## Introduction

Matrix multiplication occupies a central role in scientific computing with an extremely wide range of applications. Many numerical procedures in linear algebra (e.g. solving linear systems, matrix inversion, factorizations, determinants) can essentially be reduced to matrix multiplication [5, 3]. Hence, there is great interest in investigating fast matrix multiplication algorithms, to accelerate matrix multiplication (and other numerical procedures in turn).

SuanShu was already the fastest in matrix multiplication and hence linear algebra per our benchmark.
SuanShu v3.0.0 benchmark

Starting version 3.3.0, SuanShu has implemented an advanced algorithm for even faster matrix multiplication. It makes some operations 100x times faster those of our competitors! The new benchmark can be found here:
SuanShu v3.3.0 benchmark

In this article, we briefly describe our implementation of a matrix multiplication algorithm that dramatically accelerates dense matrix-matrix multiplication compared to the classical IJK algorithm.

## Parallel IJK

We first describe the method against which our new algorithm is compared against, IJK. Here is the algorithm performing multiplication $C = A\times B$ for $A$ is $m\times n$ $B$ is $n\times p$, and $C$ is $m\times p$:

for (i = 1; i < = m; i ++){
for (j = 1; j <= p; j ++){
for (k = 1; k <= n; k ++){
C[i,k] += A[i,j] * B[j,k];
}
}
}

In Suanshu, this is implemented in parallel; the outermost loop is passed to a ParallelExecutor .

As there are often more rows than threads available, the time complexity of this parallelized IJK is still roughly the same as IJK: $O\left(mnp\right)$, or cubic time $O\left(n^3\right)$ for $m = n = p$. This complexity is not most desirable.

The core of our new multiplication algorithm, the Strassen algorithm, reduces the time complexity to $O\left(n^{\log_27}\right)$.

## Strassen’s Algorithm

The Strassen algorithm  is based on the following block matrix multiplication: $\displaystyle \begin{pmatrix}A_{11} & A_{12}\\ A_{21} & A_{22} \end{pmatrix}\begin{pmatrix}B_{11} & B_{12}\\ B_{21} & B_{22} \end{pmatrix}=\begin{pmatrix}A_{11}B_{11}+A_{12}B_{21} & A_{11}B_{12}+A_{12}B_{22}\\ A_{21}B_{11}+A_{22}B_{21} & A_{21}B_{12}+A_{22}B_{22} \end{pmatrix}$

The naive method of completing this involves 8 submatrix multiplications and 4 additions. The version (Winograd’s variant ) of Strassen’s algorithm that we use forgoes one submatrix multiplication in exchange for eleven extra additions/subtractions, which is faster when the submatrices are large enough.

The algorithm runs as follows (visualization above):

1. Split $A$ into four equally-sized quadrants $A_{11}$ $A_{12}$ $A_{21}$ $A_{22}$. The same for $B$. (Assume first that all dimensions are even.)

2. Obtain the following factor matrices: $\displaystyle \begin{array}{ccccccc} M_{1} & = & A_{11} & \qquad & N_{1} & = & B_{11}\\ M_{2} & = & A_{12} & \qquad & N_{2} & = & B_{21}\\ M_{3} & = & A_{21}+A_{22} & \qquad & N_{3} & = & B_{12}-B_{11}\\ M_{4} & = & M_{3}-A_{11} & \qquad & N_{4} & = & B_{22}-N_{3}\\ M_{5} & = & A_{11}-A_{21} & \qquad & N_{5} & = & B_{22}-B_{12}\\ M_{6} & = & A_{12}-M_{4} & \qquad & N_{6} & = & B_{22}\\ M_{7} & = & A_{22} & \qquad & N_{7} & = & B_{21}-N_{4} \end{array}$

3. Obtain $P_{i}=M_{i}\times N_{i}$ for $i\in\left\{ 1,2,\ldots,7\right\}$. Again depending on the dimensions, we either use Parallel IJK, or make a recursive call to Strassen’s algorithm.

4. The final product $C=A\times B$ can then be obtained as follows (using some temporary matrices $T_{i}$): $\displaystyle \begin{array}{ccccccc} C_{11} & = & P_{1}+P_{2} & \qquad & C_{22} & = & T_{2}+P_{3}\\ T_{1} & = & P_{1}+P_{4} & \qquad & T_{3} & = & T_{1}+P_{3}\\ T_{2} & = & T_{1}+P_{5} & \qquad & C_{12} & = & T_{3}+P_{6}\\ C_{21} & = & T_{2}+P_{7} & \qquad \end{array}$

### Odd Dimensions

So far this algorithm has ignored the cases when $A$ and/or $B$ has an odd number of rows/columns. There are several methods of dealing with this [2, 4]. For example, one could pad the matrices statically so that the dimensions are always even until the recursion passes to IJK (static padding); or pad only when one of the dimensions is odd (dynamic padding).

Alternatively one could disregard the extra rows/columns until after the algorithm completes, and then take care of them afterwards (i.e. if $A$ has an extra row or $B$ has an extra column, use the appropriate matrix-vector operation to calculate the remaining row/column of $C$. If $A$ has an extra column and $B$ has an extra row, their product can be added on to $C$ afterwards.) We chose this method, called dynamic peeling, for our implementation.

## Blocking and Tiling

Taken on its own, the above Strassen’s algorithm works well, provided both matrices are roughly square. In practice, we may encounter cases where either is highly rectangular (e.g., too tall, or long).

We solve this by slicing the matrices into blocks which are nearly square, then using Strassen’s algorithm on the submatrices. The blocking scheme is devised so that long or tall strips are avoided. Fig 2. Left: A strict blocking scheme (strips!). Right: An efficient blocking scheme (our implementation).

## Performance

The following charts show the performance of our hybrid Block-Strassen algorithm versus Parallel IJK on an Intel ® Core i5-3337U CPU @ 1.80 GHz with 6GB RAM, running Java 1.8.0 update 40.

Tests are patterned after D’Alberto and Nicolau : We ran $C = A \times B$ for random matrices $A$ ( $m\times n$) and $B$ ( $n \times p$) , for every triple $\left(m, n, p\right)$ in $S^3$, where $S = \{100,500,1000,2500,5000,7500,10000\}$. This multiplication was done three times using Parallel IJK, then three times using Hybrid Block-Strassen. The average times using each method are compared and tabulated.

Figure 3 shows the shows the multiplication time plotted against the product $mnp$ of the dimensions. The multiplication time for IJK is $O\left(mnp\right)$, and our empirical results show that multiplication times for both Parallel IJK and HBS are strongly linearly related to complexity. HBS, however, has a significantly smaller gradient.

The gradients of the best-fit lines suggest that as complexity approaches infinity (ignoring memory constraints), HBS will take 63.5% less time than IJK. Several data points, however, (e.g. $m,n,p\ge5000$ ) show an even greater speedup. Fig 4. Time Savings (in seconds) using HBS versus Parallel IJK Fig 5. Time Savings (as a percent of P.IJK Time) using HBS versus Parallel IJK

Figures 4 and 5 show the time saving of HBS over Parallel IJK, in seconds (Figure 4), and as a percentage of the Parallel IJK time (Figure 5). Each table is for a specific value of $n$ (number of columns of $A$), and runs over values of $m$ (number of rows of $A$) and $p$ (number of columns of $B$). Fig 6. Maximum entry-wise Relative Error of HBS versus PIJK (in units of 1e-15)

Finally, an accuracy test was also run to address concerns regarding the numerical stability of Strassen’s algorithm . Figure 6 shows the maximum entry-wise relative error $\displaystyle\max_{i,j}\left|\frac{C_{i,j}^{{\rm IJK}}-C_{i,j}^{{\rm HBS}}}{C_{i,j}^{{\rm IJK}}}\right|$

of the resulting product matrix. We see that none of the errors exceed $2 \times 10^{-14}$ (note that we did not run Strassen’s algorithm completely to the scalar level; when the matrices $M_i$ and $N_i$ are too small, IJK is used. This reduces the error.), which suggests that HBS is surprisingly accurate, and a strong candidate for use in general-purpose contexts where speed is a priority.

## References

 Paolo D’Alberto and Alexandru Nicolau. Adaptive strassen’s matrix multiplication. In Proceedings of the 21st Annual International Conference on Supercomputing, ICS ’07, pages 284–292, New York, NY, USA, 2007. ACM.
 Hossam ElGindy and George Ferizis. On improving the memory access patterns during the execution of strassen’s matrix multiplication algorithm. In Proceedings of the 27th Australasian Conference on Computer Science – Volume 26, ACSC ’04, pages 109–115, Darlinghurst, Australia, Australia, 2004. Australian Computer Society, Inc.
 Nicholas J Higham. Accuracy and stability of numerical algorithms. Siam, 2002.
 Steven Huss-Lederman, Elaine M. Jacobson, Anna Tsao, Thomas Turnbull, and Jeremy R. Johnson. Implementation of strassen’s algorithm for matrix multiplication. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Supercomputing ’96, Washington, DC, USA, 1996. IEEE Computer Society.
 Steven S. Skiena. The Algorithm Design Manual. Springer Publishing Company, Incorporated, 2nd edition, 2008.
 Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969.
 Shmuel Winograd. On multiplication of 2×2 matrices. Linear algebra and its applications, 4(4):381– 388, 1971.