A Quadratic Programming problem (QP) in the form of

where , can be transformed to a Second-Order Cone Programming (SOCP) problem in the form of

Consider , and

As is non-negative, minimizing is equivalent to minimizing , and hence is equivalent to minimizing .

If we have and , then the objective function in QP can be written as . We can thus minimize .

Thus, the QP problem can now be written as

As , by definition of QP, is symmetric, a symmetric can be found such that . If the QP is assumed to be a convex QP, is positive semidefinite, applying Cholesky factorization gives (or ). In this case, (or ).

Next, as is always non-negative, the equality constraint can be written as

Finally, each row in the inequality constraint can be written as

where is the i-th row of , and is the i-th element of .

Therefore, a QP problem can be transformed to an equivalent SOCP problem in the following way. We need to introduce a few variables first.

The sub-vector with the first elements in the solution of the transformed SOCP problem is the solution of the original QP problem.

SuanShu has implementations to solve both SOCP and QP problems.

Matrix multiplication occupies a central role in scientific computing with an extremely wide range of applications. Many numerical procedures in linear algebra (e.g. solving linear systems, matrix inversion, factorizations, determinants) can essentially be reduced to matrix multiplication [5, 3]. Hence, there is great interest in investigating fast matrix multiplication algorithms, to accelerate matrix multiplication (and other numerical procedures in turn).

SuanShu was already the fastest in matrix multiplication and hence linear algebra per our benchmark. SuanShu v3.0.0 benchmark

Starting version 3.3.0, SuanShu has implemented an advanced algorithm for even faster matrix multiplication. It makes some operations 100x times faster those of our competitors! The new benchmark can be found here: SuanShu v3.3.0 benchmark

In this article, we briefly describe our implementation of a matrix multiplication algorithm that dramatically accelerates dense matrix-matrix multiplication compared to the classical IJK algorithm.

Parallel IJK

We first describe the method against which our new algorithm is compared against, IJK. Here is the algorithm performing multiplication for is , is , and is :

for (i = 1; i < = m; i ++){
for (j = 1; j <= p; j ++){
for (k = 1; k <= n; k ++){
C[i,k] += A[i,j] * B[j,k];
}
}
}

In Suanshu, this is implemented in parallel; the outermost loop is passed to aParallelExecutor.

As there are often more rows than threads available, the time complexity of this parallelized IJK is still roughly the same as IJK: , or cubic time for . This complexity is not most desirable.

The core of our new multiplication algorithm, the Strassen algorithm, reduces the time complexity to .

Strassen’s Algorithm

The Strassen algorithm [6] is based on the following block matrix multiplication:

The naive method of completing this involves 8 submatrix multiplications and 4 additions. The version (Winograd’s variant [7]) of Strassen’s algorithm that we use forgoes one submatrix multiplication in exchange for eleven extra additions/subtractions, which is faster when the submatrices are large enough.

The algorithm runs as follows (visualization above):

1. Split into four equally-sized quadrants , , , . The same for . (Assume first that all dimensions are even.)

2. Obtain the following factor matrices:

3. Obtain for . Again depending on the dimensions, we either use Parallel IJK, or make a recursive call to Strassen’s algorithm.

4. The final product can then be obtained as follows (using some temporary matrices ):

Odd Dimensions

So far this algorithm has ignored the cases when and/or has an odd number of rows/columns. There are several methods of dealing with this [2, 4]. For example, one could pad the matrices statically so that the dimensions are always even until the recursion passes to IJK (static padding); or pad only when one of the dimensions is odd (dynamic padding).

Alternatively one could disregard the extra rows/columns until after the algorithm completes, and then take care of them afterwards (i.e. if has an extra row or has an extra column, use the appropriate matrix-vector operation to calculate the remaining row/column of . If has an extra column and has an extra row, their product can be added on to afterwards.) We chose this method, called dynamic peeling, for our implementation.

Blocking and Tiling

Taken on its own, the above Strassen’s algorithm works well, provided both matrices are roughly square. In practice, we may encounter cases where either is highly rectangular (e.g., too tall, or long).

We solve this by slicing the matrices into blocks which are nearly square, then using Strassen’s algorithm on the submatrices. The blocking scheme is devised so that long or tall strips are avoided.

Performance

The following charts show the performance of our hybrid Block-Strassen algorithm versus Parallel IJK on an Intel® Core i5-3337U CPU @ 1.80 GHz with 6GB RAM, running Java 1.8.0 update 40.

Tests are patterned after D’Alberto and Nicolau [1]: We ran for random matrices () and () , for every triple in , where . This multiplication was done three times using Parallel IJK, then three times using Hybrid Block-Strassen. The average times using each method are compared and tabulated.

Figure 3 shows the shows the multiplication time plotted against the product of the dimensions. The multiplication time for IJK is , and our empirical results show that multiplication times for both Parallel IJK and HBS are strongly linearly related to complexity. HBS, however, has a significantly smaller gradient.

The gradients of the best-fit lines suggest that as complexity approaches infinity (ignoring memory constraints), HBS will take 63.5% less time than IJK. Several data points, however, (e.g. ) show an even greater speedup.

Figures 4 and 5 show the time saving of HBS over Parallel IJK, in seconds (Figure 4), and as a percentage of the Parallel IJK time (Figure 5). Each table is for a specific value of (number of columns of ), and runs over values of (number of rows of ) and (number of columns of ).

Finally, an accuracy test was also run to address concerns regarding the numerical stability of Strassen’s algorithm [3]. Figure 6 shows the maximum entry-wise relative error

of the resulting product matrix. We see that none of the errors exceed (note that we did not run Strassen’s algorithm completely to the scalar level; when the matrices and are too small, IJK is used. This reduces the error.), which suggests that HBS is surprisingly accurate, and a strong candidate for use in general-purpose contexts where speed is a priority.

References

[1] Paolo D’Alberto and Alexandru Nicolau. Adaptive strassen’s matrix multiplication. In Proceedings of the 21st Annual International Conference on Supercomputing, ICS ’07, pages 284–292, New York, NY, USA, 2007. ACM. [2] Hossam ElGindy and George Ferizis. On improving the memory access patterns during the execution of strassen’s matrix multiplication algorithm. In Proceedings of the 27th Australasian Conference on Computer Science – Volume 26, ACSC ’04, pages 109–115, Darlinghurst, Australia, Australia, 2004. Australian Computer Society, Inc. [3] Nicholas J Higham. Accuracy and stability of numerical algorithms. Siam, 2002. [4] Steven Huss-Lederman, Elaine M. Jacobson, Anna Tsao, Thomas Turnbull, and Jeremy R. Johnson. Implementation of strassen’s algorithm for matrix multiplication. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Supercomputing ’96, Washington, DC, USA, 1996. IEEE Computer Society. [5] Steven S. Skiena. The Algorithm Design Manual. Springer Publishing Company, Incorporated, 2nd edition, 2008. [6] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, 1969. [7] Shmuel Winograd. On multiplication of 2×2 matrices. Linear algebra and its applications, 4(4):381– 388, 1971.

Cloud computing is very popular nowadays. Delegating your CPU-intensive computation (or simulation) to the cloud seems to be a smart choice. Many of our users asked if SuanShu can be run on Amazon’s Elastic Compute Cloud (EC2), because SuanShu license requires a MAC address and they have no control on which machine being used when they launch an EC2 instance. Here comes a good news! Amazon Web Service (AWS) now supports Elastic Network Interface (ENI), by which you can bind your EC2 instance to a specified network interface. Therefore, you can license your SuanShu against the MAC address of the ENI, and launch an instance with the same ENI and MAC address. For details, please visit the blog here. User guide for ENI can also be found here.

Numerical Method Inc. publishes SuanShu, a Java numerical and statistical library. The objective of SuanShu is to enable very easy programming of engineering applications. Programmers are able to program mathematics in a way that the source code is solidly object-oriented and individually testable. SuanShu source code adheres to the strictest coding standard so that it is readable, maintainable, and can be easily modified and extended.

SuanShu revolutionizes how numerical computing is traditionally done, e.g., netlib, gsl. The repositories of these most popular and somewhat “standard” libraries are rather collections of ad-hoc source code in obsolete languages, e.g., FORTRAN and C. One biggest problem of these code is that they are not readable (for most modern programmers), hence unmaintainable. For example, it is quite a challenge to understand AS 288, let alone improving it. Other problems include, but not limited to, the lack of data structure, duplicated code, being entirely procedural, very bad variable naming, abuse of GOTO, the lack of test cases, insufficient documentations, the lack of IDE support, inconvenient linking to modern languages such as Java, being unfriendly to parallel computing, etc.

To address these problems, SuanShu designs a framework of reusable math components (not procedures) so that programmers can put components together like Legos to build more complex algorithms. SuanShu is written from anew so that it conforms to the modern programming paradigm such as variable naming, code structuring, reusability, readability, maintainability, as well as software engineering procedure. To ensure very high quality of the code and very few bugs, SuanShu has a few thousands of unit test cases that run daily.

The basic of SuanShu covers the following.

– numerical differentiation and integration – polynomial and Jenkin-Straub – root finding – unconstrained and constrained optimization for univariate and multivariate functions – linear algebra: matrix operations and factorization – sparse matrix – descriptive statistics – random sampling from distributions

Comparing to competing products, SuanShu, as we believe, has the most extensive coverage in statistics. SuanShu covers the following.

– Ordinary Least Square (OLS) regression – Generalized Linear Model (GLM) regression – a full suite of residual analysis – Stochastic Differential Equation (SDE) simulation – a comprehensive library of hypothesis testing: Kolmogorov-Smirnov, D’Agostino, Jarque-Bera, Lilliefors, Shapiro-Wilk, One-way ANOVA, T, Kruskal-Wallis, Siegel-Tukey, Van der Waerden, Wilcoxon rank sum, Wilcoxon signed rank, Breusch-Pagan, ADF, Bartlett, Brown-Forsythe, F, Levene, Pearson’s Chi-square, Portmanteau – time series analysis, univariate and multivariate – ARIMA, GARCH modelling, simulation, fitting, and prediction – sample and theoretical auto-correlation – cointegration – hidden Markov chain – Kalman filter – more

Today I have my first post ever on the Numerical Method blog. I’d like to discuss a design decision when writing our library. This is to give our friends some insight into how the SuanShu library works. Please leave me feedbacks and comments so that we can continue to improve the product for you.

We have decided to make the class library as parsimonious as possible to avoid method pollution. This is inspired by the jMatrices’ white paper. The challenge is to organize the methods by minimal and correct packages.

I will illustrate this with SuanShu’s Matrix packages.

The Matrix class has only 26 methods, of which 9 of them are constructors and the related; 3 are overrides for the AbstractMatrix interfaces; 8 are overrides for the MatrixSpace interfaces. Only 6 of them are class specific to make calling these methods convenient for the user. The other dozens of matrix operations, such as the different factorizations, properties like rank, transformations like inverse, are grouped into multiple classes and packages. In most cases, each of these operations is a class on its own.

For instance, the inverse operation itself is a class inheriting from Matrix. The constructor takes as input a Matrix to invert. For example, to find the inverse for

,

we code

<pre> Matrix A = new Matrix(new double[][]{ {1, 2, 3}, {6, 5, 4}, {8, 7, 9} });

Matrix Ainv = new Inverse(A); </pre>

SuanShu computes

It is important to note that Ainv is a Matrix, created by the keyword new, not by a method call.

In summary, we choose to have 100s of classes, rather than to have a class with 100s of methods. Each class is kept deliberately short. This class parsimony principle is a key design decision guiding the whole library development.