Erik Ehrling <f95-eeh@nada.kth.se>

Supervisor: Stefan Nilsson <snilsson@nada.kth.se>

Royal Institute of Technology, Stockholm

Fast Parallel Matrix Multiplication - Strategies for

Practical Hybrid Algorithms

Summary

The aim of the proposed study is to investigate implementations of different algorithms for

matrix multiplication (Strassen, Winograd) and their applicability for practical cases.

Parallelizing compilers will be used in order to determine near-optimal strategies for serial,

vector and parallel computers - including strategies for building hybrids of known algorithms

(determination of breakpoints) and study of multiplication of matrices of sizes not a power of

2 (fewer steps of recursion vs. padding-out with zeros). In cases where the full potential

speed-up is not realised factors such as loss in data locality/cache effects, recursion

overheads, etc. will be studied.

Background

Matrix multiplication is one of the most fundamental operations in linear algebra and serves

as the main building block in many different algorithms, including the solution of systems of

linear equations, matrix inversion, evaluation of the matrix determinant and the transitive

closure of a graph. In several cases the asymptotic complexities of these algorithms depend

directly on the complexity of matrix multiplication - which motivates the study of possibilities

to speed up matrix multiplication. Also, the inclusion of matrix multiplication in many

benchmarks points at its role as a determining factor for the performance of high speed

computations.

Since Strassen, in his now famous paper from 1969, showed that the O(n^{3}) asymptotic

complexity of matrix multiplication can be reduced to O(n^{2.807}) the exponent has been further

decreased and the currently known lowest order is by Coppersmith and Winograd 1990,

O(n^{2.376}). These results have been achieved by first studying how matrix multiplication

depends on bilinear, later also trilinear, combinations of factors. But while the former lead to

smaller reductions in the exponent, the latter suffer from a constant factor so large that it

renders them unusable for all practical sizes of matrices. Also the trilinear algorithms applies

exclusively to certain subsets of the natural numbers and are not immediately applicable for

floating point calculations.

However, lower orders in the numbers of floating point operations does not necessarily imply

that overall processing becomes faster. As Pan points out in his book, *How to Multiply
Matrices Faster*: "bit-time complexity classification of linear algebra problems [...] differs

from their arithmetical complexity classification." Further, overheads are introduced for

recursion and padding with zeros for sizes not a power of 2 due to the recursive block

structure of the algorithms, and this makes full speed-up harder to achieve. The aim of this

study is to examine different strategies for overcoming these obstacles. (On the other hand

some positive effects could be expected due to better usage of cache memory as the recursive

approach provides blocking.)

Bilinear algorithms

Three main bilinear algorithms are relevant for this study; the first is the block formulation of

ordinary matrix multiplication, the second is Strassen’s algorithm, and last Winograd’s variant

of it.

Ordinary matrix multiplication / Blocked matrix multiplication

Asymptotic complexity: 2n^{3} operations

Each recursion step (blocked version): 8 multiplications, 4 additions

Strassen

Strassen’s formulation makes use of the fact that while the cost for additions and multiplications

might be of the same order in the scalar case, multiplications are considerably

more costly for increasing sizes of matrices. By utilising a mapping of bilinear combinations

that trades one multiplication for fourteen additions/subtractions in each recursion step the

asymptotic complexity is reduced.

Asymptotic complexity: 4.695n^{2.807} operations

Each recursion step: 7 multiplications, 18 additions/subtractions

Winograd

Winograd’s variant of Strasssen’s algorithm has the same exponent but a slightly lower

constant as the number of additions/subtractions is reduced from 18 to 15, and even though it

might seem somewhat more complicated, using several temporary variables, it can be shown

that product can be computed using only n^{2}/2 extra memory on a sequential machine if the

order of computations is scheduled carefully.

Asymptotic complexity: 4.537n^{2.807} operations

Each recursion step: 7 multiplications, 15 additions/subtractions

Implementation and evaluation

For arguments given above only strategies concerning bilinear algorithms for matrix

multiplication will be considered in the study. Different bilinear algorithms and hybrids

formed between them, will be implemented on sequential, vector and parallel platforms

respectively, and their performance will be evaluated for each of the platforms -

Sequential: Intel Pentium / Sun Sparc Ultra

Vector: Single processor of CRAY J932

Parallel: Silicon Graphics Origin 2000

Parallelizing C-compilers will be used (F90 if explicit parallel constructs appear to be

necessary) for the vector and parallel platforms.

The work is started on the sequential platform - for laying a foundation and sorting out initial

implementation problems before moving to the vector and parallel platforms. The influence

from factors such breakpoint level between two different algorithms in a hybrid, memory

effects - extra memory requirements, data locality, cache effects, recursion overheads, and

action in cases of odd sized matrices will be studied. The ported code from the sequential

platform will then form the basis for the vector and parallel platforms.

Breakpoints

For matrices of smaller sizes (typically n<16..256), where the difference in cost between

multiplications and additions of submatrices is not significant, ordinary multiplication is

likely to outperform both Strassen variants. The most intuitive hybrid algorithm to take

advantage of this is a Strassen variant where at a certain level the recursion is stopped and

ordinary matrix multiplication is called instead. The main parameter to be investigated in this

case is the breakpoint size. The breakpoint is likely to be dependent of hardware

specifications and therefore implementations should be open for adaptability.

Memory effects

The current trend is towards faster processors and larger memory sizes, while the gap between

processor and memory speed is widening. Cache performance will therefore play a role of

increasing importance. Computer systems are becoming more hierarchical. This is to some

extent analogous to the situation twenty years back in time when main memory was small and

algorithms processing large amounts of data had to rely on much slower secondary storage.

An important step to achieve good performance was then to schedule the data so that as large

a fraction of memory requests as possible was from main memory. Today this is equivalent to

having as few cache misses as possible. One major difference though is that while being

reasonably predictable cache memory is usually not controllable at program level.

The speed difference between cache and main memory is likely to give negative effects from

losses in data locality, while the recursive algorithms on the other hand provides blocking so

that the actual data processed would fit better in the cache. The block formulation of ordinary

matrix multiplication is included in the study for the purpose of making comparisons to the

two Strassen variants concerning the positive cache effects for speed-up. The blocked variant

of ordinary matrix multiplication is likely to have similar effects from loss in data locality and

from blocking as the Strassen variants while maintaining the same number of floating point

operations as the straightforward algorithm. This should give some hints as to what proportion

of speed-up comes from reduction of complexity and what proportion that comes from

different memory usage only.

The extra memory requirements for the recursive algorithms will have an impact on the

running time for the algorithms. This is certainly the case for Winograd’s variant where there

is a mapping that only requires n^{2}/2 extra memory but also requires the computation to be

strictly sequential, which means that there is a probable trade-off between speed and memory

efficiency in the parallel case.

Odd dimensions

The recursive algorithms applies naturally to matrices with sizes that are powers of 2, as a

matrix then easily can be splitted into its four quadrants in each recursion step. Matrices of all

other sizes, however, will introduce extra overheads. There exists several conceivable

strategies for dealing with these cases. The most obvious one is padding out the matrix with

zeros from the start to the nearest upper dimension that is a power of 2 (static padding), but

other variants could include padding out only at the recursion level where it is necessary

(dynamic padding) or simply using fewer recursion steps (remember from the breakpoint

section above that it is not optimal to run the recursion to the bottom anyway.)

...

References

Andrew W. Appel, *Modern Compiler Implementation in C, *Cambridge University Press,

1998

David H. Bailey, *Extra high speed matrix multiplication on the Cray-2.* SIAM J. Sci. Stat.

Comput. Vol 9, No. 3, May 1988

Coppersmith, D., Winograd S., *Matrix multiplication via arithmetic progressions,* J. Symbolic

Comput. 9, p. 251-280, 1990

Don Coppersmith, *Rectangular matrix multiplication revisited. *Journal of Complexity,

13(1):42-49, March 1997

P.C. Fischer, R.L. Probert, *Efficient procedures for using matrix algorithms*, Automata,

Languages and Programming, vol 14:413-427 Lecture Notes in Computer Science, Springer

Verlag, Berlin 1974,

J. Fr. Hake, W. Homberg, *The impact of memory organization on the performance of matrix
calculations. *Parallel Computing 17:311-327, 1991

R. W. Hockney, *Parallel Computers 2*, 2nd ed., Adam Hilger Publ. 1988

Xiaohan Huang, Victor Pan, *Fast rectangular matrix multiplication and applications.* Journal

of Complexity 14(2):257-299, June 1998

M. S. Lam, E. E. Rothberg, M. E. Wolf, *The cache performance and optimizations of blocked
algorithms*. 4th Int. Conf. on Architectural Support for Programming Languages and

Operating Systems (ASPLOS IV) Palo Alto, California, April 9-11,1991

Pan, Victor, *How to Multiply Matrices Faster*, Springer-Verlag, 1984

Strassen, Volker, "Gaussian Elimination is not Optimal", Numer. Math. 13, p. 354-356, 1969