Erik Ehrling <f95-eeh@nada.kth.se>
Supervisor: Stefan Nilsson <snilsson@nada.kth.se>
Royal Institute of Technology, Stockholm

 

 

Fast Parallel Matrix Multiplication - Strategies for
Practical Hybrid Algorithms

 

Summary

The aim of the proposed study is to investigate implementations of different algorithms for
matrix multiplication (Strassen, Winograd) and their applicability for practical cases.
Parallelizing compilers will be used in order to determine near-optimal strategies for serial,
vector and parallel computers - including strategies for building hybrids of known algorithms
(determination of breakpoints) and study of multiplication of matrices of sizes not a power of
2 (fewer steps of recursion vs. padding-out with zeros). In cases where the full potential
speed-up is not realised factors such as loss in data locality/cache effects, recursion
overheads, etc. will be studied.

 

Background

Matrix multiplication is one of the most fundamental operations in linear algebra and serves
as the main building block in many different algorithms, including the solution of systems of
linear equations, matrix inversion, evaluation of the matrix determinant and the transitive
closure of a graph. In several cases the asymptotic complexities of these algorithms depend
directly on the complexity of matrix multiplication - which motivates the study of possibilities
to speed up matrix multiplication. Also, the inclusion of matrix multiplication in many
benchmarks points at its role as a determining factor for the performance of high speed
computations.

Since Strassen, in his now famous paper from 1969, showed that the O(n3) asymptotic
complexity of matrix multiplication can be reduced to O(n2.807) the exponent has been further
decreased and the currently known lowest order is by Coppersmith and Winograd 1990,
O(n2.376). These results have been achieved by first studying how matrix multiplication
depends on bilinear, later also trilinear, combinations of factors. But while the former lead to
smaller reductions in the exponent, the latter suffer from a constant factor so large that it
renders them unusable for all practical sizes of matrices. Also the trilinear algorithms applies
exclusively to certain subsets of the natural numbers and are not immediately applicable for
floating point calculations.

However, lower orders in the numbers of floating point operations does not necessarily imply
that overall processing becomes faster. As Pan points out in his book, How to Multiply
Matrices Faster
: "bit-time complexity classification of linear algebra problems [...] differs
from their arithmetical complexity classification." Further, overheads are introduced for
recursion and padding with zeros for sizes not a power of 2 due to the recursive block
structure of the algorithms, and this makes full speed-up harder to achieve. The aim of this
study is to examine different strategies for overcoming these obstacles. (On the other hand
some positive effects could be expected due to better usage of cache memory as the recursive
approach provides blocking.)


Bilinear algorithms

Three main bilinear algorithms are relevant for this study; the first is the block formulation of
ordinary matrix multiplication, the second is Strassenís algorithm, and last Winogradís variant
of it.

Ordinary matrix multiplication / Blocked matrix multiplication

Asymptotic complexity: 2n3 operations
Each recursion step (blocked version): 8 multiplications, 4 additions

Strassen

Strassenís formulation makes use of the fact that while the cost for additions and multiplications
might be of the same order in the scalar case, multiplications are considerably
more costly for increasing sizes of matrices. By utilising a mapping of bilinear combinations
that trades one multiplication for fourteen additions/subtractions in each recursion step the
asymptotic complexity is reduced.

Asymptotic complexity: 4.695n2.807 operations
Each recursion step: 7 multiplications, 18 additions/subtractions

Winograd

Winogradís variant of Strasssenís algorithm has the same exponent but a slightly lower
constant as the number of additions/subtractions is reduced from 18 to 15, and even though it
might seem somewhat more complicated, using several temporary variables, it can be shown
that product can be computed using only n2/2 extra memory on a sequential machine if the
order of computations is scheduled carefully.

Asymptotic complexity: 4.537n2.807 operations
Each recursion step: 7 multiplications, 15 additions/subtractions

 

Implementation and evaluation

For arguments given above only strategies concerning bilinear algorithms for matrix
multiplication will be considered in the study. Different bilinear algorithms and hybrids
formed between them, will be implemented on sequential, vector and parallel platforms
respectively, and their performance will be evaluated for each of the platforms -

Sequential: Intel Pentium / Sun Sparc Ultra
Vector: Single processor of CRAY J932
Parallel: Silicon Graphics Origin 2000

Parallelizing C-compilers will be used (F90 if explicit parallel constructs appear to be
necessary) for the vector and parallel platforms.

The work is started on the sequential platform - for laying a foundation and sorting out initial
implementation problems before moving to the vector and parallel platforms. The influence
from factors such breakpoint level between two different algorithms in a hybrid, memory
effects - extra memory requirements, data locality, cache effects, recursion overheads, and
action in cases of odd sized matrices will be studied. The ported code from the sequential
platform will then form the basis for the vector and parallel platforms.

Breakpoints

For matrices of smaller sizes (typically n<16..256), where the difference in cost between
multiplications and additions of submatrices is not significant, ordinary multiplication is
likely to outperform both Strassen variants. The most intuitive hybrid algorithm to take
advantage of this is a Strassen variant where at a certain level the recursion is stopped and
ordinary matrix multiplication is called instead. The main parameter to be investigated in this
case is the breakpoint size. The breakpoint is likely to be dependent of hardware
specifications and therefore implementations should be open for adaptability.

Memory effects

The current trend is towards faster processors and larger memory sizes, while the gap between
processor and memory speed is widening. Cache performance will therefore play a role of
increasing importance. Computer systems are becoming more hierarchical. This is to some
extent analogous to the situation twenty years back in time when main memory was small and
algorithms processing large amounts of data had to rely on much slower secondary storage.
An important step to achieve good performance was then to schedule the data so that as large
a fraction of memory requests as possible was from main memory. Today this is equivalent to
having as few cache misses as possible. One major difference though is that while being
reasonably predictable cache memory is usually not controllable at program level.

The speed difference between cache and main memory is likely to give negative effects from
losses in data locality, while the recursive algorithms on the other hand provides blocking so
that the actual data processed would fit better in the cache. The block formulation of ordinary
matrix multiplication is included in the study for the purpose of making comparisons to the
two Strassen variants concerning the positive cache effects for speed-up. The blocked variant
of ordinary matrix multiplication is likely to have similar effects from loss in data locality and
from blocking as the Strassen variants while maintaining the same number of floating point
operations as the straightforward algorithm. This should give some hints as to what proportion
of speed-up comes from reduction of complexity and what proportion that comes from
different memory usage only.

The extra memory requirements for the recursive algorithms will have an impact on the
running time for the algorithms. This is certainly the case for Winogradís variant where there
is a mapping that only requires n2/2 extra memory but also requires the computation to be
strictly sequential, which means that there is a probable trade-off between speed and memory
efficiency in the parallel case.

Odd dimensions

The recursive algorithms applies naturally to matrices with sizes that are powers of 2, as a
matrix then easily can be splitted into its four quadrants in each recursion step. Matrices of all
other sizes, however, will introduce extra overheads. There exists several conceivable
strategies for dealing with these cases. The most obvious one is padding out the matrix with
zeros from the start to the nearest upper dimension that is a power of 2 (static padding), but
other variants could include padding out only at the recursion level where it is necessary
(dynamic padding) or simply using fewer recursion steps (remember from the breakpoint
section above that it is not optimal to run the recursion to the bottom anyway.)

...

References

Andrew W. Appel, Modern Compiler Implementation in C, Cambridge University Press,
1998

David H. Bailey, Extra high speed matrix multiplication on the Cray-2. SIAM J. Sci. Stat.
Comput. Vol 9, No. 3, May 1988

Coppersmith, D., Winograd S., Matrix multiplication via arithmetic progressions, J. Symbolic
Comput. 9, p. 251-280, 1990

Don Coppersmith, Rectangular matrix multiplication revisited. Journal of Complexity,
13(1):42-49, March 1997

P.C. Fischer, R.L. Probert, Efficient procedures for using matrix algorithms, Automata,
Languages and Programming, vol 14:413-427 Lecture Notes in Computer Science, Springer
Verlag, Berlin 1974,

J. Fr. Hake, W. Homberg, The impact of memory organization on the performance of matrix
calculations.
Parallel Computing 17:311-327, 1991

R. W. Hockney, Parallel Computers 2, 2nd ed., Adam Hilger Publ. 1988

Xiaohan Huang, Victor Pan, Fast rectangular matrix multiplication and applications. Journal
of Complexity 14(2):257-299, June 1998

M. S. Lam, E. E. Rothberg, M. E. Wolf, The cache performance and optimizations of blocked
algorithms
. 4th Int. Conf. on Architectural Support for Programming Languages and
Operating Systems (ASPLOS IV) Palo Alto, California, April 9-11,1991

Pan, Victor, How to Multiply Matrices Faster, Springer-Verlag, 1984

Strassen, Volker, "Gaussian Elimination is not Optimal", Numer. Math. 13, p. 354-356, 1969