[Last update: 17 November 1997]
There is currently some work on optimizing the library for the Pentium
Pro, which should be quite good for the Pentium II, too. At the moment, no
results are available yet, but please look back in the near future. Thanks.
Pentium Optimized BLAS
This is the homepage of the Pentium optimized BLAS project with the aim of
helping programmers writing high performance numerical software on the
widespread intel processors.
Please note that this library is only specially optimized for the original
Intel Pentium processor. Other CPUs like the 486, AMD K-xx, Cyrix/IBM 6x86
and the PentiumPro may or may not gain some speed. Since I don't have access
to such other CPUs, it is nearly impossible for me to produce optimal code
on all platforms.
Purpose of this project
The BLAS library, first introduced by Lawson et. al.
[1], has proved to be
very successful in advanced numeric processing. A standard FORTRAN-77
implementation of the proposed set of routines is available at Netlib
[2], together with some papers describing all of
the routines and with
some design decisions. This standard implementation should compile on
every platform with every standard conforming FORTRAN compiler, with the
exception of the COMPLEX*16 routines. Yet to gain the best efficiency out
of the hardware architectures of today, considering such items as caches,
superscalar and superpipelined processors, latency and throughput, it is
necessary to make some effort to optimize those functions, especially if
one takes into account how many people and libraries are using the BLAS as
their low level routines, that are doing the hard work of numerical
computations.
Goal of this project is to do this for the widely used Intel Pentium
processor and its successors. I have started with the most important
functions of the BLAS level 1 and will continue as my spare time will
allow. With the current interim implementation (see below), speed
improvements of 50-300% can be achieved, therefore spending some time on
this subject is no waste of time.
Results
For those of you only interested in hard facts, here is a short table
comparing three different implementations of some selected functions of
BLAS 1, on different vector lengths.
The test system consists of a P-133, Gigabyte GA-586ATE board, 256K PB,
32M EDO (7-2-2-2) + 32M FPM (8-3-3-3), tests running in EDO, no paging.
Compiler is gcc-2.7.2.1f, g77-0.5.18, compiled for DOS (DJGPP), full
optimization (-O3 -ffast-math -funroll-loops), timing with Pentium
builtin TSC (time stamp counter).
Shown are typical vector sizes for 1st and 2nd level cache (256 & 8192)
and for main memory access (262144). Results are MFLOPS resp. the
speed improvement over the original FORTRAN version. Note that the C
routines are just plain rewrites of the original FORTRAN code.
Assembler C FORTRAN
N MFLOPS MFLOPS MFLOPS
caxpy
256 48,88 37,7% 24,57 -30,8% 35,50
8192 38,02 34,8% 22,10 -21,6% 28,20
262144 24,21 12,6% 16,50 -23,3% 21,50
ccopy
256 28,14 21,6% 12,74 -45,0% 23,14
8192 17,00 128,7% 8,24 10,8% 7,43
262144 12,42 103,1% 6,11 -0,0% 6,11
cdotc
256 64,46 264,2% 39,02 120,4% 17,70
8192 43,34 178,6% 31,80 104,4% 15,56
262144 28,68 111,9% 24,56 81,4% 13,54
cdotu
256 67,33 63,3% 36,52 -11,4% 41,24
8192 45,27 43,6% 29,77 -5,6% 31,52
262144 29,66 22,8% 22,59 -6,4% 24,14
cscal
256 52,83 70,9% 31,45 1,8% 30,91
8192 45,90 68,5% 27,98 2,7% 27,24
262144 37,02 57,2% 24,14 2,5% 23,56
dasum
256 71,29 122,3% 61,16 90,7% 32,07
8192 50,41 100,0% 35,62 41,3% 25,21
262144 34,20 93,7% 22,17 25,5% 17,66
daxpy
256 36,36 208,0% 18,62 57,7% 11,80
8192 21,66 147,7% 13,80 57,8% 8,74
262144 11,88 105,0% 7,68 32,6% 5,79
dcopy
256 27,10 86,0% 27,17 86,5% 14,57
8192 16,36 136,2% 8,23 18,8% 6,93
262144 11,94 118,5% 6,15 12,7% 5,46
ddot
256 48,82 165,7% 33,77 83,8% 18,37
8192 27,67 116,2% 18,67 45,9% 12,80
262144 15,32 79,2% 11,12 30,1% 8,55
dnrm2
256 71,84 362,2% 44,98 189,4% 15,54
8192 50,95 247,1% 29,41 100,4% 14,68
262144 34,68 182,7% 19,58 59,6% 12,27
dscal
256 28,95 272,0% 17,25 121,6% 7,78
8192 16,05 134,2% 12,61 84,0% 6,85
262144 9,18 65,9% 8,81 59,3% 5,53
dzasum
256 73,46 340,3% 58,70 251,8% 16,69
8192 49,61 259,5% 33,30 141,3% 13,80
262144 33,59 201,7% 21,26 91,0% 11,13
dznrm2
256 74,57 286,5% 48,25 150,0% 19,30
8192 50,52 192,9% 29,70 72,1% 17,25
262144 34,17 151,3% 19,74 45,2% 13,60
sasum
256 75,13 134,8% 45,39 41,9% 31,99
8192 64,35 117,0% 36,22 22,1% 29,65
262144 49,62 109,3% 27,63 16,6% 23,70
saxpy
256 32,60 147,8% 22,07 67,8% 13,16
8192 27,97 155,5% 17,45 59,4% 10,95
262144 16,94 111,1% 11,58 44,4% 8,02
scopy
256 43,85 173,1% 67,82 322,4% 16,05
8192 32,32 185,6% 17,18 51,8% 11,31
262144 23,62 154,1% 12,73 36,9% 9,30
sdot
256 48,28 161,6% 39,35 113,2% 18,45
8192 38,45 148,7% 25,35 64,0% 15,46
262144 23,42 104,1% 16,51 43,8% 11,48
snrm2
256 78,09 406,0% 45,55 195,2% 15,43
8192 67,10 340,4% 36,52 139,7% 15,24
262144 51,74 274,0% 27,87 101,4% 13,84
sscal
256 31,12 212,4% 17,51 75,8% 9,96
8192 24,41 155,1% 15,07 57,4% 9,57
262144 17,40 110,1% 12,03 45,2% 8,28
scasum
256 77,00 28,7% 31,39 -47,5% 59,83
8192 62,59 39,3% 26,50 -41,0% 44,94
262144 48,12 46,0% 21,71 -34,2% 32,97
scnrm2
256 82,26 304,2% 44,36 118,0% 20,35
8192 67,08 245,0% 35,34 81,7% 19,44
262144 51,67 205,0% 27,33 61,3% 16,94
zaxpy
256 48,36 67,4% 23,43 -18,9% 28,88
8192 32,67 50,6% 19,19 -11,5% 21,69
262144 18,10 19,0% 12,89 -15,2% 15,21
zcopy
256 15,10 91,1% 9,39 18,9% 7,90
8192 8,36 142,3% 4,24 23,0% 3,45
262144 6,06 128,3% 3,06 15,2% 2,65
zdotc
256 56,76 358,8% 33,75 172,8% 12,37
8192 34,87 224,8% 24,58 129,0% 10,74
262144 21,24 133,9% 16,96 86,8% 9,08
zdotu
256 55,86 39,5% 35,43 -11,5% 40,04
8192 34,32 31,1% 25,82 -1,4% 26,18
262144 20,90 16,9% 17,81 -0,4% 17,89
zscal
256 52,21 82,6% 28,98 1,4% 28,59
8192 40,08 76,8% 23,42 3,3% 22,67
262144 27,45 51,3% 18,10 -0,2% 18,14
Basic ideas
If one looks at the code generated by g77 from the netlib sources, some
points of optimization are evident: better use of registers, better use
of the advanced adressing capabilities, fewer reloadings and recalculations.
Others are less obvious, especially the rearranging of the FP instructions
because of the difference between latency and throughput. The pairable
instruction FXCH (which means that it's essentially clockless) allows a
rearranging without additional clock cycles. One example is the main loop
in xAXPY: a good optimizing compiler might produce a code like this:
FLD [ESI]
FMUL ST(1)
FADD [EDI]
FSTP [EDI]
This requires 9 cycles, one for each instruction (two for FSTP) and
2*2 for the latency of FMUL and FADD. The rearrangement provides us
with 5 cycles, avoiding the 4 cycles of latency, which means that we have
achieved a speed improvement of 80% (meaning 80% more work can be done in
the same time, or, if you prefer, a time reduction of 44% for the same work
has been achieved).
Loop unrolling is an essential requirement to achieve an effective
rearrangement, in order to be able to schedule more instructions. In
addition, the loop overhead per operation is reduced. If vectors do not
come out of the 1st level cache built into the CPU, further items have to
be taken into consideration: on the one hand, one has to force the memory
interface to its limits in order to supply the FPU with the maximal possible
amount of data to process, and on the other hand, one has to reduce page
misses on (slow) DRAM. If these suggestions are carefully implemented,one
can significantly improve the performance of large data sets, without
reducing the performance of little vectors out of the cache. A detailed
paper with all those key ideas is currently in progress and probably will
be finished till christmas (I hope).
Work done until now
(This part is a little bit outdated. See below for the newest test version
with xSCAL and xASUM included, as well as the complex counterparts in both
precisions.)
At the moment (12 November 1996), these routines are rather complete:
xCOPY, xDOT, xNRM2, with x either S or D (single and double precision).
Not really complete, but completely useful are SAXPY and DAXPY, which
already do work, but have no optimizations and larger unrolling factors
for incx=1 and/or incy=1 yet. The assembler routines all have two entry
points, one for FORTRAN calling and one for C calling conventions. At least
under DOS/DJGPP[4] (GCC2.7.2 and G77 0.5.18), the
right one is selected at
link time, consequently there is no problem including both calling
conventions into one library.
As already pointed out, all routines have been completely rewritten using
the GNU assembler, in combination with the GNU assembler preprocessor GASP.
Many thanks from this point to the FSF [3] for providing
all of their GNU
tools. In my opinion, Assembler is the only way to achieve optimal
performance on such architectures as the Pentium because of the stack
based FPU, for which machine generated optimizations are very difficult.
There's only one compiler I know [5], which can
rearrange FP instructions
to avoid latency induced pipline stalls.
C compilers have additional difficulties with regard to pointer aliasing
(see the regularly upcoming discussions in sci.math.num-analysis). Since
most users of the
BLAS come from the FORTRAN world, aliasing should be no problem as the
FORTRAN standard explicitly prohibits it, therefore we can safely assume
that no aliasing will occur. The C users should be aware of running into
problems if they provide the functions with aliased pointers. At least the
pointers which indicate where data is being saved, shouldn't be aliased.
Again, as we do really much computational work, speed is a main factor,
and users should beware of doing discouraged things. Perhaps a debugging
version with checks for aliasing would be a good idea, comments are
welcome.
Things that have to be done
It should be relatively straightforward to write the other routines in the
BLAS 1 with the achieved knowledge of the Pentium FPU and memory interface:
this concerns the different xROTxx variants, xSWAP, xASUM and IxAMAX,
and of course all of the complex functions. Volunteers for this job are
always welcome, just as other suggestions and more testers. If the BLAS 1
is completed, perhaps the BLAS 2 and BLAS 3 can be set about. Especially the
BLAS 3 can be VERY much improved by reusing cache contents more than
once.
(This part is also obsolete. xASUM and xSCAL are working, the complex
routines too. For BLAS 3 speed improvements you may read the article
available at the end. In short, matrix-matrix multiply can get a speedup of
factor 9 compared with a straightforward C implementation!)
A completely different thing are specialized routines for the 486 and
the PPro. The 486 is not very difficult, because it has no pipelined FPU,
therefore loop unrolling and minimizing instruction counts is the only work
that has to be done. If there is enough demand, I could give it a try
myself.
This is completely different with the PPro. As I don't have access to such
a beast,
I know almost nothing about its programming details. I have a paper dealing
with the features of the PPro, but it's a little bit difficult without a
real machine. The PPro seems to be inferior compared with a Pentium (at the
same clock speed, of course), because the FPU pipeline is deeper, resulting
in more latency clocks, and the branch prediction is much more sensitive to
misprediction. Since there is nearly no integer work to be done in the inner
loops, the superscalar design (max. 4 opcodes per cycle) can not be
exploited. Sorry to say this.
Actually, I even don't know whether it's worth the effort, since some kind
soul out there got just no speed improvements of the optimized routines
over just unrolled loops. Consequently the PPro might make some
optimizations unnecessary. Further investigations on this subject has to
be done, therefore people with a PPro are VERY welcome, especially for
the execution of some tests in order to get the same insights into the
working of the FPU and
memory interface as into the Pentium. And if anybody is acquainted with
a document describing exactly this, perferably online available, PLEASE
do let me know.
AND, if possible, the documentation and this homepage should be
completely rewritten by somebody who knows English much
better than myself ;-)
Acknowledgements
Last but not least, some people must be mentioned without whose
work nothing could appear about this subject. I wish to thank all those
friendly men (and women?) out there for sharing their knowledge and work
with the rest of the world.
- BLAS library,
level 1 proposed by C. Lawson, R. Hanson, D. Kincaid, F. Krogh: "Basic
Linear Algebra Subprograms for FORTRAN Usage," ACM Trans.Math.Soft., 5
(1979),
level 2 proposed by J Dongarra, J. DuCroz, S. Hammarling, R.J. Hanson,
"An Extended Set of FORTRAN Basic Linear Algebra Subprograms," ACM
Trans.Math.Soft., 14 (1988),
level 3 proposed by J.J. Dongarra, J.J. Du Croz, I. Duff, S. Hammarling,
"A set of level 3 Basic Linear Algebra Subprograms," ACM Trans.Math.Soft.,
16 (1990)
Implementation available via NetLib
- Netlib software repository, with tons of
mathematical codes publicly available, see
[FTP] and
[WWW].
- Free
Software Foundation (FSF), responsible for
such tools like gcc, g77 and all the other numerous tools.
- DJ Delorie
and his project of porting GNU software,
especially gcc, to the somewhat unfriendly (with respect to porting UNIX
software) DOS environment.
- This is concerning the NDP FORTRAN compiler, about
which I know nothing more than that it can schedule FPU instructions, from
an article in Dr.
Dobb's Journal, January 1995, pp. 18-29. I don't even know whether it
still exists today, but if it does, I'd like to hear something from
people working with this compiler.
- My tester in the first place
Konstantinos Kourakis, who answered my
very first posting in UseNet and encouraged me to develop my ideas further.
- And all the friendly people on the UseNet, for their interesting (most
of the time) discussions and insights, especially the people showing
interest in my work (which doesn't mean anybody else than YOU).
Availability
In this section the described library is online available. Please remember
that this code is copyrighted by me, but you may use it and redistribute it
provided that nothing is changed. There is absolutely no warranty! I have
made a package of the compiled library, the source codes and a short
performance abstract. The current version number is 0.1 (first version
released). Choose the one suitable for your system:
- DOS/DJGPP V2.0: BLAS1.ZIP
and TESTS.ZIP, the testsuite
and performance test program I use.
- LINUX: blas1.tar.gz
- CBLAS1.ZIP, a C version of
all the BLAS1 routines, just plain rewrites of the FORTRAN ones. For those
of you who are interested in comparing their speed.
The new testversion (0.9) including xSCAL, xASUM and the COMPLEX routines is
available as zipfile for
DOS/DJGPP and Borland C++, also as
tarfile for Linux.
And an article about pentium FPU
optimization tricks in postscript format, original published in the german
magazine Toolbox 2/97, but only available in german, sorry! The
example programs from the article.
And the extracted parts for a very partial implementation of the level 3
dgemm routine.
Back to my english or
german homepage.
WOW, this page has been accessed more than 4500 times since its
start (Sep. 1996). The library has been fetched more than 1500 times in its
various versions.
Manuel Kessler