[Last update: 17 November 1997]

There is currently some work on optimizing the library for the Pentium Pro, which should be quite good for the Pentium II, too. At the moment, no results are available yet, but please look back in the near future. Thanks.

Pentium Optimized BLAS

This is the homepage of the Pentium optimized BLAS project with the aim of helping programmers writing high performance numerical software on the widespread intel processors.

Please note that this library is only specially optimized for the original Intel Pentium processor. Other CPUs like the 486, AMD K-xx, Cyrix/IBM 6x86 and the PentiumPro may or may not gain some speed. Since I don't have access to such other CPUs, it is nearly impossible for me to produce optimal code on all platforms.

Purpose of this project

The BLAS library, first introduced by Lawson et. al. [1], has proved to be very successful in advanced numeric processing. A standard FORTRAN-77 implementation of the proposed set of routines is available at Netlib [2], together with some papers describing all of the routines and with some design decisions. This standard implementation should compile on every platform with every standard conforming FORTRAN compiler, with the exception of the COMPLEX*16 routines. Yet to gain the best efficiency out of the hardware architectures of today, considering such items as caches, superscalar and superpipelined processors, latency and throughput, it is necessary to make some effort to optimize those functions, especially if one takes into account how many people and libraries are using the BLAS as their low level routines, that are doing the hard work of numerical computations.
Goal of this project is to do this for the widely used Intel Pentium processor and its successors. I have started with the most important functions of the BLAS level 1 and will continue as my spare time will allow. With the current interim implementation (see below), speed improvements of 50-300% can be achieved, therefore spending some time on this subject is no waste of time.

Results

For those of you only interested in hard facts, here is a short table comparing three different implementations of some selected functions of BLAS 1, on different vector lengths. The test system consists of a P-133, Gigabyte GA-586ATE board, 256K PB, 32M EDO (7-2-2-2) + 32M FPM (8-3-3-3), tests running in EDO, no paging. Compiler is gcc-2.7.2.1f, g77-0.5.18, compiled for DOS (DJGPP), full optimization (-O3 -ffast-math -funroll-loops), timing with Pentium builtin TSC (time stamp counter). Shown are typical vector sizes for 1st and 2nd level cache (256 & 8192) and for main memory access (262144). Results are MFLOPS resp. the speed improvement over the original FORTRAN version. Note that the C routines are just plain rewrites of the original FORTRAN code.
	Assembler	C		FORTRAN
N	MFLOPS		MFLOPS		MFLOPS
caxpy					
256	48,88	37,7%	24,57	-30,8%	35,50
8192	38,02	34,8%	22,10	-21,6%	28,20
262144	24,21	12,6%	16,50	-23,3%	21,50
ccopy					
256	28,14	21,6%	12,74	-45,0%	23,14
8192	17,00	128,7%	8,24	10,8%	7,43
262144	12,42	103,1%	6,11	-0,0%	6,11
cdotc					
256	64,46	264,2%	39,02	120,4%	17,70
8192	43,34	178,6%	31,80	104,4%	15,56
262144	28,68	111,9%	24,56	81,4%	13,54
cdotu					
256	67,33	63,3%	36,52	-11,4%	41,24
8192	45,27	43,6%	29,77	-5,6%	31,52
262144	29,66	22,8%	22,59	-6,4%	24,14
cscal					
256	52,83	70,9%	31,45	1,8%	30,91
8192	45,90	68,5%	27,98	2,7%	27,24
262144	37,02	57,2%	24,14	2,5%	23,56
dasum					
256	71,29	122,3%	61,16	90,7%	32,07
8192	50,41	100,0%	35,62	41,3%	25,21
262144	34,20	93,7%	22,17	25,5%	17,66
daxpy					
256	36,36	208,0%	18,62	57,7%	11,80
8192	21,66	147,7%	13,80	57,8%	8,74
262144	11,88	105,0%	7,68	32,6%	5,79
dcopy					
256	27,10	86,0%	27,17	86,5%	14,57
8192	16,36	136,2%	8,23	18,8%	6,93
262144	11,94	118,5%	6,15	12,7%	5,46
ddot					
256	48,82	165,7%	33,77	83,8%	18,37
8192	27,67	116,2%	18,67	45,9%	12,80
262144	15,32	79,2%	11,12	30,1%	8,55
dnrm2					
256	71,84	362,2%	44,98	189,4%	15,54
8192	50,95	247,1%	29,41	100,4%	14,68
262144	34,68	182,7%	19,58	59,6%	12,27
dscal					
256	28,95	272,0%	17,25	121,6%	7,78
8192	16,05	134,2%	12,61	84,0%	6,85
262144	9,18	65,9%	8,81	59,3%	5,53
dzasum					
256	73,46	340,3%	58,70	251,8%	16,69
8192	49,61	259,5%	33,30	141,3%	13,80
262144	33,59	201,7%	21,26	91,0%	11,13
dznrm2					
256	74,57	286,5%	48,25	150,0%	19,30
8192	50,52	192,9%	29,70	72,1%	17,25
262144	34,17	151,3%	19,74	45,2%	13,60
sasum					
256	75,13	134,8%	45,39	41,9%	31,99
8192	64,35	117,0%	36,22	22,1%	29,65
262144	49,62	109,3%	27,63	16,6%	23,70
saxpy					
256	32,60	147,8%	22,07	67,8%	13,16
8192	27,97	155,5%	17,45	59,4%	10,95
262144	16,94	111,1%	11,58	44,4%	8,02
scopy					
256	43,85	173,1%	67,82	322,4%	16,05
8192	32,32	185,6%	17,18	51,8%	11,31
262144	23,62	154,1%	12,73	36,9%	9,30
sdot					
256	48,28	161,6%	39,35	113,2%	18,45
8192	38,45	148,7%	25,35	64,0%	15,46
262144	23,42	104,1%	16,51	43,8%	11,48
snrm2					
256	78,09	406,0%	45,55	195,2%	15,43
8192	67,10	340,4%	36,52	139,7%	15,24
262144	51,74	274,0%	27,87	101,4%	13,84
sscal					
256	31,12	212,4%	17,51	75,8%	9,96
8192	24,41	155,1%	15,07	57,4%	9,57
262144	17,40	110,1%	12,03	45,2%	8,28
scasum					
256	77,00	28,7%	31,39	-47,5%	59,83
8192	62,59	39,3%	26,50	-41,0%	44,94
262144	48,12	46,0%	21,71	-34,2%	32,97
scnrm2					
256	82,26	304,2%	44,36	118,0%	20,35
8192	67,08	245,0%	35,34	81,7%	19,44
262144	51,67	205,0%	27,33	61,3%	16,94
zaxpy					
256	48,36	67,4%	23,43	-18,9%	28,88
8192	32,67	50,6%	19,19	-11,5%	21,69
262144	18,10	19,0%	12,89	-15,2%	15,21
zcopy					
256	15,10	91,1%	9,39	18,9%	7,90
8192	8,36	142,3%	4,24	23,0%	3,45
262144	6,06	128,3%	3,06	15,2%	2,65
zdotc					
256	56,76	358,8%	33,75	172,8%	12,37
8192	34,87	224,8%	24,58	129,0%	10,74
262144	21,24	133,9%	16,96	86,8%	9,08
zdotu					
256	55,86	39,5%	35,43	-11,5%	40,04
8192	34,32	31,1%	25,82	-1,4%	26,18
262144	20,90	16,9%	17,81	-0,4%	17,89
zscal					
256	52,21	82,6%	28,98	1,4%	28,59
8192	40,08	76,8%	23,42	3,3%	22,67
262144	27,45	51,3%	18,10	-0,2%	18,14

Basic ideas

If one looks at the code generated by g77 from the netlib sources, some points of optimization are evident: better use of registers, better use of the advanced adressing capabilities, fewer reloadings and recalculations. Others are less obvious, especially the rearranging of the FP instructions because of the difference between latency and throughput. The pairable instruction FXCH (which means that it's essentially clockless) allows a rearranging without additional clock cycles. One example is the main loop in xAXPY: a good optimizing compiler might produce a code like this:
	FLD     [ESI]
	FMUL    ST(1)
	FADD    [EDI]
	FSTP    [EDI]
This requires 9 cycles, one for each instruction (two for FSTP) and 2*2 for the latency of FMUL and FADD. The rearrangement provides us with 5 cycles, avoiding the 4 cycles of latency, which means that we have achieved a speed improvement of 80% (meaning 80% more work can be done in the same time, or, if you prefer, a time reduction of 44% for the same work has been achieved).
Loop unrolling is an essential requirement to achieve an effective rearrangement, in order to be able to schedule more instructions. In addition, the loop overhead per operation is reduced. If vectors do not come out of the 1st level cache built into the CPU, further items have to be taken into consideration: on the one hand, one has to force the memory interface to its limits in order to supply the FPU with the maximal possible amount of data to process, and on the other hand, one has to reduce page misses on (slow) DRAM. If these suggestions are carefully implemented,one can significantly improve the performance of large data sets, without reducing the performance of little vectors out of the cache. A detailed paper with all those key ideas is currently in progress and probably will be finished till christmas (I hope).

Work done until now

(This part is a little bit outdated. See below for the newest test version with xSCAL and xASUM included, as well as the complex counterparts in both precisions.)
At the moment (12 November 1996), these routines are rather complete: xCOPY, xDOT, xNRM2, with x either S or D (single and double precision). Not really complete, but completely useful are SAXPY and DAXPY, which already do work, but have no optimizations and larger unrolling factors for incx=1 and/or incy=1 yet. The assembler routines all have two entry points, one for FORTRAN calling and one for C calling conventions. At least under DOS/DJGPP[4] (GCC2.7.2 and G77 0.5.18), the right one is selected at link time, consequently there is no problem including both calling conventions into one library.
As already pointed out, all routines have been completely rewritten using the GNU assembler, in combination with the GNU assembler preprocessor GASP. Many thanks from this point to the FSF [3] for providing all of their GNU tools. In my opinion, Assembler is the only way to achieve optimal performance on such architectures as the Pentium because of the stack based FPU, for which machine generated optimizations are very difficult. There's only one compiler I know [5], which can rearrange FP instructions to avoid latency induced pipline stalls.
C compilers have additional difficulties with regard to pointer aliasing (see the regularly upcoming discussions in sci.math.num-analysis). Since most users of the BLAS come from the FORTRAN world, aliasing should be no problem as the FORTRAN standard explicitly prohibits it, therefore we can safely assume that no aliasing will occur. The C users should be aware of running into problems if they provide the functions with aliased pointers. At least the pointers which indicate where data is being saved, shouldn't be aliased. Again, as we do really much computational work, speed is a main factor, and users should beware of doing discouraged things. Perhaps a debugging version with checks for aliasing would be a good idea, comments are welcome.

Things that have to be done

It should be relatively straightforward to write the other routines in the BLAS 1 with the achieved knowledge of the Pentium FPU and memory interface: this concerns the different xROTxx variants, xSWAP, xASUM and IxAMAX, and of course all of the complex functions. Volunteers for this job are always welcome, just as other suggestions and more testers. If the BLAS 1 is completed, perhaps the BLAS 2 and BLAS 3 can be set about. Especially the BLAS 3 can be VERY much improved by reusing cache contents more than once.
(This part is also obsolete. xASUM and xSCAL are working, the complex routines too. For BLAS 3 speed improvements you may read the article available at the end. In short, matrix-matrix multiply can get a speedup of factor 9 compared with a straightforward C implementation!)
A completely different thing are specialized routines for the 486 and the PPro. The 486 is not very difficult, because it has no pipelined FPU, therefore loop unrolling and minimizing instruction counts is the only work that has to be done. If there is enough demand, I could give it a try myself.
This is completely different with the PPro. As I don't have access to such a beast, I know almost nothing about its programming details. I have a paper dealing with the features of the PPro, but it's a little bit difficult without a real machine. The PPro seems to be inferior compared with a Pentium (at the same clock speed, of course), because the FPU pipeline is deeper, resulting in more latency clocks, and the branch prediction is much more sensitive to misprediction. Since there is nearly no integer work to be done in the inner loops, the superscalar design (max. 4 opcodes per cycle) can not be exploited. Sorry to say this.
Actually, I even don't know whether it's worth the effort, since some kind soul out there got just no speed improvements of the optimized routines over just unrolled loops. Consequently the PPro might make some optimizations unnecessary. Further investigations on this subject has to be done, therefore people with a PPro are VERY welcome, especially for the execution of some tests in order to get the same insights into the working of the FPU and memory interface as into the Pentium. And if anybody is acquainted with a document describing exactly this, perferably online available, PLEASE do let me know.
AND, if possible, the documentation and this homepage should be completely rewritten by somebody who knows English much better than myself ;-)

Acknowledgements

Last but not least, some people must be mentioned without whose work nothing could appear about this subject. I wish to thank all those friendly men (and women?) out there for sharing their knowledge and work with the rest of the world.
  1. BLAS library,
    level 1 proposed by C. Lawson, R. Hanson, D. Kincaid, F. Krogh: "Basic Linear Algebra Subprograms for FORTRAN Usage," ACM Trans.Math.Soft., 5 (1979),
    level 2 proposed by J Dongarra, J. DuCroz, S. Hammarling, R.J. Hanson, "An Extended Set of FORTRAN Basic Linear Algebra Subprograms," ACM Trans.Math.Soft., 14 (1988),
    level 3 proposed by J.J. Dongarra, J.J. Du Croz, I. Duff, S. Hammarling, "A set of level 3 Basic Linear Algebra Subprograms," ACM Trans.Math.Soft., 16 (1990)
    Implementation available via NetLib
  2. Netlib software repository, with tons of mathematical codes publicly available, see [FTP] and [WWW].
  3. Free Software Foundation (FSF), responsible for such tools like gcc, g77 and all the other numerous tools.
  4. DJ Delorie and his project of porting GNU software, especially gcc, to the somewhat unfriendly (with respect to porting UNIX software) DOS environment.
  5. This is concerning the NDP FORTRAN compiler, about which I know nothing more than that it can schedule FPU instructions, from an article in Dr. Dobb's Journal, January 1995, pp. 18-29. I don't even know whether it still exists today, but if it does, I'd like to hear something from people working with this compiler.
  6. My tester in the first place Konstantinos Kourakis, who answered my very first posting in UseNet and encouraged me to develop my ideas further.
  7. And all the friendly people on the UseNet, for their interesting (most of the time) discussions and insights, especially the people showing interest in my work (which doesn't mean anybody else than YOU).

Availability

In this section the described library is online available. Please remember that this code is copyrighted by me, but you may use it and redistribute it provided that nothing is changed. There is absolutely no warranty! I have made a package of the compiled library, the source codes and a short performance abstract. The current version number is 0.1 (first version released). Choose the one suitable for your system: The new testversion (0.9) including xSCAL, xASUM and the COMPLEX routines is available as zipfile for DOS/DJGPP and Borland C++, also as tarfile for Linux.

And an article about pentium FPU optimization tricks in postscript format, original published in the german magazine Toolbox 2/97, but only available in german, sorry! The example programs from the article. And the extracted parts for a very partial implementation of the level 3 dgemm routine.

Back to my english or german homepage.


WOW, this page has been accessed more than 4500 times since its start (Sep. 1996). The library has been fetched more than 1500 times in its various versions.
Manuel Kessler