[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [spr8043] Allegro 4.1 on a Sparc 10

Your problem report has been assigned tracking id spr8043.
Please use it in the subject line of any correspondences
regarding your problem report.

>> I'm sending this as a bug-report as well as to the mailing list
>> because I'm hoping to get some information from both Franz and other
>> Allegro users.  My basic question is, why isn't my Allegro code very
>> much faster on a Sparc 10 than a Sparc 2?
>> I'm running Allegro CL 4.1 [SPARC; R1].  The first thing to say is
>> that a standard Allegro image cannot run on both a 10 and a non-10
>> (something to do with the stack location).  So we installed the
>> relevant patches, and now have an image that can run on both.  To my
>> vast disappointment, however, my system (a natural language
>> understander) runs only marginally faster on the 10 (approximately 1.4
>> times faster).
>> Now, if I remember right, a 10 is supposed to have approximately 2.5
>> to 3 times the cojones of a 2, measured in spec marks.  Some simple,
>> purely numeric test code that I wrote does indeed run at least 3 times
>> faster.  Even some simple, non-numeric test code (lots of consing,
>> calls to MEMBER, EQUAL, etc.) runs about twice as fast.
>> But my big, hairy, heavily CLOSified system just isn't that much
>> faster.  Just to make sure it wasn't the patched image, I compared the
>> unpatched and the patched Allegro (on a Sparc 2), and there was no
>> appreciable difference in performance.
>> So, do any other users have any experience with 10s yet?  Does Franz
>> have any intuitions about this situation?
>> I'd greatly appreciate any information.  I imagine the mailing list as
>> a whole would be interested as well.

My thanks to those others who have submitted answers to your question.
An oversimplified encapsulation of what they stated is "your mileage may
vary."  Of course, we can't just leave it at that, and we have been doing
some research into this phenomenon.


 When we first heard of this problem from another customer in November, we
asked Sun about it.  Sun loaned us a Sparc 10 Model 30 for a month, since
we had none of our own to run any experiments on.  We gathered some initial
findings, some of which resulted in some immediate speedups in development
versions of Allegro CL.  At the end of the month, we decided that we should
purchse a Sparc 10 of our own.  We have now received this Sparc 10/30, and
will continue to work on characterizing and speeding up the lisp so that it
scales better on the 10.

Where improvements to the lisp are possible, I have beeen working in some
cases on Allegro CL 4.1, but most of my effort has been spent in the 4.2
development sources, since it has the best chance of significant improvements.
Allegro CL 4.2 has a different compiler that does a better job of register
allocation and allows such features as unboxed 32-bit integers, functions 
that don't link in a stack frame, and tail-call elimination (very useful
for this study).  Allegro CL 4.2 also has 30-bit fixnums (although the
currently released 4.2.beta.0 still has 29-bit fixnums) which provides
advantages for array accessing.  In all cases that I refer to Allegro CL 4.2
below, I am referring to the current development version, and not to any
released version (as of yet).

Initial Findings:

Certain aspects of the Sparc 10 do not scale up from the Sparc 2 as well as
others.  The cpu itself is certainly fast, but so far we have found several
areas which affect lisp performance:

 1. Register-window saving/restoring:  The 10/30 seems to take longer to
save out its register-window set than the sparc 2.  Thus, if the program you
are running does deep recursion, windows will be saved more, and the total
run time will increase.

In Allegro CL, the deepest recursion tendency is in CLOS and closified
streams code.  I spent some time tuning the streams code with the specific
idea in mind that the code should be "flattened out" using tail-call-merging
and functions that do not link in a stack frame (i.e. functions that do not
use the "save" and "restore" instructions).  There is still some work that
needs to be done in this area, and since we did not get long with the
loaner Sparc 10, we could not do accurate measurements of how well the
optimizations did (bear in mind that the same optimizations are also likely
to speed up operations on the sparc2 as well!).

 2. Fixed-point multiply/divide instructions:  The Sparc Version 8
architecture allows for a hardware multiply (smul) and divide (sdiv)
instruction to be implemented.  Additionally, the same instruction decodes
when executed on a sparc 2 cause an unimplemented-instruction trap, at
which point the operating system does the multiply/divide and returns as if
the instruction were really implemented.  This emulation is both a blessing
and a curse: I don't yet know how to tell for sure whether a particular
sparc chip has the multiply/divide hardware or not.

We perform integer multiplies by doing a primitive call to an assembler
function whose address is in a table.  This function either calls the
software version of the multiply, which is about 35 instructions, or it
executes the hardware multiply instruction (and a setup instruction) which
takes two instructions.  Incidentally, we cannot just move to the new
hardware multiply, because if a sparc2 "executes" this instruction, it
takes several hundred instructions to take the trap, go through the software
multiply, and return.  Thus, we will need to figure out how to determine at
lisp startup time exactly whether the ccurrent machine on which the lisp is
running has the hardware multiply or not.

 3. Floating-point conversion:  I did not get very far with this one in my
experimentation, because it was discovered fairly late in the loaner period.
There _seems_ to be some lack of scalability when you mix single-float and
double-float code.  I have not had a chance yet to look at this deeper.

If you have any code that simply declares numbers as "float" instead of
single-float or double-float, or if you do a double-float calculation on a
single-float constant, you may end up going through some data type
conversion.  This problem has plagued us enough times that we have added 
a (muffle-able) warning that checks against any float declarations (as
opposed to single-float or double-float).

 4. Memory latency:  There are differences in memory latency between the
sparc 2, the 10/30 and the 10/41.  This affects how the system performs in
the presence of "interlocked" reads (i.e. read-from memory into a register,
and then use that register to read again from memory).  Allegro CL 4.2
has slightly better interlocking characteristics than 4.1, although there
are still some things we can do.

 5. PSO:  In addition to memory latency, there is another memory issue
that affects performance.  The Sparc version 8 architecture manual defines
a PSO bit, which may or may not be implemeted.  The older sparcs all
implement TSO, or Total Store Ordering, in which there is conceptually no
data cache between your virtual address space and your real address space.
When your program writes to a data location, every processor (including
the instruction-fetch mechanism) will see the data you wrote out, and is not
in danger of seeing the old data.  Those sparcs that implement PSO, or
Partial Store Ordering, are really providing what has become the more
traditional write-through-cache technique.  This of course would have
ramifications on lisp, because we would need to flush the cache whenever a
code vector moves.  On the other hand, we have had quite a bit of experience
with other architectures that only provide write-through cache, so this
would not be a problem.

The TSO concept is inherently slower than PSO, because the hardware is
forced to perform checks to guarantee synchronicity.  PSO is faster, but
also more dangerous, since some programs, especially those which rely
on multiple CPU execution or on executing code in data space, would need
modification in order to run in PSO mode.

I understand that the only sparc that implements PSO is the chip that is
in the model 41.  I am told, though, that the PSO bit cannot be turned on by
a user, and there are not yet any plans to allow specific programs to turn
on this bit on (on a per-process basis).

I have tried to provide you with some of the insights we have obtained over
the past few months.  I hope these help you to reconcile the actual
performance gains you see with the expectations you were given.  Please feel
free to ask me more questions, or to make suggestions.

Duane Rettig, Franz Inc.        1995 University Avenue, Suite 275
duane@Franz.COM (internet)      Berkeley, CA  94704
uunet!franz!duane (uucp)        Phone: (510) 548-3600; FAX: (510) 548-8253