[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: CMU CL 15c performance
Thanks for taking the time to look into this and responding so
quickly.
Some comments, though:
Rob_MacLachlan@lisp-pmax1.slisp.cs.cmu.edu writes:
> Unfortunately, CL lacks any standard way to do block I/O. Changing to use
> CMU specific operations resulted in:
> Seconds | Consed | Calls | Sec/Call | Name:
> ------------------------------------------------------
> 19.650 | 2,464,112 | 1 | 19.64990 | LOCSUM
> 4.860 | 307,752 | 1 | 4.85990 | CONVERT-FLOAT-BYTE-IMAGE
> 4.660 | 1,232,056 | 1 | 4.65990 | CONVERT-BYTE-FLOAT-IMAGE
> 0.150 | 307,936 | 1 | 0.14990 | READ-BYTE-IMAGE-RAW
> 0.010 | 760 | 1 | 0.00990 | WRITE-BYTE-IMAGE-RAW
> ------------------------------------------------------
> 29.329 | 4,312,616 | 5 | | Total
Even so, the C version runs in under 6 seconds, including byte-wise
I/O, while the fixnum multiply for the array accessing only makes up
18% of the running time.
I think the differences between C and CMU CommonLisp are mostly due to
missed peephole style optimizations.
Consider:
(defun foo (n r)
(declare (fixnum n) (single-float r))
(if (< n 1) 1.0s0 (foo (- n 1) (* -1.0s0 r))))
The optimal inner loop (from GCC 2.0) is (at least, it looks optimal
to me):
cmp %i0,0
ble L3
nop
L4:
add %i0,-1,%i0
cmp %i0,0
bg L4
fnegs %f0,%f0
The corresponding output from Python is:
58: L0: MOVE %A0, %LRA ; No-arg-parsing entry point
5C: SRA %CFUNC, 2, %NL1 ; %CFUNC = N
60: ADD %ZERO, 1, %NL2
64: CMP %NL1, %NL2
68: BLT L1
6C: NOP
70: SUB 4, %CFUNC ; %CFUNC = N
74: LD [%CODE+17], %A0 ; -1.0
78: LDF [%A0+-3], %F2
7C: FMULS %F2, %F0 ; %F0 = R
80: MOVE %LRA, %A0
84: B L0
88: NOP
Now, the compiler failed to fill some branch slots, but I think the
main problems are:
* there is redundant unboxing (on line 5C)
* the floating point constant -1.0s0 is loaded on every
iteration (note that GCC 2.0 even figured out that it
could replace the multiplication with a negate)
* the moves on lines 58 and 80 seem redundant
* the termination condition is checked at the top, not
at the bottom
These also don't seem to be Sparc-specific optimizations.
Anyway, I'm not complaining. I think CMU CL is a great implementation,
and I understand that you operate under constraints.
BTW, do you have any comments on why I had to declare types explicitly
in parts of the code that I have marked "INFERENCEn"? I think doing
this kind of inference in the compiler (if legal) would simplify
writing numerical code greatly.
Also, what are your thoughts on how/whether I can (or should be able
to) use polymorphic higher-order functions in numerical code
efficiently? At least for higher-order functions that have been
explicitly declared inline and are present in the same compilation
unit as the client, I would hope that any consing for return values
could be avoided.
Thanks, Thomas.