[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CMU CL 15c performance

To: Rob_MacLachlan@LISP-PMAX1.SLISP.CS.CMU.EDU
Subject: Re: CMU CL 15c performance
From: "Thomas M. Breuel" <tmb@ai.mit.edu>
Date: Thu, 30 Jan 92 23:18:47 EST
References: <9201300751.AA13881@bambleweenie57> <9201302338.AA14191@life.ai.mit.edu>

Thanks for taking the time to look into this and responding so
quickly.

Some comments, though:

Rob_MacLachlan@lisp-pmax1.slisp.cs.cmu.edu writes:
 > Unfortunately, CL lacks any standard way to do block I/O.  Changing to use
 > CMU specific operations resulted in:
 >       Seconds  |  Consed   |  Calls  |  Sec/Call  |  Name:
 >     ------------------------------------------------------
 > 	19.650 | 2,464,112 |       1 |   19.64990 | LOCSUM
 > 	 4.860 |   307,752 |       1 |    4.85990 | CONVERT-FLOAT-BYTE-IMAGE
 > 	 4.660 | 1,232,056 |       1 |    4.65990 | CONVERT-BYTE-FLOAT-IMAGE
 > 	 0.150 |   307,936 |       1 |    0.14990 | READ-BYTE-IMAGE-RAW
 > 	 0.010 |       760 |       1 |    0.00990 | WRITE-BYTE-IMAGE-RAW
 >     ------------------------------------------------------
 > 	29.329 | 4,312,616 |       5 |            | Total

Even so, the C version runs in under 6 seconds, including byte-wise
I/O, while the fixnum multiply for the array accessing only makes up
18% of the running time.

I think the differences between C and CMU CommonLisp are mostly due to
missed peephole style optimizations.

Consider:

(defun foo (n r)
  (declare (fixnum n) (single-float r))
  (if (< n 1) 1.0s0 (foo (- n 1) (* -1.0s0 r))))

The optimal inner loop (from GCC 2.0) is (at least, it looks optimal
to me):

        cmp %i0,0
        ble L3
        nop
L4:
        add %i0,-1,%i0
        cmp %i0,0
        bg L4
        fnegs %f0,%f0

The corresponding output from Python is:

      58: L0:   MOVE  %A0, %LRA              ; No-arg-parsing entry point
      5C:       SRA   %CFUNC, 2, %NL1        ; %CFUNC = N
      60:       ADD   %ZERO, 1, %NL2
      64:       CMP   %NL1, %NL2
      68:       BLT   L1
      6C:       NOP
      70:       SUB   4, %CFUNC              ; %CFUNC = N
      74:       LD    [%CODE+17], %A0        ; -1.0
      78:       LDF   [%A0+-3], %F2
      7C:       FMULS %F2, %F0               ; %F0 = R
      80:       MOVE  %LRA, %A0
      84:       B     L0
      88:       NOP

Now, the compiler failed to fill some branch slots, but I think the
main problems are:

 * there is redundant unboxing (on line 5C)
 * the floating point constant -1.0s0 is loaded on every
   iteration (note that GCC 2.0 even figured out that it
   could replace the multiplication with a negate)
 * the moves on lines 58 and 80 seem redundant
 * the termination condition is checked at the top, not
   at the bottom

These also don't seem to be Sparc-specific optimizations.

Anyway, I'm not complaining. I think CMU CL is a great implementation,
and I understand that you operate under constraints.

BTW, do you have any comments on why I had to declare types explicitly
in parts of the code that I have marked "INFERENCEn"? I think doing
this kind of inference in the compiler (if legal) would simplify
writing numerical code greatly.

Also, what are your thoughts on how/whether I can (or should be able
to) use polymorphic higher-order functions in numerical code
efficiently? At least for higher-order functions that have been
explicitly declared inline and are present in the same compilation
unit as the client, I would hope that any consing for return values
could be avoided.

					Thanks, Thomas.

Follow-Ups:
- Re: CMU CL 15c performance
  - From: Rob_MacLachlan@LISP-PMAX1.SLISP.CS.CMU.EDU

References:
- CMU CL 15c performance
  - From: "Thomas M. Breuel" <tmb@ai.mit.edu>

Prev by Date: Re: CMU CL 15c performance
Next by Date: Re: CMU CL 15c performance
Previous by thread: Re: CMU CL 15c performance
Next by thread: Re: CMU CL 15c performance
Index(es):
- Date
- Thread