[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
assembly routine call sequence.
Currently, the biggest problem with the PMAX core is its size. Rob and I
believe that a significant amount of the bloat is due to the fact that the
assembly routine interface and the static-function interface need much more
code than the miscop interface (their counterpart in the old system). This
message outlines some ideas I had to make the assembly routine interface
more tense. I'm sending to the who group so I don't have to try to figure
out who would be interested, even though Rob, Chris, and Bill are probably
going to be the only people with enough background info to understand it.
I'm not exactly sure what the call sequence for miscops is, but the call
sequence looks like:
compute-lra-from-code
lui temp, <upper half of assembly routine address>
ori temp, <lower half of assembly routine address>
jr temp
<padding>
lra header word
In other words, either 20 or 24 bytes depending on whether or not the
padding is needed. I see two independent ways to improve on this:
This first improvement is to use the jump-immediate instruction instead of
the jump-register. With the jump-immediate, you can specify 26 bits of
address in the instruction (the low two bits are zero, and the high four
bits are copied from the current pc). In order to do this, we would have
to have a set of special allocation regions for code objects in order to
guarantee that they all have the same four high bits as the assembly
routines. This would involve minor changes to GC, purify, genesis, and the
code object allocator. Making all these changes would probably take less
than a day, but recompiling everything would take another two.
The second improvement is to get rid of all the lra stuff. If we use the
native call instruction (jalr or jali) then we can save the 8 or 12 bytes
the LRA uses up at the call site. Given that the current code object is
already in a register, we can just put the return address in the
lisp-interior-pointer (LIP) register, and GC will fix it correctly. To
return, we just jr to the return address. If the assembly routine needs to
use the LIP for something else, it can just subtract the value of the code
object off the return address, use the LIP, and then add the code object
back.
This also causes problems for backtrace. It's easy to tell when an
assembly routine was interrupted, but it would be impossible to determine
what called the assembly routine without a standard location for the return
pc.
In addition to being shorter, the RT miscop sequence is implicitly atomic.
On the pmax, we use a pseudo-atomic flag to indicate what code should not be
interrupted. The code sequence for this is:
clear pending flag
set atomic flag
<do atomic stuff>
clear atomic flag
test pending flag
branch if clear to foo
breakpoint instruction
foo:
The pseudo-atomic stuff is use primarily for allocation. Obviously,
allocation has to be fast, but all that flag twiddling is just as expensive
as calling an assembly-routine. Therefore, if we make the assembly-routine
stuff implicitly atomic, out-of-line allocation will be just as fast as
inline allocation, but will take up much less space in the resultant code.
Unfortunately, making assembly-routines implicitly atomic poses reliability
problems. The interrupt handler has to be able to tell when an assembly
routine was interrupted, and it has to arrange for the interrupt to be
handled later. The handler can easily check to see if the PC is inside a
range of values, but this requires that all the assembly routines sit in
their own allocation space.
Arranging for the interrupt to be handled later is much harder. We can
either set some flag (in a register or memory location) or change the
return address. Setting a flag has the advantage that it doesn't rely on
the return address being in any standard location, but has the disadvantage
that it must be checked. Changing the return address requires that the
return address be in a standard location, but no explicit test must be
made. Changing the return address will result in faster code for the
normal case (as very few interrupts are actually delivered).
I don't like the number of restrictions that this is putting on assembler
routines. It would be real handy if we had some data structure similar to
debug-info for assembler routines. Each entry could contain the start
address, length, and an interruptible-p flag. The entries could be linked
together, and the C interrupt handler could easily scan this list and
figure out what to do. Actually, instead of having one entry per assembly
routine, we could just have a linked list of uninterruptible regions.
Comments, suggestions, other ideas?
-William