[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: functional programming



Discussing my bit-set problem, Paul made the following wild claims about
Yale LOOP:

    - LOOP is dangerous because programmers often try to "force" it onto
      inappropriate situations.

    - It is hard to tell what an appropriate use of LOOP is without first
      writing general recursive code and then recognizing that it is tail
      recursive.

    - Why not write general recursive code, letting the compiler optimize
      the tail recursions?

I'll deal with each of these in turn.  Let the term "syntactic recursion"
refer to any code written in the low-level recursive style that Paul
prefers, using T's LABELS, ITERATE, or explicitly defined recursive
functions.  All of my arguments will be based on clarity of code; a good
programmer does not prefer LOOP over syntactic recursion because of any
naive sense that it is more efficient.

    - It is hard to tell what an appropriate use of LOOP is without first
      writing general recursive code and then recognizing that it is tail
      recursive.

When a good programmer starts thinking about how to solve a problem, his
first solutions involve high-level operations on some "fuzzy" or "abstract"
data structure.  Gradually, he refines his fuzzy notions into terms closer
to the target programming language, until finally he is ready to write
the final code.  During this refinement process, he picks actual data
and control structures.  After he picks an implementation, it is almost
always (95%) immediately clear what control structure to use in
implementing the abstract operation, since data structures and control
structures go hand in hand (trees use true recursion, lists usually use
true iteration, etc.).

If at this point the programmer has a data structure but doesn't know
IMMEDIATELY whether he should use true recursion or iteration for an
operation, then he better not start coding in any style.  Either his notion
of the abstract algorithm is too ill-defined, he has picked the wrong
data structure, or else he doesn't fully understand the data structure
he is attempting to use.

During the refinement from abstraction to implementation, the programming
language should help the programmer in this task by eliding detail and
expressing the abstract algorithm as directly as possible (while still
meeting any efficiency requirements).  Given two language constructs of
acceptable efficiency, one that expresses the abstract operation directly
and clearly and one that requires the programmer to supply lots of detail,
the former is clearly preferable.  This ideal has been the main driving
force behind research into programming languages.

The set, the collection, and the sequence are very common, if not the
most common, algorithmic abstractions.  Other abstractions such as trees
and graphs build on these three, at both the abstract and the
implementation levels.  When designing a high level algorithm using sets,
collections, or sequences, the programmer uses operations such as:

    For each X do...

    Find the first X satisfying some property...

    Return a subset satisfying some property...

    Perform a reduction over a set (e.g. add up (F X) for each X, or
        find the elements that maximize (F X) )...

    Join two sequences ("zip" them up)

    Find the first pair X,Y in the join of two sequences that satisfies some
        property...

    etc. etc. etc.

The above abstract operations are "naturally iterative", in that their
most natural definitions depend on enumeration of the objects in the set,
collection, or sequence.  These operations are often composed.

After the programmer has expressed his algorithm in such high level terms,
he must then decide on an implementation.  In the case of sets,
collections, and sequences, he picks from a number of well known data
structures and algorithms:  lists, vectors, strings, bit-vectors,
union-find trees, priority queues, etc.  The most common implementations
are lists, vectors, and strings, and the most common control structures
are iterative.

    - LOOP is dangerous because programmers often try to "force" it onto
      inappropriate situations.

To emphasize, if the programmer reaches this point in the refinement
process, having picked a list, vector, string, or bit-set data structure,
and he doesn't know whether or not his control structure will be iterative
or truly recursive, then he is in BIG TROUBLE.  To recommend that he start
writing recursive code, or any code, is wrong:  He has basic confusions
about his algorithm or a fundamental lacking in his knowledge, and no
programming language will help him.  If he starts programming in such
a state, he will only produce garbage.

So now we have established that:

    - The most common abstractions are sets, collections, and sequences,
      and the operations on these are naturally enumerative (though not
      necessarily implemented that way).

    - The most common implementations of these abstractions are singly
      and doubly linked lists and vectors.

    - The most common control structures on these data structures are
      iterative.

A good programming language should make it easy for the programmer to
use these abstractions and their implementations, hiding as much of the
detail as possible without sacrificing either clarity or the efficiency
required for the particular application.


So now we're back to the original issue:

    - Why not write general recursive code, letting the compiler optimize
      the tail recursions?

For dealing with the most common implementations of sets, collections,
and vectors, LOOP is vastly superior to syntactic recursion.  I'll explain
why by showing the numerous defects of syntactic recursion when used for
iteration and how LOOP corrects those defects.  An example follows.

* Syntactic recursion requires the programmer to specify too many
implementation details of the data structure, whereas LOOP hides many
of those details.  In the simplest example of iterating through a sequence
implemented as a list, the programmer must explicitly write CAR and CDR,
whereas LOOP provides information-hiding clauses:

    (for x in l)
    (for-each-stat s)
    (for-each-bit-set-element s x)
    (incr i from 1 to n)

LOOP lets the programmer define his own iteration clauses so he can hide
implementation, but syntactic recursion forces clients to use gory
implementation detail.  Information hiding (modularity) has long been
a basic tenet of good programming practice.

* Syntactic recursion separates the initialization, the termination
conditions, and the induction steps so that they are often spread out
across a page or more of code.  If there are any side effects (e.g. to
a global data structure), that code is often mixed in with the iteration
code, and it can be hard to quickly distinguish the two kinds.  But LOOP
clusters the iteration steps of initialization, termination, and induction
in one concise clause, at the top of the loop; side effects are usually
clustered separately in the DO clause.

* Syntactic recursion uses the hated positional notation for induction
variables.  When there is more than one induction variable (and there
usually is), positional notation is impossible to read.  Ever look at
a large recursion and find yourself constantly skipping back and forth
between the recursive calls, the initial call, and the parameters just
to find out the cognitive purpose of an induction variable?  LOOP avoids
the need for positional notation by its clustering of clauses that package
all the information about an induction variable in one concise line.

* When there are multiple continuation conditions (i.e. continue iterating
if any of these conditions are true) or subset conditions, syntactic
recursion often encourages the programmer to implement this as multiple
tail-recursive calls, one for each continuation or subset condition.  This
not only adds unnecessary clutter but requires the repetition of
implementation detail.

* Syntactic recursion produces unnecessarily deep levels of syntactic
nesting of LETs and IFs, even though the iteration as expressed abstractly
has only one level of nesting ("for each X do some piece of code").  LOOP
produces code whose nesting corresponds closely to the abstract nesting.
I'll have to reserve most of my lecture on indentation for another day,
but the example I'll present later will illustrate this point.  Suffice
it to say that "river" indenting makes for unreadable code because there
is no direct correlation between indenting and abstraction.

* Even in the most common situations syntactic recursion forces the
programmer to introduce variables to hold the intermediate results (such
as an output list).  LOOP hides the details with its various constructive
clauses such as SAVE, APPEND, SPLICE, and REDUCE.

* Syntactic recursion makes it hard to separate induction variables used
for stepping through a sequence or collection from those used for holding
results (another bad feature of positional notation); disciplined
programmers must rely on convention alone (output variables come after
the iterative variables).  Through its various clauses LOOP explicitly
indicates which variables are iterating through sequences or collections,
which variables are mappings of the elements, and which are reductions.

* Syntactic recursion does not compose well.  For example, suppose that
you have a large iteration over the elements of a list, and you now want
to modify it to look at only those elements of a list that meet a certain
property that requires the position of the element in the list.  That
is, you are composing enumeration of list elements with filtering and
with a join of the sequence 1, 2, 3, etc.  To do this composition, you
now have to add in another induction variable and wrap the body in in
an IF, and because of all the bad properties mentioned above, the new
code doesn't look very similar to the old code, especially because of
the new levels of indentation.

But the different components of iteration compose very well using LOOP,
and the composition does not cause more levels of nesting:

    (loop (for x in l)

gets transformed to:

    (loop (for x in l)
          (incr i from 1)
          (when (p? x i) )

Programmer-defined iteration clauses for LOOP compose just as easily as
the standard ones:

    (loop (for x in l)
          (for-each-stat-set-element set s)
          (when (p? x s) )


* Syntactic recursion does not emphasize the distinction between the
abstract enumerative operations and abstract true recursion.  When looking
at syntactically recursive code, it is not always clear whether the
abstraction is truly recursive (e.g. trees and DAGs) or truly enumerative
(e.g. simple sets, collections, and sequences).  This violates the
principle that the programming language should clearly express the
abstractions.

LOOP, on the other hand, directly expresses the most common operations
on collections and sequences.  There is no chance of ever confusing LOOP
with true recursion.

---------

Finally, I'll give one T example of how syntactic recursion loses badly.
The example is representative of typical Lisp problems, and I'm sure
I could quickly find grosser examples by looking at everyone's T code.

Suppose I have a collection represented as a list, and I want to find
the mapping of a verbose function applied to some subset of the elements
specified by some other verbose conditions.  Using syntactic recursion
I often see code from expert programmers that looks like (forgive my T):

    (iterate loop ((rest l) (result ()))
        (if (null l)
            (reverse! l)
            (let ((x (car rest)))
                (if (and (< x 3)
                         (p? x))
                    (let ((intermediate-name -verbose-code-using-x-))
                        (if (q? -verbose-code-using-intermeidate-name-)
                            (let ((another-name -another-verbose-expression-))
                                (loop (cdr l) (cons -another-name-expression-
                                                    result)))
                            (loop (cdr l) result)))
                    (loop (cdr l) result)))))

For efficiency, the simpler conditions (< X 3) and (P?  X) are tested
first, and only if they are true are the more complicated subparts of
the subset specification tested.  (One need only remember the bit-set
problem solutions submitted to come up with more such gross examples.)

This example exhibits almost all the bad features of syntactic features
described above:  random indenting unrelated to the abstraction, multiple
tail-recursive calls that aren't identical, positional notation, problems
with distinguishing termination conditions from subset conditions, etc.
etc.

Here is the LOOP version:

    (loop (for x in l)
          (when (&& (< x 3)
                    (p? x) ) )
          (bind intermediate-name -verbose-code-using-x-)
          (when (q? -verbose-code-using-intermeidate-name-) )
          (bind another-name -another-verbose-expression-)
    (save -another-name-expression-) )

Notice that it is easy to immediately determine all the essential
characteristics of this iteration:  It enumerates through a list, finding
elements meeting a certain sequential series of tests, producing a list
of a function applied to those elements.  The loop has a single level
of indentation to match that of the abstraction.  Implementation details
are hidden.  Etc. etc.

Now suppose that you want to modify this iteration, say by joining two
sequences (iterating through two sequences in parallel instead of one).
Look at all the places I have to change in the ITERATE version -- good
luck at finding them all, getting the positional notation right,
duplicating the induction steps, etc.  But to modify LOOP, one just inserts
a single, concise line without touching the rest of the code:

    (for y in l1)

Finally, consider an even worse scenario (one that I happen to deal with
every day).  Suppose you have a universe of objects on which you perform
set operations; a frequent operation is to enumerate through the elements
of a set performing various side effects, mappings, or reductions.  With
LOOP, you need to define only one simple clause to get all the
compositional powers of LOOP:

    (loop (for-each-stat-set-element set stat)
          (incr i from 1)
          (when (stat:predicate? stat) )
          ...

I shudder in pure terror to think what I'd have to do without LOOP and
that clause FOR-EACH-STAT-SET-ELEMENT.  In the implementation, each STAT
is tagged with a unique number, and sets of STATs are implemented by
bit-sets (lists of fixnums); there is a vector that maps unique numbers
onto STATs.  So enumeration through a STAT-SET:

          (for-each-stat-set-element set stat)

is implemented as:

          (for-each-bit-set-element set temp)
          (bind stat (number:stat temp) )

The hard part is implementing the clause FOR-EACH-BIT-SET-ELEMENT: It
enumerates through the fixnums of the list, for each fixnum using
FIND-FIRST-ONE and LOGAND to find the next one-bit in the fixnum and then
mask it out.  The implementation looks like:

((INITIAL G0319 SET G0320 0 I 0 G0321 -32 G0322 32)
 (WHILE (LOOP (DO (IF (== 32 G0322)
                      (THEN (IF (! G0319) (RETURN NIL))
                            (POP G0319 G0320)
                            (:= G0321 (+ G0321 32))))
                  (:= G0322 (JFFO32 G0320))
                  (IF (!== 32 G0322)
                      (THEN (:= G0320 (BOOLE 4 G0320 (LSH32 1 (- 31 G0322))))
                            (:= I (+ G0321 G0322))
                            (RETURN T)))))))

Look at all the implementation detail being hidden under such a clean
abstraction!  It is clearly something you wouldn't want to write using
low-level syntactic recursion.  (If you're interested, ELISP compiles
this quite nicely in less than 30 instructions, and it runs very fast
to boot.)

The only relatively efficient alternative the syntactic recursion people
have to offer is to write a "walk" procedure to which I hand some procedure
that gets invoked on every element of the set.  This is fine for simple
enumerations but falls down the minute I need to compose set enumeration
with other sequence operations (e.g. perform a mapping on a subset defined
by a certain property, count the number of elements, enumerate through
a set and a list in parallel, find the first element satisfying a property
and terminate the loop, etc.).  You end up with constructing a family
of walk functions for each new datatype, none of which do quite what you
want it to do and none of which compose (and of course, you'd better watch
out to make sure no closures are being consed for your walk procedures!).

------

LOOP is also good at writing Knuth-style and Dijkstra-style loops, superior
to syntactic recursion mainly because of the odiousness of positional
notation.  But I must admit, I don't have occasion to write such loops
often; usually, they are very clever enumerations of sequences (e.g.
munging the node of a B-tree with a Knuth-loop).

Though it expresses operations on collections and sequences more clearly
than syntactic recursion, LOOP is obviously not as expressive as a language
like SETL or the very high-level sequence package that whats-his-name
is constructing for Lisp.  On the other hand, LOOP doesn't require a fancy,
complicated, expensive (and buggy) optimizer, and the programmer can
confidently write efficient iterations with a good idea of their
implementation cost.

LOOP is a better engineering tool.  So there.
-------