next up previous index
Next: 9. Conclusion Up: An Architecture for Parallel Previous: 7. Device Management

Subsections

    
8. An Analysis of Daisy

This chapter provides an analysis of Daisy programs in terms of parallelism and efficiency under suspending construction. A brief introduction to the language was given in section 2.3; for a full description of the language, see the Daisy Programming Manual [Joh89b].

       
8.1 Demand Driven Computation

It is instructive to understand the operation of the output-driven pipeline that is the conduit of demand in Daisy programs. Most Daisy programs are pipelined compositions of streams, terminated at either end by input and output devices. A good example of this is the main interpretation pipeline built by the bootstrap process, which is roughly analogous to the init process in Unix. The interpretation pipeline corresponds to the following Daisy program:
        screen: scnos: prsos: evlst: prsis: scnis: console: "&"
Application is denoted by infix colon (:), and associates to the right. This program builds a stream pipeline as shown in figure 25.
  
Figure 25: The Daisy Interpretation Pipeline

console returns a stream of characters from the keyboard, interleaving lines of input with ampersand prompts. This stream is fed to a scanner (scnis) which converts it to a stream of tokens. The tokens are passed through a parser (prsis) which creates a stream of expressions. The expressions are read by a mapped evaluator (evlst), which interprets the expressions and returns a stream of results. The results are piped through a deparser (prsos) and a de-scanner (scnos), respectively, before finally being consumed by the terminal output driver (screen).

Demand emanating from the printer (screen) ripples down through the pipeline to the input device, and data flows in the opposite direction. In the middle (evlst), programs are evaluated. Programs will typically create their own network of streams, which plug in transparently to this top-level stream pipeline. The demand rippling down through the output pipeline is what drives program computation.

   
8.2 Limitations of Suspending Construction

Suspending construction offers many possibilities for implicit parallelism, as described in section 2.3. However, the model also has deficiencies that undermine its effective parallelism. We explore some of these here, and suggest solutions.

8.3 Excessive Laziness

While we have have touted the virtues of laziness, it can also have the unintended effect of squelching parallelism. This is well illustrated using a naive version of the quicksort program given in chapter 2, the code for which is shown in figure 26 (this example is adapted from [OK92]).
  
Figure 26: The Quicksort Algorithm in Daisy
quick = \L.
    let:[[X ! Xs] L
      if:[nil?:L  L
          nil?:Xs L
            let:[[lo hi] partition:[X Xs]
                 append:[quick:lo [X ! quick:hi]]
                ]
         ]]

partition = \[p L]. part:[p L [] []]
part = \[p L lo hi].
  let:[[X ! Xs] L
    if:[nil?:L      [lo hi]
        le?:[X p]  part:[p Xs [X ! lo] hi]
                   part:[p Xs lo [X ! hi]]
       ]]

append = \[L1 L2].
  let:[[X ! Xs] L
       if:[nil?:L1 L2
           [X ! append:[Xs L2]] ]
      ]

quick operates in the usual way, partitioning its argument and then calling two recursive quicksorts on the result. This kind of divide-and-conquor parallelism is ideal for parallel applicative processing.

Consider the application of quick to a list of numbers. The result of quick:[...] eventually reduces to a sequence of nested, suspended appends. This can be visualized as:

        append: [append:[ append:[ ...
(this represents what a string reduction model might actually do). Under normal output-driven evaluation, the printer, attempting to coerce the head of this structure, causes the inner suspended appends to unfold until the innermost append performs a suspending cons (line 21 of figure 26). This satisfies the demand of the next outer append for X, so it also performs a lazy cons and converges. In this manner, computation ripples back up to the topmost append as the dependence chain is satisfied. The printer then issues the first number of the result and iterates, probing the second element, which repeats the entire scenario just described.

There are two problems with this behavior. First, append is lazy in its second argument, which means the suspended recursive call quick:hi will not be coerced until the printer has output the entire result of quick:lo. Thus, no parallelism has been achieved. Secondly, the lazy cons occurring on line 21 results in N complete traversals of the suspended computation graph for outputting N total elements. Ideally, if we know that the entire structure is needed, we would like to propagate exhaustive demand [OK92], so that when demand ripples down through the inner appends, their result is completely coerced before converging. This would result in a single traversal of the computation graph, resulting in much less context switching. Neither of these problems are due to the kernel's default process management. Given a scheduling request, the kernel will execute that suspension and any suspensions that it coerces until such time as it converges. Rather, there are two problems are at the language level which are responsible:

  1. the serialized demand of the printer, and

  2. the laziness of suspending construction, which "cuts off" eagerness at each suspension creation.

Suppose we were to address the first problem by using a concurrent primitive at top level instead of the serial printer. For example, suppose we have a primitive pcoerce (see chapter 6) that schedules its argument suspensions in parallel and returns after they have all converged. We could apply it to our top-level quicksort like so:

This would cause the suspension for quick:hi to be coerced as well as quick:lo. This example works because our result is not bushy; if it were, pcoerce would likely only coerce the top-level elements down to their first suspended cons, at which point they would have satisfied the top-level convergance requirements of pcoerce. This is the problem referred to by item no. 2 above; the suspending cons "cuts off" even our explicit attempt to be parallel.

8.3.1 Bounded Eagerness

We might consider addressing the excessive laziness problem in the kernel through the demand coefficient mechanism; having the system speculate on suspensions as they are created. Demand coefficients would apply not only to probes, but to allocation as well; i.e. when you create a cell containing a suspension the system automatically siphons off some of the running process' demand coefficient into the suspension and schedules it immediately. This could be implemented by modifying the new instruction to trap after the allocation if one of its source registers contained a suspension. This would implement a form of bounded eagerness.

The problem with this approach is that it is naive system-level speculative parallelism (see section 2.3). There is not enough useful context information for intelligent scheduling choices. Does the entire coefficient of the running process get transferred to the first suspension created? If only part of it, how much, and to which created suspensions? The former policy would essentially default the system to an inefficient form of strictness; every suspension would be immediately scheduled after creation and the creating process would be suspended. The latter case results in massively speculative parallelism; all suspensions are scheduled with a little bit of energy. This is far too general, and undermines the purpose of laziness. Suppose the demand is for a particular element (say the 10th) of the result of our quicksort. Under this "shotgun" approach we expend effort across the entire structure even through the exhaustive demand in this case is "lateral" and then "down".

The limitation of the demand coefficient approach is that while it can indicate the relative overall demand for a suspension's result, it does not provide specific information about how to distribute that demand through probes or suspension creations. If the demand coefficient was not just a number, but instead a bushy list structure containing numbers, it might be a start towards conveying the demand information we need, but such a structure would be expensive to update externally, and would require more effort to interpret by the kernel. For these reasons it seems preferable to avoid general system-level speculation on new suspensions and relegate scheduling decisions to the language level (except for demand-driven scheduling).

8.3.2 Granularity Revisited

The fine granularity of process decomposition under suspending construction is also a potential problem. Suspending construction suspends every computation embedded in a list, no matter how trivial. Under our allocation scheme, trivial suspensions are just as likely to be allocated remotely as suspensions representing significant computations, whuch we are more interested in offloading to other processors. We would like trivial suspensions to be eliminated or at least allocated locally and suspensions representing important parallelism to be allocated remotely, so that parallelism is migrated properly.

   
8.3.3 Strictness Analysis

One way to address these problems at the language level is through the use of strictness analysis [Hal87,HW87]. The information from strictness analysis can be used to eliminate many "trivial" suspensions that do not need to be created. Unfortunately, strictness analysis does not discriminate between trivial suspensions and "important" suspensions that we want to preserve for parallelism's sake.

Strictness analysis can also indicate instances where the system can be safely eager, but this bypasses some parallel opportunities where the system is being properly lazy. For example, strictness analysis does not increase parallelism in our quicksort program above, since append is not strict in its second argument. The language, being properly lazy, will suspend the recursive quicksort calls; the system, being properly demand-driven, will not schedule them until they are probed.

   
8.4 Program Annotations

We have discovered some limitations of naive suspending construction. We can either settle for what the system provides or annotate our program to provide clues for the system about how to handle these situations.

Consider our example; suppose we annotate the recursive calls to quicksort like so:

        ...
            let:[[lo hi] @partition:[X Xs]
                 append:[quick:lo [X ! @quick:hi]]
                ]
        ...
Here, the @ annotation could suggest suspensions that should be allocated remotely, thus insuring that important computations are offloaded. This kind of annotation might be most effective if coupled with a policy of allocating all non-annotated suspensions locally.

We might address the second problem, inefficient demand-driven, computation, by annotation as well. Suppose we used the following definition of append:

Here, the $ annotations would force the recursive call to append to be performed before returning, but still leaves append non-strict in its second argument.

Thus, it seems that we would like both an "important parallelism" annotation and a demand or strictness annotation. With this in mind, I suggest three kinds of annotations:

The use of annotations represents a step back from the goal of completely transparent resource management. However, the eager and parallel annotations we describe here can be considered guidelines, and the system is free to disregard them. Strictness annotations are not so, but can be considered a programmer or compiler optomization. The annotations we are suggesting are not overly intrusive, and deserve consideration as a viable approach to the problems presented in this chapter.


next up previous index
Next: 9. Conclusion Up: An Architecture for Parallel Previous: 7. Device Management
Eric Jeschke
1999-07-08