Next: Bibliography Up: An Architecture for Parallel Previous: 8. An Analysis of

Subsections

9. Conclusion

Section 9.1 outlines the main results and contributions of the work described in this dissertation. Section 9.2 describes possibilities for future research and development efforts stemming from this work, and section 9.3 examines other symbolic language implementations most relevant to my work.

9.1 Summary of Results

This dissertation discusses the design of a list processing engine for implementing parallel symbolic languages on stock multiprocessors. The design results from experience implementing an applicative language on the BBN Butterfly multiprocessor. The language and engine are based on suspending construction, a fine-grained, concurrent, non-strict computation model for Lisp-type languages.

The contributions of the dissertation are:

A virtual machine architecture for fine-grained parallel list multiprocessing. This virtual machine design identifies key areas where hardware support would accelerate the execution of programs using computational models similar to suspending construction, such as Lisp with futures.
A microkernel providing memory management, process management and device management for high level parallel symbolic programs. The resource management techniques described represent new approaches to these problems for many parallel symbolic languages.
An analysis of suspending construction in Daisy, a language implemented on this architecture and microkernel.

9.1.1 Implementation Notes

The virtual machine, kernel and language described in this thesis (DSI V4.1) has been implemented in slightly simpler forms (DSI V2.x and V3.x) on the BBN Butterfly multiprocessor. The notable differences between that implementation and the design described herein can be summarized as follows:

The Butterfly virtual machine has two identical parallel sets of registers: one for the kernel and one for regular processes; this was later generalized into the context window model.
Signals are used, but in a more limited way.
The memory management is essentially the same as described here, but uses a slightly different representation for allocation vectors.
Process management is limited to conservative parallelism based on annotations in the user program; demand coefficients and speculative computation are not supported.
Each processor maintains a separate, single circular queue for incoming scheduling and allocation requests. Scheduling requests consist of distributed dependence stacks, as described here.

This implementation served as a prototype for the current design.

The current system as described in this thesis (DSI V4.1) has been implemented sequentially. It includes a virtual machine assembler [Jes] implemented in Yacc/C, a host interface library (in C) and a set of DSI assembly modules comprising the DSI kernel and Daisy interpreter. A Daisy compiler, written in Daisy, translates Daisy code to DSI code.

I plan to fold the improvements described in this report back into the Butterfly implementation. Another possible target for a near-term port is a Silicon Graphics Challenge multiprocessor.

9.2 Future Work

Specific experiments aimed at analyzing the characteristics and performance of DSI's resource management algorithms are a top priority. These experiments would target:

Load-balancing effectiveness and performance compared to other systems. I am particularly interested in comparing my allocation-based system with the task-stealing mechanism used in Multilisp [RHH85] and its offshoots.
An analysis of the effects of allocation block size on parallelism and load balancing.
Our local task execution model should be analytically compared to systems which use task migration strategies or non-local process execution. The basic question to be answered is whether locality considerations, such as the ones used in DSI, are beneficial enough to justify their use with such a small process granularity.
The effectiveness of our priority scheduling structure on constraining parallelism and excessive speculation.

These are just a few of the immediate possibilities for analytical comparison and experimentation.

Some crude instrumentation is provided in the Butterfly implementation for event logging using the BBN gist utility. This has provided some limited preliminary quantitative data. Better instrumentation is needed for more accurate performance and analytic measurements for the experiments described above. Support for host-specific tools (e.g. gist) is useful, but a portable, cross-platform mechanism would be better.

9.2.1 Incremental Enhancements

In addition to the experimentation outlined above, there are a number of incremental enhancements that would improve DSI's utility and capability for supporting surface languages.

Daisy has a similar design philosophy as Scheme: a simple, but expressive core language, without a lot of bells and whistles. The suggestions for improvements in this section reflect that spirit and are meant to address limitations based on experience using Daisy and not to match feature for feature in other languages. Solving these problems would also address some of the issues faced in implementing other languages on DSI (see section 9.2.2).

9.2.1.1 Storage Management

There are two main limitations of DSI's storage management system that should be addressed. First, the current garbage collector only compacts two different sized objects: cells and suspensions. This rules out dynamically allocated arrays (static arrays are allowed) which are standard in modern applicative languages; their exclusion is a noticeable omission in Daisy. This could be rectified by implementing a more sophisticated compaction scheme.

A second limitation I would like to address is the system-wide pauses caused by garbage collection. System-wide garbage collection pauses become unacceptable as heap sizes increase, making interactive response time suffer and may cause data overruns in a device handler if it occurs during pending I/O and the host cannot provide sufficient buffering. Our system has the advantage of parallel collection in all phases, so that collection pauses increase according to the heap size divided by the total number of processors. Nevertheless, this still allows for significant pauses. I would like to incorporate an incremental garbage collector [AAL88,Moo84,Kie86] to avoid the latency penalties incurred by the current collector. The incremental collector should preserve locality of data as in the current scheme and not migrate data or processes to other processor nodes.

Generational collectors are becoming commonplace in many strict applicative systems, but are likely to be ineffective for a lazy language. Generational collectors are designed around the principle that there are few references from older generations to newer generations; stores into old generations have to be monitored so that the references can be updated when newer generations are collected. This is generally true of strict languages, but lazy languages have many updates of old structures pointing to new structures. Indeed, suspending construction is predicated on this fact. Therefore, generational collection may not be a viable method for incremental collection on DSI.

9.2.1.2 Global Namespaces

Lexical scoping solves most identifier conflict issues, but does not solve the problem of conflicting global identifiers. While the current implementation is satisfactory for smaller single user sessions, a larger layered software architecture or multiuser environment will require the shared global namespace issue (see chapter 4) to be resolved. Some Lisp-based systems address this issue through per-task storage based on fluid-binding [Mil87], first class environments, or extending deep binding to top-level environments [DAKM89]. A flexible module or package system for Daisy would address this problem in a more general way that might have benefits for other languages implemented on DSI.

9.2.1.3 User-Defined Exceptions

Daisy's treatment of errors is woefully inadequate for serious programming. Errors are handled by propagating errons (error values) back up the computation tree to top level, which essentially provides the equivalent of a stack dump. Explicitly testing for errors is not a viable solution. In addition to introducing overhead, it is a disincentive to programmers to consistently check for errors in every conceivable spot in the code where they could occur. For this reason, many languages provide configurable exception handling. This often takes the form of user-specified procedures or continuations for exception conditions, like errors. These kind of control mechanisms are not a viable solution for Daisy, given its applicative model.

DSI does provide a low-level exception mechanism (signals) that is compatible with an applicative solution. Signals establish an asynchronous event dependence between two processes without imposing state or control changes within those processes. A signal occurring in one process simply causes another process to be activated, presumably to handle that exception. An exception mechanism based on specifying exception dependences between expressions might fit naturally into this model.

The major obstacle to implementing this in DSI is that signals are a component of the architecture, not processes; i.e. signals map between context windows (running processes), not suspensions. An example of how this is adapted to arbitrary processes is given by the use of signals in supporting demand-driven scheduling (see chapter 4 and 6). It might be possible to further generalize the mechanisms used for this purpose to support user-defined signals. Ideally, exception support could be introduced into the language without requiring additional state fields to be added to suspensions. However, some kernel support is probably necessary, so this can also be considered as an enhancement to the DSI system.

9.2.2 Implementing Other Languages

In addition to enhancing Daisy, I would like to implement additional languages on top of DSI. This would validate the usefulness of DSI as a symbolic language implementation platform and indicate what further enhancements are necessary in the virtual machine and kernel to support compatible models of computation, such as futures. I don't expect DSI to become a completely general-purpose language implementation platform; by nature that leads to inefficiencies for certain classes of languages. Rather, DSI will remain optimized for fine-grained symbolic languages. Support for richer data types will be determined by progress in enhancing storage management (see remarks in section 9.2.1 above).

9.2.2.1 Scheme

I am particularly interested in implementing a strict, parallel Scheme-dialect on DSI, primarily for comparison of DSI's resource management algorithms with those of Multilisp-type languages. DSI's resource management algorithms (especially load-sensitive allocation) seem well-suited for an eagerly parallel language. Scheme is similar internally to Daisy, but semantically quite different (applicative order, eager, explicit parallelism) and would thus also be a good starting point for language comparisons. The dialect would support future and delay constructs that would be implemented with suspensions, the difference being that futures are scheduled immediately after creation.

Some potential issues to be resolved in this proposed language are whether the language should provide side-effects and what sort of I/O interfaces should be provided. If side-effects are included, some general-purpose visible synchronization constructs will need to be provided in the language. Also, the interaction of continuations across process boundaries is troublesome [KW90]. If side-effects are not provided, the language is free to be parallel wherever possible in its primitives (including speculatively) just like Daisy. As for I/O, the language could easily provide stream I/O based on the kernel's device interface, but standard Scheme also provides imperative I/O; providing a proper superset of the Scheme standard [CJR91] might be beneficial.

9.2.2.2 Communicating Sequential Processes

Another language that would be particularly well suited to DSI's task and communication model would be a CSP-like language. A CSP language would be naturally implemented in DSI using streams for communication and signals for synchronization between processes. This kind of language would allow the safe integration of imperative programming with applicative programming, by encapsulating side-effects within the scope of a suspension, which communicates with other processes through a port-type interface to obtain its values. The integration of the two styles would be very similar to that described for Rediflow [RMKT84].

The major hurdle for implementing this type of language on DSI is the same one described above for providing exception handling in Daisy; namely, generalizing the signal mechanism from the architecture to arbitrary suspensions and user-defined signals.

9.3 Related work

In this section I compare several parallel symbolic language implementations to the Daisy/DSI work described in this thesis. The number of such projects is fairly large and a thorough comparison of all of them is outside the scope of this report. My comparisons are limited to those parallel Lisp dialects and "general-purpose" parallel symbolic kernels that are most similar to the Daisy/DSI project in terms of the characteristics listed in section 1.2 (p.

). DSI is a list-processing engine, and the architecture and kernel are oriented toward Lisp-like languages; parallel Lisp systems therefore invite the most direct comparison. Nevertheless, the resource management solutions described herein may be applicable, in various forms, in wider circles. I reserve some general remarks for the class of functional languages, with whom Daisy also shares a fair number of traits.

9.3.1 Parallel Functional Languages

Although DSI and Daisy have a strong heritage in Lisp, superficially Daisy appears to have much in common with functional languages. Like functional languages, Daisy lacks side-effects; however, Daisy is not referentially transparent due to its indeterministic set operator.

Daisy's similarity to functional languages extends beyond laziness and lack of side-effects to include stream I/O and the use of error values. Lazy, side-effect-free programming also leads to similar styles of expressing data recursion, functional mappings and other functional programming hallmarks. At the same time, Daisy does not have many of the trappings of "modern" functional languages [Hud89] such as pattern matching, type inference, and algebraic or abstract data type constructors. In these areas Daisy remains very Lisp like, using lambda forms, dynamic typing, and standard Lisp data types.

Internally, Daisy's evaluation model is also Lisp like, using an environment-based closure model rather then the ubiquitous graph reduction, combinator or dataflow approaches common in functional languages. This point is significant, because the reduction model strongly influences the underlying view of hardware; DSI's virtual machine is fairly different from published descriptions of abstract machines for graph reduction.

This dissimilarity extends to the task model. Under graph reduction, a task is simply an available redex; this is just a pointer to a subgraph that can be reduced in parallel. The reduction machine will "spark" a parallel task [Jon87] by communicating this pointer to some other processor. In contrast, DSI's process decomposition occurs in list construction, not in reducing application redexes. Suspensions are implemented as a fine-grained process control records rather than as graph redexes.

In summary, Daisy shares many semantic traits with functional languages due to its lazy semantics and lack of side-effects, but differs substantially in most other respects. To avoid confusion, I refer to Daisy as a quasi-functional or applicative language.

9.3.2 General-Purpose Symbolic Processing Kernels

This category refers to kernels that are presented in the literature as general (symbolic) language implementation vehicles. Virtual machines for specific languages are not included here. Kernel issues relating to specific languages or implementations are described in the section 9.2.2.

9.3.2.1 STING

The STING project [JP92b,JP92a] lays claim to being a general purpose high-level parallel "substrate" for implementing parallel symbolic languages. The STING system offers a high-level programming environment (the system is based on a Scheme compiler) with a "concurrency toolkit" approach for implementing other languages. Under STING, users can dynamically create virtual machines, each containing an arbitrary number of virtual processors. Each virtual machine provides a separate address space for the threads executing within it and the user can define a scheduling policy for threads on each virtual processor. Their system provides explicit creation, synchronization, blocking and task killing operations. So, for example, users can implement speculative computation, but must manage the detection, killing and removal of useless tasks themselves.

In contrast to STING, DSI is at the same time both lower-level and higher-level. It is lower-level in the sense that programming the DSI virtual machine amounts roughly to assembly programming, with registers and instructions that are very close to the actual machine; STING's virtual machines, virtual processors, and so forth are all first-class objects that can be manipulated from within Scheme (in fact the system was written mostly in Scheme). DSI is higher-level in the sense that its kernel provides a more packaged solution to process control, with the kernel handling synchronization, demand-driven scheduling, useless task removal, and so forth.

Another very notable difference between STING and DSI is that DSI's processes are much finer-grained than STING's threads. Each STING thread contains a stack and a heap^9.1, and is associated (through virtual machines) with a protected address space. Thus STING's threads, while conceptually smaller than traditional operating system processes, are still fairly large-grained tasks. It is highly unlikely that one could implement suspending construction on STING, using threads for suspensions, in any efficient way, other than by re-implementing a finer-grained process package on top of it, which would obviate the purpose of using it.

DSI's low-level implementation is due to one of the original motivations of the project: to explore fine-grained list-multiprocessing at target levels. DSI's virtual machine is "the machine" and its kernel is the operating system on that machine. In contrast, STING is a relatively high-level environment that is implemented on stock multiprocessing operating systems in Scheme. DSI's kernel reflects the motivation to provide as much system control of parallelism as possible, so that the language implementor is simply responsible for identifying parallelism. STING's approach is to allow the programmer full control (and thus full responsibility) over process decomposition, mapping, etc.

9.3.2.2 The Chare Kernel

The Chare kernel [KS88] is closer in spirit to DSI than the STING system described above. The Chare kernel was developed to support the distributed execution of Prolog on message-passing multiprocessors, although it is touted as a general-purpose parallel programming environment. Like DSI, the Chare kernel handles all aspects of resource allocation for processes (chares in their terminology). Programming on the Chare kernel is fairly low-level (an improper superset of C; i.e. not all C features are supported), and chare granularity seems to be similar to that of suspensions, perhaps finer.

That is about where the similarities end. The chare computation model is inherently distributed (although there is an implementation for shared memory machines). Chare programming consists of creating chares and sending messages between them. Shared data is handled on distributed processors by having the user program provide a callback routine to package any data that the kernel is about to pass to another processor, where it is unpacked. Pointers must be eliminated by the packaging procedure. On shared memory machines this is not required. The kernel can be configured to use one of several load-balancing procedures that migrate chares to neighboring processors and also to use one of several queueing strategies for messages.

It is not clear how speculative processing could be handled in the Chare kernel. Chares can be created with specific priorities, but termination is explicit (a chare kills itself). It is unclear from [KS88] whether or how chares are garbage collected (or for that matter, how storage management in general is handled).

In summary, the Chare system bears a resemblance to DSI in that the process granularity is similar and the goals of the two systems are total process resource management. Otherwise, the systems are fairly different not only in computation model and style, but in the algorithms used for resource management.

9.3.3 Parallel Lisp

In the family of symbolic processing languages, the work done on parallel Lisp is perhaps the most relevant to the work presented here and vice versa. That in itself is not surprising, since Lisp is the grandfather of symbolic processing, and the seminal research leading to the work presented here was conducted in Lisp. This is particularly true regarding the body of work centered around the future construct used in Multilisp [RHH85,RHH86,Osb90], and its offspring, MultiScheme [Mil87] and Mul-T [DAKM89]. There seems to have been a cross-fertilization of ideas in the genesis of suspensions and futures [BH77,RHH85], although the work on suspensions seems to predate that of futures [FW76a,FW76b]. A number of other Scheme constructs like delay/force, continuations and engines share some characteristics with suspensions in various ways, but only futures share the inherent notion of concurrent process entities. A high-level, denotational analysis of these constructs vs. suspensions can be found in [Joh89c].

Conceptually, both suspensions and futures represent fine-grained tasks. However, most implementation descriptions of a future's state indicate a larger granularity than that of suspensions. This is probably due to the historical fact that futures were designed primarily as an explicit eager construct, and thus can represent an arbitrarily large granularity, even though they are generally used in a fine-grained way. Suspensions, on the other hand, were intended from the outset to be a lazy object and to be created implicitly during list construction. Thus the suspension's design attempts to minimize the state profile. It is instructive to consider whether an efficient suspending construction type language could be built using delays, Multilisp's counterpart to suspensions.

Aside from granularity and scheduling semantics, futures and suspensions also differ operationally. Although they are mostly transparent in Scheme (most strict primitives will implicitly touch them into existence), futures can be directly shared in a way that suspensions cannot. Non-strict operations are free to pass future pointers around as arguments, insert them into new structures, etc. This is because when the future is determined (their terminology) the value is stored in the future itself, until the garbage collector comes along and "shorts out" the value. Suspensions, on the other hand, overwrite their references upon converging. This means that suspension checking occurs earlier than future detection, which is generally tied in to run-time type checking. Some implementations, such as MultiScheme, further differentiate between a placeholder (future) and task, actually separating the two entities into different objects, both of which may be manipulated within Scheme.

This more explicit manipulation of futures in Scheme is indicative of the language differences between Scheme and . Multilisp is a strict applicative order language (Scheme) with side-effects and explicit parallelism using futures. It also includes explicit synchronization constructs such as as atomic replace, locks and semaphores. These features allow explicit process granularity, low-level heap mutation and process synchronization to be performed in Scheme, which aids the parallel Lisp implementor, since more of the system code can be written in Scheme itself. It also provides a great deal of flexibility for implementing other parallel languages on top of it, as in [JP92b], although one implicitly inherits Scheme's control structures, memory management and other artifacts with this approach.

Since these low-level hooks into the implementation exist, it is tempting to make them visible to the programmer, on the principle that it affords the programmer greater flexibility and capabilities. In fact, this is the hallmark of Lisp systems. However, this also can work to the language's detriment. It drags the language down toward the level of the implementation, exposing the gritty details of implementation and requiring the programmer to take more active control in resource management. In contrast, Daisy remains a fairly high-level language, without low-level hooks, and in doing so has much flexibility in parallel scheduling at the language implementation level. DSI remains at a very-low level, for fine control of the machine. In this light, parallel Scheme appears to be more of a middle-level language between DSI and Daisy.

This concludes my general comparison of parallel Scheme and Daisy/DSI. In the next few sections I highlight the differences in resource management between the various flavors of parallel Scheme and DSI. The information presented here is gleaned from various papers in the bibliography, cited here where appropriate.

9.3.3.1 Multilisp

Multilisp is a version of Scheme with futures implemented on the experimental Concert multiprocessor at MIT [RHH85,RHH86]. Multilisp was instrumental in starting the parallel Lisp revolution, providing one of the first working implementations of futures, and bringing many fine-grained parallel resource management issues to light.

Multilisp's memory management is based on a copying, incremental collector. Each processor has an oldspace and newspace; objects are allocated out of the processor's own newspace, which means there is no allocation synchronization and very high data locality for local processes. During garbage collection, the first processor to reach an object copies it to it's newspace, regardless of whose oldspace it was originally in. The author claims this may improve locality of reference because if the containing processor has dereferenced the object it will be entirely migrated to the copying node, which presumably still needs it. However, if there is more than one reference to an object there is no guarantee that the processor copying it is the one that will access it the most, so this may not be a valid assumption. Unlike Multilisp, DSI's allocation is distributed

Multilisp employs an "unfair" scheduler to constrain parallelism. Processes are placed on a local stack, oldest first; executing a future results in the parent task being stacked while the child continues to execute. When a processor runs out of tasks it looks around at other processor's stacks for a task to "steal" off the top. This arrangement limits task queue growth to the same order of magnitude as if the futures executed on the stack. Note, however, that this also encourages depth-first expansion of the process decomposition, since the processes higher in the process tree are buried on the stacks; these processes may be more likely to represent significant computations that should be migrated rather than leaf tasks. A good example of this is the quicksort program and other divide-and-conquer strategies. It is interesting to note that the Mul-T implementation (Multilisp's heir apparent) uses a queue for this structure. Halstead notes that under the LIFO scheduling strategy, once the system is saturated, futures complete before their parent task ever gets resumed; in essence, the future expression could have been evaluated on the stack. This may have motivated the work on lazy future creation [Moh90] used in Mul-T (see below).

9.3.3.2 MultiScheme

MultiScheme [Mil87,FM90] is a parallel dialect of MIT-Scheme, extended to support futures.

Miller's dissertation [Mil87] concentrates on the details of extending MIT Scheme to support futures and the other features of MultiScheme. The MultiScheme scheduler, written in Scheme, and presented in the thesis, uses a set of primitive functions to pass tasks to the underlying multiprocessing engine for distribution. There are no details about the underlying resource management algorithms used in the implementation, making it difficult to draw comparisons to DSI in this report, which essentially describes a different level of implementation and/or set of problems, although some issues such as task blocking and resumption are described.

Miller does describe the basis for speculative task recovery in MultiScheme. The system provides a special data type, weak cons cells, which have the property that the garbage collector will remove the car reference if no other references point to the same object. Thus, weak cons cells can be used to build hash tables, device lists, and other structures that must be specially collected in systems like DSI (see sections 5.3.2 and 7.3.3).

Weak cons cells are used to build the waiting (blocking) queues used for tasks waiting on the value of a future (placeholder in MultiScheme), so that any speculative task that has been dereferenced elsewhere will not be retained following a garbage collection. Osborne [Osb90] notes that the success of this approach depends on the frequency of garbage collection, although priority propagation was later added to downgrade speculative tasks between collections.

MultiScheme is similar to DSI in its use of global interrupts (visible in scheme) and barrier synchronization to organize garbage collection, although the garbage collection itself is encapsulated by a primitive function at the lower implementation level, and is not described. This example typifies the difference between Multilisp and Daisy, regarding low level hooks into the implementation, described earlier.

9.3.3.3 Mul-T

Mul-T is a extended version of Scheme augmented with futures and implemented on the Encore Multimax multiprocessor using a modified form of the ORBIT Scheme compiler [DAKM89]. For language and computation model comparisons with DSI, see the general remarks on parallel Lisp, above.

Mul-T's run-time system, like most other implementations (except ours, it seems) queues blocked tasks into the task state of processes directly. No mention is made of whether weak pointers are used or speculative tasks are garbage collected, as in MultiScheme. The language does provide a grouping mechanism for debugging, and users can suspend entire groups of tasks and kill the group. Mul-T provides support for process-specific variables (i.e. global name spaces) by converting T's normal shallow binding to deep binding.

Mul-T uses a memory management scheme very similar to that described for Butterfly Portable Standard Lisp (see below). Chunks of memory are allocated from a global heap and then objects are suballocated out of the chunks. This reduces contention for the shared global pointer managing allocation in the heap as a whole, although large objects are allocated out of the heap directly. Since the Encore is a non-NUMA architecture (see sections 2.1 and 2.2.1, chapters 3 and 5), no locality considerations are made for heap allocation. Mul-T uses a parallel stop-and-copy garbage collector.

The most significant difference between DSI and Mul-T regarding allocation is that Mul-T always allocates futures locally, whereas DSI distributes suspension allocation, subject to the availability of remote suspensions. This primarily reflects the difference in load-balancing strategies (see below). For further comparison remarks to DSI's memory allocation scheme, see section 9.3.3 below.

Mul-T scheduling uses two queues per processor, one for new tasks and one for resumed blocked tasks. Blocked tasks are restarted on the processor on which they last executed, in an attempt to improve snoopy cache locality, but tasks may migrate around the system due to stealing [RHH85]. The scheduling priority is as follows:

Run a task from the processor's own suspended task queue.
Run a task from the processor's own new task queue.
Steal a task from the new task queue of another processor.
Steal a task from the suspended task queue of another processor.

No priorities are provided within a queue.

DSI, like Mul-T, employs a multi-tier task scheduling algorithm, but the queuing strategies used in the two systems are quite different. DSI uses a load-sensitive allocation strategy for load-balancing rather than migrating tasks via stealing. An experiment is needed to determine whether this allocation strategy results in less overhead than task stealing, because processors do not have to spend time examining each other's task queues, or whether the overhead of our allocation strategy and inter-processor scheduling requests outweighs the cost of non-local context switching. DSI tasks always run locally and do not migrate, so cache locality should be good for DSI suspensions on a bus-based multiprocessor like the Encore.

Mut-T's per-processor task queues correspond closely with DSI's conservative task queue and conservative task stack (see chapter 6). DSI's use of a stack for local suspensions rather than a queue plus the lack of stealing may provide even more parallelism throttling than is available under Mul-T's queuing scheme. Note, however, that Mul-T uses a technique called lazy task creation [Moh90] to increase granularity of programs dynamically by inlining, and this probably reduces the need for kernel-based throttling. This technique is unlikely to be applicable to a lazy language like Daisy, which relies on suspension creation for laziness in addition to parallelism. DSI's suspensions are also likely smaller than futures.

In addition, DSI provides an additional speculative task queue and stack with corresponding lower priorities to keep speculative processes from disturbing conservative ones. DSI removes speculative tasks from the system implicitly. Kranz [DAKM89] provides no indication of how speculative computation is handled, if at all, in Mul-T. The techniques used in [Osb90] may be a possible direction for Mul-T in this regard.

Mul-T's I/O also shares similarities with DSI's device management. Mul-T uses a "distinguished task" dedicated to terminal I/O to avoid the problems of shared, parallel I/O; this corresponds roughly with DSI's input device manager (see chapter 7). Mul-T likely uses the imperative I/O model of Scheme, however, compared to DSI's stream I/O model; Mul-T's I/O efforts are motivated by integration with it's task group debugging environment (see above). Mul-T also uses a distinguished "exception handler task" for each processor in the system devoted to coordinating task groups. In DSI, all exception handling is accomplished via separate processes using signals (see chapter 3); our "distinguished exception handler task" is the kernel. DSI signals are also used to coordinate I/O, tracing and debugging, garbage collection, suspension detection, etc.

Finally, Kranz [DAKM89] makes very good arguments for hardware support for tag-checking to accelerate future detection on stock hardware, validating similar remarks I made in section 2.1 and chapter 3.

9.3.3.4 QLisp

QLisp [GM84,GG88] is a eagerly parallel version of Lisp based on a shared-queuing discipline. The language is based on a version of Lucid Common Lisp and compiles Lisp code to machine instructions. The language provides several explicitly parallel constructs:

A qlet construct for evaluating arguments to a let form in parallel.
A qlambda form, for creating a separate process which is accessed by passing arguments to the closure. The process executes the closure body on the arguments and returns the result to the caller.
Futures, a la Multilisp.

An interesting feature of QLisp is that the process granularity of qlet and qlambda processes is directly controlled by a predicate expression in the form. If the predicate evaluates to false the form reverts to its normal (non-parallel) counterpart.

The philosophy of the QLisp implementation seems to be oriented toward providing full programmer control over parallelism. QLisp provides functions to create, acquire, release and test locks. It uses catch and throw to explicitly kill other tasks; that is, when a processor executes a throw, all other tasks created within the corresponding catch are killed. This feature can be used to manage OR-parallel speculation.

Published details [GM84,GG88] are sketchy on the memory and process management used in QLisp. The language is implemented on an Alliant FX/8 multiprocessor (a non-NUMA design), which means QLisp probably does not micro-manage for locality purposes. The report states that processors contend for a shared lock on the "free memory pool" and that whichever processor exhausts the memory pool first performs a garbage collection while the other processors wait. The comparison remarks for DSI's memory management scheme verses QLisp are basically the same as for PSL, below.

A comparison of process management is difficult, given the lack of details on QLisp's design. However, the authors [GM84] describe the use of a single shared queue for distributing work among processors, hence the name, QLisp.

9.3.3.5 Butterfly Portable Standard Lisp

A group at the University of Utah has also implemented Lisp on the BBN Butterfly [MRSL88]. PSL is a standard Lisp augmented with futures and fluid binding to support concurrent name spaces.

The memory management described in [MRSL88] has some resemblance to DSI's. It divides the heap into segments across all processors. Distributed allocation is accomplished by allocating chunks out of the global heap; suballocation occurs out of the chunk until it is exhausted, at which point another global chunk is allocated. Their scheme allocates a chunk at a time from shared global heap state variables. DSI also allocates in blocks, but each processor's allocation vector contains blocks for each processor so that suballocation is more distributed. Also, in our scheme blocks are requested directly from the remote processor and can only be donated by that processor; they are not allocated on demand from global shared heap variables. This supports our load-balancing scheme in which suspensions only execute locally (see section 5.2.2); under PSL, blocked tasks can be rescheduled on any processor. An interesting result in [MRSL88] supports our choice of distributed allocation of cells for the BBN Butterfly. They first tried a local allocation scheme, which fared poorly due to contention; a subsequent refinement to distributed round-robin allocation improved performance (see chapter 5).

The garbage collector described in [MRSL88] is a monolithic and sequential (i.e. it runs on one processor) stop-and-copy design; DSI's mark-sweep collector is distributed and fully parallel in all phases. PSL uses a single global scheduling queue, this also handles load balancing; DSI uses distributed queues, executes suspensions locally, and relies on distributed allocation for load balancing. Butterfly PSL is very similar to Multilisp and similar comparison remarks apply (see section 9.3.3).

9.3.4 Parallel Lisp: Summary

DSI's virtual machine architecture and parallel kernel differ from the other Lisp implementations described above in several ways. The most significant and notable differences are summarized below.

9.3.4.1 Machine Design

DSI's virtual machine differs in two significant ways from most other parallel Lisp VMs. First, it uses a context window design in which several processes may be register resident on the same processor at the same time, as opposed to context swapping into a single set of registers. Secondly, in the other systems exceptions are handled in the traditional way using a stack, and exception handlers are defined using procedures or continuations; in DSI, exceptions (signals) are handled by separate processes.

9.3.4.2 Kernel Design

Many parallel Lisps use a tightly-coupled kernel design in which interrupts and kernel services are handled on the user or system stack. The DSI kernel is implemented as a special group of processes, distributed across all processor nodes; kernel communication is asynchronous, implemented by message streams.

9.3.4.3 Distributed Allocation

The parallel Lisp systems we have reviewed use one of two allocation methods:

Processors contend for access to global shared heap allocation pointers. Synchronization (usually in the form of locks) must be used to arbitrate access. Some systems optimize this by allocating large chunks at a time and then suballocating out of them to reduce contention.
Processors allocate only out of their own local space and either rely on a copying collector to migrate data, or they don't worry about locality of data at all.

In DSI processors asynchronously request and receive allocation blocks directly from remote processors. They can only suballocate from blocks that have been provided by these processors; allocation is distributed evenly across processors, subject to availability of allocation blocks.

9.3.4.4 Load Balancing

The parallel Lisp systems we have reviewed use task stealing or other task migration methods to balance the workload across the set of processors. In DSI, tasks can only execute locally; since the garbage collection does not move data between processors, suspensions execute on the processor on which they were allocated. The distributed allocation mechanism is indirectly tied to the processor load in an attempt to self-balance the system.

9.3.4.5 Process Management

Most parallel Lisps require the programmer to specify parallel decomposition and explicitly manage synchronization, at least in the case of side effects. Daisy/DSI attempts to provide automated process management in addition to automated memory management. The user should no more care about process decomposition, synchronization, scheduling or process reclamation than he does about memory management. However, annotations may be necessary to achieve maximal parallelism where excessive laziness interferes.

9.3.4.6 Speculative Computation

The Lisp systems we have reviewed either require the programmer to explicitly kill useless tasks, or rely on garbage collection of processes coupled with priority propagation to prevent them from running. DSI uses bounded computation to control the growth of new speculation and to remove useless tasks.

9.3.4.7 Parallel I/O

Imperative I/O to a single device (e.g. a terminal) presents problems for any parallel system; like side-effects, it requires explicit synchronization to control access and achieve the proper sequencing of events. At least one parallel Lisp system (Mul-T) uses a specialized process and grouping mechanism to control access to the terminal or controlling window. DSI uses a stream model of I/O designed to work with the nonintuitive scheduling orders imposed by lazy evaluation; for the same reasons this model works extremely well for parallel I/O. Non-temporal character streams can be merged, directed to separate windows, or combined in any fashion by the user.

9.4 Conclusion

Parallel processing technology has been maturing for a number of years, to the point where shared-memory multiprocessors are poised to become commodity items on the desktop. High performance and numeric computing is likely to continue to be dominated by traditional imperative languages, but high level symbolic languages have the opportunity to carve a niche in a more general class of computing problems on these machines. These languages have much to offer programmers, and indirectly, end-users. The ease of programming in these languages is in a large part due to their automated resource management as much as any other attractive qualities they might possess. The success of these languages will depend on the performance and effectiveness of this resource management as well as the co-evolvement of hardware to better support the execution of symbolic language data types and fine-grained processes. I hope that the work presented herein will in some way aid in this effort.

Next: Bibliography Up: An Architecture for Parallel Previous: 8. An Analysis of

Eric Jeschke
1999-07-08