Next: 6. Process Management Up: An Architecture for Parallel Previous: 4. The DSI Kernel

Subsections

5. Memory Management

DSI's kernel provides complete memory management for processes. In this chapter we discuss the organization of DSI's memory space and the techniques used for distributed storage allocation and reclamation.

5.1 Memory Organization

DSI memory requirements are divided into three categories:

a large, shared heap of untyped cells;
a small, per-processor private data area;
the DSI implementation code in host executable format.

If the host architecture supports distributed shared-memory (e.g. the BBN Butterfly), the heap is allocated as identically-sized segments distributed across all nodes, as shown in figure 9.

**Figure 9:** DSI Memory: *Physical Layout*

These individual heap segments are mapped into the logical address space of all processors as one large contiguous heap. The DSI implementation code is replicated across processors for maximum instruction locality. If only non-NUMA shared memory is supported, a single, large heap is allocated in shared memory, and is artificially divided up into equal segments. Thus, regardless of the underlying shared memory implementation, the heap as a whole is globally shared, and each processor is assigned responsibility for a particular segment of the heap. Figure 10 shows the view from one processor.

**Figure 10:** DSI Memory: *Logical Layout*

5.2 Storage Allocation

DSI uses a three-level distributed storage allocation scheme. Each level builds upon the level below it to provide its functionality.

At the lowest level, a processor manages its own heap segment, and only allocates from its own area. Each segment is organized as shown in figure 11.

**Figure 11:** Heap Segment Organization

Ordinary cells are allocated in blocks from the low end of available space ( MEMLO) upward, while suspensions are allocated in blocks from the high end of the segment ( MEMHI) downward. The registers AVLLST and AVLSPN track the respective allocation regions; when they meet, available space is exhausted (for that node) and garbage collection must be performed to recover space. The reason for the bidirectional allocation has to do with garbage collection; its compaction algorithm is designed to relocate objects of the same size. This will become clearer in section 5.3; for now, suffice it to say that if DSI had a more sophisticated storage reclamation scheme this would not be necessary.

The second level handles the distributed aspect of allocation. Each heap segment maintains two allocation vectors stored in the Static Data area shown in figure 11. The allocation vectors contain pointers to reserved blocks of free space on other processors. One vector is for free cell space, the other is for free suspension space^5.1.

**Figure 12:** Cell Allocation Vector

Each vector is organized as a region of contiguous unary cells, indexed by logical processor number. Each element's tail points to a block of free space on the associated processor and the data field contains the size of the block (in cells). Figure 12 depicts a cell allocation vector.

The allocation vectors provide a general mechanism by which data can be allocated on any particular processor in the system. The technique of allocating larger blocks of free space and then suballocating out of it improves efficiency by reducing the frequency of communication costs between processors [MRSL88]. The free blocks pointed to by the allocation vector are preallocated and reserved; there are no mutual exclusion problems to contend with to arbitrate access to the blocks. The kernel can explicitly allocate objects on remote processors by manipulating the allocation vector directly, but most processes will typically use the allocation instructions new and suspend instead, the operation of which are described below.

When a processor exhausts a block of free space in its allocation vector it sends a request to the remote processor in question to supply another free block. The requesting processor will be unable to allocate objects of that type (cells or suspensions) on that particular remote processor until it receives another free block from that processor. An allocation server on each processor handles these requests from other processors for blocks of free space. A request contains a pointer to the appropriate cell in the allocation vector of the requesting processor. The allocation server will allocate a block of cells or suspensions locally from its own AVLLST or AVLSPN (respectively) and store the location and size of the block into the remote allocation vector. This request/donate allocation mechanism stands in contrast to the shared heap state variables used in other systems [DAKM89,MRSL88]. Although the allocation block/suballocation technique can be (and is) used in these systems, processors must occasionally synchronize for access to shared global allocation pointers.

Two obvious factors affecting the availability of free space using our technique are the size of the free blocks exchanged and how responsive the allocation server is to requests. The larger the block requested, the more data can be remotely allocated before running out of space and the less often processors will have to exchange requests. However, allocating larger blocks to other nodes will result in a processor using up its own free space faster then if it doled it out in smaller chunks. Since garbage collection is a global phenomenon, the interval between collections is set by the first processor to run out of space; this would argue in favor of smaller blocks. As we shall see, there are other reasons that a processor should limit the amount of local space it is allocating to other processors. We will return to this topic in section 5.2.1.

If the server can't supply a block it simply ignores the request; in this situation it is very close to being out of space itself. The allocation vectors are not garbage collected; this is necessary so that any partially filled allocation blocks may be compacted and heap fragmentation eliminated. After a collection there is a flurry of allocation activity as processors request and resupply each other's allocation vectors.

The third level of storage allocation buffers cells from the allocation vector for use by the allocation instructions (new and suspend). The kernel culls free cell and suspension addresses from the allocation vector into two internal buffers. The new and suspend instructions implicitly access these buffers to obtain their allocation references. This culling action is activated by a signal, which is generated by the allocation instructions when the buffers begin to run low. The kernel scans the allocation vector, culling free references and building up the new and suspend buffers. When a free block for a particular processor is exhausted, the kernel sends a request to the allocation server on that processor for another free block, as described above.

5.2.1 Data Distribution

Under DSI's allocation scheme the data distribution pattern (across processors) generated by a series of allocations is determined by two main factors:

the culling (distribution) pattern used by the kernel (and thus new and suspend), and
the responsiveness of the allocation servers to requests from other processors for free allocation blocks.

The default distribution algorithm used in the Butterfly implementation culls free addresses in a round-robin fashion from the free blocks for each processor, skipping those for which the free blocks are empty. The rationale for this is that in the absence of any other useful metric we should distribute data as evenly as possible to prevent network hot spots [PCYL87,LD88,MRSL88]. The other logical benefit is simply to balance allocation across processors.

The responsiveness of the allocation servers is deliberately underplayed. The server is not signal-activated (event-driven) like other system processes. In fact, it runs at a very low priority compared to all other processes on the node. The effect is that the server on a given node runs with a frequency inversely proportional to the processing load on that processor. If the load is light, it will run frequently and quickly refresh the free blocks on other nodes; if the load is heavy, it will run infrequently or not at all, and supply few or no blocks to other nodes.

The end result of these design choices is that new allocations are distributed more readily to lightly loaded nodes in the system, because those processors are able to supply more free space in a timely manner. Recall the discussion of the new-sink in chapter 3, which effectively accomplishes the same result using a hypothetically constructed processor interconnection network (LiMP). Our load-sensitive allocation strategy models that behavior in software.

The success of this approach depends on maintaining the connection between processor load and allocation; i.e. allocation must be rendered load-sensitive. There are two significant factors at work here: proper scheduling of the allocation server (it should run less frequently when the load is heavier) and the size of the allocation blocks used. Using too large of a free block will render the allocation less sensitive to loading, because processors can suffer longer periods of allocation server latency before their allocation blocks are used up. Small-to-moderate block sizes seem best suited for this purpose. Smaller block sizes also increase the interval between garbage collections, as explained in section 5.2. Further experimentation is necessary to determine the precise effect of block sizes on system performance.

5.2.2 Load Balancing

The interaction between allocation servers and kernels described above results in a dynamic system that distributes allocation evenly under balanced load conditions and favors lightly used processors under imbalanced load conditions. This is particularly significant in regard to the allocation of new suspensions, because what this really implies is the distribution of new processes over processors; i.e. load balancing.

Our load-balancing allocation scheme is predicated on two factors:

The fine-grained nature of suspending construction coupled with an applicative environment forces data and processes to be recycled frequently.
A scheduling invariant in which suspensions can only execute locally; i.e. on the processor in which heap segment they are located.

Applicative languages enforce a stateless mode of programming in which new data and processes are constantly created. Most processes are short-lived; those that aren't are usually blocked awaiting the results of new processes. This is important because it lets our allocation strategy shift work to new processors as they become less busy. If our system consisted of relatively static processes then an unbalanced system would remain unbalanced, since work is not being reflected in the form of new processes. Without invariant 2, our load-sensitive allocation strategy would also be meaningless, since processors would be disregarding the load-based partitioning of suspensions. The garbage collector does not migrate storage between processors, so a given suspension lives out its existence on one processor. Since new suspensions tend to be created on lightly-loaded nodes, computation overall continually gravitates towards an equilibrium.

Our system of load-balancing stands in contrast to most other symbolic processing systems which actively move processes around to try to balance the system, or have processors steal tasks from other processor's ready queues[RHH85,DAKM89]. Our approach to load balancing has low overhead compared to these approaches:

Process state does not have to be relocated (process migration techniques might not relocate either, but this adds cost for nonlocal context switching due to NUMA (see section 3.5) or cache locality.
Processors do not have to spend time examining each other's queues, notifying each other of task availability, or handle intermediate routing of tasks to other processors.

5.2.2.1 Considerations for Load Balancing

DSI's load-balancing strategy doesn't distinguish between "important" suspensions and "trivial" ones. Under our allocation scheme, trivial suspensions are just as likely to be allocated remotely as suspensions representing significant computations, which we are more interested in offloading to other processors. This is not a consideration if the surface language is explicitly parallel, since the programmer will be indicating the parallelism of computations using annotations (or e.g., futures). It is possible under our allocation scheme to allocate on specific processors (provided a block is available) so the language implementor has the option of offloading important tasks should the language be able to distinguish them.

A potential weakness concerning the load-balancing aspect of our allocation scheme, is that under suspending construction, the time between suspension creation and coercion (scheduling) is potentially unrelated (this is the basis of the model). This would appear to undermine the load-sensitive strategy for allocating free suspension blocks, since it is when the suspension is scheduled that impacts the load and not when it is created^5.2. Consider a situation in which a few idle processors are supplying blocks of free suspension space to other processors, which are creating significant suspended computations, but not coercing them. Then at some future point this group of significant suspensions is clustered on a few processors. It's worth pointing out that this anomalous condition would only arise in an unbalanced state, which our strategy is designed to prevent. Nevertheless, assuming that the system did get into this state, how is the situation handled? Here the small grain size of suspending construction may prove beneficial, since it tends to break up significant computations into many smaller processes which will be distributed.

5.2.3 Locality

Heap-based symbolic languages are notoriously non-local. Heap objects tend to be distributed randomly throughout the heap and are also extensively shared. Lazy languages compound the problem, because the control flow is not as amenable to locality optimizations for data allocation (e.g. stack allocation) and heap sharing is even more pronounced.

DSI's heap allocation strategy acknowledges this and attempts to make the best of the situation by dispersing data evenly over the machine. However, for data that is not going to be shared DSI makes an effort to maximize locality. The Butterfly implementation, for example, uses locality wherever feasible:

The push operation allocates stack cells from the local heap segment, so stacks accesses are local and stack activity does not involve network traffic.
Suspensions are scheduled for execution on the nodes on which they reside, so that context switching is localized to the node; entire process contexts are not transmitted through the network. This policy also serves the load-balancing strategy outlined above.
Garbage collection does not move data across processor boundaries. This preserves existing locality conditions.
Compiled code is replicated across all nodes so that instruction stream access is localized.

The goal of these optimizations is to maximize locality, going through the processor network only when a nonlocal heap access is necessary. Tagged references help here as well, allowing typing of objects without the penalty of a network memory reference to examine a field within the object.

Finally, we should say a word about systems that have little or no sense of locality. On a uniform access shared memory system (e.g. a bus-based, cache-coherent multiprocessor) there is no sense of locality as regards the heap^5.3. Cells and suspensions might as well be allocated anywhere; the access time is (supposedly) uniform. Nevertheless, our distributed allocation system provides benefits even in this environment. First, it provides locality for the allocation instructions. Once a series of addresses has been collected and buffered the individual processors are not competing to update shared allocation variables. Second, the system provides processor load balancing as described above if processors respect the (artificial) "ownership" of suspensions in other heap segments. There is an implicit trade-off here: on such systems it may make more sense not to enforce such artificial boundaries and instead allow processors to execute any suspension, anywhere. This would cut down on the amount of interprocessor communication required to transfer execution requests to other processors, but would also require a new method of load balancing to be devised. Task stealing [RHH85] might be appropriate in this case.

5.3 Storage Reclamation

Like many other heap-based symbolic processing systems, DSI uses garbage collection as its primary method of storage reclamation. DSI does employ ancillary methods, such as code optimization that recycles cells in non-sharing situations, but at best these methods simply increase the interval between garbage collections. Under normal circumstances allocation operations will eventually exhaust all avaliable free space in the heap and garbage collection must be performed to recover space.

5.3.1 Garbage Collection

DSI uses a distributed mark-sweep garbage collection algorithm. Garbage collection is triggered by a SIG_GC signal delivered to all processors. The kernel process on each node responds to this signal by invoking a local dormant garbage collection process. Garbage collection suspends all other processing activity for its duration; when the collection process terminates, the kernel resumes the system where it left off.

Figure 13 illustrates the major phases of garbage collection.

**Figure 13:** Garbage Collection Execution

All phases operate in parallel on each node; each phase is prefaced by a synchronization barrier to ensure that all processors are executing in the same phase.

5.3.1.1 Initialization

During the initialization phase the collector swaps out all active processes from the context windows to their respective suspensions so that all valid references in registers will be collected.

5.3.1.2 The Mark Phase

During the mark phase the collector process performs a pointer-reversal, depth-first traverse of the live data reachable from each node's root. Three bits in each cell are reserved for garbage collection (the gc field in figure 5): one for interprocessor synchronization and two for coloring. If, during the mark phase, two or more processors reach the same cell through different references, the synchronization bit determines which one will proceed into the cell and which one will back up and terminate that thread of its traverse.

The mark phase leaves the gc coloring bits set according to the configuration of live collectible pointers found in the cell. At the end of the mark phase all cells are marked in one of the configurations shown in table 2.

VALUE	CONFIGURATION
`0x0`	Unmarked cell (garbage).
`0x1`	Unary cell.
`0x2`	Stack cell.
`0x3`	Binary cell.

5.3.1.3 The Compaction Phase

The compaction phase uses a "two-finger" algorithm to sweep through the heap segment, defragmenting and freeing up space for future allocation. The algorithm works as follows: Pointer A starts at the low end of allocatable memory (MEMLO in figure 11) and scans upward for unmarked cells, stopping when it finds one. Pointer B starts at the upper end of the allocation area (AVLLST) and scans downward for marked cells, stopping when it finds one. The cell pointed to by B is copied to the slot pointed to by A, thus freeing up a cell at the upper end of the allocation space. A forwarding address is placed in the head of the former B location noting the relocation address of the cell, and a special flag is left in the tail indicating a relocation. This process iterates until pointers A and B meet.

This compaction scheme relies on the fact that cells are of a uniform size. Although suspensions are fundamentally made up of cells, suspension cells must remain as an ordered, contiguous group. This restriction, coupled with our choice of compaction algorithm, accounts for the segregated allocation scheme discussed earlier in which suspensions are allocated at the upper end of the heap segment and cells at the lower end. This upper (suspension) area must be defragmented as well, so a similar compaction is performed at the high end of the heap segment between AVLSPN and MEMHI (see figure 11). After compaction each heap segment has been restored to the state pictured in figure 11, where free space is concentrated in the middle of the segment.

Each processor compacts its own heap segment in parallel with the others. Unlike the mark phase, each processor's locality is perfect since the compaction does not require the processor to reference anything outside of its own heap segment.

5.3.1.4 The Update Phase

The update phase involves a second scan through the compressed live data in which the mark bits are cleared and any pointers to relocated cells or suspensions are updated from their forwarding addresses. The latter is handled by loading the tail of the cell pointed to by each valid reference encountered during the scan and checking it against the relocation pointer-flag. If the cell was relocated it is necessary to load the head of the cell to obtain the relocation address and then update the pointer in question. The gc bit configuration indicates which fields contain valid references during the scan. As with compaction, all heap segments are updated in parallel. Some switch contention is possible, because the relocation tests may require loads into other heap segments, but otherwise locality is good.

5.3.1.5 Cleaning Up

During the exit phase processors synchronize again to ensure that collection has finished system-wide, then reload the context windows with the active process set. Finally, the collector process relinquishes control back to the process that was interrupted in allocation.

5.3.2 Minor Phases

There are two minor phases devoted to collecting the system hash table and device list as a special entities.

DSI must provide a way to remove identifiers that are not referenced outside of the hash table and recover the space. Most systems facing this problem use weak pointers or similar mechanisms [Mil87]. DSI takes a slightly different approach. The hash table is maintained as a global entity that is not reachable from any garbage collection root. During collection, the normal mark phase marks any identifiers that are reachable from live data. A minor phase between the mark and compaction phases traverses the hash table and removes all unmarked entries and marks the table structure itself, so that it can be successfully scanned during the update phase.

A second minor phase devoted to collecting the device list is discussed in chapter 7. These two minor phases correspond to the hash demon and files demon described in [Mil87].

5.3.3 Garbage Collection: Observations

Our garbage collection algorithm is neither efficient or full-featured compared to others described in the literature (e.g. [AAL88]), nevertheless we are encouraged by recent studies on the efficiency of mark-sweep collection [Zor90,HJBS91]. Our collector has some useful properties that contribute to the overall storage management strategy in DSI, the most important being that storage is not migrated from one processor to another, preserving locality conditions that are the basis for a number of design decisions (see sections 5.2.2 and 5.2.3).

Our garbage collector's cost is proportional to the sum cost of the mark, compaction and update phases. Although our collector requires three main passes as opposed to one pass used in copy-collect, the heap segments are not divided into semi-spaces, and good locality is ensured for the compaction and update phases, which may have major caching benefits. Since the mark phase is fully parallelized, but allows individual processors to traverse the global heap, it is difficult to assess the total cost of marking. Optimally, it is proportional to

reachable cells / number of processors

(5.1)

although in practice it is unlikely to turn out that way, depending on the amount of live data reachable from each processor's root and contention in the network. The compaction phase sweeps the entire allocatable portion of a processor's heap segment (marked and unmarked), and so is proportional to that size. The update phase sweeps the static area and compressed portion of a processor's heap segment.

5.4 Summary

We have described a system of storage allocation and reclamation for a cell-oriented, distributed heap. Where feasible, the system allocates cells locally. When data must be allocated globally, the system resorts to a load-sensitive, distributed allocation scheme. The garbage collector does not migrate data or processes from one node to another, which helps maintain the locality of processes and data allocated locally. Our approach differs from related systems which attempt to improve locality by copying structure between nodes during garbage collection. Our system relies on the short temporal nature of data and processes to allow load-sensitive allocation to gravitate the system to a balanced state. The interaction between load-based allocation and process management is explored further in chapter 6.

Next: 6. Process Management Up: An Architecture for Parallel Previous: 4. The DSI Kernel

Eric Jeschke
1999-07-08