Next: 5. Memory Management
Up: An Architecture for Parallel
Previous: 3. Architecture of the
Subsections
4. The DSI Kernel
DSI refers not only to our virtual hardware architecture,
but also to a kernel written for that architecture
(see figure 4, p. ).
In this chapter we describe the role of the kernel and
its overall organization and operation.
4.1 Resource Management in Symbolic Programs
DSI's kernel provides much of the same core functionality as an
operating system.
Its main purpose is to provide run-time resource management for
programs running on the DSI architecture.
We define resource management as:
-
Memory management, in the form of distributed storage
allocation and reclamation.
-
Process management, in the form of distributed, demand-driven
scheduling of processes, with system calls for parallel task scheduling.
The kernel handles interprocess synchronization, excess parallelism
throttling, system load balancing, and support for speculative
parallelism.
-
Device management, in the form of event-driven device
scheduling, a character-stream I/O interface and garbage collection
of open devices.
DSI's kernel differs substantially from conventional operating
system and parallel language kernels due to its particular
orientation toward fine-grained symbolic processing.
The differences in processing requirements is reflected in the
emphasis or deemphasis of certain features.
For example, a finer process grain size leads to greatly increased
emphasis on the efficiency of process manipulation.
Suspension grain size is much smaller than in conventional
operating systems;
smaller than kernel-supported threads in conventional
processes, and even smaller than that of many symbolic languages
such as Multilisp, where grain size is specified by the
programmer.
Process manipulation overhead (creation, scheduling, synchronization,
context switching) is therefore significantly more critical to system
performance than in conventional systems, where it is typically a
fraction of overall computation effort.
This is reflected in DSI's use of context windows and demand
driven scheduling traps in the virtual machine.
Another major difference is in virtual memory support, or lack
thereof.
Studies of the interaction between virtual memory and heap-based,
symbolic languages [Moo84] reveal that
the large memory requirements of symbolic languages plus the
non-locality of heap allocation and garbage collection do not
mesh well with conventional virtual memory systems which depend
heavily on spatial locality.
The development of generational garbage collection has
reduced the problem somewhat [PRWM92], but the issue is
still a topic of active research.
Lazy languages such as Daisy may have even worse virtual-memory
locality than applicative-order languages such as Scheme, which
can make heavy use of stack frames.
In any case, the task granularity of symbolic processes is
much too fine to associate with virtual memory table flushes
on every context switch.
Another reason for a deemphasis on virtual memory is that it is
often used in conventional systems to support protected address
spaces.
In symbolic processing the dominant communication paradigm is
shared memory;
i.e. shared memory is the norm, rather than the exception.
There has been some work in using shared virtual memory to
support type-checking, memory barriers and other features needed
by symbolic languages (e.g. [AL91]).
This approach exploits the trapping aspect of memory management units
on stock hardware
for purposes that might also be handled by appropriate processor
modifications.
Finally, symbolic languages do not have visible pointers as
traditional languages do.
Memory protection is accomplished at the language level by the
simple fact that if you do not have a reference to an object you
cannot modify it, even in systems with side-effects.
Of more concern is the issue of conflicting global name-spaces.
Most systems have some notion of a global namespace for identifiers.
On systems with simple shallow binding or similar schemes, this
leads to shared identifier references; on a concurrent system this
is not always what is actually desired, especially for concurrent
users.
This problem might be addressed at the language level with a suitably
designed module or package system, as opposed to the use
of protected address spaces.
4.2 Kernel Structure
We distinguish between the design of the kernel's
resource-management algorithms and the design of its structure and interface.
DSI's design in the former sense is the subject of the following
three chapters.
The remainder of this chapter discusses the latter;
the structure, interface and operation of DSI's kernel.
In traditional operating systems such as Unix,
the majority of system resource management is encapsulated by a set
of system services provided by a monolithic kernel.
The kernel is implemented as a protected layer (or layers) of code
underlying every process.
System services are accessed by exceptions or special "function"
calls, both of which are handled on the user or system stacks.
These calls put the calling process in supervisor or
kernel mode, which allows the called kernel routines to
execute privileged operations on behalf of that process or other
processes in the system.
In other words, the kernel is a special mode (and code) that is
executed by all processes and executes as a part of their respective
address spaces.
Many modern operating systems are structured so that most system
services are handled outside of the kernel by special
user-mode processes.
The resulting stripped-down microkernel provides only the
absolute core supervisor-mode functionality required to manage
the machine resources: processor scheduling, memory management,
and interrupt handling.
The microkernel itself is implemented as a separate process
to which ordinary processes make requests for services,
not as a special code layer in processes' own address spaces.
This provides certain advantages over the traditional
kernel implementation model,
such as increased robustness, distributability, and concurrency
in the operating system itself.
A microkernel design results in a modular decomposition of operating
system functionality into a few separate processes that operate
interdependently to manage the system as a whole.
This modularity greatly simplifies understanding, extending and
debugging system-level code.
Many symbolic language kernels share aspects of the monolithic design;
they may not be nearly as large and they may use very different
resource management techniques, but they still implement the kernel as
a low-level code layer interfaced through the stack of every process.
In contrast, DSI borrows the microkernel design philosophy in the
structure of its own kernel (we will hereafter not distinguish between
the terms kernel and microkernel when referring to DSI's
kernel).
DSI's kernel is implemented as a set of special processes,
not as layers of code accessed through a stack interface.
Interfacing with the kernel is a kind of interprocess communication.
This is not nearly as expensive as in conventional IPC, since
- interprocess communication in DSI is through shared memory;
- DSI's processes are very lightweight;
- DSI's context window architecture allows multiple processes to
be register resident.
These factors blur the distinction between traditional notions of
interprocess communication and, say, shared-memory coroutining.
However, from a programming perspective our approach offers the
modularity and loose coupling of the former with the efficiency
of the latter.
The kernel interface is discussed in more detail below.
Another issue in parallel kernel design is whether a single kernel
controls all the processors in a sort of master-slave
relationship,
or whether the kernel is distributed and runs on each processor.
Master-slave arrangements are common in systems that do not use
symmetric multiprocessing;
i.e. in a processor arrangement in which one processor controls the
others.
In this case the "master" processor might run the kernel and
distribute work to the slave processors.
The advantage of this kind of design is that no locks or special
synchronizations are needed for the kernel, since only one processor
is running it.
In symmetric multiprocessing systems all processors have the same
functionality and capabilities.
This kind of system encourages a distributed kernel design where all
processors have kernel functionality,
an approach that requires more care in determining how the processors
interact, but pays off in greater parallelism in the kernel itself.
Although this issue is somewhat orthogonal to the issue of macro vs.
microkernel, the two issues impact one another.
If a distributed kernel is combined with a monolithic kernel design,
the processors may need to use shared locks and other measures to
arbitrate access to shared kernel structures.
A microkernel design allows the kernels to use interprocess
communication to inter-operate with each other,
resulting in a more loosely-coupled design that is easier to scale
and results in less bottlenecks.
Many parallel symbolic processing kernels use a distributed design.
The macrokernel approach of most of them is reflected in in the use
of locks to control access to global allocation pointers, shared
schedules, and other shared kernel structures.
DSI uses a distributed design in which the kernel processes are
distributed across all processor nodes.
There are no shared kernel structures; the only synchronization
required is to the queues of each processor's message area.
The kernels communicate with each other through these queues to
manage the machine as a whole.
4.2.3 Kernel Organization
The DSI kernel consists of four main processes.
The supervisor handles process scheduling, context swapping,
and miscellaneous other kernel duties.
The garbage collector handles storage reclamation.
The device manager handles device I/O, and
the tracer handles machine level tracing and debugging.
At system startup, each processor constructs this same configuration
of processes.
4.3 The Kernel Interface
Kernel services are accessed in one of two ways:
traps and message requests.
Both types of communication are implemented with signals.
Traps are associated with high-priority hardware interrupts
or instruction exceptions.
Examples of traps include device I/O signals, conditional load
traps, garbage collection, etc.
Message requests correspond to traditional "system call" type
services such as scheduling and allocation requests.
The term message request is not meant to imply
copying data between processors, as is done in message
passing systems;
rather it refers to the loosely coupled, asynchronous communication
that is required by our microkernel design.
4.3.1 Message Requests
Kernel "messages" are simply pointers to data (cells) in the heap
that are passed to the kernel to act upon.
Each message contains an indication of the type of request and
some optional data.
Common requests (e.g. scheduling requests) are encoded in the
reference tag of the message itself;
others are encoded in the head of the message.
The data (tail) includes any further data being passed to the kernel.
This may include a pointer to a location where data is to be returned
to the sender; a sort of call-by-reference mechanism4.1.
For example, this is how allocation requests are handled across
processors (see chapter 5).
Message communication is handled with streams, a natural choice for
communication in DSI.
There are two parts to a message request: appending the message to
the appropriate stream and (optionally) signaling the processor
that a message of the appropriate priority has been sent.
The signaling part may not be used for messages that are routinely
handled, such as allocation requests.
This approach is used both for local and remote kernel requests.
In normal stream communication under DSI there may be many readers
but only one writer4.2;
the only synchronization required is an atomic store operation,
which is provided by the virtual machine.
With message requests, there are multiple writers and only one reader;
namely, the kernel handling the requests.
Thus, some synchronization is required to arbitrate access between
processors to the tail of the communication streams for the purposes
of appending messages.
Note that since message streams are distributed over the processors
(each processor has several message streams), synchronization
efficiency is not overly critical, and simple spin locks
will suffice to arbitrate access among writers.
4.3.1.1 Implementation
The kernel's message streams are implemented as a set of queues.
The queue pointers are stored as a contiguous vector of cells in the
static data area of each processor's heap segment4.3.
The vector resides at the same offset in each heap segment;
a particular queue can be accessed by adding the offset and queue number
to the heap segment base pointer for the target processor.
Each queue of a processor's set of message queues is associated
with a signal (see table 1).
After appending a message, the sender signals the processor with the
signal associated with that queue.
The priority of the signals assigned to queues establishes the priority
of the queues themselves;
messages in a higher priority queue will always be serviced
before messages in the next lower priority queue, and so forth.
This provides a way to prioritize various types of requests;
see chapter 6 for an example of how this is used.
4.3.2 Traps
A special method is available for passing data between the kernel
and a running process on the same node.
The transient registers are globally accessible from any context
window and allow small fixed amounts of data to be passed from
resident process to resident process much faster than allocating
storage in the heap.
This method is used for implementing the conditional probe traps
and other frequent kernel trap interruptions.
Using these optimizations, a round-trip system trap in DSI
is faster than issuing a message request,
since no allocation operations are performed.
The catch is that this technique can only be used with very high
priority signals or with careful masking,
because data in the transient registers is volatile.
If a trap occurs when a higher priority signal is
pending4.4
then the transient registers may be overwritten before the
kernel gets around to reading them.
Message requests are relegated to the lower priority signals so as not
to disrupt trap handling.
We have described how signals are used to prioritize the kernel's
response;
as discussed in chapter 3,
DSI employs a number of optimizations to accelerate these transactions.
Context switching speed between regular and kernel processes
must be as efficient as possible to make this interface viable.
The use of context windows allows both regular and kernel processes
to be simultaneously register-resident4.5.
The kernel maps signals to the context windows to allow
direct transitions between programs and the handler processes;
e.g. tracing signals directly invoke the tracer, etc.
Signals have fixed priorities, which naturally provides a mechanism
for arbitrating among them (this can be overridden by masking).
The kernel assigns symbolic names to the values and exports the names
(using the assembler module system) so that the actual values can be
changed transparently without disrupting other modules.
Table 1 lists some of the signals and their respective
handlers used in the current implementation.
Table:
DSI Signals and Handlers.
Signal Name |
Type |
Description |
Handler |
SIG_EXIT |
sw |
Exit signal. |
Supervisor |
SIG_ABORT |
sw |
Abort signal. |
Tracer |
SIG_TRACE |
sw |
Tracing. |
Tracer |
SIG_PROBERR |
hw |
Invalid probe. |
Supervisor |
SIG_GC |
sw |
Garbage collection. |
Garbage Collector |
SIG_GC_DUMP |
sw |
Dump heap. |
Garbage Collector |
SIG_HDC |
hw |
Conditional head. |
Supervisor |
SIG_TLC |
hw |
Conditional tail. |
Supervisor |
SIG_RESET |
hw/sw |
Boot/reboot. |
Supervisor |
SIG_SYNC |
sw |
Synchronize nodes. |
Supervisor |
SIG_TIMER |
hw |
Interval timer. |
Supervisor |
SIG_IO |
hw |
An input event. |
Device Manager |
SIG_IOBLOCK |
hw |
I/O blocked. |
Supervisor |
SIG_DETACH |
sw |
Detach request. |
Supervisor |
SIG_QUEUE8 |
sw |
Message in queue 8. |
Supervisor |
SIG_QUEUE7 |
sw |
Message in queue 7. |
Supervisor |
SIG_QUEUE6 |
sw |
Message in queue 6. |
Supervisor |
SIG_QUEUE5 |
sw |
Message in queue 5. |
Supervisor |
SIG_QUEUE4 |
sw |
Message in queue 4. |
Supervisor |
SIG_QUEUE3 |
sw |
Message in queue 3. |
Supervisor |
SIG_QUEUE2 |
sw |
Message in queue 2. |
Supervisor |
SIG_QUEUE1 |
sw |
Message in queue 1. |
Supervisor |
|
The development of DSI's kernel and virtual machine was motivated by
the desire to support suspending construction at target levels;
i.e. hardware and low-level software.
Thus it is oriented toward applicative languages based on list
processing, and using a fine-grained, largely demand-driven
concurrency model.
Nevertheless, we have attempted to draw clear boundaries between
the language (chapter 8), the virtual hardware
architecture (chapter 3) and the kernel (chapters
4-7).
DSI's kernel is independent of the Daisy language and could easily
support other applicative languages (see section 9.2).
In this regard,
the DSI kernel can be compared to other generalized parallel symbolic
processing kernels such as STING [JP92b,JP92a],
which was built around parallel Scheme, and Chare
[KS88],
which was motivated by concurrent Prolog.
The DSI kernel differs from these systems, among other things,
in that it is generally targeting a lower level of implementation.
It also provides device management for lazy languages and uses
different resource management algorithms.
Next: 5. Memory Management
Up: An Architecture for Parallel
Previous: 3. Architecture of the
Eric Jeschke
1999-07-08