A Frame for the simulation of Shared Memory on a Distributed Memory

Introduction

The majority of the algorithms for General Purpose Parallel Machines have been designed until now with respect to widely accepted, machine independent cost models. The aim of these models of parallel computation is to provide unified frameworks for the design of algorithms, that will estimate as close as possible the cost of an implementation on realistic parallel machines.

Unfortunately, too many models of parallel computation have been proposed until now, since the High Performance Computing vendors were adopting their own computation models, while the scientists were attracted by more abstracted models that allowed them to focus on the matters of parallelism itself, rather than technical issues such as synchronization and communication cost. As a result, for many years there was no commonly accepted cost model in the society of parallel computation, since there was no cost model that clearly outnumbered all the other models. Yet, the designers of parallel algorithms have exhibited a significant preference in abstracted, shared memory models, while the vendors of parallel machines have almost exclusively been based on models that adopt the message passing for interprocessor communication, which is closer to realistic parallel architectures.

As a result, a central issue has come up in the parallel computation, which is the gap between these ideal models that utilize shared memory and abstract away some machine oriented technical details, and the more realistic models that try to capture some additional difficulties in the design of parallel algorithms, and still remain machine independent. For this reason, many simulation algorithms have been proposed in the literature of the theory of parallel computation, from shared memory to distributed memory models.

In this work, a brief survey on the most recent results concerning the simulation of PRAMs on DMMs is presented, along with the design of a Shared Memory Simulator (SMS) that is based on a distributed environment provided by many realistic architectures, based on the DMM model. The reason why the models of PRAM and DMM have been chosen, is that the former has proved to be the most popular cost model in the community of the parallel computation scientists, while the latter is an abstracted model based on message passing, that is very close to, or is provided through specialized interfaces by the majority of the vendors of parallel machines.

Shared Memory vs Distributed Memory

During the past decades, many scientists of the theory of parallel computation have expressed various opinions and have proposed several cost models for parallel algorithms. The most popular proposed models are PRAM, DMM, BSP, OCPC and LogP. For more details on these models the reader is referred to ([?]).

There are two major categories of these models:

The Shared Memory Models (SMMs) that assume the existence of processing elements that communicate via a Shared Memory. The most popular representatives of this category are the variants of the PRAM cost model.

The Distributed Memory Models (DMMs) which consider that each processing element has direct access to its own local memory module, while the interprocessor communication is achieved by message passing. In this category belong models such as the DMM and the OCPC.

Figure 1. The DMM and SMM architectures.

Unfortunately, although the purely abstracted shared memory models give the designers of parallel algorithms the opportunity to focus in the matters of the parallelism itself when considering a specific problem, they are very difficult to implement on realistic architectures, at least to within the performance guarantee indicated by the analysis of the algorithm.

On the other hand, there are already available in almost all the widely known High Performance Computers either direct implementations of distributed memory models, or some interfaces that give the designers the impression of interacting with a Distributed Memory Machine (See PVM, MPI, corresponding section). This fact makes the execution of algorithms designed for such a model on realistic parallel machines almost direct, thus yielding a unified framework for the implementation of parallel algorithms on parallel architectures. The problem is that such models are very restrictive for the parallel algorithms designers, since they have to deal not only with the parallelization of the problem itself exploiting its own nature, but also with some communication issues, synchronization among the processing elements, etc.

One additional parameter that votes in favor of the Shared Memory Models, is that almost all the existing parallel algorithms have been based on a member of this family of models, namely the PRAM model. PRAM has attracted the parallel computation scientists' attention in the last decades and has prevailed over the rest of the computation models that have been proposed up to now. The only model that really tried to compete PRAM is the BSP cost model proposed by L. Valiant as a unified framework for the design of parallel algorithms. BSP tries to capture some synchronization and communication overheads without been limited by the technical details of the underlying parallel architecture. Nevertheless, the literature is based on PRAM which is already the most widely accepted cost model and it seems that PRAM is a more convenient measure of cost in the hands of the scientists, despite its large scale of abstraction.

From the above discussion it is clear that a simulation of an ideal (but desirable) Shared Memory Model on a more realistic and restrictive Distributed Memory Model without causing dramatic slowdowns or loss of efficiency, will give rise to the use of the numerous efficient PRAM algorithms available in the literature, on realistic parallel architectures through a unified simulation framework.

The PRAM Model

PRAM is a synchronous model of parallel computation in which processors communicate via a shared memory. It consists of m shared memory cells, M₁, …, M_m and p processors P₁,…,P_p. Each processing element is a Random Access Machine (RAM) with a private local memory, which is used for elementary local computations, for buffering, etc. During each step of a computation, a processor may read from one shared memory cell, perform a local operation, and write to one shared memory cell. Different processing elements may execute different operations during a step and may make reference to their own index. The local operations and the read/write accesses, may also be viewed as occurring during three separate phases. This simplifies the analysis and will only change the running time of an algorithm by a constant factor.

An input x consists of n variables, x₁,…,x_n. At the beginning of the computation, these values are located in shared memory cells M₁,…,M_n respectively, . If there are output values, they appear in the first r Shared Memory cells at the end of the computation.

The word size of a PRAM is the maximum number of bits that can be contained in a single cell of the shared memory. Limiting the word size can significantly affect the amount of time required to solve certain problems.

The interprocessor communications is taking place through the shared memory. Limiting the size of the shared memory implies a limitation on the amount of information that can be communicated between processing elements during a single step (for example processors communicating with each other by a single bus can be viewed as a CREW PRAM with a single shared memory cell).

The efficiency of a PRAM algorithm is considered with respect to the processors-time product, for solving a problem, which is called the work of the algorithm. A parallel algorithm for a problem is considered to be work-optimal, if its work is equal to the optimal sequential time for solving the same problem. Since this is a very restrictive definition, the term of efficiency is also used for parallel algorithms. If a PRAM algorithm solves a problem in t steps using p processors, with the product within a constant factor of the sequential time complexity of the problem, then the PRAM algorithm is called efficient.

There are many variants of the PRAM cost model, depending on the contention resolution strategy during the concurrent accesses to the shared memory cells. The most important categories are:

EREW PRAM: At each step of a PRAM computation only one processor may read from or write to a single shared memory cell.
CREW PRAM: At each step of a PRAM computation, multiple processors may access a shared memory cell for reading, but only one processor may write to it.
CRCW PRAM: This is the weakest version of PRAM where multiple processors may access a single shared memory cell for either reading or writing on it. There are some subcases, according to the contention resolution rule for the concurrent write accesses:

ARBITRARY writing: An arbitrary processor, chosen at random among the processors attempting to access a specific memory cell for writing, may write on it at a single step.
COMMON writing: This version requires that all the processors that are simultaneously writing to the same memory cell write a common value.
PRIORITY writing: The processors are assigned fixed, distinct priorities. The processor of the highest priority among those accessing a specific shared memory cell for writing, finally writes to it.

Note that any algorithm that runs on an ARBITRARY PRAM will run unchanged on a PRIORITY PRAM. Thus PRIORITY is at least as powerful as ARBITRARY. Similarly, ARBITRARY is at least as powerful as COMMON, COMMON is at least as powerful as the CREW PRAM, and the CREW PRAM is at least as powerful as EREW PRAM. The specific correlations of these variants of the PRAM cost model are shown in Table 1.

Original model	Simulating model	Number of processors	Amount of time
PRIORITY ARBITRARY COMMON	EREW or CREW	p	(logp)
PRIORITY ARBITRARY	COMMON	kp	(logp/(k(loglogp - logp)))
PRIORITY	ARBITRARY	kp	O(loglogp / log(k + 1))
CREW PRAM	EREW PRAM	p	((logp)/loglogp)

Table 1. Relationships among the variants of the PRAM Model.

The Distributed Memory Model (DMM)

A Distributed Memory Model (DMM) consists of n processors Q₁,…,Q_n and n memory modules M₁,…,M_n. Each processor Q_j has a link to the corresponding module M_j. The memory modules have communication windows which they may read from or write to. The reader may consider these communication windows the only available (CRCW) shared memory for the processing elements of the DMM. A basic communication step of such a DMM consists of the processors sending read or write requests to the memory module, at most one request per processor. Each module processes some of the requests directed to it and sends an acknowledgment to each processor whose request was chosen to be processed. Obviously this is a purely message passing model that is very close to the realistic architectures. It is considered that all the processing elements may communicate with any memory module. The DMMs are distinguished in the following categories, according to the contention resolution rules for choosing requests for processing:

Arbitrary DMM: In this case, one arbitrarily chosen request out of all the requests arriving at a single module is processed per communication step. The answer given by the module is accessible by all the processors having requested for access to the specific module.
c-collision DMM: In this case all requests arriving at a single module are processed in a communication step, as long as they are at most c (c is a positive constant independent of the input size); otherwise none of the requests is processed. An answer is accessible only by the corresponding processor this time.

Note that a 1-collision DMM is equivalent to the Optical Crossbar (the OCPC model), while a c-collision DMM can easily be simulated on an ARBITRARY DMM with delay O(c).

The Frames approach

The lack of software solutions which are directly applicable to a large family of computing environments, including parallel machines and topologies of heterogeneous computer systems, necessitates the existence of a unified framework for the development of a uniformly applicable source codes and executable files for realistic problems.

The Frames approach aims to provide support for the programming of distributed memory machines via a library of basic algorithms, data structures and so-called programming frames (or frameworks). The latter are skeletons with problem dependent parameters to be provided by the users. Frames focuses on re-usability and portability, as well as on small and easy-to-learn interfaces. Thus, non-expert users will be provided with tools to program and exploit parallel machines efficiently. Basic frames will be developed (which can be used to build more sophisticated ones) as well as complex frames for, e.g., Branch-and-Cut, finite element methods, tree-structured combinatorial optimization algorithms, finite difference methods, sorting and searching algorithms. All these frames will be especially valuable and relevant to industrial cooperators.

Frames will be constructed so as to provide applicable source code or executable files for different target machines and common programming environments, since a major objective is the interoperability among heterogeneous environments. The focus, however, is on distributed memory machines. Frames will be adapted optimally to the target systems, contain efficient state-of-the-art programming techniques, and therefore increase the efficiency and acceptance of parallel computing.

The need for the support of algorithms designed for shared memory models is a realistic situation depicted by the fact that the theory of parallel computation is mostly based on the PRAM cost model which is a purely shared memory model. The current work is presenting a solution to this problem that complies with the rationale of the Frames and enhances their applicability to a large area of computations. That is, a distinct Frame will be implemented for providing access to a Shared Memory Module by a number of processing elements, that will comprise altogether the hypothetical parallel machine on which an algorithm is supposed to be executed.

The SMS-Frame

The designers of parallel algorithms consider that a Shared Memory Model such as PRAM is very convenient for parallel programming because the programmer will not have to specify interprocessor communication, or to allocate distributed memory and synchronize the processing elements.

As mentioned in the introductory section, the purpose of the current work is the implementation of a basic frame, the Shared Memory Simulation Frame (SMS-Frame), that will give the opportunity of executing parallel algorithms designed for a variant of PRAM, over a purely distributed working environment, that will be considered to comply with the DMM model. The input of the SMS-Frame will be a PRAM algorithm, while the output of the computation will be stored in the first r cells of the hypothetical Shared Memory.

The outcome of such a simulation will be the transparent to the end user access of a hypothetical Shared Memory Machine, without introducing prohibitive costs slowdowns to the runtime of the parallel algorithm. Thus the SMS-Frame will motivate the implementation of various parallel algorithms already existing in the literature of the theory of Parallel Computation and are designed for Shared Memory Models, over existing parallel architectures that follow the Distributed Memory Model.

The major advantage of the SMS-Frame is the provision of a unified approach to the Shared Memory Simulation by various parallel architectures that are based on distributed memory. This is because the Frames are based on the idea of the 3-level implementation, that is, while the abstract level and the instance level will be the same for any parallel architecture, the implementation level will be implemented using a unified programming environment, thus providing source codes that may be adjusted to specific architectures or topologies, according to the implementations of these environments on the existing machines.

The expert that will have to provide the implementation level of this Frame over a specific Distributed Memory Machine, will be able to consider a unified Parallel Architecture Interface through which he/she will access the DMM, such as PVM or MPI (something that is compliant with the rationale of the Frames approach for programming on distributed memory machines). As for the abstract and instance levels, these will be the same for all the distributed environments, comprised of a single parallel machine, heterogeneous computer systems, etc.

Figure 2. The SMS-Frame as a Shared Memory Interface.

The major parameters that will have to be specified by a programmer of PRAM algorithms who wants to use the SMS-Frame as a PRAM, will concern the following:

Definition of the Shared Memory Computation Model. This category of parameters are necessary for the simulation process and will definitely have to be specified by the end user. The most important of these parameters are the following:

The variant of the PRAM that is considered by the algorithm.
The number of processors of the hypothetical PRAM.
The size of the Shared Memory that is used by any of the processing elements.
The parallel algorithm that will be executed.

Specification of the underlying DMM Parallel Architecture. It should be mentioned that the specification of these parameters may affect the optimality of the simulation process. When not defined, the SMS-Frame will initiate them so as to achieve simulations as closer to optimality as possible. Nevertheless, the user may need for example faster simulation, at the expense of some extra processing elements when this is possible, or may want to test an algorithm on a machine with a fixed number of processors. The following are the basic parameters that will (optionally) be specified by the user and concern the underlying, distributed machine:

The number of processing elements that will be used.
The number of hash functions that will be used for the distribution of the replicas of the Shared Memory cells of PRAM to the Modules of the DMM.

For convenience, a single PRAM step will be distinguished into 3 cycles, a read operation, a local computation and a write operation. The local computations will be just executed to the processing elements of the DMM that have taken over the simulation of a PRAM processor, after having read the appropriate Shared Memory cells that will be used in this computation. The difficult part is the simulation of read/write operations that deal with the Shared Memory access. According to the variant of PRAM that the end user has specified, these accesses may be concurrent or not, and in the case of concurrent writes, a contention resolution rule will be used (e.g. ARBITRARY, COMMON, PRIORITY). It should be mentioned that since there are trivial simulations of strong variants of PRAM (e.g. PRIORITY PRAM) on weaker ones (e.g. EREW PRAM), a single and weak variant might be considered in a pilot phase of the SMS-Frame such as EREW PRAM, on which any PRAM algorithm may run, at the expense of some extra overhead when it is designed for stronger variants (see Table 1).

As for the underlying distributed machine that will simulate the operation of PRAM, it will comply with a variant of the DMM model, according to the demands of the simulation. Nevertheless, most of the advanced simulations existing in the literature, are based on a variant of a c-collision DMM, the constant c being defined by the simulation process. For an overview of the strategies for simulating Shared Memory on Distributed Memory Machines, the reader is referred to the corresponding state-of-the-art section of this work, as well as to [MadH92].

Related Work

Because of the significance of the existence of such a simulation process, there have been many advanced works on this specific topic in the theory of parallel computation. In this section the most recent and work-optimal (or at least efficient) algorithms and bounds are presented, that will give the reader an intuition of how difficult such a task is and the possible strategies for implementing it.

There are two major categories of simulations of Shared Memories on Distributed Memories: The deterministic and the randomized simulations. Almost all the randomized schemes are based on the use of universal classes of hash functions to allocate (in some cases multiple copies of) the shared memory cells to the modules of DMM. The distribution properties of these functions aid the analysis of the strategies for work-optimal or efficient simulations, which may succeed with high probability. For example, using brilliant hashing strategies along with the h-relation routing algorithm of [GJLR93], the authors of [GMR94] achieve a work-efficient simulation with doubly logarithmic slowdown. The same slowdown is also achieved in the work of [DMadH93] by using a special class of hash functions and 3 replicas per shared memory cell. This slowdown can be easily be proved to be work optimal for proper choices of the parameters of the PRAM and the DMM model. A variant of the same idea is presented in [KLMadH94] where a parallel hashing table is used that allows delayed executions of the write operations.

On the other hand, the development of fast deterministic simulations of PRAM on a DMM seems to be much harder. The pioneering work of Melhorn and Vishkin ([MV84]) introduced the idea of multiple copies, for the implementation of faster read operations. Later, Upfal and Widgerson ([UW87]) introduced the idea of the majority trick, which was also used in [DMadH93] and [KLMadH94]. They also stated that there exists a suitably constructed expanding graph that guarantees an O(logn(loglogn)²) slowdown on an n-DMM. This result was improved to O(logn) by [AHMP87] by using a more complex access strategy. In [PP94] a lower and an upper bound were shown for any deterministic simulation of an n-processor PRAM on a p-processor DMM (pn), as a function of the number r of replicas (they call it redundancy) of the shared memory cells, which become tight for any m = poly(n) and .

In the following we shall deal with some randomized simulations, since it is clear that a work-efficient randomized simulation is much faster and simpler than a deterministic one, while the failure probability of these algorithms is significantly small.

[DMadH93]

The basic idea of the simulation in this work is the use of 3 hash functions to map each memory cell of a Shared Memory Machine to 3 distinct modules of DMM. A write operation is implemented by writing a value and a time stamp to at least 2 out of the 3 copies held in the associated DMM memory cells. This work exploits the well known majority trick which was introduced in [UW87], according to which, it suffices to successfully access at least out of the copies of a memory cell each time that an access of the Shared Memory is needed, provided that at least that many copies of the specific cell are up to date at any time, in order to assure consistency of the simulation model.

This simulation is a simplified (non-optimal) simulation of an n-processor EREW PRAM on an n-processor (3-collision) DMM with expected delay. The simulation presented in this paper circumvents the need for using a CRCW PRAM perfect hashing by an elegant use of the majority trick.

[KLMadH94]

In this work, a work-optimal randomized simulation of an EREW PRAM with delay O(loglog(n)log*(n)), and a simulation of a CRCW PRAM with the same delay, where the time-processor product is only away from the optimal by a factor of log*(n). According to the authors, the delay bound is very reliable, that is, it is guaranteed whp.

The simulation is based on a novel scheme which is more complicated than the simple hashing strategies used in many other works. They speed up the simulation of a read operation of an n-processor PRAM on an n-processor DMM, by using 2 or more hash functions thus providing multiple copies of each cell of the Shared Memory in PRAM. The simulation of a write operation is achieved by allowing delayed executions of write instructions; this means that whenever a memory contention prevents a write request from being executed during the present memory cycle, the request is stored in a parallel hash table. The size of this table of deferred write requests never exceeds O(n), whp. Thus, it is possible to distribute the parallel hash table among the modules so that accesses to it can be performed in constant time.

The analysis of their simulation depends on the properties of a particular -universal class of hash functions. The structure of these hash functions enables the analysis of the delay in the simulation process.

[GMR94]

In this paper a work optimal randomized algorithm for simulating a Shared Memory Machine (PRAM) on an Optical Communication Network (OCPC) is presented. The OCPC model is equivalent to the 1-collision DMM, where each processor has a local memory module, and each memory module only services a request per communication step, if and only if it receives exactly one request, otherwise no request is serviced.

More specifically they simulate each step of a (loglog(n))-processor EREW PRAM on an n-processor OCPC in O(loglog(n)) expected delay. Their simulation incorporates techniques and ideas from the simulation algorithms used in [KLMadH92] and [DMadH93] for simulating SMMs on DMMs, along with the h-relation routing algorithm presented in [GJLR93].

The authors, use the analysis of [DMadH93] for the successful elimination of a randomly generated tripartite hypergraph. More specifically, since they are simulating an -processor PRAM on an n-node OCPC, they must simultaneously implement the process used in [DMadH93] for 3n-node hypergraphs using only n processors. For this, they sparcify all the hypergraphs using ideas from the -relation routing algorithm in [GJTR93]. That is, they route all but messages and ensure that at most one undelivered message remains at any processor of the OCPC. Even so, implementing the process above in parallel could still require time steps per iteration, since each destination may participate in as many as different hypergraphs. Thus they have also to copy each destination in such a manner that each message can locate the appropriate copy of its destination. They perform afterwards the process of [DMadH93] in each hypergraph, ensuring that the process delivers at most a constant number of messages to each copy of a destination. After that, the messages can be sequentially forwarded to their true destinations in time.

Note that it is not possible to directly perform the process above on any of the hypergraphs since our processors can only receive one message in a time step, whereas the processors in [DMadH93] can receive 3 messages in a time step (because a 3-collision DMM is considered there).

[MacKPR94]

This work deals with the problem of on-line scheduling in which a set of abstract processes are competing for the use of a number of resources. Furthermore, it is either prohibitively expensive, or impossible for any two of the processes to directly communicate with one another. If several processes simultaneously attempt to allocate a particular resource, (as may be expected to occur, since the processes cannot easily coordinate their allocations), then none of them succeeds. In such a framework, it is a challenge to design efficient contention resolution protocols.

Two recently proposed approaches of PRAM emulation give rise to scheduling problems of the above kind. In one approach, the resources (in this case, Shared Memory cells) are duplicated and distributed randomly. A simple and efficient deterministic algorithm for accessing some subset of the duplicated resources is analyzed in this work. In the other approach, it is exhibited how quickly a process can access the given (non duplicated) resource using a simple randomized strategy. The authors obtain precise bounds on the performance of both strategies.

Interfaces of Parallel Architectures

For the simulation of a Shared Memory on a Distributed Memory, a specific interface that will describe accurately the architecture of the DMM must be adopted. In the past years several such interfaces have been developed that try to be as scaleable and machine independent as possible. In this section the two most popular interfaces of this kind are presented, that will give an impression of what is expected by a Parallel Architecture Interface.

Message Passing Interface (MPI)

The Message Passing Interface (MPI) is a library of functions and macros that can be used in C, FORTRAN and C++ programs. As its name implies, MPI is intended for use in programs that exploit the existence of multiple processors in a working environment, by message passing.

MPI was developed in 1993 by a group of researchers from the industry, the government and the academic society. As such, it is one of the first standards for programming parallel processors and is the first that is based exclusively on message passing.

In MPI, the processes involved in the execution of a parallel program are identified by a sequence of non-negative integers. If there are p processes executing a program, they will have ranks in the range {0,…,p-1}.

The main advantage of establishing a message passing standard are portability and ease of use. In a distributed memory communication environment in which the higher level routines and/or abstractions are built upon lower level message passing routines, the benefits of standardization are particularly apparent. Furthermore, the definitions of a message passing standard, such as these proposed in MPI, provides vendors with a clearly defined base set of routines that they can implement efficiently or in some cases provide hardware support for, thereby enhancing scalability.

The goal of MPI is to develop a widely used standard for writing programs that exploit distributed environments based on message passing. As such, the interface should establish a practical portable, efficient and flexible standard for message passing. More analytically, the major objectives of MPI are:

The provision of an application interface that exploits the multiprocessor environments.
To allow efficient communication, that will avoid memory to memory copying and overlap local computation and communication.
To allow for implementations that can be used in a heterogeneous environment, that may consist of small workstations, UNIX machines, complex parallel architectures, etc.
To provide the programmer with the ability of assuming a reliable communication interface. The user need not cope with communication failures, because these are handled by the underlying communication layer, which is guided by MPI.
To provide an interface that can be implemented on many vendors' platforms, with no significant changes in the underlying communication and system software calls.
To support language independent semantics for the interface.

This standard is intended for use by all those who want to write portable message passing programs in FORTRAN, C or C++, that are designed for running on parallel machines. The features that MPI provides are:

Point to point communication. This is the basic pairwise communication process which MPI is based. The message transmission and reception operations designed for this part of MPI provide a powerful and efficient basic communication facility.
Collective operations. This basically includes the process-group collective communication operations, such as barrier, and broadcast operations over a group of processes.
Process groups. MPI supports the partition of the set of processes that comprise a parallel program, into groups of processes that may be manipulated as a group by other processes.
Process topologies. This is a set of utility functions meant to assist in the mapping of process groups (i.e., a linearly ordered set) to more complex topological structures, such as a multidimensional grid.
Environmental management and inquiry. This is how the programmers manipulate and make inquiries of the current MPI environment. These functions are needed for writing correct, robust programs, and are especially important for the construction of highly portable message passing programs.
Profiling interface. It is a simple name-shifting convention that any MPI implementation supports. One motivation for this is the ability to put performance profiling calls into MPI without the need for access to the MPI source code.

The problem of MPI is that it is an exclusively message passing model, and thus it does not support explicit shared memory operations at all. Furthermore, it does not support explicit manipulation of threads and debugging facilities, while its I/O functions are limited since this is out of the scope of MPI.

The Parallel Virtual Machine (PVM)

The Parallel Virtual Machine (PVM) is a software system that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. The individual computers may be shared or local-memory multiprocessors, vector computers, specialized graphics engines, or scalar workstations, that may be interconnected by a network infrastructure. PVM support software executes on each machine in a user-configurable pool and presents a unified, general and powerful computational environment for applications designed to run over multiprocessor environments. User programs written in C/C++ or FORTRAN, are provided access to PVM through the use of calls to the PVM library routines for functions such as process initiation, message transmission/reception, and synchronization operations via barrier or rendezvous facilities.

Users may optionally control the execution location of specific application components; the PVM system transparently handles message routing, data conversion for incompatible architectures and other tasks that are necessary for operation in a heterogeneous network environment.

PVM is ideally suited for concurrent applications composed of many interrelated parts. It is particularly effective for heterogeneous applications that exploit specific strengths of individual machines on a network. As a loosely coupled concurrent supercomputer environment, PVM is a viable scientific computing platform. PVM system has been used for a number of applications such as molecular dynamics simulations, superconductivity studies, distributed fractal computations, matrix algorithms, and in the classroom as the basis for teaching concurrent computing. The major features of PVM are:

Portability. It runs on nearly every UNIX machine, plus many shared/distributed memory multiprocessing environments.
Heterogeneity. Any types of machines can be combined in a single Virtual Machine.
Scalability. The Virtual Machines can include hundreds of host computers and run thousands of tasks.
Dynamic configuration. The host computers can be dynamically allocated/deallocated from a Parallel Virtual Machine either by the application itself or manually.
Hooks for fault tolerance. The application can be notified via messages of lost processes, processors of addition of new resources.
Dynamic process groups. User defined process grouping for functions like broadcast and barrier may be done. The groups may overlap and change dynamically during the execution of the application.
Signaling. PVM tasks can send signals to other tasks.
Multiple message buffering. This allows easier development of PVM math libraries, graphical user interfaces, etc.
Tracing. A call-level tracing is built into the PVM library.
Customization. The user can write manager tasks to implement custom scheduling policies.

The Simulation process

It is widely accepted in the Parallel Computation Society that it is relatively more convenient to design parallel algorithms considering a Shared Memory Model, rather than having in mind a Distributed Memory Model in which the programmer will also have to deal with matters such as synchronization of the processing elements, distribution of Shared Memory, and contentions during the communication process. On the other hand, a Distributed Memory Machine is much easier to be implemented on a realistic parallel or distributed architecture. But when a PRAM algorithm has to run over a DMM a memory contention occurs, because of the sequentialization of the memory requests that correspond to the same memory module.

Some measures for the quality of a simulation, are the slowdown that it induces to the execution of a PRAM algorithm, the time-processor efficiency, the memory contention (that is, the number of requests that have to be served by a single Distributed Memory module in a simulated step) and the simplicity of the strategy.

As mentioned in Section 3, the simulation strategies are categorized in two major groups: The deterministic and the randomized strategies. In this work it is our belief that a randomized strategy is much more interesting than a deterministic one. This is because a randomized approach is significantly simpler than a deterministic one, and easier to be implemented, while the failure probability is sufficiently small. Especially in the case of randomized PRAM algorithms, this failure probability is added to an already existing (also small) failure probability, thus not affecting at all the execution of the algorithm. Nevertheless, the clarity of the deterministic algorithms is a desirable feature, and thus a deterministic and clear Shared Memory access schedule is considered to be preferable. In some randomized simulations (e.g. in [DMadH93]) the randomization only concerns the choice of the hash functions, while the access schedule is clear and simple.

In the case of randomized algorithms, the most promising approaches have been based on hashing. More specifically, one (or more) hash function(s) chosen uniformly at random (u.a.r.) from a properly created universal class, is (are) used for the distribution of the (possibly multiple) replicas of the Shared Memory cells in the Distributed Memory modules. In the case that more than one hash functions are used, there is a speed-up in the Shared Memory access because of the well known majority trick. It is also well known that for any scheme that uses a single hash function to distribute the Shared Memory cells among n memory modules, the expected contention is necessarily , even in the case that the hash function behaves as a function chosen u.a.r. from , where p is the size of Shared Memory. Thus, any improvement to this bound may stem from the use of more than one hash functions, that imply more than one replicas of a single Shared Memory cell.

On the other hand, when considering more than one hash functions, there is an extra overhead because of the augmented write and update operations on the Shared Memory cells. Yet, this overhead may be evenly distributed so as not to affect significantly the efficiency of the simulation. In this case, the access schedule, that determines the strategy of approaching the replicas of a specific Shared Memory cell is of great importance for the efficiency of the simulation. Note that the space efficiency with respect to the size of the Distributed Memory modules is considered to be of minor significance in this work, since the key point is the estimation of the overhead of the simulation itself.

By introducing parallel slackness (that is, tuning our simulation parameters concerning the processors of the simulated PRAM and the simulating parallel architecture), it is easy to achieve work-optimal simulations, but only when considering complete interconnection networks among the processing elements of the simulating distributed machine. This is why our interest will be focused on DMM, avoiding this way some complex situations where the parallel algorithms programmer should deal with slowdowns because of routing in the interconnection network.

In Section 0, three works were briefly presented that comply with the latter philosophy, that is, they are randomized strategies for the simulation of PRAM on DMM. These are [GMR94] and [DMadH93], and [KLMadH94]. The last two works are closely related to each other, since the first uses 3 hash functions that are chosen u.a.r. from a suitably constructed universal class, while in the last, 2 hash functions are chosen at random from the same universal class, and a parallel hashing table is used that allows delayed write operations to the Shared Memory. The simulation of [GMR94] is based on several ideas from the previously referred two works, and it additionally uses the h-relation routing algorithm of [GJLR93] for the division of the problem to appropriately created subproblems. Despite this is a brilliant work that achieves a work-optimal simulation and a very efficient slowdown of , it has a very complex access schedule to the Shared Memory.

This work will be based on the works of [DMadH93] and [KLMadH94] where a c-collision DMM is considered, while the work of [GMR94] considers an optical crossbar, that is a 1-collision DMM which is more restrictive.

Simulation of an n-processor EREW PRAM on an n-processor c-collision DMM

This simulation is presented in [DMadH93] and its basic idea is to use three hash functions to store the replicas of each Shared Memory cell into the modules of the DMM. During the write operation, arbitrary two out of three copies of each Shared Memory cell x_i are updated. As for a read operation, the access of two out of three replicas per Shared Memory cell is enough, since at least one of them is up-to-date. Clearly the memory access schedule is based on the previously referred idea of the majority trick ([UW87]). By this strategy for both reading and writing operations, we need a schedule that accesses two arbitrary of the three copies of each Shared Memory cell x_i. An overview of the Memory Access Schedule is shown in the following.

The simulation of a single Shared Memory access operation consists of consecutive rounds. Each round consists of three phases, P₀, P₁ and P₂. In P_j each Q_i tries to access the j^th replica of x_i. It skips an access to the j^th replica, if it was successfully accessed during a previous round. Q_i finishes the access operation as soon as it gets two out of the three replicas of x_i. Each module answers all the requests it gets in one phase, provided that all these requests are at most c. Otherwise it acknowledges no request (according to the c-collision rule of the DMM).

In the following, the Memory Access Schedule will be presented, and it will be briefly explained why this schedule finishes within rounds, whp. Further it will be shown that it guarantees constant memory contention. It runs on the weak c-collision DMM () but several experiments by the authors of [DMadH93] have shown that the delay bound also holds for c = 2.

The algorithm

As previously mentioned, this access schedule is simulating the access operation of the Shared Memory of an EREW PRAM over a c-collision DMM. It should be noted that this algorithm, although it causes very small slowdown, is not work-optimal. It is far from optimality. Nevertheless, we have chosen this strategy because of its simplicity and clarity. And of course, this is considered to be a very efficient slowdown for parallel computations that usually require polylogarithmic runtime. The analysis of the algorithm demands that and at most n of the PRAM accesses are treated at the same time, for a suitable constant. The latter is not a problem, since (1/e) such phases are sufficient for simulating one step of any n-processor EREW PRAM. The Memory Access Schedule is the following:

Input: Each Q_i, i [n], possesses an address x_i (the Shared Memory cell accessed by P_i in the current PRAM step). Hash functions h₁, h₂ and h₃ map the keys x₁,…,x_n randomly into [n].

Task: Processor Q_i wants to access two out of the three memory modules

Memory Access Schedule:

Each Q_i , i [n], performs the following steps:

y = x_i;

for r=1 to 3 do s[r]=0;

while s[1] + s[2] + s[3] < 2 do

if s[r] == 0 then ACCESS(h_r(y), s[r]);
r = (r+1) mod 3;
endwhile

end.

All the processing elements of the DMM are considered to call the ACCESS routine in the t^th execution of the loop exactly at the same time. The c-collision rule for accessing a memory module is reflected by the following description of the semantics of the ACCESS operation:

Assume Q_i, iJ, and we have to perform ACCESS(k_i,s_i), for some . For , let J_k = {iJ: k_i = k}. After the ACCESS operation, s_i=1, iff iJ_k with , and s_i=0 otherwise. If is an arbitrary constant, an sufficiently small constant, and is sufficiently large, the Memory Access Schedule algorithm finishes after rounds with probability . The proof of this fact is based on the elimination of the vertices of an appropriately constructed hypergraph H, until no more vertices have left. More specifically, given a randomly generated tripartite hypergraph on 3n nodes with edges, one can, with high probability, remove all the nodes in the hypergraph using the following process:

Repeat O(loglog(n)) times:

Remove all the nodes with degree at most c.
Remove all the resulting trivial hyperedges (i.e., hyperedges in which only one incident node remains)

Each hyperedge corresponds to a read or write operation on a Shared Memory location. The 3 vertices correspond to the 3 processors of the DMM that are associated with that memory location. For example, for the Shared Memory location v, suppose that the 3 replicas of this specific location are v⁰,v¹ and v². Then in the hypergraph H an edge (v⁰,v¹,v²) will exist that corresponds to the access request of the specific Shared Memory cell.

Obviously, the hypergraphs H₀=H,H₁,…,H_t that are created by the execution of the elimination process, have only edges of size 2 or 3, and the number of rounds the above protocol needs is the minimum t such that H_t= {}. But with probability , the largest component of H has at most nodes, for some constant that depends on s. Thus, rounds suffice to reach an empty hypergraph H_t. For more details in the proof of the delay of the access schedule the reader is referred to [DMadH93].

Constructing almost random hash functions

Until now, for the ease of analysis the hash functions that were considered for the distribution of the replicas of the Shared Memory cells in the Distributed Memory modules, were considered to be chosen u.a.r. from , where p is the size of Shared Memory. In fact such a choice of the hash functions is unrealistic and an effective way for constructing a universal class of hash functions that will be computable in constant time, has to be provided. Some desirable features of the hash functions that will be used by the Memory Access Schedule are the following:

They have to behave like random functions.
They must be evaluated in constant time.
They must require little space.

Definition: Let H_p,n be a family of hash functions from . Hp,n is called (,k)-universal if for any set of keys and locations , it holds that if the hash function h is drawn u.a.r. from H_p,n then . If = 1 then we simply call H_p,n k-universal.

Carter and Wegman [CW79] introduced the notion of universality for families of hash functions as a measure of randomness. More specifically, they introduced a class of linear hash functions which they showed that they are (2,2)-universal. Later on this class was generalized to a class of polynomials of degree d. This new class is proved there to be (2,d+1)-universal and computable in O(d) time steps.

For our purposes, a universal class as this of [KLMadH94] is sufficient for the analysis of Memory Access Schedule that is adopted. In fact any universal class of hash functions as these introduced in [Si89] that assures sufficiently chosen limited independence is enough.

In [KLMadH94] a new class of hash functions is introduced. It combines two previously known universal class. The first is the class of functions , where h is a degree (d-1) polynomial over _p, was introduced in [CW79], which, as previously stated, is (2,d+1)-universal. The second class, is introduced in [Si89] and consists of more complicated functions. It is the first class with high degree of universality whose functions can be generated fast using little space and have constant evaluation time, if the universe is of size n^k, for constant k.

In [KLMadH94] the universal class that is also considered in the current work, is introduced as follows:

Definition (): A particular function h is specified by:

A primary hash function .
A secondary hash function .
A set of offsets where [n].

The function h is defined in terms of as h(x) = (r(s(x)) + _f(x) ) mod n.

Note that can be evaluated in constant time if d and k are constants. Further, a random h can be constructed by a randomized DMM with processors in constant time. These properties follow from the properties of the universal classes used for the definition of . Lemma 5.4 of [KLMadH94] states the desired result of this class of hash functions, i.e. that is a (1, )-universal class, which is more than enough for our randomized strategy.

Table of contents