EURO-PAR'95, Stockholm, Sweden August 29-31,
Swedish Institute of Computer Science (SICS) and Department of
Teleinformatics, KTH

Paper Information

This page contains further information on the papers, as contributed by the authors or keynote speakers. Also, the first page of all papers are available in postscript form. Please report any errors or problems in accessing this material to psm@sics.se

Language Implementation I

(Tuesday 10:30-12:30)

Execution of Distributed Reactive Systems

Paul Caspi and Alain Girault
[view first page | fetch full paper | conference presentation ]

Abstract:

This paper addresses the problem of distributed reactive systems execution. We first show that a natural parallel description of such systems can be achieved with synchronous languages. Then, we explain how a centralized synchronous program can be executed in its environment, which is intrinsically asynchronous. For this purpose, we define a synchronous/asynchronous interface, which links the program logical time with the environment physical time. Finally, we motivate the need for distribution and show how a desired distribution can be easily achieved, thanks to the object code distribution algorithm implemented in the OC2REP tool. We then propose and discuss three solutions that allow distributed synchronous programs to be executed on an asynchronous network of processors.

Keywords: parallel computing, program interface, reactive system,synchronous language

Information from authors:

Paul Caspi received the Docteur es Science thesis from INPG, Grenoble, France. Currently, he is Charge de Recherche CNRS at Verimag, INRIA SPECTRE project, in Grenoble, France. His previous research activities mainly dealt with the use of computers in process control, and where concerned with dependability problems in critical applications, from both hardware and software point of view. This led him to participate in the design of the synchronous data flow language LUSTRE. He is now mainly concerned with the theory of synchronous data flow, and the problem of building distributed systems from synchronous programs. He also acted as a consultant for several companies and French administrations, on problems of dependable computing systems design and validation.

Alain Girault received the Ph.D degree from INPG, Grenoble, France. Currently, he is working with a post doctoral position at the INRIA MEIJE project, in Sophia-Antipolis, France. During his thesis, he worked on distribution of synchronous programs, and designed the OC2REP distribution tool.

Relating Data-Parallelism and (And-) Parallelism in Logic Programs

Manuel V. Hermenegildo and Manuel Carro
[view first page | fetch full paper | conference presentation ]

Abstract:

Much work has been done in the areas of and--parallelism and data--parallelism in Logic Programs. Both types of parallelism offer advantages and disadvantages: traditional (and--) parallel models offer generality, whereas data--parallelism techniques offer increased performance for a restricted class of programs. The thesis of this paper is that these two forms of parallelism are not fundamentally different and that relating them opens the possibility of obtaining the advantages of both within the same system. Some relevant issues are discussed and solutions proposed. The discussion is illustrated through visualizations of actual parallel executions implementing the ideas proposed.

Keywords: Parallel Logic Programming, And-Parallelism, Data-Parallelism, Fast Task Startup, Scheduling

Information from authors:


About Manuel Hermenegildo

Manuel Hermenegildo received his Masters and Ph.D. degrees in Computer Engineering from the University of Texas at Austin. He is currently Full Professor at the Dept. of Artificial Intelligence in the School of Computer Science of the Technical University of Madrid (UPM), Spain, where he leads the CLIP Group. Previously he was project leader at the MCC research center and Adjunct Professor at the Dept. of Computer Science of the University of Texas at Austin. His research interests include advanced compilation techniques, parallelizing compilers, high-performance and distributed computing, constraint logic programming, program execution visualization, and parallel computer architecture. He has published more than fifty papers in conferences and Journals. He is also area editor of the Journal of Logic Programming and editorial advisor of the Journal of New Generation Computing.


About Manuel Carro

Manuel Carro received his Masters degree in Computer Science from the School of Computer Science of the Technical University of Madrid (UPM). He is currently working in the CLIP Group of the Dept. of Artificial Intelligence as research assistant and working on his Ph.D. degree under the supervision of Manuel Hermenegildo. He has been involved in the ParForce (EP 6707) and ACCLAIM (EP 7195) ESPRIT projects.


On the Duality Between Or-parallelism and And-parallelism in Logic Programming

Enrico Pontelli and Gopal Gupta
[view first page | fetch full paper | conference presentation ]

Abstract:

Or-parallelism and And-parallelism have often been considered as two distinct forms of parallelism with not much in common. As a result, they have been treated quite independently and different mechanisms have been employed for realizing them. The purpose of this paper is to highlight the inherently dual nature of the two forms of parallelism and the similarities that exist between them.

The dualities and similarities observed are then exploited for gaining new insights into the design, implementation, and optimization of and- and or-parallel systems. The ideas developed in this paper are illustrated with the help of ACE system - a parallel Prolog system incorporating both and- and or-parallelism. The results from implementing these ideas are also briefly reported.

Keywords: Prolog, And-parallelism, Or-parallelism, Optimizations

Information from authors:


For further details on this project please access the home page of the Laboratory for Logic, Databases, and Advanced Programming, New Mexico State University.

Laboratory for Logic, Databases, and Advanced Programming

Functional Skeletons for Parallel Coordination

John Darlington, Yi-ke Guo, Hing Wing To and Jin Yang
[view first page | fetch full paper ]

Abstract:

In this paper we propose a methodology for structured parallel programming using functional skeletons to compose and co-ordinate concurrent activities written in a standard imperative language. Skeletons are higher order functional forms with built-in parallel behaviour. We show how such forms can be used uniformly to abstract all aspects of a parallel program's behaviour including data partitioning, placement and re-arrangement (communication) as well as computation. Skeletons are naturally data parallel and are capable of expressing computation and co-ordination at a higher level of abstraction than other process oriented co-ordination notations. Examples of the application of this methodology are given and an implementation technique outlined.

Keywords: Programming Language, Parallel Computing, Skeleton, Coordination Language, High Performance Fortran

Architecture Design

(Tuesday 10:30-12:30)

On the Scalability of Demand-Driven Parallel Systems

Ronald C. Unrau, Michael Stumm and Orran Krieger
[view first page | fetch full paper | conference presentation ]

Abstract:

Demand-driven systems follow the model where customers enter the system, request some service, and then depart. Examples are databases, transaction processing systems and operating systems, which form the system software layer between the applications and the hardware. Achieving scalability at the system software layer is critical for the scalability of the system as a whole, and yet this layer has largely been ignored.

In this paper, we characterize the scalability of the system software layer of demand-driven parallel systems based on fundamental metrics of quantitative system performance analysis. We develop a set of sufficient conditions so that if a system satisfies these conditions, then the system is scalable. We further argue that in practice these conditions are also necessary. In the remainder of the paper, we use the necessary and sufficient conditions to develop a set of practical design guidelines, to study the effect of application workloads, and to examine the scalability behavior of a system with only a limited number of processors.

Keywords: scalability, parallel, demand-driven, system software

Information from authors:

Ronald Unrau
============
Ronald Unrau is working on advanced compiler optimizations at the IBM
Software Solutions Lab in Toronto, Canada. His research interests
include parallel operating systems and compilers.

Unrau received his BASc in computer engineering form the University of
Alberta in 1984 and his MASc in biomedical engineering and PhD in
electrical and computer engineering, both from the University of
Toronto, in 1989 and 1993.

Michael Stumm
=============
Michael Stumm is an associate professor in the departments of electrical
and computer engineering, and computer science at the University of Toronto.
His research interests are in the area of computer systems.

Stumm received a diploma in mathematics and a PhD in computer science from the
University of Zurich in 1980 and 1984, respectively.
He is a member of the IEEE Computer Society and ACM.

Orran Krieger
=============
Orran Krieger completed his PhD in electrical and computer engineering at the
University of Toronto in 1994. His research interests include
operating systems, file systems, and multiprocessors.

Krieger received a BASc from the University of Ottawa in 1985 and an
MASc from the University of Toronto in 1989, both in electrical
engineering.

Bounds on Memory Bandwidth in Streamed Computations

Sally A. McKee, Wm. A. Wulf and Trevor C. Landon
[view first page | fetch full paper | conference presentation ]

Abstract:

The growing disparity between processor and memory speeds has caused memory bandwidth to become the performance bottleneck for many applications. In particular, this performance gap severely impacts stream-orientated computations such as (de)compression, encryption, text searching, and scientific (vector) processing. This paper looks at streaming computations and derives analytic upper bounds on the bandwidth attainable from a class of access reordering schemes. We compare these bounds to the simulated performance of a particular dynamic access ordering scheme, the Stream Memory Controller (SMC). We are building the SMC, and where possible we relate our analytic bounds and simulation data to the simulation performance of the hardware. The results suggest that the SMC can deliver nearly the full attainable bandwidth with relatively modest hardware costs.

Keywords: memory bandwidth, vector processing, access ordering, data prefetching, performance evaluation

StarT-NG: Delivering Seamless Parallel Computing

Derek Chiou, Boon S. Ang, Robert Greiner, Arvind, James C. Hoe, Michael J. Beckerle, James E. Hicks and Andy Boughton
[view first page | fetch full paper ]

Abstract:

StarT-NG is a joint MIT-Motorola project to build a high-performance message passing machine from commercial systems. Each {\em site} of the machine consists of a PowerPC 620-based Motorola symmetric multiprocessor (SMP) running the AIX 4.1 operating system. Every processor is connected to a low-latency, high-bandwidth network that is directly accessible from user-level code. In addition to fast message passing capabilities, the machine has experimental support for cache-coherent shared memory across sites. When the machine requires memory to be kept globally coherent, one processor on each site is devoted to supporting shared memory. When globally coherent shared memory is not required, that processor can be used for normal computation tasks. StarT-NG will be delivered at about the time the base SMP is introduced into the marketplace. The ability to be both a collection of standard SMP and an aggressive message passing machine with coherent shared memory makes StarT-NG a good building block for incrementally expandable parallel machines.

Keywords: parallel architecture, message passing, coherent sharedmemory, SMP, PowerPC

Information from authors:

The StarT-NG project is a collaborative effort between the Computation Structures Group of the Laboratory for Computer Science at MIT and the Motorola Computer Group. Derek Chiou, Boon S. Ang and James C. Hoe are graduate students at MIT and are supervised by Arvind. Andy Boughton is a staff member in CSG. Bob Greiner and Jamey Hicks are employees of Motorola while Mike Beckerle was formerly an employee of Motorola who now works for Applied Parallel Technologies, Inc.

Costs and Benefits of Multithreading with Off-the-Shelf RISC Processors

Olivier C. Maquelin, Herbert H.J. Hum and Guang R. Gao
[view first page | fetch full paper | conference presentation ]

Abstract:

Multithreaded architectures have been proposed for future multiprocessor systems due to their ability to cope with network and synchronization latencies. Some of these architectures depart significantly from current RISC processor designs, while others retain most of the RISC core unchanged. However, in light of the very low cost and excellent performance of off-the-shelf microprocessors it seems important to determine whether it is possible to build efficient multithreaded machines based on unmodified RISC processors, or if such an approach faces inherent limitations. This paper describes the costs and benefits of running multithreaded programs on the EARTH-MANNA system, which uses two Intel i860 XP microprocessors per node.

Keywords: multiprocessors, multithreaded architectures, performance measurements, EARTH, MANNA

Information from authors:

The authors of this paper:
Dr. Olivier Maquelin, Prof. Herbert H.J Hum and Prof. Guang R. Gao are members of the EARTH project, which groups researchers from the ACAPS Lab at McGill University and from the Department of Electrical and Computer Engineering at Concordia University. The aim of the EARTH project as a whole is to investigate compiler techniques and architectural features that are required for better supporting parallel computing in future high-performance architectures. This ranges from parallel language and algorithm design, over automatic parallelizing compilers for high-level languages, benchmarking, emulation and simulation of parallel machines, down to the production of an EARTH hardware prototype.

Semantics and Tools

(Tuesday 14:00-16:00)

Transformation Techniques in Pei

S. Genaud, E. Violard and G.-R. Perrin
[view first page | fetch full paper | conference presentation ]

Abstract:

This article presents an example of program transformation strategies in the language {\sc Pei} using several strategies : a simplification of the communications, the introduction of broadcasts by removing recursion from data field definitions, and the introduction of a reduction operator.

These transformations emphasize the relationships between several programs solving a given problem, especially in the data parallelism area.

Keywords: Abstraction, Broadcast, Communication, Data-parallelism, Reduction, Refinement

Information from authors:

PEI & VPEI

The PEI project defines a new equational style framework for data parallel programming. It lies on a small but powerful set of primitives and induces a refinement calculus to design or transform programs. Expressions and transformations can be carried out by using an interactive environment which supplies symbolic evaluations. Current works concern the foundations of PEI and practical developments as well, such as a compiler to a data parallel language.

On the Completeness of a Proof System for a Simple Data-parallel Programming Language

Luc Bougé and David Cachera
[view first page | fetch full paper ]

Abstract:

We prove the completeness of an assertional proof system for a simple loop-free data-parallel language. This proof system is based on two-part assertions, where the predicate on the current value of variables is separated from the specification of the current extent of parallelism.

The proof is based on a Weakest Precondition (WP) calculus. In contrast with the case of usual scalar languages, not all WP can be defined by an assertion. Yet, partial definability suffices to prove the completeness thanks to the introduction of hidden variables in assertions. The case of data-parallel programs with loops is briefly discussed in the conclusion. The full version is available as LIP Research Report RR94-42 (URL http://www.ens-lyon.fr/~bouge/Info).

Keywords: Concurrent Programming, Specifying and Verifying and Reasoning about Programs, Semantics of Programming Languages, Data-Parallel Languages, Proof System, Hoare Logic, Weakest Preconditions.

Information from authors:

Luc Bougé defended his Habilitation Thesis in 1987 at University Paris 7 under the supervision of Prof. Maurice Nivat and Prof. Krzysztof Apt. He worked as a full-time CNRS researcher at LIENS, ENS Ulm, Paris until 1990 on the semantics of CSP-like parallel languages, and then got a position as a Professor at the LIP, ENS Lyon. There, he launched a research project on the design and implementation of data-parallel languages. A general position paper is The data-parallel programming model: a semantic perspective. Luc Bougé is now serving as the co-chairman of the National CNRS Research Program on Parallelism, Networks and Systems.
David Cachera received his Master Degree at ENS Lyon in 1994. He is currently working for a PhD under the supervision of Luc Bougé at LIP. His work concerns the semantics of data-parallel languages, and its application to formal program validation.
The full version of the EuroPar Paper (17 pages) is available: get it? . To get the list of all material available on-line as LIP Research Reports, please click here..

An Implementation of Race Detection and Deterministic Replay with MPI

C. Clémençon, J. Fritscher, M.J. Meehan and R. Rühl
[view first page | fetch full paper ]

Abstract:

The Parallel Debugging Tool (PDT) of the Annai programming environment is developed within the Joint CSCS-ETH/NEC Collaboration in Parallel Processing. Like the other components of the integrated environment, PDT aims to provide support for application developers to debug portable large-scale data-parallel programs based on HPF and message-passing programs based on the MPI standard. PDT supports MPI event tracing for race detection and deterministic replay for manually parallelized MPI programs as well as for code generated with the advanced techniques of a data-parallel compiler. This paper describes the tracing and replaying mechanisms included in PDT as well as their efficiency by presenting execution time overheads for several benchmark programs running on the NEC Cenju-2/3 distributed-memory parallel computers.

Keywords: Race Detection, Replay, Determinism, Distributed Memory, Message Passing, MPI, HPF, Parallel Debugger, Integrated Programming Environment

Information from authors:

CSCS (Centro Svizzero di Calcolo Scientifico) home page: http://www.cscs.ch

Formal and Experimental Validation of a Low Overhead Execution Replay Mechanism

Alain Fagot and Jacques Chassin de Kergommeaux
[view first page | fetch full paper | conference presentation ]

Abstract:

This paper presents a mechanism for record-replay of parallel programs written in a remote procedure call (RPC) based parallel programming model. This mechanism, which will serve as a basis for implementing a user-level debugger, exploits some properties of the programming model to limit drastically the number of records that need to be done. A formal proof of the equivalence between recorded and replayed executions is given. Systematic measurements of the time overhead of the recording indicate that it is sufficiently low for the recording mode to be considered as normal execution mode. Similar techniques can be applied to other programming models.

Keywords: Instant Replay, parallel debugging, deterministic reexecutions, Remote Procedure Call.

Information from authors:

The work described in this paper is part of the PhD research of
Alain Fagot. It is done in the framework of the APACHE project whose aim is to design and implement a parallel programming environment for parallel computers, providing both static and dynamic load balancing facilities.

The authors can be reached by s-mail at:
IMAG, APACHE project
46 avenue Félix Viallet,
F-38031 Grenoble Cedex 1, France.

and by e-mail at:
Alain.Fagot@imag.fr
Jacques.Chassin@imag.fr

Interconnection Networks I

(Tuesday 14:00-16:00)

On Efficient Embedding of Grids into Grids in PARIX

T. Römke, M. Röttger, U.-P. Schroeder, J. Simon
[view first page | fetch full paper | conference presentation ]

Abstract:

A hardware independent method of programming a massively parallel machine (MPP) can best be supported by a well-designed run-time environment. An important problem in this design is the ability of efficiently simulating networks different from the hardware topology. We will describe the mapping kernel of the virtual processors library for the commercial run-time system PARIX. This kernel contains description classes for several topologies (so- called virtual topologies) and implementations of respective embeddings which map given instances of virtual topologies onto others or onto the hardware. Using these functions, PARIX is able to establish concrete virtual topologies with corresponding communication channels. The implemented functions were selected with respect to the well-known criteria for graph embeddings: equal load and small dilation. Additionally, we focus on fast distributed computation and universal applicability. As an example, we will show new methods for efficiently embedding an arbitrary 2-dimensional grid as a guest graph into any 2- dimensional grid as a host graph.

Keywords: parallel run-time system, PARIX, virtual processors, embedding, grids

Information from authors:

Thomas Römke and Jens Simon,
Paderborn Center for Parallel Computing

Markus Röttger and Ulf-Peter Schoeder,
Efficient Use of Massively Parallel Systems, AG Monien,
Department of Mathematics and Computer Science,
University of Paderborn

Optimal Emulation of Meshes on Meshes of Trees

Alf-Christian Achilles
[view first page | fetch full paper ]

Abstract:

Many problems can be solved more efficiently on a mesh of trees network than on a mesh. Until now it has been an open problem whether the mesh of trees is always at least as fast as the mesh. In this paper, we present an emulation of N-node meshes on O(N)-node meshes of trees with constant slowdown, even though any embedding of a mesh into a mesh of trees requires dilation Omega(log N). This demonstrates that the mesh of trees is strictly more powerful than the mesh. As an application, we show how to construct an optimal O(sqrt(N)) sorting algorithm for the mesh of trees that improves on the best previously known algorithm by a logarithmic factor.

Keywords: interconnection networks, mesh, mesh of trees, emulation, sorting

Information from authors:

Alf-Christian Achilles is working towards his PhD at the Computer Science Department at the University of Kalrsruhe.

Optimal Embeddings in the Hamming Cube Networks

Sajal K. Das and Aisheng Mao
[view first page | fetch full paper ]

Abstract:

This paper studies network embeddings in the Hamming cubes, a recently designed interconnection topology for multicomputers. The Hamming cube networks are supergraphs of incomplete hypercubes such that the additional edges form an extra binomial spanning tree. The recursively constructible and unit incremental Hamming cubes have better properties than hypercubes, including half of logarithmic diameter and higher fault-tolerance. They also support simple routing and efficient broadcasting schemes.

In this paper, we show that Hamiltonian paths and cycles of all lengths, complete binary trees and their several variants are subgraphs of Hamming cubes. Our embeddings have both dilation and expansion equal to one. Furthermore, taking advantage of the enhanced edges in the Hamming cubes, tree machines can be embedded with dilation of one and expansion of 7/6. Thus, Hamming cubes provide embeddings at a lower cost than (incomplete) hypercubes of the same size.

Keywords: Network embedding, dilation, interconnection network,Hamming cube, incomplete hypercube, binary tree, hypertree, tree machine.

Information from authors:

Authors: Sajal K. Das and Aisheng Mao

Affiliation Address: Department of Computer Science
University of North Texas
P. O. Box 13886
Denton, TX 76203-6886
USA

About the Authors:

Dr. Sajal K. Das received the B. Tech. degree in 1983 from Calcutta University,
the M. S. degree in 1984 from the Indian Institute of Science, Bangalore, and
the Ph.D. degree in 1988 from the University of Central Florida, Orlando, all
in Computer Science. Currently he is an Associate Professor of Computer Science
at the University of North Texas, Denton, and also a faculty member of the
Center for Research in Parallel and Distributed Computing. He is a recipient
of the Cambridge Nehru Scholarship and an Honor Professor Award at UNT for best
teaching and scholarly research. Dr. Das was invited to visit The Leonardo
Fibonacci Institute in Trento, Italy, and the Slovak Academy of Sciences in
Bratislava, Slovakia, in 1994. His current research interests include the design
and analysis of parallel algorithms, parallel data structures, parallel discrete
event simulation, applied graph theory and combinatorics, multiprocessor
interconnection networks and embeddings, and mobile computing. He has published
over 80 papersin these areas in refereed journals and international conference
proceedings.

Dr. Das serves on the Editorial Boards of {\em Parallel Processing Letters}
(World Scientific Pub), the {\em Journal of Parallel Algorithms and Applications}
(Gordon and Breach), and the {\em CD-ROM Journal of Computing}.
He has been a member of Program Committees for several international conferences
and a Co-Editor of the {\em Journal of Computer and Sofware Engineering} special
issue on Parallel Algorithms and Architectures. Dr. Das is also a member of the
IEEE Computer Society and the ACM.

Mr. Aisheng Mao received the M.S. degree in Computer Science in 1992 from the
University of North Texas, Denton. Curently he is a Ph.D. student in computer
science and working on his thesis. His research interests include multicomputer
architectures and algorithms, interconnection network topologies, network
embeddingss, and telecommunications.

Hierarchical Adaptive Routing Under Hybrid Traffic Load

Ziqiang Liu
[view first page | fetch full paper ]

Abstract:

Adaptive routing can improve the network performance and the fault-tolerance by providing multiple routing paths. However, the implementation complexity of adaptive routing can be significant, discouraging its use in commercial massively parallel systems. In this paper, we introduce Hierarchical Adaptive Routing (HAR), a new adaptive routing framework, which provides a unified framework for simple and high performance fully adaptive deadlock-free wormhole routing. HAR divides the physical network into several level of virtual networks. There is one connection channel between two adjacent virtual networks, that allows blocked packets in the higher level to move to the lower level. Different routing algorithms can be used in each virtual network, and the overall network is deadlock-free, if the routing algorithm in the lowest level virtual network is deadlock-free. However, the routing algorithm in any other virtual network can be fully adaptive, even nonminimal routing to increase performance. HAR has three advantages: fully adaptive deadlock-free routing in any non-wrapped and wrapped k-ary n-cube network with 2 and 3 virtual channels respectively, relative small crossbars and applicability to a wide variety of network topologies. Detailed implementation and simulation studies of a HAR for 2D mesh networks are presented.

Keywords: Routing Networks, Adaptive Routing, Deadlock Prevention, Wormhole Routing

Parallel Algorithms I

(Tuesday 16:30-18:00)

Tight Bounds on Parallel List Marking

Sandeep N. Bhatt, Gianfranco Bilardi, Kieran T. Herley, Geppino Pucci and Abhiram G. Ranade
[view first page | fetch full paper ]

Abstract:

The list marking problem involves marking the nodes of an $\ell$-node linked list stored in the memory of a $(p, n)$-PRAM, when only the location of the head of the list is initially known. Under the assumption that memory cells containing list nodes bear no distinctive tags distinguishing them from other cells, we establish an $\Omega(\min \{ \ell, n/p \})$ randomized lower bound for $\ell$-node lists and present a deterministic algorithm whose running time is within a logarithmic additive term of this bound. In the case where list cells are tagged in a way that differentiates them from other cells, we establish a tight $\Theta(\min\{\ell, \ell/p+ \sqrt{(n/p)\log n}\})$ bound for randomized algorithms.

Keywords: parallel processing, list algorithms, randomized algorithms,lower bounds

Optimization of PRAM-Programs with Input-Dependent Memory Access

Welf Löwe
[view first page | fetch full paper ]

Abstract:

There exist transformations of PRAM programs with predictable communication behavior to existing architectures. We extend the class of tractable programs to those with communication depending on the input. First, we define this class of programs. Second, we give source code transformations to simplify the programs and to eliminate indirect addresses and conditionals. Third, we show how to derive the communication behavior statically. Fourth, we show how to compute the mapping at compile time. Finally, we give upper time bounds for execution on existing architectures.

Keywords: PRAM, LogP, Program Optimization, Program Transformation

Information from authors:

Welf Loewe: Machine-Independent Parallel Programming

Today, there is a great variety on parallel algorithms for shared-memory architectures, mainly the PRAM. However, the PRAM model does not take into account properties of realistic architectures. Recently, Culler et al. defined a new more realistic machine model which reflects better the practical behaviour of massively parallel computers. Their LogP model differs from the PRAM in the following points. First, the synchronous execution is dropped. Instead all processors perform their computation asynchronously. Second, there is no shared memory assumption. Instead they consider a communication latency, communication overhead and network bandwith. Finally, the number of processors is fixed and cannot increase with the problem size.

We focus on transformation of parallel algorithms and programs on the PRAM to equivalent LogP-programs and on optimization of the resulting programs.

Publications of mine on the above topic and further information on our institution can be optained via my home page.


Welf Loewe's home page

Optimal Circular Arc Representations

Lin Chen
[view first page | fetch full paper ]

Abstract:

We investigate some properties of minimal interval and circular-arc representations and give several optimal parallel recognition and construction algorithms. We show that, among other things, given an $s \times t$ interval or circular arc representation matrix, * deciding if the representation is minimal can be done in $O(\log s)$ time with $O(st/\log s)$ EREW PRAM processors, or in $O(1)$ time with $O(st)$ Common CRCW PRAM processors; * constructing a minimum interval representation can be done in $O(\log (st))$ time with $O(st/\log (st))$ EREW PRAM processors, or in $O(\log t/\log\log t)$ time with $O(st\log\log t/\log t)$ Common CRCW PRAM processors, or in $O(1)$ time with $O(st)$ BSR processors.

Keywords: parallel algorithms, time and work optimality, minimal interval and circular arc representations, recognition and construction

Cache Systems

(Tuesday 16:30-18:00)

Exploiting Parallelism in Cache Coherency Protocol Engines

Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke and Sanjay Vishin
[view first page | fetch full paper | conference presentation ]

Abstract:

Shared memory multiprocessors are based on memory models, which are precise contracts between hard- and software that spell out the semantics of memory operations. Scalable systems implementing such memory models rely on cache coherency protocols that use dedicated hardware. This paper discusses the design space for high performance cache coherency controllers and describes the architecture of the programmable protocol engines that were developed for the S3.mp shared memory multiprocessor. S3.mp uses two independent protocol engines, each of which can maintain multiple, concurrent contexts so that maintaining memory consistency does not limit the system performance. Programmability of these engines allows support of multiple memory organizations, including CC-NUMA and S-COMA.

Keywords: sistributed shared memory, protocol engines, cache Coherency, simple COMA

Information from authors:

Our WWW page is http://playground.sun.com/pub/S3.mp/s3mp.html

Verifying Distributed Directory-based Cache Coherence Protocols: S3.mp, a Case Study

Fong Pong, Andreas Nowatzyk, Gunes Aybay and Michel Dubois
[view first page | fetch full paper ]

Abstract:

Abstract. This paper presents the results for the verification of the S3.mp cache coherence protocol. The S3.mp protocol uses a distributed directory with limited number of pointers and hardware supported overflow handling that keeps processing nodes sharing a data block in a singly linked list. The complexity of the protocol is high and its validation is challenging because of the distributed algorithm used to maintain the linked lists and the non-FIFO network. We found several design errors, including an error which only appears in verification models of more than three processing nodes, which is very unlikely to be detected by intensive simulations. We believe that methods described in this paper are applicable to the verification of other linked list based protocols such as the IEEE Scalable Coherent Interface.

Keywords:

Efficient Software Data Prefetching for a Loop with Large Arrays

Se-Jin Hwang and Myong-Soon Park
[view first page | fetch full paper ]

Abstract:

Large array can bring unfavorable effects on software data prefetching. Therefore, it worth to be considered carefully to get the good effectiveness of data prefetching. This paper proposes a software data prefetching mechanism that considers the the size of array and the size of cache along with the behaviour of accesses to the array. We describe on the loss of data prefetching effectiveness in respects of the reuse property and the number of prefetching instructions.

We also propose loop reorganizing algorithm to insert prefetching instructions into a nested loop by using proposed prefetching method. Also, we implemented a preprocessor, (we call it as LOOP), using the proposed algorthms in this paper, and show that our prefetching method improves the execution speed of experimented loops.

Keywords: software data prefetching, cache size, preprocessor

Information from authors:

Since the deciency of man-power in my lab, and the poor environments,
I cannot help making the preprocessor, LOOP, only for myself. It has been
a very lone work. Anyway, I have done my efforts to refine it and made it
be more extensible, and I will do it in the future.

Some guys in my lab call it as 'PISH'(Poor sImulation System under
Highering). In fact, the name of 'LOOP' is not fit to my preprocessor.
Total source code is about 10,000 lines except for comments.
The frame structure is very rough. The day spent on implementing 'LOOP'
is about 20 days. Anyway, what I can firmly say is....
It can guarantee the correct result of simulation of simple loop,
in respects of execution cycles, cache hit ratio etc. Futhermore,
the usage of it is very simple.
In current, I am building an hardware description language for 'LOOP'.
The language is defined with a reference to VHDL. Surely, it is
,to large extent, inferior to VHDL. However, it can be used in simple level.
All the prerequisites to use it are the ability of describing the
state-machine using automaton.
Since the aim of 'LOOP' is to assist the work for compiler system,
the language for describing the hardware structure is very simplified.

Loop Parallelization

(Wednesday 10:30-12:30)

Generation of Synchronous Code for Automatic Parallelization of while Loops

Martin Griebl and Jean-François Collard
[view first page | fetch full paper ]

Abstract:

Automatic parallelization of imperative programs has focused on nests of for loops with affine bounds and affine dependences, because in this case execution domains and dependences can precisely be known at compile-time. When dynamic control structures, such as while loops, are used, existing methods for conversion into single-assignment form and domain scanning are inapplicable.

This paper gives an algorithm to automatically generate data-parallel codes, together with an algorithm to possibly convert the program into single-assignment form.

Keywords: automatic parallelization, while loops, code generation, data parallelism

Information from authors:

Authors:

Implementing Flexible Computation Rules with Subexpression-level Loop Transformations

Dattatraya Kulkarni, Michael Stumm and Ronald C. Unrau
[view first page | fetch full paper | conference presentation ]

Abstract:

Computation Decomposition and Alignment (CDA) is a new loop transformation framework that extends the linear loop transformation framework and the more recently proposed Computation Alignment frameworks by linearly transforming computations at the granularity of subexpressions. It can be applied to achieve a number of optimization objectives, including the removal of data alignment constraints, the elimination of ownership tests, the reduction of cache conflicts, and improvements in data access locality.

In this paper we show how CDA can be used to effectively implement flexible computation rules with the objective of minimizing communication and, whenever possible, eliminating intrinsics that test whether computations need to be executed or not. We describe CDA, show how it can be used to implement flexible computation rules, and present an algorithm for deriving appropriate CDA transformations.

Keywords: Loop transformations, HPF, SPMD code generation

Information from authors:

http://www.eecg.toronto.edu/~kulki
http://www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys

http://www.eecg.toronto.edu/~kulki
http://www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys

Synchronization Migration for Performance Enhancement in a DOACROSS Loop

Rong-Yuh Hwang
[view first page | fetch full paper ]

Abstract:

An efficient technique for migrating the synchronization operations is proposed. This technique rewrites the original statements, Send_Signal(S) to be moved up and Wait_Signal(S) to be moved down, or rearranges the sequence of statements or the position of array element in a synchronization region by data dependence analysis to migrate the synchronization operations. Theorems show that the migration of synchronization operation can reduce the parallel execution time of loop significantly. Perfect benchmarks are employed to measure the system performance after migration. Experimental result shows that the enhancement is very significant.

Keywords: Code migration, Data dependence, LBD, LFD, Synchronization, Synchronization region

An Array Partitioning Analysis for Parallel Loop Distribution

Marc Le Fur, Jean-Louis Pazat and Françoise André
[view first page | fetch full paper | conference presentation ]

Abstract:

This paper presents the compilation techniques implemented in a compiler for a HPF-like language. The stress is especially put on the description of an optimized scheme which is dedicated to the compilation of parallel nested loops. The generation of the SPMD code is based on the polyhedral model and allows for the partitioning of the arrays involved in the loop in order to achieve symbolic restriction of iteration domains and message aggregation. Experimental results for some well-known kernels are shown.

Keywords: Distributed Memory Parallel Computers, Loop Distribution,Domain Analysis, Restriction of Iteration Domains, Message Aggregation.

Information from authors:

http://www.irisa.fr/EXTERNE/projet/pampa/PANDORE/pandore.html

Load Balancing and Parallel Algorithms II

(Wednesday 10:30-12:30)

A Model for Efficient Programming of Dynamic Applications on Distributed Memory Multiprocessors

A. Erzmann, M. Hadeler and C. Müller-Schloer
Winner of EURO-PAR'95 Best Paper Award
[view first page | fetch full paper | conference presentation ]

Abstract:

We present the TDC programming model which aims to ease the efficient implementation of dynamic applications on distributed memory multiprocessors. This model is based on task descriptors, data objects and capabilities which reside in distinct, globally accessible domains. Dynamic load balancing will be done by the system software and is completely transparent to the user. This often leads to a significant reduction of code complexity. Our prototype of the TDC model on an 128 node nCUBE 2 uses a distributed diffusion scheme to balance load dynamically. We have developed a task selection strategy which reduces the load balancing overhead. Measuring and simulation results for a parallel implementation of a block matching algorithm indicate that runtime efficiency close to the optimum can be achieved with the TDC model even for highly parallel systems.

Keywords: Programming Model, Dynamic Load Balancing, Block Matching, Distributed Memory Multiprocessors

Information from authors:

Andreas Erzmann received his diploma in electrical engineering from the University of Hannover, Germany, in 1990 and is currently working towards the doctor's degree at the institute for computer architecture and operating systems of the University of Hannover, Germany.

Michael Hadeler is a student of electrial engineering (technical informatics) at the University of Hannover, Germany.

Prof. C. Müller-Schloer is the head of the institute for computer architecture and operating systems of the University of Hannover, Germany.

Further information about the authors, their institution can be obtained from our WWW-Server.

Efficient Solutions for mapping parallel programs

P. Bouvry, J. Chassin de Kergommeaux and D. Trystram
[view first page | fetch full paper ]

Abstract:

This paper describes a mapping toolbox, whose aim is to optimize the execution time of parallel programs described as task graphs on distributed memory parallel systems. The toolbox includes several classical mapping algorithms. It was assessed by computing the mapping of randomly generated task graphs and by mapping and executing on a parallel system synthetic programs representing some classical numerical algorithms. A large number of experiments were used to validate the cost functions used in the toolbox and to compare the algorithms.

Keywords: Parallel environment, Load-balancing, Mapping.

Information from authors:

The work described in this paper was part of the PhD research of
Pascal Bouvry. It was done in the framework of the APACHE project whose aim is to design and implement a parallel programming environment for parallel computers, providing both static and dynamic load balancing facilities.

The authors can be reached by s-mail at:
IMAG, APACHE project
46 avenue Félix Viallet,
F-38031 Grenoble Cedex 1, France.

and by e-mail at:
Pascal.Bouvry@cwi.nl

Jacques.Chassin@imag.fr

Denis.Trystram@imag.fr

Optimal Data Distributions for LU Decomposition

Thomas Rauber and Gudula Rünger
[view first page | fetch full paper ]

Abstract:

The paper considers the well--known problem of LU decomposition to study a method to derive data distributions for parallel computers with a distributed memory organization. The importance of the paper lies not so much in the special application but with the principle that the problem of finding an optimal data distribution is formulated as an optimization problem. This is possible by using a parameterized data distribution and a rigorous performance prediction technique that allows us to derive runtime formulas containing the parameters of the data distribution. The parameters are determined % by a mathematical optimization technique in such a way that the total runtime is minimized, thus also minimizing the communication overhead and the load imbalance penalty.

Keywords: LU decomposition, data distribution, parallel computers, performance prediction, computation model

Detecting Quantified Global Predicates in Parallel Programs

Mark Minas
[view first page | fetch full paper | conference presentation ]

Abstract:

Global predicates in parallel programs are predicates considering the state of more than one process. They are a useful concept for debugging parallel programs, e.g., for specifying assertions or breakpoints. In this paper $\exists$-predicates are defined and examined, a restricted class of global predicates. $\exists$-predicates are defined by two local predicates which have to be simultaneously satisfied by two different processes. Such predicates are frequently needed to express synchronization properties. Efficient centralized and parallel algorithms for detecting satisfaction of $\exists$-predicates are proposed. Furthermore, it is outlined how $\exists$-predicates can be used for global breakpoints and where to stop a parallel program reaching such a breakpoint. The underlying machine models is a fixed set of processes communicating by message passing or shared memory.

Keywords: Global predicates, testing, debugging, parallel debugger

Information from authors:

Mark Minas

Compiling Techniques

(Wednesday 14:00-16:00)

Using Knowledge-Based Techniques for Parallelization on Parallelizing Compilers

Chao-Tung Yang, Shian-Shyong Tseng, Cheng-Der Chuang and Wen-Chung Shih
[view first page | fetch full paper ]

Abstract:

In this paper we propose a knowledge-based approach for solving data dependence testing and loop scheduling problems. A rule-based system, called the K test, is developed by repertory grid and attribute ording table to construct the knowledge base. The K test chooses an appropriate testing algorithm according to some features of the input program by using knowledge-based techniques, and then applies the resulting test to detect data dependences for loop parallelization. Another rule-based system, called the KPLS, is also proposed to be able to choose an appropriate scheduling by inferring some features of loops and assign parallel loops on multiprocessors for achieving high speedup. The experimental results show that the graceful speedup obtained by our compiler is obvious.

Keywords: Parallelizing compiler, data dependence testing,loop parallelization, parallel loop scheduling, knowledge-based, repertory grid analysis, speedup.

Automatic Vectorization of Communications for Data-parallel Programs

Cécile Germain and Franck Delaplace
[view first page | fetch full paper ]

Abstract:

Optimizing communication is a key issue in compiling data-parallel languages for distributed memory architectures. We examine here the case of cyclic distribution, and we derive symbolic expressions for communication sets under the only assumption that the initial parallel loop is defined by affine expressions of the indices. This technique relys on unimodular changes of basis and provides a fully explicit SPMD code, wich is efficient as it does not include any integer divide.

Analysis of the properties of communications leads to a tiling of the local memory addresses that provides maximal message vectorization in the most general case. Many actual program present some pecularities simplifying the compilation process and the output code. We show that our output code regularly improves with the simplicity of the input code

Keywords: Data-Parallelism, SPMD, Compilation, Code Generation, Message Vectorization,

Information from authors:

Cécile Germain is an Assistant Professor at Université Paris-Sud.

She is with the Parallel Architecture group in the Laboratoire de Recherche en Informatique .

Her research interests include interconnection networks for parallel architectures, randomized routing, parallel computing, and compilers for parallel architectures.

Contact and Information

  • Bibliography
  • e-mail : cecile@lri.fr
  • Office :LRI bat 490, Universite Paris-Sud, F-91405 ORSAY Cedex FRANCE

  • Franck Delaplace is an assistant Professor at University of Evry (France)

    He is with the Parallel Research Team.
    His research interests are :

    Contact and Information :
    email : delapla@lami.univ-evry.fr

    Université d'Evry Val d'Essonne
    Boulevard des Coquibus 91 000 Evry

    The Program Compaction Revisited: the Functional Framework

    Marc Pouzet
    [view first page | fetch full paper ]

    Abstract:

    This paper presents a general method to compact the first-order part of functional languages with call-by-value semantics for fine-grain parallel machines like VLIW or super-scalars. This work extends previous works on compaction in two ways. First, it defines a new formal system for the compaction problem usable to design a meta-compiler for these machines. Second, the compaction is directly applied to functional expressions instead of graph based representations (control flow or dependence flow based representations) leading to a very uniform and simple presentation.

    Keywords: fine-grain parallelization, scheduling, software-pipelining, functional languages

    Featherweight Threads and ANDF Compilation of Concurrency

    Ben Sloman and Tom Lake
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    We present an intermediate representation called ThreadTDF, a component of the Parallel TDF system for compiling distributed concurrent programs to shared and distributed memory multiprocessors. ThreadTDF is a parallel extension of the TDF architecture neutral distribution format (ANDF) for sequential programs. ThreadTDF provides {\em featherweight} thread mechanisms for explicitly scheduling dynamic fine-grain concurrent computations within procedures (and more generally within static local scopes). Communication between address spaces is supported by remote service request mechanisms based upon asynchronous activation of remote threads and synchronous remote procedure calls. In ThreadTDF variable lifetimes bound the lifetimes of featherweight threads declared in their scope. We show how a compiler uses thread lifetime information to integrate resource allocation and communication with thread scheduling for efficient intraprocedural concurrency. Initial performance results are given for the SPARC processor.

    Keywords: concurrency, compilation, ANDF, TDF, intermediate representation, thread, hierarchical scheduling, context switch

    Information from authors:

    Ben Sloman is a postgraduate student studying compilation for parallel
    machines at Reading University (UK) and at GLOSSA (a SME also based in
    Reading). He has worked for the last four years with the TDF
    intermediate representation, initially investigating vectorisation of
    TDF and now looking at extensions for compiling explicit parallelism.
    His research interests include processor architecture, parallel
    systems, compiler analysis, and graph algorithms.

    Tom Lake was local organiser for CONPAR'88 and the international
    workshop series `Compilers for Parallel Computers'. He has worked with
    TDF during much of its evolution and is currently examining commercial
    applications in the fields of parallel computation, reverse engineering
    and distributed discrete event simulation.

    Applications

    (Wednesday 14:00-16:00)

    Parallel N-Body Simulation on a Large-Scale Homogeneous Distributed System

    John W. Romein and Henri E. Bal
    [view first page | fetch full paper ]

    Abstract:

    A processor pool is a homogeneous collection of processors that are used for computationally intensive tasks, such as parallel programs. Processor pools are far less expensive than multicomputers and more convenient to use than collections of idle workstations. This paper gives a case study in parallel programming on a processor pool with 80 SPARCs connected by an Ethernet, running the Amoeba distributed operating system. We use a realistic application (N-body simulation of water molecules) and show that a good performance can be obtained. We measured a speedup of 72 on 80 processors.

    Keywords: distributed system, parallel N-body simulation

    Information from authors:

    WWW Home pages: http://www.cs.vu.nl/~john
    http://www.cs.vu.nl/~bal

    Analysis of Parallel Scan Processing in Shared Disk Database Systems

    Erhard Rahm and Thomas Stöhr
    [view first page | fetch full paper ]

    Abstract:

    Shared Disk database systems offer a high flexibility for parallel transaction and query processing. This is because each node can process any transaction, query or subquery because it has access to the entire database. Compared to Shared Nothing database systems, this is particularly advantageous for scan queries for which the degree of intra-query parallelism as well as the scan processors themselves can dynamically be chosen. On the other hand, there is the danger of disk contention between subqueries, in particular for index scans. We present a detailed simulation study to analyze the effectiveness of parallel scan processing in Shared Disk database systems. In particular, we investigate the relationship between the degree of declustering and the degree of scan parallelism for relation scans, clustered index scans, and non-clustered index scans. Furthermore, we study the usefulness of disk caches and prefetching for limiting disk contention in parallel database systems. Finally, we show that disk contention in multi-user mode can be limited for Shared Disk database systems by dynamically choosing the degree of scan parallelism.

    Keywords: Parallel Database Systems, Shared Disk, Parallel Query Processing, Disk Contention, Dynamic Load Balancing, Performance Analysis

    Information from authors:

    Erhard Rahm is full professor of computer science and head of the database group at the University of Leipzig, Germany. He received his Ph.D. in 1988 from the University of Kaiserslautern, Germany, and spent a year as a Visiting Scientist at the IBM T.J. Watson Research Center in Yorktown Heights, NY. Prof. Rahm holds an US patent and is author of three books and over 40 refereed journal and conference publications. His research interests include parallel database system, dynamic load balancing, high performance transaction systems, heterogeneous databases, performance evaluation and geographic information systems.

    Thomas Stöhr received his Master's degree in Computer Science in 1993 from the University of Kaiserslautern, Germany. Currently, he is a member of the database group of Prof. Rahm at the Univ. of Leipzig. His research focus is on parallel database systems and dynamic load balancing.

    Polynomial Time Scheduling of Low Level Computer Vision Algorithms on Networks of Heterogeneous Machines

    Adam R. Nolan and Bryan Everding
    [view first page | fetch full paper ]

    Abstract:

    Defining an optimal schedule for arbitrary algorithms on a network of heterogeneous machines is an NP complete problem. This paper focuses on data parallel deterministic neighborhood computer vision algorithms. This focus enables the polynomial time definition of a schedule which minimizes the distributed execution time by overlapping computation and communication cycles on the network. The scheduling model allows for any speed machine to participate in the concurrent computation but makes the assumption of a master/slave control mechanism using a linear communication network. Several vision algorithms are presented which adhere to the scheduling model. The theoretical speedup of these algorithms is discussed and empirical data is presented and compared to theoretical results.

    Keywords: Computer Vision, Heterogeneous Architectures, Scheduling, Distributed Algorithms

    Information from authors:

    Adam's interests are:

    Bryan's interests are:

    Mapping Neural Network Back-Propagation onto Parallel Computers with Computation/Communication Overlapping

    Bernard Girau
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    It is shown in [GhoshHwang89] that mapping neural networks onto existing parallel computers leads to an unsatisfactory efficiency, except for irregularly connected networks, with several distinguishable highly connected regions. This paper shows how a four step decomposition of the back-propagation algorithm allows to introduce computation/communication overlapping so as to improve any parallel mapping of differentiable feedforward neural networks. This solution can adapt to both irregular and regular feedforward neural structures. Its computational complexity is estimated for multilayered networks. Unlike most network partitioning schemes, it can deal with multilayer networks using non-standard neurons, such as wavelet networks. Unlike a pattern partitioning algorithm, it is able to implement the stochastic gradient learning algorithm. Numerical results show that this solution should be considered as soon as communication overlapping is available.

    Keywords: feed-forward neural networks, parallel implementation, back-propagation

    Information from authors:

    Bernard GIRAU is a PhD student at the LIP in Lyon, France. He is working about parallel algorithms for neural networks. He is also working about a theoretical model for differentiable feed-forward neural networks.

    Language Implementation II

    (Thursday 10:30-12:00)

    Super Monaco: Its Portable and Efficient Parallel Runtime System

    J. S. Larson, B. C. Massey and E. Tick
    Winner of EURO-PAR'95 Best Student Paper Award
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    ``Super Monaco'' is the successor to Monaco, a shared-mem\-ory multiprocessor implementation of a flat concurrent logic programming language. While the system retains, by-and-large, the older Monaco compiler and intermediate abstract machine, the intermediate code translator and the runtime system have been completely replaced, incorporating a number of new features intended to improve robustness, flexibility, maintainability, and performance. There are currently two native-code backends for 80x86-based and MIPS-based multiprocessors. The runtime system, written in C, improves upon its predecessor with better memory utilization and garbage collection, and includes new features such as an efficient termination scheme and a novel variable binding and hooking mechanism. The result of this organization is a portable system\footnote{Available by anonymous ftp from {\tt ftp.cs.uoregon.edu:pub/sm.tar.gz}.} which is robust, extensible, and has performance competitive with C-based systems. This paper describes the design choices made in building the system and the interfaces between the components.

    Keywords: logic programming, parallelism, native code, runtime systems

    Information from authors:

    Jim Larson, jim@cs.uoregon.edu

    Bart Massey, bart@cs.uoregon.edu

    Bart Massey's Home Page
    Evan Tick, tick@cs.uoregon.edu
    Evan Tick's Home Page University of Oregon CIS Dept. Home Page

    Quiescence Detection in a Distributed KLIC Implementation

    Kazuaki Rokusawa, Akihiko Nakase, Takashi Chikayama
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    Quiescence detection is a fundamental facility for parallel and distributed processing. This paper describes schemes for quiescence detection in a distributed KLIC implementation. KLIC is a portable implementation of concurrent logic programming language KL1.

    Termination is detected using the weighted throw counting (WTC) scheme. Based on the scheme a scheme for global suspension was invented. The postmortem system built-in predicate which provides meta programming facilities was designed, and its distributed implementation is also presented. 11~

    Keywords: concurrent logic programming, termination detection, quiescence detection, distributed implementation, weighted reference counting

    Information from authors:

    Kazuaki Rokusawa :
    Research and Development Group
    Oki Electric Industry Co., Ltd.
    rokusawa@okilab.oki.co.jp or rokusawa@icot.or.jp

    Akihiko Nakase :
    Research and Development Center
    Toshiba Corporation
    nakase@isl.rdc.toshiba.co.jp or nakase@icot.or.jp

    Takashi Chikayama :
    Department of Electronic Engineering, University of Tokyo
    chikayama@fuchi.t.u-tokyo.ac.jp or chikayama@icot.or.jp

    Compiler Optimizations in Reform Prolog: Experiments on the KSR-1 multiprocessor

    Thomas Lindgren, Johan Bevemyr and Håkan Millroth
    [view first page | fetch full paper ]

    Abstract:

    We describe the compiler analyses of Reform Prolog and evaluate their effectiveness in eliminating suspension and locking on a range of benchmarks. The results of the analysis may also be used to extract non-strict independent and-parallelism.

    We find that 90% of the predicate arguments are ground or local, and that 95% of the predicate arguments do not require suspension code. Hence, very few suspension operations need to be generated to maintain sequential semantics. The compiler can also remove unnecessary locking of local data by locking only updates to shared data; however, even though locking writes are reduced to 52% of the unoptimized number for our benchmark set, this has little effect on execution times. We find that the ineffectiveness of locking elimination is due to the relative rarity of locking writes, and the execution model of Reform Prolog, which results in few invalidations of shared cache lines when such writes occur.

    The benchmarks are evaluated on a cache-coherent KSR-1 multiprocessor with physically distributed memory, using up to 48 processors. Speedups scale from previous results on smaller, bus-based multiprocessors, and previous low parallelization overheads are retained.

    Keywords: Compilation, locality optimizations, interprocedural dataflow analysis, cache-coherent distributed memory, parallelism, Prolog

    Information from authors:

    Thomas Lindgren, Johan Bevemyr Häkan Millroth, Computing Science Department, Uppsala University. Reform Prolog project.

    Interconnection Networks II

    (Thursday 10:30-12:00)

    Bidirectional Ring: An Alternative to the Hierarchy of Unidirectional Rings

    Muhammad Jaseemuddin and Zvonko G. Vranesic
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    A hierarchy of unidirectional rings has been used successfully in distributed shared-memory multiprocessors. The fixed cluster size of the hierarchy prevents full exploitation of communication locality. The bidirectional ring is presented as an alternative to the hierarchy. Its relative performance is evaluated for a variety of memory access patterns and network sizes. It gives superior performance for low communication locality and for large networks. Another useful feature of the bidirectional ring is that the network load tends to be balanced over the two constituent unidirectional rings. These features make the bidirectional ring an attractive possibility as a network structure for scalable NUMA multiprocessors.

    Keywords: ring, interconnection network, NUMA multiprocessor, communication locality, performance.

    Information from authors:

    Muhammad Jaseemuddin:

    Muhammad Jaseemuddin is a PhD candidate in the department of electrical
    and computer engineering at the University of Toronto. His research
    interests include parallel processing, distributed computing, computer
    architecture, and local area networks. He received his BS from
    NED University of Engg. and Tech., Karachi, Pakistan, in 1989,
    and MS from The University of Texas at Arlington, Texas, USA, in 1991.
    He is a student member of ACM, and IEEE.

    URL: http://www.eecg.toronto.edu/~jaseemud/

    Zvonko G. Vranesic:

    Zvonko G. Vranesic is a professor of electrical and computer engineering
    at the University of Toronto. His research interests include computer
    architecture, VLSI systems, fault-tolerant computing, local area
    networks, and many-valued switching systems. He has had a major role in the
    design of Hector and NUMAchine multiprocessors. He was a senior
    visitor at the Computer Laboratory at Cambridge University, England, and
    at the Institute of Programming at the University of Paris.

    Vranesic received his BS, MS, and PhD from the University of Toronto in
    1963, 1966, and 1968, resprectively. He is a member of the Association
    of professional Engineers of Ontario and senior member of the IEEE.

    URL: http://www.eecg.toronto.edu/EECG/EECGProfStaff.html

    A Formal Study of the Mcube Interconnection Network

    Nitin K. Singhvi and Kanad Ghose
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    The Mcube network has been previously proposed by us as a highly recursive and symmetrical interconnection network based on twisted links. The Mcube topology has been developed and defined in terms of the structural constraints between components rather than relations between binary strings because one of the chief goals of the construction is structural symmetry. The salient features of the Mcube interconnection include a uniform distance distribution, almost half the diameter of a comparable hypercube, a lower average internode distance than most other twisted networks, uniform traffic flow through every node, a lower link saturation rate than hypercubes under uniformly random traffic, good recursive partitionability and the ability to execute several parallel algorithms as fast as or faster than a hypercube. We establish the node distribution for the Mcubes to be uniform with the internode distance. Based on the constraints imposed by the construction and the routing algorithm, we derive an expression for the average internode distance for Mcubes. We also introduce an efficient broadcasting algorithm for the Mcube interconnection and show how Mcubes can be reconfigured into hypercubes. Finally, we show how representative parallel algorithms can be directly implemented on the Mcube topology.

    Keywords: hypercube, interconnection network, massively parallel, twisted cube

    Information from authors:

    Nitin K. Singhvi is a Ph.D. student in the department of Computer Science at
    the State University of New York at Binghamton. His research interests include
    interconnection networks, WDM-based distributed shared memory systems, cache
    coherence protocols and task scheduling.

    Kanad Ghose is an associate professor in the Department of Computer Science
    at the State University of New York at Binghamton. His research interests
    are in the area of high-performance architectures, parallel systems and
    compiling techniques. Some of the research projects he is currently leading
    include the following:

    SNOW: Hardware-supported shared memory over a network of workstations: prototype
    design and implementation of a cluster of workstations (NOW) with hardware
    support for shared memory coherence to allow a network of workstations to be
    used as a tightly-coupled system. The current SNOW prototype consists of eight
    4-CPU Sparc 20 nodes, FPGA device based high data rate custom SBus network interface
    cards, running a modified and extended version of Solaris 2.4.
    A fundamental goal of this project is to provide low end-to-end latencies
    to promote very tight coupling among the network nodes. Our use of
    programmable network interface cards allow for experimentation with degree and nature
    of support for shared memory, including support for coherent caches and barrier
    synchronization. This work is primarily supported by the NSF.

    Prototype Design & Implementation of a Signal Processing Multiprocessor: This
    work involves the design and implementation of a 64-node DSP multiprocessor
    using a novel reconfigurable interconnection. This work is sponsored by
    the US Air Force.

    Architectural Techniques for Low Power: This project examines and proposes
    architectural techniques for reducing power dissipation in high performance
    microprocessors. This work is supported by the NSF.

    Scheduling Tool for Distributed Memory Multiprocessors: This work is supported
    by CAE-Link corporation.

    A MPP-Based video Sever: This project involves the design and implementation
    of a video server based on a 512-node massively parallel processor. This
    project is supported by the State of New York and ForeFronts Inc. through
    the SPIR program.

    A Toolkit for Distributed Real-Time Scheduling: This project involves the
    design and implementation of a X-windows based toolkit for analyzing the
    performance of distributed real-time applications. It also allows users to
    verify timing requirements and experiment with a variety of real-time
    scheduling algorithms, both global and local. This tool accurately models
    delays in system calls, delays in the interconnections, scheduling overhead
    etc. The system incorporates a simple language and its compiler to allow
    users to define the characteristics of the system and the tasks. In addition
    to providing canned descriptions of standard interconnections (Futurebus II,
    Ethernet etc.), users can specify any arbitrary interconnection. This
    project is co-directed by Prof. S. Aggarwal and is supported by Loral Federal
    Systems, CAE-Link and NYSEG.

    Multiwave Interconnection Networks for MCM-Based Parallel Processing

    Shinichi Shionoya, Takafumi Aoki and Tatsuo Higuchi
    [view first page | fetch full paper ]

    Abstract:

    This paper presents multiwave interconnections -- optical interconnections that employ wavelength components as multiplexable information carriers -- for next-generation multiprocessor systems using MCM technology. A hypercube-based multiprocessor network called the multiwave hypercube (MWHC) is introduced, where multiwave interconnections provide highly-flexible dynamic communication channels. A performance analysis shows that the use of integrated multiwave optics enables the reduction of network complexity on a MCM substrate, while supporting low-latency message routing.

    In this paper, we also present the experimental fabrication of wavelength detectors for the proposed system, and discuss the physical limit on the number of wavelength components at the present state of technology.

    Keywords: Multichip module (MCM), opticalinterconnections, wavelength division multiplexing (WDM), interconnection networks, parallel processing, message-passing multiprocessors

    Information from authors:

    Shinichi Shionoya is currently a student in the Graduate
    School of Information Sciences at Tohoku University, Japan.

    His research interests include multi-wavelength optical
    computing.

    He received B. E. degree from Tohoku University in 1994.
    He is a student member of the Institute of Electronics,
    Information and Communication Engineers of Japan.

    Takafumi Aoki is a research associate in the Graduate School
    of Information Sciences at Tohoku University, Japan. His
    research interests include the VLSI architectures based on
    multiple-valued logic, multi-wavelength optical computing
    and biomolecular computing.

    He received the Outstanding Paper Award at the 1989 IEEE
    International Symposium on Multiple-Valued Logic and the
    Outstanding Transactions Paper Award from the Institute of
    Electronics, Information, Communication Engineers of Japan
    in 1989 and IEE Ambrose Fleming Premium Award in 1994.

    He received the B. E., M. E., and D. E. degrees in
    electronic engineering from Tohoku University in 1988, 1990,
    and 1992, respectively. He is a member of the IEEE Computer
    Society, the Society of Instrument and Control Engineers of
    Japan, and the Institute of Electronics, Information and
    Communication Engineers of Japan.

    Tatsuo Higuchi received the B.E., M.E. and D.E. degrees in
    electronic engineering from Tohoku University, Sendai,
    Japan, in 1962, 1964 and 1969, respectively. He was a
    Professor in the Department of Electronic Engineering at
    Tohoku University, and is currently a Professor and Dean of
    the Graduate School of Information Sciences at Tohoku
    University. His general research interests include the
    design of 1-D and multi-D digital filters, linear
    time-varying system theory, fractals and chaos in digital
    signal processing, VLSI computing structures for signal and
    image processing, multiple-valued ICs, multiwave
    opto-electronic ICs, interconnection-free bio-electronic ICs
    and fault-tolerant computing.

    He served as the Program Chairman of the 1983 IEEE
    International Symposium on Multiple-Valued Logic (ISMVL),
    and the Symposium Chairman of the 1992 IEEE ISMVL. He
    received the Outstanding Paper Awards at the 1984, 1985,
    1987, and 1989 IEEE International Symposiums on
    Multiple-Valued Logic, the Outstanding Transactions Paper
    Award from the Society of Instrument and Control Engineers
    of Japan in 1984, the Technically Excellent Award from the
    Society of Instrument and Control Engineers of Japan in
    1986, the Outstanding Transactions Paper Award from the
    Institute of Electronics, Information and Communication
    Engineers of Japan in 1989, the Technically Excellent Award
    from the Robotics Society of Japan in 1990, and IEE Ambrose
    Fleming Premium Award in 1994. He is a member of the
    Society of Instrument and Control Engineers of Japan, and
    Robotics Society of Japan. He is an IEEE fellow.

    Scheduling

    (Thursday 13:30-15:00)

    Scheduling Master-Slave Multiprocessor Systems

    Sartaj Sahni
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    We define the master-slave multiprocessor scheduling model and provide several applications for the model. $O(n \log n$) algorithms are developed for some of the problems formulated and some others are shown to be NP-hard.

    Keywords: multiprocessor scheduling, complexity

    Time Space Sharing Scheduling: A Simulation Analysis

    Atsushi Hori, Yutaka Ishikawa, Jörg Nolte, Hiroki Konaka, Munenori Maeda and Takashi Tomokiyo
    [view first page | fetch full paper ]

    Abstract:

    We explain a new job scheduling class, called "Time Space Sharing Scheduling" (TSSS) for partitionable parallel machines. TSSS is a combination of time-sharing and space-sharing job scheduling techniques. Our proposed "Distributed Queue Tree" (DQT) is an instance of TSSS. We evaluate and analyze DQT behavior in more detail with a number of simulations. The result shows that DQT performs very well in low-load to high-load situations, almost independent of system size and task size distribution. We also compare our DQT and ScanUp batch scheduling, and we find that our DQT performs as well as ScanUp scheduling in processor utilization, but that both DQT and ScanUp have drawbacks in terms of scheduling fairness. Finally, we find that TSSS can inherently achieve higher processor utilization.

    Keywords: Job Scheduling, Parallel Operating System, Simulation, Time-Sharing, Space-Sharing

    Information from authors:

    WWW Pages:


    If you have any question on our paper and/or above WWW pages, feel free to mail to hori@rwcp.or.jp.

    "Agency Scheduling" - A Model for Dynamic Task Scheduling

    Johann Rost, Franz-Josef Markus and Li Yan-Hua
    [view first page | fetch full paper ]

    Abstract:

    This paper describes a class of algorithms for scheduling parallel programs represented by macro dataflow graphs (task precedence graphs onto a multiprocessor system such that the total execution time is minimized. The schedule will be computed dynamically during the runtime of the process system. The model allows to represent centralized and fully distributed algorithms as well as intermediate forms. The algorithms are able to schedule static as well as dynamic dataflow graphs. Knowledge of the execution times of the tasks is not necessary. Some variants of the model have been implemented using a multi-transputer system. Practical experiences are included in the paper.

    Keywords: dynamic dataflow graphs, dynamic task scheduling, distributed decision making

    Information from authors:

    Authors:

    The Agency Scheduling model for distributed dynamic scheduling described in the paper has been extended in the FDDS project such that the parallel programs are executed in a fault-tolerant way. As task replication (static redundancy) is too costly for high-performance parallel computers dynamic redundancy comprising checkpointing, self-diagnosis and rollback recovery is used. The proposed approach works in a completely distributed way by making nodes which have finished a task responsible for allocating their ready task successors. The basic idea for achieving fault tolerance is to keep all input data sets of a task as checkpoints on different nodes in such a way that after a node failure the lost task can automatically be restarted on a remaining intact node. So, fail-soft behavior is realized in a distributed and user-transparent way.

    Other projects at the Institute for Computer Engineering deal with intelligent automation systems, tools for debugging and performance analysis of parallel applications, fault tolerant massively parallel systems and operator site software.

    Fault Tolerance and SIMD Arrays

    (Thursday 13:30-15:00)

    FFTs on a Linear SIMD array

    Mattias Johannesson
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    This paper describes different ways to implement one- and two-dimensional Fast Fourier Transforms, FFTs, on the linear SIMD architecture family VIP.

    VIP, Video Image Processor, is a bit-serial linear array SIMD architecture developed at the Image Processing Laboratory at Linköping University aiming at real-time low-level image and radar signal processing.

    Since the processing elements in the linear array only communicate via a shiftregister there is a large communication overhead to PEs far away in the array but we nevertheless obtain very fast FFT implementations.

    We show for instance that it is possible to build a 512-PE VIP system capable of Fourier transforming 5122 images at video rate (25 frames per second).

    Keywords: FFT, Linear Array, SIMD, Bit-serial

    Information from authors:

    Mattias Johannesson is a PhD student working in the projects VLSI of Vision and the VIP project.
    The work is supervised by Prof. Per-Erik Danielsson and Dr Anders Åström.

    Mattias presented 1993 a licentiate thesis called "Sheet-of-light Range Imaging", which
    concerns fast sheet-of-light range imaging with different smart sensor architectures.

    In the VIP project we work on linear SIMD architectures for Image and Radar signal processing.
    One of our design is currently being developed by Ericcson Microwave systems.

    For more information look me up on my home page,
    it's found here

    Tolerating Faults in Faulty Hypercubes Using Maximal Fault-Free Subcube-Ring

    Jang-Ping Sheu and Yuh-Shyan Chen
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    In this paper, we present a reconfiguration approach to identify the maximal fault-free subcube-ring for tolerating faults in faulty hypercubes. The fault-free subcube-ring is connected by a ring of fault-free subcubes with dilation 3. By exploiting the size of fault-free subcubes as large as possible, the maximal fault-free subcube-ring with higher processor utilization is obtained. Using this approach, we can tolerate more than n faults in n-dimensional hypercubes. To demonstrate the fault- tolerant capability of our approach, we implement a fault-tolerant matrix-multiplication algorithm on the nCUBE/2E hypercube machine with 32 processors. The simulation results show that our reconfiguration approach has low performance slowdown and high processor utilization.

    Keywords: Algorithms, fault tolerance, hypercubes, matrix-multiplication

    Information from authors:

    JANG-PING SHEU


    was born in Taiwan, Republic of China, on April 28, 1959. He received the B.S. degree in computer science from Tamkang University, Taiwan, in June 1981 and the M.S. and Ph.D. degrees in computer science from the National Tsing Hua University, Taiwan, in June 1983 and January 1987, respectively. He joined the faculty of the Department of Electrical Engineering at the National Central University, Taiwan, as an Associate Professor in February 1987. Since August 1992, he has been a Full Professor at the Department of Computer Science and Information Engineering. His current research interests include parallel processing and distributed computing systems.

    Dr. Sheu is a member of the IEEE Computer Society and Phi Tau Phi Society.

    YUH-SHYAN CHEN


    was born in Taiwan, Republic of China, on May 25, 1965. He received the B.S. degree in computer science from Tamkang University, Taiwan, in June 1988. In 1991 he received his M.S. degree in the Department of Electrical Engineering from the National Central University, Taiwan, where he is currently working toward the Ph.D. degree. His research interests are fault-tolerant computing and parallel algorithms.

    Mr. Chen is a student member of the IEEE Computer Society.

    See more information in High-Speed Computing Lab. Room of NCU.

    Communication in Multicomputers with Nonconvex Faults

    Suresh Chalasani and Rajendra V. Boppana
    [view first page | fetch full paper | conference presentation ]

    Abstract:

    A technique to enhance multicomputer routers for fault-tolerant routing with modest increase in routing complexity and resource requirements is described. This method handles solid faults in meshes, which includes all convex faults and many practical nonconvex faults, for example, faults in the shape of L or T. As examples of the proposed method, adaptive and nonadaptive fault-tolerant routing algorithms using four virtual channels per physical channel are described.

    Keywords: Fault tolerance, mesh networks, nonconvex faults, routing.

    Information from authors:

    In the last three years, Chalsani and
    Boppana have worked on the design of new wormhole routing algorithms, fault-tolerant message communication, and multicast communication for multicomputers. Students interested in pursuing Ph.D. in Computer Science at UTSA can browse its web site for more information about UTSA Computer Science Program and the faculty.

    Poster Session

    Wednesday 14:00-16:00

    Parallelising Programs with Algebraic Programming Tools

    Anatoly E. Doroshenko and Alexander B. Godlevsk
    [view first page | fetch full paper ]

    Abstract:

    A rewriting based approach to dynamical parallelization of a general class of sequential imperative programs by means of the algebraic programming system APS is proposed. It gives advantages of rapid prototyping and evolutionary development of efficient parallelizers. The poster contains brief explanation of the APS system, techniques for designing efficient parallelizers and an example of parallel program development.

    We follow an integrated compiler/interpreter approach to parallel software design having in mind that writing interpreter is much easier task than developing compiler but interpreters are commonly known to be inefficient. The emphasize is maden on program transformations and dynamical issues of parallelization that can be solved in compile-time and run-time respectively at low cost.

    Keywords: dynamic parallelization, algebraic programming, rapidprototyping, evolutionary parallel program development.

    Parallel Prolog with Uncertainty Handling

    Katalin Molnar
    [view first page | fetch full paper ]

    Abstract:

    Aurora Prolog has been extended with uncertainty handling and experiments show a reasonable speedup on multiple workers. There is a bigger scope for exploiting parallelism in an uncertain knowledge base than in an ordinary logic knowledge base because of the nature of calculating uncertainty.

    The asynchronous database handling predicates of Aurora are useful for implementing efficient information transfer between the parallel branches of the search tree. In a parallel environment one has to pay attention to careful use of the non-logical features of Prolog (such as the cut operator), as improper use may lead to serious performance degradation.

    Keywords: Aurora, Prolog, parallel, uncertainty, support logic

    Information from authors:

    This paper describes a part of the CUBIQ toolset that is built on top of Aurora, an OR-parallel implementation of Prolog for shared memory multiprocessors. Cubiq introduces frames, blackboard handling and uncertainty handling.

    A Special-Purpose Coprocessor for Qualitative Simulation

    Gerald Friedl, Marco Platzner and Bernhard Rinner
    [view first page | fetch full paper ]

    Abstract:

    Qualitative simulation is applied more and more in design, monitoring and fault-diagnosis. However, poor performance of current qualitative simulators complicates or even prevents its application in technical environments.

    In our research project a special-purpose computer architecture for the qualitative simulator QSIM is developed. Improved performance is achieved by mapping QSIM functions onto a multiprocessor system and executing runtime extensive functions on specialized coprocessors. In this paper, we present a part of this research project - the design and implementation of a specialized coprocessor for the constraint check functions (CCFs) of QSIM. The coprocessor is designed at gate- and register level to obtain maximum execution speed. The coprocessor is implemented in an FPGA of type Xilinx XC4013 running at a clock frequency of 15 MHz. With this FPGA-implementation a runtime improvement of up to 20 is achieved.

    Keywords: specialized coprocessor, FPGA, qualitative simulator QSIM

    Portable Software Tools for Parallel Architectures

    Catherine Barnes and Chris Wadsworth
    [view first page | fetch full paper ]

    Abstract:

    Portable Software Tools for Parallel Architectures (PSTPA) is a five-year programme of collaborative research funded by the UK Engineering and Physical Sciences Research Council. The paper summarises the background and aims of the programme and the set of 15 projects recently started.

    Keywords: parallel software methods, tools, user environments, portability, scalability

    Information from authors:

    Catherine Barnes is Programme Manager for Systems Architectures
    in EPSRC which includes the PSTPA Programme.

    Chris Wadsworth is Head of the Parallel and Distributed Systems Group
    at RAL, with projects in the systems aspects and techniques of parallel
    programming and the porting of applications software. He is Technical
    Co-ordinator (part-time) of the PSTPA Programme. His URL on WWW is
    http://www.cis.rl.ac.uk/people/cpw/contact.html

    Boosting the Performance of Workstations through WARPmemory

    Christoph Siegelin, Ulrich Finger and Ciaran O'Donnel
    [view first page | fetch full paper ]

    Abstract:

    This paper presents WARPmemory, a multiprocessor extension for networks of workstations. A multiported memory device is connected in a star topology to the workstations, allowing for transparent access to a shared memory. High-speed fiber optic links are used as the interconnect, thus keeping latencies in the order of a microsecond. This is three orders of magnitude lower than a purely software implementation over a local area network (e.g. Ethernet and PVM). Cache coherence is ensured using a directory based protocol. The design of the multiported memory module is elegant. A very fast RISC processor is used as an intelligent switch, much of what would be complex in hardware is migrated to software. WARPmemory allows the advantages of distributed processing and shared memory to be combined at a reasonable cost.

    Keywords: Shared Memory, Multiprocessor, Fiber Optic Networks

    Information from authors:

    ENST WWW Home Page.

    A Monitoring System for Software-Heterogeneous Distributed Environments

    Aleksander Laurentowski, Jakub Szymaszek, Andrzej Uszok and Krzysztof Zieliñski
    [view first page | fetch full paper ]

    Abstract:

    One computational model for distributed memory architectures is a set of autonomous objects, communicating via message-passing mechanism. Advanced applications within this model may be engineered of components implemented in various programming languages. Thus, particular parts of such applications can be executed in those environments which inherently express the nature of computation. This allows mixing various programming and execution paradigms (e.g. imperative, functional, object-oriented or logic) in one, software-heterogeneous system. The MODIMOS system aims at monitoring of this kind of systems. Its application area covers visualization of their structure, activity and management of their logical configuration, thus enabling appropriate tuning. The characteristic features of MODIMOS are: configurability, information management & filtering mechanisms, expandability of monitored environments set.

    Keywords: monitoring, software visualization, software heterogenity, multi-paradigm programming

    Information from authors:

    Authors:

    are members of Distributed Systems Research Group at the Institute of Computer Science, University of Mining and Metallurgy in Cracow, Poland.

    You may find more information about MODIMOS project here.

    A Metacircular Data-Parallel Functional Language

    Gaétan Hains and John Mullins
    [view first page | fetch full paper ]

    Abstract:

    We describe and relate two data-parallel semantics for the simply-typed lambda-calculus and obtain a semantic function expressible in its object language. The explicitly distributed semantics allow the formal specification of the interpreter's effect on data partitioning and communications.

    The CMU Scandal project's VCODE demonstrates the usefulness of a parallel intermediate language abstract enough to be portable yet explicit about parallel data. DPML and its latest implementation Caml FLight, were designed with these goals in mind. Caml Flight extends the sequential language Caml Light in a deterministic and portable way but looses its metacircularity, a symptom of a loss of expressive power. A language which can express its own interpreter is called metacircular. This paper describes the semantics of a metacircular DPML-like language.

    Keywords: key, words, come, here, separated, by, commas

    Efficient Run-Time Program Allocation on a Parallel Coprocessor

    Jurij Silc and Borut Robic
    [view first page | fetch full paper ]

    Abstract:

    One way to obtain higher computational performance of a computer H is to use a parallel coprocessor unit C which is attached to H and can efficiently solve computationally demanding applications. If such a unit C can be used for concurrent execution of several independent executable programs, the problem of efficient run-time (i.e. dynamic) allocation of programs to C arises. Our aim is to find an allocation strategy that maximizes throughput of the overall system H-C.

    We present two dynamic allocation approaches for C with mesh-connected architecture. In both cases we reduce the allocation problem to the weighted bipartite matching problem which must be solved during the run-time each time a set of waiting executable programs is to be allocated space in C. The two approaches differ in the way they search for free space in C.

    Keywords: mapping optimization, run-time allocation, mesh-connected architecture

    Information from authors:


    Jurij Silc

    received the Ph.D. degree in electrical engineering from the University of Ljubljana, Slovenia, in 1992. From 1980 he has been a researcher at the Department for Computer Science and Informatics, Jozef Stefan Institute, Ljubljana. From 1987 to 1993 he has been the Head of the Laboratory for Computer Architectures at the same department. Presently, he is a Deputy Head of the Computer Systems Department at the Jozef Stefan Institute. His current research interests are in:
    Address: Jozef Stefan Institute, Computer Systems Dept., Jamova 39, 61111 Ljubljana, Slovenia
    Email: jurij.silc@ijs.si

    Borut Robic

    received the Ph.D. degree in computer science from the University of Ljubljana, Slovenia, in 1993. From 1984 he has been a researcher at the Department for Computer Science and Informatics, and since 1994 a researcher at the Computer Systems Department, both at the Jozef Stefan Institute, Ljubljana. He is also an Assistant Professor of Computer Science at the Faculty of Electrical Engineering and Computer Science, University of Ljubljana, Slovenia. His present research interests include:
    Address: Jozef Stefan Institute, Computer Systems Dept., Jamova 39, 61111 Ljubljana, Slovenia
    Email: borut.robic@ijs.si

    A Program Manipulation System for Fine-Grained Architectures

    Vladimir A. Evstigneev and Victor N. Kasyanov
    [view first page | fetch full paper ]

    Abstract:

    The PROGRESS system being implemented at theInstitute of Informatics Systems in Novosibirsk is discussed. The system is intended to support rapid prototyping of compilers for high level languages (e.g. Fortran-77, Modula-2, SISAL) and for a family of architectures exploited fine-grained parallelism. The next goal of the project is to develop an environment for investigation of optimizing and restructuring transformations of programs to be parallelized.

    Keywords: parallel processing, fine-grainarchitectures, transformational approach, program restructuring, multifunctional cooperation

    Information from authors:

    Vladimir A. Evstigneev is Leading Scientist (staff member)
    of the Laboratory for Program Construction and Optimization
    of Institute of Informatics System at Novosibirsk (Russian
    Academy of Sciences, Siberian Division). He is also
    Professor of Novosibirsk State University. Current research
    interests include graph theory and its applications,
    parallel processing, parallel computer architectures, and
    tools. He is the author of the book "Applications of graph
    theory in programming" (Moskow, Nauka, 1985) and the
    co-author of the book "Graph theory: tree processing
    algorithms" (Novosibirsk, Nauka, 1994).

    Victor N. Kasyanov is Head of the Laboratory for Program
    Construction and Optimization of Institute of Informatics
    Systems of the Siberian Division of the Russian Academy of
    Sciences and Professor in the Department of Mathematics of
    Novosibirsk State University, Russia. His current research
    interests include optimizing compilation, program
    transformation, annotated programming, data flow analysis,
    parallel processing, graph theory, computer education. He
    has written over hundred papers and fourteen monographs in
    theoretical computer science, system programming and graph
    theory.

    Institute of Informatics Systems
    of the Siberian Division of the Russian Academy of Sciences
    (http://www.iis.nsk.su)

    Novosibirsk State University
    (http://www.cnit.nsk.su//univer/english/General.html)

    Real-Time Image Compression Using Data-Parallelism

    P. Moravie, H. Essafi, C. Lambert-Nebout and J-L. Basille
    [view first page | fetch full paper ]

    Abstract:

    The purpose of this paper is to present this new parallel image compression algorithm. We present implementation results on serveral parallel computers. We also examine load balancing and data mapping problems. We end by presenting a well-suited architecture for Real-Time image compression.

    Keywords: Data-Parallelism, Image Compression, Wavelet Transform, Vector Quantization, Huffman Coding

    Congestion Control in Wormhole Networks: First Results

    Abdel-Halim Smai
    [view first page | fetch full paper ]

    Abstract:

    We present the first results of a congestion control scheme for computer wormhole networks. The basic idea of this scheme is to require pre-selected nodes to restrict their packet injection in order to reduce and possibly eliminate congestion in the whole network. The scheme employs randomization to select nodes.

    Our preliminary performance measurements indicate clearly that the proposed scheme improves network performance. Some of the remaining studies are implementation cost, such as the impact of the scheme on the input/output queuees in the nodes. Another further study would be to compare such flow control schemes with other trends such as that of adaptive routing algorithms.

    Keywords: Parallel Architectures, Interconection Networks, Wormhole Networks, Congestion Control, Design and Performance

    Pages developed and maintained by psm@sics.se. Please direct comments or bug reports to europar95-www@sics.se