
Abstract:
This paper addresses the problem of distributed reactive systems execution. We first show that a natural parallel description of such systems can be achieved with synchronous languages. Then, we explain how a centralized synchronous program can be executed in its environment, which is intrinsically asynchronous. For this purpose, we define a synchronous/asynchronous interface, which links the program logical time with the environment physical time. Finally, we motivate the need for distribution and show how a desired distribution can be easily achieved, thanks to the object code distribution algorithm implemented in the OC2REP tool. We then propose and discuss three solutions that allow distributed synchronous programs to be executed on an asynchronous network of processors.
Keywords: parallel computing, program interface, reactive system,synchronous language
Information from authors:
Paul Caspi received the Docteur es Science thesis from INPG, Grenoble, France. Currently, he is Charge de Recherche CNRS at Verimag, INRIA SPECTRE project, in Grenoble, France. His previous research activities mainly dealt with the use of computers in process control, and where concerned with dependability problems in critical applications, from both hardware and software point of view. This led him to participate in the design of the synchronous data flow language LUSTRE. He is now mainly concerned with the theory of synchronous data flow, and the problem of building distributed systems from synchronous programs. He also acted as a consultant for several companies and French administrations, on problems of dependable computing systems design and validation.
Alain Girault received the Ph.D degree from INPG, Grenoble, France. Currently, he is working with a post doctoral position at the INRIA MEIJE project, in Sophia-Antipolis, France. During his thesis, he worked on distribution of synchronous programs, and designed the OC2REP distribution tool.
Abstract:
Much work has been done in the areas of and--parallelism and data--parallelism in Logic Programs. Both types of parallelism offer advantages and disadvantages: traditional (and--) parallel models offer generality, whereas data--parallelism techniques offer increased performance for a restricted class of programs. The thesis of this paper is that these two forms of parallelism are not fundamentally different and that relating them opens the possibility of obtaining the advantages of both within the same system. Some relevant issues are discussed and solutions proposed. The discussion is illustrated through visualizations of actual parallel executions implementing the ideas proposed.
Keywords: Parallel Logic Programming, And-Parallelism, Data-Parallelism, Fast Task Startup, Scheduling
Information from authors:
Abstract:
Or-parallelism and And-parallelism have often been considered as two distinct forms of parallelism with not much in common. As a result, they have been treated quite independently and different mechanisms have been employed for realizing them. The purpose of this paper is to highlight the inherently dual nature of the two forms of parallelism and the similarities that exist between them.
The dualities and similarities observed are then exploited for gaining new insights into the design, implementation, and optimization of and- and or-parallel systems. The ideas developed in this paper are illustrated with the help of ACE system - a parallel Prolog system incorporating both and- and or-parallelism. The results from implementing these ideas are also briefly reported.
Keywords: Prolog, And-parallelism, Or-parallelism, Optimizations
Information from authors:
Abstract:
In this paper we propose a methodology for structured parallel programming using functional skeletons to compose and co-ordinate concurrent activities written in a standard imperative language. Skeletons are higher order functional forms with built-in parallel behaviour. We show how such forms can be used uniformly to abstract all aspects of a parallel program's behaviour including data partitioning, placement and re-arrangement (communication) as well as computation. Skeletons are naturally data parallel and are capable of expressing computation and co-ordination at a higher level of abstraction than other process oriented co-ordination notations. Examples of the application of this methodology are given and an implementation technique outlined.
Keywords: Programming Language, Parallel Computing, Skeleton, Coordination Language, High Performance Fortran

Abstract:
Demand-driven systems follow the model where customers enter the system, request some service, and then depart. Examples are databases, transaction processing systems and operating systems, which form the system software layer between the applications and the hardware. Achieving scalability at the system software layer is critical for the scalability of the system as a whole, and yet this layer has largely been ignored.
In this paper, we characterize the scalability of the system software layer of demand-driven parallel systems based on fundamental metrics of quantitative system performance analysis. We develop a set of sufficient conditions so that if a system satisfies these conditions, then the system is scalable. We further argue that in practice these conditions are also necessary. In the remainder of the paper, we use the necessary and sufficient conditions to develop a set of practical design guidelines, to study the effect of application workloads, and to examine the scalability behavior of a system with only a limited number of processors.
Keywords: scalability, parallel, demand-driven, system software
Information from authors:
Ronald Unrau
============
Ronald Unrau is working on advanced compiler optimizations at the IBM
Software Solutions Lab in Toronto, Canada. His research interests
include parallel operating systems and compilers.
Unrau received his BASc in computer engineering form the University of
Alberta in 1984 and his MASc in biomedical engineering and PhD in
electrical and computer engineering, both from the University of
Toronto, in 1989 and 1993.
Michael Stumm
=============
Michael Stumm is an associate professor in the departments of electrical
and computer engineering, and computer science at the University of Toronto.
His research interests are in the area of computer systems.
Stumm received a diploma in mathematics and a PhD in computer science from the
University of Zurich in 1980 and 1984, respectively.
He is a member of the IEEE Computer Society and ACM.
Orran Krieger
=============
Orran Krieger completed his PhD in electrical and computer engineering at the
University of Toronto in 1994. His research interests include
operating systems, file systems, and multiprocessors.
Krieger received a BASc from the University of Ottawa in 1985 and an
MASc from the University of Toronto in 1989, both in electrical
engineering.
Abstract:
The growing disparity between processor and memory speeds has caused memory bandwidth to become the performance bottleneck for many applications. In particular, this performance gap severely impacts stream-orientated computations such as (de)compression, encryption, text searching, and scientific (vector) processing. This paper looks at streaming computations and derives analytic upper bounds on the bandwidth attainable from a class of access reordering schemes. We compare these bounds to the simulated performance of a particular dynamic access ordering scheme, the Stream Memory Controller (SMC). We are building the SMC, and where possible we relate our analytic bounds and simulation data to the simulation performance of the hardware. The results suggest that the SMC can deliver nearly the full attainable bandwidth with relatively modest hardware costs.
Keywords: memory bandwidth, vector processing, access ordering, data prefetching, performance evaluation
Abstract:
StarT-NG is a joint MIT-Motorola project to build a high-performance message passing machine from commercial systems. Each {\em site} of the machine consists of a PowerPC 620-based Motorola symmetric multiprocessor (SMP) running the AIX 4.1 operating system. Every processor is connected to a low-latency, high-bandwidth network that is directly accessible from user-level code. In addition to fast message passing capabilities, the machine has experimental support for cache-coherent shared memory across sites. When the machine requires memory to be kept globally coherent, one processor on each site is devoted to supporting shared memory. When globally coherent shared memory is not required, that processor can be used for normal computation tasks. StarT-NG will be delivered at about the time the base SMP is introduced into the marketplace. The ability to be both a collection of standard SMP and an aggressive message passing machine with coherent shared memory makes StarT-NG a good building block for incrementally expandable parallel machines.
Keywords: parallel architecture, message passing, coherent sharedmemory, SMP, PowerPC
Information from authors:
The StarT-NG project is a collaborative effort between the Computation Structures Group of the Laboratory for Computer Science at MIT and the Motorola Computer Group. Derek Chiou, Boon S. Ang and James C. Hoe are graduate students at MIT and are supervised by Arvind. Andy Boughton is a staff member in CSG. Bob Greiner and Jamey Hicks are employees of Motorola while Mike Beckerle was formerly an employee of Motorola who now works for Applied Parallel Technologies, Inc.
Abstract:
Multithreaded architectures have been proposed for future multiprocessor systems due to their ability to cope with network and synchronization latencies. Some of these architectures depart significantly from current RISC processor designs, while others retain most of the RISC core unchanged. However, in light of the very low cost and excellent performance of off-the-shelf microprocessors it seems important to determine whether it is possible to build efficient multithreaded machines based on unmodified RISC processors, or if such an approach faces inherent limitations. This paper describes the costs and benefits of running multithreaded programs on the EARTH-MANNA system, which uses two Intel i860 XP microprocessors per node.
Keywords: multiprocessors, multithreaded architectures, performance measurements, EARTH, MANNA
Information from authors:
The authors of this paper:
Dr. Olivier Maquelin,
Prof. Herbert H.J Hum and
Prof. Guang R. Gao
are members of the
EARTH project,
which groups researchers from the
ACAPS Lab at
McGill University
and from the
Department of Electrical and
Computer Engineering at
Concordia University.
The aim of the
EARTH project
as a whole is to investigate compiler techniques and architectural
features that are required for better supporting parallel computing in
future high-performance architectures. This ranges from parallel
language and algorithm design, over automatic parallelizing compilers
for high-level languages, benchmarking, emulation and simulation of
parallel machines, down to the production of an EARTH hardware
prototype.

Abstract:
This article presents an example of program transformation strategies in the language {\sc Pei} using several strategies : a simplification of the communications, the introduction of broadcasts by removing recursion from data field definitions, and the introduction of a reduction operator.
These transformations emphasize the relationships between several programs solving a given problem, especially in the data parallelism area.
Keywords: Abstraction, Broadcast, Communication, Data-parallelism, Reduction, Refinement
Information from authors:
The PEI project defines a new equational style framework for data parallel programming. It lies on a small but powerful set of primitives and induces a refinement calculus to design or transform programs. Expressions and transformations can be carried out by using an interactive environment which supplies symbolic evaluations. Current works concern the foundations of PEI and practical developments as well, such as a compiler to a data parallel language.
Abstract:
We prove the completeness of an assertional proof system for a simple loop-free data-parallel language. This proof system is based on two-part assertions, where the predicate on the current value of variables is separated from the specification of the current extent of parallelism.
The proof is based on a Weakest Precondition (WP) calculus. In contrast with the case of usual scalar languages, not all WP can be defined by an assertion. Yet, partial definability suffices to prove the completeness thanks to the introduction of hidden variables in assertions. The case of data-parallel programs with loops is briefly discussed in the conclusion. The full version is available as LIP Research Report RR94-42 (URL http://www.ens-lyon.fr/~bouge/Info).
Keywords: Concurrent Programming, Specifying and Verifying and Reasoning about Programs, Semantics of Programming Languages, Data-Parallel Languages, Proof System, Hoare Logic, Weakest Preconditions.
Information from authors:
Luc Bougé
defended his Habilitation Thesis in 1987 at University Paris 7 under
the supervision of Prof. Maurice Nivat and Prof. Krzysztof Apt. He
worked as a full-time CNRS researcher at LIENS, ENS Ulm, Paris until
1990 on the semantics of CSP-like parallel languages, and then got a
position as a Professor at the LIP,
ENS Lyon. There, he launched a research project on the design and
implementation of data-parallel languages. A general position paper is
The data-parallel programming model: a semantic
perspective. Luc Bougé is now serving as the
co-chairman of the National CNRS Research Program on Parallelism,
Networks and Systems.
David Cachera received
his Master Degree at ENS Lyon in 1994. He is currently working for a
PhD under the supervision of Luc Bougé at LIP. His work
concerns the semantics of data-parallel languages, and its application
to formal program validation.
The full version of the EuroPar Paper (17 pages) is available:
get it? . To get the list of all material available
on-line as LIP Research Reports, please
click here..
Abstract:
The Parallel Debugging Tool (PDT) of the Annai programming environment is developed within the Joint CSCS-ETH/NEC Collaboration in Parallel Processing. Like the other components of the integrated environment, PDT aims to provide support for application developers to debug portable large-scale data-parallel programs based on HPF and message-passing programs based on the MPI standard. PDT supports MPI event tracing for race detection and deterministic replay for manually parallelized MPI programs as well as for code generated with the advanced techniques of a data-parallel compiler. This paper describes the tracing and replaying mechanisms included in PDT as well as their efficiency by presenting execution time overheads for several benchmark programs running on the NEC Cenju-2/3 distributed-memory parallel computers.
Keywords: Race Detection, Replay, Determinism, Distributed Memory, Message Passing, MPI, HPF, Parallel Debugger, Integrated Programming Environment
Information from authors:
CSCS (Centro Svizzero di Calcolo Scientifico) home page: http://www.cscs.ch
Abstract:
This paper presents a mechanism for record-replay of parallel programs written in a remote procedure call (RPC) based parallel programming model. This mechanism, which will serve as a basis for implementing a user-level debugger, exploits some properties of the programming model to limit drastically the number of records that need to be done. A formal proof of the equivalence between recorded and replayed executions is given. Systematic measurements of the time overhead of the recording indicate that it is sufficiently low for the recording mode to be considered as normal execution mode. Similar techniques can be applied to other programming models.
Keywords: Instant Replay, parallel debugging, deterministic reexecutions, Remote Procedure Call.
Information from authors:
The work described in this paper is part of the PhD research of
Alain Fagot.
It is done in the framework of the
APACHE
project whose aim is to design and implement a parallel
programming environment for parallel computers, providing both static
and dynamic load balancing facilities.
The authors can be reached by s-mail at:
IMAG, APACHE project
46 avenue Félix Viallet,
F-38031 Grenoble Cedex 1, France.
and by e-mail at:
Alain.Fagot@imag.fr
Jacques.Chassin@imag.fr

Abstract:
A hardware independent method of programming a massively parallel machine (MPP) can best be supported by a well-designed run-time environment. An important problem in this design is the ability of efficiently simulating networks different from the hardware topology. We will describe the mapping kernel of the virtual processors library for the commercial run-time system PARIX. This kernel contains description classes for several topologies (so- called virtual topologies) and implementations of respective embeddings which map given instances of virtual topologies onto others or onto the hardware. Using these functions, PARIX is able to establish concrete virtual topologies with corresponding communication channels. The implemented functions were selected with respect to the well-known criteria for graph embeddings: equal load and small dilation. Additionally, we focus on fast distributed computation and universal applicability. As an example, we will show new methods for efficiently embedding an arbitrary 2-dimensional grid as a guest graph into any 2- dimensional grid as a host graph.
Keywords: parallel run-time system, PARIX, virtual processors, embedding, grids
Information from authors:
Thomas Römke and
Jens Simon,
Paderborn Center for Parallel Computing
Markus Röttger and
Ulf-Peter Schoeder,
Efficient Use of Massively Parallel Systems, AG Monien,
Department of
Mathematics
and
Computer Science,
University of Paderborn
Abstract:
Many problems can be solved more efficiently on a mesh of trees network than on a mesh. Until now it has been an open problem whether the mesh of trees is always at least as fast as the mesh. In this paper, we present an emulation of N-node meshes on O(N)-node meshes of trees with constant slowdown, even though any embedding of a mesh into a mesh of trees requires dilation Omega(log N). This demonstrates that the mesh of trees is strictly more powerful than the mesh. As an application, we show how to construct an optimal O(sqrt(N)) sorting algorithm for the mesh of trees that improves on the best previously known algorithm by a logarithmic factor.
Keywords: interconnection networks, mesh, mesh of trees, emulation, sorting
Information from authors:
Alf-Christian Achilles is working towards his PhD at the Computer Science Department at the University of Kalrsruhe.
Abstract:
This paper studies network embeddings in the Hamming cubes, a recently designed interconnection topology for multicomputers. The Hamming cube networks are supergraphs of incomplete hypercubes such that the additional edges form an extra binomial spanning tree. The recursively constructible and unit incremental Hamming cubes have better properties than hypercubes, including half of logarithmic diameter and higher fault-tolerance. They also support simple routing and efficient broadcasting schemes.
In this paper, we show that Hamiltonian paths and cycles of all lengths, complete binary trees and their several variants are subgraphs of Hamming cubes. Our embeddings have both dilation and expansion equal to one. Furthermore, taking advantage of the enhanced edges in the Hamming cubes, tree machines can be embedded with dilation of one and expansion of 7/6. Thus, Hamming cubes provide embeddings at a lower cost than (incomplete) hypercubes of the same size.
Keywords: Network embedding, dilation, interconnection network,Hamming cube, incomplete hypercube, binary tree, hypertree, tree machine.
Information from authors:
Authors: Sajal K. Das and Aisheng Mao
Affiliation Address: Department of Computer Science
University of North Texas
P. O. Box 13886
Denton, TX 76203-6886
USA
About the Authors:
Dr. Sajal K. Das received the B. Tech. degree in 1983 from Calcutta University,
the M. S. degree in 1984 from the Indian Institute of Science, Bangalore, and
the Ph.D. degree in 1988 from the University of Central Florida, Orlando, all
in Computer Science. Currently he is an Associate Professor of Computer Science
at the University of North Texas, Denton, and also a faculty member of the
Center for Research in Parallel and Distributed Computing. He is a recipient
of the Cambridge Nehru Scholarship and an Honor Professor Award at UNT for best
teaching and scholarly research. Dr. Das was invited to visit The Leonardo
Fibonacci Institute in Trento, Italy, and the Slovak Academy of Sciences in
Bratislava, Slovakia, in 1994. His current research interests include the design
and analysis of parallel algorithms, parallel data structures, parallel discrete
event simulation, applied graph theory and combinatorics, multiprocessor
interconnection networks and embeddings, and mobile computing. He has published
over 80 papersin these areas in refereed journals and international conference
proceedings.
Dr. Das serves on the Editorial Boards of {\em Parallel Processing Letters}
(World Scientific Pub), the {\em Journal of Parallel Algorithms and Applications}
(Gordon and Breach), and the {\em CD-ROM Journal of Computing}.
He has been a member of Program Committees for several international conferences
and a Co-Editor of the {\em Journal of Computer and Sofware Engineering} special
issue on Parallel Algorithms and Architectures. Dr. Das is also a member of the
IEEE Computer Society and the ACM.
Mr. Aisheng Mao received the M.S. degree in Computer Science in 1992 from the
University of North Texas, Denton. Curently he is a Ph.D. student in computer
science and working on his thesis. His research interests include multicomputer
architectures and algorithms, interconnection network topologies, network
embeddingss, and telecommunications.
Abstract:
Adaptive routing can improve the network performance and the fault-tolerance by providing multiple routing paths. However, the implementation complexity of adaptive routing can be significant, discouraging its use in commercial massively parallel systems. In this paper, we introduce Hierarchical Adaptive Routing (HAR), a new adaptive routing framework, which provides a unified framework for simple and high performance fully adaptive deadlock-free wormhole routing. HAR divides the physical network into several level of virtual networks. There is one connection channel between two adjacent virtual networks, that allows blocked packets in the higher level to move to the lower level. Different routing algorithms can be used in each virtual network, and the overall network is deadlock-free, if the routing algorithm in the lowest level virtual network is deadlock-free. However, the routing algorithm in any other virtual network can be fully adaptive, even nonminimal routing to increase performance. HAR has three advantages: fully adaptive deadlock-free routing in any non-wrapped and wrapped k-ary n-cube network with 2 and 3 virtual channels respectively, relative small crossbars and applicability to a wide variety of network topologies. Detailed implementation and simulation studies of a HAR for 2D mesh networks are presented.
Keywords: Routing Networks, Adaptive Routing, Deadlock Prevention, Wormhole Routing

Abstract:
The list marking problem involves marking the nodes of an $\ell$-node linked list stored in the memory of a $(p, n)$-PRAM, when only the location of the head of the list is initially known. Under the assumption that memory cells containing list nodes bear no distinctive tags distinguishing them from other cells, we establish an $\Omega(\min \{ \ell, n/p \})$ randomized lower bound for $\ell$-node lists and present a deterministic algorithm whose running time is within a logarithmic additive term of this bound. In the case where list cells are tagged in a way that differentiates them from other cells, we establish a tight $\Theta(\min\{\ell, \ell/p+ \sqrt{(n/p)\log n}\})$ bound for randomized algorithms.
Keywords: parallel processing, list algorithms, randomized algorithms,lower bounds
Abstract:
There exist transformations of PRAM programs with predictable communication behavior to existing architectures. We extend the class of tractable programs to those with communication depending on the input. First, we define this class of programs. Second, we give source code transformations to simplify the programs and to eliminate indirect addresses and conditionals. Third, we show how to derive the communication behavior statically. Fourth, we show how to compute the mapping at compile time. Finally, we give upper time bounds for execution on existing architectures.
Keywords: PRAM, LogP, Program Optimization, Program Transformation
Information from authors:
Today, there is a great variety on parallel algorithms for shared-memory architectures, mainly the PRAM. However, the PRAM model does not take into account properties of realistic architectures. Recently, Culler et al. defined a new more realistic machine model which reflects better the practical behaviour of massively parallel computers. Their LogP model differs from the PRAM in the following points. First, the synchronous execution is dropped. Instead all processors perform their computation asynchronously. Second, there is no shared memory assumption. Instead they consider a communication latency, communication overhead and network bandwith. Finally, the number of processors is fixed and cannot increase with the problem size.
We focus on transformation of parallel algorithms and programs on the PRAM to equivalent LogP-programs and on optimization of the resulting programs.
Publications of mine on the above topic and further information on our institution can be optained via my home page.
Welf Loewe's home page
Abstract:
We investigate some properties of minimal interval and circular-arc representations and give several optimal parallel recognition and construction algorithms. We show that, among other things, given an $s \times t$ interval or circular arc representation matrix, * deciding if the representation is minimal can be done in $O(\log s)$ time with $O(st/\log s)$ EREW PRAM processors, or in $O(1)$ time with $O(st)$ Common CRCW PRAM processors; * constructing a minimum interval representation can be done in $O(\log (st))$ time with $O(st/\log (st))$ EREW PRAM processors, or in $O(\log t/\log\log t)$ time with $O(st\log\log t/\log t)$ Common CRCW PRAM processors, or in $O(1)$ time with $O(st)$ BSR processors.
Keywords: parallel algorithms, time and work optimality, minimal interval and circular arc representations, recognition and construction

Abstract:
Shared memory multiprocessors are based on memory models, which are precise contracts between hard- and software that spell out the semantics of memory operations. Scalable systems implementing such memory models rely on cache coherency protocols that use dedicated hardware. This paper discusses the design space for high performance cache coherency controllers and describes the architecture of the programmable protocol engines that were developed for the S3.mp shared memory multiprocessor. S3.mp uses two independent protocol engines, each of which can maintain multiple, concurrent contexts so that maintaining memory consistency does not limit the system performance. Programmability of these engines allows support of multiple memory organizations, including CC-NUMA and S-COMA.
Keywords: sistributed shared memory, protocol engines, cache Coherency, simple COMA
Information from authors:
Our WWW page is http://playground.sun.com/pub/S3.mp/s3mp.html
Abstract:
Abstract. This paper presents the results for the verification of the S3.mp cache coherence protocol. The S3.mp protocol uses a distributed directory with limited number of pointers and hardware supported overflow handling that keeps processing nodes sharing a data block in a singly linked list. The complexity of the protocol is high and its validation is challenging because of the distributed algorithm used to maintain the linked lists and the non-FIFO network. We found several design errors, including an error which only appears in verification models of more than three processing nodes, which is very unlikely to be detected by intensive simulations. We believe that methods described in this paper are applicable to the verification of other linked list based protocols such as the IEEE Scalable Coherent Interface.
Keywords:
Abstract:
Large array can bring unfavorable effects on software data prefetching. Therefore, it worth to be considered carefully to get the good effectiveness of data prefetching. This paper proposes a software data prefetching mechanism that considers the the size of array and the size of cache along with the behaviour of accesses to the array. We describe on the loss of data prefetching effectiveness in respects of the reuse property and the number of prefetching instructions.
We also propose loop reorganizing algorithm to insert prefetching instructions into a nested loop by using proposed prefetching method. Also, we implemented a preprocessor, (we call it as LOOP), using the proposed algorthms in this paper, and show that our prefetching method improves the execution speed of experimented loops.
Keywords: software data prefetching, cache size, preprocessor
Information from authors:
Since the deciency of man-power in my lab, and the poor environments,
I cannot help making the preprocessor, LOOP, only for myself. It has been
a very lone work. Anyway, I have done my efforts to refine it and made it
be more extensible, and I will do it in the future.
Some guys in my lab call it as 'PISH'(Poor sImulation System under
Highering). In fact, the name of 'LOOP' is not fit to my preprocessor.
Total source code is about 10,000 lines except for comments.
The frame structure is very rough. The day spent on implementing 'LOOP'
is about 20 days. Anyway, what I can firmly say is....
It can guarantee the correct result of simulation of simple loop,
in respects of execution cycles, cache hit ratio etc. Futhermore,
the usage of it is very simple.
In current, I am building an hardware description language for 'LOOP'.
The language is defined with a reference to VHDL. Surely, it is
,to large extent, inferior to VHDL. However, it can be used in simple level.
All the prerequisites to use it are the ability of describing the
state-machine using automaton.
Since the aim of 'LOOP' is to assist the work for compiler system,
the language for describing the hardware structure is very simplified.

Abstract:
Automatic parallelization of imperative programs has focused on nests of for loops with affine bounds and affine dependences, because in this case execution domains and dependences can precisely be known at compile-time. When dynamic control structures, such as while loops, are used, existing methods for conversion into single-assignment form and domain scanning are inapplicable.
This paper gives an algorithm to automatically generate data-parallel codes, together with an algorithm to possibly convert the program into single-assignment form.
Keywords: automatic parallelization, while loops, code generation, data parallelism
Information from authors:
Abstract:
Computation Decomposition and Alignment (CDA) is a new loop transformation framework that extends the linear loop transformation framework and the more recently proposed Computation Alignment frameworks by linearly transforming computations at the granularity of subexpressions. It can be applied to achieve a number of optimization objectives, including the removal of data alignment constraints, the elimination of ownership tests, the reduction of cache conflicts, and improvements in data access locality.
In this paper we show how CDA can be used to effectively implement flexible computation rules with the objective of minimizing communication and, whenever possible, eliminating intrinsics that test whether computations need to be executed or not. We describe CDA, show how it can be used to implement flexible computation rules, and present an algorithm for deriving appropriate CDA transformations.
Keywords: Loop transformations, HPF, SPMD code generation
Information from authors:
http://www.eecg.toronto.edu/~kulki
http://www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys
http://www.eecg.toronto.edu/~kulki
http://www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys
Abstract:
An efficient technique for migrating the synchronization operations is proposed. This technique rewrites the original statements, Send_Signal(S) to be moved up and Wait_Signal(S) to be moved down, or rearranges the sequence of statements or the position of array element in a synchronization region by data dependence analysis to migrate the synchronization operations. Theorems show that the migration of synchronization operation can reduce the parallel execution time of loop significantly. Perfect benchmarks are employed to measure the system performance after migration. Experimental result shows that the enhancement is very significant.
Keywords: Code migration, Data dependence, LBD, LFD, Synchronization, Synchronization region
Abstract:
This paper presents the compilation techniques implemented in a compiler for a HPF-like language. The stress is especially put on the description of an optimized scheme which is dedicated to the compilation of parallel nested loops. The generation of the SPMD code is based on the polyhedral model and allows for the partitioning of the arrays involved in the loop in order to achieve symbolic restriction of iteration domains and message aggregation. Experimental results for some well-known kernels are shown.
Keywords: Distributed Memory Parallel Computers, Loop Distribution,Domain Analysis, Restriction of Iteration Domains, Message Aggregation.
Information from authors:
http://www.irisa.fr/EXTERNE/projet/pampa/PANDORE/pandore.html

Abstract:
We present the TDC programming model which aims to ease the efficient implementation of dynamic applications on distributed memory multiprocessors. This model is based on task descriptors, data objects and capabilities which reside in distinct, globally accessible domains. Dynamic load balancing will be done by the system software and is completely transparent to the user. This often leads to a significant reduction of code complexity. Our prototype of the TDC model on an 128 node nCUBE 2 uses a distributed diffusion scheme to balance load dynamically. We have developed a task selection strategy which reduces the load balancing overhead. Measuring and simulation results for a parallel implementation of a block matching algorithm indicate that runtime efficiency close to the optimum can be achieved with the TDC model even for highly parallel systems.
Keywords: Programming Model, Dynamic Load Balancing, Block Matching, Distributed Memory Multiprocessors
Information from authors:
Andreas Erzmann received his diploma in electrical engineering from the University of Hannover, Germany, in 1990 and is currently working towards the doctor's degree at the institute for computer architecture and operating systems of the University of Hannover, Germany.
Michael Hadeler is a student of electrial engineering (technical informatics) at the University of Hannover, Germany.
Prof. C. Müller-Schloer is the head of the institute for computer architecture and operating systems of the University of Hannover, Germany.
Further information about the authors, their institution can be obtained from our WWW-Server.
Abstract:
This paper describes a mapping toolbox, whose aim is to optimize the execution time of parallel programs described as task graphs on distributed memory parallel systems. The toolbox includes several classical mapping algorithms. It was assessed by computing the mapping of randomly generated task graphs and by mapping and executing on a parallel system synthetic programs representing some classical numerical algorithms. A large number of experiments were used to validate the cost functions used in the toolbox and to compare the algorithms.
Keywords: Parallel environment, Load-balancing, Mapping.
Information from authors:
The work described in this paper was part of the PhD research of
Pascal Bouvry.
It was done in the framework of the
APACHE
project whose aim is to design and implement a parallel
programming environment for parallel computers, providing both static
and dynamic load balancing facilities.
The authors can be reached by s-mail at:
IMAG, APACHE project
46 avenue Félix Viallet,
F-38031 Grenoble Cedex 1, France.
and by e-mail at:
Pascal.Bouvry@cwi.nl
Abstract:
The paper considers the well--known problem of LU decomposition to study a method to derive data distributions for parallel computers with a distributed memory organization. The importance of the paper lies not so much in the special application but with the principle that the problem of finding an optimal data distribution is formulated as an optimization problem. This is possible by using a parameterized data distribution and a rigorous performance prediction technique that allows us to derive runtime formulas containing the parameters of the data distribution. The parameters are determined % by a mathematical optimization technique in such a way that the total runtime is minimized, thus also minimizing the communication overhead and the load imbalance penalty.
Keywords: LU decomposition, data distribution, parallel computers, performance prediction, computation model
Abstract:
Global predicates in parallel programs are predicates considering the state of more than one process. They are a useful concept for debugging parallel programs, e.g., for specifying assertions or breakpoints. In this paper $\exists$-predicates are defined and examined, a restricted class of global predicates. $\exists$-predicates are defined by two local predicates which have to be simultaneously satisfied by two different processes. Such predicates are frequently needed to express synchronization properties. Efficient centralized and parallel algorithms for detecting satisfaction of $\exists$-predicates are proposed. Furthermore, it is outlined how $\exists$-predicates can be used for global breakpoints and where to stop a parallel program reaching such a breakpoint. The underlying machine models is a fixed set of processes communicating by message passing or shared memory.
Keywords: Global predicates, testing, debugging, parallel debugger
Information from authors:

Abstract:
In this paper we propose a knowledge-based approach for solving data dependence testing and loop scheduling problems. A rule-based system, called the K test, is developed by repertory grid and attribute ording table to construct the knowledge base. The K test chooses an appropriate testing algorithm according to some features of the input program by using knowledge-based techniques, and then applies the resulting test to detect data dependences for loop parallelization. Another rule-based system, called the KPLS, is also proposed to be able to choose an appropriate scheduling by inferring some features of loops and assign parallel loops on multiprocessors for achieving high speedup. The experimental results show that the graceful speedup obtained by our compiler is obvious.
Keywords: Parallelizing compiler, data dependence testing,loop parallelization, parallel loop scheduling, knowledge-based, repertory grid analysis, speedup.
Abstract:
Optimizing communication is a key issue in compiling data-parallel languages for distributed memory architectures. We examine here the case of cyclic distribution, and we derive symbolic expressions for communication sets under the only assumption that the initial parallel loop is defined by affine expressions of the indices. This technique relys on unimodular changes of basis and provides a fully explicit SPMD code, wich is efficient as it does not include any integer divide.
Analysis of the properties of communications leads to a tiling of the local memory addresses that provides maximal message vectorization in the most general case. Many actual program present some pecularities simplifying the compilation process and the output code. We show that our output code regularly improves with the simplicity of the input code
Keywords: Data-Parallelism, SPMD, Compilation, Code Generation, Message Vectorization,
Information from authors:
Cécile Germain is an Assistant Professor at Université Paris-Sud.
She is with the Parallel Architecture group in the Laboratoire de Recherche en Informatique .
Her research interests include interconnection networks for parallel architectures, randomized routing, parallel computing, and compilers for parallel architectures.
Contact and Information
Franck Delaplace is an assistant Professor at University of Evry (France)
He is with the Parallel Research Team.
His research interests are :
Abstract:
This paper presents a general method to compact the first-order part of functional languages with call-by-value semantics for fine-grain parallel machines like VLIW or super-scalars. This work extends previous works on compaction in two ways. First, it defines a new formal system for the compaction problem usable to design a meta-compiler for these machines. Second, the compaction is directly applied to functional expressions instead of graph based representations (control flow or dependence flow based representations) leading to a very uniform and simple presentation.
Keywords: fine-grain parallelization, scheduling, software-pipelining, functional languages
Abstract:
We present an intermediate representation called ThreadTDF, a component of the Parallel TDF system for compiling distributed concurrent programs to shared and distributed memory multiprocessors. ThreadTDF is a parallel extension of the TDF architecture neutral distribution format (ANDF) for sequential programs. ThreadTDF provides {\em featherweight} thread mechanisms for explicitly scheduling dynamic fine-grain concurrent computations within procedures (and more generally within static local scopes). Communication between address spaces is supported by remote service request mechanisms based upon asynchronous activation of remote threads and synchronous remote procedure calls. In ThreadTDF variable lifetimes bound the lifetimes of featherweight threads declared in their scope. We show how a compiler uses thread lifetime information to integrate resource allocation and communication with thread scheduling for efficient intraprocedural concurrency. Initial performance results are given for the SPARC processor.
Keywords: concurrency, compilation, ANDF, TDF, intermediate representation, thread, hierarchical scheduling, context switch
Information from authors:
Ben Sloman is a postgraduate student studying compilation for parallel
machines at Reading University (UK) and at GLOSSA (a SME also based in
Reading). He has worked for the last four years with the TDF
intermediate representation, initially investigating vectorisation of
TDF and now looking at extensions for compiling explicit parallelism.
His research interests include processor architecture, parallel
systems, compiler analysis, and graph algorithms.
Tom Lake was local organiser for CONPAR'88 and the international
workshop series `Compilers for Parallel Computers'. He has worked with
TDF during much of its evolution and is currently examining commercial
applications in the fields of parallel computation, reverse engineering
and distributed discrete event simulation.

Abstract:
A processor pool is a homogeneous collection of processors that are used for computationally intensive tasks, such as parallel programs. Processor pools are far less expensive than multicomputers and more convenient to use than collections of idle workstations. This paper gives a case study in parallel programming on a processor pool with 80 SPARCs connected by an Ethernet, running the Amoeba distributed operating system. We use a realistic application (N-body simulation of water molecules) and show that a good performance can be obtained. We measured a speedup of 72 on 80 processors.
Keywords: distributed system, parallel N-body simulation
Information from authors:
WWW Home pages: http://www.cs.vu.nl/~john
http://www.cs.vu.nl/~bal
Abstract:
Shared Disk database systems offer a high flexibility for parallel transaction and query processing. This is because each node can process any transaction, query or subquery because it has access to the entire database. Compared to Shared Nothing database systems, this is particularly advantageous for scan queries for which the degree of intra-query parallelism as well as the scan processors themselves can dynamically be chosen. On the other hand, there is the danger of disk contention between subqueries, in particular for index scans. We present a detailed simulation study to analyze the effectiveness of parallel scan processing in Shared Disk database systems. In particular, we investigate the relationship between the degree of declustering and the degree of scan parallelism for relation scans, clustered index scans, and non-clustered index scans. Furthermore, we study the usefulness of disk caches and prefetching for limiting disk contention in parallel database systems. Finally, we show that disk contention in multi-user mode can be limited for Shared Disk database systems by dynamically choosing the degree of scan parallelism.
Keywords: Parallel Database Systems, Shared Disk, Parallel Query Processing, Disk Contention, Dynamic Load Balancing, Performance Analysis
Information from authors:
Erhard Rahm is full professor of computer science and head of the database group at the University of Leipzig, Germany. He received his Ph.D. in 1988 from the University of Kaiserslautern, Germany, and spent a year as a Visiting Scientist at the IBM T.J. Watson Research Center in Yorktown Heights, NY. Prof. Rahm holds an US patent and is author of three books and over 40 refereed journal and conference publications. His research interests include parallel database system, dynamic load balancing, high performance transaction systems, heterogeneous databases, performance evaluation and geographic information systems.
Thomas Stöhr received his Master's degree in Computer Science in 1993 from the University of Kaiserslautern, Germany. Currently, he is a member of the database group of Prof. Rahm at the Univ. of Leipzig. His research focus is on parallel database systems and dynamic load balancing.
Abstract:
Defining an optimal schedule for arbitrary algorithms on a network of heterogeneous machines is an NP complete problem. This paper focuses on data parallel deterministic neighborhood computer vision algorithms. This focus enables the polynomial time definition of a schedule which minimizes the distributed execution time by overlapping computation and communication cycles on the network. The scheduling model allows for any speed machine to participate in the concurrent computation but makes the assumption of a master/slave control mechanism using a linear communication network. Several vision algorithms are presented which adhere to the scheduling model. The theoretical speedup of these algorithms is discussed and empirical data is presented and compared to theoretical results.
Keywords: Computer Vision, Heterogeneous Architectures, Scheduling, Distributed Algorithms
Information from authors:
Abstract:
It is shown in [GhoshHwang89] that mapping neural networks onto existing parallel computers leads to an unsatisfactory efficiency, except for irregularly connected networks, with several distinguishable highly connected regions. This paper shows how a four step decomposition of the back-propagation algorithm allows to introduce computation/communication overlapping so as to improve any parallel mapping of differentiable feedforward neural networks. This solution can adapt to both irregular and regular feedforward neural structures. Its computational complexity is estimated for multilayered networks. Unlike most network partitioning schemes, it can deal with multilayer networks using non-standard neurons, such as wavelet networks. Unlike a pattern partitioning algorithm, it is able to implement the stochastic gradient learning algorithm. Numerical results show that this solution should be considered as soon as communication overlapping is available.
Keywords: feed-forward neural networks, parallel implementation, back-propagation
Information from authors:
Bernard GIRAU is a PhD student at the LIP in Lyon, France. He is working about parallel algorithms for neural networks. He is also working about a theoretical model for differentiable feed-forward neural networks.

Abstract:
``Super Monaco'' is the successor to Monaco, a shared-mem\-ory multiprocessor implementation of a flat concurrent logic programming language. While the system retains, by-and-large, the older Monaco compiler and intermediate abstract machine, the intermediate code translator and the runtime system have been completely replaced, incorporating a number of new features intended to improve robustness, flexibility, maintainability, and performance. There are currently two native-code backends for 80x86-based and MIPS-based multiprocessors. The runtime system, written in C, improves upon its predecessor with better memory utilization and garbage collection, and includes new features such as an efficient termination scheme and a novel variable binding and hooking mechanism. The result of this organization is a portable system\footnote{Available by anonymous ftp from {\tt ftp.cs.uoregon.edu:pub/sm.tar.gz}.} which is robust, extensible, and has performance competitive with C-based systems. This paper describes the design choices made in building the system and the interfaces between the components.
Keywords: logic programming, parallelism, native code, runtime systems
Information from authors:
Jim Larson, jim@cs.uoregon.edu
Abstract:
Quiescence detection is a fundamental facility for parallel and distributed processing. This paper describes schemes for quiescence detection in a distributed KLIC implementation. KLIC is a portable implementation of concurrent logic programming language KL1.
Termination is detected using the weighted throw counting (WTC) scheme. Based on the scheme a scheme for global suspension was invented. The postmortem system built-in predicate which provides meta programming facilities was designed, and its distributed implementation is also presented. 11~
Keywords: concurrent logic programming, termination detection, quiescence detection, distributed implementation, weighted reference counting
Information from authors:
Kazuaki Rokusawa :
Research and Development Group
Oki Electric Industry Co., Ltd.
rokusawa@okilab.oki.co.jp or rokusawa@icot.or.jp
Akihiko Nakase :
Research and Development Center
Toshiba Corporation
nakase@isl.rdc.toshiba.co.jp or nakase@icot.or.jp
Takashi Chikayama :
Department of Electronic Engineering, University of Tokyo
chikayama@fuchi.t.u-tokyo.ac.jp or chikayama@icot.or.jp
Abstract:
We describe the compiler analyses of Reform Prolog and evaluate their effectiveness in eliminating suspension and locking on a range of benchmarks. The results of the analysis may also be used to extract non-strict independent and-parallelism.
We find that 90% of the predicate arguments are ground or local, and that 95% of the predicate arguments do not require suspension code. Hence, very few suspension operations need to be generated to maintain sequential semantics. The compiler can also remove unnecessary locking of local data by locking only updates to shared data; however, even though locking writes are reduced to 52% of the unoptimized number for our benchmark set, this has little effect on execution times. We find that the ineffectiveness of locking elimination is due to the relative rarity of locking writes, and the execution model of Reform Prolog, which results in few invalidations of shared cache lines when such writes occur.
The benchmarks are evaluated on a cache-coherent KSR-1 multiprocessor with physically distributed memory, using up to 48 processors. Speedups scale from previous results on smaller, bus-based multiprocessors, and previous low parallelization overheads are retained.
Keywords: Compilation, locality optimizations, interprocedural dataflow analysis, cache-coherent distributed memory, parallelism, Prolog
Information from authors:
Thomas Lindgren, Johan Bevemyr Häkan Millroth, Computing Science Department, Uppsala University. Reform Prolog project.

Abstract:
A hierarchy of unidirectional rings has been used successfully in distributed shared-memory multiprocessors. The fixed cluster size of the hierarchy prevents full exploitation of communication locality. The bidirectional ring is presented as an alternative to the hierarchy. Its relative performance is evaluated for a variety of memory access patterns and network sizes. It gives superior performance for low communication locality and for large networks. Another useful feature of the bidirectional ring is that the network load tends to be balanced over the two constituent unidirectional rings. These features make the bidirectional ring an attractive possibility as a network structure for scalable NUMA multiprocessors.
Keywords: ring, interconnection network, NUMA multiprocessor, communication locality, performance.
Information from authors:
Muhammad Jaseemuddin:
Muhammad Jaseemuddin is a PhD candidate in the department of electrical
and computer engineering at the University of Toronto. His research
interests include parallel processing, distributed computing, computer
architecture, and local area networks. He received his BS from
NED University of Engg. and Tech., Karachi, Pakistan, in 1989,
and MS from The University of Texas at Arlington, Texas, USA, in 1991.
He is a student member of ACM, and IEEE.
URL: http://www.eecg.toronto.edu/~jaseemud/
Zvonko G. Vranesic:
Zvonko G. Vranesic is a professor of electrical and computer engineering
at the University of Toronto. His research interests include computer
architecture, VLSI systems, fault-tolerant computing, local area
networks, and many-valued switching systems. He has had a major role in the
design of Hector and NUMAchine multiprocessors. He was a senior
visitor at the Computer Laboratory at Cambridge University, England, and
at the Institute of Programming at the University of Paris.
Vranesic received his BS, MS, and PhD from the University of Toronto in
1963, 1966, and 1968, resprectively. He is a member of the Association
of professional Engineers of Ontario and senior member of the IEEE.
URL: http://www.eecg.toronto.edu/EECG/EECGProfStaff.html
Abstract:
The Mcube network has been previously proposed by us as a highly recursive and symmetrical interconnection network based on twisted links. The Mcube topology has been developed and defined in terms of the structural constraints between components rather than relations between binary strings because one of the chief goals of the construction is structural symmetry. The salient features of the Mcube interconnection include a uniform distance distribution, almost half the diameter of a comparable hypercube, a lower average internode distance than most other twisted networks, uniform traffic flow through every node, a lower link saturation rate than hypercubes under uniformly random traffic, good recursive partitionability and the ability to execute several parallel algorithms as fast as or faster than a hypercube. We establish the node distribution for the Mcubes to be uniform with the internode distance. Based on the constraints imposed by the construction and the routing algorithm, we derive an expression for the average internode distance for Mcubes. We also introduce an efficient broadcasting algorithm for the Mcube interconnection and show how Mcubes can be reconfigured into hypercubes. Finally, we show how representative parallel algorithms can be directly implemented on the Mcube topology.
Keywords: hypercube, interconnection network, massively parallel, twisted cube
Information from authors:
Nitin K. Singhvi is a Ph.D. student in the department of Computer Science at
the State University of New York at Binghamton. His research interests include
interconnection networks, WDM-based distributed shared memory systems, cache
coherence protocols and task scheduling.
Kanad Ghose is an associate professor in the Department of Computer Science
at the State University of New York at Binghamton. His research interests
are in the area of high-performance architectures, parallel systems and
compiling techniques. Some of the research projects he is currently leading
include the following:
SNOW: Hardware-supported shared memory over a network of workstations: prototype
design and implementation of a cluster of workstations (NOW) with hardware
support for shared memory coherence to allow a network of workstations to be
used as a tightly-coupled system. The current SNOW prototype consists of eight
4-CPU Sparc 20 nodes, FPGA device based high data rate custom SBus network interface
cards, running a modified and extended version of Solaris 2.4.
A fundamental goal of this project is to provide low end-to-end latencies
to promote very tight coupling among the network nodes. Our use of
programmable network interface cards allow for experimentation with degree and nature
of support for shared memory, including support for coherent caches and barrier
synchronization. This work is primarily supported by the NSF.
Prototype Design & Implementation of a Signal Processing Multiprocessor: This
work involves the design and implementation of a 64-node DSP multiprocessor
using a novel reconfigurable interconnection. This work is sponsored by
the US Air Force.
Architectural Techniques for Low Power: This project examines and proposes
architectural techniques for reducing power dissipation in high performance
microprocessors. This work is supported by the NSF.
Scheduling Tool for Distributed Memory Multiprocessors: This work is supported
by CAE-Link corporation.
A MPP-Based video Sever: This project involves the design and implementation
of a video server based on a 512-node massively parallel processor. This
project is supported by the State of New York and ForeFronts Inc. through
the SPIR program.
A Toolkit for Distributed Real-Time Scheduling: This project involves the
design and implementation of a X-windows based toolkit for analyzing the
performance of distributed real-time applications. It also allows users to
verify timing requirements and experiment with a variety of real-time
scheduling algorithms, both global and local. This tool accurately models
delays in system calls, delays in the interconnections, scheduling overhead
etc. The system incorporates a simple language and its compiler to allow
users to define the characteristics of the system and the tasks. In addition
to providing canned descriptions of standard interconnections (Futurebus II,
Ethernet etc.), users can specify any arbitrary interconnection. This
project is co-directed by Prof. S. Aggarwal and is supported by Loral Federal
Systems, CAE-Link and NYSEG.
Abstract:
This paper presents multiwave interconnections -- optical interconnections that employ wavelength components as multiplexable information carriers -- for next-generation multiprocessor systems using MCM technology. A hypercube-based multiprocessor network called the multiwave hypercube (MWHC) is introduced, where multiwave interconnections provide highly-flexible dynamic communication channels. A performance analysis shows that the use of integrated multiwave optics enables the reduction of network complexity on a MCM substrate, while supporting low-latency message routing.
In this paper, we also present the experimental fabrication of wavelength detectors for the proposed system, and discuss the physical limit on the number of wavelength components at the present state of technology.
Keywords: Multichip module (MCM), opticalinterconnections, wavelength division multiplexing (WDM), interconnection networks, parallel processing, message-passing multiprocessors
Information from authors:
Shinichi Shionoya is currently a student in the Graduate
School of Information Sciences at Tohoku University, Japan.
His research interests include multi-wavelength optical
computing.
He received B. E. degree from Tohoku University in 1994.
He is a student member of the Institute of Electronics,
Information and Communication Engineers of Japan.
Takafumi Aoki is a research associate in the Graduate School
of Information Sciences at Tohoku University, Japan. His
research interests include the VLSI architectures based on
multiple-valued logic, multi-wavelength optical computing
and biomolecular computing.
He received the Outstanding Paper Award at the 1989 IEEE
International Symposium on Multiple-Valued Logic and the
Outstanding Transactions Paper Award from the Institute of
Electronics, Information, Communication Engineers of Japan
in 1989 and IEE Ambrose Fleming Premium Award in 1994.
He received the B. E., M. E., and D. E. degrees in
electronic engineering from Tohoku University in 1988, 1990,
and 1992, respectively. He is a member of the IEEE Computer
Society, the Society of Instrument and Control Engineers of
Japan, and the Institute of Electronics, Information and
Communication Engineers of Japan.
Tatsuo Higuchi received the B.E., M.E. and D.E. degrees in
electronic engineering from Tohoku University, Sendai,
Japan, in 1962, 1964 and 1969, respectively. He was a
Professor in the Department of Electronic Engineering at
Tohoku University, and is currently a Professor and Dean of
the Graduate School of Information Sciences at Tohoku
University. His general research interests include the
design of 1-D and multi-D digital filters, linear
time-varying system theory, fractals and chaos in digital
signal processing, VLSI computing structures for signal and
image processing, multiple-valued ICs, multiwave
opto-electronic ICs, interconnection-free bio-electronic ICs
and fault-tolerant computing.
He served as the Program Chairman of the 1983 IEEE
International Symposium on Multiple-Valued Logic (ISMVL),
and the Symposium Chairman of the 1992 IEEE ISMVL. He
received the Outstanding Paper Awards at the 1984, 1985,
1987, and 1989 IEEE International Symposiums on
Multiple-Valued Logic, the Outstanding Transactions Paper
Award from the Society of Instrument and Control Engineers
of Japan in 1984, the Technically Excellent Award from the
Society of Instrument and Control Engineers of Japan in
1986, the Outstanding Transactions Paper Award from the
Institute of Electronics, Information and Communication
Engineers of Japan in 1989, the Technically Excellent Award
from the Robotics Society of Japan in 1990, and IEE Ambrose
Fleming Premium Award in 1994. He is a member of the
Society of Instrument and Control Engineers of Japan, and
Robotics Society of Japan. He is an IEEE fellow.

Abstract:
We define the master-slave multiprocessor scheduling model and provide several applications for the model. $O(n \log n$) algorithms are developed for some of the problems formulated and some others are shown to be NP-hard.
Keywords: multiprocessor scheduling, complexity
Abstract:
We explain a new job scheduling class, called "Time Space Sharing Scheduling" (TSSS) for partitionable parallel machines. TSSS is a combination of time-sharing and space-sharing job scheduling techniques. Our proposed "Distributed Queue Tree" (DQT) is an instance of TSSS. We evaluate and analyze DQT behavior in more detail with a number of simulations. The result shows that DQT performs very well in low-load to high-load situations, almost independent of system size and task size distribution. We also compare our DQT and ScanUp batch scheduling, and we find that our DQT performs as well as ScanUp scheduling in processor utilization, but that both DQT and ScanUp have drawbacks in terms of scheduling fairness. Finally, we find that TSSS can inherently achieve higher processor utilization.
Keywords: Job Scheduling, Parallel Operating System, Simulation, Time-Sharing, Space-Sharing
Information from authors:
Abstract:
This paper describes a class of algorithms for scheduling parallel programs represented by macro dataflow graphs (task precedence graphs onto a multiprocessor system such that the total execution time is minimized. The schedule will be computed dynamically during the runtime of the process system. The model allows to represent centralized and fully distributed algorithms as well as intermediate forms. The algorithms are able to schedule static as well as dynamic dataflow graphs. Knowledge of the execution times of the tasks is not necessary. Some variants of the model have been implemented using a multi-transputer system. Practical experiences are included in the paper.
Keywords: dynamic dataflow graphs, dynamic task scheduling, distributed decision making
Information from authors:
Authors:
Other projects at the Institute for Computer Engineering deal with intelligent automation systems, tools for debugging and performance analysis of parallel applications, fault tolerant massively parallel systems and operator site software.

Abstract:
This paper describes different ways to implement one- and two-dimensional Fast Fourier Transforms, FFTs, on the linear SIMD architecture family VIP.
VIP, Video Image Processor, is a bit-serial linear array SIMD architecture developed at the Image Processing Laboratory at Linköping University aiming at real-time low-level image and radar signal processing.
Since the processing elements in the linear array only communicate via a shiftregister there is a large communication overhead to PEs far away in the array but we nevertheless obtain very fast FFT implementations.
We show for instance that it is possible to build a 512-PE VIP system capable of Fourier transforming 5122 images at video rate (25 frames per second).
Keywords: FFT, Linear Array, SIMD, Bit-serial
Information from authors:
Mattias Johannesson is a PhD student working in the projects VLSI of Vision and the VIP project.
The work is supervised by Prof. Per-Erik Danielsson and Dr Anders Åström.
Mattias presented 1993 a licentiate thesis called "Sheet-of-light Range Imaging", which
concerns fast sheet-of-light range imaging with different smart sensor architectures.
In the VIP project we work on linear SIMD architectures for Image and Radar signal processing.
One of our design is currently being developed by Ericcson Microwave systems.
For more information look me up on my home page,
it's found
here
Abstract:
In this paper, we present a reconfiguration approach to identify the maximal fault-free subcube-ring for tolerating faults in faulty hypercubes. The fault-free subcube-ring is connected by a ring of fault-free subcubes with dilation 3. By exploiting the size of fault-free subcubes as large as possible, the maximal fault-free subcube-ring with higher processor utilization is obtained. Using this approach, we can tolerate more than n faults in n-dimensional hypercubes. To demonstrate the fault- tolerant capability of our approach, we implement a fault-tolerant matrix-multiplication algorithm on the nCUBE/2E hypercube machine with 32 processors. The simulation results show that our reconfiguration approach has low performance slowdown and high processor utilization.
Keywords: Algorithms, fault tolerance, hypercubes, matrix-multiplication
Information from authors:
JANG-PING SHEU
Dr. Sheu is a member of the IEEE Computer Society and Phi Tau Phi Society.
YUH-SHYAN CHEN
Mr. Chen is a student member of the IEEE Computer Society.
See more information in High-Speed Computing Lab. Room of NCU.
Abstract:
A technique to enhance multicomputer routers for fault-tolerant routing with modest increase in routing complexity and resource requirements is described. This method handles solid faults in meshes, which includes all convex faults and many practical nonconvex faults, for example, faults in the shape of L or T. As examples of the proposed method, adaptive and nonadaptive fault-tolerant routing algorithms using four virtual channels per physical channel are described.
Keywords: Fault tolerance, mesh networks, nonconvex faults, routing.
Information from authors:
In the last three years, Chalsani and
Boppana
have worked on the design of new wormhole routing algorithms,
fault-tolerant message communication, and multicast communication
for multicomputers.
Students interested in pursuing Ph.D. in Computer Science
at UTSA can browse its web site for more information about
UTSA Computer Science Program
and the faculty.

Abstract:
A rewriting based approach to dynamical parallelization of a general class of sequential imperative programs by means of the algebraic programming system APS is proposed. It gives advantages of rapid prototyping and evolutionary development of efficient parallelizers. The poster contains brief explanation of the APS system, techniques for designing efficient parallelizers and an example of parallel program development.
We follow an integrated compiler/interpreter approach to parallel software design having in mind that writing interpreter is much easier task than developing compiler but interpreters are commonly known to be inefficient. The emphasize is maden on program transformations and dynamical issues of parallelization that can be solved in compile-time and run-time respectively at low cost.
Keywords: dynamic parallelization, algebraic programming, rapidprototyping, evolutionary parallel program development.
Abstract:
Aurora Prolog has been extended with uncertainty handling and experiments show a reasonable speedup on multiple workers. There is a bigger scope for exploiting parallelism in an uncertain knowledge base than in an ordinary logic knowledge base because of the nature of calculating uncertainty.
The asynchronous database handling predicates of Aurora are useful for implementing efficient information transfer between the parallel branches of the search tree. In a parallel environment one has to pay attention to careful use of the non-logical features of Prolog (such as the cut operator), as improper use may lead to serious performance degradation.
Keywords: Aurora, Prolog, parallel, uncertainty, support logic
Information from authors:
This paper describes a part of the CUBIQ toolset that is built on top of Aurora, an OR-parallel implementation of Prolog for shared memory multiprocessors. Cubiq introduces frames, blackboard handling and uncertainty handling.
Abstract:
Qualitative simulation is applied more and more in design, monitoring and fault-diagnosis. However, poor performance of current qualitative simulators complicates or even prevents its application in technical environments.
In our research project a special-purpose computer architecture for the qualitative simulator QSIM is developed. Improved performance is achieved by mapping QSIM functions onto a multiprocessor system and executing runtime extensive functions on specialized coprocessors. In this paper, we present a part of this research project - the design and implementation of a specialized coprocessor for the constraint check functions (CCFs) of QSIM. The coprocessor is designed at gate- and register level to obtain maximum execution speed. The coprocessor is implemented in an FPGA of type Xilinx XC4013 running at a clock frequency of 15 MHz. With this FPGA-implementation a runtime improvement of up to 20 is achieved.
Keywords: specialized coprocessor, FPGA, qualitative simulator QSIM
Abstract:
Portable Software Tools for Parallel Architectures (PSTPA) is a five-year programme of collaborative research funded by the UK Engineering and Physical Sciences Research Council. The paper summarises the background and aims of the programme and the set of 15 projects recently started.
Keywords: parallel software methods, tools, user environments, portability, scalability
Information from authors:
Catherine Barnes is Programme Manager for Systems Architectures
in EPSRC which includes the PSTPA Programme.
Chris Wadsworth is Head of the Parallel and Distributed Systems Group
at RAL, with projects in the systems aspects and techniques of parallel
programming and the porting of applications software. He is Technical
Co-ordinator (part-time) of the PSTPA Programme. His URL on WWW is
http://www.cis.rl.ac.uk/people/cpw/contact.html
Abstract:
This paper presents WARPmemory, a multiprocessor extension for networks of workstations. A multiported memory device is connected in a star topology to the workstations, allowing for transparent access to a shared memory. High-speed fiber optic links are used as the interconnect, thus keeping latencies in the order of a microsecond. This is three orders of magnitude lower than a purely software implementation over a local area network (e.g. Ethernet and PVM). Cache coherence is ensured using a directory based protocol. The design of the multiported memory module is elegant. A very fast RISC processor is used as an intelligent switch, much of what would be complex in hardware is migrated to software. WARPmemory allows the advantages of distributed processing and shared memory to be combined at a reasonable cost.
Keywords: Shared Memory, Multiprocessor, Fiber Optic Networks
Information from authors:
Abstract:
One computational model for distributed memory architectures is a set of autonomous objects, communicating via message-passing mechanism. Advanced applications within this model may be engineered of components implemented in various programming languages. Thus, particular parts of such applications can be executed in those environments which inherently express the nature of computation. This allows mixing various programming and execution paradigms (e.g. imperative, functional, object-oriented or logic) in one, software-heterogeneous system. The MODIMOS system aims at monitoring of this kind of systems. Its application area covers visualization of their structure, activity and management of their logical configuration, thus enabling appropriate tuning. The characteristic features of MODIMOS are: configurability, information management & filtering mechanisms, expandability of monitored environments set.
Keywords: monitoring, software visualization, software heterogenity, multi-paradigm programming
Information from authors:
Authors:
Abstract:
We describe and relate two data-parallel semantics for the simply-typed lambda-calculus and obtain a semantic function expressible in its object language. The explicitly distributed semantics allow the formal specification of the interpreter's effect on data partitioning and communications.
The CMU Scandal project's VCODE demonstrates the usefulness of a parallel intermediate language abstract enough to be portable yet explicit about parallel data. DPML and its latest implementation Caml FLight, were designed with these goals in mind. Caml Flight extends the sequential language Caml Light in a deterministic and portable way but looses its metacircularity, a symptom of a loss of expressive power. A language which can express its own interpreter is called metacircular. This paper describes the semantics of a metacircular DPML-like language.
Keywords: key, words, come, here, separated, by, commas
Abstract:
One way to obtain higher computational performance of a computer H is to use a parallel coprocessor unit C which is attached to H and can efficiently solve computationally demanding applications. If such a unit C can be used for concurrent execution of several independent executable programs, the problem of efficient run-time (i.e. dynamic) allocation of programs to C arises. Our aim is to find an allocation strategy that maximizes throughput of the overall system H-C.
We present two dynamic allocation approaches for C with mesh-connected architecture. In both cases we reduce the allocation problem to the weighted bipartite matching problem which must be solved during the run-time each time a set of waiting executable programs is to be allocated space in C. The two approaches differ in the way they search for free space in C.
Keywords: mapping optimization, run-time allocation, mesh-connected architecture
Information from authors:
Abstract:
The PROGRESS system being implemented at theInstitute of Informatics Systems in Novosibirsk is discussed. The system is intended to support rapid prototyping of compilers for high level languages (e.g. Fortran-77, Modula-2, SISAL) and for a family of architectures exploited fine-grained parallelism. The next goal of the project is to develop an environment for investigation of optimizing and restructuring transformations of programs to be parallelized.
Keywords: parallel processing, fine-grainarchitectures, transformational approach, program restructuring, multifunctional cooperation
Information from authors:
Vladimir A. Evstigneev is Leading Scientist (staff member)
of the Laboratory for Program Construction and Optimization
of Institute of Informatics System at Novosibirsk (Russian
Academy of Sciences, Siberian Division). He is also
Professor of Novosibirsk State University. Current research
interests include graph theory and its applications,
parallel processing, parallel computer architectures, and
tools. He is the author of the book "Applications of graph
theory in programming" (Moskow, Nauka, 1985) and the
co-author of the book "Graph theory: tree processing
algorithms" (Novosibirsk, Nauka, 1994).
Victor N. Kasyanov is Head of the Laboratory for Program
Construction and Optimization of Institute of Informatics
Systems of the Siberian Division of the Russian Academy of
Sciences and Professor in the Department of Mathematics of
Novosibirsk State University, Russia. His current research
interests include optimizing compilation, program
transformation, annotated programming, data flow analysis,
parallel processing, graph theory, computer education. He
has written over hundred papers and fourteen monographs in
theoretical computer science, system programming and graph
theory.
Institute of Informatics Systems
of the Siberian Division of the Russian Academy of Sciences
(http://www.iis.nsk.su)
Novosibirsk State University
(http://www.cnit.nsk.su//univer/english/General.html)
Abstract:
The purpose of this paper is to present this new parallel image compression algorithm. We present implementation results on serveral parallel computers. We also examine load balancing and data mapping problems. We end by presenting a well-suited architecture for Real-Time image compression.
Keywords: Data-Parallelism, Image Compression, Wavelet Transform, Vector Quantization, Huffman Coding
Abstract:
We present the first results of a congestion control scheme for computer wormhole networks. The basic idea of this scheme is to require pre-selected nodes to restrict their packet injection in order to reduce and possibly eliminate congestion in the whole network. The scheme employs randomization to select nodes.
Our preliminary performance measurements indicate clearly that the proposed scheme improves network performance. Some of the remaining studies are implementation cost, such as the impact of the scheme on the input/output queuees in the nodes. Another further study would be to compare such flow control schemes with other trends such as that of adaptive routing algorithms.
Keywords: Parallel Architectures, Interconection Networks, Wormhole Networks, Congestion Control, Design and Performance

Pages developed and maintained by psm@sics.se. Please direct comments or bug reports to europar95-www@sics.se