---------------------------------------------------------------------
File: Ravi_Stumm_ICPP95.ps.Z

Title: Hierarchical Ring Topologies and the effect of their Bisection
       Bandwidth Constraints

Authors: G. Ravindran and M. Stumm

Where  : Proc. Intl. Conf. on Parallel Processing, pp.I/51-55, 1995

Keywords: Multiprocessor architectures, Interconnection networks,
          Hierarchical rings, Bisection bandwidth

Abstract:

Ring-based hierarchical networks are interesting alternatives to
popular direct networks such as 2D meshes or tori. They allow for
simple router designs, wider communications paths, and faster networks
than their direct network counterparts. However, they have a constant
bisection bandwidth, regardless of system size. In this paper, we
present the results of a simulation study to determine how large
hierarchical ring networks can become before their performance
deteriorates due to their bisection bandwidth constraint. We show that
a system with a maximum of 128 processors can sustain most memory
access behaviors, but that larger systems can be sustained, only if
their bisection bandwidth is increased.

---------------------------------------------------------------------
File: Ravi_Stumm_JIEICE96.ps.Z

Title : A Comparison of Blocking and Non-blocking Packet Switching
        Techniques in Hierarchical Ring Networks

Authors: G. Ravindran and M. Stumm

Where : IEICE Trans. Inf. & Syst., vol. E79-D, No. 8, August 1996

keywords:

Networks, Switching, Wormhole, Virtual Cut-through, Hierarchical Ring Networks,
Slotted Rings

Abstract :

This paper presents the results of a simulation study of blocking and
non-blocking switching for hierarchical ring networks.
The switching techniques include wormhole, virtual cut-through, and
slotted ring.
We conclude that slotted ring network performs better than
the more popular wormhole and virtual cut-through networks.
We also show that the size of the node buffers is an important parameter and
that choosing them too large can hurt performance in some cases.
Slotted rings have the advantage that the choice of buffer size is easier in
that larger than necessary buffers do not hurt performance and hence a single
choice of buffer size performs well for all system configurations.
In contrast, the optimal buffer size for virtual cut-through and wormhole
switching nodes varies depending on the system configuration and the level
in the hierarchy in which the switching node lies.

---------------------------------------------------------------------
File: Zhou_Brecht_SM91.ps.Z

Title: Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors

Where: Appears in: Proceedings of  the  1991  ACM  SIGMETRICS  Conference  on
Measurement and Modeling of Computer Systems, May (1991), pp. 133-142.

Authors: Songnian Zhou and Timothy Brecht

Keywords: NUMA, Schedulling, multiprocessor performance

Abstract:
Large-scale  Non-Uniform  Memory  Access  (NUMA)  multiprocessors  are
gaining  increased attention due to their potential for achieving high
performance through the replication of relatively  simple  components.
Because  of  the complexity of such systems, scheduling algorithms for
parallel  applications  are  crucial  in  realizing  the   performance
potential  of  these  systems.  In particular, scheduling methods must
consider the scale of the system, with  the  increased  likelihood  of
creating  bottlenecks,  along  with  the  NUMA  characteristics of the
system, and the benefits to be gained  by  placing  threads  close  to
their code and data.

We propose a class of scheduling algorithms based on processor  pools.
A processor pool is a software construct for organizing and managing a
large number of processors by dividing them into groups called  pools.
The  parallel  threads  of  a  job are run in a single processor pool,
unless there are performance advantages for a  job  to  span  multiple
pools.   Several  jobs may share one pool.  Our simulation experiments
show that processor pool-based scheduling may effectively  reduce  the
average  job  response time.  The performance improvements attained by
using processor pools increase with the  average  parallelism  of  the
jobs, the load level of the system, the differentials in memory access
costs, and the likelihood of having system bottlenecks.  As the system
size   increases,  while  maintaining  the  workload  composition  and
intensity, we observed that processor pools can  be  used  to  provide
significant  performance  improvements.   We  therefore  conclude that
processor pool-based scheduling may  be  an  effective  and  efficient
technique for scalable systems.


---------------------------------------------------------------------
File: Brecht_SEDMS93.ps.Z

Title: On the Importance of Parallel Application Placement
                       in NUMA Multiprocessors

Authors: Timothy Brecht

Where: Proceedings of the Fourth Symposium on Experiences  with
Distributed  and  Multiprocessor  Systems  (SEDMS  IV), San Diego, CA,
September, 1993.

Keywords: NUMA, multiprocessor scheduling, multiprocessor performance  

Abstract:
The thesis of this paper is that scheduling decisions in  large-scale,
shared-memory,  NUMA  (Non-Uniform Memory Access) multiprocessors must
consider not only how many processors, but also  which  processors  to
allocate  to  each  application.   We  call  the  problem of assigning
parallel  processes  of  an  application  to  processors   application
placement.

We explore the importance of  placement  decisions  by  measuring  the
execution  time  of  several  parallel  applications  using  different
placements on a shared-memory NUMA  multiprocessor.   The  results  of
these  experiments  lead  us  to conclude that, as expected, in small-
scale mildly NUMA multiprocessors, placement  decisions  have  only  a
minor affect on the execution time of parallel applications.  However,
the  results  also  show  that  placement  decisions  in   large-scale
multiprocessors  are  critical  and  localization  that  considers the
architectural clusters inherent in these systems  is  essential.   Our
experiments  also  show  that  the  importance  of placement decisions
increases substantially with the size and NUMAness of the  system  and
that  the  placement  of individual processes of an application within
the  subset  of   chosen   processors   also   significantly   impacts
performance.

---------------------------------------------------------------------

File: Kumar_Kulkarni_ICPP91.ps.Z
      (does not contain figures)
Title: Generalized Unimodular Loop Transformations for 
       Distributed Memory Multiprocessors

Authors: K G Kumar*, D Kulkarni+ and A Basu
        Center for Development of Advanced Computing
        2/1 Brunton Road, Bangalore 560 025, India
* Now at IBM TJ Watson, York Town Heights, NY 10598
+ Now at Dept of Computer Science, University of Toronto,
         Toronto, ON M5S 1A4

Where: International Conference of Parallel Processing -91

Keywords: Parallelizing Compilers, Restructuring Transformations, 
          Loop Partitioning, Iteration Spaces, Dependence Vectors.

                           Abstract
In this paper, we present a generalized unimodular loop
transformation as a simple, systematic and elegant method
for partitioning the iteration spaces of nested loops for
execution on distributed memory multiprocessors. We present
a methodology for deriving the transformations that internalize
multiple dependences in a multidimensional iteration space
without resulting in a deadlocking situation. We then
derive the general expression for the bounds of the
transformed loops in terms of the bounds of the original
space and the transformation matrix elements.

-------------------------------------------------------------------

File: Kumar_Kulkarni_ICS92.ps.Z

Title: Deriving Good Transformations for Mapping
       Nested Loops on Hierarchical Parallel Machines
       in Polynomial Time
Authors: K G Kumar*, D Kulkarni+ and A Basu
        Center for Development of Advanced Computing
        2/1 Brunton Road, Bangalore 560 025, India
* IBM TJ Watson, York Town Heights, NY 10598
+ Dept of Computer Science, University of Toronto,
         Toronto, ON M5S 1A4

Where: International Conference on Supercomputing 92

Keywords: Parallelizing Compilers, Restructuring Transformations,
          Loop Partitioning, Iteration Spaces, Dependence Vectors.

We present a computationally efficient method for deriving
the most appropriate transformation and mapping of a nested
loop for a given hierarchical parallel machine. This method
is in the context of our systematic and general theory of
unimodular loop transformations for the problem of iteration
space partitioning \cite{kandk6}. Finding an optimal mapping or
an optimal associated unimodular transformation is NP-complete.
We present a polynomial time method for obtaining a `good'
transformation using a simple parameterized model of the
hierarchical machine. We outline a systematic methodology
for obtaining the most appropriate mapping.

------------------------------------------------------------------- 

File: Li_Tandri_et_ICPP93.ps.Z

Title: LOCALITY AND LOOP SCHEDULING ON NUMA MULTIPROCESSORS

Authors: Hui Li, Sudarsan Tandri Michael Stumm, and Kenneth C. Sevcik

Where: International Conference on Parallel Processing 93

Keywords: NUMA multiprocessors, Locality, Scheduling 

Abstract:
An important issue in the parallel execution of loops is how to
partition and schedule the loops onto the available processors.
While most existing dynamic scheduling algorithms manage load
imbalances well, they fail to take locality into account and
therefore perform poorly on parallel systems with non-uniform
memory access times.  In this paper, we propose a new loop
scheduling algorithm, Locality-based Dynamic Scheduling (LDS),
that exploits locality, and dynamically balances the load.

--------------------------------------------------------------
File: Sandhu_et_al_PPOPP.ps.Z

Title: The shared  regions  approach to software cache coherence on 
       multiprocessors 

Where: Appears in: Proceedings of  the  1993  ACM  SIGPLAN Symposium on
Principles and Pranctice of Parallel Programming, May (1993).

Authors: Harjinder Sandhu, Benjamin Gamsa and Songnian Zhou 

Keywords: NUMA, cache coherence, multiprocessor performance

Abstract: The effective management of caches is critical to the performance of
applications on shared-memory multiprocessors. In this paper,
we discuss a technique for software cache coherence that is
based upon the integration of a program-level abstraction for shared data
with software cache management. The program-level abstraction,
called {\it Shared Regions}, explicitly relates synchronization objects
with the data they protect. Cache coherence algorithms are presented
which use the information provided by shared region primitives, and
ensure that shared regions are always cacheable by the processors accessing
them. Measurements and experiments of the Shared Region approach on a
shared-memory multiprocessor are shown. Comparisons with other software
based coherence strategies, including a user-controlled strategy and an
operating system-based strategy, show that this approach is able to
deliver better performance, with relatively low corresponding overhead
and only a small increase in the programming effort.  Compared to
a compiler-based coherence strategy, the Shared Regions
approach still performs better  than a compiler that can achieve 90\%
accuracy in allowing cacheing, as long as the regions are a few 
hundred bytes or larger, or they are re-used a few times in the cache.

-------------------------------------------------------------------
File: Wilton_Vranesic_SPDP.ps.Z
Title: 
Architectural Support for Block Transfers in a Shared-Memory Multiprocessor

Authors: Steven J.E. Wilton and Zvonko G. Vranesic

To appear in the Fifth IEEE Symposium on Parallel and Distributed Processing,
Irving, Texas, December 1993

Keywords:

Shared-memory multiprocessor, block transfer support

Abstract:

This paper examines how the performance of a shared-memory multiprocessor
can be improved by including hardware support for block transfers.  A system
similar to the Hector multiprocessor developed at the University of Toronto
is used as a base architecture.  It is shown that such hardware support can
improve the performance of initialization code by as much as 50%, but that
the amount of improvement depends on the memory access behavior of the
program and the way in which the operating system issues block transfer
requests.

----------------------------------------------------------------------

File: Sevcik_Zhou_PERF93.ps.Z

Title: Performance Benefits and Limitations of Large NUMA Multiprocessors

Authors: Kenneth C. Sevcik and Songnian Zhou

Where:  appeared in the Proceedings of Performance '93 , Rome, Italy,
  September 27 to October 1, 1993, pp. 183-204, Elsevier Science Publ.

Abstract: Please see the ps file.

----------------------------------------------------------------------
File: Harz_Sevcik_SC93.ps.Z

Title: Hot Spot Analysis in Large Scale Shared Memory Multiprocessors

Authors:  Karim Harzallah and Kenneth C. Sevcik

Where: will appear in the Proceedings of the Supercomputing '93
Conference, November, 1993, Portland, Oregon.

Abstract: Please see the ps file.

-----------------------------------------------------------------------
File: Sevcik_JPE.ps.Z

Title: Application Scheduling and Processor Allocation in 
Multiprogrammed Parallel Processing Systems

Authors:  Kenneth C. Sevcik

Where:  This paper will appear in a special issue of the journal
  "Performance Evaluation" on the performance evaluation of parallel
  systems in late 1993 or early 1994.

Abstract: Please see the ps file.

-----------------------------------------------------------------------
File :   Holliday_Stumm_IEEETC.ps.Z

Title:   Performance Evaluation of Hierarchical Ring-Based Shared 
         Memory Multiprocessors

Authors: Mark Holliday 
         Dept. of Computer Science, Duke University, Durham, NC 27706

         Michael Stumm
         Dept. of Electrical and Computer Engineering
         University of Toronto, Toronto, Canada M5S 1A4 

Date:    November 1992; revised April 1993

Published: Technical Report CS-1992-18, Duke University
         Accepted for publication in IEEE Transactions on Computers

Keywords: communication locality; hierarchical ring-based networks;
         hot spots; large scale parallel systems; memory banks;
         performance evaluation; prefetching; shared memory multiprocessors;
         simulation.

Abstract:

	This paper investigates the performance of word-packet, slotted
	unidirectional ring-based hierarchical direct networks in the
	context of large-scale shared memory multiprocessors.
	Slotted unidirectional rings are attractive because their
	electrical characteristics and simple interfaces allow for fast
	cycle times and large bandwidths.
	For large-scale systems, it is necessary to use multiple rings
	for increased aggregate bandwidth.
	Hierarchies are attractive because the topology ensures unique
	paths between nodes, simple node interfaces and simple inter-ring
	connections.

	To ensure that a realistic region of the design space is examined, 
	the architecture of the network used in the Hector prototype is 
	adopted as the initial design point.
	A simulator of that architecture has been developed and validated with
	measurements from the prototype.
	The system and workload parameterization reflects conditions
	expected in the near future.

	The results of our study show the importance of system balance on 
	performance. Large-scale systems inherently have large
	communication delays for distant accesses, so processor efficiency
	will be low, unless the processors can operate with multiple 
	outstanding transactions using techniques such as prefetching, 
	asynchronous writes and multiple hardware contexts.
	However with multiple outstanding transactions and only one memory
	bank per processing module, memory quickly saturates.
	Memory saturation can be alleviated by having multiple memory
	banks per processing module, but this shifts the bottleneck to the
	ring subsystem.
	While the topology of the ring hierarchy affects performance,
	the ring subsystem will inherently limit the throughput of the system.
	Hence increasing the number of
	outstanding transactions per processor beyond a certain point only
	has a limiting effect on performance, since it causes some of the
	rings to become congested.
	An adaptive maximum number of outstanding
	transactions appears necessary to adjust for the appropriate tradeoff
	between concurrency and contention as the communication locality 
	changes. We show the relationships between processor, ring and memory 
	speeds, and their effects on performance.
	Using block transfers for prefetching seems unlikely to be advantageous
	in that the improvement in the cache hit ratio needed to compensate for
	the increased network utilization is substantial.


-------------------------------------------------------------------------
File : Curran_Stumm_CS.ps.Z

Title:     A Comparison of basic CPU Scheduling Algorithms for 
           Multiprocessor Unix

Authors:   Stephen Curran  and  Michael Stumm
           Department of Electrical and Computer Engineering
           University of Toronto, Toronto, Canada M5S 1A4

Published: Computer Systems, 3(4), Oct., 1990, pp. 551--579.

Abstract:
	In this paper, we present the results of a simulation study
	comparing three basic algorithms that schedule independent tasks in 
	multiprocessor versions of Unix.
	Two of these algorithms, namely Central Queue and Initial Placement,
	are obvious extensions to the standard uniprocessor scheduling 
	algorithm and are in use in a number of multiprocessor systems.
	A third algorithm, Take, is a variation on Initial Placement,
	where processors are allowed to raid the task queues of the other
	processors.
	Our simulation results show the difference between the
	performance of the three algorithms to be small when
	scheduling a typical Unix workload running on a small, bus-based,
	shared memory multiprocessor.
	They also show that the Take algorithm performs best for those
	multiprocessors on which tasks incur overhead each time they migrate.
	In particular, the Take algorithm appears to be more stable
	than the other two algorithms under extreme conditions.


-----------------------------------------------------------------------

File: Stumm_Unrau_Krieger_USENIX92.ps.Z

Title:  HIERARCHICAL CLUSTERING: A STRUCTURE FOR SCALABLE MULTIPROCESSOR
          OPERATING SYSTEM DESIGN

Authors: Michael Stumm, Ron Unrau, and Orran Krieger 

Where:  Extended version of Clustering Micro-Kernels for Scalability,
        Proc.\ of the Usenix Workshop on Micro-Kernels and
        Other Kernel Architectures, April, 1992.

Abstract: Please see the ps file.

----------------------------------------------------------------------

File: Stumm_Vranesic_White_IPPS93.ps.Z

Title: EXPERIENCE WITH THE HECTOR MULTIPROCESSOR 

Authors: Michael Stumm, Zvonko Vranesic, Ron White

Where:  Extended version of paper with same title
        in Proc.\ Intl.\ Parallel Processing Symposium
        Parallel Systems Fair, 1993, pp.\ 9--16.

Abstract: Please see the ps file.

----------------------------------------------------------------------

File: Krieger_Stumm_Unrau_USENIX92.ps.Z

Title: The Alloc Stream Facility: A redesign of application-level
Stream I/O 

Authors: O. Krieger, M. Stumm, and R. Unrau

Where: Extended version of
``Exploiting the advantages of mapped files for stream I/O''
in Proc.\ of the Winter 1992 Usenix Conference, January, 1992.

Abstract:
This paper describes the design and implementation of a new
application level I/O facility, called the Alloc Stream Facility.
The Alloc Stream Facility has several key advantages.  First,
performance is substantially improved as a result of a)~the
structure of the facility that allows it to take advantage of
system specific features like mapped files, and b)~a reduction in
data copying and the number of I/O system calls.  Second, the
facility is designed for multi-threaded applications running on
multiprocessors and allows for a high degree of concurrency.
Finally, the facility can support a variety of I/O interfaces,
including stdio, emulated Unix I/O, ASI, and C++ streams, in a
way that allows applications to freely intermix calls to the
different interfaces, resulting in improved code reusability.

We show that on several Unix workstation platforms the
performance of Unix applications using the Alloc Stream Facility
can be substantially better that when the applications use the
original I/O facilities.


----------------------------------------------------------------------

File: Krieger_Stumm_DAGS93.ps.Z

Title: HFS: A Flexible File System for large-scale Multiprocessors 

Authors: Orran Krieger and Michael Stumm

Where: Proceedings of the 1993 DAGS/PC Symposium

Abstract:
The {H{\sc urricane}} File System (HFS) is a new file system
  being developed for large-scale shared memory multiprocessors with
  distributed disks. The main goal of this file system is scalability; that is,
  the file system is designed to handle demands that are expected to grow
  linearly with the number of processors in the system. To achieve this goal,
  HFS is designed using a new structuring technique called Hierarchical
  Clustering. HFS is also designed to be flexible in supporting a variety of
  policies for managing file data and for managing file system state. This
  flexibility is necessary to support in a scalable fashion the diverse
  workloads we expect for a multiprocessor file system.


----------------------------------------------------------------------

File: Krieger_etal_ICPP93.ps.Z

Title: A fair fast scalable reader-writer lock 

Authors: O. Krieger, M. Stumm, R. Unrau, and J. Hanna,

Where: Proc. Intl. Conf. on Parallel Processing, 1993.

Abstract:
A reader-writer lock allows either multiple readers to inspect
  shared data or a single writer exclusive access to that data.  On
  shared memory multiprocessors, the cost of acquiring and
  releasing these locks can have a large impact on the performance
  of parallel applications.  Other researchers have shown how to
  implement scalable locks, that is, locks that can become
  contended without resulting in memory or interconnection network
  contention.  This paper describes a new algorithm for a
  reader-writer lock that, while being scalable in the contended
  case, has a low overhead in the uncontended case.  This is
  important because most parallel applications are written so that
  locks are typically uncontended.  The only atomic operation
  required by this algorithm is fetch_and_store and hence
  it can be used on most current multiprocessor systems.
  Experimental results are provided.

----------------------------------------------------------------------

File: Kulkarni_Stumm_Tutorial.ps.Z

Title: Loop and Data Transformations: A tutorial

Authors: Dattatraya Kulkarni and Michael Stumm

Where: Internal document, a tutorial guide.

Abstract:
Hierarchically structured machines appear to be becoming the
dominant parallel computing structure. These systems have
non-uniform access times. We address the problem of restructuring
a possibly sequential program to execute efficiently on such
parallel machines. This restructuring involves transforming and
partitioning the loop structures and the data to so as to improve
{\it parallelism}, {\it static} and {\it dynamic locality}, and {\it
load balance}.
The objective of this paper is to present previous and ongoing work
on loop and data transformations and motivate a {\it unified} framework to
restructuring of a sequence of loops and data so as to execute efficiently on
parallel machines with several levels of hierarchy.

----------------------------------------------------------------------


File: Baru_Zilio_PADS93.ps.Z

Title: Data reorganization in parallel database systems

Author: Chaitanya Baru & Daniel C. Zilio

Where : Proc. of the IEEE Workshop on Advances in Parallel and
Distributed Systems}, Princeton, NJ, pp.102-107, Oct.  1993.

Abstract: Parallel database systems are suitable for use in
applications with high capacity and high performance and availability
requirements.  The trend in such systems is to provide efficient
on-line capability for performing various system administration
functions such as, index creation and maintenance, backup/restore,
reorganization, and gathering of statistics.  For some of these
functions, the on-line capability can be efficiently supported by the
use of ``incremental algorithms", i.e., algorithms that achieve the
function in several, relatively small (i.e., less time-consuming)
steps, rather than in a single, large step.  Incremental algorithms
ensure that only small parts of the database become inaccessible for
short durations as opposed to non-incremental algorithms which may
lock large portions of the database or the entire database for a
longer duration.  In this paper, we discuss issues in providing
concurrent data reorganization capability using incremental algorithms
in parallel database systems.


----------------------------------------------------------------------

File: Kulkarni_Stumm_292.ps.Z

Title: Computational Alignment:
A new, unified program transformation for local and global optimization

Authors: Dattatraya Kulkarni and Michael Stumm

Where: CSRI Tech report 292, ISSN 0834-1648

Abstract:
{\small {\em Computational Alignment} is a new class of program
transformations suitable for both local and global optimization.
Computational Alignment transforms all of the computations of a
{\em portion} of the loop body in order to align them to other
computations either in the same loop or in another loop. It
extends along a new dimension and is significantly more powerful
than linear transformations because $i)$
it can transform subsets of dependences and references;
$ii)$ it is sensitive to the location of data in that it can move
the computation relative to data; $iii)$ it applies to imperfect
loop nests; and $iv)$ it is the first loop transformation that
can change {\it access vectors}. Linear transformations are just a
special case of Computational Alignment.
Computational Alignment is highly suitable for global
optimization because it can transform given loops to access data
in similar ways. Two important subclasses of Computational
Alignment are presented as well, namely, {\em Freeing} and {\em
Isomerizing} Computational Alignment.}

-------------------------------------------------------------
File: Brecht_PhD_303.ps.Z

Title: Multiprogrammed Parallel Application Scheduling in NUMA Multiprocessors

Authors: Timothy B. Brecht

Where: Ph.D. Dissertation - CSRI Technical Report CSRI-303

Abstract:

          The   invention,   acceptance,    and    proliferation    of
     multiprocessors  are  primarily a result of the quest to increase
     computer system performance.   The  most  promising  features  of
     multiprocessors are their potential to solve problems faster than
     previously possible and to solve larger problems than  previously
     possible.    Large-scale  multiprocessors  offer  the  additional
     advantage of being able to execute multiple parallel applications
     simultaneously.

          The execution time of a  parallel  application  is  directly
     related  to  the  number  of  processors  it is allocated and, in
     shared-memory   non-uniform    memory    access    time    (NUMA)
     multiprocessors,  which processors it is allocated.  As a result,
     efficient and effective scheduling becomes  critical  to  overall
     system  performance.   In fact, it is likely to be a contributing
     factor in  ultimately  determining  the  success  or  failure  of
     shared-memory NUMA multiprocessors.

          The subjects  of  this  dissertation  are  the  problems  of
     processor  allocation  and  application placement.  The processor
     allocation problem involves determining the number of  processors
     to  allocate to each of several simultaneously executing parallel
     applications and possibly dynamically adjusting those allocations
     to  improve  overall  system performance.  The performance metric
     used is mean response time.   We  show  that  by  differentiating
     between  applications  based on the amount of remaining work they
     have to execute, performance can be improved significantly.  Then
     we  propose  techniques  for estimating an application's expected
     remaining work along with policies for using these  estimates  to
     make  improved  processor  allocation decisions.  An experimental
     evaluation demonstrates the promise of this approach.

          The placement problem involves determining which of the many
     processors  to  assign  to  each  application.  Using experiments
     conducted on a representative  system,  we  demonstrate  that  in
     large-scale  NUMA  multiprocessors the execution time of parallel
     applications is significantly affected by the  placement  of  the
     application.  This motivates the need for new techniques designed
     explicitly  for  NUMA  multiprocessors.   We  introduce  such   a
     technique,   called  processor  pool-based  scheduling,  that  is
     designed to  localize  the  execution  of  parallel  applications
     within  a  NUMA  architecture  and  to isolate different parallel
     applications from each other.  An experimental evaluation of this
     scheduling  method  shows  that  it  can be used to significantly
     reduce mean response time over methods that do not  consider  the
     placement of parallel applications.

-------------------------------------------------------------------
File: Gamsa_MASc.ps.Z

Title: Region-Oriented Main Memory Management
       in Shared-Memory NUMA Multiprocessors


Authors: Benjamin Gamsa

Where: M.Sc. Thesis

Abstract:

     In Non-Uniform Memory Access  time  (NUMA)  multiprocessors,
     distribution of the memory modules facilitates architectural
     scaling,  but creates complications for the programmers  who
     must  be  concerned  with the physical distribution of their
     data in order to obtain good performance.

     In order to reduce the impact of remote  accesses,  in  this
     thesis  we  propose  that  data  be  partitioned into Shared
     Regions that reflect the  granularity  of  data  sharing  in
     programs, and that special synchronization calls be added to
     enforce proper ordering of accesses to the  shared  data  as
     well  as to manage replication and consistency transparently
     to the programmer.

     Results   from   measurements   on   a   16-processor   NUMA
     multiprocessor  and from a model of the system indicate that
     the Shared Regions approach is successful in  obtaining  the
     necessary  locality critical to performance, while incurring
     only minimal overhead.  Data distribution methods  are  also
     observed  to have a significant impact on the performance of
     the  system,  especially  in  the   larger   multiprocessors
     modeled.

-------------------------------------------------------------------

File: Unrau_PhD.ps.Z

Title: Scalable Memory Management through Hierarchical 
       Symmetric Multiprocessing

Authors: Ronald C. Unrau

Where: Ph.D. Disseration

Abstract:

     This dissertation examines scalability issues in  the  design  of
     operating systems for large-scale, shared-memory multiprocessors.
     In particular, the thesis focuses on structuring issues  as  they
     relate to memory management.

     From a set of simple, well-known  queuing  network  formulas,  we
     derive  a  set  of properties that describe sufficient conditions
     for an operating system to scale. From these properties we  first
     develop  a  set of guidelines for designing scalable systems, and
     then develop  a  new  structuring  philosophy  for  shared-memory
     multiprocessor  operating  systems, called Hierarchical Symmetric
     Multiprocessing (HSM).

     HSM  manages  the  system  resources  in  clusters,  using  tight
     coupling  within  a  cluster, and loose coupling across clusters.
     Distributed systems principles are applied  by  distributing  and
     replicating   system   services  and  data  objects  to  increase
     locality,  increase  concurrency,  and   to   avoid   centralized
     bottlenecks,  thus  making  the  system scalable.  However, tight
     coupling is used within a cluster, so the  system  performs  well
     for  local  interactions.  HSM maximizes locality which is key to
     good performance in large systems, and systems based on  HSM  can
     easily  be  adapted  to  different  hardware  configurations  and
     architectures by changing the size of the clusters.  Finally, HSM
     leads  to a modular system composed from easy-to-design and hence
     efficient building blocks.

     Memory  management  is  a  particularly  challenging  service  to
     implement  within  the HSM framework, because it must provide the
     applications with an integrated and coherent  view  of  a  single
     system,  while  distributing and replicating services in order to
     fully exploit the hardware potential. We describe in  detail  the
     implementation  of an HSM structured memory management subsystem,
     and evaluate the performance of our implementation on  Hector,  a
     prototype scalable shared memory multiprocessor.

-------------------------------------------------------------------

File: Wu_MASc.ps.Z

Title: Processor Scheduling in Multiprogrammed Shared Memory 
       NUMA Multiprocessors

Authors: Chee-Shong Wu

Where: M.Sc. Thesis

Abstract:

     In a multiprogrammed multiprocessor, the scheduler  is  not  only
     responsible for deciding when to activate an application and when
     to suspend it, but is also responsible for determining  how  many
     processors  to  allocate  to each application. In a scalable Non-
     Uniform Memory Access  (NUMA)  multiprocessor,  it  must  further
     resolve  the  problem  of which processor(s) to allocate to which
     application since the memory reference times are not the same for
     all processor-memory pairs.

     In this thesis, we study  the  problem  of  how  to  characterize
     parallel   applications  and  how  to  apply  this  knowledge  in
     scheduling for NUMA systems.  We also study  the  performance  of
     several  scheduling  algorithms  in  an  NUMA  environment. These
     algorithms differ in the frequency of reallocations.  We  propose
     two  policies,  the  Static policy and the Immediate Start Static
     policy, that  utilize  application  characteristics  when  making
     scheduling  decisions.   The performance of these two policies is
     compared  with  that  of  the  Dynamic   policy,   on   an   NUMA
     multiprocessor, Hector.

---------------------------------------------------------------------
File: Parsons_Sevcik_IPPS95.ps.Z

Title: Multiprocessor Scheduling for High-Variability Service Time
       Distributions
 
Where: IPPS '95 Workshop on Job Scheduling Strategies for Parallel Processing
  reprinted in Springer-Verlag Lecture Notes in Computer Science, Vol 949,
  pages 127--145.

Authors: Eric W. Parsons and Kenneth C. Sevcik 

Keywords: Scheduling, multiprocessor performance

Abstract:
  Many disciplines have been proposed for scheduling and processor
  allocation in multiprogrammed multiprocessors for parallel processing.
  These have been, for the most part, designed and evaluated for workloads
  having relatively low variability in service demand.  But with reports
  that variability in service demands at high performance computing centers
  can actually be quite high, these disciplines must be reevaluated.  In
  this paper, we examine the performance of two well-known static
  scheduling disciplines, and propose preemptive versions of these that
  offer much better mean response times when the variability in service
  demand is high.  We argue that, in systems in which dynamic
  repartitioning in applications is expensive or impossible, these
  preemptive disciplines are well suited for handling high variability in
  service demand.

---------------------------------------------------------------------
File:  Okrieg_PhD.ps.Z

Title: HFS: A flexible file system for shared-memory multiprocessors 

Where: PhD Dissertation, Department of Electrical and Computer Engineering,
University of Toronto

Authors: Orran Krieger 

Keywords: File System, I/O, Hurricane, Hector NUMA multiprocessor

Abstract:

  The Hurricane File System (HFS) is designed for large-scale, shared-memory
  multiprocessors.  Its architecture is based on the principle that a
  file system must support a wide variety of file structures, file
  system policies and I/O interfaces to maximize performance
  for a wide variety of applications.  HFS uses a novel, object-oriented
  building-block approach to provide the flexibility needed to support
  this variety of file structures, policies, and I/O interfaces.  File
  structures can be defined in HFS that optimize for sequential or
  random access, read-only, write-only or read/write access, sparse or
  dense data, large or small file sizes, and different degrees of
  application concurrency.  Policies that can be defined on a per-file
  or per-open instance basis include locking policies, prefetching
  policies, compression/decompression policies and file cache management
  policies.  In contrast, most existing file systems have been designed
  to support a single file structure and a small set of policies.

  We have implemented large portions of HFS as part of the Hurricane operating
  system running on the \hec\ shared-memory multiprocessor.  We
  demonstrate that the flexibility of HFS comes with little processing
  or I/O overhead.  Also, we show that HFS is able to deliver the full
  I/O bandwidth of the disks on our system to the applications.

---------------------------------------------------------------------
File: Vranesic_etal_IEEEC.ps.Z 

Title: Hector -- A hierarchically structured shared memory multiprocessor

Where: IEEE Computer, 24(1): 72-80, January, 1991.

Authors: Z. Vranesic, M. Stumm, D. Lewis and R. White 

Keywords: Shared memory multiprocessors, slotted rings, NUMA, Scalability. 

Abstract: Please see the ps file.

---------------------------------------------------------------------
File: Kulkarni_etal_317.ps.Z 

Title: A Generalized Theory of Linear Loop Transformations

Where: CSRI Tech Report 317

Authors: D. Kulkarni, M. Stumm, R. Unrau 

Keywords: Computational Alignment, Computation Decomposition,
Linear loop transformations, SPMD code generation.

Abstract: In this paper we present a new theory of linear loop transformations
called {\em Computation Decomposition and Alignment\/} (CDA). A CDA
transformation has two components: {\em Computation Decomposition\/}
first decomposes the computations in the loop into computations
of finer granularity, from iterations to instances of subexpressions.
{\em Computation Alignment\/} subsequently, linearly transforms each
of these sets of computations, possibly by using a
different transformation for each set.  This framework subsumes all
existing linear transformation frameworks in that it reduces to a
conventional linear loop transformation when the smallest granularity
is an iteration, and it reduces to some of the more recently extended
frameworks when the smallest granularity is a statement instance.
The possibility  of being able to align computations at arbitrary
granularities adds a new dimensions to performance optimization
on high performance computing platforms.

We describe Computation Decomposition and Alignment and provide
examples of CDA transformations. We present some heuristics to
derive appropriate CDA transformations, given a desired optimization
objective. We present the results of experiments run on
the KSR1 multiprocessor and various RS6000 and
Sparc platforms that demonstrate that CDA can result in substantial
performance improvements.

---------------------------------------------------------------------
File: Kulkarni_Stumm_ACJ95.ps.Z 

Title: Linear Loop Transformations in Optimizing Compilers for Parallel Machines

Where: To appear in the Australian Computer Journal

Authors: D. Kulkarni, M. Stumm 

Keywords:  Linear loop transformations

Abstract: We present the linear loop transformation framework which is the formal
basis for state of the art optimization techniques in restructuring
compilers for parallel machines. The framework unifies most existing
transformations and provides a systematic set of code generation
techniques for arbitrary compound loop transformations. The algebraic
representation of the loop structure and its transformation give
way to quantitative techniques for optimizing performance on parallel
machines. We discuss in detail the techniques for generating
the transformed loop and deriving the desired linear transformation.

---------------------------------------------------------------------
File: Manjikian_Abdelrahman_315.ps.Z 

Title: Fusion of Loops for Parallelism and Locality

Where: CSRI Tech Report 315

Authors: N. Manjikian and T. Abdelrahaman

Keywords:  Loop fusion, cache performance, locality, NUMA

Abstract: Loop fusion improves data locality and reduces synchronization in
data-parallel applications. However, loop fusion is not always
legal. Even when legal, fusion may introduce loop-carried dependences
which reduce parallelism. In addition, performance losses result from
cache conflicts in fused loops. We present new, systematic techniques
which: (1) allow fusion of loop nests in the presence of
fusion-preventing dependences, (2) allow parallel execution of fused
loops with minimal synchronization, and (3) eliminate cache conflicts
in fused loops. We evaluate our techniques on a 56-processor KSR2
multiprocessor, and show performance improvements of up to 20% for
representative loop nest sequences. The results also indicate a
performance tradeoff as more processors are used, suggesting a careful
evaluation of the profitability of fusion.

 
<!--------------------------------------------------------------------->
<HR>
<A NAME="Kulkarni_Stumm_LCR95">.</A>
<HR>

<B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Kulkarni_Stumm_LCR95.ps.Z">CDA Loop Transformations</A> 
 <P>
<B>Authors:</B> Dattatraya Kulkarni and Michael Stumm 
 <P>
<B>Where:</B> Proceedings of the Third workshop on languages, compilers and run-
time systems
  for scalable computers}, Troy, NY, May 1995, Kluwer Academic. 
 <P>
<B>Abstract:</B>
 <P>

In this paper we present a new loop transformation technique
called {\em Computation Decomposition and Alignment\/} (CDA).
{\em Computation Decomposition\/}
first decomposes the iteration space into finer computation spaces.
{\em Computation  Alignment\/} subsequently, linearly transforms each
computation space independently.
CDA is a general framework in that
linear transformations and its recent extensions are just special cases of CDA.
CDA's fine grained loop restructuring can incur considerable
computational effort, but can exploit optimization
opportunities that earlier frameworks cannot. We present four
optimization contexts in which CDA can be useful. Our
initial experiments demonstrate that CDA adds a new dimension to
performance optimization.

<!--------------------------------------------------------------------->
<HR>
<A NAME="Kulkarni_Stumm_EuroPar95">.</A>
<HR>

<B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Kulkarni_Stumm_EuroPar95.ps.Z">Implementing Flexible Computation Rules with
Subexpression-level Loop Transformations</A> 
 <P>
<B>Authors:</B> Dattatraya Kulkarni, Michael Stumm  and Ronald C. Unrau
 <P>
<B>Where:</B>Proceedings of the Euro-Par95, Stockholm, Aug 28-31, 1995. 
 <P>
<B>Abstract:</B>
 <P>

  Computation Decomposition and Alignment (CDA) is a new loop transformation
  framework that extends the linear loop transformation framework and
the more recently proposed
  Computation Alignment frameworks by linearly transforming
  computations at the granularity of subexpressions.
  It can be applied to achieve a number of optimization objectives,
  including the removal of data alignment constraints, the elimination
  of ownership tests, the reduction of cache conflicts, and
improvements in data access locality.

  In this paper we show how CDA can be used to effectively implement
  flexible computation rules with the objective of minimizing
  communication and, whenever possible, eliminating intrinsics that
  test whether computations need to be executed or not.
  We describe CDA, show how it can be used to implement flexible
  computation rules, and present an algorithm for deriving appropriate
  CDA transformations.


<!--------------------------------------------------------------------->
<HR>
<A NAME="Unrau_etal_EuroPar95">.</A>
<HR>

<B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Unrau_etal_EuroPar95.ps.Z">On the Scalability of Demand-Driven Parallel Systems </A> 
 <P>
<B>Authors:</B> Ronald C. Unrau and Michael Stumm and Orran Krieger 
 <P>
<B>Where:</B>Proceedings of the Euro-Par95, Stockholm, Aug 28-31, 1995. 
 <P>
<B>Abstract:</B>
 <P>

Demand-driven systems follow the model where customers enter the system,
request some service, and then depart.
Examples are databases, transaction processing systems and operating systems,
which form the system software layer between the applications and the hardware.
Achieving scalability at the system software layer is critical
for the scalability of the system as a whole, and yet this layer
has largely been ignored.

In this paper, we characterize the scalability of the system software layer
of demand-driven parallel systems 
based on fundamental metrics of quantitative system performance analysis.
We develop a set of sufficient conditions so that
if a system satisfies these conditions, then the system is scalable.
We further argue that in practice these conditions are also necessary.
In the remainder of the paper, we use the necessary and sufficient conditions 
to develop a set of practical design guidelines, to study the effect of
application workloads, and to examine the scalability behavior of a system with
only a limited number of processors.

<!--------------------------------------------------------------------->
<HR>
<A NAME="Parsons_etal_IWOOS95">.</A>
<HR>

<B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Parsons_etal_IWOOS95.ps.Z">(De-)Clustering Objects for Multiprocessor System Software </A> 
 <P>
<B>Authors:</B> Eric Parsons, Ben Gamsa, Orran Krieger, Michael Stumm
<P>
<B>Where:</B> IWOOS95 (Fourth International Workshop on Object Orientation in Operating Systems 95)  
<P>
<B>Abstract:</B>
 <P>

  Designing system software for large-scale shared-memory multiprocessors
  is challenging because of the level of performance demanded by the
  application workload and the distributed nature of the system.  Adopting
  an object-oriented approach for our system, we have developed a framework
  for de-clustering objects, where each object may migrate, replicate, and
  distribute all or part of its data across the system memory using the
  policies that will best meet the locality requirements for that data.  The
  mechanism for object invocation hides the internal structure of an
  object, allowing a request to be made directly to the most suitable part
  of the object on a per-processor basis without any knowledge of how the
  object is de-clustered.  Method invocation is very efficient, both within
  and across address spaces, involving no remote memory accesses in the
  common case.  We describe the design and implementation of this framework
  in Tornado, our multiprocessor operating system.


<!--------------------------------------------------------------------->
<HR>
<A NAME="Ben_etal_OOPSLAW94">.</A>
<HR>

<B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Ben_etal_OOPSLAW94.ps.Z">The Importance of Performance-Oriented Flexibility in
System Software for Large-Scale Shared-Memory Multiprocessors
 </A> 
 <P>
<B>Authors:</B> 
Orran Krieger, Ben Gamsa, Karen Reid, Paul Lu, Eric Parsons and Michael Stumm
<P>
<B>Where:</B> 
OOPSLA Workshop on Flexible System Software. October 1994.
<P>
<B>Abstract:</B>
 <P>
See paper for abstract.


<!--------------------------------------------------------------------->
<HR>
<A NAME="Orran_etal_SPDPW95">.</A>
<HR>

<B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Orran_etal_SPDPW95.ps.Z">
 Exploiting Mapped Files for Parallel I/O
</A> 
 <P>
<B>Authors:</B> 
Orran Krieger, Karen Reid and Michael Stumm
<P>
<B>Where:</B> 
SPDP Workshop on Modeling and Specification of I/O (MSIO), October 1995

<P>
<B>Abstract:</B>
 <P>
  Harnessing the full I/O capabilities of a large-scale multiprocessor is
  difficult and requires a great deal of cooperation between the
  application programmer, the compiler and the operating (/file) system.
  Hence, the parallel I/O interface used by the application to communicate
  with the system is crucial in achieving good performance.  We present a
  set of properties we believe that a good I/O interface should have and
  consider current parallel I/O interfaces from the perspective of these
  properties.  We describe the advantages and disadvantages of mapped-file
  I/O and argue that if properly implemented it can be a good basis for a
  parallel I/O interface that can fulfill the suggested properties.  To
  demonstrate that such an implementation is feasible, we describe
  methodology used in our previous work on the Hurricane operating system
  and in our current work on the Tornado operating system to implement
  mapped files.