Catálogo de publicaciones - libros
Languages and Compilers for High Performance Computing: 17th International Workshop, LCPC 2004, West Lafayette, IN, USA, September 22-24, 2004, Revised Selected Papers
Rudolf Eigenmann ; Zhiyuan Li ; Samuel P. Midkiff (eds.)
En conferencia: 17º International Workshop on Languages and Compilers for Parallel Computing (LCPC) . West Lafayette, IN, USA . September 22, 2004 - September 24, 2004
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
| Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
|---|---|---|---|---|
| No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-28009-5
ISBN electrónico
978-3-540-31813-2
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Cobertura temática
Tabla de contenidos
doi: 10.1007/11532378_20
MSA: Multiphase Specifically Shared Arrays
Jayant DeSouza; Laxmikant V. Kalé
Shared address space (SAS) parallel programming models have faced difficulty scaling to large number of processors. Further, although in some cases SAS programs are easier to develop, in other cases they face difficulties due to a large number of race conditions. We contend that a multi-paradigm programming model comprising a distributed-memory model with a disciplined form of shared-memory programming may constitute a “complete” and powerful parallel programming system. Optimized coherence mechanisms based on the specific access pattern of a shared variable show significant performance benefits over general DSM coherence protocols. We present MSA, a system that supports such specifically shared arrays that can be shared in read-only , write-many , and accumulate modes. These simple modes scale well and are general enough to capture the majority of shared memory access patterns. MSA does not support a general read-write access mode, but a single array can be shared in read-only mode in one phase and write-many in another. MSA coexists with the message-passing paradigm (MPI) and the processor virtualization-based message-driven paradigm(Charm++). We present the model, its implementation, programming examples and preliminary performance results.
Palabras clave: Shared Memory; Access Pattern; Cache Size; Access Mode; Page Size.
Pp. 268-282
doi: 10.1007/11532378_21
Supporting SQL-3 Aggregations on Grid-Based Data Repositories
Li Weng; Gagan Agrawal; Umit Catalyurek; Joel Saltz
There is an increasing trends towards distributed and shared repositories for storing scientific datasets. Developing applications that retrieve and process data from such repositories involves a number of challenges. First, these data repositories store data in complex, low-level layouts, which should be abstracted from application developers. Second, as data repositories are shared resources, part of the computations on the data must be performed at a different set of machines than the ones hosting the data. Third, because of the volume of data and the amount of computations involved, parallel configurations need to be used for both hosting the data and the processing on the retrieved data. In this paper, we describe a system for executing SQL-3 queries over scientific data stored as flat-files. A relational table-based virtual view is supported on these flat-file datasets. The class of queries we consider involve data retrieval using Select and Where clauses, and processing with user-defined aggregate functions and group-bys. We use a middleware system STORM for providing much of the low-level functionality. Our compiler analyzes the SQL-3 queries and generates many of the functions required by this middleware. Our experimental results show good scalability with respect to the number of nodes as well as the dataset size.
Palabras clave: Aggregation Function; Virtual View; Aggregate Function; Client Node; Storm System.
Pp. 283-298
doi: 10.1007/11532378_22
Supporting XML Based High-Level Abstractions on HDF5 Datasets: A Case Study in Automatic Data Virtualization
Swarup Kumar Sahoo; Gagan Agrawal
Recently, we have been focusing on the notion of automatic data virtualization . The goal is to enable automatic creation of efficient data services to support a high-level or virtual view of the data. The application developers express the processing assuming this virtual view, whereas the data is stored in a low-level format. The compiler uses the information about the low-level layout and the relationship between the virtual and the low-level layouts to generate efficient low-level data processing code. In this paper, we describe a specific implementation of this approach. We provide XML-based abstractions on datasets stored in the Hierarchical Data Format (HDF). A high-level XML Schema provides a logical view on the HDF5 dataset, hiding actual layout details. Based on this view, the processing is specified using XQuery, which is the XML Query language developed by the World Wide Web Consortium (W3C). The HDF5 data layout is exposed to the compiler using low-level XML Schema. The relationship between the high-level and low-level Schemas is exposed using a Mapping Schema. We describe how our compiler can generate efficient code to access and process HDF5 datasets using the above information. A number of issues are addressed for ensuring high locality in processing of the datasets, which arise mainly because of the high-level nature of XQuery and because the actual data layout is abstracted.
Palabras clave: Mapping Schema; Iteration Space; Virtual View; Data Layout; Very Large Data Base.
Pp. 299-318
doi: 10.1007/11532378_23
Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers
Kazuhisa Ishizaka; Takamichi Miyamoto; Jun Shirako; Motoki Obata; Keiji Kimura; Hironori Kasahara
This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with inter-array padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II 4 processors desktop workstation, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.
Palabras clave: Time Speedup; Cache Size; Dynamic Schedule; Static Schedule; Task Parallelism.
Pp. 319-331
doi: 10.1007/11532378_24
Experiences with Co-array Fortran on Hardware Shared Memory Platforms
Yuri Dotsenko; Cristian Coarfa; John Mellor-Crummey; Daniel Chavarría-Miranda
When performing source-to-source compilation of Co-array Fortran (CAF) programs into SPMD Fortran 90 codes for shared-memory multiprocessors, there are several ways of representing and manipulating data at the Fortran 90 language level. We describe a set of implementation alternatives and evaluate their performance implications for CAF variants of the STREAM, Random Access, Spark98 and NAS MG & SP benchmarks. We compare the performance of library-based implementations of one-sided communication with fine-grain communication that accesses remote data using load and store operations. Our experiments show that using application-level loads and stores for fine-grain communication can improve performance by as much as a factor of 24; however, codes requiring only coarse-grain communication can achieve better performance by using an architecture’s tuned memcpy for bulk data movement.
Palabras clave: Shared Memory; Cache Line; Remote Access; Remote Data; Common Block.
Pp. 332-347
doi: 10.1007/11532378_25
Experiments with Auto-Parallelizing SPEC2000FP Benchmarks
Guansong Zhang; Priya Unnikrishnan; James Ren
In this paper, we document the experimental work in our attempts to automatically parallelize SPEC2000FP benchmarks for SMP machines. This is not purely a research project. It was implemented within IBM’s software laboratory in a commercial compiler infrastructure that implements OpenMP 2.0 specifications in both Fortran and C/C++. From the beginning, our emphasis is on using simple parallelization techniques. We aim to maintain a good trade-off between performance, especially scalability of an application program and its compilation time. Although the parallelization results show relatively low speed up, it is still promising considering the problems associated with explicit parallel programming and the fact that more and more multi-thread and multi-core chips will soon be available even for home computing.
Palabras clave: automatic parallelization; parallelizing compiler; SMT machine; OpenMP; parallel do.
Pp. 348-362
doi: 10.1007/11532378_26
An Offline Approach for Whole-Program Paths Analysis Using Suffix Arrays
G. Pokam; F. Bodin
Software optimization techniques are highly reliant on program behavior to deliver high performance. A key element with these techniques is to identify program paths that are likely to achieve the greatest performance benefits at runtime. Several approaches have been proposed to address this problem. However, many of them fail to cover larger optimization scope as they are restricted to loops or procedures. This paper introduces a novel approach for representing and analyzing complete program paths. Unlike the whole-program paths (WPPs) approach that relies on a DAG to represent program paths, our program trace is processed into a suffix-array that can enable very fast searching algorithms that run with time O (ln ( N )), N being the length of the trace. This allows to process reasonable trace sizes offline, avoiding the high runtime overhead incurred by WPPs, while accurately characterizing hot paths. Our evaluation shows impressive performance results, with almost 48% of the code being covered by hot paths. We also demonstrate the effectiveness of our approach to optimize for power. For this purpose, an adaptive cache resizing scheme is used that shows energy savings in the order of 12%.
Palabras clave: Basic Block; Execution Trace; Global Coverage; Local Coverage; Strong Region.
Pp. 363-378
doi: 10.1007/11532378_27
Automatic Parallelization Using the Value Evolution Graph
Silvius Rus; Dongmin Zhang; Lawrence Rauchwerger
We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the Value Evolution Graph (VEG), which models the global flow of scalar and array values within a program. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays indexed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites.
Palabras clave: Closed Form; Data Dependence; Input Node; Memory Reference; Loop Nest.
Pp. 379-393
doi: 10.1007/11532378_28
A New Dependence Test Based on Shape Analysis for Pointer-Based Codes
A. Navarro; F. Corbera; R. Asenjo; A. Tineo; O. Plata; E. L. Zapata
The approach presented in this paper focus on detecting data dependences induced by heap-directed pointers on loops that access dynamic data structures. Knowledge about the shape of the data structure accessible from a heap-directed pointer, provides critical information for disambiguating heap accesses originating from it. Our approach is based on a previously developed shape analysis that maintains topological information of the connections among the different nodes (memory locations) in the data structure. Basically, the novelty is that our approach carries out abstract interpretation of the statements being analyzed, and let us annotate the memory locations reached by each statement with read/write information. This information will be later used in order to find dependences in a very accurate dependence test which we introduce in this paper.
Palabras clave: Access Pointer; Memory Location; Shape Analysis; Loop Iteration; Symbolic Execution.
Pp. 394-408
doi: 10.1007/11532378_29
Partial Value Number Redundancy Elimination
Rei Odaira; Kei Hiraki
When exploiting instruction level parallelism in a runtime optimizing compiler, it is indispensable to quickly remove redundant computations and memory accesses to make resources available. We propose a fast and efficient algorithm called Partial Value Number Redundancy Elimination (PVNRE), which completely fuses Partial Redundancy Elimination (PRE) and Global Value Numbering (GVN). Using value numbers in the data-flow analyses, PVNRE can deal with data-dependent redundancy, and can quickly remove path-dependent partial redundancy by converting value numbers at join nodes on demand during the data-flow analyses. Compared with the naive combination of GVN, PRE, and copy propagation, PVNRE has a maximum 45% faster analyses speed, but the same optimizing power on SPECjvm98.
Palabras clave: Hash Table; Partial Redundancy; Code Motion; Redundancy Elimination; Static Single Assignment.
Pp. 409-423