Catálogo de publicaciones - libros

Compartir en
redes sociales


Languages and Compilers for High Performance Computing: 17th International Workshop, LCPC 2004, West Lafayette, IN, USA, September 22-24, 2004, Revised Selected Papers

Rudolf Eigenmann ; Zhiyuan Li ; Samuel P. Midkiff (eds.)

En conferencia: 17º International Workshop on Languages and Compilers for Parallel Computing (LCPC) . West Lafayette, IN, USA . September 22, 2004 - September 24, 2004

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

No disponibles.

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2005 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-28009-5

ISBN electrónico

978-3-540-31813-2

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2005

Tabla de contenidos

JuliusC: A Practical Approach for the Analysis of Divide-and-Conquer Algorithms

Paolo D’Alberto; Alexandru Nicolau

The development of divide and conquer (D&C) algorithms for matrix computations has led to the widespread use of high- performance scientific applications and libraries. In turn, D&C algorithms can be implemented using loop nests or recursion. Recursion is extremely appealing because it is an intuitive means for the deployment of top-down techniques, which exploit data locality and parallelism naturally. However, recursion has been considered impractical for high-performance codes, mostly because of the inherent overhead of the division process into small subproblems. In this work, we develop techniques to model the behavior of recursive algorithms in a way suitable for use by a compiler in estimating and reducing the division process overheads. We describe these techniques and JuliusC, a (lite) C compiler, which we developed to exploit them. JuliusC unfolds the application call graph (partially) and extracts the relations among function calls. As a final result, it produces a directed acyclic graph (DAG) modeling the function calls concisely. The approach is a combination of compile-time and run-time analysis and both have negligible complexity. We illustrate the applicability of our approach by studying 6 test cases. We present the analysis results and we show how our (optimizing) compiler can use these results to increase the efficiency of the division process between 14 to 20 million times, for our codes.

Palabras clave: Directed Acyclic Graph; Function Call; Recursive Algorithm; Loop Nest; Division Process.

Pp. 117-131

Exploiting Parallelism in Memory Operations for Code Optimization

Yunheung Paek; Junsik Choi; Jinoo Joung; Junseo Lee; Seonwook Kim

Code size reduction is ever becoming more important for compilers targeting embedded processors because these processors are often severely limited by storage constraints and thus the reduced code size can have a positively significant impact on their performance. Various code size reduction techniques have different motivations and a variety of application contexts utilizing special hardware features of their target processors. In this work, we propose a novel technique that fully utilizes a set of hardware instructions, called the multiple load/store  (MLS) or parallel load/store  (PLS), that are specially featured for reducing code size by minimizing the number of memory operations in the code. To take advantage of this feature, many microprocessors support the MLS instructions, whereas no existing compilers fully exploit the potential benefit of these instructions but only use them for some limited cases. This is mainly because optimizing memory accesses with MLS instructions for general cases is an NP-hard problem that necessitates complex assignments of registers and memory offsets for variables in a stack frame. Our technique uses a couple of heuristics to efficiently handle this problem in a polynomial time bound.

Palabras clave: Code Optimization; Code Size; Memory Operation; Register Assignment; Memory Layout.

Pp. 132-148

An ILP-Based Approach to Locality Optimization

Guilin Chen; Ozcan Ozturk; Mahmut Kandemir

One of the most important factors that determine performance of data-intensive applications is data locality. A program with high data locality makes better use of fast, on-chip memories and can avoid large main memory latencies. Although previous compiler research investigated numerous techniques for enhancing locality, we lack of formal techniques, against which the existing heuristics can be compared. Motivated by this observation, this paper presents a fresh look at locality optimization based on integer linear programming (ILP). We formulate the conditions for data locality, and present a system of constraints whose solution gives optimal computation re-ordering and data-to-memory assignment under our objective function and cost model. Our experimental results using three data-intensive applications clearly indicate that the ILP-based approach generates very good results and outperforms a previously proposed heuristic solution to locality.

Palabras clave: Locality Optimization; Data Item; Integer Linear Program; Cache Size; Input Size.

Pp. 149-163

A Code Isolator: Isolating Code Fragments from Large Programs

Yoon-Ju Lee; Mary Hall

In this paper, we describe a tool we have developed called a code isolator. We envision such a tool will facilitate many software development activities in complex software systems, but we are using it to isolate code segments from large scientific and engineering codes, for the purposes of performance tuning. The goal of the code isolator is to provide an executable version of a code segment and representative data that mimics the performance of the code in the full program. The resulting isolated code can be used in performance tuning experiments, requiring just a tiny fraction of the execution time of the code when executing within the full program. We describe the analyses and transformations used in a code isolator tool, which we have largely automated in the SUIF compiler. We present a case study of its use with LS-DYNA, a large widely-used engineering application. In this paper, we demonstrate how the tool derives code that permits performance tuning for cache. We present results comparing L1 cache misses and execution time for the original program and the isolated program generated by the tool with some manual intervention. We find that the isolated code can be executed 3600 times faster than the original program, and most of the L1 cache misses are preserved. We identify areas where additional analyses can close the remaining gap in predicting and preserving cache misses in the isolated code.

Palabras clave: Machine State; Original Program; Cache Line; Code Fragment; Performance Tuning.

Pp. 164-178

The Use of Traces for Inlining in Java Programs

Borys J. Bradel; Tarek S. Abdelrahman

We explore the effectiveness of using traces in optimization. We build a trace collection system for the Jikes Research Virtual Machine and create traces based on the execution of the SPECjvm98 and Java Grande benchmarks. We evaluate the use of traces for inlining in Jikes, and find that the use of traces leads to a decrease in execution time of 10%, when compared to providing similar information from Jikes’s adaptive system from a previous execution. This increase in performance is achieved at the expense of a code expansion of 47%. Further, this performance is slightly better than that achieved when using a greedy algorithm. We conclude that traces may be used effectively to perform inlining, although code expansion and trace collection overhead must be addressed.

Palabras clave: Execution Time; Greedy Algorithm; Adaptive System; Basic Block; Java Program.

Pp. 179-193

A Practical MHP Information Analysis for Concurrent Java Programs

Lin Li; Clark Verbrugge

In this paper we present an implementation of May Happen in Parallel analysis for Java that attempts to address some of the practical implementation concerns of the original work. We describe a design that incorporates techniques for aiding a feasible implementation and expanding the range of acceptable inputs. We provide experimental results showing the utility and impact of our approach and optimizations using a variety of concurrent benchmarks.

Palabras clave: Input Program; Control Flow Graph; Call Graph; Strongly Connect Component; Program Language Design.

Pp. 194-208

Efficient Computation of Communicator Variables for Programs with Unstructured Parallelism

Christoph von Praun

We present an algorithm to determine communicator variables in parallel programs. If communicator variables are accessed in program order and accesses to other shared variables are not reordered with respect to communicators, then program executions are sequentially consistent. Computing communicators is an efficient and effective alternative to delay set computation. The algorithm does not require a thread and whole-program control-flow model and tolerates the typical approximations that static program analyses make for threads and data. These properties make the algorithm suitable to handle multi-threaded object-oriented programs with unstructured parallelism. We demonstrate on several multi-threaded Java programs that the algorithm is effective in reducing the number of fences at memory access statements compared to a naive fence insertion algorithm (the reduction is on average 28%) and report the runtime overhead caused by the fences (between 0% and 231%, average 81%).

Palabras clave: Parallel Program; Communicator Variable; Data Race; Loop Boundary; Program Order.

Pp. 209-223

Compiling High-Level Languages for Vector Architectures

Christopher D. Rickett; Sung-Eun Choi; Bradford L. Chamberlain

In this paper, we investigate the issues of compiling high-level languages for vector architectures. Vector architectures have regained popularity in recent years, from simple desktop computers with small vector units motivated by multimedia applications to large-scale vector multiprocessing machines motivated by ever-growing computational demands. We show that generating code for various types of vector architectures can be done using several idioms, and that the best idiom is not what a programmer would normally do. Using a set of benchmark programs, we also show that the benefits of vectorization can be significant and must not be ignored. Our results show that high-level languages are an attractive means of programming vector architectures since their compilers can generate code using the specific idioms that are most effective for the low-level vectorizing compiler. This leads to source code that is clearer and more maintainable, has excellent performance across the full spectrum of vector architectures, and therefore improves programmer productivity.

Palabras clave: Loop Nest; Dimensional Array; Vectorized Loop; Multidimensional Array; Array Access.

Pp. 224-237

HiLO: High Level Optimization of FFTs

Nick Rizzolo; David Padua

As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary’s performance.

Palabras clave: Software Pipeline; Abstract Syntax Tree; Program Language Design; Circular Code; Optimization Pass.

Pp. 238-252

Applying Loop Optimizations to Object-Oriented Abstractions Through General Classification of Array Semantics

Qing Yi; Dan Quinlan

Optimizing compilers have a long history of applying loop transformations to C and Fortran scientific applications. However, such optimizations are rare in compilers for object-oriented languages such as C++ or Java, where loops operating on user-defined types are left unoptimized due to their unknown semantics. Our goal is to reduce the performance penalty of using high-level object-oriented abstractions. We propose an approach that allows the explicit communication between programmers and compilers. We have extended the traditional Fortran loop optimizations with an open interface. Through this interface, we have developed techniques to automatically recognize and optimize user-defined array abstractions. In addition, we have developed an adapted constant-propagation algorithm to automatically propagate properties of abstractions. We have implemented these techniques in a C++ source-to-source translator and have applied them to optimize several kernels written using an array-class library. Our experimental results show that using our approach, applications using high-level abstractions can achieve comparable, and in cases superior, performance to that achieved by efficient low-level hand-written codes.

Palabras clave: Loop Optimization; Performance Penalty; Loop Transformation; Program Language Design; Annotation Language.

Pp. 253-267