Catálogo de publicaciones - libros
Languages and Compilers for High Performance Computing: 17th International Workshop, LCPC 2004, West Lafayette, IN, USA, September 22-24, 2004, Revised Selected Papers
Rudolf Eigenmann ; Zhiyuan Li ; Samuel P. Midkiff (eds.)
En conferencia: 17º International Workshop on Languages and Compilers for Parallel Computing (LCPC) . West Lafayette, IN, USA . September 22, 2004 - September 24, 2004
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
No disponibles.
Disponibilidad
| Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
|---|---|---|---|---|
| No detectada | 2005 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-28009-5
ISBN electrónico
978-3-540-31813-2
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2005
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2005
Cobertura temática
Tabla de contenidos
doi: 10.1007/11532378_10
JuliusC: A Practical Approach for the Analysis of Divide-and-Conquer Algorithms
Paolo D’Alberto; Alexandru Nicolau
The development of divide and conquer (D&C) algorithms for matrix computations has led to the widespread use of high- performance scientific applications and libraries. In turn, D&C algorithms can be implemented using loop nests or recursion. Recursion is extremely appealing because it is an intuitive means for the deployment of top-down techniques, which exploit data locality and parallelism naturally. However, recursion has been considered impractical for high-performance codes, mostly because of the inherent overhead of the division process into small subproblems. In this work, we develop techniques to model the behavior of recursive algorithms in a way suitable for use by a compiler in estimating and reducing the division process overheads. We describe these techniques and JuliusC, a (lite) C compiler, which we developed to exploit them. JuliusC unfolds the application call graph (partially) and extracts the relations among function calls. As a final result, it produces a directed acyclic graph (DAG) modeling the function calls concisely. The approach is a combination of compile-time and run-time analysis and both have negligible complexity. We illustrate the applicability of our approach by studying 6 test cases. We present the analysis results and we show how our (optimizing) compiler can use these results to increase the efficiency of the division process between 14 to 20 million times, for our codes.
Palabras clave: Directed Acyclic Graph; Function Call; Recursive Algorithm; Loop Nest; Division Process.
Pp. 117-131
doi: 10.1007/11532378_11
Exploiting Parallelism in Memory Operations for Code Optimization
Yunheung Paek; Junsik Choi; Jinoo Joung; Junseo Lee; Seonwook Kim
Code size reduction is ever becoming more important for compilers targeting embedded processors because these processors are often severely limited by storage constraints and thus the reduced code size can have a positively significant impact on their performance. Various code size reduction techniques have different motivations and a variety of application contexts utilizing special hardware features of their target processors. In this work, we propose a novel technique that fully utilizes a set of hardware instructions, called the multiple load/store (MLS) or parallel load/store (PLS), that are specially featured for reducing code size by minimizing the number of memory operations in the code. To take advantage of this feature, many microprocessors support the MLS instructions, whereas no existing compilers fully exploit the potential benefit of these instructions but only use them for some limited cases. This is mainly because optimizing memory accesses with MLS instructions for general cases is an NP-hard problem that necessitates complex assignments of registers and memory offsets for variables in a stack frame. Our technique uses a couple of heuristics to efficiently handle this problem in a polynomial time bound.
Palabras clave: Code Optimization; Code Size; Memory Operation; Register Assignment; Memory Layout.
Pp. 132-148
doi: 10.1007/11532378_12
An ILP-Based Approach to Locality Optimization
Guilin Chen; Ozcan Ozturk; Mahmut Kandemir
One of the most important factors that determine performance of data-intensive applications is data locality. A program with high data locality makes better use of fast, on-chip memories and can avoid large main memory latencies. Although previous compiler research investigated numerous techniques for enhancing locality, we lack of formal techniques, against which the existing heuristics can be compared. Motivated by this observation, this paper presents a fresh look at locality optimization based on integer linear programming (ILP). We formulate the conditions for data locality, and present a system of constraints whose solution gives optimal computation re-ordering and data-to-memory assignment under our objective function and cost model. Our experimental results using three data-intensive applications clearly indicate that the ILP-based approach generates very good results and outperforms a previously proposed heuristic solution to locality.
Palabras clave: Locality Optimization; Data Item; Integer Linear Program; Cache Size; Input Size.
Pp. 149-163
doi: 10.1007/11532378_13
A Code Isolator: Isolating Code Fragments from Large Programs
Yoon-Ju Lee; Mary Hall
In this paper, we describe a tool we have developed called a code isolator. We envision such a tool will facilitate many software development activities in complex software systems, but we are using it to isolate code segments from large scientific and engineering codes, for the purposes of performance tuning. The goal of the code isolator is to provide an executable version of a code segment and representative data that mimics the performance of the code in the full program. The resulting isolated code can be used in performance tuning experiments, requiring just a tiny fraction of the execution time of the code when executing within the full program. We describe the analyses and transformations used in a code isolator tool, which we have largely automated in the SUIF compiler. We present a case study of its use with LS-DYNA, a large widely-used engineering application. In this paper, we demonstrate how the tool derives code that permits performance tuning for cache. We present results comparing L1 cache misses and execution time for the original program and the isolated program generated by the tool with some manual intervention. We find that the isolated code can be executed 3600 times faster than the original program, and most of the L1 cache misses are preserved. We identify areas where additional analyses can close the remaining gap in predicting and preserving cache misses in the isolated code.
Palabras clave: Machine State; Original Program; Cache Line; Code Fragment; Performance Tuning.
Pp. 164-178
doi: 10.1007/11532378_14
The Use of Traces for Inlining in Java Programs
Borys J. Bradel; Tarek S. Abdelrahman
We explore the effectiveness of using traces in optimization. We build a trace collection system for the Jikes Research Virtual Machine and create traces based on the execution of the SPECjvm98 and Java Grande benchmarks. We evaluate the use of traces for inlining in Jikes, and find that the use of traces leads to a decrease in execution time of 10%, when compared to providing similar information from Jikes’s adaptive system from a previous execution. This increase in performance is achieved at the expense of a code expansion of 47%. Further, this performance is slightly better than that achieved when using a greedy algorithm. We conclude that traces may be used effectively to perform inlining, although code expansion and trace collection overhead must be addressed.
Palabras clave: Execution Time; Greedy Algorithm; Adaptive System; Basic Block; Java Program.
Pp. 179-193
doi: 10.1007/11532378_15
A Practical MHP Information Analysis for Concurrent Java Programs
Lin Li; Clark Verbrugge
In this paper we present an implementation of May Happen in Parallel analysis for Java that attempts to address some of the practical implementation concerns of the original work. We describe a design that incorporates techniques for aiding a feasible implementation and expanding the range of acceptable inputs. We provide experimental results showing the utility and impact of our approach and optimizations using a variety of concurrent benchmarks.
Palabras clave: Input Program; Control Flow Graph; Call Graph; Strongly Connect Component; Program Language Design.
Pp. 194-208
doi: 10.1007/11532378_16
Efficient Computation of Communicator Variables for Programs with Unstructured Parallelism
Christoph von Praun
We present an algorithm to determine communicator variables in parallel programs. If communicator variables are accessed in program order and accesses to other shared variables are not reordered with respect to communicators, then program executions are sequentially consistent. Computing communicators is an efficient and effective alternative to delay set computation. The algorithm does not require a thread and whole-program control-flow model and tolerates the typical approximations that static program analyses make for threads and data. These properties make the algorithm suitable to handle multi-threaded object-oriented programs with unstructured parallelism. We demonstrate on several multi-threaded Java programs that the algorithm is effective in reducing the number of fences at memory access statements compared to a naive fence insertion algorithm (the reduction is on average 28%) and report the runtime overhead caused by the fences (between 0% and 231%, average 81%).
Palabras clave: Parallel Program; Communicator Variable; Data Race; Loop Boundary; Program Order.
Pp. 209-223
doi: 10.1007/11532378_17
Compiling High-Level Languages for Vector Architectures
Christopher D. Rickett; Sung-Eun Choi; Bradford L. Chamberlain
In this paper, we investigate the issues of compiling high-level languages for vector architectures. Vector architectures have regained popularity in recent years, from simple desktop computers with small vector units motivated by multimedia applications to large-scale vector multiprocessing machines motivated by ever-growing computational demands. We show that generating code for various types of vector architectures can be done using several idioms, and that the best idiom is not what a programmer would normally do. Using a set of benchmark programs, we also show that the benefits of vectorization can be significant and must not be ignored. Our results show that high-level languages are an attractive means of programming vector architectures since their compilers can generate code using the specific idioms that are most effective for the low-level vectorizing compiler. This leads to source code that is clearer and more maintainable, has excellent performance across the full spectrum of vector architectures, and therefore improves programmer productivity.
Palabras clave: Loop Nest; Dimensional Array; Vectorized Loop; Multidimensional Array; Array Access.
Pp. 224-237
doi: 10.1007/11532378_18
HiLO: High Level Optimization of FFTs
Nick Rizzolo; David Padua
As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary’s performance.
Palabras clave: Software Pipeline; Abstract Syntax Tree; Program Language Design; Circular Code; Optimization Pass.
Pp. 238-252
doi: 10.1007/11532378_19
Applying Loop Optimizations to Object-Oriented Abstractions Through General Classification of Array Semantics
Qing Yi; Dan Quinlan
Optimizing compilers have a long history of applying loop transformations to C and Fortran scientific applications. However, such optimizations are rare in compilers for object-oriented languages such as C++ or Java, where loops operating on user-defined types are left unoptimized due to their unknown semantics. Our goal is to reduce the performance penalty of using high-level object-oriented abstractions. We propose an approach that allows the explicit communication between programmers and compilers. We have extended the traditional Fortran loop optimizations with an open interface. Through this interface, we have developed techniques to automatically recognize and optimize user-defined array abstractions. In addition, we have developed an adapted constant-propagation algorithm to automatically propagate properties of abstractions. We have implemented these techniques in a C++ source-to-source translator and have applied them to optimize several kernels written using an array-class library. Our experimental results show that using our approach, applications using high-level abstractions can achieve comparable, and in cases superior, performance to that achieved by efficient low-level hand-written codes.
Palabras clave: Loop Optimization; Performance Penalty; Loop Transformation; Program Language Design; Annotation Language.
Pp. 253-267