Catálogo de publicaciones - libros

Compartir en
redes sociales

Shared Memory Parallel Programming with Open MP: 5th International Workshop on Open MP Application and Tools, WOMPAT 2004, Houston, TX, USA, May 17-18, 2004

Barbara M. Chapman (eds.)

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Software Engineering/Programming and Operating Systems; Computer Systems Organization and Communication Networks; Theory of Computation; Mathematics of Computing

Disponibilidad

Institución detectada	Año de publicación	Navegá	Descargá	Solicitá
No detectada	2005	SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-24560-5

ISBN electrónico

978-3-540-31832-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

2005

Información sobre derechos de publicación

Cobertura temática

Ciencias de la computación e información

Tabla de contenidos

Verificá que desde tu institución tengas acceso para descargar o solicitar el libro completo o alguno de sus capítulos.

doi: DOItmp_0558_030084

Efficient Implementation of OpenMP for Clusters with Implicit Data Distribution

This paper discusses an approach to implement OpenMP on clusters by translating it to Global Arrays (GA). The basic translation strategy from OpenMP to GA is described. GA requires a data distribution; we do not expect the user to supply this; rather, we show how we perform data distribution and work distribution according to OpenMP static loop scheduling. An inspector-executor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector-executor approach. Our experiments show promising results for the corresponding regular and irregular GA codes.

Pp. No disponible

doi: DOItmp_0558_030083

Structure and Algorithm for Implementing OpenMP Workshares

Although OpenMP has become the leading standard in parallel programming languages, the implementation of its runtime environment is not well discussed in the literature. In this paper, we introduce some of the key data structures required to implement OpenMP workshares in our runtime library and also discuss considerations on how to improve its performance. This includes items such as how to set up a workshare control block queue, how to initialize the data within a control block, how to improve barrier performance and how to handle implicit barrier and nowait situations. Finally, we discuss the performance of this implementation focusing on the EPCC benchmark. Keywords: OpenMP, parallel region, workshare, barrier, nowait.

Pp. No disponible

doi: DOItmp_0558_030082

Parallelization of General Matrix Multiply Routines Using OpenMP

An application programmer interface (API) is developed to facilitate, via OpenMP, the parallelization of the double precision general matrix multiply routine called from within GAMESS [1] during the execution of the coupled-cluster module for calculating physical properties of molecules. Results are reported using the ATLAS library and the Intel MKL on an Intel machine, and using the ESSL and the ATLAS library on an IBM SP.

Pp. No disponible

doi: DOItmp_0558_030086

Performance Analysis of Hybrid OpenMP/MPI N-Body Application

In this paper we show, through a case-study, how the adoption of the MPI model for the distributed parallelism and OpenMP parallelizing compiler technology for the inner shared memory parallelism, allows to yield a hierarchically distributed-shared memory implementation of an algorithm presenting multiple levels of parallelism. The chosen application solves the well known N-body problem.

Pp. No disponible

doi: DOItmp_0558_030087

Performance and Scalability of OpenMP Programs on the Sun Fire^TM E25K Throughput Computing Server

The Sun Fire^TM K is the first generation of high-end servers built to the Throughput Computing strategy of Sun Microsystems. The server can scale up to 72 UltraSPARC® IV processors with Chip MultiThreading (CMT) technology, and execute up to 144 threads simultaneously. The Sun Studio^TM 9 software includes compilers and tools that provide support for the C, C++, and Fortran OpenMP Version 2.0 specifications, and that fully exploit the capabilities of the UltraSPARC IV processor and the E25K server. This paper gives an overview of the Sun Fire E25K server and OpenMP support in the Sun Studio 9 software. The paper presents the latest world-class SPEC OMPL benchmark results on the Sun Fire E25K, and compares these results with those on the UltraSPARC III Cu based Sun Fire 15K server. Results show that base and peak ratios for the SPEC OMPL benchmarks increase by approximately 50% on the Sun Fire E25K server, compared to a Sun Fire 15K server with the same number of processors and with higher clock frequency.

Pp. No disponible

doi: DOItmp_0558_030088

What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

In this paper we present a performance analysis case study of two multilevel parallel benchmark codes implemented in three different programming paradigms applicable to shared memory computer architectures. We describe how detailed analysis techniques help to differentiate between the influences of the programming model itself and other factors, such as implementation specific behavior of the operating system or architectural issues.

Pp. No disponible

doi: DOItmp_0558_030090

Dragon: A Static and Dynamic Tool for OpenMP

A program analysis tool can play an important role in helping users understand and improve OpenMP codes. Dragon is a robust interactive program analysis tool based on the Open64 compiler, an open source OpenMP, C/C++/Fortran77/90 compiler for Intel Itanium systems. We developed the Dragon tool on top of Open64 to exploit its powerful analyses in order to provide static as well as dynamic (feedback-based) information which can be used to develop or optimize OpenMP codes. Dragon enables users to visualize and print essential program structures and obtain runtime information on their applications. Current features include static/dynamic call graphs and control flow graphs, data dependence analysis and interprocedural array region summaries, that help understand procedure side effects within parallel loops. On-going work extends Dragon to display data access patterns at runtime, and provide support for runtime instrumentation and optimizations.

Pp. No disponible

doi: DOItmp_0558_030092

Automatic Scoping of Variables in Parallel Regions of an OpenMP Program

The process of manually specifying scopes of variables when writing an OpenMP program is both tedious and error-prone. To improve productivity, an autoscoping feature was proposed in [1]. This feature leverages the analysis capability of a compiler to determine the appropriate scopes of variables. In this paper, we present the proposed autoscoping rules and describe the autoscoping feature provided in the Sun Studio^TM 9 Fortran 95 compiler. To investigate how much work can be saved by using autoscoping and the performance impact of this feature, we study the process of parallelizing PANTA, a 50,000-line 3D Navier-Stokes solver, using OpenMP. With pure manual scoping, a total of 1389 variables have to be explicitly privatized by the programmer. With the help of autoscoping, only 13 variables have to be manually scoped. Both versions of PANTA achieve the same performance.

Pp. No disponible

doi: DOItmp_0558_030085

Runtime Adjustment of Parallel Nested Loops

OpenMP allows programmers to specify nested parallelism in parallel applications. In the case of scientific applications, parallel loops are the most important source of parallelism. In this paper we present an automatic mechanism to dynamically detect the best way to exploit the parallelism when having nested parallel loops. This mechanism is based on the number of threads, the problem size, and the number of iterations on the loop. To do that, we claim that programmers must specify the potential application parallelism and give the runtime the responsibility to decide the best way to exploit it. We have implemented this mechanism inside the IBM XL runtime library. Evaluation shows that our mechanism dynamically adapts the parallelism generated to the application and runtime parameters, reaching the same speedup as the best static parallelization (with a priori information).

Pp. No disponible

doi: DOItmp_0558_030089

SIMT/OMP: A Toolset to Study and Exploit Memory Locality of OpenMP Applications on NUMA Architectures

OpenMP has become the dominant standard for shared memory programming. It is traditionally used for Symmetric Multiprocessor Systems, but has more recently also found its way to parallel architectures with distributed shared memory like NUMA machines. This combines the advantages of OpenMP’s easy-to-use programming model with the scalability and cost-effectiveness of NUMA architectures. In NUMA (Non Uniform Memory Access) environments, however, OpenMP codes suffer from the longer latencies of remote memory accesses. This can be observed for both hardware and software DSM systems. In this paper we present SIMT/OMP, a simulation environment capable of modeling NUMA scenarios and providing comprehensive performance data about the inter-connection traffic. We use this tool to study the impact of NUMA on the performance of OpenMP applications and show how the memory layout of these codes can be improved using a visualization tool. Based on these techniques, we have achieved performance increases of up to a factor of five on some of our benchmarks, especially in larger system configurations.

Pp. No disponible