Catálogo de publicaciones - libros

Compartir en
redes sociales


Euro-Par 2007 Parallel Processing: 13th International Euro-Par Conference, Rennes ,France , August 28-31, 2007. Proceedings

Anne-Marie Kermarrec ; Luc Bougé ; Thierry Priol (eds.)

En conferencia: 13º European Conference on Parallel Processing (Euro-Par) . Rennes, France . August 28, 2007 - August 31, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Computer System Implementation; Computer Systems Organization and Communication Networks; Software Engineering/Programming and Operating Systems; Theory of Computation; Numeric Computing; Database Management

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-74465-8

ISBN electrónico

978-3-540-74466-5

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

: Low-Overhead Online Parallel Performance Monitoring

Aroon Nataraj; Matthew Sottile; Alan Morris; Allen D. Malony; Sameer Shende

Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very lowoverhead application monitoring as well as other benefits unavailable from using a transport such as NFS.

- Topic 2: Performance Prediction and Evaluation | Pp. 85-96

Practical Differential Profiling

Martin Schulz; Bronis R. de Supinski

Comparing performance profiles from two runs is an essential performance analysis step that users routinely perform. In this work we present , a tool that facilitates these comparisons through differential profiling inside . We chose this approach, rather than designing a new tool, since is one of the few performance analysis tools accepted and used by a large community of users.

allows users to ”subtract” two performance profiles directly. It also includes callgraph visualization to highlight the differences in graphical form. Along with the design of this tool, we present several case studies that show how can be used to find and to study the differences of two application executions quickly and hence can aid the user in this most common step in performance analysis. We do this without requiring major changes on the side of the user, the most important factor in guaranteeing the adoption of our tool by code teams.

- Topic 2: Performance Prediction and Evaluation | Pp. 97-106

Decision Trees and MPI Collective Algorithm Selection Problem

Jelena Pješivac-Grbović; George Bosilca; Graham E. Fagg; Thara Angskun; Jack J. Dongarra

Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step for achieving good performance of MPI applications. In this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm selection problem. We construct C4.5 decision trees from the measured algorithm performance data and analyze both the decision tree properties and the expected run time performance penalty.

In cases we considered, results show that the C4.5 decision trees can be used to generate a reasonably small and very accurate decision function. For example, the broadcast decision tree with only 21 leaves was able to achieve a mean performance penalty of 2.08%. Similarly, combining experimental data for reduce and broadcast and generating a decision function from the combined decision trees resulted in less than 2.5% relative performance penalty. The results indicate that C4.5 decision trees are applicable to this problem and should be more widely used in this domain.

- Topic 2: Performance Prediction and Evaluation | Pp. 107-117

Profiling of Task-Based Applications on Shared Memory Machines: Scalability and Bottlenecks

Ralf Hoffmann; Thomas Rauber

A sophisticated approach for the parallel execution of irregular applications on parallel shared memory machines is the decomposition into fine-grained tasks. These tasks can be executed using a task pool which handles the scheduling of the tasks independently of the application. In this paper we present a transparent way to profile irregular applications using task pools without modifying the source code of the application. We show that it is possible to identify critical tasks which prevent scalability and to locate bottlenecks inside the application. We show that the profiling information can be used to determine a coarse estimation of the execution time for a given number of processors.

- Topic 2: Performance Prediction and Evaluation | Pp. 118-128

Search Strategies for Automatic Performance Analysis Tools

Michael Gerndt; Edmond Kereku

Periscope is a distributed automatic online performance analysis system for large scale parallel systems. It consists of a set of analysis agents distributed on the parallel machine. This article presents the architecture of the node agent and its central part, the search strategy driving the online search for performance properties. The focus is on strategies used to analyze memory access-related performance properties in OpenMP programs.

- Topic 2: Performance Prediction and Evaluation | Pp. 129-138

Experiences Understanding Performance in a Commercial Scale-Out Environment

Robert W. Wisniewski; Reza Azimi; Mathieu Desnoyers; Maged M. Michael; Jose Moreira; Doron Shiloach; Livio Soares

Clusters of loosely connected machines are becoming an important model for commercial computing. The cost/performance ratio makes these scale-out solutions an attractive platform for a class of computational needs. The work we describe in this paper focuses on understanding performance when using a scale-out environment to run commercial workloads. We describe the novel scale-out environment we configured and the workload we ran on it. We explain the unique performance challenges faced in such an environment and the tools we applied and improved for this environment to address the challenges. We present data from the tools that proved useful in optimizing performance on our system. We discuss the lessons we learned applying and modifying existing tools to a commercial scale-out environment, and offer insights into making future performance tools effective in this environment.

- Topic 2: Performance Prediction and Evaluation | Pp. 139-149

Detecting Application Load Imbalance on High End Massively Parallel Systems

Luiz DeRose; Bill Homer; Dean Johnson

Scientific applications should be well balanced in order to achieve high scalability on current and future high end massively parallel systems. However, the identification of sources of load imbalance in such applications is not a trivial exercise, and the current state of the art in performance analysis tools do not provide an efficient mechanism to help users to identify the main areas of load imbalance in an application. In this paper we discuss a new set of metrics that we defined to identify and measure application load imbalance. We then describe the extensions that were made to the Cray performance measurement and analysis infrastructure to detect application load imbalance and present to the user in an insightful way.

- Topic 2: Performance Prediction and Evaluation | Pp. 150-159

A First Step Towards Automatically Building Network Representations

Lionel Eyraud-Dubois; Arnaud Legrand; Martin Quinson; Frédéric Vivien

To fully harness Grids, users or middlewares must have some knowledge on the topology of the platform interconnection network. As such knowledge is usually not available, one must uses tools which automatically build a topological network model through some measurements. In this article, we define a methodology to assess the quality of these network model building tools, and we apply this methodology to representatives of the main classes of model builders and to two new algorithms. We show that none of the main existing techniques build models that enable to accurately predict the running time of simple application kernels for actual platforms. However some of the new algorithms we propose give excellent results in a wide range of situations.

- Topic 2: Performance Prediction and Evaluation | Pp. 160-169

Topic 3 Scheduling and Load-Balancing

Henri Casanova; Olivier Beaumont; Uwe Schwiegelshohn; Marek Tudruj

While scheduling and load-balancing problems have been studied for several decades, the dramatic multi-scale shifts in distributed systems and their usage in the last few years have raised new and exciting challenges. These challenges span the entire spectrum from theory to practice, as demonstrated by the selection of papers in the scheduling and load-balancing topic this year at Europar. Out of the twenty-three submissions to the topic we accepted six papers. The topic organizers would like to thank all reviewers whose work made it possible for each paper to receive at least three reviews.

- Topic 3: Scheduling and Load-Balancing | Pp. 171-171

Toward Optimizing Latency Under Throughput Constraints for Application Workflows on Clusters

Nagavijayalakshmi Vydyanathan; Umit V. Catalyurek; Tahsin M. Kurc; Ponnuswamy Sadayappan; Joel H. Saltz

In many application domains, it is desirable to meet some user-defined performance requirement while minimizing resource usage and optimizing additional performance parameters. For example, application workflows with real-time constraints may have strict throughput requirements and desire a low latency or response-time. The structure of these workflows can be represented as directed acyclic graphs of coarse-grained application tasks with data dependences. In this paper, we develop a novel mapping and scheduling algorithm that minimizes the latency of workflows that act on a stream of input data, while satisfying throughput requirements. The algorithm employs pipelined parallelism and intelligent clustering and replication of tasks to meet throughput requirements. Latency is minimized by exploiting task parallelism and reducing communication overheads. Evaluation using synthetic benchmarks and application task graphs shows that our algorithm 1) consistently meets throughput requirements even when other existing schemes fail, 2) produces lower-latency schedules, and 3) results in lesser resource usage.

- Topic 3: Scheduling and Load-Balancing | Pp. 173-183