Catálogo de publicaciones - libros

Compartir en
redes sociales


High Performance Computing: HiPC 2007: 14th International Conference, Goa, India, December 18-21, 2007. Proceedings

Srinivas Aluru ; Manish Parashar ; Ramamurthy Badrinath ; Viktor K. Prasanna (eds.)

En conferencia: 14º International Conference on High-Performance Computing (HiPC) . Goa, India . December 18, 2007 - December 21, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Processor Architectures; Software Engineering/Programming and Operating Systems; Computer Systems Organization and Communication Networks; Algorithm Analysis and Problem Complexity; Computation by Abstract Devices; Mathematics of Computing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-77219-4

ISBN electrónico

978-3-540-77220-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Accelerating Large Graph Algorithms on the GPU Using CUDA

Pawan Harish; P. J. Narayanan

Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms – including breadth first search, single source shortest path, and all-pairs shortest path – using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.

- Session III - Applications of Novel Architectures | Pp. 197-208

FT64: Scientific Computing with Streams

Mei Wen; Nan Wu; Chunyuan Zhang; Wei Wu; Qianming Yang; Changqing Xun

This paper describes FT64 and Multi-FT64, single- and multi-coprocessor systems designed for high performance scientific computing with streams. We give a detailed case study of porting the Mersenne Prime Search problem to FT64 and Multi-FT64 systems. We discuss several special problems associated with streamizing, such as kernel processing granularity, stream organization and workload partitioning for a multi-processor, which are generally applicable to other scientific codes on FT64. Finally, we perform experiments with eight typical scientific applications on FT64. The results show that a 500MHz FT64 achieves over 50% of its peak performance and a 4.2x peak speedup over 1.6GHz Itanium2. An eight processor Multi-FT64 system achieves 6.8x peak speedup over a single FT64.

- Session III - Applications of Novel Architectures | Pp. 209-220

Implementation and Evaluation of Jacobi Iteration on the Imagine Stream Processor

Jing Du; Xuejun Yang; Wenjing Yang; Tao Tang; Guibin Wang

In this paper, we explore an efficient streaming implementation of Jacobi iteration on the Imagine platform. Especially, we develop four programming optimizations according to different stream organizations, involving using SP, dot product, row product and multi-row product methods, each highlighting different aspects of the underlying architecture. The experimental results show that the multi-row product optimization of Jacobi iteration on Imagine achieves 2.27 speedup over the corresponding serial program running on Itanium 2. It is certain that Jacobi iteration can efficiently exploit the tremendous potential of Imagine stream processor through programming optimization.

- Session III - Applications of Novel Architectures | Pp. 221-232

Compiler-Directed Dynamic Voltage Scaling Using Program Phases

K. Shyam; R. Govindarajan

Energy consumption has become a major constraint in providing increased functionality for devices with small form factors. Dynamic voltage and frequency scaling has been identified as an effective approach for reducing the energy consumption of embedded systems. Earlier works on dynamic voltage scaling focused mainly on performing voltage scaling when the CPU is waiting for memory subsystem or concentrated chiefly on loop nests and/or subroutine calls having sufficient number of dynamic instructions. This paper concentrates on coarser program regions and for the first time uses program phase behavior for performing dynamic voltage scaling. Program phases are annotated at compile time with mode switch instructions. Further, we relate the Dynamic Voltage Scaling Problem to the Multiple Choice Knapsack Problem, and use well known heuristics to solve it efficiently. Also, we develop a simple integer linear program formulation for this problem. Experimental evaluation on a set of media applications reveal that our heuristic method obtains a 38% reduction in energy consumption on an average, with a performance degradation of 1% and upto 45% reduction in energy with a performance degradation of 5%. Further, the energy consumed by the heuristic solution is within 1% of the optimal solution obtained from the ILP approach.

- Session IV - System Software | Pp. 233-244

Partial Flow Sensitivity

Subhajit Roy; Y. N. Srikant

Compiler optimizations need precise and scalable analyses to discover program properties. We propose a partially flow-sensitive framework that tries to draw on the scalability of flow-insensitive algorithms while providing more precision at some specific program points. Provided with a set of critical nodes — basic blocks at which more precise information is desired — our partially flow-sensitive algorithm computes a reduced control-flow graph by collapsing some sets of non-critical nodes. The algorithm is more scalable than a fully flow-sensitive one as, assuming that the number of critical nodes is small, the reduced flow-graph is much smaller than the original flow-graph. At the same time, a much more precise information is obtained at certain program points than would had been obtained from a flow-insensitive algorithm.

- Session IV - System Software | Pp. 245-256

A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

John Paul Walters; Vipin Chaudhary

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques.

- Session IV - System Software | Pp. 257-268

Towards a Transparent Data Access Model for the Grid Paradigm

Gabriel Antoniu; Eddy Caron; Frédéric Desprez; Aurélia Fèvre; Mathieu Jan

As grids become more and more attractive for solving complex problems with high computational and storage requirements, the need for adequate grid programming models is considerable. To this purpose, the model has been proposed as a grid version of the classical RPC paradigm, with the goal to build NES (Network-Enabled Server) environments. In this model, data management has not been defined and is now explicitly left at the user’s charge. The contribution of this paper is to enhance data management in NES by introducing a , available through the concept of grid data-sharing service. Data management is totally delegated to the service, whereas the applications simply access shared data via global identifiers. We illustrate our approach using the middleware and the data-sharing service. Notably, our experiments performed on the Grid’5000 using a real-life application show the efficiency of using for managing persitent data in the model: application execution times in a grid environment are of the same order as in a cluster environment.

- Session IV - System Software | Pp. 269-284

A Proxy-Based Self-tuned Overload Control for Multi-tiered Server Systems

Rukma P. Verlekar; Varsha Apte

Web-sites, especially E-commerce ones, are often faced with incoming load of requests that exceeds their capacity, i.e, they are subjected to . Most existing servers show severe throughput degradation at high overload. Overload control mechanisms are required to prevent such occurrences. In this paper, we present a proxy-based overload control mechanism, which uses the drop in throughput relative to arrival rate as an indicator of overload. On overload detection, a self-clocked admission control is activated, which admits a new request only when a successful reply is observed to be leaving the server system. Thus, the mechanism is , and requires no knowledge of the system. We validate our approach on an experimental testbed consisting of a two-tier Web application, and find that even at very high overload, the server operates at its maximum capacity while keeping response times within acceptable bounds.

- Session IV - System Software | Pp. 285-296

Approximation Algorithms for Scheduling with Reservations

Florian Diedrich; Klaus Jansen; Fanny Pascual; Denis Trystram

We study the problem of scheduling independent jobs on a system of identical parallel machines in the presence of reservations. This constraint is practically important; for various reasons, some machines are not available during specified time intervals. The objective is to minimize the makespan. This problem is inapproximable in the general case unless P = NP which motivates the study of suitable restrictions. We use an approach based on algorithms for multiple subset sum problems; our technique yields a polynomial time approximation scheme (PTAS) which is best possible in the sense that the problem does not admit an FPTAS unless P = NP. The PTAS presented here is the first one for the problem under consideration; so far, not even for special cases approximation schemes have been proposed. We also derive a low cost algorithm with a constant approximation ratio and discuss additional FPTASes for special cases and complexity results.

- Session V - Scheduling | Pp. 297-307

Enhanced Real-Time Divisible Load Scheduling with Different Processor Available Times

Xuan Lin; Ying Lu; Jitender Deogun; Steve Goddard

Providing QoS and performance guarantees for arbitrarily divisible loads in a cluster has become a significant problem. While progress is being made in scheduling arbitrarily divisible loads, some of the proposed approaches may cause Inserted Idle Times (IITs) that are detrimental to system performance. Two contributions are made in addressing this problem. First, we propose two constraints that, when satisfied, lead to an optimal partitioning in utilizing IITs. Second, we integrate the new partitioning method with a previous approach and develop an enhanced algorithm that better utilizes IITs. Simulation results demonstrate the advantages of our new approach.

- Session V - Scheduling | Pp. 308-319