Catálogo de publicaciones - libros
High Performance Computing: HiPC 2007: 14th International Conference, Goa, India, December 18-21, 2007. Proceedings
Srinivas Aluru ; Manish Parashar ; Ramamurthy Badrinath ; Viktor K. Prasanna (eds.)
En conferencia: 14º International Conference on High-Performance Computing (HiPC) . Goa, India . December 18, 2007 - December 21, 2007
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Processor Architectures; Software Engineering/Programming and Operating Systems; Computer Systems Organization and Communication Networks; Algorithm Analysis and Problem Complexity; Computation by Abstract Devices; Mathematics of Computing
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2007 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-77219-4
ISBN electrónico
978-3-540-77220-0
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2007
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2007
Cobertura temática
Tabla de contenidos
Accelerating Large Graph Algorithms on the GPU Using CUDA
Pawan Harish; P. J. Narayanan
Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms – including breadth first search, single source shortest path, and all-pairs shortest path – using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.
- Session III - Applications of Novel Architectures | Pp. 197-208
FT64: Scientific Computing with Streams
Mei Wen; Nan Wu; Chunyuan Zhang; Wei Wu; Qianming Yang; Changqing Xun
This paper describes FT64 and Multi-FT64, single- and multi-coprocessor systems designed for high performance scientific computing with streams. We give a detailed case study of porting the Mersenne Prime Search problem to FT64 and Multi-FT64 systems. We discuss several special problems associated with streamizing, such as kernel processing granularity, stream organization and workload partitioning for a multi-processor, which are generally applicable to other scientific codes on FT64. Finally, we perform experiments with eight typical scientific applications on FT64. The results show that a 500MHz FT64 achieves over 50% of its peak performance and a 4.2x peak speedup over 1.6GHz Itanium2. An eight processor Multi-FT64 system achieves 6.8x peak speedup over a single FT64.
- Session III - Applications of Novel Architectures | Pp. 209-220
Implementation and Evaluation of Jacobi Iteration on the Imagine Stream Processor
Jing Du; Xuejun Yang; Wenjing Yang; Tao Tang; Guibin Wang
In this paper, we explore an efficient streaming implementation of Jacobi iteration on the Imagine platform. Especially, we develop four programming optimizations according to different stream organizations, involving using SP, dot product, row product and multi-row product methods, each highlighting different aspects of the underlying architecture. The experimental results show that the multi-row product optimization of Jacobi iteration on Imagine achieves 2.27 speedup over the corresponding serial program running on Itanium 2. It is certain that Jacobi iteration can efficiently exploit the tremendous potential of Imagine stream processor through programming optimization.
- Session III - Applications of Novel Architectures | Pp. 221-232
Compiler-Directed Dynamic Voltage Scaling Using Program Phases
K. Shyam; R. Govindarajan
Energy consumption has become a major constraint in providing increased functionality for devices with small form factors. Dynamic voltage and frequency scaling has been identified as an effective approach for reducing the energy consumption of embedded systems. Earlier works on dynamic voltage scaling focused mainly on performing voltage scaling when the CPU is waiting for memory subsystem or concentrated chiefly on loop nests and/or subroutine calls having sufficient number of dynamic instructions. This paper concentrates on coarser program regions and for the first time uses program phase behavior for performing dynamic voltage scaling. Program phases are annotated at compile time with mode switch instructions. Further, we relate the Dynamic Voltage Scaling Problem to the Multiple Choice Knapsack Problem, and use well known heuristics to solve it efficiently. Also, we develop a simple integer linear program formulation for this problem. Experimental evaluation on a set of media applications reveal that our heuristic method obtains a 38% reduction in energy consumption on an average, with a performance degradation of 1% and upto 45% reduction in energy with a performance degradation of 5%. Further, the energy consumed by the heuristic solution is within 1% of the optimal solution obtained from the ILP approach.
- Session IV - System Software | Pp. 233-244
Partial Flow Sensitivity
Subhajit Roy; Y. N. Srikant
Compiler optimizations need precise and scalable analyses to discover program properties. We propose a partially flow-sensitive framework that tries to draw on the scalability of flow-insensitive algorithms while providing more precision at some specific program points. Provided with a set of critical nodes — basic blocks at which more precise information is desired — our partially flow-sensitive algorithm computes a reduced control-flow graph by collapsing some sets of non-critical nodes. The algorithm is more scalable than a fully flow-sensitive one as, assuming that the number of critical nodes is small, the reduced flow-graph is much smaller than the original flow-graph. At the same time, a much more precise information is obtained at certain program points than would had been obtained from a flow-insensitive algorithm.
- Session IV - System Software | Pp. 245-256
A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications
John Paul Walters; Vipin Chaudhary
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques.
- Session IV - System Software | Pp. 257-268
Towards a Transparent Data Access Model for the Grid Paradigm
Gabriel Antoniu; Eddy Caron; Frédéric Desprez; Aurélia Fèvre; Mathieu Jan
As grids become more and more attractive for solving complex problems with high computational and storage requirements, the need for adequate grid programming models is considerable. To this purpose, the model has been proposed as a grid version of the classical RPC paradigm, with the goal to build NES (Network-Enabled Server) environments. In this model, data management has not been defined and is now explicitly left at the user’s charge. The contribution of this paper is to enhance data management in NES by introducing a , available through the concept of grid data-sharing service. Data management is totally delegated to the service, whereas the applications simply access shared data via global identifiers. We illustrate our approach using the middleware and the data-sharing service. Notably, our experiments performed on the Grid’5000 using a real-life application show the efficiency of using for managing persitent data in the model: application execution times in a grid environment are of the same order as in a cluster environment.
- Session IV - System Software | Pp. 269-284
A Proxy-Based Self-tuned Overload Control for Multi-tiered Server Systems
Rukma P. Verlekar; Varsha Apte
Web-sites, especially E-commerce ones, are often faced with incoming load of requests that exceeds their capacity, i.e, they are subjected to . Most existing servers show severe throughput degradation at high overload. Overload control mechanisms are required to prevent such occurrences. In this paper, we present a proxy-based overload control mechanism, which uses the drop in throughput relative to arrival rate as an indicator of overload. On overload detection, a self-clocked admission control is activated, which admits a new request only when a successful reply is observed to be leaving the server system. Thus, the mechanism is , and requires no knowledge of the system. We validate our approach on an experimental testbed consisting of a two-tier Web application, and find that even at very high overload, the server operates at its maximum capacity while keeping response times within acceptable bounds.
- Session IV - System Software | Pp. 285-296
Approximation Algorithms for Scheduling with Reservations
Florian Diedrich; Klaus Jansen; Fanny Pascual; Denis Trystram
We study the problem of scheduling independent jobs on a system of identical parallel machines in the presence of reservations. This constraint is practically important; for various reasons, some machines are not available during specified time intervals. The objective is to minimize the makespan. This problem is inapproximable in the general case unless P = NP which motivates the study of suitable restrictions. We use an approach based on algorithms for multiple subset sum problems; our technique yields a polynomial time approximation scheme (PTAS) which is best possible in the sense that the problem does not admit an FPTAS unless P = NP. The PTAS presented here is the first one for the problem under consideration; so far, not even for special cases approximation schemes have been proposed. We also derive a low cost algorithm with a constant approximation ratio and discuss additional FPTASes for special cases and complexity results.
- Session V - Scheduling | Pp. 297-307
Enhanced Real-Time Divisible Load Scheduling with Different Processor Available Times
Xuan Lin; Ying Lu; Jitender Deogun; Steve Goddard
Providing QoS and performance guarantees for arbitrarily divisible loads in a cluster has become a significant problem. While progress is being made in scheduling arbitrarily divisible loads, some of the proposed approaches may cause Inserted Idle Times (IITs) that are detrimental to system performance. Two contributions are made in addressing this problem. First, we propose two constraints that, when satisfied, lead to an optimal partitioning in utilizing IITs. Second, we integrate the new partitioning method with a previous approach and develop an enhanced algorithm that better utilizes IITs. Simulation results demonstrate the advantages of our new approach.
- Session V - Scheduling | Pp. 308-319