Catálogo de publicaciones - libros

Compartir en
redes sociales


High Performance Computing: HiPC 2007: 14th International Conference, Goa, India, December 18-21, 2007. Proceedings

Srinivas Aluru ; Manish Parashar ; Ramamurthy Badrinath ; Viktor K. Prasanna (eds.)

En conferencia: 14º International Conference on High-Performance Computing (HiPC) . Goa, India . December 18, 2007 - December 21, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Processor Architectures; Software Engineering/Programming and Operating Systems; Computer Systems Organization and Communication Networks; Algorithm Analysis and Problem Complexity; Computation by Abstract Devices; Mathematics of Computing

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-77219-4

ISBN electrónico

978-3-540-77220-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

An FPGA-Based Accelerator for Multiple Biological Sequence Alignment with DIALIGN

Azzedine Boukerche; Jan Mendonca Correa; Alba Cristina Magalhaes Alves de Melo; Ricardo Pezzuol Jacobi; Adson Ferreira Rocha

Multiple sequence alignment (MSA) is a very important problem in Computational Biology since it is often used to identify evolutionary relation-ships among the organisms and predict secondary/tertiary structure. Since MSA is known to be a computationally challenging problem, many proposals were made to accelerate it either by using parallel processing or hardware accelerators. In this paper, we propose an FPGA based accelerator to execute the most compute intensive part of DIALIGN, an iterative method to obtain multiple sequence alignments. The experimental results collected in our 200-element FPGA prototype show that a speedup of 383.41 was obtained when compared with the software implementation.

- Session I - Applications on I/O and FPGAs | Pp. 71-82

A Speed-Area Optimization of Full Search Block Matching Hardware with Applications in High-Definition TVs (HDTV)

Avishek Saha; Santosh Ghosh

HDTV based applications require FSBM to maintain its significantly higher resolution than traditional broadcasting formats (NTSC, SECAM, PAL). This paper proposes some techniques to increase the speed and reduce the area requirements of an FSBM hardware. These techniques are based on modifications of the Sum-of-Absolute-Differences (SAD) computation and the MacroBlock (MB) searching strategy. The design of an FSBM architecture based on the proposed approaches has also been outlined. The highlight of the proposed architecture is its split pipelined design to facilitate parallel processing of macroblocks (MBs) in the initial stages. The proposed hardware has high throughput, low silicon area and compares favorably with other existing FPGA architectures.

- Session I - Applications on I/O and FPGAs | Pp. 83-94

Evaluating ISA Support and Hardware Support for Recursive Data Layouts

Won-Taek Lim; Mithuna Thottethodi

Recursive data layouts for matrices (two dimensional arrays) have been proposed to ameliorate the poor data locality caused by traditional layouts like row-major and column-major [3][12]. However, recursive data layouts require non-traditional address computation which involves bit-level manipulations that are not supported in current processors. As such, a number of software-based address computation techniques have been developed ranging from table-lookup based techniques to arithmetic-and-logic-operation based techniques. This effectively creates a tradeoff of extra computation for locality. In this paper, we design the appropriate instruction set architecture (ISA) support and hardware support to achieve address computation for recursive data layouts. Our technique captures the benefits of locality of the sophisticated data layouts while avoiding the cost of software-based address computation. Simulations reveal that our hardware approach improves the performance of matrix multiplication by factors ranging 30% to 59% over software-computed Morton-ordered indexing, especially at larger matrix sizes.

- Session II - Microarchitecture and Multiprocessor Architecture | Pp. 95-106

qTLB: Looking Inside the Look-Aside Buffer

Omesh Tickoo; Hari Kannan; Vineet Chadha; Ramesh Illikkal; Ravi Iyer; Donald Newell

Rapid evolution of multi-core platforms is putting additional stress on shared processor resources like TLB. TLBs have mostly been private resources for the application running on the core, due to the constant flushing of entries on context switches. Recent technologies like virtualization enable independent execution of software domains leading to performance issues because of interesting dynamics at the shared hardware resources. The advent of TLB tagging with application and VM identifiers, however, increases the lifespan of these resources. In this paper, we demonstrate that TLB tagging and refraining from flushing the hypervisor TLB entries during a VM context switch can lead to considerable performance benefits. We show that it is possible to improve the TLB performance of an important application by protecting its TLB entries from the interference of other low priority VMs/applications and providing differentiated service. We present our QoS architecture framework for TLB (qTLB) and show its benefits.

- Session II - Microarchitecture and Multiprocessor Architecture | Pp. 107-118

Analysis of x86 ISA Condition Codes Influence on Superscalar Execution

Virginia Escuder; Raúl Durán; Rafael Rico

Instruction sets may have particular characteristics that produce a negative impact into the amount of available parallelism. The x86 instruction set architecture includes some of those characteristics. In particular, it is well know the negative impact of condition codes usage. In a coarse approximation, they can be considered responsible for a greater code coupling. Moreover, several in-depth works show that they introduce additional complexity in the procedures both to perform microcode binary translation and to support for precise exception mechanisms among others. To the extent of our knowledge no quantitative evaluation has been carried out that may determine the impact of condition codes usage on the x86 processors performance. In this work we will present a proposal of such quantification based on Graph Theory.

- Session II - Microarchitecture and Multiprocessor Architecture | Pp. 119-132

Efficient Message Management in Tiled CMP Architectures Using a Heterogeneous Interconnection Network

Antonio Flores; Juan L. Aragón; Manuel E. Acacio

Previous studies have shown that the interconnection network of a Chip-Multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth and power characteristics. In this work, we present a proposal for performance-and energy-efficient message management in tiled CMPs by using a heterogeneous interconnect. Our proposal consists of , a technique that classifies all coherence messages into critical and short, and non-critical and long messages; and the use of a heterogeneous interconnection network comprised of low-latency wires for critical messages and low-energy wires for non-critical ones. Through detailed simulations of 8- and 16-core CMPs, we show that our proposal obtains average improvements of 8% in execution time and 65% in the Energy-Delay Product metric of the interconnect over previous works.

- Session II - Microarchitecture and Multiprocessor Architecture | Pp. 133-146

Direct Coherence: Bringing Together Performance and Scalability in Shared-Memory Multiprocessors

Alberto Ros; Manuel E. Acacio; José M. García

Traditional directory-based cache coherence protocols suffer from long-latency cache misses as a consequence of the indirection introduced by the home node, which must be accessed on every cache miss before any coherence action can be performed. In this work we present a new protocol that moves the role of storing up-to-date coherence information (and thus ensuring totally ordered accesses) from the home node to one of the sharing caches. Our protocol allows most cache misses to be directly solved from the corresponding remote caches, without requiring the intervention of the home node. In this way, cache miss latencies are reduced. Detailed simulations show that this protocol leads to improvements in total execution time of 8% on average over a highly optimized MOESI directory-based protocol.

- Session II - Microarchitecture and Multiprocessor Architecture | Pp. 147-160

Constraint-Aware Large-Scale CMP Cache Design

L. Zhao; R. Iyer; S. Makineni; R. Illikkal; J. Moses; D. Newell

Within the next decade, we expect that large-scale CMP (LCMP) platforms consisting of 10s of cores to become mainstream. The performance and scalability of these architectures is highly dependent on the design of the cache hierarchy. In this paper, our goal is to explore the cache design space for LCMP platforms, which can be vast with several constraints. We approach this exploration problem by developing a constraint-aware analysis methodology (CAAM). CAAM first considers two important constraints and limitations – cache area constraints and on-die / off-die bandwidth limitations. We determine a viable range of cache hierarchy options. We then estimate the bandwidth requirements for these by running server workload traces on our LCMP performance model. Based on allowable bandwidth constraints, we narrow the design space further to highlight a few cache options. Finally, we compare these options based on performance, area and bandwidth trade-offs to make recommendations.

- Session II - Microarchitecture and Multiprocessor Architecture | Pp. 161-171

FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine

David A. Bader; Virat Agarwal

The Fast Fourier Transform (FFT) is of primary importance and a fundamental kernel in many computationally intensive scientific applications. In this paper we investigate its performance on the Sony-Toshiba-IBM Cell Broadband Engine, a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. The Cell processor consists of a traditional microprocessor (called the PPE) that controls eight SIMD co-processing units called synergistic processor elements (SPEs). We exploit the architectural features of the Cell processor to design an efficient parallel implementation of Fast Fourier Transform (FFT). While there have been several attempts to develop a fast implementation of FFT on the Cell, none have been able to achieve high performance for input series with several thousand complex points. We use an iterative out-of-place approach to design our parallel implementation of FFT with 1K to 16K complex input samples and attain a single precision performance of 18.6 GFLOP/s on the Cell. Our implementation beats FFTW on Cell by several GFLOP/s for these input sizes and outperforms Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. To our knowledge we have the fastest FFT for this range of complex inputs.

- Session III - Applications of Novel Architectures | Pp. 172-184

Molecular Dynamics Simulations on Commodity GPUs with CUDA

Weiguo Liu; Bertil Schmidt; Gerrit Voss; Wolfgang Müller-Wittig

Molecular dynamics simulations are a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the requirement for large system simulations with many atoms and numerous time steps. In this paper we present a new approach to high performance molecular dynamics simulations on graphics processing units. Using modern graphics processing units for high performance computing is facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm. This results in an implementation with significant runtime savings on an off-the-shelf computer graphics card.

- Session III - Applications of Novel Architectures | Pp. 185-196