Catálogo de publicaciones - libros

Compartir en
redes sociales


Advances in Computer Systems Architecture: 11th Asia-Pacific Conference, ACSAC 2006, Shanghai, China, September 6-8, 2006, Proceedings

Chris Jesshope ; Colin Egan (eds.)

En conferencia: 11º Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC) . Shanghai, China . September 6, 2006 - September 8, 2006

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Computer System Implementation; Arithmetic and Logic Structures; Input/Output and Data Communications; Logic Design; Computer Communication Networks; Processor Architectures

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2006 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-40056-1

ISBN electrónico

978-3-540-40058-5

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2006

Tabla de contenidos

Static WCET Analysis Based Compiler-Directed DVS Energy Optimization in Real-Time Applications

Yi Huizhan; Chen Juan; Yang Xuejun

Compiler-directed dynamic voltage scaling (DVS) is one of the effective low-power techniques for real-time applications. Using the technique, compiler inserts voltage scaling points into a real-time application, and supply voltage and clock frequency are adjusted to the relationship between the remaining time and the remaining workload at each voltage scaling point. In this paper, based on the WCET (the worst case execution time) analysis tool HEPTANE and the performance/power simulator Sim-Panalyzer , we present a DVS-enabled simulation environment RTLPower ( R eal- T ime L ow P ower), which integrates static WCET estimation, performance/power simulation, automatically inserting the DVS code into a real-time application, and profile-guided energy optimization. By simulations of some benchmark applications, we prove that the DVS technique and the profile-guided optimization technique significantly reduce energy consumption.

Palabras clave: Real-time; Low-power; WCET; Compiler.

Pp. 123-136

A Study on Transformation of Self-similar Processes with Arbitrary Marginal Distributions

Hae-Duck J. Jeong; Jong-Suk R. Lee

Stochastic discrete-event simulation studies of communication networks often require a mechanism to transform self-similar processes with normal marginal distributions into self-similar processes with arbitrary marginal distributions. The problem of generating a self-similar process of a given marginal distribution and an autocorrelation structure is difficult and has not been fully solved. Our results presented in this paper provide clear experimental evidence that the autocorrelation function of the input process is not preserved in the output process generated by the inverse cumulative distribution function (ICDF) transformation, where the output process has an infinite variance. On the other hand, it preserves autocorrelation functions of the input process where the output marginal distributions (exponential, gamma, Pareto with α = 20.0, uniform and Weibull) have finite variances, and the ICDF transformation is applied to long-range dependent self-similar processes with normal marginal distributions.

Palabras clave: Self-similar process; Arbitrary marginal distribution; Autocorrelation function; Inverse cumulative distribution function; Stochastic simulation.

Pp. 137-146

μTC – An Intermediate Language for Programming Chip Multiprocessors

Chris Jesshope

μ TC is a language that has been designed for programming chip multiprocessors. Indeed, to be more specific, it has been developed to program chip multiprocessors based on arrays of microthreaded microprocessors as these processors directly implement the concepts introduced in the language. However, it is more general than that and is being used in other projects as an interface defining dynamic concurrency. Ideally, a program written in μ TC is a dynamic, concurrent control structure over small sequences of code, which in the limit could be a few instructions each. μ TC is being used as an intermediate language to capture concurrency from data-parallel languages such as single-assignment C, parallelising compilers for sequential languages such as C and concurrent composition languages, such as Snet. μ TC’s advantage over other approaches is that it allows an abstract representation of maximal concurrency in a schedule-independent form. Both Snet and μ TC are being used in a European project called AETHER, in order to support all aspects of self-adaptive computation.

Palabras clave: Self-adaptive computing; concurrent languages; data-driven com-putation; programming chip multiprocessors.

Pp. 147-160

Functional Unit Chaining: A Runtime Adaptive Architecture for Reducing Bypass Delays

Lih Wen Koh; Oliver Diessel

Bypass delays are expected to grow beyond 1ns as technology scales. These delays necessitate pipelining of bypass paths at processor frequencies above 1GHz and thus affect the performance of sequential code sequences. We propose dealing with these delays through a dynamic functional unit chaining approach. We study the performance benefits of a superscalar, out-of-order processor augmented with a two-by-two array of ALUs interconnected by a fast, partial bypass network. An online profiler guides the automatic configuration of the network to accelerate specific patterns of dependent instructions. A detailed study of benchmark simulations demonstrates these first steps towards mapping binaries to a small coarse-grained array at runtime can improve instruction throughput by over 18% and 25% when the microarchitecure includes bypass delays of one cycle and two cycles, respectively.

Pp. 161-174

Trace-Based Data Cache Leakage Reduction at Link Time

Lian Li; Jingling Xue

This paper investigates the benefits of conducting leakage energy optimisations for data caches at link time for embedded applications. We introduce an improved algorithm for identifying and constructing the traces in a binary program and present a trace-based optimisation for reducing leakage energy in data caches. Our experimental results using Mediabench benchmarks show that good leakage energy savings can be achieved at the cost of some small performance and code size penalties. Furthermore, by varying the granularity of optimisation regions, which is a tunable parameter, embedded application programmers can make the tradeoffs between energy savings and these associated costs.

Palabras clave: Optimisation Region; Data Cache; Cache Line; Leakage Power; Dynamic Voltage Scaling.

Pp. 175-188

Parallelizing User-Defined and Implicit Reductions Globally on Multiprocessors

Shih-wei Liao

Multiprocessors are becoming prevalent in the PC world. Major CPU vendors such as Intel and Advanced Micro Devices have migrated to multicore processors. However, this also means that computers will run an application at full speed only if that application is parallelized. To take advantage of more than a fraction of compute resource on a die, we develop a compiler to parallelize a common and powerful programming paradigm, namely reduction. Our goal is to exploit the full potential of reductions for efficient execution of applications on multiprocessors, including multicores. Note that reduction operations are common in streaming applications, financial computing and HPC domain. In fact, 9% of all MPI invocations in the NAS Parallel Benchmarks are reduction library calls. Recognizing implicit reductions in Fortran and C is important for parallelization on multiprocessors. Recent languages such as Brook Streaming language and Chapel language allow users to specify reduction functions. Our compiler provides a unified framework for processing both implicit and user-defined reductions. Both types of reductions are propagated and analyzed interprocedurally. Our global algorithm can enhance the scope of user-defined reductions and parallelize coarser-grained reductions. Thanking to the powerful algorithm and representation, we obtain an average speedup of 3 on 4 processors. The speedup is only 1.7 if only intraprocedural scalar reductions are parallelized.

Palabras clave: Reduction; multiprocessor; multicore; reduction recognition; interprocedural analysis; data flow analysis; parallelization; implicit reductions; user-defined reductions.

Pp. 189-202

Overload Protection for Commodity Network Appliances

Luke Macpherson

Performance degradation under overload is a well known problem in networked systems. While this problem has been explored extensively in the context of TCP-based web servers, other applications have unique requirements which need to be addressed. In existing admission control systems, the cost of admission control increases with the load to the system. This is acceptable for responsive TCP-based loads, but it is not effective in preventing overload for unresponsive workloads. We present a solution where admission control cost is a function of the traffic admitted to the system, allowing our approach to maintain peak throughput under overload. We have implemented our approach in a real system and evaluated its effectiveness in preventing overload for a number of demanding network workloads. We find that our solution is effective in eliminating performance degradation under overload, while having the desirable property of being simple to implement in commodity systems.

Palabras clave: Performance Degradation; Admission Control; Network Interface; Maximum Throughput; Admission Control Mechanism.

Pp. 203-218

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit

Farhad Mehdipour; Hamid Noori; Morteza Saheb Zamani; Kazuaki Murakami; Mehdi Sedighi; Koji Inoue

Extensible processors allow customization for an application by extending the core instruction set architecture. Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit. Custom instructions (CIs) usually are extracted from critical portions of applications. This paper presents approaches for CI generation with respect to the RFU constraints to improve speedup of the extensible processor. First, our proposed RFU architecture for an adaptive dynamic extensible processor called AMBER is described. Then, an integrated temporal partitioning and mapping framework is presented to partition and map the CIs on the RFU. In this framework, a mapping aware temporal partitioning algorithm is used to generate CIs which are mappable on the RFU. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. In addition, a mapping algorithm is presented which supports CIs with critical path length more than the RFU depth.

Palabras clave: Critical Path; Integrate Framework; Mapping Framework; Custom Instruction; Data Flow Graph.

Pp. 219-230

A High Performance Simulator System for a Multiprocessor System Based on a Multi-way Cluster

Arata Shinozaki; Masatoshi Shima; Minyi Guo; Mitsunori Kubo

In the ubiquitous era, it is necessary to research on the architectures of multiprocessor system with high performance and low power consumption. A processor simulator developed in high level language is useful because of its easily changeable system architecture which includes application specific instruction sets and functions. However, there is a problem in processing speed that both PCs and workstations provide insufficient performance for the simulation of a multiprocessor system. In this research, a simulator for a multiprocessor system based on the multi-way cluster was developed. In the developed simulator system, one processor model consists of an instruction set simulator (ISS) process and several inter-processor communication processes. In order to get the maximization of the simulation performance, each processor model is assigned to the specific CPU on the multi-way cluster. Also, each inter-processor communication process is implemented using MPI library, which can minimize the CPU resource usage in a communication waiting state. The evaluation results of the processing and communication performance using a distributed application program such as JPEG encoding show that each ISS process in the developed simulator system consumes approximately 100% CPU resources for keeping enough inter-processor communication performance. This result means that the performance increases in proportion to the number of integrated CPUs on the cluster.

Palabras clave: Processing Module; Processing Element; Simulator System; Multiprocessor System; Communication Module.

Pp. 231-243

Hardware Budget and Runtime System for Data-Driven Multithreaded Chip Multiprocessor

Kyriakos Stavrou; Pedro Trancoso; Paraskevas Evripidou

The Data-Driven Multithreading Chip Multiprocessor (DDM-CMP) architecture has been shown to overcome the power and memory wall limitations by combining two key technologies: the use of the Data-Driven Multithreading (DDM) model of execution, and the Chip-Multiprocessor architecture. DDM is able to hide memory and synchronization latencies providing significant performance gains whereas the use of of the CMP architecture offers high-degree of parallelism at low complexity design and is therefore power efficient. This paper presents the hardware budget analysis and the runtime support system for the DDM-CMP architecture. The hardware analysis shows that the DDM benefits may be achieved with only a 17% hardware cost increase compared to a traditional chip-multiprocessor implementation. The support for the runtime system was designed in such a way that allows the DDM applications to execute on the DDM-CMP chip using a regular, non-modified, Operating System and CPU cores.

Palabras clave: Code Block; Runtime System; Chip Multiprocessor; Graph Memory; Content Addressable Memory.

Pp. 244-259