Catálogo de publicaciones - libros

Compartir en
redes sociales


High Performance Embedded Architectures and Compilers: Second International Conference, HiPEAC 2007, Ghent, Belgium, January 28-30, 2007. Proceedings

Koen De Bosschere ; David Kaeli ; Per Stenström ; David Whalley ; Theo Ungerer (eds.)

En conferencia: 2º International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC) . Ghent, Belgium . January 28, 2007 - January 30, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Theory of Computation; Arithmetic and Logic Structures; Processor Architectures; Input/Output and Data Communications; Logic Design; Computer Communication Networks

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No detectada 2007 SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-69337-6

ISBN electrónico

978-3-540-69338-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Información sobre derechos de publicación

© Springer-Verlag Berlin Heidelberg 2007

Tabla de contenidos

Keynote: Insight, Not (Random) Numbers: An Embedded Perspective

Thomas M. Conte

Hamming said that the purpose of computing was “insight, not numbers.” Yet benchmarking embedded systems is today a numbers game. In this talk, I will dissect Hamming’s famous quote and provide some reasons to hope we can making benchmarking of embedded systems into a science. In particular, I will discuss how to model and measure quantities so that one can gain confidence in the results. I will use the industry standard EEMBC benchmark set as an example. Along the way, I will (I hope) give some insight into what the EEMBC benchmarks are trying to test.

- Invited Program | Pp. 3-3

Compiler-Assisted Memory Encryption for Embedded Processors

Vijay Nagarajan; Rajiv Gupta; Arvind Krishnaswamy

A critical component in the design of secure processors is memory encryption which provides protection for the privacy of code and data stored in off-chip memory. The overhead of the decryption operation that must precede a load requiring an off-chip memory access, decryption being on the critical path, can significantly degrade performance. Recently hardware counter-based one-time pad encryption techniques [11,13,9] have been proposed to reduce this overhead. For high-end processors the performance impact of decryption has been successfully limited due to: presence of fairly large on-chip L1 and L2 caches that reduce off-chip accesses; and additional hardware support proposed in [13,9] to reduce decryption latency. However, for low- to medium-end embedded processors the performance degradation is high because first they only support small (if any) on-chip L1 caches thus leading to significant off-chip accesses and second the hardware cost of decryption latency reduction solutions in [13,9] is too high making them unattractive for embedded processors. In this paper we present a compiler-assisted strategy that uses minimal hardware support to reduce the overhead of memory encryption in low- to medium-end embedded processors. Our experiments show that the proposed technique reduces average execution time overhead of memory encryption for low-end (medium-end) embedded processor with 0 KB (32 KB) L1 cache from 60% (13.1%), with single counter, to 12.5% (2.1%) by additionally using only 8 hardware counter-registers.

- I Secure and Low-Power Embedded Memory Systems | Pp. 7-22

Leveraging High Performance Data Cache Techniques to Save Power in Embedded Systems

Major Bhadauria; Sally A. McKee; Karan Singh; Gary S. Tyson

Voltage scaling reduces leakage power for cache lines unlikely to be referenced soon. Partitioning reduces dynamic power via smaller, specialized structures. We combine approaches, adding a voltage scaling design providing finer control of power budgets. This delivers good performance and low power, consuming 34% of the power of previous designs.

- I Secure and Low-Power Embedded Memory Systems | Pp. 23-37

Applying Decay to Reduce Dynamic Power in Set-Associative Caches

Georgios Keramidas; Polychronis Xekalakis; Stefanos Kaxiras

In this paper, we propose a novel approach to reduce dynamic power in set-associative caches that leverages on a leakage-saving proposal, namely Cache Decay. We thus open the possibility to unify dynamic and leakage management in the same framework. The main intuition is that in a decaying cache, dead lines in a set need not be searched. Thus, rather than trying to predict which cache way holds a specific line, we predict, for each way, whether the line could be live in it. We access all the ways that possibly contain the live line and we call this way-selection. In contrast to way-prediction, way-selection cannot be wrong: the line is either in the selected ways or not in the cache. The important implication is that we have a fixed hit time  indispensable for both performance and ease-of-implementation reasons. In order to achieve high accuracy, in terms of total ways accessed, we use Decaying Bloom filters to track only the live lines in ways  dead lines are automatically purged. We offer efficient implementations of such autonomously Decaying Bloom filters, using novel quasi-static cells. Our prediction approach grants us high-accuracy in narrowing the choice of ways for hits as well as the ability to predict misses  a known weakness of way-prediction.

- I Secure and Low-Power Embedded Memory Systems | Pp. 38-53

Virtual Registers: Reducing Register Pressure Without Enlarging the Register File

Jun Yan; Wei Zhang

This paper proposes a novel scheme to mitigate the register pressure for statically scheduled high-performance embedded processors without physically enlarging the register file. Our scheme exploits the fact that a large fraction of variables are short-lived, which do not need to be written to or read from real registers. Instead, the compiler can allocate these short-lived variables to the virtual registers, which are simply place holders (instead of physical storage locations in the register file) to identify dependences among instructions. Our experimental results demonstrate that virtual registers are very effective at reducing the number of register spills; which, in many cases, can achieve the performance close to the processor with twice number of real registers. Also, our results indicate that for some multimedia and communication applications, using a large number of virtual registers with a small number of real registers can even achieve higher performance than that of a mid-sized register file without any virtual registers.

- II Architecture/Compiler Optimizations for Efficient Embedded Processing | Pp. 57-70

Bounds Checking with Taint-Based Analysis

Weihaw Chuang; Satish Narayanasamy; Brad Calder; Ranjit Jhala

We analyze the performance of different bounds checking implementations. Specifically, we examine using the x86 instruction to reduce the run-time overhead. We also propose a compiler optimization that prunes the bounds checks that are not necessary to guarantee security. The optimization is based on the observation that buffer overflow attacks are launched through external inputs. Therefore, it is sufficient to bounds check only the accesses to those data structures that can possibly hold the external inputs. Also, it is sufficient to bounds check only the memory writes. The proposed optimizations reduce the number of required bounds checks as well as the amount of meta-data that need to be maintained to perform those checks.

- II Architecture/Compiler Optimizations for Efficient Embedded Processing | Pp. 71-86

Reducing Exit Stub Memory Consumption in Code Caches

Apala Guha; Kim Hazelwood; Mary Lou Soffa

The interest in translation-based virtual execution environments (VEEs) is growing with the recognition of their importance in a variety of applications. However, due to constrained memory and energy resources, developing a VEE for an embedded system presents a number of challenges. In this paper we focus on the VEE’s memory overhead, and in particular, the code cache. Both code traces and exit stubs are stored in a code cache. Exit stubs keep track of the branches off a trace, and we show they consume up to 66.7% of the code cache. We present four techniques for reducing the space occupied by exit stubs, two of which assume unbounded code caches and the absence of code cache invalidations, and two without these restrictions. These techniques reduce space by 43.5% and also improve performance by 1.5%. After applying our techniques, the percentage of space consumed by exit stubs in the resulting code cache was reduced to 41.4%.

- II Architecture/Compiler Optimizations for Efficient Embedded Processing | Pp. 87-101

Reducing Branch Misprediction Penalties Via Adaptive Pipeline Scaling

Chang-Ching Yeh; Kuei-Chung Chang; Tien-Fu Chen; Chingwei Yeh

Pipeline scaling provides an attractive solution for increasingly serious branch misprediction penalties within deep pipeline processor. In this paper we investigate techniques that are related to reducing branch misprediction penalties. We present a dual supply-voltage architecture framework that can be efficiently exploited in an deep pipeline processor to reduce pipeline depth depending on the confidence level of branches in pipeline. We also propose two techniques, and manner, that increase the efficiency for pipeline scaling . With these techniques, we then show that not only provides a fast branch misprediction recovery, but also speeds up the resolve of mispredicted branch. The evaluation of in a 13-stage superscalar processor with benchmarks from SPEC2000 applications shows a performance improvement (between 3%-12%, average 8%) over baseline processor that does not exploit .

- III Adaptive Microarchitectures | Pp. 105-119

Fetch Gating Control Through Speculative Instruction Window Weighting

Hans Vandierendonck; André Seznec

In a dynamic reordering superscalar processor, the front-end fetches instructions and places them in the issue queue. Instructions are then issued by the back-end execution core. Till recently, the front-end was designed to maximize performance without considering energy consumption. The front-end fetches instructions as fast as it can until it is stalled by a filled issue queue or some other blocking structure. This approach wastes energy: (i) speculative execution causes many wrong-path instructions to be fetched and executed, and (ii) back-end execution rate is usually less than its peak rate, but front-end structures are dimensioned to sustained peak performance. Dynamically reducing the front-end instruction rate and the active size of front-end structure (e.g. issue queue) is a required performance-energy trade-off. Techniques proposed in the literature attack only one of these effects.

In this paper, we propose Speculative Instruction Window Weighting (SIWW), a fetch gating technique that allows to address both fetch gating and instruction issue queue dynamic sizing. A global weight is computed on the set of inflight instructions. This weight depends on the number and types of inflight instructions (non-branches, high confidence or low confidence branches, ...). The front-end instruction rate can be continuously adapted based on this weight. SIWW is shown to perform better than previously proposed fetch gating techniques. SIWW is also shown to allow to dynamically adapt the size of the active instruction queue.

- III Adaptive Microarchitectures | Pp. 120-135

Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches

Sonia López; Steve Dropsho; David H. Albonesi; Oscar Garnica; Juan Lanchares

Caches are designed to provide the best tradeoff between access speed and capacity for a set of target applications. Unfortunately, different applications, and even different phases within the same application, may require a different capacity-speed tradeoff. This problem is exacerbated in a Simultaneous Multi-Threaded (SMT) processor where the optimal cache design may vary drastically with the number of running threads and their characteristics.

We propose to make this capacity-speed cache tradeoff dynamic within an SMT core. We extend a previously proposed globally asynchronous, locally synchronous (GALS) processor core with multi-threaded support, and implement dynamically resizable instruction and data caches. As the number of threads and their characteristics change, these adaptive caches automatically adjust from small sizes with fast access times to higher capacity configurations. While the former is more performance-optimal when the core runs a single thread, or a dual-thread workload with modest cache requirements, higher capacity caches work best with most multiple thread workloads. The use of a GALS microarchitecture permits the rest of the processor, namely the execution core, to run at full speed irrespective of the cache speeds. This approach yields an overall performance improvement of 24.7% over the best fixed-size caches for dual-thread workloads, and 19.2% for single-threaded applications.

- III Adaptive Microarchitectures | Pp. 136-150