Catálogo de publicaciones - libros
High Performance Embedded Architectures and Compilers: Second International Conference, HiPEAC 2007, Ghent, Belgium, January 28-30, 2007. Proceedings
Koen De Bosschere ; David Kaeli ; Per Stenström ; David Whalley ; Theo Ungerer (eds.)
En conferencia: 2º International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC) . Ghent, Belgium . January 28, 2007 - January 30, 2007
Resumen/Descripción – provisto por la editorial
No disponible.
Palabras clave – provistas por la editorial
Theory of Computation; Arithmetic and Logic Structures; Processor Architectures; Input/Output and Data Communications; Logic Design; Computer Communication Networks
Disponibilidad
Institución detectada | Año de publicación | Navegá | Descargá | Solicitá |
---|---|---|---|---|
No detectada | 2007 | SpringerLink |
Información
Tipo de recurso:
libros
ISBN impreso
978-3-540-69337-6
ISBN electrónico
978-3-540-69338-3
Editor responsable
Springer Nature
País de edición
Reino Unido
Fecha de publicación
2007
Información sobre derechos de publicación
© Springer-Verlag Berlin Heidelberg 2007
Cobertura temática
Tabla de contenidos
Keynote: Insight, Not (Random) Numbers: An Embedded Perspective
Thomas M. Conte
Hamming said that the purpose of computing was “insight, not numbers.” Yet benchmarking embedded systems is today a numbers game. In this talk, I will dissect Hamming’s famous quote and provide some reasons to hope we can making benchmarking of embedded systems into a science. In particular, I will discuss how to model and measure quantities so that one can gain confidence in the results. I will use the industry standard EEMBC benchmark set as an example. Along the way, I will (I hope) give some insight into what the EEMBC benchmarks are trying to test.
- Invited Program | Pp. 3-3
Compiler-Assisted Memory Encryption for Embedded Processors
Vijay Nagarajan; Rajiv Gupta; Arvind Krishnaswamy
A critical component in the design of secure processors is memory encryption which provides protection for the privacy of code and data stored in off-chip memory. The overhead of the decryption operation that must precede a load requiring an off-chip memory access, decryption being on the critical path, can significantly degrade performance. Recently hardware counter-based one-time pad encryption techniques [11,13,9] have been proposed to reduce this overhead. For high-end processors the performance impact of decryption has been successfully limited due to: presence of fairly large on-chip L1 and L2 caches that reduce off-chip accesses; and additional hardware support proposed in [13,9] to reduce decryption latency. However, for low- to medium-end embedded processors the performance degradation is high because first they only support small (if any) on-chip L1 caches thus leading to significant off-chip accesses and second the hardware cost of decryption latency reduction solutions in [13,9] is too high making them unattractive for embedded processors. In this paper we present a compiler-assisted strategy that uses minimal hardware support to reduce the overhead of memory encryption in low- to medium-end embedded processors. Our experiments show that the proposed technique reduces average execution time overhead of memory encryption for low-end (medium-end) embedded processor with 0 KB (32 KB) L1 cache from 60% (13.1%), with single counter, to 12.5% (2.1%) by additionally using only 8 hardware counter-registers.
- I Secure and Low-Power Embedded Memory Systems | Pp. 7-22
Leveraging High Performance Data Cache Techniques to Save Power in Embedded Systems
Major Bhadauria; Sally A. McKee; Karan Singh; Gary S. Tyson
Voltage scaling reduces leakage power for cache lines unlikely to be referenced soon. Partitioning reduces dynamic power via smaller, specialized structures. We combine approaches, adding a voltage scaling design providing finer control of power budgets. This delivers good performance and low power, consuming 34% of the power of previous designs.
- I Secure and Low-Power Embedded Memory Systems | Pp. 23-37
Applying Decay to Reduce Dynamic Power in Set-Associative Caches
Georgios Keramidas; Polychronis Xekalakis; Stefanos Kaxiras
In this paper, we propose a novel approach to reduce dynamic power in set-associative caches that leverages on a leakage-saving proposal, namely Cache Decay. We thus open the possibility to unify dynamic and leakage management in the same framework. The main intuition is that in a decaying cache, dead lines in a set need not be searched. Thus, rather than trying to predict which cache way holds a specific line, we predict, for each way, whether the line could be live in it. We access all the ways that possibly contain the live line and we call this way-selection. In contrast to way-prediction, way-selection cannot be wrong: the line is either in the selected ways or not in the cache. The important implication is that we have a fixed hit time indispensable for both performance and ease-of-implementation reasons. In order to achieve high accuracy, in terms of total ways accessed, we use Decaying Bloom filters to track only the live lines in ways dead lines are automatically purged. We offer efficient implementations of such autonomously Decaying Bloom filters, using novel quasi-static cells. Our prediction approach grants us high-accuracy in narrowing the choice of ways for hits as well as the ability to predict misses a known weakness of way-prediction.
- I Secure and Low-Power Embedded Memory Systems | Pp. 38-53
Virtual Registers: Reducing Register Pressure Without Enlarging the Register File
Jun Yan; Wei Zhang
This paper proposes a novel scheme to mitigate the register pressure for statically scheduled high-performance embedded processors without physically enlarging the register file. Our scheme exploits the fact that a large fraction of variables are short-lived, which do not need to be written to or read from real registers. Instead, the compiler can allocate these short-lived variables to the virtual registers, which are simply place holders (instead of physical storage locations in the register file) to identify dependences among instructions. Our experimental results demonstrate that virtual registers are very effective at reducing the number of register spills; which, in many cases, can achieve the performance close to the processor with twice number of real registers. Also, our results indicate that for some multimedia and communication applications, using a large number of virtual registers with a small number of real registers can even achieve higher performance than that of a mid-sized register file without any virtual registers.
- II Architecture/Compiler Optimizations for Efficient Embedded Processing | Pp. 57-70
Bounds Checking with Taint-Based Analysis
Weihaw Chuang; Satish Narayanasamy; Brad Calder; Ranjit Jhala
We analyze the performance of different bounds checking implementations. Specifically, we examine using the x86 instruction to reduce the run-time overhead. We also propose a compiler optimization that prunes the bounds checks that are not necessary to guarantee security. The optimization is based on the observation that buffer overflow attacks are launched through external inputs. Therefore, it is sufficient to bounds check only the accesses to those data structures that can possibly hold the external inputs. Also, it is sufficient to bounds check only the memory writes. The proposed optimizations reduce the number of required bounds checks as well as the amount of meta-data that need to be maintained to perform those checks.
- II Architecture/Compiler Optimizations for Efficient Embedded Processing | Pp. 71-86
Reducing Exit Stub Memory Consumption in Code Caches
Apala Guha; Kim Hazelwood; Mary Lou Soffa
The interest in translation-based virtual execution environments (VEEs) is growing with the recognition of their importance in a variety of applications. However, due to constrained memory and energy resources, developing a VEE for an embedded system presents a number of challenges. In this paper we focus on the VEE’s memory overhead, and in particular, the code cache. Both code traces and exit stubs are stored in a code cache. Exit stubs keep track of the branches off a trace, and we show they consume up to 66.7% of the code cache. We present four techniques for reducing the space occupied by exit stubs, two of which assume unbounded code caches and the absence of code cache invalidations, and two without these restrictions. These techniques reduce space by 43.5% and also improve performance by 1.5%. After applying our techniques, the percentage of space consumed by exit stubs in the resulting code cache was reduced to 41.4%.
- II Architecture/Compiler Optimizations for Efficient Embedded Processing | Pp. 87-101
Reducing Branch Misprediction Penalties Via Adaptive Pipeline Scaling
Chang-Ching Yeh; Kuei-Chung Chang; Tien-Fu Chen; Chingwei Yeh
Pipeline scaling provides an attractive solution for increasingly serious branch misprediction penalties within deep pipeline processor. In this paper we investigate techniques that are related to reducing branch misprediction penalties. We present a dual supply-voltage architecture framework that can be efficiently exploited in an deep pipeline processor to reduce pipeline depth depending on the confidence level of branches in pipeline. We also propose two techniques, and manner, that increase the efficiency for pipeline scaling . With these techniques, we then show that not only provides a fast branch misprediction recovery, but also speeds up the resolve of mispredicted branch. The evaluation of in a 13-stage superscalar processor with benchmarks from SPEC2000 applications shows a performance improvement (between 3%-12%, average 8%) over baseline processor that does not exploit .
- III Adaptive Microarchitectures | Pp. 105-119
Fetch Gating Control Through Speculative Instruction Window Weighting
Hans Vandierendonck; André Seznec
In a dynamic reordering superscalar processor, the front-end fetches instructions and places them in the issue queue. Instructions are then issued by the back-end execution core. Till recently, the front-end was designed to maximize performance without considering energy consumption. The front-end fetches instructions as fast as it can until it is stalled by a filled issue queue or some other blocking structure. This approach wastes energy: (i) speculative execution causes many wrong-path instructions to be fetched and executed, and (ii) back-end execution rate is usually less than its peak rate, but front-end structures are dimensioned to sustained peak performance. Dynamically reducing the front-end instruction rate and the active size of front-end structure (e.g. issue queue) is a required performance-energy trade-off. Techniques proposed in the literature attack only one of these effects.
In this paper, we propose Speculative Instruction Window Weighting (SIWW), a fetch gating technique that allows to address both fetch gating and instruction issue queue dynamic sizing. A global weight is computed on the set of inflight instructions. This weight depends on the number and types of inflight instructions (non-branches, high confidence or low confidence branches, ...). The front-end instruction rate can be continuously adapted based on this weight. SIWW is shown to perform better than previously proposed fetch gating techniques. SIWW is also shown to allow to dynamically adapt the size of the active instruction queue.
- III Adaptive Microarchitectures | Pp. 120-135
Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches
Sonia López; Steve Dropsho; David H. Albonesi; Oscar Garnica; Juan Lanchares
Caches are designed to provide the best tradeoff between access speed and capacity for a set of target applications. Unfortunately, different applications, and even different phases within the same application, may require a different capacity-speed tradeoff. This problem is exacerbated in a Simultaneous Multi-Threaded (SMT) processor where the optimal cache design may vary drastically with the number of running threads and their characteristics.
We propose to make this capacity-speed cache tradeoff dynamic within an SMT core. We extend a previously proposed globally asynchronous, locally synchronous (GALS) processor core with multi-threaded support, and implement dynamically resizable instruction and data caches. As the number of threads and their characteristics change, these adaptive caches automatically adjust from small sizes with fast access times to higher capacity configurations. While the former is more performance-optimal when the core runs a single thread, or a dual-thread workload with modest cache requirements, higher capacity caches work best with most multiple thread workloads. The use of a GALS microarchitecture permits the rest of the processor, namely the execution core, to run at full speed irrespective of the cache speeds. This approach yields an overall performance improvement of 24.7% over the best fixed-size caches for dual-thread workloads, and 19.2% for single-threaded applications.
- III Adaptive Microarchitectures | Pp. 136-150