Catálogo de publicaciones - libros

Compartir en
redes sociales

High Performance Embedded Architectures and Compilers: Second International Conference, HiPEAC 2007, Ghent, Belgium, January 28-30, 2007. Proceedings

Koen De Bosschere ; David Kaeli ; Per Stenström ; David Whalley ; Theo Ungerer (eds.)

En conferencia: 2º International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC) . Ghent, Belgium . January 28, 2007 - January 30, 2007

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

Theory of Computation; Arithmetic and Logic Structures; Processor Architectures; Input/Output and Data Communications; Logic Design; Computer Communication Networks

Disponibilidad

Institución detectada	Año de publicación	Navegá	Descargá	Solicitá
No detectada	2007	SpringerLink

Información

Tipo de recurso:

libros

ISBN impreso

978-3-540-69337-6

ISBN electrónico

978-3-540-69338-3

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

2007

Información sobre derechos de publicación

Cobertura temática

Ciencias de la computación e información

Artes

Tabla de contenidos

Verificá que desde tu institución tengas acceso para descargar o solicitar el libro completo o alguno de sus capítulos.

doi: 10.1007/978-3-540-69338-3_11

Branch History Matching: Branch Predictor Warmup for Sampled Simulation

Simon Kluyskens; Lieven Eeckhout

Computer architects and designers rely heavily on simulation. The downside of simulation is that it is very time-consuming — simulating an industry-standard benchmark on today’s fastest machines and simulators takes several weeks. A practical solution to the simulation problem is sampling. Sampled simulation selects a number of sampling units out of a complete program execution and only simulates those sampling units in detail. An important problem with sampling however is the microarchitecture state at the beginning of each sampling unit. Large hardware structures such as caches and branch predictors suffer most from unknown hardware state. Although a great body of work exists on cache state warmup, very little work has been done on branch predictor warmup. This paper proposes Branch History Matching (BHM) for accurate branch predictor warmup during sampled simulation. The idea is to build a distribution for each sampling unit of how far one needs to go in the pre-sampling unit in order to find the same static branch with a similar global and local history as the branch instance appearing in the sampling unit. Those distributions are then used to determine where to start the warmup phase for each sampling unit for a given total warmup length budget. Using SPEC CPU2000 integer benchmarks, we show that BHM is substantially more efficient than fixed-length warmup in terms of warmup length for the same accuracy. Or reverse, BHM is substantially more accurate than fixed-length warmup for the same warmup budget.

- III Adaptive Microarchitectures | Pp. 153-167

doi: 10.1007/978-3-540-69338-3_12

Full-System, Embedded Microarchitecture Evaluation

Phillip Stanley-Marbell; Diana Marculescu

This paper describes , a full-system microarchitectural evaluation environment for embedded computing systems. The environment enables detailed microarchitectural simulation of multiple instances of complete embedded systems, their peripherals, and medium access control / physical layer communication between systems. The environment models the microarchitecture, computation and communication upset events under a variety of stochastic distributions, compute and communication power consumption, electrochemical battery systems, and power regulation circuitry, as well as analog signals external to the processing elements.

The simulation environment provides facilities for speeding up simulation performance, which tradeoff accuracy of simulated properties for simulation speed. Through the detailed simulation of benchmarks in which the effect of simulation speedup on correctness can be accurately quantified, it is demonstrated that traditional techniques proposed for simulation speedup can introduce significant error when simulating a combination of computation and analog physical phenomena external to a processor.

- III Adaptive Microarchitectures | Pp. 168-182

doi: 10.1007/978-3-540-69338-3_13

Efficient Program Power Behavior Characterization

Chunling Hu; Daniel A. Jiménez; Ulrich Kremer

Fine-grained program power behavior is useful in both evaluating power optimizations and observing power optimization opportunities. Detailed power simulation is time consuming and often inaccurate. Physical power measurement is faster and objective. However, fine-grained measurement generates enormous amounts of data in which locating important features is difficult, while coarse-grained measurement sacrifices important detail.

We present a program power behavior characterization infrastructure that identifies program phases, selects a representative interval of execution for each phase, and instruments the program to enable precise power measurement of these intervals to get their time-dependent power behavior. We show that the representative intervals accurately model the fine-grained time-dependent behavior of the program. They also accurately estimate the total energy of a program. Our compiler infrastructure allows for easy mapping between a measurement result and its corresponding source code. We improve the accuracy of our technique over previous work by using , i.e., counts of traversals of control-flow edges, instead of basic block vectors, as well as incorporating event counters into our phase classification.

We validate our infrastructure through the physical power measurement of 10 SPEC CPU 2000 integer benchmarks on an Intel Pentium 4 system. We show that using edge vectors reduces the error of estimating total program energy by 35% over using basic block vectors, and using edge vectors plus event counters reduces the error of estimating the fine-grained time-dependent power profile by 22% over using basic block vectors.

- III Adaptive Microarchitectures | Pp. 183-197

doi: 10.1007/978-3-540-69338-3_14

Performance/Energy Optimization of DSP Transforms on the XScale Processor

Paolo D’Alberto; Markus Püschel; Franz Franchetti

The XScale processor family provides user-controllable independent configuration of CPU, bus, and memory frequencies. This feature introduces another handle for the code optimization with respect to energy consumption or runtime performance. We quantify the effect of frequency configurations on both performance and energy for three signal processing transforms: the discrete Fourier transform (DFT), finite impulse response (FIR) filters, and the Walsh-Hadamard Transform (WHT).

To do this, we use SPIRAL, a program generation and optimization system for signal processing transforms. For a given transform to be implemented, SPIRAL searches over different algorithms to find the best match to the given platform with respect to the chosen performance metric (usually runtime). In this paper we use SPIRAL to generate implementations for different frequency configurations and optimize for runtime and physically measured energy consumption. In doing so we show that first, each transform achieves best performance/energy consumption for a different system configuration; second, the best code depends on the chosen configuration, problem size and algorithm; third, the fastest implementation is not always the most energy efficient; fourth, we introduce dynamic (i.e., during execution) reconfiguration in order to further improve performance/energy. Finally, we benchmark SPIRAL generated code against Intel’s vendor library routines. We show competitive results as well as 20% performance improvements or energy reduction for selected transforms and problem sizes.

V - Generation of Efficient Embedded Applications | Pp. 201-214

doi: 10.1007/978-3-540-69338-3_15

Arx: A Toolset for the Efficient Simulation and Direct Synthesis of High-Performance Signal Processing Algorithms

Klaas L. Hofstra; Sabih H. Gerez

This paper addresses the efficient implementation of high-performance signal-processing algorithms. In early stages of such designs many computation-intensive simulations may be necessary. This calls for hardware description formalisms targeted for efficient simulation (such as the programming language C). In current practice, other formalisms (such as VHDL) will often be used to map the design on hardware by means of logic synthesis. A manual, error-prone, translation of a description is then necessary.

The line of thought of this paper is that the gap between simulation and synthesis should not be bridged by stretching the use of existing formalisms (e.g. defining a synthesizable subset of C), but by a language dedicated to an application domain. This resulted in Arx, which is meant for signal-processing hardware at the register-transfer level, either using floating-point or fixed-point data. Code generators with knowledge of the application domain then generate efficient simulation models and synthesizable VHDL.

Several designers have already completed complex signal-processing designs using Arx in a short time, proving in practice that Arx is easy to learn. Benchmarks presented in this paper show that the generated simulation code is significantly faster than SystemC.

V - Generation of Efficient Embedded Applications | Pp. 215-226

doi: 10.1007/978-3-540-69338-3_16

A Throughput-Driven Task Creation and Mapping for Network Processors

Lixia Liu; Xiao-Feng Li; Michael Chen; Roy D. C. Ju

Network processors are programmable devices that can process packets at a high speed. A network processor is typified by multi-threading and heterogeneous multiprocessing, which usually requires programmers to manually create multiple tasks and map these tasks onto different processing elements. This paper addresses the problem of automating task creation and mapping of network applications onto the underlying hardware to maximize their throughput. We propose a throughput cost model to guide the task creation and mapping with the objective of both minimizing the number of stages in the processing pipeline and maximizing the average throughput of the slowest task simultaneously. The average throughput is modeled by taking communication cost, computation cost, memory access latency and synchronization cost into account. We envision that programmers write small functions for network applications, such that we use grouping and duplication to construct tasks from the functions. The optimal solution of creating tasks from functions and mapping them to processors is an NP-hard problem. Therefore, we present a practical and efficient heuristic algorithm with an (( + )) complexity and show that the obtained solutions produce excellent performance for typical network applications. The entire framework has been implemented in the Open Research Compiler (ORC) adapted to compile network applications written in a domain-specific dataflow language. Experimental results show that the code produced by our compiler can achieve the 100% throughput on the OC-48 input line rate. OC-48 is a fiber optic connection that can handle a 2.488Gbps connection speeds, which is what our targeted hardware was designed for. We also demonstrate the importance of good creation and mapping choices on achieving high throughput. Furthermore, we show that reducing communication cost and efficient resource management are the most important factors for maximizing throughput on the Intel IXP network processors.

V - Generation of Efficient Embedded Applications | Pp. 227-241

doi: 10.1007/978-3-540-69338-3_17

MiDataSets: Creating the Conditions for a More Realistic Evaluation of Iterative Optimization

Grigori Fursin; John Cavazos; Michael O’Boyle; Olivier Temam

Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performance-critical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will work well with other data sets that a program uses.

In this article, we evaluate that assumption based on 20 data sets per benchmark of the MiBench suite. We find that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise optimization configuration across data sets which is often within 5% of the best possible configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). All these conclusions have significant and positive implications for the practical utilization of iterative optimization.

VI - Optimizations and Architectural Tradeoffs for Embedded Systems | Pp. 245-260

doi: 10.1007/978-3-540-69338-3_18

Evaluation of Offset Assignment Heuristics

Johnny Huynh; José Nelson Amaral; Paul Berube; Sid-Ahmed-Ali Touati

In digital signal processors (DSPs) variables are accessed using address registers. The problem of finding a memory layout, for a set of variables, that minimizes the address-computation overhead is known as the (GOA) Problem. The most common approach to this problem is to partition the set of variables into partitions and to assign each partition to an address register. Thus effectively decomposing the GOA problem into several (SOA) problems. Many heuristic-based algorithms are proposed in the literature to approximate solutions to the partitioning and SOA problems. However, the address-computation overhead of the resulting memory layouts are not accurately evaluated. In this paper we use Gebotys’ optimal address-code generation technique to evaluate memory layouts. Using this evaluation method introduces a new problem which we call the Memory Layout Permutation (MLP) problem. We then use the Gebotys’ technique and an exhaustive solution to the MLP problem to evaluate heuristic-based offset-assignment algorithms. The memory layouts produced by each algorithm are compared against each other and against the optimal layouts. Our results show that even in small access sequences with 12 variables or less, current heuristics may produce memory layouts with address-computation overheads up to two times higher than the overhead of an optimal layout.

VI - Optimizations and Architectural Tradeoffs for Embedded Systems | Pp. 261-275

doi: 10.1007/978-3-540-69338-3_19

Customizing the Datapath and ISA of Soft VLIW Processors

Mazen A. R. Saghir; Mohamad El-Majzoub; Patrick Akl

In this paper, we examine the trade-offs in performance and area due to customizing the datapath and instruction set architecture of a soft VLIW processor implemented in a high-density FPGA. In addition to describing our processor, we describe a number of microarchitectural optimizations we used to reduce the area of the datapath. We also describe the tools we developed to customize, generate, and program our processor. Our experimental results show that datapath and instruction set customization achieve high levels of performance, and that using on-chip resources and implementing microarchitectural optimizations like selective data forwarding help keep FPGA resource utilization in check.

VI - Optimizations and Architectural Tradeoffs for Embedded Systems | Pp. 276-290

doi: 10.1007/978-3-540-69338-3_20

Instruction Set Extension Generation with Considering Physical Constraints

I-Wei Wu; Shih-Chia Huang; Chung-Ping Chung; Jyh-Jiun Shann

In this paper, we propose new algorithms for both ISE exploration and selection with considering important physical constraints such as pipestage timing and instruction set architecture (ISA) format, silicon area and register file. To handle these considerations, an ISE exploration algorithm is proposed. It not only explores ISE candidates but also their implementation option to minimize the execution time meanwhile using less silicon area. In ISE selection, many researches only take silicon area into account, but it is not comprehensive. In this paper, we formulate ISE selection as a multiconstrained 0-1 knapsack problem so that it can consider multiple constraints. Results with MiBench indicate that under same number of ISE, our approach achieves 69.43%, 1.26% and 33.8% (max., min. and avg., respectively) of further reduction in silicon area and also has maximally 1.6% performance improvement compared with the previous one.

VI - Optimizations and Architectural Tradeoffs for Embedded Systems | Pp. 291-305