Catálogo de publicaciones - libros

Compartir en
redes sociales


Título de Acceso Abierto

Supercomputing Frontiers: Supercomputing Frontiers

Parte de: Theoretical Computer Science and General Issues

En conferencia: 4º Asian Conference on Supercomputing Frontiers (SCFA) . Singapore, Singapore . March 26, 2018 - March 29, 2018

Resumen/Descripción – provisto por la editorial

No disponible.

Palabras clave – provistas por la editorial

artificial intelligence; big data; cloud computing; communication; computer architecture; computer science; computer systems; data management; databases; hardware; High-Performance Computing (HPC); information management; map-reduce; processors; programming languages; semantics; wireless telecommunication systems

Disponibilidad
Institución detectada Año de publicación Navegá Descargá Solicitá
No requiere 2018 Directory of Open access Books acceso abierto
No requiere 2018 SpringerLink acceso abierto

Información

Tipo de recurso:

libros

ISBN impreso

978-3-319-69952-3

ISBN electrónico

978-3-319-69953-0

Editor responsable

Springer Nature

País de edición

Reino Unido

Fecha de publicación

Tabla de contenidos

TINS: A Task-Based Dynamic Helper Core Strategy for In Situ Analytics

Estelle Dirand; Laurent Colombet; Bruno Raffin

The in situ paradigm proposes to co-locate simulation and analytics on the same compute node to analyze data while still resident in the compute node memory, hence reducing the need for post-processing methods. A standard approach that proved efficient for sharing resources on each node consists in running the analytics processes on a set of dedicated cores, called helper cores, to isolate them from the simulation processes. Simulation and analytics thus run concurrently with limited interference. In this paper we show that the performance can be improved through a . We rely on a work stealing scheduler to implement TINS, a task-based in situ framework with an on-demand analytics isolation. The helper cores are dedicated to analytics only when analytics tasks are available. Otherwise the helper cores join the other cores for processing simulation tasks. TINS relies on the Intel TBB library. Experiments on up to 14,336 cores run a set of representative analytics parallelized with TBB coupled with the hybrid MPI+TBB ExaStamp molecular dynamics code. TINS shows up to 40% performance improvement over various other approaches including the standard helper core.

- Performance Tools | Pp. 159-178

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

Jian Guo; Akihiro Nomura; Ryan Barton; Haoyu Zhang; Satoshi Matsuoka

In modern high-performance computing (HPC) systems, users are usually requested to estimate the job runtime for system scheduling when they submit a job. In general, an underestimation of job runtime will cause the HPC system to terminate the job before its completion. If users could be notified that their jobs may not finish before its allocated time expires, users can take actions, such as killing the job and resubmitting it after parameter adjustment, to save time and cost. Meanwhile, the productivity of HPC systems could also be vastly improved. In this paper, we propose a data-driven approach – that is, one that actively observes, analyzes, and logs jobs – for predicting underestimation of job runtime on HPC systems. Using data produced by TSUBAME 2.5, a supercomputer deployed at the Tokyo Institute of Technology, we apply machine learning algorithms to recognize patterns about whether the underestimation of job runtime occurs. Our experimental results show that our approach on runtime-underestimation prediction with 80% precision, 70% recall and 74% F1-score on the entirety of a given dataset. Finally, we split the entire job data set into subsets categorized by scientific application name. The best precision, recall and F1-score of subsets on runtime-underestimation prediction achieved 90%, 95% and 92% respectively.

- Performance Tools | Pp. 179-198

A Power Management Framework with Simple DSL for Automatic Power-Performance Optimization on Power-Constrained HPC Systems

Yasutaka Wada; Yuan He; Thang Cao; Masaaki Kondo

To design exascale HPC systems, power limitation is one of the most crucial and unavoidable issues; and it is also necessary to optimize the power-performance of user applications while keeping the power consumption of the HPC system below a given power budget. For this kind of power-performance optimization for HPC applications, it is indispensable to have enough information and good understanding about both the system specifications (what kind of hardware resources are included in the system, which component can be used as a “power-knob”, how to control the power-knob, etc.) and user applications (which part of the application is CPU-intensive, memory-intensive, and so on). Because this situation forces both the users and administrators of power-constrained HPC systems pay much effort and cost, it has been highly demanded to realize a simple framework to automate a power-performance optimization process, and to provide a simple user interface to the framework. To tackle these concerns, we propose and implement a versatile framework to help carry out power management and performance optimization on power-constrained HPC systems. In this framework, we also propose a simple DSL as an interface to utilize the framework. We believe this is a key to effectively utilize HPC systems under the limited power budget.

- Performance Tools | Pp. 199-218

Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with Radiation

Sidharth Kumar; Alan Humphrey; Will Usher; Steve Petruzza; Brad Peterson; John A. Schmidt; Derek Harris; Ben Isaac; Jeremy Thornock; Todd Harman; Valerio Pascucci; Martin Berzins

The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes–file per process I/O and shared file I/O–in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira.

- Performance Tools | Pp. 219-240

High Performance LOBPCG Method for Solving Multiple Eigenvalues of Hubbard Model: Efficiency of Communication Avoiding Neumann Expansion Preconditioner

Susumu Yamada; Toshiyuki Imamura; Masahiko Machida

The exact diagonalization method is a high accuracy numerical approach for solving the Hubbard model of a system of electrons with strong correlation. The method solves for the eigenvalues and eigenvectors of the Hamiltonian matrix derived from the Hubbard model. Since the Hamiltonian is a huge sparse symmetric matrix, it was expected that the LOBPCG method with an appropriate preconditioner could be used to solve the problem in a short time. This turned out to be the case as the LOBPCG method with a suitable preconditioner succeeded in solving the ground state (the smallest eigenvalue and its corresponding eigenvector) of the Hamiltonian. In order to solve for multiple eigenvalues of the Hamiltonian in a short time, we use a preconditioner based on the Neumann expansion which uses approximate eigenvalues and eigenvectors given by LOBPCG iteration. We apply a communication avoiding strategy, which was developed considering the physical properties of the Hubbard model, to the preconditioner. Our numerical experiments on two parallel computers show that the LOBPCG method coupled with the Neumann preconditioner and the communication avoiding strategy improves convergence and achieves excellent scalability when solving for multiple eigenvalues.

- Linear Algebra | Pp. 243-256

Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code

Yasuhiro Idomura; Takuya Ina; Akie Mayumi; Susumu Yamada; Toshiyuki Imamura

A preconditioned Chebyshev basis communication-avoiding conjugate gradient method (P-CBCG) is applied to the pressure Poisson equation in a multiphase thermal-hydraulic CFD code JUPITER, and its computational performance and convergence properties are compared against a preconditioned conjugate gradient (P-CG) method and a preconditioned communication-avoiding conjugate gradient (P-CACG) method on the Oakforest-PACS, which consists of 8,208 KNLs. The P-CBCG method reduces the number of collective communications with keeping the robustness of convergence properties. Compared with the P-CACG method, an order of magnitude larger communication-avoiding steps are enabled by the improved robustness. It is shown that the P-CBCG method is 1.38 and 1.17 faster than the P-CG and P-CACG methods at 2,000 processors, respectively.

- Linear Algebra | Pp. 257-273

Optimization of Hierarchical Matrix Computation on GPU

Satoshi Ohshima; Ichitaro Yamazaki; Akihiro Ida; Rio Yokota

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of -matrices is more complex than that of dense and sparse matrices; thus, accelerating the -matrices is required. We focus on -matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

- Linear Algebra | Pp. 274-292