BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160903Z
LOCATION:C2/3/4 Ballroom
DTSTART;TZID=America/Chicago:20181115T083000
DTEND;TZID=America/Chicago:20181115T170000
UID:submissions.supercomputing.org_SC18_sess324@linklings.com
SUMMARY:Research Posters
DESCRIPTION:Poster\nTech Program Reg Pass, Exhibits Reg Pass\n\nExploring
Application Performance on Fat-Tree Networks in the Presence of Congestion
\n\nTaffet, Rao, Karlin\n\nNetwork congestion, which occurs when multiple
applications simultaneously use shared links in cluster network, can cause
poor communication performance, decreasing the performance and scalabilit
y of parallel applications. Many studies are performed while clusters also
run other production workloads...\n\n---------------------\nMulti-Client
DeepIO for Large-Scale Deep Learning on HPC Systems\n\nZhu, Chowdhury, Fu,
Moody, Mohror...\n\nWith the growth of computation power, leadership High
-Performance Computing (HPC) systems can train larger datasets for Deep ne
ural networks (DNNs) more efficiently. On HPC systems, a training dataset
is on a parallel file system or node-local storage devices. However, not a
ll HPC clusters have node...\n\n---------------------\nEnergy Efficiency o
f Reconfigurable Caches on FPGAs\n\nWang, Li, Geng, Herbordt\n\nThe perfor
mance of a given cache architecture depends largely on the applications th
at run on it. Even though each application has its best-suited cache confi
guration, vendors of fixed HPC systems must provide compromise designs. Re
configurable caches can adjust cache configuration dynamically to ge...\n\
n---------------------\nRGB (Redfish Green500 Benchmarker): A Green500 Ben
chmarking Tool Using Redfish\n\nHojati, Chen, Sill, Hass\n\nPerformance an
d energy are important factors for supercomputers and data-centers with a
trade-off between them. Energy efficiency metric considers both of these p
roperties. The Green500 is a branch of Top500 project which provides a li
st of supercomputers based on energy efficiency. It has a manual...\n\n---
------------------\nOptimization of Ultrasound Simulations on Multi-GPU Se
rvers\n\nVaverka, Spetko, Treeby, Jaros\n\nRealistic ultrasound simulation
s have found a broad area of applications in preoperative photoacoustic sc
reening and non-invasive ultrasound treatment planing. However, the domain
s are typically thousands of wavelengths in size, leading to large-scale n
umerical models with billions of unknowns. The ...\n\n--------------------
-\nGPGPU Performance Estimation with Core and Memory Frequency Scaling\n\n
Wang, Chu\n\nGraphics processing units (GPUs) support dynamic voltage and
frequency scaling to balance computational performance and energy consumpt
ion. However, simple and accurate performance estimation for a given GPU k
ernel under different frequency settings is still lacking for real hardwar
e, which is impor...\n\n---------------------\nMaking Sense of Scientific
Simulation Ensembles\n\nDahshan, Polys\n\nScientists run many simulations
with varying initial conditions, known as "ensembles", to understand the i
nfluence and relationships among multiple parameters or ensemble members.
Most of the ensemble visualization and analysis approaches and techniques
focus on analyzing the relationships between e...\n\n---------------------
\nWhich Architecture Is Better Suited for Matrix-Free Finite-Element Algor
ithms: Intel Skylake or Nvidia Volta?\n\nKronbichler, Allalen, Ohlerich, W
all\n\nThis work presents a performance comparison of highly tuned matrix-
free finite element kernels from the finite element library on different c
ontemporary computer architectures, NVIDIA V100 and P100 GPUs, an Intel Kn
ights Landing Xeon Phi, and two multi-core Intel CPUs (Broadwell and Skyla
ke). The a...\n\n---------------------\nSpotSDC: an Information Visualiza
tion System to Analyze Silent Data Corruption\n\nLi, Menon, Livnat, Mohror
, Pascucci\n\nAggressive technology scaling trends are expected to make th
e hardware of HPC systems more susceptible to transient faults. Transient
faults in hardware may be masked without affecting the program output, cau
se a program to crash, or lead to silent data corruptions (SDC). While fau
lt injection studi...\n\n---------------------\nHigh-Accuracy Scalable Sol
utions to the Dynamic Facility Layout Problem\n\nQasem, Novoa, Kolla, Coyl
e\n\nThe dynamic facility layout problem (DFLP) is concerned with finding
arrangements of facilities within plant locations that minimize the sum of
material handling and relocation costs over a planning horizon. DFLP is r
elevant in manufacturing engineering; accurate solutions can reduce operat
ional cos...\n\n---------------------\nHPC-as-a-Service for Life Sciences\
n\nSvaton, Martinovic, Jeliazkova, Chupakhin, Tomancak...\n\nHPC-as-a-Serv
ice is a well-known term in the area of high performance computing. It ena
bles users to access an HPC infrastructure without a need to buy and manag
e their own infrastructure. Through this service, academia and industry ca
n take advantage of the technology without an upfront investment ...\n\n--
-------------------\nSciGaP: Apache Airavata Hosted Science Gateways\n\nPi
erce, Marru, Abeysinghe, Pamidighantam, Christie...\n\nThe goal of the Sci
ence Gateways Platform as a service (SciGaP.org) project is to provide cor
e services for building and hosting science gateways. Over the last two ye
ars, SciGaP services have been used to build and host over twenty-five sci
ence gateways. SciGaP services support these gateways throu...\n\n--------
-------------\nReproducibility as Side Effect\n\nWang, Zhen, Anderson, Kea
hey\n\nThe ability to keep records and reproduce experiments is a critical
element of the scientific method for any discipline. However, the recordi
ng and publishing of research artifacts that allow to reproduce and direct
ly compare against existing research continue to be a challenge. In this p
aper, we pr...\n\n---------------------\nUsing Darshan and CODES to Evalua
te Application I/O Performance\n\nKhetawat, Zimmer, Mueller, Vazhkudai, At
chley\n\nBurst buffers have become increasingly popular in HPC systems, al
lowing bursty I/O traffic to be serviced faster without slowing down appli
cation execution. The ubiquity of burst buffers creates opportunities for
studying their ideal placement in the HPC topology. Furthermore, the topol
ogy of the ne...\n\n---------------------\nGPU-Accelerated Interpolation f
or 3D Image Registration\n\nHimthani, Mang, Gholami, Biros\n\nImage regist
ration is a key technology in image computing with numerous applications i
n medical imaging. Our overarching goal is the design of a consistent and
unbiased computational framework for the integration of medical imaging da
ta with simulation and optimization to support clinical decision m...\n\n-
--------------------\nHIVE: A Cross-Platform, Modular Visualization Ecosys
tem for Heterogeneous Computational Environments\n\nNonaka, Ono, Sakamoto,
Hayashi, Kawanabe...\n\nHPC operational environments usually have support
ing computational systems for assisting pre- and post-processing activitie
s such as the visualization and analysis of simulation results. A wide var
iety of hardware systems can be found at different HPC sites, and in our c
ase, we have a CPU-only (x86...\n\n---------------------\nImproving the I
/O Performance and Memory Usage of the Xolotl Cluster Dynamics Simulator\n
\nRoth, Blondel, Bernholdt, Wirth\n\nXolotl is a cluster dynamics simulato
r used to predict gas bubble evolution in solids. It is currently being us
ed to simulate bubble formation in the plasma-facing surface within fusion
reactors and the nuclear fuel used in fission reactors. After observing p
erformance problems in coupled-code simul...\n\n---------------------\nPer
formance Evaluation of the Shifted Cholesky QR Algorithm for Ill-Condition
ed Matrices\n\nFukaya, Kannan, Nakatsukasa, Yamamoto, Yanagisawa\n\nThe Ch
olesky QR algorithm, which computes the QR factorization of a matrix, is a
simple yet efficient algorithm for high-performance computing. However it
suffers from numerical instability. In a recent work, this instability ha
s been remedied by repeating Cholesky QR twice (CholeskyQR2). ChokeskyQ..
.\n\n---------------------\nLarge Scale MPI-Parallelization of LBM and DEM
Systems: Accelerating Research by Using HPC\n\nJelinek, Mason, Peters, Jo
hnson, Brumfield...\n\nCasting, solidification, and the behavior of dry, s
aturated, and partially saturated granular media are examples of interesti
ng and important problems in multiple areas of civil, mechanical, and chem
ical engineering. For interacting particle-fluid systems, the Discrete Ele
ment Method (DEM) and Latti...\n\n---------------------\nHermes: a Multi-T
iered Distributed I/O Buffering System for HDF5\n\nDevarajan\n\nHigh-Perfo
rmance Computing (HPC) systems’ increasing ability to run data-intensive p
roblems at larger scale and resolution has driven the evolution of modern
storage technologies. In addition, extreme amounts of data are collected b
y large scientific instruments and sensor network is resulting in a ...\n\
n---------------------\nWorkflow for Parallel Processing of Sequential Mes
h Databases\n\nMeca, Říha, Brzobohatý\n\nThis poster presents a workf
low for parallel loading of sequentially stored mesh databases. It can be
used as a connection between tools for the creation of complex engineering
models along with parallel solvers to allow broader usage of HPC by the e
ngineering community. Scalability tests show that ...\n\n-----------------
----\nThe NAStJA Framework: Non-Collective Scalable Global Communications\
n\nBerghoff, Kondov\n\nIn recent years, simulations in various areas of sc
ience and engineering have proven to be very useful. To efficiently deplo
y simulation codes on current and future high-performance computer systems
, high node level performance, scalable communication and the exclusion of
unnecessary calculations a...\n\n---------------------\nHardware Accelera
tion of CNNs with Coherent FPGAs\n\nSefat, Aslan, Qasem\n\nThis paper desc
ribes a new flexible approach to implementing energy-efficient CNNs on FPG
As. Our design leverages the Coherent Accelerator Processor Interface (CAP
I) which provides a cache-coherent view of system memory to attached accel
erators. Convolution layers are formulated as matrix multiplica...\n\n----
-----------------\nDistributed Fast Boundary Element Methods\n\nMerta, Zap
letal, Kravcenko\n\nWe present a parallel implementation of the fast bound
ary element method (BEM) for the Helmholtz equation. After a brief descrip
tion of BEM, vectorization of the computationally most demanding kernels,
and shared memory parallelization, we focus on the distributed memory para
llelization using a new ...\n\n---------------------\nDevelopment of Numer
ical Coupled Analysis Method by Air Flow Analysis and Snow Accretion Analy
sis\n\nMurotani, Nakade, Kamata, Takahashi\n\nIn this research, to take co
untermeasures for the snow accretion damage, we developed a simulator of r
ealizing the snow accretion process in the following steps. Firstly, air f
low analysis is performed by “Airflow simulator” developed by RTRI (Railwa
y Technical Research Institute). Secondly, traject...\n\n-----------------
----\nPortable Parallel Performance via Multi-Dimensional Homomorphisms\n\
nRasch, Schulze, Gorlatch\n\nAchieving portable performance over different
parallel architectures and varying problem sizes is hard: e.g., a program
optimized for multi-core CPUs on large input sizes can significantly diff
er from the same program optimized for Graphics Processing Units (GPUs) on
small sizes.\n\nWe propose an appr...\n\n---------------------\nWarpX: To
ward Exascale Modeling of Plasma Particle Accelerators\n\nThevenet, Vay, A
lmgren, Bell, Lehe...\n\nTurning the current experimental plasma accelerat
or state-of-the-art from a promising technology into mainstream scientific
tools depends critically on high-performance, high-fidelity modeling of c
omplex processes that develop over a wide range of space and time scales.
As part of the U.S. Departmen...\n\n---------------------\nEnabling Data A
nalytics Workflows Using Node-Local Storage\n\nDo, Jiang, Gallagher, Chu,
Harrison...\n\nThe convergence of high-performance computing (HPC) and Big
Data is a necessity with the push toward extreme-scale computing. As HPC
simulations become more complex, the analytics need to process larger amou
nts of data, which poses significant challenges for coupling HPC simulatio
ns with Big Data an...\n\n---------------------\nOpeNNdd: Open Neural Netw
orks for Drug Discovery: Creating Free and Easy Methods for Designing Medi
cine\n\nKroencke, Shacterman, Pavini, Samudio, Crivelli\n\nBringing new me
dicines to patients can be prohibitively expensive in terms of time, cost,
and resources. This leaves many diseases without therapeutic interventio
ns. In addition, new and reemerging diseases are increasing in prevalence
across the globe at an alarming rate. The speed and scale of ...\n\n----
-----------------\nSC18 Research Posters\n\n\n\nSC18 Research Posters will
be on display on Tuesday, Wednesday, Thursday from 8:30am to 5pm in the C
2/3/4 Ballroom.\n\n---------------------\nFeatherCNN: Fast Inference Compu
tation with TensorGEMM on ARM Architectures\n\nLan, Meng, Hundt, Schmidt,
Deng...\n\nThis poster presents a fast inference computation library for A
RM architecture named as CNNForward. CNNForward is trying to improve the e
fficiency of inference computation for convolutional neural networks on AR
M-based multi-core and many-core architectures using both mathematical for
mula reconstruc...\n\n---------------------\nBoosting the Scalability of C
ar-Parrinello Molecular Dynamics Simulations for Multi- and Manycore Archi
tectures\n\nKlöffel, Meyer, Mathias\n\nWe present our recent optimizations
of the ultra-soft pseudo-potential (USPP) code path of the ab inito molec
ular dynamics program CPMD (www.cpmd.org). Following the internal instrume
ntation of CPMD, all relevant USPP routines have been revised to fully sup
port hybrid MPI+OpenMP parallelization. For...\n\n---------------------\nC
haracterizing Declustered Software RAID for Enhancing Storage Reliability
and Performance\n\nQiao, Fu, Chen, Settlemyer\n\nRedundant array of indepe
ndent disks (RAID) has been widely used to address the reliability issue i
n storage systems. As the scale of modern storage systems continues growin
g, disk failure becomes the norm. With ever-increasing disk capacity, RAID
recovery based on disk rebuild becomes more costly, ...\n\n--------------
-------\nParallel Implementation of Machine Learning-Based Many-Body Poten
tials on CPU and GPU\n\nZhai, Danandeh, Tan, Gao, Paesani...\n\nMachine le
arning models can be used to develop highly accurate and efficient many-bo
dy potentials for molecular simulations based on the many-body expansion o
f the total energy. A prominent example is the MB-pol water model that em
ploys permutationally invariant polynomials (PIPs) to represent the ...\n\
n---------------------\nImplementing Efficient Data Compression and Encryp
tion in a Persistent Key-Value Store for HPC\n\nKim, Vetter\n\nRecently, p
ersistent data structures, like key-value stores (KVSs), which are stored
in an HPC system's nonvolatile memory, provide an attractive solution for
a number of emerging challenges like limited I/O performance. This paper i
nvestigates how to efficiently integrate data compression and encry...\n\n
---------------------\nA Parallel-Efficient GPU Package for Multiphase Flo
w in Realistic Nano-Pore Networks\n\nXia, Blumers, Li, Luo, Goral...\n\nSi
mulations of fluid flow in oil/gas shale rocks are challenging in part due
to the heterogeneous pore sizes ranging from a few nanometers to a few mi
crometers. Additionally, the complex fluid-solid interaction occurring phy
sically and chemically must be captured with high resolution. To address t
he...\n\n---------------------\nProcessing-in-Storage Architecture for Mac
hine Learning and Bioinformatics\n\nKaplan, Yavits, Ginosar\n\nUser-genera
ted and bioinformatics database volumes has been increasing exponentially
for more than a decade. With the slowdown and approaching end of Moore's l
aw, traditional technologies cannot satisfy the increasing demands for pro
cessing power. This work presents PRINS, a highly-parallel in-sto...\n\n
---------------------\nKernel-Based and Total Performance Analysis of CGYR
O on 4 Leadership Systems\n\nSfiligoi, Candy, Belli\n\nWe present the resu
lts of an exhaustive performance analysis of the CGYRO code on 4 leadershi
p systems spanning 5 different configurations (2 KNL-based, 1 Skylake-base
d, and 2 hybrid CPU-GPU architectures). CGYRO is an Eulerian gyrokinetic s
olver designed and optimized for collisional, electromagnet...\n\n--------
-------------\nRedesigning The Absorbing Boundary Algorithm for Asynchrono
us High Performance Acoustic Wave Propagation\n\nAbdelkhalak, Akbudak, Eti
enne, Tonellot\n\nExploiting high concurrency, relaxing the synchrony of e
xisting algorithms, and increasing data reuse have immense effect in perfo
rmance. We integrate the Multicore-optimized Wavefront Diamond (MWD) tilin
g approach by Malas et al. [SIAM SISC, 2015, ACM Trans. Parallel Comput. 2
017], which takes int...\n\n---------------------\nCapsule Networks for P
rotein Structure Classification\n\nRosa de Jesus, Cuevas Paniagua, Rivera,
Crivelli\n\nCapsule Networks have great potential to tackle problems in s
tructural biology because of their attention to hierarchical relationships
. This work describes the implementation and application of a capsule netw
ork architecture to the classification of RAS protein family structures on
GPU-based comput...\n\n---------------------\nCross-Layer Group Regulariz
ation for Deep Neural Network Pruning\n\nGao, Liu\n\nImproving weights spa
rsity is a common strategy for deep neural network pruning. Most existing
methods use regularizations that only consider structural sparsity within
an individual layer. In this paper, we propose a cross-layer group regular
ization taking into account the statistics from multiple ...\n\n----------
-----------\nMachine Learning for Adaptive Discretization in Massive Multi
scale Biomedical Modeling\n\nHan, Gupta, Zhang, Bluestein, Deng\n\nFor mul
tiscale problems, traditional time stepping algorithms use a single smalle
st time stepsize in order to capture the finest details; using this scale
leads to a significant waste of computing resources for simulating coarse-
grained portion of the problem. To improve computing efficiency for mul...
\n\n---------------------\nMulti-GPU Accelerated Non-Hydrostatic Numerical
Ocean Model with GPUDirect RDMA Transfers\n\nYamagishi, Matsumura, Hasumi
\n\nWe have implemented our “kinaco” numerical ocean model on Tokyo Univer
sity’s Reedbush supercomputer, which utilizes the latest Nvidia Pascal P10
0 GPUs with GPUDirect technology. We have also optimized the model’s Poiss
on/Helmholtz solver by adjusting the global memory alignment and thread bl
ock conf...\n\n---------------------\nA Locality and Memory Congestion-Awa
re Thread Mapping Method for Modern NUMA Systems\n\nAgung, Amrizal, Egawa,
Takizawa\n\nOn modern NUMA systems, the memory congestion problem could d
egrade performance more than the memory access locality problem because a
large number of processor cores in the systems can cause heavy congestion
on memory controllers. In this work, we propose a thread mapping method th
at considers the ...\n\n---------------------\nTuning CFD Applications for
Intel Xeon Phi with TAU Commander and ParaTools ThreadSpotter\n\nBeekman,
Chaimov, Shende, Malony, Bisek...\n\nTuning and understanding the perform
ance characteristics of computational fluid dynamics (CFD) codes on many-c
ore, NUMA architectures is challenging. One must determine how programming
choices impact algorithm performance and how best to utilize the availabl
e memory caches, high-bandwidth memory, an...\n\n---------------------\nMa
ssively Parallel Stress Chain Characterization for Billion Particle DEM Si
mulation of Accretionary Prism Formation\n\nFuruichi, Nishiura, Hori\n\nHe
rein, a novel algorithm for characterizing stress chains using a large par
allel computer system is presented. Stress chains are important for analyz
ing the results of large-scale discrete element method (DEM) simulations.
However, the general algorithm is difficult to parallelize especially when
s...\n\n---------------------\nRefactoring and Optimizing Multiphysics Co
mbustion Models for Data Parallelism\n\nStone, Poludnenko, Taylor\n\nHigh-
fidelity combustion simulations combine high-resolution computational flui
d dynamics numerical methods with multi-physics models to capture chemical
kinetics and transport processes. These multi-physics models can dominate
the computation cost of the simulation. Due to the high cost of combusti.
..\n\n---------------------\nInteractive HPC Deep Learning with Jupyter No
tebooks\n\nBhimji, Farrell, Evans, Henderson, Cholia...\n\nDeep learning r
esearchers are increasingly using Jupyter notebooks to implement interacti
ve, reproducible workflows. Such solutions are typically deployed on small
-scale (e.g. single server) computing systems. However, as the sizes and c
omplexities of datasets and associated neural network models in...\n\n----
-----------------\nFast and Accurate Training of an AI Radiologist\n\nWils
on, Gundecha, Varadharajan, Filby, Yang...\n\nThe health care industry is
expected to be an early adopter of AI and deep learning to improve patient
outcomes, reduce costs, and speed up diagnosis. We have developed models
for using AI to diagnose pneumonia, emphysema, and other thoracic patholog
ies from chest x-rays. Using the Stanford Universi...\n\n-----------------
----\nFull State Quantum Circuit Simulation by Using Lossy Data Compressio
n\n\nWu, Di, Cappello, Finkel, Alexeev...\n\nIn order to evaluate, validat
e, and refine the design of a new quantum algorithm or a quantum computer,
researchers and developers need methods to assess their correctness and f
idelity. This requires the capabilities of simulation for full quantum sta
te amplitudes. However, the number of quantum sta...\n\n------------------
---\nAn Efficient SIMD Implementation of Pseudo-Verlet Lists for Neighbor
Interactions in Particle-Based Codes\n\nWillis, Schaller, Gonnet\n\nIn par
ticle-based simulations, neighbour finding (i.e. finding pairs of particle
s to interact within a given range) is the most time consuming part of the
computation. One of the best such algorithms, which can be used for both
Molecular Dynamics (MD) and Smoothed Particle Hydrodynamics (SPH) simula..
.\n\n---------------------\nUnderstanding Potential Performance Issues Usi
ng Resource-Based alongside Time Models\n\nding, Lee, Xue, Zheng\n\nNumero
us challenges and opportunities are introduced by the complexity and enorm
ous code legacy of HPC applications, the diversity of HPC architectures, a
nd the nonlinearity of interactions between applications and HPC systems.
To address these issues, we propose the Resource-based Alongside Time (R..
.\n\n---------------------\nMPI/OpenMP parallelization of the Fragment Mol
ecular Orbitals Method in GAMESS\n\nMironov, Alexeev, Fedorov\n\nIn this w
ork, we present a novel parallelization strategy for the Fragment Molecula
r Orbital (FMO) method in the quantum chemistry package GAMESS. The origin
al FMO code has been parallelized only with MPI, which limits scalability
of the code on multi-core massively parallel machines. To address thi...\n
\n---------------------\nAutomatic Generation of Mixed-Precision Programs\
n\nMoody, Pinnow, Lam, Menon, Schordan...\n\nFloating-point arithmetic is
foundational to scientific computing in HPC, and choices about floating-po
int precision can have a significant effect on the accuracy and speed of H
PC codes. Unfortunately, current precision optimization tools require sign
ificant user interaction, and few work on the sca...\n\n------------------
---\nUPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runti
mes\n\nBaden, Hargrove, Ahmed, Bachan, Bonachea...\n\nLawrence Berkeley Na
tional Lab is developing a programming system to support HPC application d
evelopment using the Partitioned Global Address Space (PGAS) model. This w
ork is driven by the emerging need for adaptive, lightweight communication
in irregular applications at exascale. We present an ove...\n\n---------
------------\nEnabling Reproducible Microbiome Science through Decentraliz
ed Provenance Tracking in QIIME 2\n\nNaimey, Keefe\n\nIn this poster, we d
emonstrate the ways in which automatic, integrated, decentralized provenan
ce tracking in QIIME 2, a leading microbiome bioinformatics platform, enab
les reproducible microbiome science. We use sample data from a recent stud
y of arid soil microbiomes (Significant Impacts of Increa...\n\n---------
------------\nOptimizing Next Generation Hydrodynamics Code for Exascale S
ystems\n\nAkhmetova, Lakshmiranganatha, Mukherjee, Oullet, Payne...\n\nStu
dying continuum dynamics problems computationally can illuminate complex p
hysical phenomena where experimentation is too costly. However, the models
used in studying these phenomena usually require intensive calculations,
some of which are beyond even the largest supercomputers to date. Emerging
...\n\n---------------------\nMGRIT Preconditioned Krylov Subspace Method
\n\nYoda, Fujii, Tanaka\n\nMGRIT re-discretize the problem with larger tim
e-step width at the coarse-levels, which often cause unstable convergence.
We propose a Krylov subspace method with MGRIT preconditioning as a more
stable solver. For unstable problems, MGRIT preconditioned Krylov subspace
method performed better than M...\n\n---------------------\nEnabling Neut
rino and Antineutrino Appearance Observation Measurements with HPC Facilit
ies\n\nBuchanan, Calvez, Ding, Doyle, Himmel...\n\nWhen fitting to data wi
th low statistics and near physical boundaries, extra measures need to be
taken to ensure proper statistical coverage. The method NOvA uses is calle
d the Feldman-Cousins procedure, which entails fitting thousands of indepe
ndent pseudoexperiments to generate acceptance interval...\n\n------------
---------\nLarge Scale Computation of Quantiles Using MELISSA\n\nRibes, Te
rraz, Fournier, Iooss, Raffin\n\nQuantiles being order statistics, the cla
ssical approach for their computation requires availability of the full sa
mple before ranking it. This approach is not suitable at exascale. Large e
nsembles would need to gather a prohibitively large amount of data. We pro
pose an iterative approach based on t...\n\n---------------------\nFlowOS-
RM: Disaggregated Resource Management System\n\nTakano, Suzaki, Koie\n\nA
traditional data center consists of monolithic-servers is confronted with
limitations including lack of operational flexibility, low resource utiliz
ation, low maintainability, etc. Resource disaggregation is a promising so
lution to address the above issues. We propose a concept of disaggregated
da...\n\n---------------------\nProgramming the EMU Architecture: Algorith
m Design Considerations for Migratory-Threads-Based Systems\n\nBelviranli,
Lee, Vetter\n\nThe decades-old memory bottleneck problem for data-intensi
ve applications is getting worse as the processor core counts continue to
increase. Workloads with sparse memory access characteristics only achieve
a fraction of a system's total memory bandwidth. EMU architecture provide
s a radical approach...\n\n---------------------\nOpenACC to FPGA: A Direc
tive-Based High-Level Programming Framework for High-Performance Reconfigu
rable Computing\n\nLee, Lambert, Kim, Vetter, Malony\n\nAccelerator-based
heterogeneous computing has become popular solutions for power-efficient h
igh performance computing (HPC). Along these lines, Field Programmable Ga
te Arrays (FPGAs) have offered more advantages in terms of performance and
energy efficiency for specific workloads than other acceler...\n\n-------
--------------\nTensor-Optimized Hardware Accelerates Fused Discontinuous
Galerkin Simulations\n\nBreuer, Heinecke, Cui\n\nIn recent years the compu
te/memory balance of processors has been continuously shifting towards com
pute. The rise of Deep Learning, based on matrix multiplications, accelera
ted this path, especially in terms of single precision and lower precision
compute. An important research question is if this d...\n\n--------------
-------\nAI Matrix – Synthetic Benchmarks for DNN\n\nWei, Xu, Jin, Zhang,
Zhang\n\nThe current AI benchmarks suffer from a number of drawbacks. Firs
t, they cannot adapt to the emerging changes of deep learning (DL) algorit
hms and are fixed once selected. Second, they contain tens to hundreds of
applications and have very long running time. Third, they are mainly selec
ted from open...\n\n---------------------\nApplying the Execution-Cache-Me
mory Model: Current State of Practice\n\nHager, Eitzinger, Hornich, Cremon
esi, Alappat...\n\nThe ECM (Execution-Cache-Memory) model is an analytic,
resource-based performance model for steady-state loop code running on mu
lticore processors. Starting from a machine model, which describes the int
eraction between the code and the hardware, and static code analysis, it a
llows an accurate predi...\n\n---------------------\nPerformance Evaluatio
n of the NVIDIA Tesla V100: Block Level Pipelining vs. Kernel Level Pipeli
ning\n\nCui, Scogland, de Supinski, Feng\n\nAs accelerators become more co
mmon, expressive and performant, interfaces for them become ever more impo
rtant. Programming models like OpenMP offer simple-to-use but powerful dir
ective-based offload mechanisms. By default, these models naively copy dat
a to or from the device without overlapping comp...\n\n-------------------
--\nJob Simulation for Large-Scale PBS-Based Clusters with the Maui Schedu
ler\n\nZitzlsberer, Jansik, Martinovic\n\nFor large-scale High Performance
Computing centers with a wide range of different projects and heterogeneo
us infrastructures, efficiency is an important consideration. Understandin
g how compute jobs are scheduled is necessary for improving the job schedu
ling strategies in order to optimize cluster u...\n\n---------------------
\nScript of Scripts Polyglot Notebook and Workflow System\n\nWang, Leong,
Peng\n\nComputationally intensive disciplines such as computational biolog
y often use tools implemented in different languages and analyze data on h
igh-performance computing systems. Although scientific workflow systems ca
n powerfully execute large-scale data-processing, they are not suitable fo
r ad hoc dat...\n\n---------------------\nEnabling High-Level Graph Proces
sing via Dynamic Tasking\n\nDrocco, Castellana, Minutoli, Tumeo, Feo\n\nDa
ta-intensive computing yields irregular and unbalanced workloads, in parti
cular on large-scale problems running on distributed systems. Task-based r
untime systems are commonly exploited to implement higher-level data-centr
ic programming models, promoting multithreading and asynchronous coordinat
io...\n\n---------------------\nTensorfolding: Improving Convolutional Neu
ral Network Performance with Fused Microkernels\n\nAnderson, Georganas, Av
ancha, Heinecke\n\nConvolution layers are prevalent in many classes of dee
p neural networks, including Convolutional Neural Networks (CNNs) which pr
ovide state-of-the-art results for tasks like image recognition, neural ma
chine translation and speech recognition. In the recent past, several tech
niques to improve gener...\n\n---------------------\nBinarized ImageNet In
ference in 29us\n\nGeng, Li, Wang, Song, Herbordt\n\nWe propose a single-F
PGA-based accelerator for ultra-low-latency inference of ImageNet in this
work. The design can complete the inference of Binarized AlexNet within 29
us with accuracy comparable to other BNN implementations. We achieve this
performance with the following contributions: 1. We comp...\n\n----------
-----------\nToward Smoothing Data Movement Between RAM and Storage\n\nAlt
urkestani, Tonellot, Etienne, Ltaief\n\nWe propose to design and implement
a software framework, which provides a Multilayer Buffer System (MBS) to
cache in/out datasets into CPU main memory from/to slower storage media, s
uch as parallel file systems (e.g., Lustre), solid-state drive (e.g., Burs
t Buffer) or non-volatile RAM. Although MBS ...\n\n---------------------\n
MATEDOR: MAtrix, TEnsor, and Deep-Learning Optimized Routines\n\nAbdelfatt
ah, Dongarra, Tomov, Yamazaki, Haidar\n\nThe MAtrix, TEnsor, and Deep-lear
ning Optimized Routines (MATEDOR) project develops software technologies a
nd standard APIs, along with a sustainable and portable library, for large
-scale computations that can be broken down into very small matrix or tens
or computations. The main target of MATEDOR i...\n\n---------------------\
nAccelerating Wave-Propagation Algorithms with Adaptive Mesh Refinement Us
ing the Graphics Processing Unit (GPU)\n\nQin, LeVeque, Motley\n\nClawpack
is a library for solving nonlinear hyperbolic partial differential equati
ons using high-resolution finite volume methods based on Riemann solvers a
nd limiters. It supports Adaptive Mesh Refinement (AMR), which is essentia
l in solving multi-scale problems. Recently, we added capabilities to ...\
n\n---------------------\nDistributed Adaptive Radix Tree for Efficient Me
tadata Search on HPC Systems\n\nZhang, Tang, Byna, Chen\n\nAffix-based sea
rch allows users to retrieve data without the need to remember all relevan
t information precisely. While building an inverted index to facilitate ef
ficient affix-based search is a common practice for standalone databases a
nd desktop file systems, they are often insufficient for high-p...\n\n----
-----------------\nImproving Error-Bounded Lossy Compression for Cosmologi
cal N-Body Simulation\n\nLi, Di, Liang, Chen, Cappello\n\nCosmological sim
ulations may produce extremely large amount of data, such that its success
ful run depends on large storage capacity and huge I/O bandwidth, especial
ly in the exascale computing scale. Effective error-bounded lossy compress
ors with both high compression ratios and low data distortion ...\n\n-----
----------------\nVeloC: Very Low Overhead Checkpointing System\n\nNicolae
, Cappello, Moody, Gonsiorowski, Mohror\n\nCheckpointing large amounts of
related data concurrently to stable storage is a common I/O pattern of man
y HPC applications. However, such a pattern frequently leads to I/O bottle
necks that lead to poor scalability and performance. As modern HPC infrast
ructures continue to evolve, there is a growing...\n\n--------------------
-\nEstimating Molecular Dynamics Chemical Shift with GPUs\n\nWright, Ferra
to\n\nExperimental chemical shifts (CS) from solution and solid state magi
c-angle-spinning nuclear magnetic resonance spectra provide atomic level d
ata for each amino acid within a protein or complex. However, structure de
termination of large complexes and assemblies based on NMR data alone rema
ins challe...\n\n---------------------\nUsing Thrill to Process Scientific
Data on HPC\n\nKarabin, Chen, Suresh, Jimenez, Lo...\n\nWith ongoing impr
ovement of computational power and memory capacity, the volume of scientif
ic data keeps growing. To gain insights from vast amounts of data, scienti
sts are starting to look at Big Data processing and analytics tools such a
s Apache Spark. In this poster, we explore Thrill, a framewor...\n\n------
---------------\nGPU Acceleration at Scale with OpenPower Platforms in Cod
e_Saturne\n\nAntao, Moulinec, Fournier, Sawko, Zimon...\n\nCode_Saturne is
a widely used computational fluid dynamics software package that uses fin
ite-volume methods to simulate different kinds of flows tailored to tackle
multi-bilion-cell unstructured mesh simulations. This class of codes has
shown to be challenging to accelerate on GPUs as they consist o...\n\n----
-----------------\nLarge-Message Size Allreduce at Wire Speed for Distribu
ted Deep Learning\n\nTanaka, Arikawa, Kawai, Kato, Ito...\n\nIn large-scal
e distributed deep learning, the Allreduce operation for large messages (1
00 KB or more) is critical for gathering gradients from multiple worker no
des and broadcasting the sum of the gradients to them. When the message is
large, the latency in Allreduce operation would make it difficul...\n\n--
-------------------\nSol: Transparent Neural Network Acceleration Platform
\n\nWeber\n\nWith the usage of neural networks in a wide range of applicat
ion fields, the necessity to execute these efficiently on high performance
hardware is one of the key problems for artificial intelligence (AI) fram
ework providers. More and more new specialized hardware types and correspo
nding libraries a...\n\n---------------------\nDetection of Silent Data Co
rruptions in Smooth Particle Hydrodynamics Simulations\n\nCavelan, Ciorba,
Cabezón\n\nSoft errors, such as silent data corruptions (SDCs) hinder the
correctness of large-scale scientific applications. Ghost replication (GR
) is proposed herein as the first SDCs detector relying on the fast error
propagation inherent to applications that employ the smooth particle hydro
dynamics (SPH) m...\n\n---------------------\nDeepSim-HiPAC: Deep Learning
High Performance Approximate Calculation for Interactive Design and Proto
typing\n\nAl-Jarro, Georgescu, Tomita, Nakashima\n\nWe present a data-driv
en technique that can learn from physical-based simulations for the instan
t prediction of field distribution for 3D objects. Such techniques are ext
remely useful when considering, for example, computer aided engineering (C
AE), where computationally expensive simulations are oft...\n\n-----------
----------\nTop-Down Performance Analysis of Workflow Applications\n\nHero
ld, Williams\n\nScientific simulation frameworks are common to use on HPC
systems. They contain parallelized algorithms and provide various solvers
for a specific application domain. Usually, engineers execute multiple ste
ps to solve a particular problem which are often distributed over multiple
jobs. Finding perfo...\n\n---------------------\nConvolutional Neural Net
works for Coronary Plaque Classification in Intravascular Optical Coherenc
e Tomography (IVOCT) Images\n\nKolluru, Prabhu, Gharaibeh, Wilson, Gajurel
\n\nCurrently, IVOCT is the only imaging technique with the resolution nec
essary to identify vulnerable thin cap fibro-atheromas (TCFAs). IVOCT also
has greater penetration depth in calcified plaques as compared to Intrava
scular Ultrasound (IVUS). Despite its advantages, IVOCT image interpretati
on is ch...\n\n---------------------\nCompiling SIMT Programs on Multi- an
d Many-Core Processors with Wide Vector Units: A Case Study with CUDA\n\nW
u, Ravi, Becchi\n\nThere has been an increasing interest in SIMT programmi
ng tools for multi- and manycore (co)processors with wide vector extension
s. In this work, we study the effective implementation of a SIMT programmi
ng model (a subset of CUDA C) on Intel platforms with 512-bit vector exten
sions (hybrid MIMD/SIMD...\n\n---------------------\nAn Alternative Approa
ch to Teaching Bigdata and Cloud Computing Topics at CS Undergraduate Leve
l\n\nDeb, Fuad, Irwin\n\nBig data and cloud computing collectively offer a
paradigm shift in the way businesses are now acquiring, using and managin
g information technology. This creates the need for every CS student to be
equipped with foundation knowledge in this collective paradigm and to pos
sess some hands-on-experience...\n\n---------------------\nA Massively Par
allel Evolutionary Markov Chain Monte Carlo Algorithm for Sampling Complic
ated Multimodal State SpacesState\n\nCho, Liu\n\nWe develop an Evolutionar
y Markov Chain Monte Carlo (EMCMC) algorithm for sampling from large multi
-modal state spaces. Our algorithm combines the advantages of evolutionary
algorithms (EAs) as optimization heuristics and the theoretical convergen
ce properties of Markov Chain Monte Carlo (MCMC) algo...\n\n--------------
-------\nMLModelScope: Evaluate and Measure Machine Learning Models within
AI Pipelines\n\nDakkak, Li, Hwu, Xiong\n\nThe current landscape of Machin
e Learning (ML) and Deep Learning (DL) is rife with non-uniform frameworks
, models, and system stacks but lacks standard tools to facilitate the eva
luation and measurement of models. Due to the absence of such tools, the c
urrent practice for evaluating and comparing th...\n\n--------------------
-\nA Compiler Framework for Fixed-Topology Non-Deterministic Finite Automa
ta on SIMD Platforms\n\nNourian, Wu, Becchi\n\nAutomata traversal accelera
tion has been studied on various parallel platforms. Many existing acceler
ation methods store finite automata states and transitions in memory. For
these designs memory size and bandwidth are the main limiting factors to p
erformance and power efficiency. Many applications,...\n\n----------------
-----\nA Low-Communicaton Method to Solve Poisson's Equation on Locally-St
ructured Grids\n\nVan Straalen, McCorquodale, Colella, Kavouklis\n\nThis p
oster describes a new algorithm, Method of Local Corrections (MLC), and a
high-performance implementation for solving Poisson's equation with infini
te-domain boundary conditions, on locally-refined nested rectangular grids
. The data motion is comparable to that of only a single V-cycle of mul..
.\n\n---------------------\nFloating-Point Autotuner for CPU-Based Mixed-P
recision Applications\n\nGu, Beata, Becchi\n\nIn this poster, we present t
he design and development of an autotuning tool for floating-point code. T
he goal is to balance accuracy and performance in order to produce an effi
cient and accurate mixed-precision program. The tuner starts by maximizing
accuracy through the use of a high-precision libr...\n
END:VEVENT
END:VCALENDAR