Presentation

· Presenters · Organizations · Search Program · Flagged · Happening Now · Maps · Notifications

Workshop

: Reproducibility for Streaming Analysis

SessionComputational Reproducibility at Exascale 2018 (CRE2018)

Author/Presenters

Christopher J. Wright

Line Pouchard

Simon J. L. Billinge

Event Type

Workshop

Registration Categories

Tags

TimeSunday, November 11th4:10pm - 4:30pm

LocationD221

DescriptionThe natural and physical sciences increasingly need streaming data processing for live data analysis and autonomous experimentation. Furthermore, data provenance and replicability are important to assure the veracity of scientific results. Here we describe a software system that combines high performance computing, streaming data processing, and automatic data provenance capturing to address this need. Data provenance and streaming data processing share a common data structure, the directed acyclic graph (DAG), which describes the order of each computational step. Data processing requires the DAG to specify what computations to run in what order, and the execution can be recreated from the graph, reproducing the analyzed data and capturing provenance. In our framework the description and ordering of the analysis steps (the pipeline) are separated from their execution (the streaming analysis) and the DAG created for the streaming data processing is captured during data analysis. Streaming data can have high throughputs and our system allows users to choose among multiple parallel processing backends, including Dask. To guarantee reproducibility, unique links to the incoming data, and their timestamps are captured alongside the DAG. Analyzed data, along with provenance metadata, are stored in a database, which can re-run analysis from raw data, enabling verification of results, exploring how parameters change outcomes, and data processing reuse. This system is running in production at the National Synchrotron Light Source-II (NSLS-II) x-ray powder diffraction beamlines.

Program November 11–16, 2018

Exhibits November 12–15, 2018

KAY BAILEY HUTCHISON CONVENTION CENTER DALLAS

The International Conference for High Performance
Computing, Networking, Storage, and Analysis

Presentation