Reproducibility for Streaming Analysis
Authors: Christopher J. Wright (Columbia University)
Abstract: The natural and physical sciences increasingly need streaming data processing for live data analysis and autonomous experimentation. Furthermore, data provenance and replicability are important to assure the veracity of scientific results. Here we describe a software system that combines high performance computing, streaming data processing, and automatic data provenance capturing to address this need. Data provenance and streaming data processing share a common data structure, the directed acyclic graph (DAG), which describes the order of each computational step. Data processing requires the DAG to specify what computations to run in what order, and the execution can be recreated from the graph, reproducing the analyzed data and capturing provenance. In our framework the description and ordering of the analysis steps (the pipeline) are separated from their execution (the streaming analysis) and the DAG created for the streaming data processing is captured during data analysis. Streaming data can have high throughputs and our system allows users to choose among multiple parallel processing backends, including Dask. To guarantee reproducibility, unique links to the incoming data, and their timestamps are captured alongside the DAG. Analyzed data, along with provenance metadata, are stored in a database, which can re-run analysis from raw data, enabling verification of results, exploring how parameters change outcomes, and data processing reuse. This system is running in production at the National Synchrotron Light Source-II (NSLS-II) x-ray powder diffraction beamlines.
Back to Computational Reproducibility at Exascale 2018 (CRE2018) Archive Listing
Back to Full Workshop Archive Listing