Presentation

· Presenters · Organizations · Search Program · Flagged · Happening Now · Maps · Notifications

Workshop

: A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows

SessionWORKS 2018: 13th Workshop on Workflows in Support of Large-Scale

Author/Presenters

Event Type

Workshop

Registration Categories

Tags

TimeSunday, November 11th11:55am - 12:20pm

LocationD173

DescriptionWhenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark’s memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a “black-box” activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark’s performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.

Program November 11–16, 2018

Exhibits November 12–15, 2018

KAY BAILEY HUTCHISON CONVENTION CENTER DALLAS

The International Conference for High Performance
Computing, Networking, Storage, and Analysis

Presentation