Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction

<span class="var-sub_title">Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction</span> SC18 Proceedings

Machine Learning in HPC Environments

Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction

Authors: Kesheng Wu (Lawrence Berkeley National Laboratory)

Abstract: As new scientific instruments generate ever more data, we need to parallelize advanced data analysis algorithms such as machine learning to harness the available computing power. The success of commercial Big Data systems demonstrated that it is possible to automatically parallelize these algorithms. However, these Big Data tools have trouble handling the complex analysis operations from scientific applications. To overcome this difficulty, we have started to build an automated parallel data processing engine for science, known as SystemA1. This paper provides an overview of this data processing engine, and a use case involving a complex feature extraction task from a large-scale seismic recording technology, called distributed acoustic sensing (DAS). The key challenge associated with DAS is that it produces a vast amount of noisy data. The existing methods used by the DAS team for extracting useful signals like traveling seismic waves from this data are very time-consuming. Our parallel data processing engine reduces the job execution time from 100s of hours to 10s of seconds, and achieves 95% parallelization efficiency. We are implementing more advanced techniques including machine learning using SystemA, and plan to work with more scientific applications.

Archive Materials

Back to Machine Learning in HPC Environments Archive Listing

Back to Full Workshop Archive Listing