Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction
TimeSunday, November 11th4:30pm - 5pm
DescriptionAs new scientific instruments generate ever more data, we need to parallelize advanced data analysis algorithms such as machine learning to harness the available computing power. The success of commercial Big Data systems demonstrated that it is possible to automatically parallelize these algorithms. However, these Big Data tools have trouble handling the complex analysis operations from scientific applications. To overcome this difficulty, we have started to build an automated parallel data processing engine for science, known as SystemA1. This paper provides an overview of this data processing engine, and a use case involving a complex feature extraction task from a large-scale seismic recording technology, called distributed acoustic sensing (DAS). The key challenge associated with DAS is that it produces a vast amount of noisy data. The existing methods used by the DAS team for extracting useful signals like traveling seismic waves from this data are very time-consuming. Our parallel data processing engine reduces the job execution time from 100s of hours to 10s of seconds, and achieves 95% parallelization efficiency. We are implementing more advanced techniques including machine learning using SystemA, and plan to work with more scientific applications.