Introduction of Practical Approaches to Data Analytics for HPC with Spark
Parallel Programming Languages, Libraries, and Models
TimeMonday, November 12th8:30am - 12pm
DescriptionThis tutorial provides a practical introduction to big data analytics, blending theory (e.g., of clustering algorithms and techniques for dealing with noisy data) and practice (e.g., using Apache Spark, Jupyter Notebooks, and Github). Over the course of five modules, participants will become familiar with modern data science methods, gain comfort with the tools of the trade, explore real-world data sets, and leverage the power of HPC resources to extract insights from data. Upon completing the tutorial, participants will have: used Jupyter notebooks to create reproducible, explanatory data science workflows; learned a modern MapReduce implementation, Apache Spark; implemented parallel clustering methods in Spark; studied strategies for overcoming the common imperfections in real-world datasets, and applied their new skills to extract insights from a high-dimensional medical dataset.