HPC Impact Showcase:
The BP Data Science Sandbox
HPC Impact Showcase
TimeThursday, November 15th1:30pm - 2:15pm
DescriptionRecent years have seen major advances in the state-of-the-art of machine learning, particularly in fields such as natural language processing and 2D computer vision.
These advances have naturally spurred interest in the application of similar techniques to new fields in medicine, science, and engineering. However, the problems in these fields are differentiated from previous machine learning successes by the level of domain expertise required. While classifying an image as a cat, dog, horse, etc is a task that anyone can understand, automatic identification of malignant tumors, subsurface faults, or financial fraud (for example) often requires far more background in the specific domain. Unfortunately, it is rare today for people to have both the skills of a data scientist/statistician and a domain expert (e.g. an oncologist or petroleum engineer).
This problem can generally be solved in two ways: (1) through education (of your data scientists and/or domain experts), or (2) through co-location of these two groups of people such that they can work closely together.
This talk will introduce the BP Data Science Sandbox (DSS) – an internal environment at BP that supports both of the above solutions. The sandbox is a platform made up of hardware, software, and people. On the hardware front, the sandbox includes everything from big memory machines to GPU machines to compute clusters, enabling users of the sandbox to pick and choose the platform that meets their resource requirements. On the software front, the sandbox is built on entirely free and open source software, including common tools such as Jupyter, JupyterHub, Spark, Dask, Tensorflow, and other packages in the Conda ecosystem. On the people front, the sandbox is supported by a team of dedicated data scientists and infrastructure engineers who support users and internal customers of the sandbox.