The Gen3 Approach to Portability and Repeatability for Cancer Genomics Projects
TimeSunday, November 11th11:15am - 11:45am
DescriptionThe Gen3 software stack is a open-source platform for managing, analyzing, and sharing petabyte-scale research data. In this note, we describe the approach that we have used with Gen3 to support portability and repeatibility for cancer genomics projects. Data in a Gen3 data commons is divided into projects. Project data is of two types: large files, such as BAM files and image files, that are managed as data objects and stored in one or more private and public clouds, and all of the other data associated with a project, including all of the the clinical phenotype data and biospecimen data. We call this other data “core data” and have developed data serialization format for it, which includes versioning and schema information. Data objects are available across multiple data commons, while core data can be exported and imported using the serialization format. In this way, we support portability for data projects. We support repeatibility by representing workflows using the Common Workflow Language (CWL) and managing the CWL files as data objects. With this approach, we simply need to manage and version the data objects, core data, and CWL files associated with a project.