Year 5

DCM Summary of Scientific Progress

The Data Coordination and Management Core (DCM) has continued to bring in a large amount of data over the past six months. We’ve brought in two new data sets (Quake and Yeo) and extended data for several more data sets (Chi, Crooks, Fan, Frazer, Sanford, Yeo, Jones, Loring, Bruneau, Belmonte). We’ve also added analysis files, and coordinated data going through the Uniform Processing Pipeline as data sets conclude data generation. We are in the process of bringing in additional data for the Yeo, Sanford, Kriegstein, Loring and Corn labs.

DCG Summary of overall progress

The Data Curation Group (DCG) continues its work during the no-cost-extension period as described in the previous report. An agreement on uniform pipelines (from quantified CellRanger expression to batch correction to normalization and through clustering and trajectory inference) was established soon after the April face-to-face meeting at Stanford. The group has incorporated additional alternatives into these pipelines to suit the needs of the HoC collaborators, for example replacing tSNE w/ UMAP for dimensionality reduction, or using URD in place of Monocle2 for trajectory inference.

In addition to solidifying such processing pipelines, interfaces in SCHub to search for marker genes of clusters and for predicting cell types with SamplePsychic will be finalized over the next reporting period. Anticipating that a number of new datasets will become available at the end of this reporting period, we are now automating the identification and comparison of dataset clusters and the RDF descriptions for gene markers found as distinguishing of these clusters. Improvements in marker gene identification (NS-Forest) and feature transformations of scRNA-Seq datasets has progressed. Further annotation of progenitor cells has been established (using StemID). Both the JCVI and UCSC groups are establishing methods for detecting cell types from gene expression data and for making the 1) predictors available in a portal (SamplePsychic) and 2) making the marker gene lists and cell ontology terms available to index SCHub datasets. JCVI continues to improve their NS-Forest algorithm and is now applying the updated version to BoC and HoC datasets, which is still in progress. Controlled representations of the marker genes and cell types will be developed using RDF so that they can be retrieved by user queries.