In Year 2, the Data Center Management Group (DCM) wrangled in production or pilot data from CIP1, CIP2, and from all but one first round CRP labs, and initiated contact with labs funded in the second CRP round. We created an infrastructure for higher level whole dataset views in addition to our existing file-by-file views. We implemented a simple, informative visualization of neighbor-joining type RNA-seq clustering that can display multiple types of metadata in the same display. We developed interactive figures for displaying the results of principal components analysis and T-SNE clustering that are particularly useful for single-cell data, and tested these on datasets of up to 6000 cells. We extended the SCHub capabilities to handle mouse as well as human data, and extended the website to allow authorized users to download both data and metadata without the need for a Unix account.
We elected to defer one milestone, a public-facing portal of CIRM production data, until Y3Q1Q2 to focus on the larger than expected new CRP awardees. This also allowed us to wrangle in 4 additional pilot data sets from existing labs. This delay has minimal impact on the overall project.
The secondary work of the DCM is to provide visualizations of the data. There is a growing demand for the DCM to produce new data visualizations, as well as to run both the standard and a new, CIP4-invented analysis on the data. Given these increased requests for visualization and analysis, which often go hand-in-hand, as well as the additional wrangling work from the large number of new CRP-funded labs, we would like to discuss adding an FTE to the DCM budget.
In the next six months we hope to produce web displays showing the results of CIP4 analysis. We plan to help the data curation group (DCG) roll out their metadata standards, help train the labs in these protocols, and perform QA to ensure adherence to the standards. We will continue to import production data as it becomes available. We will work with the labs that were awarded second round CRP funding, on pilot data if production data sets are not yet available. We plan to release a set of new features and usability enhancements on the web site, and develop a public-facing portal for with a wider range of published data than is currently available. We will assist CIP4 in cross-lab and cross-dataset analysis.
The Data Curation Group (DCG) has worked to established procedures to richly annotate the data sets collected through the genomics center. Because the CIRM collection will include the results of various different labs and projects, it is critical to establish methods to describe the experiments and results in a coherent fashion. The definition of uniform terminology will aid downstream computational analysis to maximally shed light on any known and novel cell types identified through these experiments. The effort is particularly challenging given the constantly evolving nature of both the genomics technologies employed in these investigations and our understanding about the fundamental cellular and molecular entities that exist or contribute to the various assays.
In light of the dynamic nature of these investigations, the DCG has established a communication mechanism with the DCM to provide standard terminology for describing experiments wherever possible. At the same time, the DCG will be able to respond by expanding and adapting the ontology specifications as needed by the consortium.
During the last six months, the DCG has developed minimum information standard (MISC) and worked with several of the collaborating labs as a pilot project to complete a full cycle in which metadata from labs was obtained, mapped to the standard, and returned to the labs for incorporation into submission of subsequent experiments. Because MISCE will be used throughout, this will ensure that future collaborating labs can also take advantage of the controlled terms. To support labs with adopting these standards, the DCG is creating a portal based on JCVI’s previously developed O-META system, to allow investigators to annotate their own experimental samples using the standards developed by the DCG and DCM procedures. In the next year, the DCG will work to further develop and test this system as well as on ways to assist labs in annotating their own experiments.
To provide a backdrop of cell state information against which samples collected within the consortium can be compared, the DCG is collecting a set of publicly available datasets from GEO and the SRA. These datasets include tens of thousands of samples for which gene expression data has been collected on microarrays and through RNA sequencing. The DCG has processed over 80,000 RNA-Seq samples at this point and is in the process of using semi-automated methods to assign cell states to each. Once completed, these samples can be used by the center to develop machine-learning predictors to connect new experiments with those established in the literature.