Year 6 NCE

The Data Coordination and Management Core (DCM) has continued to bring in a large amount of data over the past six months. We’ve brought in two new datasets (quakeFetalPancreas and pyleSkeletalMuscle) and extended primary data for several more datasets (belmonteMouseDnmt3a and sanfordRnaRegulation1). We’ve also added analysis files for eleven datasets and coordinated data going through the Uniform Processing Pipeline as datasets conclude data generation. We are finalizing the ingestion of all remaining analysis files as we receive them from the Stanford CESCG Core.

Over the past six months, our team worked closely with CESCG labs to update summary pages and identify data that is ready to be shared on the public SCHub. We pushed five datasets to the public site (kriegsteinBrainOrganoids2, fanIcf1, jonesYap, pyleSkeletalMuscle, and belmonteMouseDnmt3a). Additionally, we have made updates to the Cell Browser by adding a new user interface and command-line software features. The Cell Browser has also expanded with twelve new datasets, three of which were added to aid researchers working on COVID-19 therapeutics and vaccines.

Data ingestion and annotation into SCHub are now complete. Pipelines for analyzing scRNA-seq and scATAC-seq also have been completed. In the latest development, methods for linking together scRNA-seq and ATAC-Seq have also been installed during this past NCE period (see CIP4 report). Several dimensionality reduction techniques are part of the pipelines including tSNE, PCA, and UMAP. Clustering approaches including DBSCAN, Louvain, and graph-based spectral methods have been installed. Trajectory methods for identifying trends have been installed including Scimitar, Monocle2, Slingshot, and URD. The DCG has created various pipelines to ingest and analyze SCHub datasets. We developed analytical approaches for trajectory analysis, machine-learning to annotate datasets such as JCVI’s NS-Forest algorithm, and new data sharing platforms for investigators to compare results of their experiments. A legacy of datasets and methodologies are now publicly available to the research community and will have a lasting value in follow-on projects and as open source community-contributed efforts.

The DCG worked most closely with the Heart-of-Cells (HoC) investigators to test methodology. The collaboration involved weekly, and later bi-weekly, telephone calls to advance the analysis of two main HoC manuscripts. During the NCE, we finalized JCVI’s NSForest to find marker gene sets, created an ontology-aware annotation tool called TreeMAP, and a signature collection called scBeacon that collects, shares, and displays cell types found from RNA-seq data through their cluster signatures using a peer-to-peer system. If widely adopted, this system will allow investigators to share the results of their experimental findings in a decentralized, yet fully verifiable setting. As a side point, but with relevance to the current pandemic, the same scBeacon system is also being deployed to share the results of COVID-19 diagnostic test results starting with the UCSC campus and soon the county, allowing patients to be in complete control of the access or privacy of their health information.