Over the past few years, the amount of data being generated with single cell RNA sequencing has scaled exponentially, with the number of cells surveyed in a single academic study going from dozens to millions. Each year hundreds of studies are published, often containing dozens of experiments and tens of thousands or hundreds of thousands of cells each. This is because for a few thousand dollars, it is now possible to profile tens of thousands of single cells in a standard experiment.
Number of cells reported in academic studies using single cell RNA sequencing each year. Data from Svensson 2021 (DOI: 10.1093/database/baaa073).
However one of the challenges with utilizing public data is the quality and consistency of the data. When integrating public data from many different resources, laboratories and publications it becomes impossible to discern biological variability form technical artifacts, because these two variables are confounded. ImYoo is creating a large standardized immune reference dataset with samples from hundreds of people that have been very consistently processed through our end-to-end pipelines.
Our standardized reference dataset currently has over 300 samples samples, including self collected capillary blood samples, venous blood samples and buffy coat samples. In total there are more than 2.4M single cells and nuclei, and this enabled us to create a high resolution cell taxonomy, identifying over 40 cell subtypes in our dataset.
A t-SNE visualization of the cells in the ImYoo database, colored by cell type.
One of the challenges of having data at this complexity and scale, however, is that there are a lot of processing steps to get from the raw sequencing data to clean visualizations of which genes are being expressed in which cells. Each of these processing steps has a lot of parameters, and many of the steps take a long time, even on high performance computers. In the next section we discuss some of these challenges.
Machine Learning Workflows
Despite the complexity of data processing, that is still the easier part of dealing with scRNA-seq data. The harder part is the grouping of cells into types, the analysis of differences between groups of cells or people, and deriving mechanistic insights with valuable follow-up. An interesting feature of single cell RNA sequencing data is that the format is quite standardized: every sample is a big sparse matrix of cells and their gene counts. Conveniently, this is a format that is by default suitable for machine learning. By training machine learning models on our large database of cells, we identify groups of cells that share common features: cells that are of the same type (such as B cells), cells that are in similar immune states (such as antibody-producing plasma B cells), or even cells that are in specific disease states (such as cancerous cells). This process is one that keeps on giving - as we accrue more cells in our database, our models pick up on increasingly finer resolution of cellular signatures. In short, having more data keeps improving our models.
For training models capturing the state of individual cells or cell populations, we rely heavily on scvi-tools, an awesome suite of tools for single-cell machine learning in Python.
However, there is another type of approach that is very different than modeling the state of individual cells: modeling the state of entire immune systems. Most machine learning architectures can't operate on matrices of varying sizes; they expect a sample to have a consistent number of data points. However, when we collect cells, we obtain very different numbers of cells between people; anywhere from hundreds to 10s of thousands of cells. We don't want to throw away all that data; we want to use as much as we can! So, we have developed our own machine learning model architectures that learn similarities between people using all of their immune cells that we've captured; e.g. people that might share a common disease state, or people that have similar predispositions to a pathogen. Similar to the cell-based model, the more people's samples we have in our database, the better we can group people together. Again, having more people's data helps everyone else.
An example of the neural network architecture that can be used to classify the immune state of people, as opposed to individual cells.
We have doubled-down on employing machine learning methods for finding similarities between cells and between people, because we believe that disease does not fall along the neat, binary lines so often assigned by diagnostics. The human immune system is incredibly complex, and people with very different traditional diagnostic criteria might have very similar mechanisms underlying their disease. We want to find immune systems that might be suffering from the same failure modes, or that might benefit from the same intervention by letting models learn these mechanistic similarities.