Database &
ML workflows

ImYoo's database

ImYoo is creating a large standardized immune reference dataset with samples from hundreds of people. This dataset includes self collected capillary blood samples and venous blood samples. Individuals across multiple conditions are listed below. Additionally there are 36 capillary samples in a longitudinal study performed on 6 people sampled weekly times over 6 weeks.

cell_types.png

A t-SNE visualization of the cells in the ImYoo database, colored by cell type.

One of the challenges of having data at this complexity and scale, however, is that there are a lot of processing steps to get from the raw sequencing data to clean visualizations of which genes are being expressed in which cells. Each of these processing steps has a lot of parameters, and many of the steps take a long time, even on high performance computers. 

Data processing 

Over the past few years, the amount of data being generated with single cell RNA sequencing has scaled exponentially, with the number of cells surveyed in a single academic study going from dozens to millions. Each year hundreds of studies are published, often containing dozens of experiments and tens of thousands or hundreds of thousands of cells each. This is because for a few thousand dollars, it is now possible to profile tens of thousands of single cells in a standard experiment.

cell_over_time.png

One of the challenges of having data at this complexity and scale, however, is that there are a lot of processing steps to get from the raw sequencing data to clean visualizations of which genes are being expressed in which cells. Each of these processing steps has a lot of parameters, and many of the steps take a long time, even on high performance computers. 

Machine Learning Workflows​

Despite the complexity of data processing, that is still the easier part of dealing with scRNA-seq data. The harder part is the grouping of cells into types, the analysis of differences between groups of cells or people, and deriving mechanistic insights with valuable follow-up. An interesting feature of single cell RNA sequencing data is that the format is quite standardized: every sample is a big sparse matrix of cells and their gene counts. Conveniently, this is a format that is by default suitable for machine learning. By training machine learning models on our large database of cells, we identify groups of cells that share common features: cells that are of the same type (such as B cells), cells that are in similar immune states (such as antibody-producing plasma B cells), or even cells that are in specific disease states (such as cancerous cells). This process is one that keeps on giving - as we accrue more cells in our database, our models pick up on increasingly finer resolution of cellular signatures. In short, having more data keeps improving our models. 

For training models capturing the state of individual cells or cell populations, we rely heavily on scvi-tools, an awesome suite of tools for single-cell machine learning in Python.

However, there is another type of approach that is very different than modeling the state of individual cells: modeling the state of entire immune systems. Most machine learning architectures can't operate on matrices of varying sizes; they expect a sample to have a consistent number of data points. However, when we collect cells, we obtain very different numbers of cells between people; anywhere from hundreds to 10s of thousands of cells. We don't want to throw away all that data; we want to use as much as we can! So, we have developed our own machine learning model architectures that learn similarities between people using all of their immune cells that we've captured; e.g. people that might share a common disease state, or people that have similar predispositions to a pathogen. Similar to the cell-based model, the more people's samples we have in our database, the better we can group people together. Again, having more people's data helps everyone else.

ml diagram

We have doubled-down on employing machine learning methods for finding similarities between cells and between people, because we believe that disease does not fall along the neat, binary lines so often assigned by diagnostics. The human immune system is incredibly complex, and people with very different traditional diagnostic criteria might have very similar mechanisms underlying their disease. We want to find immune systems that might be suffering from the same failure modes, or that might benefit from the same intervention by letting models learn these mechanistic similarities.