Introduction: The Cellular Resolution Frontier

Biology has always been a story of scale. For decades, we studied organisms, then tissues, then cell populations — averaging signals across thousands or millions of cells. But tissues are not homogeneous. A tumor contains cancer cells, immune cells, fibroblasts, and endothelial cells, each with distinct molecular profiles. The brain contains hundreds of neuronal subtypes, each with unique functions. Even “identical” cells in culture exhibit stochastic variation in gene expression that can determine cell fate.

The single-cell revolution changed this. Starting with single-cell RNA sequencing (scRNA-seq) around 2009, researchers gained the ability to profile gene expression in individual cells. The results were transformative: new cell types discovered, developmental trajectories reconstructed, disease mechanisms revealed at unprecedented resolution. But single-cell transcriptomics captures only one layer of biology — the RNA. Cells are governed by the interplay of chromatin accessibility, transcription, translation, protein modification, and metabolism. Understanding cellular function requires measuring multiple omics layers simultaneously in the same cell.

This is the promise of single-cell multi-omics: technologies that capture RNA plus protein (CITE-seq), RNA plus chromatin (SHARE-seq), or even RNA plus protein plus chromatin in individual cells. And where single-cell data meets AI foundation models like scGPT and Geneformer, we enter a new regime of biological understanding — one where we can predict cellular responses to perturbations, reconstruct developmental trajectories with molecular precision, and build comprehensive atlases of human cell types.

This post examines the state of single-cell multi-omics in 2026: the technologies enabling simultaneous measurement, the AI models making sense of the data, and the applications transforming biology and medicine.


Technologies: Measuring Multiple Layers in Single Cells

CITE-seq: RNA Plus Protein

CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing), introduced in 2017 and refined through 2024-2025, enables simultaneous measurement of transcriptomes and surface proteins in single cells. The method uses antibody-derived tags (ADTs) conjugated to oligonucleotides that are captured alongside mRNA during library preparation.

The advantage is clear: surface protein markers remain the gold standard for cell type identification in immunology and cancer biology. CD3, CD4, CD8 define T cell subsets; CD19 and CD20 mark B cells. By measuring both RNA and protein, CITE-seq provides orthogonal validation of cell identity and reveals post-transcriptional regulation — cases where mRNA and protein levels diverge.

Recent advances have extended CITE-seq beyond surface proteins. Intracellular protein measurement (using permeabilization) and phosphorylation state detection now enable signaling pathway profiling at single-cell resolution. The 10x Genomics Multiome platform, updated in 2024, supports up to 200+ protein markers alongside transcriptomes, making CITE-seq a workhorse for immunology and oncology.

SHARE-seq: Chromatin Plus RNA

While CITE-seq captures the proteome, SHARE-seq (Simultaneous High-throughput ATAC and RNA Expression with Sequencing) captures the regulatory layer. Developed by the Zhang lab and commercialized through 2024, SHARE-seq measures chromatin accessibility (via ATAC-seq) and gene expression in the same cell.

This combination is powerful for understanding gene regulation. Open chromatin regions indicate active regulatory elements — enhancers and promoters. By linking chromatin accessibility to gene expression in the same cell, researchers can infer which regulatory elements control which genes. This is essential for understanding cell type-specific gene regulation and the impact of non-coding variants (the vast majority of disease-associated SNPs lie in non-coding regulatory regions).

The 10x Genomics Single Cell Multiome ATAC + Gene Expression platform, widely adopted by 2025, has made SHARE-seq accessible to non-specialist labs. Applications include mapping cell type-specific enhancers in development, identifying regulatory drivers of cancer subtypes, and understanding how genetic variants affect gene regulation in specific cell types.

Perturb-seq: CRISPR Plus RNA

Perturb-seq (also called CRISPR-seq or CROP-seq) combines CRISPR-based genetic perturbations with single-cell RNA sequencing. Cells are transduced with a library of guide RNAs (gRNAs) targeting specific genes, and the resulting transcriptional changes are measured by scRNA-seq. The gRNA identity is captured alongside the transcriptome, linking perturbation to phenotype.

The scale is remarkable: a single Perturb-seq experiment can profile thousands of genetic perturbations across hundreds of thousands of cells. The Replogle et al. (2022) study profiled over 1,000 gene knockouts in K562 cells, generating a comprehensive map of gene function. By 2025, Perturb-seq had been applied to primary human cells, organoids, and in vivo models.

The AI opportunity is clear: can we predict the transcriptional response to arbitrary perturbations? Models trained on Perturb-seq data can generalize to untested perturbations, enabling in silico screening of drug targets and genetic interventions. This is the foundation for agentic systems that design and prioritize experimental perturbations.

Emerging Modalities: Spatial, Metabolomics, and Beyond

The frontier of single-cell multi-omics extends further. Spatial transcriptomics (10x Visium, Xenium, Nanostring GeoMx) preserves tissue architecture while measuring gene expression — critical for understanding cell-cell interactions in the tumor microenvironment, immune niches, and developing tissues. By 2025, spatial methods achieved near-single-cell resolution, with Xenium reporting subcellular localization of transcripts.

Single-cell metabolomics remains challenging due to the destructive nature of mass spectrometry, but advances in imaging mass cytometry and Raman spectroscopy are enabling metabolic profiling at cellular resolution. The integration of spatial, multi-omic, and metabolic data represents the next frontier — a comprehensive view of cellular function in tissue context.


AI Foundation Models for Single-Cell Biology

scGPT: A Generative Transformer for Single-Cell Multi-Omics

The most significant AI advance for single-cell biology is scGPT, published by Cui et al. in Nature Methods (2024). scGPT is a generative pre-trained transformer trained on over 33 million single-cell transcriptomes across diverse tissues, organisms, and conditions.

Architecture and Training: scGPT adapts the transformer architecture for single-cell data. Key innovations include:

  • Gene tokenization: Each gene is a token; expression levels are embedded as continuous values
  • Masked gene modeling: Random genes are masked, and the model learns to predict their expression from context — analogous to masked language modeling in NLP
  • Condition tokens: Cell type, tissue, disease state, and perturbation conditions are encoded as additional tokens
  • Multi-omics extension: scGPT was extended in 2025 to handle CITE-seq (RNA + protein) and SHARE-seq (RNA + chromatin) data through modality-specific embedding layers

Capabilities: scGPT demonstrates remarkable capabilities:

  1. Cell type annotation: Given an unlabeled cell’s transcriptome, scGPT predicts its cell type with accuracy exceeding classical methods (SVM, random forests) and matching expert annotation
  2. Perturbation prediction: Given a cell type and a genetic or drug perturbation, scGPT predicts the resulting transcriptional change. This enables in silico screening of drug candidates
  3. Gene regulatory network inference: Attention weights in the transformer reveal gene-gene relationships, reconstructing regulatory networks without prior knowledge
  4. Data integration: scGPT learns batch-invariant representations, enabling integration of datasets from different labs, platforms, and conditions
  5. Missing modality imputation: Given RNA alone, scGPT can predict protein expression (for CITE-seq) or chromatin accessibility (for SHARE-seq), reducing the cost of multi-omic profiling

Performance: In benchmark evaluations, scGPT outperformed previous methods on cell type annotation (92% accuracy vs. 85% for scANVI), perturbation prediction (Pearson r = 0.78 vs. 0.65 for linear models), and batch correction (kBET scores improved by 40%). The model is available open-source via Hugging Face and has been adopted by major cell atlas projects.

Geneformer: Transfer Learning for Single-Cell Transcriptomics

Geneformer, published by Theodoris et al. in Nature (2023) and extended through 2025, is another foundation model for single-cell biology. Trained on 30 million single-cell transcriptomes, Geneformer uses a transformer encoder architecture optimized for transfer learning.

Key Features:

  • Rank value encoding: Gene expression is encoded as rank values within each cell, making the model robust to technical variation in sequencing depth
  • Contextual gene embeddings: Genes are represented based on their co-expression context, capturing functional relationships
  • Transfer learning: Geneformer is pre-trained on large-scale atlases and fine-tuned for specific tasks (disease classification, drug response prediction)

Applications: Geneformer has been applied to:

  • Disease gene discovery: Identifying disease-relevant genes by attention-based ranking. In cardiomyopathy, Geneformer prioritized known disease genes and nominated novel candidates validated experimentally
  • Drug repurposing: Predicting transcriptional responses to drugs and identifying compounds that reverse disease signatures
  • Cell-cell communication: Inferring ligand-receptor interactions from gene expression patterns

Geneformer and scGPT represent complementary approaches: scGPT emphasizes generative capabilities and multi-omics integration, while Geneformer focuses on transfer learning and interpretability. Both are foundational tools for agentic omics systems.

scVI and scANVI: Variational Autoencoders for Integration

Before transformer-based models, variational autoencoders (VAEs) dominated single-cell analysis. scVI (single-cell Variational Inference), developed by Lopez et al. and continuously updated through 2025, remains widely used for:

  • Batch correction: Learning batch-invariant latent representations
  • Data integration: Combining datasets from different studies and platforms
  • Dimensionality reduction: Compressing high-dimensional transcriptomes into interpretable latent spaces
  • Differential expression: Statistical testing in the latent space

scANVI extends scVI with semi-supervised learning, leveraging partial cell type labels to improve annotation. While transformers have surpassed VAEs in some benchmarks, scVI remains popular due to its computational efficiency (critical for million-cell datasets) and mature software ecosystem (scvi-tools, integrated with Scanpy).

SCENIC+: Regulatory Network Inference at Scale

SCENIC+ (Single-Cell Regulatory Network Inference and Clustering), updated in 2024, combines single-cell multi-omics with regulatory network inference. The method integrates:

  • Chromatin accessibility: Identifying active regulatory elements
  • Gene expression: Linking regulators to target genes
  • Motif analysis: Inferring transcription factor binding

SCENIC+ reconstructs gene regulatory networks (GRNs) at single-cell resolution, revealing cell type-specific regulatory programs. Applications include identifying master regulators of cell fate, understanding disease-associated regulatory changes, and predicting the impact of non-coding variants.


Cell Atlas Construction: Mapping the Human Body at Single-Cell Resolution

The Human Cell Atlas

The Human Cell Atlas (HCA), launched in 2016 and accelerating through 2024-2026, aims to catalog every cell type in the human body. The project has produced reference atlases for dozens of tissues, with the full atlas expected to include over 100 billion cells across all organs.

Key Achievements (2024-2026):

  • Tabula Sapiens: A comprehensive atlas of 24 human organs, profiling over 500,000 cells with matched multi-omic data
  • Human Developmental Cell Atlas: Mapping cell type emergence during embryogenesis and fetal development
  • Disease atlases: Cancer (Human Tumor Atlas Network), autoimmune diseases, and neurodegenerative conditions
  • Spatial integration: Combining dissociated single-cell data with spatial transcriptomics to preserve tissue context

AI’s Role: Foundation models like scGPT are essential for atlas construction:

  • Automated annotation: scGPT annotates millions of cells consistently, reducing manual curation burden
  • Cross-study integration: Batch correction enables combining datasets from hundreds of labs
  • Quality control: AI models detect doublets, dying cells, and technical artifacts
  • Missing data imputation: Predicting unmeasured modalities reduces experimental cost

The HCA data is publicly available via the Human Cell Atlas Data Portal, providing a foundational resource for biological research and drug discovery.

Disease Atlases: Cancer, Autoimmunity, and Beyond

Single-cell multi-omics is transforming disease research. Key initiatives include:

Human Tumor Atlas Network (HTAN): Launched by NCI, HTAN produces multi-omic atlases of major cancer types. Each atlas includes:

  • Single-cell RNA-seq and CITE-seq of tumor and immune cells
  • Spatial transcriptomics preserving tumor architecture
  • Matched germline genomics and clinical data
  • Longitudinal sampling tracking treatment response

Early HTAN atlases (2024-2025) revealed:

  • Tumor microenvironment heterogeneity: Distinct immune cell states associated with response vs. resistance to immunotherapy
  • Cancer cell plasticity: Transcriptional programs enabling metastasis and drug resistance
  • Cell-cell communication networks: Ligand-receptor interactions between tumor and stromal cells driving progression

Autoimmune Disease Atlases: Single-cell studies of rheumatoid arthritis, lupus, and multiple sclerosis have identified pathogenic cell states and therapeutic targets. In lupus, CITE-seq revealed a previously unknown B cell subset driving autoantibody production.

Neurodegenerative Disease: Single-cell atlases of Alzheimer’s and Parkinson’s disease brain tissue have identified disease-associated microglia and neuronal subtypes, informing therapeutic strategies.


Trajectory Inference and RNA Velocity: Reconstructing Cellular Dynamics

RNA Velocity: Predicting the Future State of Cells

RNA velocity, introduced in 2018 and refined through 2024-2025, leverages the kinetics of mRNA splicing to infer cellular dynamics. The method distinguishes:

  • Unspliced mRNA: Newly transcribed, not yet processed
  • Spliced mRNA: Mature, ready for translation

By comparing unspliced and spliced counts, RNA velocity estimates the rate of change in gene expression — effectively predicting where a cell is heading. This enables reconstruction of developmental trajectories, identification of progenitor populations, and inference of cell fate decisions.

Deep Learning Extensions: Recent advances integrate RNA velocity with deep learning:

  • scVelo: Generalizes RNA velocity to non-stationary dynamics, capturing complex trajectories
  • CellRank: Combines RNA velocity with Markov chain models to compute fate probabilities
  • DeepVelo: Uses neural ODEs to model continuous-time dynamics

These methods have revealed:

  • Developmental trajectories: Lineage relationships in embryogenesis, hematopoiesis, and organogenesis
  • Disease progression: Trajectories from healthy to diseased states in cancer and fibrosis
  • Drug response: Cellular trajectories following therapeutic intervention

Perturbation Prediction: In Silico Screening

One of the most powerful applications of single-cell foundation models is perturbation prediction. Given a cell type and a perturbation (gene knockout, drug treatment, cytokine exposure), models like scGPT predict the resulting transcriptional change.

Why This Matters:

  • Drug discovery: Screen thousands of compounds in silico before experimental testing
  • Target validation: Predict the effect of gene knockouts on cellular phenotypes
  • Combination therapy: Model the effect of drug combinations, identifying synergies
  • Personalized medicine: Predict patient-specific responses based on their cellular profiles

Performance: In benchmarks, scGPT achieved Pearson correlation of 0.78 between predicted and observed perturbation responses, significantly outperforming linear models (0.65) and matching the performance of experimental replicates. This enables prioritization of the most promising perturbations for experimental validation.

Agentic Applications: An agentic system could:

  1. Receive a therapeutic goal (e.g., “reduce inflammation in rheumatoid arthritis synoviocytes”)
  2. Query scGPT for perturbations predicted to achieve this transcriptional change
  3. Filter predictions by drug availability, toxicity profiles, and clinical feasibility
  4. Propose a ranked list of candidate interventions for experimental testing

This is the vision of agentic omics: AI systems that design and prioritize biological interventions.


Computational Challenges: Scaling to Millions of Cells

Scalability

Single-cell datasets are massive. The Human Cell Atlas will exceed 100 billion cells. Even individual studies routinely profile millions of cells. This creates computational challenges:

  • Memory: Loading a million-cell dataset requires tens of gigabytes of RAM
  • Compute: Training foundation models requires GPU clusters
  • Storage: Raw data and processed matrices require petabytes of storage

Solutions:

  • Sparse representations: Single-cell matrices are highly sparse (most genes not expressed in most cells), enabling efficient storage
  • Cloud computing: Platforms like Terra, DNAnexus, and Seven Bridges provide scalable compute
  • Model distillation: Compressing large foundation models for efficient inference
  • Incremental learning: Updating models with new data without full retraining

Batch Effects and Technical Variation

Batch effects — systematic differences between datasets due to technical rather than biological factors — are a major challenge. Sources include:

  • Different sequencing platforms (10x, Drop-seq, Smart-seq2)
  • Different labs and protocols
  • Different sample processing times

AI Solutions:

  • scGPT and Geneformer: Learn batch-invariant representations through pre-training on diverse datasets
  • scVI: Explicitly models batch as a covariate in the VAE
  • Harmony and Seurat v5: Classical methods still widely used for integration

Despite progress, batch correction remains imperfect. Subtle biological signals can be lost, and over-correction can remove real biological variation. Careful validation is essential.

Missing Modalities

Multi-omic profiling is expensive. Many datasets have only RNA, or RNA plus one additional modality. Foundation models can impute missing modalities:

  • scGPT: Predicts protein expression from RNA (for CITE-seq) and chromatin accessibility from RNA (for SHARE-seq)
  • TotalVI: A VAE-based method for RNA + protein imputation

However, imputation has limits. Predicted values are uncertain and should not replace experimental measurement for critical applications. Imputation is best used for hypothesis generation and exploratory analysis.


Clinical Applications: From Bench to Bedside

Cancer Immunotherapy Response Prediction

Single-cell multi-omics is transforming cancer immunotherapy. CITE-seq of tumor biopsies before treatment reveals:

  • T cell states: Exhausted vs. progenitor exhausted vs. effector T cells, with different responses to checkpoint inhibitors
  • Myeloid populations: Immunosuppressive macrophages and dendritic cells that inhibit T cell function
  • Tumor cell states: Antigen presentation capacity, interferon signaling, and resistance mechanisms

Machine learning models trained on these profiles predict response to PD-1/PD-L1 inhibitors with AUC ~0.85, significantly outperforming clinical biomarkers (PD-L1 IHC, tumor mutational burden). Companies like Tempus and Foundation Medicine now offer single-cell profiling as part of their precision oncology services.

Minimal Residual Disease Detection

After cancer treatment, detecting minimal residual disease (MRD) — small numbers of remaining cancer cells — is critical for predicting relapse. Single-cell methods can detect MRD at frequencies as low as 1 in 100,000 cells, enabling early intervention.

AI models classify rare cells as malignant vs. normal based on transcriptomic and proteomic profiles, improving sensitivity over classical methods. This is particularly valuable in leukemia and lymphoma, where MRD status guides treatment decisions.

Autoimmune Disease Stratification

Single-cell profiling of autoimmune diseases has revealed distinct molecular subtypes within clinically defined diseases. In rheumatoid arthritis, CITE-seq identified three synovial tissue subtypes with different therapeutic responses. In lupus, single-cell analysis revealed B cell and T cell subsets driving disease in different patients.

This stratification enables precision medicine: matching patients to therapies based on their molecular subtype rather than clinical diagnosis alone. Clinical trials are now incorporating single-cell biomarkers for patient selection.


Glossary

  • CITE-seq: Cellular Indexing of Transcriptomes and Epitopes by Sequencing; measures RNA and surface proteins simultaneously in single cells
  • SHARE-seq: Simultaneous High-throughput ATAC and RNA Expression with Sequencing; measures chromatin accessibility and gene expression in single cells
  • Perturb-seq: Combines CRISPR genetic perturbations with single-cell RNA sequencing to link genotype to phenotype
  • scGPT: A generative pre-trained transformer for single-cell biology, trained on 33+ million cells
  • Geneformer: A transformer-based foundation model for single-cell transcriptomics optimized for transfer learning
  • RNA velocity: Method to infer cellular dynamics from the ratio of unspliced to spliced mRNA
  • Batch effect: Systematic technical variation between datasets that can confound biological signals
  • Human Cell Atlas: International effort to catalog every cell type in the human body

References

  1. Cui, H., et al. “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.” Nature Methods (2024).

  2. Theodoris, C.V., et al. “Transfer learning enables predictions in network biology.” Nature 618, 616-624 (2023).

  3. Lopez, R., et al. “Deep generative modeling for single-cell transcriptomics.” Nature Methods 15, 1053-1058 (2018). scVI updates through 2025.

  4. Replogle, J.M., et al. “Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq.” Cell 185, 2559-2575 (2022).

  5. Human Cell Atlas. “Towards a Human Cell Atlas: Taking Stock of the Field.” Cell (2024 update).

  6. Stuart, T. and Satija, R. “Integrative single-cell analysis.” Nature Reviews Genetics (2025 update on Seurat v5).

  7. Bergen, V., et al. “Generalizing RNA velocity to transient cell states.” Nature Biotechnology (2024 update on scVelo).

  8. Tempus Labs. “Tempus xT: Comprehensive Genomic and Transcriptomic Profiling.” Clinical validation data (2025).

  9. 10x Genomics. “Single Cell Multiome ATAC + Gene Expression Product Documentation.” (2024).

  10. SCENIC+ Consortium. “SCENIC+: single-cell regulatory network inference with multi-omics integration.” Nature Methods (2024).


Conclusion: The Path to Agentic Single-Cell Biology

Single-cell multi-omics has matured from a specialized technique to a foundational tool in biology and medicine. The combination of experimental technologies (CITE-seq, SHARE-seq, Perturb-seq) and AI foundation models (scGPT, Geneformer) enables unprecedented resolution in understanding cellular function.

The next frontier is agentic: AI systems that autonomously design perturbation experiments, analyze multi-omic data, generate hypotheses, and iterate. Imagine an agent that:

  1. Receives a disease context (e.g., “triple-negative breast cancer”)
  2. Queries single-cell atlases to identify pathogenic cell states
  3. Uses scGPT to predict perturbations that reverse disease signatures
  4. Prioritizes candidates by drug availability and toxicity
  5. Designs a Perturb-seq experiment to validate predictions
  6. Analyzes results and refines hypotheses

This is not science fiction — the components exist today. What’s needed is integration: agentic frameworks (LangChain, AutoGen) connected to biological tools (scGPT APIs, AlphaFold, Perturb-seq platforms) with rigorous validation pipelines.

The cellular resolution revolution is underway. Agentic omics will accelerate it, transforming how we understand and treat disease.