Benchmarks and Evaluation: How Do We Know If Omics AI Actually Works?

When a new foundation model in computational biology is released, the accompanying paper inevitably features tables of bolded numbers demonstrating state-of-the-art performance. Whether it is predicting protein structures or annotating single-cell data, the claims are often spectacular. But how do we truly know if these AI systems work in ways that matter to biology, rather than just optimizing arbitrary computational metrics?

For the vision of Agentic Omics to become reality—where autonomous agents orchestrate models like AlphaFold and DNABERT-2 to drive drug discovery—we need a rigorous understanding of when these models succeed, when they hallucinate, and when their benchmarks deceive us. Claims of AI breakthroughs are only as strong as their evaluation methodologies.

In this post, we explore the landscape of benchmarks in biological AI, the common pitfalls that inflate reported performance, and the crucial chasm between in silico success and clinical validity.

The Gold Standard: CASP

If there is a benchmark that proved the viability of AI in biology, it is CASP (Critical Assessment of Protein Structure Prediction). Established in 1994, CASP is the gold standard because of its blind, prospective nature.

Instead of evaluating models on historical data—where data leakage is almost inevitable—CASP challenges participants to predict the 3D structures of proteins whose structures have been experimentally determined (via X-ray crystallography or cryo-EM) but not yet publicly released. The models cannot cheat because the answers literally do not exist in the public domain.

CASP14 (2020): AlphaFold 2 achieved unprecedented accuracy, effectively “solving” the single-chain protein folding problem for many families.
CASP15 and CASP16 (2022–2024): The focus shifted from single proteins to complex multimers, protein-ligand interactions, and conformational dynamics. In CASP16, predictors built on AlphaFold 3 and ESM-3, such as MULTICOM4, demonstrated significant advancements in tertiary structure prediction, highlighting how ensembling and advanced sampling can push beyond standard foundation model baselines.

CASP’s rigorous, blind evaluation has catalyzed progress precisely because it is virtually impossible to game. It is the model for what biological evaluation should look like.

Benchmarking the Genome: The GUE Framework

While protein folding has CASP, genomics has historically lacked a unified, rigorous evaluation framework. When evaluating language models trained on DNA sequences, researchers often used disparate datasets with varying preprocessing steps, making direct comparisons difficult.

This changed with the introduction of the Genome Understanding Evaluation (GUE) benchmark, released alongside DNABERT-2 (Zhou et al., ICLR 2024).

The GUE benchmark is to genomic models what GLUE (General Language Understanding Evaluation) was to early natural language processing. It aggregates 28 datasets across multi-species genomes into a standardized format. GUE evaluates models on tasks such as:

Predicting promoter regions
Identifying splice sites
Transcription factor binding site prediction
Epigenetic marker prediction

By standardizing these tasks, GUE allows for apples-to-apples comparisons. For example, DNABERT-2 demonstrated that despite having 21 times fewer parameters than its predecessor, it outperformed it on 23 out of 28 datasets in the GUE benchmark, providing clear, quantifiable evidence of more efficient sequence representation.

Single-Cell Benchmarks: Evaluating scGPT and Geneformer

In transcriptomics, the rise of single-cell foundation models (like scGPT and Geneformer) has necessitated complex evaluation strategies. Single-cell data is extremely noisy, sparse, and prone to severe batch effects.

Evaluating a model like scGPT typically involves benchmarking its performance on:

Cell Type Annotation: Can the model correctly identify a cell type across different tissues or experimental batches?
Perturbation Prediction: If we simulate knocking out a specific gene, does the model accurately predict the resulting shift in the entire gene expression profile?
Batch Integration: Can the model’s embeddings group cells by true biological identity rather than technical artifacts?

While tools like scib (Single-Cell Integration Benchmarking) exist, evaluating generative models in this space is notoriously tricky. A model might achieve a high silhouette score for clustering, but completely fail to capture rare, biologically significant cell states.

The Pitfalls of Biological AI Evaluation

Despite these frameworks, the computational biology literature is rife with inflated claims. The complexity of biological data introduces unique pitfalls that can make a model appear superhuman when it is merely memorizing artifacts.

1. Data Leakage and Homology-Based Splitting Failures

In natural language, randomly splitting data into 80% training and 20% testing works reasonably well. In biology, it is disastrous.

Biological sequences are evolutionarily related (homologous). If a model is trained on Protein A and tested on Protein A’ (which shares 90% sequence identity due to evolutionary history), the model isn’t learning to predict structure or function; it is simply doing a nearest-neighbor lookup. Robust evaluation requires clustering sequences by sequence identity and ensuring that train, validation, and test sets share no significant homology. Failures in homology splitting are the number one cause of over-reported performance in omics AI.

2. The “Easy Task” Illusion

Many papers report accuracy on datasets that are fundamentally too easy. For example, distinguishing between entirely unrelated cell types in single-cell data is trivial. The true test of a foundation model is distinguishing closely related cell states or identifying subtle pathogenic shifts that traditional statistical methods miss.

3. Ignoring Compositional and Batch Biases

In microbiome analysis (metagenomics), data is compositional (it sums to 100%). In single-cell data, sequencing depth varies wildly between cells. Models that do not account for these biases in their evaluation metrics will often learn technical artifacts rather than biological truth.

The Chasm: In Silico vs. In Vivo

The most profound gap in evaluating Agentic Omics is the difference between in silico (computational) metrics and in vivo (biological/clinical) reality.

An AI agent might design a highly stable protein with an optimal predicted binding affinity (a computational success), but if that protein aggregates in a cell, is toxic to the liver, or provokes an immune response, it is a clinical failure. Similarly, predicting a variant’s effect on gene expression with high accuracy on a benchmark does not guarantee that the variant is driving a patient’s disease.

This highlights the reproducibility crisis in computational biology: many models perform beautifully on their curated test sets but degrade immediately when applied to noisy, real-world clinical data.

Moving Forward: Recommended Evaluation Frameworks

For Agentic Omics systems to be trusted in critical workflows like drug discovery or precision medicine, we must adopt stricter evaluation paradigms:

Prospective, Blind Testing: Whenever possible, models must be tested on data generated after the model was trained, mimicking the CASP approach.
Biological Orthogonality: Computational predictions must be validated using orthogonal biological assays. If a model predicts a gene regulatory network, does it align with independent ChIP-seq or CRISPR perturbation data?
Uncertainty Quantification: AI agents must not just provide predictions; they must provide well-calibrated confidence intervals. An agent orchestrating a multi-omics workflow must know when it does not know.

As we transition from single-task models to autonomous Agentic Omics, the benchmark is no longer just “can it predict the sequence?” The new benchmark is “can it reliably navigate biological uncertainty to generate testable, correct hypotheses?”

Glossary

CASP: Critical Assessment of Protein Structure Prediction; a blind, community-wide experiment for evaluating protein structure models.
GUE: Genome Understanding Evaluation; a comprehensive benchmark for evaluating DNA foundation models across diverse tasks.
Homology: Similarity in sequence or structure due to shared evolutionary ancestry.
Data Leakage: When information from outside the training dataset (such as test data or highly correlated homologous sequences) is improperly used to create the model, leading to inflated performance.

References

Kryshtafovych, A., et al. (2021). Critical assessment of methods of protein structure prediction (CASP)—Round 14. Proteins: Structure, Function, and Bioinformatics.
CASP15/16 Community Experiments. (2022-2024). Assessment of tertiary structure prediction and advanced multimer predictions.
Zhou, Z., et al. (2024). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. ICLR 2024.
Cui, H., et al. (2024). scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods.
Luebbert, L., et al. (2023). Single-cell integration benchmarking (scib) framework. Nature Methods.

The Gold Standard: CASP#

Benchmarking the Genome: The GUE Framework#

Single-Cell Benchmarks: Evaluating scGPT and Geneformer#

The Pitfalls of Biological AI Evaluation#

1. Data Leakage and Homology-Based Splitting Failures#

2. The “Easy Task” Illusion#

3. Ignoring Compositional and Batch Biases#

The Chasm: In Silico vs. In Vivo#

Moving Forward: Recommended Evaluation Frameworks#

Glossary#

References#