The Data Infrastructure Challenge: From Raw Reads to AI-Ready Datasets

The bottleneck for AI in computational biology is rarely a shortage of sophisticated models; it is the sheer difficulty of making biological data AI-ready. The “Agentic Omics” vision—where autonomous AI agents orchestrate domain-specific models to accelerate drug discovery—fundamentally rests on the assumption that these agents have access to standardized, clean, and computable data.

In this post, we explore the unglamorous but critical foundation of omics AI: the data infrastructure. We trace the journey from raw sequencing reads to the structured tensor formats required by modern foundation models, exploring the evolving standards, the scale of the challenge, and how cloud infrastructure is adapting.

The Scale of the Omics Data Deluge

The phrase “data explosion” is a cliché, but in genomics, it is a mathematically precise description. Since the Human Genome Project, the cost of sequencing has outpaced Moore’s Law, leading to an exponential accumulation of data.

Genomics: A single high-coverage human whole-genome sequence (WGS) generates roughly 100 gigabytes of raw data. Large-scale efforts like the UK Biobank and the NIH All of Us Research Program have sequenced hundreds of thousands of individuals, generating petabytes of data.
Transcriptomics: Single-cell RNA sequencing (scRNA-seq) datasets have grown from measuring hundreds of cells to millions. The Human Cell Atlas aims to map billions of cells, generating complex, sparse, high-dimensional matrices.
Proteomics & Metabolomics: High-resolution mass spectrometry produces vast, complex spectra that require significant preprocessing to interpret.

Training a foundation model like Nucleotide Transformer or scGPT requires ingesting massive, curated slices of these global datasets. The challenge is not merely storage; it is the computability of the data. Raw biological data is noisy, biased, and siloed in heterogeneous formats.

The Pipeline: From Machine to Model

The transformation of omics data into an AI-ready state involves a complex, multi-stage pipeline.

1. Raw Data Generation and Primary Analysis

Biological samples are processed by sequencing machines (e.g., Illumina, PacBio, Oxford Nanopore) or mass spectrometers.

In genomics, this yields raw “reads”—short or long strings of nucleotides—typically stored in FASTQ format. FASTQ files contain both the sequence and a quality score for each base.
These raw reads are useless to most AI models. They must be aligned to a reference genome.

2. Alignment and Variant Calling (Secondary Analysis)

BAM/CRAM: The raw reads are mapped against a reference genome. The resulting alignments are stored in BAM (Binary Alignment Map) or its highly compressed successor, CRAM. These files are massive and computationally heavy.
VCF (Variant Call Format): Once aligned, software identifies variations (SNPs, indels) compared to the reference. These differences are extracted and stored in a VCF file. A VCF is significantly smaller and more structured, representing the genetic differences of an individual.

3. Making Data “AI-Ready” (Tertiary Analysis and Tensorization)

Traditional bioinformatics largely stops at the VCF or the gene expression matrix. AI models, however, require further transformation.

Tokenization: As discussed in our previous post, models like DNABERT-2 require DNA sequences to be tokenized (e.g., via Byte-Pair Encoding) into discrete integers.
AnnData (Annotated Data): In single-cell transcriptomics, the standard has rapidly converged on the AnnData format (used by Python’s Scanpy). AnnData elegantly packages the sparse count matrix, cell metadata, and gene metadata into a single, HDF5-backed file structure. This is the direct input format for foundation models like scGPT and Geneformer.
PDB and SMILES: In structural biology and cheminformatics, 3D protein structures (PDB) and chemical graphs (SMILES strings) must be converted into graphs or sequences for models like AlphaFold or ChemCrow.

The Standardization Imperative: GA4GH and FAIR Principles

AI models are voracious for diverse data, which means researchers must aggregate datasets from multiple institutions. This is impossible without rigid standards.

The Global Alliance for Genomics and Health (GA4GH) has been instrumental in creating interoperable standards for genomic data sharing. APIs like the GA4GH Data Repository Service (DRS) allow AI workflows to access data across different cloud environments seamlessly, without needing to understand the underlying storage mechanics.

Furthermore, the data must adhere to FAIR principles: Findable, Accessible, Interoperable, and Reusable. For an Agentic Omics system to function, it must be able to autonomously find the relevant dataset, access it via an API, interoperate with its format, and reuse the data to test new hypotheses. If data lacks robust metadata (e.g., missing batch information, ambiguous cell type labels), the AI will learn the artifacts rather than the biology.

Cloud Infrastructure and Federated Learning

The days of downloading FASTQ files to a local high-performance computing (HPC) cluster are ending. The sheer mass of the data necessitates bringing the compute to the data.

Cloud Genomics Platforms

Platforms like Terra (Broad Institute/Verily), DNAnexus, and Seven Bridges (Velsera) provide cloud-native environments where data, analytical tools, and compute resources co-exist. These platforms are crucial for Agentic Omics because they offer the API-driven infrastructure that autonomous agents need to execute complex, multi-step workflows at scale.

Federated Learning: Privacy-Preserving AI

Genomic data is the ultimate personally identifiable information. Strict regulations (GDPR, HIPAA) heavily restrict data movement across borders or institutions.

Federated Learning offers a solution. Instead of centralizing the data to train the model, the model is sent to the data. The model trains locally at each hospital or research center, and only the updated model weights are shared centrally. This architecture is vital for training robust, generalizable clinical AI models on diverse populations without compromising patient privacy.

Challenges: Batch Effects, Noise, and Bias

Even when data is perfectly formatted, biological reality intrudes.

Batch Effects: Technical variations between different sequencing runs, labs, or machine models can completely overwhelm the underlying biological signal. If a model trains on data with uncorrected batch effects, it will fail catastrophically in the real world.
Label Noise: Supervised models rely on ground-truth labels (e.g., “This cell is a T-cell”). In biology, these labels are often probabilistic or derived from older, less accurate models.
Population Bias: As we will explore deeper in Post 20, most genomic data is derived from individuals of European descent. Models trained on this skewed infrastructure inherit and amplify this bias, leading to AI systems that perform poorly on diverse populations.

Conclusion

The transition toward Agentic Omics is blocked not by a lack of reasoning capabilities in LLMs, but by the friction of biological data. The heroic efforts of the last five years have centered on building the “data pipelines”—the ETL (Extract, Transform, Load) for the book of life.

As standards crystallize around formats like CRAM and AnnData, and as cloud-native APIs become ubiquitous, we are finally crossing the threshold where autonomous agents can fluidly access, manipulate, and reason over multi-omics data.

Glossary

BAM (Binary Alignment Map): A compressed binary format for storing sequence alignment data.
VCF (Variant Call Format): A standard format for representing genetic variations.
AnnData: A Python library and file format for handling annotated data matrices, widely used in single-cell genomics.
GA4GH: Global Alliance for Genomics and Health, an organization developing standards for genomic data sharing.
FAIR Principles: Guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets.

References

The Cancer Genome Atlas (TCGA) Research Network. (Various). Genomic data commons and cancer multi-omics datasets.
The Human Cell Atlas Consortium. (2024). Progress towards a comprehensive reference map of all human cells.
Global Alliance for Genomics and Health (GA4GH). (2024). Data Repository Service (DRS) API specifications.
Cui, H., et al. (2024). scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods.
Theodoris, C. V., et al. (2023). Transfer learning enables predictions in network biology. Nature (Geneformer).

The Scale of the Omics Data Deluge#

The Pipeline: From Machine to Model#

1. Raw Data Generation and Primary Analysis#

2. Alignment and Variant Calling (Secondary Analysis)#

3. Making Data “AI-Ready” (Tertiary Analysis and Tensorization)#

The Standardization Imperative: GA4GH and FAIR Principles#

Cloud Infrastructure and Federated Learning#

Cloud Genomics Platforms#

Federated Learning: Privacy-Preserving AI#

Challenges: Batch Effects, Noise, and Bias#

Conclusion#

Glossary#

References#