Introduction: The Open Science Paradox
In May 2024, Google DeepMind published AlphaFold 3 in Nature, describing a system that could predict the structure of protein complexes with DNA, RNA, ligands, and small molecules—a dramatic leap beyond AlphaFold 2’s protein-only predictions. But there was a catch: the code wasn’t released. For six months, researchers could read about the breakthrough but couldn’t reproduce it, build on it, or verify the claims independently.
The backlash was swift and severe. An opinion piece in ASBMB Today argued that “AlphaFold 3 needs to be open source” for the very health of scientific progress. In November 2024, DeepMind reversed course, releasing the code under a non-commercial academic license. The OpenFold Consortium at Columbia University had already begun building an open-source alternative, demonstrating both the community’s demand and its capacity for self-organization.
This episode captures the central tension in biological AI today: the battle between open science and proprietary development. On one side stand models like ESM-3 (Meta), DNABERT-2, scGPT, and Evo—all openly available, enabling rapid iteration and democratized access. On the other side sit commercial efforts like Isomorphic Labs, Insilico Medicine’s proprietary pipelines, and the initial AlphaFold 3 release—driven by legitimate commercial incentives but raising concerns about reproducibility, equity, and the pace of scientific progress.
This post examines this tension honestly. We’ll map the landscape of open and closed biological AI models, analyze the arguments on both sides, explore the role of open data as the backbone of all biological AI, and consider what governance frameworks might balance openness with safety and commercial viability.
The Open Models: A Thriving Ecosystem
Protein Language Models: ESM-2 and ESM-3
Meta’s Evolutionary Scale Modeling (ESM) series represents one of the most significant open contributions to biological AI. ESM-2, released in 2023, offered protein language models up to 15 billion parameters trained on hundreds of millions of protein sequences. The models learned to predict protein structure, function, and mutational effects from sequence alone—all without explicit structural training data.
In 2024, Meta released ESM-3, a multimodal generative model capable of designing proteins conditioned on structure, function, and sequence constraints. Critically, ESM-3 was released openly through the EvolutionaryScale organization, with model weights and inference code available to researchers. The licensing allows both academic and commercial use, though with some restrictions on specific high-risk applications.
The impact has been substantial. ESM models have been integrated into countless research pipelines, from variant effect prediction to protein design. The openness enabled rapid validation—researchers could immediately test the models on their own datasets, identify limitations, and propose improvements.
DNA Foundation Models: DNABERT-2
DNABERT-2, published at ICLR 2024, represents the state of the art in DNA foundation models. Unlike its predecessor, which used k-mer tokenization, DNABERT-2 employs byte-pair encoding (BPE) trained on multi-species genomes, achieving 3× better efficiency while outperforming the original on 23 out of 28 benchmark datasets.
DNABERT-2 is fully open source under the Apache 2.0 license. The code, data, and pre-trained models are available on GitHub (MAGICS-LAB/DNABERT_2), enabling researchers worldwide to apply the model to their own genomic questions. This openness has accelerated adoption in variant effect prediction, regulatory element identification, and cross-species genome analysis.
Single-Cell Models: scGPT
scGPT, published in Nature Methods in February 2024, is a foundation model for single-cell multi-omics trained on 33+ million cells. The model enables cell type annotation, perturbation prediction, and gene regulatory network inference through transfer learning.
scGPT is openly available on GitHub (bowang-lab/scGPT) with pre-trained checkpoints and comprehensive documentation. The open release has enabled rapid adoption across the single-cell community, with researchers applying the model to diverse tissues and conditions. The model is also available through the Chan Zuckerberg Initiative’s Virtual Cells Platform, further democratizing access.
Cross-Domain Models: Evo
The Evo model, released by Arc Institute in 2024, is a 7-billion parameter model trained on 300 billion nucleotides spanning all domains of life—DNA, RNA, and protein sequences in a unified architecture. Published in Science, Evo can generate functional DNA sequences and predict the effects of mutations.
Evo is openly available through Arc Institute’s open science mission, with model weights and code accessible to researchers. The release included extensive safety discussions, acknowledging the dual-use potential while arguing that open access enables better safety research and governance.
The Hugging Face Biology Ecosystem
Hugging Face has become a central hub for open biological AI models. As of late 2025, the platform hosts hundreds of biology-related models, including:
- BioMistral-7B: An open-source large language model for medical domains, trained on biomedical literature
- IBM BioMed Collection: A biomedical foundation model trained on 2+ billion biological samples across proteins, small molecules, and single-cell data
- BioCLIP 2: A vision-language model for species identification, downloaded 45,000+ times in a single month
- Microsoft BioGPT: A generative pre-trained transformer for biomedical text mining
This ecosystem enables researchers to discover, test, and deploy biological AI models with minimal friction. The open nature of Hugging Face means models can be forked, fine-tuned, and improved by the community.
The Proprietary Side: Commercial Incentives and Restricted Access
AlphaFold 3: The Controversial Release
AlphaFold 3’s release trajectory illustrates the tensions in biological AI. When the paper was published in Nature in May 2024, DeepMind released only a web server for predictions—not the code or model weights. Researchers could submit sequences and receive predictions, but couldn’t:
- Run predictions locally on large datasets
- Integrate AlphaFold 3 into automated pipelines
- Audit the model for biases or errors
- Build derivative models or improvements
The justification centered on safety and responsible deployment. DeepMind expressed concerns about potential misuse and wanted to monitor usage patterns before broader release. However, many researchers saw this as antithetical to scientific norms—particularly for a tool with such transformative potential.
In November 2024, following sustained community pressure, DeepMind released the AlphaFold 3 code on GitHub under a non-commercial academic license. Model weights remained subject to a request process, with access granted at DeepMind’s discretion. This compromise satisfied some critics but left others frustrated by the ongoing restrictions.
The episode had lasting consequences. The OpenFold Consortium accelerated development of OpenFold-3, a fully open-source reproduction aiming for bitwise equivalence with AlphaFold 3. Ligo Biosciences also released an open-source implementation with training code. These efforts demonstrate both the community’s commitment to openness and the redundancy created when major models are initially restricted.
Isomorphic Labs: AlphaFold in Drug Discovery
Isomorphic Labs, a sister company to Google DeepMind, applies AlphaFold and related AI tools to real drug discovery programs. Unlike academic releases, Isomorphic’s tools and pipelines are proprietary, accessible only through partnerships with pharmaceutical companies.
The company has announced collaborations with Novartis, Eli Lilly, and other major pharma companies, applying AlphaFold 3 to target identification and validation. While these partnerships may accelerate drug development, the work happens behind closed doors. Results are published selectively, and the underlying methods remain trade secrets.
This model has legitimate advantages: pharmaceutical development requires massive capital investment, and proprietary tools can justify that investment. However, it creates a two-tier system where well-funded companies access state-of-the-art AI while academic researchers and smaller biotechs rely on open alternatives.
Insilico Medicine: AI-Designed Drugs in Clinical Trials
Insilico Medicine has advanced one of the most publicized AI-discovered drugs, ISM001-055 for idiopathic pulmonary fibrosis (IPF), into Phase II clinical trials. The company uses proprietary AI platforms for target discovery, molecule generation, and preclinical optimization.
Insilico’s platform is not open source. The company’s competitive advantage lies in its integrated AI pipeline, which it protects as intellectual property. While Insilico publishes research papers describing their methods, the actual tools remain proprietary.
This approach has produced tangible results—ISM001-055 represents a genuine milestone in AI-driven drug discovery. But it also means the broader research community cannot build on Insilico’s innovations or apply similar approaches to other diseases without developing their own proprietary pipelines.
Recursion Pharmaceuticals: Phenomics at Scale
Recursion Pharmaceuticals operates a massive phenomics-driven drug discovery platform, combining high-throughput cellular imaging with machine learning. The company has built one of the largest proprietary biological datasets in the world—hundreds of millions of cellular images linked to genetic and chemical perturbations.
Recursion’s platform and dataset are proprietary, accessible only to company researchers and partners. While Recursion has published research using their platform, the underlying data and tools remain closed.
This creates a paradox: Recursion’s dataset may be one of the most valuable resources for understanding cellular phenotypes, but it’s unavailable to the broader research community. The company argues that the dataset was built through massive capital investment ($500M+ raised) and represents their core competitive advantage. Researchers counter that datasets of this scale have public good implications that transcend any single company’s interests.
Open Data: The Backbone of Biological AI
All biological AI models—open or closed—depend on open data. The major biological databases represent decades of community effort to create shared infrastructure:
Structural Biology: PDB and AlphaFold DB
The Protein Data Bank (PDB), established in 1971, contains over 200,000 experimentally determined protein structures. This open database was essential for training AlphaFold 2 and virtually every protein structure prediction model since.
In 2021, DeepMind released the AlphaFold Protein Structure Database, providing predicted structures for 200+ million proteins across nearly all catalogued organisms. This database is freely accessible and has become a foundational resource for structural biology. The decision to make these predictions open was widely praised and demonstrates the value of open data even when models are partially restricted.
Sequence Databases: GenBank, UniProt, RefSeq
GenBank (NCBI), UniProt, and RefSeq form the backbone of sequence-based biological AI. These databases contain hundreds of millions of DNA, RNA, and protein sequences, all freely accessible. DNABERT-2, ESM, Evo, and other foundation models were trained on these open resources.
The openness of these databases enables reproducibility. Researchers can verify that models were trained on appropriate data, identify potential biases, and build improved models on the same foundation.
Single-Cell Resources: Human Cell Atlas, GEO, Tabula Sapiens
The Human Cell Atlas aims to create comprehensive reference maps of all human cells. Data from the project is openly shared through platforms like CELLxGENE, enabling researchers worldwide to access and analyze single-cell datasets.
The Gene Expression Omnibus (GEO) at NCBI hosts millions of gene expression profiles from experiments worldwide. The Tabula Sapiens provides a comprehensive single-cell transcriptomic map of human organs. These open resources enabled the development of scGPT, Geneformer, and other single-cell foundation models.
The FAIR Principles
The biological data community has largely adopted the FAIR principles:
- Findable: Data should have persistent identifiers and rich metadata
- Accessible: Data should be retrievable via standard protocols
- Interoperable: Data should use standard formats and vocabularies
- Reusable: Data should have clear usage licenses and provenance
These principles have made biological AI possible. Without open, well-curated data, even the most sophisticated models would have nothing to learn from.
The Case for Open Biological AI
Reproducibility and Scientific Rigor
Science depends on reproducibility. When models are open, researchers can:
- Verify published results independently
- Identify bugs or errors in implementations
- Test models on diverse datasets to assess generalizability
- Understand failure modes and limitations
The AlphaFold 3 controversy highlighted this issue. Without code access, researchers couldn’t verify the model’s performance on their own targets or understand why predictions succeeded or failed. Open models like ESM-3 and DNABERT-2 enable this kind of rigorous validation.
Equity and Access
Open models democratize access to cutting-edge AI. A researcher at a university in Kenya can use ESM-3 or scGPT just as effectively as a scientist at a well-funded US institution. Proprietary models create access barriers:
- Academic licenses may exclude researchers at certain institutions
- Commercial licenses may be prohibitively expensive
- API-based access may have rate limits or usage restrictions
- Geographic restrictions may apply based on export controls
For global health challenges—neglected tropical diseases, pandemic preparedness, agricultural improvements for developing regions—open models ensure that AI benefits aren’t limited to wealthy nations and corporations.
Accelerated Scientific Progress
Openness accelerates progress through cumulative innovation. When DNABERT-2 was released openly, researchers immediately:
- Applied it to new species and genomic contexts
- Fine-tuned it for specific tasks (variant effect prediction, regulatory element identification)
- Identified limitations and proposed improvements
- Integrated it into existing bioinformatics pipelines
This “standing on the shoulders of giants” dynamic is how science advances. Proprietary models fragment the research landscape, with each company reinventing similar capabilities rather than building on shared foundations.
Safety Through Transparency
Counterintuitively, openness may improve safety. When models are open:
- The community can audit for biases, errors, and potential misuse
- Safety researchers can develop detection and mitigation strategies
- Governance frameworks can be informed by actual capabilities rather than speculation
- Dual-use concerns can be addressed through technical and social mechanisms
The Evo model release exemplified this approach. Arc Institute openly discussed the dual-use implications of a model that can generate DNA across all domains of life, inviting community input on governance and safety measures. This transparency enabled more informed policy discussions than would have been possible with a closed release.
The Case for Some Restrictions
Biosecurity and Dual-Use Concerns
Biological AI raises genuine dual-use concerns. The same models that design therapeutic proteins could potentially design harmful ones. Models that predict pathogen evolution could inform vaccine development—or enhancement of pathogens.
Proponents of restricted access argue that:
- Capability thresholds matter: Some capabilities may be too dangerous for unrestricted release
- Screening and monitoring: Controlled access enables usage monitoring and misuse detection
- Graduated release: Models can be released in stages as safety measures mature
- Responsible scaling: As models become more capable, access controls should tighten
These arguments have merit, though they must be balanced against the benefits of openness. The key question is whether restrictions actually improve safety or merely create an illusion of control while open alternatives emerge anyway.
Commercial Viability
Building state-of-the-art biological AI requires substantial investment:
- Compute costs: Training a 7B+ parameter model costs millions of dollars in GPU time
- Data curation: Assembling and cleaning training datasets requires significant labor
- Expertise: Teams need both ML and domain biology expertise, which is scarce and expensive
- Infrastructure: Serving models at scale requires ongoing operational investment
Companies argue that without some proprietary protection, they cannot justify these investments. Venture capital flows to businesses with defensible competitive advantages, not to projects that will be immediately commoditized by open-source releases.
This tension is real. If all biological AI were required to be open source, would companies still invest at current levels? Possibly not—which could slow overall progress even as openness increases.
Data Privacy and Patient Rights
Some biological AI applications involve sensitive patient data:
- Genomic data linked to health records
- Proprietary clinical trial data
- Patient-derived cellular models
In these contexts, openness must be balanced against privacy obligations. HIPAA, GDPR, and other regulations restrict how patient data can be shared, even for research. Federated learning and synthetic data generation offer partial solutions, but the tension remains.
The AlphaFold 3 Compromise
DeepMind’s eventual AlphaFold 3 release—open code with restricted model weights—represents a compromise position. Academic researchers can run the model and build on the code, but commercial use requires negotiation. Model access is granted case-by-case, enabling some oversight.
Whether this is the right balance is debatable. Critics argue that model weights should be fully open for academic research. Defenders argue that the compromise enables both scientific progress and responsible commercialization.
Community Resources: The Open Infrastructure
Hugging Face Biology Models
Hugging Face has become the central hub for open biological AI. The platform hosts:
- Hundreds of biology-specific models (protein, DNA, single-cell, medical text)
- Datasets for training and benchmarking
- Community forums for troubleshooting and collaboration
- Integration with popular tools (PyTorch, TensorFlow, JAX)
The platform’s openness enables rapid iteration. When a new model is released, the community can immediately test it, report issues, and propose improvements. This dynamic has accelerated biological AI development significantly.
BioContainers and Galaxy Project
BioContainers provides containerized bioinformatics tools, ensuring reproducibility across computing environments. The Galaxy Project offers a web-based platform for accessible, reproducible computational biology.
These resources lower barriers to entry. Researchers without extensive computational expertise can access state-of-the-art tools through user-friendly interfaces. This democratization is essential for equitable participation in biological AI research.
OpenFold Consortium
The OpenFold Consortium emerged in response to AlphaFold 2’s initial restricted release and continued with AlphaFold 3. The consortium’s mission is to create fully open-source alternatives to proprietary structural biology AI.
OpenFold-3, currently in development, aims for bitwise equivalence with AlphaFold 3 while remaining fully open. This effort demonstrates the community’s commitment to openness even when major players restrict their releases.
Governance Frameworks: Finding the Balance
White House Executive Order on AI Safety
The October 2023 White House Executive Order on AI Safety includes provisions for biological AI:
- Screening requirements: Companies developing models that could generate biological sequences must implement nucleic acid sequence screening
- Reporting requirements: Developers of powerful dual-use models must share safety test results with the government
- Research funding: Increased investment in AI safety research, including biosecurity
These requirements acknowledge both the potential and the risks of biological AI. Implementation details are still being developed, but the framework signals that biological AI will face specific governance beyond general AI regulation.
NIH Guidelines and SecureDNA
The NIH has established guidelines for research involving synthetic nucleic acids, which apply to AI-generated sequences. SecureDNA, a screening tool developed through international collaboration, enables sequence providers to check generated sequences against databases of concern.
These technical and policy mechanisms aim to prevent misuse while enabling legitimate research. Their effectiveness depends on widespread adoption and international coordination.
CARE Principles for Indigenous Data Governance
The CARE Principles (Collective Benefit, Authority to Control, Responsibility, Ethics) provide a framework for Indigenous genomic data governance. These principles recognize that data sovereignty extends beyond individual consent to collective rights.
As biological AI models are trained on increasingly diverse genomic data, respecting Indigenous data sovereignty becomes essential. Open data initiatives must incorporate these principles to avoid perpetuating historical exploitation.
International Coordination
Biological AI governance requires international coordination. Models developed in one country can be accessed globally; misuse in one region affects all. Efforts include:
- OECD AI Principles: International standards for responsible AI
- WHO guidance: Health-specific AI governance frameworks
- BWC (Biological Weapons Convention): International treaty addressing biological weapons, increasingly relevant to AI-enabled biology
Effective governance will require ongoing dialogue between researchers, policymakers, and civil society across national boundaries.
Practical Implications for Researchers
Choosing Models for Your Work
When selecting biological AI models for research, consider:
- License terms: Can you use the model for your intended purpose (academic, commercial, clinical)?
- Reproducibility: Is the code and model weights available, or only API access?
- Community support: Is there an active user community for troubleshooting and improvements?
- Documentation: Are there clear usage guidelines, limitations, and best practices?
- Longevity: Will the model be maintained, or is it a one-off release?
Open models generally score better on these criteria, but proprietary models may offer unique capabilities or support.
Contributing to Open Biological AI
Researchers can support open biological AI by:
- Publishing openly: Release code, models, and data alongside papers
- Using open licenses: Choose licenses that enable reuse (Apache 2.0, MIT, CC-BY)
- Depositing in public repositories: Use GitHub, Hugging Face, Zenodo for long-term preservation
- Citing open resources: Acknowledge the databases and models you build upon
- Participating in community efforts: Contribute to OpenFold, BioContainers, Galaxy, and similar initiatives
Individual choices aggregate into cultural norms. The more researchers prioritize openness, the stronger the open ecosystem becomes.
Navigating Proprietary Tools
When proprietary tools are necessary (e.g., AlphaFold 3 for specific applications, commercial platforms for drug discovery):
- Understand the terms: Know what you can and cannot do with the tool
- Plan for continuity: What happens if access is revoked or pricing changes?
- Document limitations: Be transparent about tool constraints in publications
- Advocate for openness: Engage with vendors about opening access where possible
Proprietary tools have their place, but researchers should be intentional about dependence on them.
Conclusion: Openness as a Scientific Imperative
The battle between open and closed biological AI is not merely philosophical—it shapes who can participate in scientific progress, how quickly discoveries translate to applications, and whether the benefits of AI-driven biology are equitably distributed.
The evidence favors openness:
- Open models enable reproducibility, the foundation of scientific rigor
- Open data accelerates progress, allowing researchers to build on shared foundations
- Open access promotes equity, ensuring that AI benefits reach beyond wealthy institutions and nations
- Open governance improves safety, enabling community input on dual-use concerns
That said, legitimate concerns exist about biosecurity, commercial viability, and data privacy. The solution is not maximal openness at all costs, but thoughtful governance that balances these values.
The AlphaFold 3 episode offers lessons. Initial restriction provoked backlash and redundant open-source efforts. Eventual partial openness satisfied some critics but left others wanting more. Future releases of powerful biological AI would benefit from:
- Early community engagement: Discuss release plans before publication
- Clear justification: Explain restrictions transparently, with specific safety or commercial rationales
- Graduated access: Consider tiered release based on risk and use case
- Commitment to eventual openness: Even if initial release is restricted, plan for broader access
As agentic omics systems emerge—combining LLM reasoning with domain-specific models like AlphaFold, ESM, and scGPT—the openness question becomes even more critical. Will these powerful systems be accessible to researchers worldwide, or concentrated in a few well-funded organizations?
The choices made now will shape biological AI for years to come. The scientific community should advocate for openness while engaging constructively with legitimate safety and commercial concerns. The goal is not open vs. closed as a binary, but a thriving ecosystem where openness is the default, restrictions are justified and minimal, and the benefits of biological AI are shared broadly.
Glossary
| Term | Definition |
|---|---|
| Open Source | Software released with source code that can be inspected, modified, and redistributed under defined license terms |
| Proprietary | Software owned and controlled by a single entity, with restricted access to source code and usage rights |
| Non-Commercial License | License permitting academic and personal use but restricting commercial applications without separate agreement |
| Dual-Use | Technology that can be used for both beneficial and harmful purposes |
| FAIR Principles | Guidelines for data management: Findable, Accessible, Interoperable, Reusable |
| CARE Principles | Framework for Indigenous data governance: Collective Benefit, Authority to Control, Responsibility, Ethics |
| Model Weights | Learned parameters of a neural network that determine its behavior; distinct from inference code |
| API Access | Access to a model through a web interface rather than local deployment |
| Federated Learning | Training models across distributed datasets without centralizing sensitive data |
| Biosecurity | Measures to prevent misuse of biological research and technology for harmful purposes |
References
-
Abramson, J., et al. “Accurate structure prediction of biomolecular interactions with AlphaFold 3.” Nature (2024). https://www.nature.com/articles/s41586-024-07487-w
-
Watershed Bio. “Foundation Models for Biomedical Research.” (2024-2025).
-
Zhou, Z., et al. “DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome.” ICLR (2024). https://github.com/MAGICS-LAB/DNABERT_2
-
Cui, H., et al. “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.” Nature Methods 21, 1470-1480 (2024). https://www.nature.com/articles/s41592-024-02201-0
-
EvolutionaryScale. “ESM-3: Open Generative Model for Protein Design.” (2024). https://github.com/evolutionaryscale/esm
-
Nguyen, E., et al. “Evo: A 7B-parameter model trained on 300B nucleotides spanning all domains of life.” Science (2024).
-
ASBMB Today. “Why AlphaFold 3 needs to be open source.” (July 2024). https://www.asbmb.org/asbmb-today/opinions/070724/why-alphafold3-needs-to-be-open-source
-
Nature News. “AI protein-prediction tool AlphaFold3 is now more open.” (November 2024). https://www.nature.com/articles/d41586-024-03708-4
-
OpenFold Consortium. “OpenFold-3: A fully open source biomolecular structure prediction model.” https://openfold.io/
-
Hugging Face. “Biology Models Collection.” https://huggingface.co/models?other=biology
-
White House. “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” (October 2023).
-
Global Alliance for Genomics and Health. “CARE Principles for Indigenous Data Governance.” (2019).
-
BioMistral. “BioMistral-7B: A Collection of Open-Source Pretrained Large Language Models for Medical Domains.” (2024). https://huggingface.co/BioMistral/BioMistral-7B
-
NVIDIA. “BioCLIP 2: Biology Model Trained on NVIDIA GPUs Identifies Over a Million Species.” (November 2025). https://blogs.nvidia.com/blog/bioclip2-foundation-ai-model/
-
Insilico Medicine. “ISM001-055 Phase II Clinical Trial for Idiopathic Pulmonary Fibrosis.” (2024-2025).