Introduction: The Promise and the Peril

Precision medicine promised to treat each patient as an individual — to move beyond one-size-fits-all therapies to interventions tailored to your unique biology. AI-driven omics seemed poised to accelerate this vision: algorithms that could read your genome, interpret your proteome, and predict your disease risk with unprecedented accuracy.

But there’s a problem. The data powering these algorithms is profoundly unrepresentative of human diversity.

As of 2024, over 94% of participants in genome-wide association studies (GWAS) are of European ancestry, despite Europeans comprising only about 16% of the global population. This imbalance isn’t just a statistical curiosity — it has real consequences. Polygenic risk scores trained on European data perform significantly worse for individuals of African, Asian, Hispanic, and Indigenous ancestry. Variant classification algorithms misclassify pathogenic mutations in underrepresented populations. And the AI tools now entering clinical practice risk cementing these disparities into healthcare systems worldwide.

This post examines the ethical dimensions of omics AI. We’ll explore where bias comes from, what initiatives are working to address it, and what it will take to build AI systems that serve all of humanity — not just a privileged subset.


The Scale of the Problem

Genomic Databases Are Overwhelmingly European

The numbers are stark. A September 2024 analysis in the GWAS Diversity Monitor found that 94.48% of GWAS participants were of European ancestry, with Asian populations at 3.96% and all other groups — African, Hispanic/Latino, South Asian, Indigenous — each below 1%.

This isn’t a new problem. A landmark 2016 Nature analysis found 81% of GWAS participants were European. Despite years of advocacy, the proportion of non-European participants has barely budged in relative terms. The absolute numbers have grown, but European-ancestry samples have grown faster.

Why does this matter for AI? Because machine learning models learn patterns from their training data. If your training data is 94% European, your model will be optimized for European biology. It will learn European-specific linkage disequilibrium patterns, European-specific allele frequencies, and European-specific genotype-phenotype relationships.

When you deploy that model on a patient of African ancestry, it’s operating outside its training distribution. The predictions become unreliable.

Polygenic Risk Scores: A Case Study in Bias

Polygenic risk scores (PRS) illustrate the problem concretely. PRS aggregate the effects of thousands of genetic variants to estimate disease risk. They’re increasingly used in clinical settings for conditions like breast cancer, coronary artery disease, and type 2 diabetes.

But PRS performance degrades dramatically with genetic distance from the training population. A 2019 study in Nature Genetics (Martin et al.) showed that PRS trained on European data had 4.5-fold lower prediction accuracy in African ancestry populations compared to Europeans. For some conditions, the scores were essentially uninformative.

A December 2025 review in Frontiers in Genetics examining cardiovascular GWAS found that clinical tools combining PRS with traditional risk factors “risk compounding bias in risk stratification” for African ancestry patients. The authors called for “ancestry-aware combination of clinical and genomic risk estimation” — but noted that most commercial tools don’t yet implement this.

The clinical implications are serious. A patient of African ancestry might be told they’re at low risk for a condition they’re actually predisposed to — or vice versa. Treatment decisions based on these scores could be harmful.


Consequences Beyond Prediction Accuracy

Variant Classification Disparities

Clinical genomics relies on accurate variant classification: is this mutation pathogenic, benign, or uncertain significance (VUS)? AI-powered variant interpretation tools like AlphaMissense (DeepMind, 2023) and PrimateAI have improved classification accuracy overall.

But they inherit the biases of their underlying databases. ClinVar, the primary repository of clinical variant interpretations, is heavily skewed toward European-ancestry submissions. A variant that’s rare in Europeans but common in Africans might be misclassified as pathogenic simply because it’s unfamiliar.

A 2024 analysis found that individuals of African ancestry receive VUS classifications at 2-3 times the rate of Europeans for the same genes. This isn’t just academically interesting — a VUS result often means no actionable clinical guidance. Patients don’t get preventive interventions they might need, or they undergo unnecessary surveillance.

Drug Development and Diagnostic Testing

Biased databases affect drug development too. Pharmacogenomic variants — genetic differences that affect drug metabolism — vary substantially across populations. The FDA now includes pharmacogenomic information on over 350 drug labels, but the underlying data is predominantly European.

A November 2024 report from the University of Maryland School of Medicine warned that biased genomic databases “could skew results in areas such as drug development, diagnostic testing, and polygenic risk scores.” Drugs dosed based on European metabolism data might be under- or over-dosed for other populations. Diagnostic tests optimized for European variants might miss disease-causing mutations in other groups.


Initiatives Addressing the Gap

All of Us Research Program

The NIH’s All of Us Research Program is perhaps the most ambitious effort to diversify genomic databases in the United States. Launched in 2018, it aims to enroll one million participants, with at least 50% from underrepresented communities.

As of 2025, All of Us has enrolled over 400,000 participants, with approximately 50% from historically underrepresented racial and ethnic groups. The program collects genomic data, electronic health records, wearable device data, and patient-reported outcomes — creating a rich, diverse resource for research.

Early results are promising. A 2024 analysis of All of Us data identified thousands of genetic variants previously unseen in European-biased databases, including clinically actionable variants in underrepresented populations. The program has also demonstrated that diverse recruitment is feasible at scale — countering claims that “minority communities don’t want to participate in research.”

But All of Us is a U.S. program. Global diversity requires global initiatives.

H3Africa: Building African Genomics Capacity

The Human Heredity and Health in Africa (H3Africa) initiative, launched in 2010, aimed to build genomics research capacity within Africa and generate African genomic data. By 2024, H3Africa included over 500 researchers across 30 African countries, had supported 480 PhD graduates and 467 trainees, held over 200 workshops, and published over 700 papers.

One of H3Africa’s landmark achievements was the 2020 Nature paper “High-depth African genomes inform human migration and health,” which uncovered over three million previously unknown genetic variants — variants absent from European-centric databases. These variants have implications for disease risk, drug response, and evolutionary history.

But H3Africa’s funding is ending. A February 2023 article on the H3Africa website warned: “Funds for a major genomics programme in Africa will run dry this year. A chance to address global inequity in health-related genomics by building on the success of this initiative must not be missed.” The initiative demonstrated what’s possible with sustained investment — but its future is uncertain.

GenomeAsia 100K

The GenomeAsia 100K Project aims to sequence 100,000 Asian individuals to create a comprehensive reference database for Asian populations. The pilot phase, published in Nature in 2019, sequenced 1,739 individuals from 219 population groups across Asia, identifying millions of novel variants.

The project has enabled genetic discoveries specific to Asian populations, including variants associated with drug metabolism, disease susceptibility, and population history. But the full 100,000-genome goal remains a work in progress, and the project has faced challenges in data sharing and commercial partnerships.

Three Million African Genomes

Building on H3Africa’s success, the Three Million African Genomes project was proposed to sequence three million African individuals — an order of magnitude larger than previous efforts. The project would capture Africa’s extraordinary genetic diversity (Africa is the most genetically diverse continent, reflecting humanity’s origins there) and create a resource for precision medicine across the continent.

But as of 2026, the project remains in the proposal stage. Funding for large-scale genomics in Africa remains scarce, despite the clear scientific and clinical value.


Algorithmic Fairness in Clinical Genomics

The Technical Challenge

Addressing bias isn’t just about collecting more diverse data — though that’s essential. It’s also about building algorithms that are fair across populations.

Several technical approaches are being explored:

  1. Ancestry-aware models: Training separate models for different ancestry groups, or including ancestry as an explicit input. This improves performance but requires sufficient data for each group.

  2. Transfer learning: Pre-training on diverse data, then fine-tuning on specific populations. This is promising but doesn’t solve the fundamental data scarcity problem.

  3. Fairness constraints: Building fairness metrics directly into model training, penalizing predictions that vary unfairly across groups. This is an active research area but not yet standard in clinical genomics.

  4. Multi-ancestry GWAS: Conducting genome-wide association studies in diverse cohorts from the start, rather than trying to correct European-trained models later. This is the gold standard but requires diverse recruitment.

A December 2025 Frontiers in Genetics review emphasized that “clinical tools are increasingly used alongside PRS to guide preventive therapy” — making the need for ancestry-aware models urgent. But most commercial tools still don’t implement these approaches.

The Regulatory Gap

The FDA has approved several AI/ML-based genomic diagnostics, including FoundationOne CDx and various companion diagnostics. But regulatory requirements for demonstrating fairness across populations are limited.

The FDA’s 2021 action plan for AI/ML-based software as a medical device mentions “bias” but doesn’t mandate specific fairness metrics or diverse validation cohorts. Manufacturers can (and do) gain approval with predominantly European validation data.

This is changing slowly. The FDA’s 2024 draft guidance on AI/ML in diagnostics recommends “diverse representation in training and validation datasets” but stops short of requiring it. Without regulatory teeth, market incentives favor the path of least resistance — which often means European-centric data.


Indigenous Data Sovereignty and the CARE Principles

Beyond FAIR: The CARE Principles

The FAIR data principles (Findable, Accessible, Interoperable, Reusable) have been the gold standard for scientific data sharing. But for Indigenous communities, FAIR isn’t enough. FAIR emphasizes openness and accessibility — values that can conflict with Indigenous data sovereignty.

Enter the CARE Principles for Indigenous Data Governance:

  • Collective Benefit: Data should benefit the Indigenous communities from which it comes.
  • Authority to Control: Indigenous peoples have the right to control data about them.
  • Responsibility: Researchers and institutions have responsibilities to Indigenous communities.
  • Ethics: Research should align with Indigenous values and ethics.

The CARE Principles were formalized in 2019 by the Global Indigenous Data Alliance and have gained traction in 2024-2025. A November 2025 resource from the Australian Research Data Commons emphasizes that researchers “should apply [CARE] if your research involves Indigenous data.”

What CARE Means in Practice

Applying CARE principles requires fundamental shifts in how genomics research is conducted:

  1. Community engagement before data collection: Indigenous communities should be partners in research design, not just subjects. This means early consultation, co-design of studies, and community approval.

  2. Data governance structures: Indigenous communities should have control over how their data is stored, accessed, and used. This might mean data stays within community-controlled databases rather than public repositories.

  3. Benefit sharing: Research should provide tangible benefits to participating communities — not just publications for researchers. This could mean capacity building, healthcare improvements, or economic benefits.

  4. Respect for cultural values: Some Indigenous communities have cultural restrictions on genetic research (e.g., concerns about ancestry testing, restrictions on research after death). These must be respected.

A 2023 Nature Ecology & Evolution paper on applying CARE principles to ecology research noted that Indigenous peoples are “increasingly being sought out for research partnerships that incorporate Indigenous Knowledges” — but emphasized that such partnerships must be ethical and responsible, avoiding “extractive helicopter research practices.”

Tensions with Open Science

The CARE principles can create tensions with open science norms. Genomics has traditionally emphasized open data sharing — databases like GenBank, dbGaP, and the SRA are built on the principle that data should be freely available to researchers worldwide.

But Indigenous communities may prefer controlled access. They may want to approve each research use case. They may want to restrict certain types of research (e.g., ancestry inference, population history).

Resolving these tensions requires nuance. The CARE principles don’t reject openness — they insist that openness must be balanced with community rights. A March 2021 University of Wisconsin resource notes that “opportunities for increasing control are directly connected to properly identifying Indigenous data and making Indigenous data FAIR” — suggesting that FAIR and CARE can be complementary, not contradictory.


Access and Equity: Who Benefits?

The Geography of Precision Medicine

Even if we solve the data diversity problem, there’s another equity challenge: access. Precision medicine — especially AI-driven precision medicine — requires infrastructure: sequencing labs, computational resources, clinical genetics expertise, and healthcare systems that can act on genomic information.

This infrastructure is concentrated in wealthy countries. A May 2025 market report found that North America dominated the precision medicine market with 54% share in 2024, while Africa — despite having the world’s greatest genetic diversity — accounts for less than 1%.

The consequences are stark. A patient in Boston can get whole-genome sequencing, AI-powered variant interpretation, and targeted therapy selection. A patient in Lagos with the same condition might not have access to basic genetic testing.

AI Could Widen or Narrow the Gap

AI has the potential to narrow this gap — by automating interpretation, reducing costs, and enabling remote expertise. Cloud-based AI tools could, in principle, make sophisticated genomic analysis available anywhere with internet access.

But there’s a risk AI widens the gap instead. If the best AI tools are proprietary, expensive, or require infrastructure that low-resource settings lack, they’ll benefit only the already-advantaged.

A January 2025 article from AUDA-NEPAD noted that the African Union adopted a Continental Strategy for Artificial Intelligence (2024–2030) to “leverage AI responsibly across sectors” — including health. But strategy documents don’t build labs or train bioinformaticians. Implementation requires sustained investment.

The Cost Problem

Whole-genome sequencing has dropped from $100 million (Human Genome Project, 2003) to under $1,000 today. But sequencing is just the start. Interpretation, clinical validation, and targeted therapies add substantial cost.

AI could reduce interpretation costs — but only if the AI tools are affordable. Many commercial genomic AI platforms charge thousands of dollars per analysis, pricing them out of reach for most healthcare systems globally.

Open-source alternatives exist — tools like GATK, DeepVariant, and AlphaFold are freely available. But they require computational expertise and infrastructure that may not be available in low-resource settings. Capacity building is as important as tool development.


Responsible AI Principles for Biological Research

What would responsible, equitable omics AI look like? Several frameworks are emerging:

1. Diverse by Design

Diversity shouldn’t be an afterthought — it should be built into study design from the start. This means:

  • Setting explicit diversity targets for data collection
  • Partnering with communities early in research design
  • Budgeting for diverse recruitment (which often costs more)
  • Publishing demographic breakdowns of training data

2. Fairness Metrics and Reporting

AI models should be evaluated for fairness across populations, not just overall accuracy. This means:

  • Reporting performance metrics stratified by ancestry/ethnicity
  • Setting minimum performance thresholds for all groups
  • Publishing failure modes and limitations honestly
  • Avoiding claims of “clinical utility” without diverse validation

3. Community Partnership

Research should be conducted with communities, not on them. This means:

  • Community advisory boards for genomics projects
  • Co-authorship with community representatives where appropriate
  • Returning results to participants in accessible formats
  • Long-term relationships, not extractive one-off studies

4. Capacity Building

Equity requires building capacity in underrepresented regions. This means:

  • Training programs for bioinformaticians and genetic counselors in Africa, Asia, Latin America
  • Infrastructure investment (sequencing, computing, storage)
  • Supporting local leadership, not just Northern institutions working “in” the Global South
  • Sustainable funding, not short-term grants

5. Open Where Possible, Protected Where Necessary

Data sharing should balance openness with community rights. This means:

  • Defaulting to open access for non-sensitive data
  • Respecting Indigenous data sovereignty and CARE principles
  • Tiered access for sensitive data (controlled rather than open)
  • Clear governance for data use decisions

Honest Assessment: Where We Are and What’s Needed

Progress Made

We should acknowledge progress:

  • Awareness is higher: The diversity gap is now widely recognized. It’s no longer acceptable to publish a genomics paper without addressing ancestry composition.
  • Initiatives exist: All of Us, H3Africa, GenomeAsia 100K, and others are making real contributions.
  • Technical methods are improving: Ancestry-aware models, transfer learning, and fairness-aware algorithms are advancing.
  • Indigenous data sovereignty is gaining recognition: The CARE principles are being adopted by funders and institutions.

Gaps Remaining

But the gaps remain enormous:

  • The numbers haven’t shifted enough: 94% European in GWAS is unacceptable. We need sustained, large-scale investment in diverse data collection.
  • Clinical tools lag behind research: Research papers may report diverse validation, but commercial clinical tools often don’t. Regulatory requirements are weak.
  • Funding is uncertain: H3Africa’s funding is ending. The Three Million African Genomes project remains unfunded. Diversity initiatives are often the first cut in budget crises.
  • Access remains unequal: Even perfect AI tools won’t help patients who can’t access sequencing or clinical genetics services.
  • Indigenous sovereignty is still emerging: CARE principles are gaining traction, but implementation is inconsistent. Many Indigenous communities remain rightfully skeptical of genomics research.

What’s Needed

Addressing these gaps requires:

  1. Sustained funding: Diversity initiatives need long-term, stable funding — not short-term grants that end just as capacity is built.

  2. Regulatory teeth: The FDA and other regulators should require diverse validation for clinical genomic tools. Approval should be contingent on demonstrated fairness.

  3. Community trust-building: Genomics has a troubled history with many communities (e.g., the Havasupai case, Henrietta Lacks). Trust must be earned through consistent ethical behavior.

  4. Global infrastructure: Sequencing and computational capacity must be built in underrepresented regions, not just samples extracted and analyzed elsewhere.

  5. Honest accounting: Researchers and companies should report demographic breakdowns, performance disparities, and limitations transparently. Greenwashing helps no one.


Conclusion: Equity as a Technical Requirement

Equity in omics AI isn’t just a moral imperative — it’s a technical requirement. AI models trained on unrepresentative data are scientifically inferior. They make less accurate predictions, miss important biology, and fail when deployed in real-world diverse populations.

Building equitable omics AI requires more than good intentions. It requires:

  • Diverse data collected through ethical partnerships
  • Fair algorithms evaluated across populations
  • Regulatory standards that enforce equity
  • Global capacity to ensure benefits are shared
  • Indigenous sovereignty respected through CARE principles

The vision of precision medicine — treatments tailored to your unique biology — can only be realized if “your” includes all of humanity, not just a privileged subset. AI has the potential to accelerate this vision or to cement existing disparities. The choice is ours.


Glossary

Term Definition
GWAS (Genome-Wide Association Study) A study that scans genomes across many individuals to find genetic variants associated with a trait or disease.
Polygenic Risk Score (PRS) A number that summarizes the combined effect of many genetic variants on disease risk.
Linkage Disequilibrium The non-random association of alleles at different genetic loci; patterns vary across populations.
Variant of Uncertain Significance (VUS) A genetic variant for which there isn’t enough evidence to classify as pathogenic or benign.
Pharmacogenomics The study of how genetic variation affects drug response.
CARE Principles Collective Benefit, Authority to Control, Responsibility, Ethics — principles for Indigenous data governance.
FAIR Principles Findable, Accessible, Interoperable, Reusable — principles for scientific data management.
Data Sovereignty The right of peoples or communities to control data about them.
Ancestry-Aware Models AI models that explicitly account for genetic ancestry in training or prediction.
All of Us Research Program NIH initiative to enroll one million diverse participants in a longitudinal health study.
H3Africa Human Heredity and Health in Africa — initiative building genomics capacity in Africa.
GenomeAsia 100K Project to sequence 100,000 Asian individuals to create a reference database.

References

  1. Bridging genomics’ greatest challenge: The diversity gap. Cell Genomics. December 2024. Analysis of GWAS Diversity Monitor showing 94.48% European ancestry representation as of September 2024. https://pmc.ncbi.nlm.nih.gov/articles/PMC11770215/

  2. Martin AR, et al. “Clinical use of current polygenic risk scores may exacerbate health disparities.” Nature Genetics. 2019;51:584-591. Demonstrated 4.5-fold lower PRS accuracy in African vs. European ancestry populations.

  3. Jurado Vélez et al. “Ancestry gaps in cardiovascular GWAS: a multi-database review of African representation in genomic studies.” Frontiers in Genetics. December 2025. Review of bias in cardiovascular genomics tools. https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2025.1647176/full

  4. H3Africa Consortium. “H3Africa: a model for implementing biobank-based genomic research in resource-constrained settings.” PubMed. July 2025. Reported 500+ researchers, 480 PhD graduates, 700+ papers by 2024. https://pubmed.ncbi.nlm.nih.gov/40618283/

  5. Garrison NA, et al. “The CARE Principles for Indigenous Data Governance.” Data Science Journal. 2020;19(1):43. Formal statement of CARE principles for Indigenous data. https://datascience.codata.org/articles/dsj-2020-043

  6. All of Us Research Program. “Diverse genomic data from the All of Us Research Program.” Nature. 2024. Early results from 400,000+ participants with 50% from underrepresented groups.

  7. World Health Organization. “Global Strategy on Digital Health 2020-2025.” WHO, 2021. Framework for equitable digital health implementation.

  8. AUDA-NEPAD. “Harnessing Artificial Intelligence in Precision Medicine to Responsibly Transform Africa’s Health Landscape.” January 2025. Overview of African AI strategy and precision medicine initiatives. https://nepad.org/blog/harnessing-artificial-intelligence-precision-medicine-responsibly-transform-africas-health

  9. University of Maryland School of Medicine. “Genomic Databases Need More Diversity.” November 2024. Analysis of bias impacts on drug development and diagnostics. https://www.medschool.umaryland.edu/news/2024/genomic-databases-need-more-diversity.html

  10. Global Indigenous Data Alliance. “CARE Principles.” https://www.gida-global.org/care. Official resource for CARE principles implementation.