Platform

A foundation model built for plant genomes, not repurposed from protein.

Designed ground-up for plant genomics: intron-aware tokenization, multi-species joint training, variant-level embeddings validated against GWAS hits across five crops. Purpose-built, not adapted.

Request Access Read the Science

Architecture

Three design decisions that matter

Every architecture choice was made specifically for plant genomics — not carried over from NLP or protein models.

01 — Tokenization

k-mer tokenizer tuned for plant genomic repetitiveness

Plant genomes are 40-80% repetitive sequence — transposons, tandem repeats, centromeric arrays. Standard BPE tokenizers collapse these into noise. Our k-mer vocabulary was built from a representative cross-species corpus to preserve signal from repetitive regions rather than discard it.

02 — Pre-training objective

Masked sequence modeling + next-region prediction

Two complementary objectives: masked k-mer prediction teaches local variant context; next-region prediction learns long-range genomic structure across introns and regulatory elements. Pre-trained on 200M+ sequences across 47 species in joint multi-species batches — population variation is encoded, not averaged out.

03 — Embedding geometry

512-dim space validated against GWAS hit loci

Embedding quality was validated by checking whether known GWAS-significant variant positions cluster in interpretable sub-regions of the embedding space — without label data. Correlation with GRIN phenotype records: ~91% across 12,000 maize accessions. This is not a proxy metric — it's a direct alignment test.

Training data

200M+ sequences. 47 crops. 120 wild relatives.

Scale and species breadth are not marketing metrics — they are the precondition for transfer learning to work in low-label plant breeding contexts.

200M+

plant sequences in pre-training corpus

crop species with full-genome coverage

120

wild relative taxa for diversity capture

Species group	Sequences	Assembly quality	Primary source
Cereals (wheat, maize, barley, sorghum, rice)	~82M	Chromosome-level (T2T for wheat)	Ensembl Plants + URGI
Solanaceae (tomato, pepper, potato)	~31M	Chromosome-level + scaffold	NCBI SRA + Sol Genomics
Legumes (soybean, pea, chickpea)	~44M	Chromosome-level	LegumeInfo + NCBI
Brassicas + root crops	~19M	Scaffold + chromosome	Ensembl Plants
Wild relatives + landraces	~24M	Mixed (scaffold-level accepted)	GRIN + internal curation

All sequences passed our deduplication and quality-filtering pipeline. Assembly artifacts, mislabeled cultivars, and contaminant sequences were removed before tokenization. Data sourcing details available on request.

API

Three primary endpoints

The Living Models API is REST-based with a Python SDK wrapper. Authentication via API key in request header. All responses include confidence bounds and known limitations flags.

POST

/v1/embed

Submit a VCF, FASTA, or SNP array CSV. Returns a 512-dimensional embedding vector for each input accession.

POST

/v1/predict

Submit a VCF + trait target list. Returns predicted trait scores with confidence intervals and a limitation flag if the input falls outside model training distribution.

POST

/v1/fine-tune

Submit labeled phenotype data (minimum 200 accessions). Trains an adapter layer on top of the frozen foundation model. Returns a fine-tuned model ID for your breeding program.

Full API reference documentation available after access request approval.

import requests

response = requests.post(
    'https://api.livingmodles.com/v1/predict',
    headers={'X-API-Key': api_key},
    json={
        'vcf_path': 'candidates_batch.vcf',
        'traits': ['drought_tolerance', 'yield_stability'],
        'species': 'triticum_aestivum'
    }
)

result = response.json()
# result['predictions'][0]
# {'drought_tolerance': {'score': 0.847, 'ci': [0.81, 0.88]},
#  'yield_stability': {'score': 0.712, 'ci': [0.67, 0.76]},
#  'out_of_distribution': false}

Integrations

The four input formats plant genomics actually uses

No custom conversion required for VCF, FASTA, SNP array CSV, or PLINK binary. We handle format parsing on ingest so you start from your existing pipeline output.

REST API

Standard HTTP endpoints. Language-agnostic. Works with any stack that can make POST requests.

Available

Python SDK

pip install livingmodels. High-level wrapper with batch processing, retry logic, and pandas DataFrames output.

Available

VCF / FASTA upload

Direct file upload for standard plant genomics formats. VCF 4.2+, multi-FASTA, SNP array CSV. No pre-processing required.

Available

PLINK format

.bed/.bim/.fam binary PLINK format for labs whose population structure workflows already output PLINK files. Parsed on ingest — no manual conversion to VCF required.

In development

Start with a benchmark run.

Submit a sample dataset and receive trait predictions to evaluate model quality before committing to an integration. No contract required for the initial benchmark.

Request Access View validation methodology