Platform Science Use Cases Research Blog Request Access
Fine-tuning

Fine-Tuning a Genomic Foundation Model When You Have 200 Samples

Cyril Veran 9 min read
Abstract visualization of machine learning model fine-tuning with limited data

Most crop breeding programmes have 200 to 500 phenotyped lines per trait, per cycle. If they have 1,000 for a major commercial trait, that is exceptional. This is the data regime we spend the most time thinking about, because it is the common case for any programme that is not a large multinational with decades of historical phenotype records. Here is what we have found to work — and what does not — when fine-tuning a genomic foundation model with that kind of constraint.

Why Small-N Is Different for Genomic Models

The challenge in fine-tuning any large pre-trained model with limited labeled data is overfitting. But genomic data has a specific structure that makes the problem harder than general-domain transfer learning. Plant genomes have high linkage disequilibrium (LD), meaning SNPs in physical proximity are correlated. In a 200-sample dataset, the genomic background variation (hundreds of thousands of SNPs) carries much more signal than any single labelled phenotype. A model that does full parameter fine-tuning will fit the noise in genomic background markers and produce out-of-sample predictions that generalise poorly to different genetic backgrounds — exactly the situation where you want good predictions.

Additionally, 200 phenotyped samples drawn from a single breeding cycle are not independent observations. They share recent common ancestry, meaning the effective diversity they represent is narrower than the N suggests. A Pearson correlation computed on 200 samples from a single half-sib family has different statistical properties than 200 samples drawn from a diverse panel.

Strategy 1: LoRA Adapter Fine-Tuning Over Full-Parameter Updates

Our preferred approach for small datasets is LoRA (Low-Rank Adaptation). Rather than updating all model weights during fine-tuning, LoRA injects trainable rank-decomposition matrices into specific layers — in our case, the attention projection matrices and the feed-forward intermediate layers — while keeping the pre-trained weights frozen. The number of trainable parameters drops from tens of millions to hundreds of thousands, making the optimisation problem substantially more tractable with 200 labeled examples.

from lmgn_sdk import LivingModelsClient, AdapterConfig

client = LivingModelsClient(api_key="your_key")

# Configure LoRA adapter for yield prediction task
adapter_config = AdapterConfig(
    task="regression",
    trait="yield_tonnes_per_hectare",
    lora_rank=8,           # low-rank dimension; 4-16 typical range
    lora_alpha=16,         # scaling parameter; usually 2x rank
    target_layers=["attn_q", "attn_v", "ffn_up"],
    dropout=0.1,
)

# Submit fine-tuning job with your labeled samples
job = client.fine_tune(
    base_model="lmgn-plant-v2",
    samples=labeled_samples,   # list of {sample_id, allele_dosages, phenotype_value}
    config=adapter_config,
    validation_split=0.2,      # held-out fraction
    max_epochs=50,
    early_stopping_patience=8,
)

print(f"Fine-tune job: {job.job_id}")
result = job.wait()
print(f"Validation Pearson r: {result.metrics['pearson_r']:.3f}")
print(f"Validation RMSE: {result.metrics['rmse']:.4f}")

With LoRA rank 8, we typically have fewer than 400,000 trainable parameters regardless of the base model size. This means the model can fit reasonably to 200 samples without memorising them — the pre-trained representations remain largely intact and continue to provide useful genomic context that the fine-tuning layer cannot destroy.

Strategy 2: Phenotype Normalisation Before Training

Field phenotype data from breeding programmes is almost never raw. Yield in tonnes per hectare from a 2022 trial in the Loire Valley is not directly comparable to yield from a 2023 trial in Champagne, even for the same variety, because of spatial heterogeneity, trial management differences, and year effects. If you feed raw field values into a fine-tuning loop, the model is partly learning year and location effects that have nothing to do with genotype.

We apply Best Linear Unbiased Prediction (BLUP) adjustment — or the simpler spatial correction method of subtracting block means from field observations — before using phenotype values as fine-tuning targets. Even a simple adjustment substantially improves the genotype signal that the model is learning from. When clients provide raw field data without adjustment, we apply a fixed-effects regression against trial-year-location covariates using R's lme4 and use the residuals as training targets. We prefer transparent adjustment over no adjustment; imperfect normalisation is still better than none.

Strategy 3: Data Augmentation via Synthetic Variants

With only 200 labeled samples, data augmentation helps. Our primary augmentation strategy for genomic data is to generate synthetic samples by randomly recombining haplotypes from the existing sample set — essentially simulating the progeny that would result from crossing existing labeled lines. This creates additional training examples whose phenotype can be estimated by additive interpolation between parent phenotypes.

The key limitation of this approach: additive interpolation only approximates the true genotype-phenotype mapping. For traits with significant dominance or epistatic effects, synthetic samples generated by simple haplotype recombination carry phenotype labels that are wrong in ways that could mislead fine-tuning. We apply augmentation selectively, only for traits where prior analysis suggests predominantly additive genetic architecture (narrow-sense heritability close to broad-sense heritability in historical cross data), and we use augmented samples at a ratio of no more than 2:1 relative to real samples to avoid amplifying interpolation errors.

Validation: Cross-Validation Strategy Matters

Random k-fold cross-validation is misleading for breeding programme data. Because samples within a cycle share recent ancestry, a random split puts closely related lines in both training and validation, producing optimistically high validation correlations that do not predict how the model performs on the next breeding cycle's material.

We use leave-family-out or leave-cycle-out cross-validation instead. If samples can be grouped by family or year of development, the validation set is one group held out entirely while training on the others. This produces honest estimates of out-of-cycle generalisation — the prediction task that matters in practice. With 200 samples and four families, each fold has approximately 50 validation samples, which limits statistical precision, but the estimate is at least unbiased for the right prediction target.

What 200 Samples Can and Cannot Support

We want to be direct about the limits. Two hundred samples is enough to fine-tune for a single trait with moderate heritability (h² above 0.4) and predominantly additive genetic architecture, assuming the sample diversity is reasonable. It is not enough to fine-tune for a low-heritability trait (h² below 0.2), to simultaneously optimise for multiple correlated traits, or to learn the interaction of genotype with a new environment not represented in the pre-training data.

The pre-trained model carries representation knowledge that reduces the effective data requirement relative to training from scratch — but it does not eliminate the requirement. When a client comes to us with 80 samples for a trait we have no prior training signal for, we advise against fine-tuning and instead use the base model's embeddings directly with a lightweight regression head trained on their 80 observations. This is not as good as fine-tuning with adequate data, but it is more honest about what can be learned from 80 samples without fabricating prediction precision the data cannot support.

Calibration: Confidence Intervals for Small-Sample Fine-Tuned Models

Small sample sizes produce models with under-estimated uncertainty. The validation metrics look reasonable, but the model's confidence intervals are too narrow because they were calibrated on data that was not fully independent of training. We apply temperature scaling on the prediction distribution after fine-tuning — a simple post-hoc calibration step that widens confidence intervals until the empirical coverage matches the nominal probability. For regression outputs, we use conformal prediction with a held-out calibration set of 20–30 samples to set coverage-correct interval widths.

Calibration is easy to skip and the prediction point estimates look fine without it. But when a breeder is making advancement decisions, a mis-calibrated confidence interval that says 80% coverage but actually covers 55% of ground truth values is worse than no interval at all, because it produces false confidence. We build calibration into our fine-tuning pipeline by default, even at the cost of requiring a separate calibration split from an already small dataset.

More from the lab