Integrating Genomic Predictions into a Seed Company Selection Pipeline

A mid-size European cereal breeder — running a winter wheat programme across roughly 800 genotyped candidates per cycle — asked us last autumn whether we could connect our prediction endpoint to their existing marker-assisted selection (MAS) workflow without requiring them to replace their genotyping laboratory information management system. This is a walkthrough of what that integration looked like, what worked immediately, and where we had to iterate.

The Starting Point: VCF Files Accumulating in a Network Share

The breeder's lab produced genotyping-by-sequencing (GBS) data for each candidate cohort, delivering final VCF files to a shared drive approximately six weeks after sowing. Those VCF files were then processed by a bioinformatician using a legacy R pipeline — marker imputation via Beagle, followed by a ridge-regression BLUP model trained on four prior cycles of field phenotype data — to rank candidates for advancement.

The bottleneck was not compute time. The ridge-regression model ran in under two hours. The bottleneck was phenotype collection: heading date, plant height, and Fusarium head blight severity required a full growing season of field observation before any predictive model could be updated. New candidates therefore waited one full year to receive a phenotype-informed rank.

What they wanted from us: sequence-based predictions for traits where field phenotyping is the limiting step, delivered before the next cycle decision date, without requiring changes to their VCF production pipeline.

Step 1: Normalising VCF to the API Input Format

Our prediction endpoint accepts variant data as a JSON payload containing a sample identifier, a reference assembly designation, and an ordered list of alternate allele encodings for our fixed marker panel. The breeder's GBS pipeline produced multi-sample VCF files aligned against Triticum aestivum IWGSC RefSeq v2.1. The first integration task was writing a preprocessing script to extract per-sample rows from multi-sample VCF and re-encode them to our panel coordinates.

import gzip
import json
import requests
from pathlib import Path

PANEL_POSITIONS = "panel_positions_iwgsc_v2.1.tsv"  # distributed with our SDK
API_ENDPOINT = "https://api.livingmodles.com/v1/predict"
API_KEY = "your_api_key_here"

def vcf_to_panel_encoding(vcf_path: str, sample_id: str) -> list[int]:
    """
    Extract alternate allele dosage (0/1/2) for each panel position
    from a single-sample or multi-sample VCF.
    Missing positions encoded as -1 (handled by imputation in the model).
    """
    panel = load_panel_positions(PANEL_POSITIONS)
    pos_index = {(row["chrom"], row["pos"]): i for i, row in enumerate(panel)}
    encoding = [-1] * len(panel)

    opener = gzip.open if vcf_path.endswith(".gz") else open
    with opener(vcf_path, "rt") as fh:
        header_cols = None
        sample_col = None
        for line in fh:
            if line.startswith("##"):
                continue
            if line.startswith("#CHROM"):
                header_cols = line.strip().lstrip("#").split("\t")
                if sample_id in header_cols:
                    sample_col = header_cols.index(sample_id)
                else:
                    raise ValueError(f"Sample {sample_id} not found in VCF header")
                continue
            cols = line.strip().split("\t")
            chrom, pos = cols[0], int(cols[1])
            key = (chrom, pos)
            if key not in pos_index:
                continue
            fmt = cols[8].split(":")
            gt_idx = fmt.index("GT") if "GT" in fmt else 0
            gt_field = cols[sample_col].split(":")[gt_idx]
            alleles = gt_field.replace("|", "/").split("/")
            if "." in alleles:
                continue  # leave as -1
            dosage = sum(int(a) for a in alleles)
            encoding[pos_index[key]] = dosage

    return encoding


def predict_traits(sample_id: str, encoding: list[int], traits: list[str]) -> dict:
    payload = {
        "sample_id": sample_id,
        "reference_assembly": "iwgsc_v2.1",
        "allele_dosages": encoding,
        "traits": traits,
    }
    resp = requests.post(
        API_ENDPOINT,
        headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
        json=payload,
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()

The script ran against a 780-sample multi-sample VCF in approximately four minutes on a standard laptop. Panel coverage for their GBS data averaged 73% — the remaining 27% of panel positions were imputed internally before trait scoring.

Step 2: Batch Submission and Result Retrieval

Single-sample synchronous requests work for ad hoc lookups, but at 780 samples per cycle the practical approach is asynchronous batch submission. Our batch endpoint accepts up to 1,000 samples in a single POST and returns a job ID; results are retrievable once processing completes, typically within 10–20 minutes for a cohort of this size.

def submit_batch(samples: list[dict]) -> str:
    """Submit list of {sample_id, reference_assembly, allele_dosages, traits} dicts."""
    resp = requests.post(
        "https://api.livingmodles.com/v1/batch/predict",
        headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
        json={"samples": samples},
        timeout=60,
    )
    resp.raise_for_status()
    return resp.json()["job_id"]


def poll_batch_results(job_id: str, poll_interval: int = 30) -> list[dict]:
    import time
    status_url = f"https://api.livingmodles.com/v1/batch/{job_id}/status"
    results_url = f"https://api.livingmodles.com/v1/batch/{job_id}/results"
    while True:
        status = requests.get(status_url, headers={"Authorization": f"Bearer {API_KEY}"}).json()
        if status["state"] == "complete":
            return requests.get(results_url, headers={"Authorization": f"Bearer {API_KEY}"}).json()["predictions"]
        if status["state"] == "failed":
            raise RuntimeError(f"Batch job failed: {status.get('error')}")
        time.sleep(poll_interval)

The breeder's bioinformatician integrated these two functions into their existing R-based workflow via a Python subprocess call, writing results to a TSV that the R pipeline could then merge with existing BLUP rankings.

What We Predicted and What We Did Not Attempt

We are not claiming, and the breeder did not expect, that sequence-based predictions replace phenotypic selection. For heading date and plant height — traits controlled by well-characterised major-effect loci in wheat (Ppd-1, Rht-B1, Rht-D1) — our predictions correlated well with historical field values and offered genuine pre-season utility. For yield under specific environmental conditions, we were explicit that our model captures genetic potential, not the GEI (genotype × environment interaction) component that still requires field data.

The 60% reduction in candidate evaluation time cited in our meta description refers specifically to the time spent waiting for trait information before the first advancement decision. By providing heading-date and height predictions immediately from GBS data, the team could make a preliminary advancement cut before field phenotype data was available, reducing the cohort that required full-season observation from 780 to approximately 310 candidates.

Error Handling and Edge Cases

Three edge cases required specific handling in production. First, VCF files sometimes included samples with very low read depth (<3×), producing high rates of missing genotype calls. We added a pre-submission QC step that flags samples with greater than 40% missing panel positions — these are returned with a low_coverage warning flag rather than a confidence score. Second, the breeder occasionally submitted accessions from their historical diversity panel, which included landraces not well represented in our current training data. For these, prediction intervals widened considerably; we surface this as an elevated uncertainty estimate in the response. Third, IWGSC RefSeq coordinates shifted between v1.0 and v2.1 at several hundred marker positions — the SDK's panel position file must match the assembly used in alignment, and a mismatch produces silently wrong dosage encodings rather than an error. We have since added assembly version verification to the SDK's VCF parser.

Practical Outcome and What We Would Do Differently

The integration was running in production within three weeks of the initial request. The main friction was not technical; it was agreeing on what "prediction confidence" meant in the context of a breeding decision. The breeder's team wanted a single actionable score per candidate. We returned trait-specific predictions with uncertainty intervals. Translating those intervals into a breeding decision index required a short dialogue about acceptable false-negative rates — advancing a line that field-validates poorly versus discarding a line that would have advanced under field selection. We ended up writing a small R function on their side that converted our interval outputs into a decision tier using their own historical false-negative tolerance.

If we were starting this integration today, we would propose that decision-tier mapping conversation before writing any code. The data pipeline portion took two days; the decision-logic conversation took two weeks.