Building Research Collaborations Between AI Startups and Genomics Labs

We are a small computational team. We do not run a wet lab. We do not grow plants. Everything we build depends on data that comes out of laboratories where people spend their days phenotyping, sequencing, and maintaining germplasm collections. Getting those collaborations right — structurally, not just interpersonally — has been one of the harder parts of building this company, and one we did not see discussed much in the technical literature we read.

Why the Gap Exists

Computational biology and experimental plant genomics share a vocabulary and a journal ecosystem, but they have substantially different daily work rhythms, output metrics, and incentive structures. A computational team measures progress by model performance on held-out benchmarks and time-to-integration of new training data. A wet-lab research group measures progress by successful field seasons, genotyping throughput, and publications from multi-year experimental designs. A collaboration where one side is iterating weekly and the other is iterating annually requires explicit management of that asymmetry or it will frustrate both parties.

The specific tension we have encountered most often: we need data before we can demonstrate that our approach works, but a laboratory group cannot justify the data sharing overhead until they see demonstrated value. This is a genuinely circular problem. The way out of it is to start with a much smaller scope than either party would ideally like, produce a concrete result with that limited data, and use that result as the negotiating foundation for a larger exchange.

The First Step: Defining a Shared Evaluation Protocol

Before discussing data transfer, before discussing authorship, before discussing any kind of formal agreement, the most important conversation is: what would a convincing result look like to each party? This question often surfaces that the two teams have different definitions of "working."

For us, a working genomic prediction model achieves a Pearson correlation of 0.6 or above between predicted and observed phenotype values on held-out accessions. For a breeding-programme collaborator, a working prediction system is one where predicted top-quartile candidates overlap substantially with phenotypic top-quartile candidates in field evaluation. These are related but not identical criteria. A model with Pearson r = 0.65 overall might still rank the top 10% correctly 70% of the time, which is the metric the breeder cares about. Or it might not — correlation is not precision at the extremes.

We now write a shared evaluation specification before any data is transferred. The specification records: (a) the prediction target and its unit of measure, (b) the validation accession set and how it will be held out, (c) the success threshold for each party, and (d) who interprets the results. This takes two or three meetings to produce but prevents the situation where we report a result we consider positive and the collaborator considers inconclusive because we were measuring different things.

Data Format Negotiations and the Hidden Overhead

Laboratory groups generate data in the formats their instruments and pipelines produce. VCF files from Illumina GBS, flat-file phenotype records from field books, Excel sheets of SNP array results with institutional column naming conventions. Every format conversion and normalisation step requires coordination: which VCF version? Which reference assembly? How are missing values encoded? Does "plant height" mean flag leaf height or ear height?

We have learned to send a data format specification sheet — a single page document describing exactly what format we need, with example rows — before any data arrives, and to offer to handle the conversion from their native format to ours. Placing the conversion burden on ourselves rather than on the collaborator removes a significant friction point. A genomics laboratory bioinformatician who has to learn our input format to help us is spending time on our problem, not theirs. If we can write the ETL script and they can point us to the file server, the collaboration moves faster.

The one exception: phenotype normalisation. Raw field data has site and year structure that only the experimental team understands. We need them to either provide normalised genotypic values (BLUPs or adjusted means) or to explain the experimental design in enough detail that we can apply the adjustment ourselves. Getting this information typically requires at least one extended conversation with the person who ran the field trials — not the PI, not the bioinformatician, but the person who recorded the plot-level data.

Publication and Intellectual Property

The IP conversation is harder when one party is a startup. Academic collaborators are accustomed to working with other academic groups where the default assumption is open publication. When we arrive as a company, there is often initial uncertainty about whether we intend to patent the model outputs, restrict data reuse, or claim exclusive commercial rights to anything derived from their germplasm data.

Our position, which we state explicitly at the beginning of any collaboration discussion: we do not seek exclusive IP claims over data contributed by collaborators. The model we train using that data becomes part of our commercial product, but we commit to publishing the methodology and to not restricting the collaborator's ability to publish their own experimental data or to describe our joint evaluation work in their papers. We ask that pre-publication model outputs be treated as confidential until we agree on a coordinated publication timeline, which is a standard academic courtesy request that most groups accept easily.

For joint publications, authorship order follows standard academic convention: wet-lab experimental contributors are co-authors, not acknowledged, and the PI of the laboratory group is typically corresponding author on any paper where the primary scientific contribution is the phenotypic data they produced. We take co-first authorship on papers where the modelling methodology is a primary contribution. These negotiations are often smoother when initiated early, before the data has been shared, because it signals that we are treating the collaboration as a genuine intellectual partnership.

Managing Expectation Cycles

A significant practical source of friction in these collaborations is the mismatch between our iteration speed and the pace of experimental validation. We can train a model in a week, deliver a prediction file, and ask for feedback. A laboratory group that needs to complete a field season to validate those predictions operates on a nine-month feedback cycle. During that nine months, if we have iterated to a new model version, which model do the field results validate? We have started maintaining a prediction archive keyed by the model version and date that produced each file, so that when field validation results arrive months later, we can identify exactly which model they correspond to.

We have also learned not to ask collaborators to evaluate predictions in the middle of their growing season. A request for feedback that arrives in August — when field teams are managing trial harvests — will be deprioritised. Quarterly check-ins at natural breaks in the agricultural calendar work better than ad hoc requests triggered by our computational milestones.

What Makes These Collaborations Worth the Overhead

The honest answer is that building these relationships is slow, and for a small company with limited time, each collaboration represents a significant coordination investment. We have had collaborations that produced useful data, one publication in preparation, and a genuine improvement to our model. We have also had collaborations that yielded two meetings, a partially completed data sharing agreement, and nothing further, because the collaborator's priorities shifted.

We are not saying external research collaborations are always the right source of training and validation data. For some trait and species combinations, public databases provide enough material to proceed. But for species where public phenotype data is sparse — which is most species outside the major cereal crops — and for prediction tasks that require current, programme-specific phenotype records, the collaboration overhead is the only path to the data that would make our model genuinely useful. The work of structuring those collaborations well is not separable from the work of building good prediction models.