Bio ML Opportunities

January 10, 2023

This note is a loose collection of thoughts with respect to where biology could benefit from modern deep-learning ML techniques.

Framing big-tech ML and bio-tech ML

(Note: Celine Halioua of loyalfordogs.com has an interesting broader framing comparing and contrasting biotech and tech industries here. Here I’m focusing more on deep learning.).

Biotech is a high-risk, high-reward industry. Fundamentally, when human life is at stake there is a lot of regulation, and when the underlying biological mechanisms and pathways are not always understood, there is a large scientific “moat” for companies to capitalize. Even if a company can find a treatment in the discovery phase, the “pre-revenue” cost structure to spin up bioreactors, clinical trials, and manufacturing processes is remarkably expensive.

Contrast this cost structure to network-effect driven, ~zero marginal cost “big tech” companies like Google, Facebook, Amazon, Uber, and others. The limiting resource for this type of company tends to be users, which is why these companies tend to invest in scaling their services to every human on the planet.

At least as of 2023, the current state of “bio-type” data that one would use for ML is fundamentally different between these two types of companies.

Attribute	Biotech Co. Data	Tech Co. Data
Mental Model	Cottage: artisan scale.	Factory: clean and huge scale.
Size	100s-100s of data points because it costs a lot to run bioreactors or do drug trials.	Millions-Billions of data points of purely digital data
Shape	Wide and heterogenous: there may be a large number of features relevant for a given bioprocess or patient outcome, gathered at varying levels of sparseness.	“Deep” and homogenous: many data points of the same type, where the B2C side of the tech company has mostly full vertical control over what is logged.
Completeness	Fundamentally incomplete: since the scientific understanding and causation of the underlying biology may not be fully understood.	Mostly complete: many tasks “translate” from one domain to another (speech to text, large language models, image classification).

Biotech Ideas for ML Niches

Given the cottage scale, there is an interesting question around how deep learning techniques can actually provide benefit to biotech.

I think there can be at least 2 areas for deep learning:

Distilling complex patterns of evolutionary biology to a degree that was difficult to achieve before. At “evolutionary scale” there are more factory-oriented datasets available. For example, DeepMind reasearchers have come up with AlphaFold and revolutionized structural biology, and “Evolutionary scale models” language models area also interesting to supplement smaller datasets with representations that can be used for protein folding (ESM Atlas). Further, I think comparative genomics across species will be interesting to understand how genes find their niche, such as the work from fauna.bio. Progress in these areas will inform and will continue to be informed by broader biological understanding: ML will become part of Laura Deming’s biological flywheel toolkit described in “Sequencing is the new microscope”.
Another important direction that I’m less familiar with is the notion of using “quantified self” metrics from wearables, as advocated by Balaji Srinivasan (example tweet). Since it is easier than ever to see trends thanks to wearables, smart beds/scales, and phones for logging data, motivated individuals may be able to take ownership of their key biometrics and find approaches that work for them.
Overall, help to streamline existing healthcare systems. For example, precision medicine biopharma companies screen for potential patients likely to succeed, ML can streamline and improve on existing imaging techniques, and in-silico simulation can improve experiment quality during drug discovery.

More classical ML approaches will continue to be helpful at the cottage scale:

More direct applications of statistical approaches and processes can also improve the chance of success for existing process development. Biomanufacturing emphasizes the design-build-test-learn cycle (e.g. as practiced at Zymergen), which tends to focus on iterative and interpretable techniques such as design of experiments and to some extent bayesian black-box optimization. In these cottage-industry cases, explanable AI models become quite important, so even basic techniques like XGBoost can go a long way for the ML practitioner.
Lab automation will also help scale up experiments. While there may be further iteration in the idea maze in this area to execute experiments, at the very least, basic ML techniques like outlier detection can replace manual work and monitoring common in many bioreactor runs. More classical time series modeling and experiment design techniques can continue to make a lot of headway to scale up data. As data scale-up becomes easier, many companies may move towards deep learning approaches, as well.