Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. Lowering the barrier of entry to these ML techniques, for scientists without a significant background in computer science or statistics, is important to broadening access to these powerful methods. Here we provide detailed, step by step tutorials for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, Principal Component Analysis (PCA), Partial Least-Squares Discriminant Analysis (PLSDA), and Partial Least-Squares Regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed, and describe controls to test the validity of the results obtained. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists and other chemical and biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data.

Basic Protocol 1

Clustering

Basic Protocol 2

Principal Component Analysis (PCA)

Basic Protocol 3

Partial Least Squares Discriminant Analysis (PLSDA)

Basic Protocol 4

Partial Least Squares Regression (PLSR)

Article activity feed