Getting started in machine learning

Every so often, a chemist or biologist will ask me how they can get started in machine learning. Here is my current answer:

My experience

When I started out, I had a strong quantitative background (chemical engineering undergrad, was taking PhD courses in chemical engineering) and some functional skills in programming. From there, I first dove deep into one type of machine learning (Gaussian processes) along with general ML practice (how to set up ML experiments in order to evaluate your models) because that was what I needed for my project. I learned mostly online and by reading papers, but I also took one class on data analysis for biologists that wasn’t ML-focused but did cover programming and statistical thinking. Later, I took a linear algebra class, an ML survey class, and an advanced topics class on structured learning at Caltech. Those helped me obtain a broad knowledge of ML, and then I’ve gained deeper understandings of some subfields that interest me or are especially relevant by reading papers closely (chasing down references and anything I don’t understand and/or implementing the core algorithms myself).

General advice and where to start

More generally, to do applied ML you need to be able to:

Put your data into your model.
Understand at a high level what the model is doing.
Analyze the results.

For 1 and 3, programming is essential. The most popular language for ML right now is Python, but programming skills transfer pretty well between languages and frameworks.

For 2 and 3, you need undergraduate-level probability and statistics (you should know, for example, what probability density functions and random variables are). Introductory linear algebra is also helpful here, as it underpins a lot of the data manipulations and theory behind machine learning algorithms.

If you are interested in developing methods or even moving towards core machine learning research, then the mathematical and formal computer science knowledge required will obviously become much more stringent, but my advice when starting out would be to pick a problem you’re interested in, learn what you need to solve that problem, and then learn more as you go along.

Some resources

Caltech’s Programming Bootcamp for Biologists, taught by Justin Bois
Caltech’s Data Analysis in the Biological Sciences, taught by Justin Bois
Stanford’s ML Intro course
All the Stanford deep learning classes are well-taught and have resources online.