Random Thoughts & References when reviewing “Designing Machine Learning Systems”

4 min readJan 31, 2024

I cannot agree more with one trend that Barr Moses wrote in “Top data and AI trends in 2024”, data teams will look more like software teams and software teams will become more or less data practitioners. There’s always a lag in how various organizations change or adapt to the technological changes in ML systems, so it’s worth going through the materials even the contents in this note are around 2 years behind the latest cutting-edge developments. The public summary of the book can be found here.

Thanks to a Chinese self-study group, I was able to review Chip Huyen’s book “Designing ML Systems” with some parallel learning. For those of you who understand Chinese (中文), here’s the link to the Zoom recording, the link to the archived slide decks (Chapter 1–4, mostly in English) prepared by Jin Wu, a senior software engineer at Google currently working on Google Core ML infrastructure about deploy ML models on TPU, can be found here. He shared with us his notes on reading the book, Designing Machine Learning Systems, Ch 1–4.

A few points that I found reminded me of some projects in my early career as a statistician in healthcare and good to review:

F1 score was not widely used for multi-class classification in healthcare such as diagnosis back in the 2008–2012 except for a few cases, this measure of the harmonic mean of precision and recall has become popular in other sectors such as tech recently. Commonly used as an evaluation metric in binary and multi-class classification, the F1 score integrates precision and recall into a single metric to gain a better understanding of model performance. A Look at Precision, Recall, and F1-Score

2. Sampling methodology: Sampling methods in Clinical Research; an Educational Review. I got in the world of survey sampling since 2006 and some of the complex sampling projects were done till 2013, merely from a methodology standpoint, sampling methods were much simpler in insurance and finance that I later worked in.

We covered one third of the book today! Of course, the more technical contents are in later chapters, especially model development and deployment in chapter 6 and 7. I’m currently going through feature engineering and will be glad to drop a few takeaway points (that don’t violate copyright of the book, of course) up here sometime in the near future.

In going over Machine Learning Engineering for Production (MLOps) Specialization, here are some links I found helpful and bookmarked in my notes:

Monitoring ML Models: https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/

A Chat with Andrew on MLOps: From Model-centric to Data-centric: https://youtu.be/06-AZXmwHjo

2. Notes/Takeaways related to Model Selection and Training:

Error analysis helps additional features.

Additional features can be hand coded or they could be generated by some learning algorithm. For the most part, the dream of no longer needing to hand design features have come true for unstructured data. That said, even with modern deep learning, designing features, especially for structured data problems, can still be an important driver of performance improvements. If dataset size isn’t massive, there is still designing of features driven by error analysis that can be useful for many applications. While learning algorithms are very good at learning features automatically for unstructured data such as images, audio, and text, so we don’t do that nearly as much. But for structured data, it’s fine to go in and work on feature design.

Establishing a baseline

Image from the above reference. Copyright belongs to the original author and publisher at https://blog.ml.cmu.edu/.

Error analysis

Experiment tracking

Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., … Anderljung, M. (n.d.). Toward trustworthy AI development: Mechanisms for supporting verifiable claims∗. http://arxiv.org/abs/2004.07213v2

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. Retrieved from http://arxiv.org/abs/1912.02292

3. Data Definition & Baseline Establishment:

Best practices for unstructured vs. structured data are quite different.

Image Source Credits: DeepLearning.AI. Retrieved from Andrew Ng’s teaching on Coursera.

If we are working on a problem from one of the above 4 quadrants, then on average advice from someone who worked on problems in the same quadrant will probably be more useful than someone who worked in a different quadrant. Instincts and decisions are more similar within one quadrant than if shifting to a totally different quadrants in discharge.

To build a useful application — rather than trying to beat Human Level Performance (HLP), it’s often useful to raise HLP by improving consistency and that ultimately results in better learning outcomes performance as well.

Ditto. A simplified illustration of data pipeline.

Ditto. Keeping track of data provenance and lineage can make life easier for large complex ML systems.

Meta-data can be useful for error analysis and spotting unexpected effects or tags or categories of data, store them in a timely way.

It’s important to have balanced train / dev (hold-out validation set) / test splits.

Random Thoughts & References when reviewing “Designing Machine Learning Systems”

Written by Lulu Yan