Machine Learning at Scale

Get your team up to speed with our hands-on deep training on Machine Learning. Based on the materials presented at top academic conferences.

What is it about?

This course will provide an accessible introduction to the principles of Machine Learning and Data Analytics on a single core computer, and also on distributed MapReduce frameworks with a particular focus on Spark and its potential to revolutionize academic and commercial data science practices through scale. Conceptually, the course is divided into three parts.

The first part will cover fundamental concepts of data analysis, data storage and management, and machine learning on a single core machine. The second will focus on MapReduce parallel computing via Hadoop, MRJob, and Spark, while diving deep into Spark core, data frames, the Spark shell, Spark streaming, Spark SQL, MLlib, and more. The third part will focus on hands-on algorithmic design and development in parallel computing environments such as Spark; developing algorithms from scratch, such as: gradient descent algorithms for supervised learning (support vectors machines, perceptrons, logistic regression, linear regression, and regularized versions of these algorithms); non-gradient descent based algorithms such as decision trees and naive Bayes; other street smarts to make machine learning work in practice such as feature normalization, feature and data engineering, and feature selection; and unsupervised machine learning approaches such as k-Means.

Industrial applications and deployments of MapReduce parallel compute frameworks from various fields, including digital advertising, finance, health care, and search engines, will also be presented. Examples and exercises will be made available in Python notebooks (Hadoop streaming, MRJob, and PySpark).

Training Content