Get your team up to speed with our hands-on comprehensive training on Apache Spark. Based on the materials presented at top academic conferences.
What is it about?
Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala, SQL and R (MapReduce has 2 core calls), and its core data abstraction, the distributed data frame. In addition, it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.
This training will provide an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will cover fundamental Spark concepts, including Spark Core, functional programming ala map-reduce, RDDs/data frames/datasets, the Spark Shell, Spark Streaming and online learning, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization, and deep learning. The home homegrown implementations will help shed some light on the internals of the MLlib libraries (and on the difficulties of parallelizing some key machine learning algorithms). Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (pySpark) notebooks.
Module 1: Spark Introduction and "Hello World" Sample Problem
Module 2: Overview of Parallel Computing Paradigms and Frameworks
Module 3: Apache Spark Foundations and Spark APIs
Module 4: Exploratory Data Analysis with Apache Spark
Module 5: Fundamental Data Mining Algorithms in Apache Spark
Module 6: Spark at Scale in the Cloud
Module 7: Online Learning, Apache Spark Deployments and Case Studies
Module 8: Hands-on Large-scale Distributed Classifier Design using Apache Spark