Spark Course Content
Lesson 1: Introduction to Spark, Spark Basics
o Introduction to Spark
o how Spark overcomes the drawbacks of working MapReduce
o understanding in-memory MapReduce
o interactive operations on MapReduce
o Spark stack, fine vs. coarse grained update
o Spark stack,Spark Hadoop YARN
o HDFS Revision, YARN Revision, the overview of Spark and how it is better Hadoop
o deploying Spark without Hadoop
o Spark history server, Cloudera distribution.
o Spark installation guide,Spark configuration
o memory management, executor memory vs. driver memory, working with Spark Shell
o the concept of Resilient Distributed Datasets (RDD)
o learning to do functional programming in Spark
o the architecture of Spark.
Lesson 2: Working with RDDs in Spark, Aggregating Data with Pair RDDs
o Understanding the concept of Key-Value pair in RDDs
o learning how Spark makes MapReduce operations faster
o various operations of RDD,MapReduce interactive operations, fine & coarse grained update, Spark stack.
o Spark RDD, creating RDDs, RDD partitioning
o operations & transformation in RDD,Deep dive into Spark RDDs
o the RDD general operations
o a read-only partitioned collection of records
o using the concept of RDD for faster and efficient data processing,RDD action for Collect, Count, Collectsmap, Saveastextfiles, pair RDD functions.
Lesson 3: Writing and Deploying Spark Applications, Parallel Processing
o Comparing the Spark applications with Spark Shell
o creating a Spark application using Scala or Java
o deploying a Spark application
o Scala built application,creation of mutable list, set & set operations, list, tuple, concatenating list, creating application using SBT
o deploying application using Maven,the web user interface of Spark application
o a real world example of Spark and configuring of Spark.
o Learning about Spark parallel processing
o deploying on a cluster, introduction to Spark partitions
o file-based partitioning of RDDs
o understanding of HDFS and data locality
o mastering the technique of parallel operations,
o comparing repartition & coalesce, RDD actions.
Lesson4: Spark RDD Persistence And Spark Streaming & Mlib
o The execution flow in Spark
o Understanding the RDD persistence overview
o Spark execution flow & Spark terminology
o distribution shared memory vs. RDD, RDD limitations
o Spark shell arguments,distributed persistence
o RDD lineage
o Key/Value pair for sorting implicit conversion like CountByKey, ReduceByKey, SortByKey, AggregataeByKey
o Spark Streaming Architecture
o Writing streaming programcoding
o processing of spark stream
o processing Spark Discretized Stream (DStream)
o the context of Spark Streaming
o streaming transformation, Flume Spark streaming, request count and Dstream, multi batch operation, sliding window operations and advanced data sources.
o Different Algorithms, the concept of iterative algorithm in Spark
o analyzing with Spark graph processing
o introduction to K-Means and machine learning
o various variables in Spark like shared variables, broadcast variables
o learning about accumulators.
Lesson 5: Improving Spark Performance And Spark SQL and Data Frames
o Introduction to various variables in Spark like shared variables, broadcast variables, learning about accumulators
o the common performance issues and troubleshooting the performance problems.
o Learning about Spark SQL
o the context of SQL in Spark for providing structured data processing
o JSON support in Spark SQL
o working with XML data, parquet files
o creating HiveContext
o writing Data Frame to Hive
o reading JDBC files
o understanding the Data Frames in Spark, creating Data Frames, manual inferring of schema, working with CSV files, reading JDBC tables, Data Frame to JDBC
o user defined functions in Spark SQL
o shared variable and accumulators
o learning to query and transform data in Data Frames
o how Data Frame provides the benefit of both Spark RDD and Spark SQL
o deploying Hive on Spark as the execution engine.
Lesson 6: Scheduling/ Partitioning
o Learning about the scheduling and partitioning in Spark,hash partition, range partition, scheduling within and around applications, static partitioning, dynamic sharing, fair scheduling.
o Map partition with index, the Zip, GroupByKey, Spark master high availability, standby Masters with Zookeeper
o Single Node Recovery With Local File System, High Order Functions.
Scala Course Content
Lesson 7: Introduction of Scala, Pattern Matching
o Introducing Scala and deployment of Scala for Big Data applications and Apache Spark analytics.
o The importance of Scala, the concept of REPL (Read Evaluate Print Loop)
o deep dive into Scala pattern matching
o type interface
o higher order function, currying, traits, application space and Scala for data analysis.
Lesson 8: Scala collections, Case classes and pattern matching
o Introduction to Scala collections
o classification of collections
o the difference between Iterator, and Iterable in Scala, example of list sequence in Scala.
o Understanding Sealed traits, wild, constructor, tuple, variable pattern, and constant pattern.
Lesson 9: Concepts of traits with example
o Understanding traits in Scala
o the advantages of traits, linearization of traits
o the Java equivalent and avoiding of boilerplate code.
Lesson 10: Scala java Interoperability
o Implementation of traits in Scala and Java
o handling of multiple traits extending.
Lesson 11: Mutable collections vs. Immutable collections
o The two types of collections in Scala
o Mutable and Immutable collections
o understanding lists and arrays in Scala
o the list buffer and array buffer, Queue in Scala, double-ended queue Deque, Stacks, Sets, Maps, Tuples in Scala.
Lesson 12: Use Case bobsrockets package
o Introduction to Scala packages and imports
o the selective imports, the Scala test classes
o introduction to JUnit test class
o JUnit interface via JUnit 3 suite for Scala test
o packaging of Scala applications in Directory Structure
o example of Spark Split and Spark Scala.