Course Description

Produced Exclusively for DASCA by John Wiley, USA

Produced exclusively for DASCA by John Wiley, USA under the DASCA Data Science Knowledgeware project. This resource is an integral part of the DASCA certification exam preparation kit provided to all individuals who have formally registered for the DASCA ABDE™ certification program. The courses included in this program are solely for assisting and complementing learning and comprehension on some important topics included in the DASCA certification exam preparation kit provided to learners. The access to these course-modules is restricted to only those individuals who are formally registered in the DASCA ABDE™ certification program.

Course curriculum

  • 1

    Apache Spark and Scala- Overview

    • Course Objectives

    • Target Audience

    • Course Prerequisites

    • Value to the Professionals

    • Value to the Professionals- 2

    • Value to the Professionals- 3

    • Lessons Covered

    • Conclusion

  • 2

    Introduction to Spark

    • Objectives

    • Need of New Generation Distributed Systems

    • Limitations of Map reduce in Hadoop

    • Limitations of Map reduce in Hadoop-2

    • Batch vs Real-Time Processing

    • Application of Stream Processing

    • Application of In-Memory Processing

    • Introduction to Apache Spark

    • History of Spark

    • Language Flexibility in Spark

    • Spark Execution Architecture

    • Automatic Parallelization of Complex Flows

    • Automatic Parallelization of Complex Flows-Important Points

    • Apis That Match User Goals

    • Apache Spark- A Unified Platform of Big Data Apps

    • More Benefits of Apache Spark

    • Running Spark in Different Modes

    • Installing Spark as a Standalone Cluster - Configuration

    • Demo - Install Apache Spark

    • Overview of Spark on a Cluster

    • Demo-Install Apache Spark-1

    • Tasks of Spark on a Cluster

    • Companies Using Spark - Use Cases

    • Hadoop Ecosystem vs Apache Spark

    • Hadoop Ecosystem vs Apache Spark-2

    • Summary

    • Summary-2

    • Conclusion

  • 3

    Introduction to Programming in Scala

    • Objectives

    • Introduction to Scala

    • Basic Data Types

    • Basic Literals

    • Basic Literals-2

    • Basic Literals-3

    • Introduction to Operators

    • Use Basic Literals and the Arithmetic Operator

    • Demo Use Basic Literals and the Arithmetic Operator

    • Use the Logical Operator

    • Demo Use the Logical Operator

    • Introduction to Type Inference

    • Type Inference for Recursive Methods

    • Type Inference for Polymorphic Methods and Generic Classes

    • Unreliability on Type Inference Mechanism

    • Mutable Collection vs Immutable Collection

    • Functions

    • Anonymous Functions

    • Objects

    • Classes

    • Use Type Inference, Functions, Anonymous Function and Class

    • Demo Use Type Inference, Functions, Anonymous Function and Class

    • Traits as Interfaces

    • Traits - Example

    • Collections

    • Types of Collections

    • Types of Collections-2

    • Lists

    • Perform Operations on Lists

    • Demo Use Data Structures

    • Maps

    • Pattern Matching

    • Implicits

    • 3.34 Implicits-2

    • Streams

    • Use Data Structures

    • Demo Perform Operations on Lists

    • Summary

    • Summary-2

    • Conclusion

  • 4

    Using RDD for Creating Applications in Spark

    • Objectives

    • RDDS API

    • Creating RDDS

    • Creating RDDS Referencing an External Dataset

    • Referencing an External Dataset Text Files

    • Referencing an External Dataset Text Files-2

    • Referencing an External Dataset Sequence Files

    • Referencing an External Dataset other Hadoop Input Formats

    • Creating RDDS - Important Points

    • RDDS Operations

    • RDD Operations - Transformations

    • Features of RDD Persistence

    • Storage Levels of RDD Persistence

    • Invoking the Spark Shell

    • Importing Spark Classes

    • Creating the Spark context

    • Creating the Spark Context

    • Loading a File in Shell

    • Performing Some Basic Operations on Files in Spark Shell RDDS

    • Packaging a Spark Project With SBT

    • Running a Spark Project with SBT

    • Demo - Build a Scala Project

    • Build A Scala Project-1

    • Demo - Build a Spark Java Project

    • Build A Spark Java Project-1

    • Shared Variables - Broadcast

    • Shared Variables - Accumulators

    • Writing a Scala Application

    • Demo - Run a Scala Application

    • Run a Scala Application

    • Write a Scala Application Reading the Hadoop Data

    • Write a Scala Application Reading the Hadoop Data

    • Demo - Run a Scala Application Reading the Hadoop Data

    • Run a Scala Application Reading the Hadoop Data

    • DoubleRDD Methods

    • PairRDD Methods- Join

    • PairRDD Methods- Others

    • JavaPairRDD Methods

    • JavaPairRDD Methods-2

    • General RDD Methods

    • General RDD Methods-2

    • Java RDD Methods

    • Common Java RDD Methods

    • Spark Java Function Classes

    • Method for Combining JavaPairRDD Functions

    • Transformations in RDD

    • Other Methods

    • Actions in RDD

    • Key-value Pair RDD in Scala

    • Key-value Pair RDD in Java

    • Using Mapreduce and Pair RDD Operations

    • Reading Text File from HDFS

    • Reading Sequence File from HDFS

    • Writing Text Data to HDFS.mp4

    • Writing Sequence File to HDFS

    • Using Groupby

    • Using Groupby-2

    • Demo - Run a Scala Application Performing Groupby Operation

    • Run A Scala Application Performing Groupby Operation-1

    • Demo - Write and Run a Java Application

    • Write and Run a Java Application

    • Summary

    • Summary-2

    • Conclusion

  • 5

    Running SQL queries using SparkSQL

    • Objectives

    • Importance of Spark SQL

    • Benefits of Spark SQL

    • Dataframes

    • SQLContext

    • SQL Context-2

    • Creating a Dataframe

    • Using Dataframe Operations

    • Using Dataframe Operations-2

    • Demo - Run SparkSQL with a Dataframe

    • Run Spark SQL Programmatically-1

    • Save Modes

    • Saving to Persistent Tables

    • Parquet Files

    • Partition Discovery

    • Schema Merging

    • JSON Data

    • Hive Table

    • DML Operation - Hive Queries

    • Demo - Run Hive Queries Using Spark SQL

    • JDBC to other Databases

    • Supported Hive Features

    • Supported Hive Features-2

    • Supported Hive Data Types

    • Case Classes

    • Case Classes-2

    • Summary

    • Summary-2

    • Conclusion

  • 6

    Spark Streaming

    • Objectives

    • Introduction to Spark Streaming

    • Working of Spark Streaming

    • Streaming Word Count

    • Micro Batch

    • DStreams

    • DStreams-2

    • Input DStreams and Receivers

    • Input DStreams and Receivers-2

    • Basic Sources

    • Advanced Sources

    • Transformations on DStreams

    • Output Operations on DStreams

    • Design Patterns for Using ForeachRDD

    • Dataframe and SQL Operations

    • Dataframe and SQL Operations-2

    • Checkpointing

    • Enabling Checkpointing

    • Socket Stream

    • File Stream

    • Stateful Operations

    • Window Operations

    • Types of Window Operations

    • Types of Window Operations-2

    • Join Operations - Stream - Dataset Joins

    • Monitoring Spark Streaming Application

    • Performance Tuning - High Level

    • Demo - Capture and Process the Netcat Data

    • Capture and Process the Flume Data

    • Demo - Capture the Twitter Data

    • Capture the Twitter Data

    • Summary

    • Summary-2

    • Conclusion

  • 7

    Spark ML Programming

    • Objectives

    • Introduction to Machine Learning

    • Applications of Machine Learning

    • Machine Learning in Spark

    • Dataframes

    • Transformers and Estimators

    • Pipeline

    • Working of a Pipeline

    • Working of a Pipeline-2

    • Dag Pipelines

    • Runtime Checking

    • Parameter Passing

    • General Machine Learning Pipeline - Example

    • Model Selection via Cross - Validation

    • Supported Types, Algorithms and Utilities

    • Data Types

    • Feature Extraction and Basic Statistics

    • Clustering

    • K - Means

    • K - Means_1

    • K - Means_2

    • Demo - Perform Clustering Using K - Means

    • Perform Clustering Using K - Means_1

    • Gaussian Mixture

    • Power Iteration Clustering

    • Latent Dirichlet Allocation

    • Latent Dirichlet Allocation-2

    • Collaborative Filtering

    • Classification

    • Classification-2

    • Regression

    • Example of Regression

    • Demo - Perform Classification Using Linear Regression

    • Perform Classification Using Linear Regression

    • Demo - Run Linear Regression

    • Run Linear Regression

    • Demo - Perform Recommendation Using Collaborative Filtering

    • Perform Recommendation Using Collaborative Filtering

    • Demo - Run Recommendation System

    • Run Recommendation System

    • Summary

    • Summary-2

    • Conclusion

  • 8

    Spark Graphx Programming

    • Objectives

    • Introduction to Graph - Parallel System

    • Limitations of Graph Parallel System

    • Introduction to GraphX

    • Introduction to GraphX-2

    • Importing GraphX

    • The Property Graph

    • The Property Graph-2

    • Creating a Graph

    • Demo - Create a Graph Using GraphX

    • Create a Graph Using GraphX

    • Triplet View

    • Graph Operators

    • List of Operators

    • List of Operators-2

    • Property Operators

    • Structural Operators

    • Subgraphs

    • Join Operators

    • Perform Graph Operations Using GraphX

    • Perform Graph Operations Using Graphx-1

    • Demo - Perform Subgraph Operations

    • Perform Subgraph Operations-1

    • Neighborhood Aggregation

    • Map Reduce Triplets

    • Demo - Perform Map Reduce Operations

    • Perform Map Reduce Operations-1

    • Counting Degree of Vertex

    • Collecting Neighbors

    • Caching and Uncaching

    • Vertex and Edge RDDs

    • Graph System Optimizations_1

    • Summary

    • Summary-1

    • 8.35 Conclusion