Big Data & Hadoop Development with Spark and Scala

With this Apache Spark & Scala Certification, you will master the essential skills such as Spark Streaming, Spark SQL, Programming, Machine Learning, Shell Scripting Spark & GraphX Programming.

About Course

What is Big Data Hadoop?

Big data is a collection of the large volumes of data that can’t be processed using the traditional Database management systems. This huge amount of data is coming from various sources like smartphones, twitters, facebook and other sources. According to various survey’s 90% of the world’s data is generated in the last two years.

To address these issues, google labs came up with an algorithm to split their large amount of data into smaller chunks and map them to many computers and when calculations were done, bring back the results to consolidate. This software framework for storing and processing big data is known as Hadoop. Hadoop framework has many components such as HDFS, MapReduce, HBase, Hive, Pig,sqoop, zookeeper to analyze structured and unstructured data using commodity hardware. This is an industry recognized training course that is a combination of the training courses in Hadoop developer, Hadoop administrator, Hadoop testing, and big data analytics. This Cloudera Hadoop training will prepare you to clear big data certification.

Apache Spark and Scala

Apache Spark and Scala Certification Training Course can provide you with experience to perform large-scale data processing using RDD, Spark Streaming, SparkSQL, GraphX, MLLib and Scala with reality use-cases on telecom and banking field.

Why Should You Learn Apache Spark and Scala?

Apache Spark is opening up various opportunities for big data examination and making it easier for administrations to solve different kinds of big data problems. Spark is the hottest technology now, not simply among the data engineers however, the even majority of data scientists highly prefer to work with Spark. Apache Spark is a fascinating platform for data scientists with use cases spanning across exploratory and operational analytics.

Data scientist’s square measure exhibiting interest in operating with Spark attributable to its ability to store information resident in memory that helps speed up machine learning workloads in contrast to Hadoop MapReduce. Apache Spark has witnessed continuous upward trajectory in the big data ecosystem. With IBM’s recent announcement that it'll educate quite one million data engineers and data scientists on Apache Spark – 2016 is unquestionably THE year to be told Spark and pursue a profitable career.

After the initiation of Hadoop, many organizations capitalized in novel computing clusters to make use of the technology. However, Apache Spark doesn't create any limitations on investment in new computing clusters as organizations will use Spark on prime of the prevailing Hadoop clusters.

Spark’s enterprise adoption is rising owing to its potential to eclipse Hadoop because it is that the best alternative to MapReduce - inside the Hadoop framework or outside it. The same as Hadoop, Apache Spark also needs technical experience in object oriented programming ideas to program & run- thus opening up job opportunities for people who have hands on working experience in Spark.

With IT Skills’s Apache Spark and Scala Certification Training Course you would advance your expertise in Big Data and Hadoop Ecosystems.

Course Overview

IT Skills's Apache Spark and Scala Training Course Online will allow learners to know how Spark permits in-memory data processing and runs quicker than Hadoop MapReduce.  

Objectives of the Apache Spark and Scala Certification Course

After completion of Apache Spark and Scala certification Course from IT Skills, you should be able to:

* Install Spark and implement Spark operations on Spark Shell

* Become skilled in using RDD, for creating applications in Spark

* Understanding of the limitations of MapReduce and role of Spark in overcoming these limitations

* Understand functional programming in Scala

* Understanding of Spark Streaming features

*Understand fundamentals of Scala Programming Language and features

*Understand GraphX API and implement graph algorithms

* Mastering SQL queries using SparkSQL

* Understanding the features of Spark ML Programming and GraphX Programming

Who can attend?

* Data Management professional

* Data Analysts

* BI Analysts

* Data Scientist

* Architects and Developers

* Anyone looking for a career in Big Data

  • +

    Module 1: Introduction to Big Data and Hadoop

    • What is Big Data?
    • The Rise of Bytes
    • Data Explosion and its Sources
    • Types of Data – Structured, Semi-structured, Unstructured data
    • Why did Big Data suddenly become so prominent
    • Data – The most valuable resource
    • Characteristics of Big Data – IBM’s Definition
    • Limitations of Traditional Large-Scale Systems
    • Various Use Cases for Big Data
    • Challenges of Big Data
    • Hadoop Introduction - What is Hadoop? Why Hadoop?
    • Is Hadoop a fad or here to stay? - Hadoop Job Trends
    • History and Milestones of Hadoop
    • Hadoop Core Components – MapReduce & HDFS
    • Why HDFS?
    • Comparing SQL Database with Hadoop
    • Understanding the big picture - Hadoop Eco-Systems
    • Commercial Distribution of Hadoop – Cloudera, Hortonworks, MapR, IBM BigInsight, Cloud Computing - Amazon Web Services, Microsoft Azure HDInsight
    • Supported Operating Systems
    • Organizations using Hadoop
    • Hands on with Linux File System
    • Hadoop Documentation and Resources
  • +

    Module 2: Getting Started with Hadoop Setup

    • Deployment Modes – Standalone, Pseudo-Distributed Single node, Multinode
    • Demo Pseudo-Distributed Virtual Machine Setup on Windows
    •       VMware Player - Introduction
            Install VMware Player
            Open a VM in VMware Player
    • Hadoop Configuration overview
    •       Configuration parameters and values
            HDFS parameters
            MapReduce parameters
            YARN parameters
            Hadoop environment setup
            Environment variables
    • Hadoop Security - Authentication/Authorization
    • Hadoop Core Services – Daemon Process Status using JPS
    • Overview of Hadoop WebUI
    •       Firefox Bookmarks
            Web Ports
    • Eclipse development environment setup
    • References
  • +

    Module 3: Hadoop Architecture and HDFS

    • Introduction to Hadoop Distributed File System
    • Regular File System v/s HDFS
    • HDFS Architecture
    • Components of HDFS - NameNode, DataNode, Secondary NameNode
    • HDFS Features - Fault Tolerance, Horizontal Scaling
    • Data Replication, Rack Awareness
    • Setting up HDFS Block Size
    • HDFS2.0 - High Availability, Federation
    • Hands on with Hadoop HDFS,WebUI and Linux Terminal Commands
    • HDFS File System Operations
    • Name Node Metadata, File System Namespace, NameNode Operation,
    • Data Block Split, Benefits of Data Block Approach, HDFS - Block Replication Architecture, Block placement, Replication Method, Data Replication Topology, Network Topology, Data Replication Representation
    • Anatomy of Read and Write data on HDFS
    • Failure and Recovery in Read/Write Operation
    • Hadoop Component failures and recoveries
    • HDFS Programming Basics – Java API
    •       Java API Introduction
            Hadoop Configuration API
            HDFS API Overview
            HDFS File CRUD API
            HDFS Directory CRUD API
            Accessing HDFS Programmatically
    • HDFS Programming Advanced – Hadoop I/O
    •       File Compression, Decompression
            Writable Class Hierarchy
            Data Type Serialization, Deserialization
            Sequence Files
    • Running inbuilt Map Reduce Examples
    •       Run MapReduce example to get a high level understanding
            Checking the output of M/R Job – console, WebUI
            Understanding the dump of M/R Job
    • When Hadoop is not suitable
    • Reference
  • +

    Module 4: Pseduo Distributed Cluster Installation

    • HDFS Architecture
    • Data Flow (File Read , File Write)
    • HDFS Shell Commands
    • Hadoop Archives
    • Configuration - Installation of pseudo Distributed Cluster - Configuration Files Introduction to MapReduce - Compression (LZO, Snappy) – Installation of Snappy
  • +

    Module 5: MultiNode Cluster Installation

    • Multi Node Cluster Setup using AWS Cloud Machines
    • Cluster Hardware Considerations – Operating Systems
    • Commands (fsck, job, dfsadmin) –
    • Job Schedulers (Fair Scheduler , Capacity Scheduler)
    • RackAwareness Policy
    • Balancing
    • NameNode Failure and Recovery - commissioning and Decommissioning a Node
  • +

    Module 6: Data Warehousing - Pig

    Pig Data Flow Language – MapReduce using Scripting

    • Challenges Of MapReduce Development Using Java
    • Need for High Level Languages - Pig
    • PIG vs MapReduce
    • What is/n’t PIG, PigLatin, Grunt Shell
    • Where to/not to use Pig?
    • Pig Installation and Configuration
    • Architecture: The Big Picture, Pig Components
    • Execution Environments - Local, Mapreduce
    • Different ways of Invoking Pig – Interactive, Batch
    • Pig Example: Data Analysis in Pig Latin
    • Data Flow in Hadoop
    • Quickstart and Interoperability
    • Data Model and Nested Data Model
    • Expression in Pig Latin
    • Pig Data Types,
    • Nulls in Pig Latin
    • Pig Macros
    • Pig Operation
    • Core Relational Operators – Load, Store, Filter, Transform, Join, Group, CoGroup, Union, Foreach, Sort/Order, Combine/Split, Distinct, Filter, Limit, Describe, Explain, Illustrate
    • Group v/s CoGroup v/s Join
    • PIG Latin: File Loaders & in built UDF(Python, Java) usage
    • PIG v/s SQL
    • Implementation & Usage of Pig UDF
    • Hands on with Pig – 3 different datasets
    • Reference
  • +

    Module 7 : Data Warehousing - Hive and HiveQL

    • Limitations of MapReduce
    • Need for High Level Languages
    • Analytical OLAP - Datawarehousing with Apache Hive and Apache Pig HiveQL- SQL like interface for MapReduce
    • What is Hive, Background, Hive QL
    • Where to use Hive? Why use Hive when Pig is here?
    • Pig v/s Hive
    • Hive Installation, Configuration Files
    • Hive Components, Architecture and Metastore
    • Metastore – configuration
    • Driver, Query Compiler, Optimizer and Execution Engine
    • Hive Server and Client components
    • Hive Data Types
    • Hive Data ModeFile Formats
    • Hive Example
    • Hive DDL
            Create/Show Database
            Create/Show/Drop Tables
    • Hive DML
            Load Files & Insert Data into Tables
    • Managed Tables v/s External Tables – Loading Data
    • Hive QL - Select, Filter, Join, Group By, Having, Cubes-Fact/Dimension(Star Schema)
    • Implementation & Usage of Hive UDF, UDAF, UDTF and SerDe
    • Partitioned Table - loading data
    • Clustered Table – loading data in Clustered Table Views
    • Bucketing
    • Multi-Table Inserts
    • Using HCatalog
    • Joins
    • Hands on with Hive – CRUD - Get,Put,Delete,Scan
    • Limitations of Hive
    • SQLv/s Hive
    • Reference
  • +

    Module 8: NoSQL Databases - HBase

    • NoSQL Introduction
    • RDBMS (SQL) v/s HBase (NoSQL)
    • Transactional (OLTP)
    • RDBMS – Benefits, ACID, Demarits
    • CAP Theorem and Eventual consistency
    • Row Oriented v/s Column Oriented Storage
    • NoSQL: ColumnDB(HBase,Cassandra),Document(MongoDB,CouchDB, MarkLogic),GraphDB(Neo4J), KeyValue(Memcached, Riak, Redis, DynamoDB)
    • What is HBase?
    • Synopsis of how typical RDBMS scaling story runs
    • HBase comes as a rescue
    • HBase Introduction, Installation, Configuration
    • HBase Overview: part of Hadoop Ecosystem
    • Problems with Batch Processing like MR
    • HBase v/s HDFS
    • Batch vs. Real Time Data Processing
    • Use-cases for Real Time Data Read/Write
    • Seek v/s Transfer
    • HBase Storage Architecture
    • Write Path, Read Path
    • HBase components - HMaster, HRegionServer
    • ZooKeeper
    • Replication
    • HBase Data Model
    • HBase Schema Design: Tall-Narrow Tables and Flat-Wide Tables
    • Column Families
    • Column Value & Key Pair
    • HBase Operation - Memstore / HFile / WAL
    • HBase Client - HBase Shell
    • Admin Operation
    • CRUD Operations
    •       Create via Put method
            Read via Get method
            Update via Put method
    • Delete via Delete method
    • Creating table, table properties, versioning, compression
    • Bulk Loading HBase
    • Hive, Pig - HBase Integration
    • Accessing HBase using Java Client v/s Admin API
    •       Introduction to Java API
            Read / Write path
            CRUD Operations – Create, Read, Update, Delete
            Scan Caching
            Batch Caching
            MapReduce Integration
            Secondary Index
    • Compaction – major, minor
    • Splits
    • Bloom Filters
    • Caches
    • Performance Tuning
    • Apache Phoenix
    • When Would I Use Apache HBase?
    • Companies Using HBase
    • Hands on HBase, Cassandra
    • When/Why to use HBase/Cassandra/MongoDB/Neo4J?
    • Who is using What?
    • Reference
  • +

    Module 9: Import/Export Data - Sqoop, Flume

    • Setup MySQL RDBMS
    • Introduction to Sqoop
    • Installing Sqoop, Configuration
    • Why Sqoop
    • Benefits of Sqoop
    • Sqoop Processing
    • How Sqoop works
    • Sqoop Architecture
    • Importing Data – to HDFS, Hive, HBase
    • Exporting Data – to MySQL
    • Sqoop Connectors
    • Sqoop Commands
    • Why Flume
    • Flume - Introduction
    • Flume Model
    • Scalability In Flume
    • How Flume works
    • Flume Complex Flow - Multiplexing
    • Hands on with Sqoop, Flume
    • Reference
  • +

    Module 10: Workflows using Oozie

    MapReduce Workflows

    • Workflows Introduction
    • Decomposing Problems into MapReduce Workflow
    • Using JobControl class
    • Introduction to Oozie
    • Oozie Installation
    • Creating Oozie Workflows
    • Oozie Service/Scheduler
    • Deploy and Run Oozie Workflow
    • Oozie use-cases
    • Hands on with Oozie
    • Reference
  • +

    Module 11: Administering Hadoop

    • Hadoop Deployment Modes
    • Pseudo-Distributed Mode - Virtual Machine for Hadoop Training
    •       VMware Player - Introduction
            Install VMware Player
            Create a VM in VMware Player
            Open a VM in VMware Player
            Oracle VirtualBox to Open a VM
            Open a VM using Oracle
    • Hadoop Cluster Configuration overview
    •       Configuration parameters and values
            HDFS parameters
            MapReduce parameters
            Hadoop environment setup
            ‘Include’ and ‘Exclude’ configuration files
            Site v/s Default conf files
            Environment Variables
            Hadoop Multi-node Installation using VM on single machine
            Hadoop Multi-Node Fully Distributed Mode Installation
            Passwordless SSH setup
            Configuration Files of Hadoop Cluster
            hadoop default.xml
            Safe Mode
            Load Balancer
            Hadoop Ports
            Hadoop Quotas
            Security - Kerberos
    • ZooKeeper
    •       What is Zookeeper
            Introduction to ZooKeeper
            Who is using it
            Features of ZooKeeper
            Challenges Faced in Distributed Applications
            ZooKeeper Coordination, Architecture,
            Uses of ZooKeeper
            Entities, Data Model, Services
            ZNokde Types
            Sequential ZNodes
            Client API Functions
            Zookeeper Installation, Configuration and Running Zookeeper
            Zookeeper use cases – HBase, Hadoop, Kafka
    • Hue, Cloudera Manager, Ambari, Mesos
    • Performance Monitoring and Tuning
    • Hadoop Cluster Performance Management
    •       Important Hadoop tuning parameters
            Hadoop Cluster Benchmarking Jobs – How to run the jobs
            TestDFSIO, NNBench, MRBench
            HDFS Benchmarking
    • Debugging, Troubleshooting
    • Reference
  • +

    Module 12: Apache Spark

    • Spark Concepts, Installation and Architecture
    • Spark ecosystem (core, spark sql, mlib, streaming)
    • Spark Modes
    • Spark web UI
    • Spark shell
    • RDD’s In Depth
    • Partitions
    • RDD Operations / transformations
    • RDD types
    • Key-Value pair RDDs
    • MapReduce on RDD
    • Caching and persistence
    • Submitting the first program to Spark
    • Broadcast variables
    • Accumulators
    • Memory management - Executers
    • Spark SQL
    • Spark Streaming – with NetCat, Kafka
    • Reference
  • +

    Module 13: Introduction to Spark

    • Limtations in MapReduce in Hadoop Objective
    • Batch vs. Real-time analytics
    • Application of stream processing
    • How to install Spark
    • Spark Eco-system
    • Modes of spark
    • Spark standalone cluster
    • Spark Web UI
  • +

    Module 16: Running SQL queries Using SparkSQL

    • SQLContext in Spark SQL
    • Explain the importance and features of SparkSQL
    • Describe methods to convert RDDs to Data Frames
    • Explain concepts of SparkSQL
    • Hive queries through Spark
    • Describe the concept of hive integration
  • +

    Module 17: Spark Streaming

    • Spark Streaming Architecture
    • Explain a concepts of Spark Streaming
    • Describe basic and advanced sources
    • Fault tolerance, Checkpointing, Parallelism level
    • Explain how stateful operations work
    • Explain window and join operations
  • +

    Module 18:Spark ML Programming

    • Machine Language (ML) - use cases and techniques
    • Concepts of Spark ML
    • ML Dataset, and ML algorithm, model selection via cross validation Lesson
  • +

    Module 19 : Spark GraphX Programming

    • Spark GraphX Programming
    • Data frames and implementation of Spark SQL
    • Explain the key concepts of Spark GraphX programming
    • Limitations of the Graph Parallel system

Key Features

48 hours of Practical and certification oriented training , program also covers Spark and Scala

Trainers are Industry experts & certified professionals with more 15+ years of experience

Whatsapp group will be created to assist on all the quireies

Training is mainly focused on certification, post completion of the training assitants will be given by the trainer

100%_Money-Back-Guarantee*(Refund in case of non-satisfaction on the first day of the class)

Pre-Assessment Text,  Quizzes, assignments & projects, Mock Test, Practice test

Batch Size will be not more than 15 Candidates.

Case studies will be discuused on all the Major topics

Training Requires 8 GB RAM Machine to practice Hands On

Trainings will be during weekends which is convienet for Working Professionals

  • +

    Who are the Instructors?

    • We believe in quality & follow a rigorous process in selecting our trainers. All our trainers are industry experts/ professionals with an experience in delivering trainings
Date Time Course Type Price

MAY 5th2018

09.00 AM to 1.00 PM



+ 18% (GST)