Get in Touch

Course Outline

  • Introduction
    • Hadoop history and concepts
    • Ecosystem
    • Distributions
    • High-level architecture
    • Hadoop myths
    • Hadoop challenges (hardware / software)
    • Labs: discuss your Big Data projects and challenges
  • Planning and installation
    • Selecting software and Hadoop distributions
    • Sizing the cluster and planning for growth
    • Selecting hardware and network
    • Rack topology
    • Installation
    • Multi-tenancy
    • Directory structure and logs
    • Benchmarking
    • Labs: cluster installation and running performance benchmarks
  • HDFS operations
    • Concepts (horizontal scaling, replication, data locality, rack awareness)
    • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring
    • Command-line and browser-based administration
    • Adding storage and replacing defective drives
    • Labs: getting familiar with HDFS command lines
  • Data ingestion
    • Flume for logs and other data ingestion into HDFS
    • Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
    • Hadoop data warehousing with Hive
    • Copying data between clusters (distcp)
    • Using S3 as a complement to HDFS
    • Data ingestion best practices and architectures
    • Labs: setting up and using Flume; doing the same for Sqoop
  • MapReduce operations and administration
    • Parallel computing before MapReduce: comparing HPC vs Hadoop administration
    • MapReduce cluster loads
    • Nodes and Daemons (JobTracker, TaskTracker)
    • MapReduce UI walkthrough
    • MapReduce configuration
    • Job configuration
    • Optimising MapReduce
    • Fool-proofing MapReduce: what to tell your programmers
    • Labs: running MapReduce examples
  • YARN: new architecture and new capabilities
    • YARN design goals and implementation architecture
    • New actors: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling under YARN
    • Labs: investigating job scheduling
  • Advanced topics
    • Hardware monitoring
    • Cluster monitoring
    • Adding and removing servers, upgrading Hadoop
    • Backup, recovery and business continuity planning
    • Oozie job workflows
    • Hadoop high availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: setting up monitoring
  • Optional tracks
    • Cloudera Manager for cluster administration, monitoring and routine tasks; installation and use. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
    • Ambari for cluster administration, monitoring and routine tasks; installation and use. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)

Requirements

  • Comfortable with basic Linux system administration
  • Basic scripting skills

Prior knowledge of Hadoop and distributed computing is not required, as these concepts will be introduced and explained during the course.

Lab environment

Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for participants.

Students will need the following

  • an SSH client (Linux and Mac already have SSH clients; for Windows, PuTTY is recommended)
  • a browser to access the cluster. We recommend the Firefox browser with the FoxyProxy extension installed
 21 Hours

Number of participants


Price per participant

Testimonials (1)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories