Get in Touch

Course Outline

1: HDFS (17%)

  • Explain the role of HDFS daemons.
  • Describe the standard operation of an Apache Hadoop cluster, covering both data storage and data processing.
  • Identify current computing system features that drive the need for a solution like Apache Hadoop.
  • Classify the primary objectives of HDFS design.
  • Given a specific scenario, determine the appropriate use case for HDFS Federation.
  • Identify the components and daemons within an HDFS High Availability (HA) Quorum cluster.
  • Analyse the role of HDFS security, specifically Kerberos.
  • Determine the optimal data serialization choice for a given scenario.
  • Describe file read and write pathways.
  • Identify the commands used to manipulate files within the Hadoop File System Shell.

2: YARN and MapReduce version 2 (MRv2) (17%)

  • Understand how upgrading a cluster from Hadoop 1 to Hadoop 2 impacts cluster configuration settings.
  • Understand how to deploy MapReduce v2 (MRv2 / YARN), including all YARN daemons.
  • Grasp the fundamental design strategy for MapReduce v2 (MRv2).
  • Determine how YARN manages resource allocation.
  • Identify the workflow of a MapReduce job running on YARN.
  • Determine which files require modification and how to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN.

3: Hadoop Cluster Planning (16%)

  • Consider key factors when selecting hardware and operating systems to host an Apache Hadoop cluster.
  • Analyse options when selecting an operating system.
  • Understand kernel tuning and disk swapping.
  • Given a scenario and workload pattern, identify a hardware configuration suitable for the requirements.
  • Given a scenario, determine the necessary ecosystem components to run the cluster in order to meet Service Level Agreements (SLAs).
  • Cluster sizing: given a scenario and execution frequency, identify workload specifics, including CPU, memory, storage, and disk I/O.
  • Disk sizing and configuration, including JBOD versus RAID, SANs, virtualisation, and disk sizing requirements within a cluster.
  • Network topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario.

4: Hadoop Cluster Installation and Administration (25%)

  • Given a scenario, identify how the cluster manages disk and machine failures.
  • Analyse logging configurations and logging configuration file formats.
  • Understand the fundamentals of Hadoop metrics and cluster health monitoring.
  • Identify the function and purpose of available tools for cluster monitoring.
  • Be able to install all ecosystem components in CDH 5, including (but not limited to): Impala, Flume, Oozie, Hue, Manager, Sqoop, Hive, and Pig.
  • Identify the function and purpose of available tools for managing the Apache Hadoop file system.

5: Resource Management (10%)

  • Understand the overarching design goals of each Hadoop scheduler.
  • Given a scenario, determine how the FIFO Scheduler allocates cluster resources.
  • Given a scenario, determine how the Fair Scheduler allocates cluster resources under YARN.
  • Given a scenario, determine how the Capacity Scheduler allocates cluster resources.

6: Monitoring and Logging (15%)

  • Understand the functions and features of Hadoop's metric collection capabilities.
  • Analyse the NameNode and JobTracker Web UIs.
  • Understand how to monitor cluster daemons.
  • Identify and monitor CPU usage on master nodes.
  • Describe how to monitor swap and memory allocation across all nodes.
  • Identify how to view and manage Hadoop's log files.
  • Interpret a log file.

Requirements

  • Fundamental Linux administration skills
  • Basic programming skills
 35 Hours

Number of participants


Price per participant

Testimonials (3)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories