Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop ecosystem
  • Brief overview of Python and Scala

Basics (theory):

  • Architecture
  • RDDs
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Hands-on workshop: Using the Databricks environment to grasp the basics:

  • Exercises using the RDD API
  • Basic transformation and action functions
  • PairRDDs
  • Joins
  • Caching strategies
  • Exercises using the DataFrame API
  • Spark SQL
  • DataFrame operations: select, filter, group, sort
  • UDFs (User Defined Functions)
  • Exploring the DataSet API
  • Streaming

Hands-on workshop: Using the AWS environment to understand deployment:

  • Introduction to AWS Glue
  • Understanding the differences between AWS EMR and AWS Glue
  • Example jobs on both platforms
  • Review of pros and cons

Extra:

  • Introduction to Apache Airflow orchestration

Requirements

Programming skills (preferably Python or Scala)

Foundational SQL knowledge

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories