Data Engineering using AWS Analytics Services

Build Data Engineering Pipelines using AWS Analytics Services such as Glue, EMR, Athena, Kinesis, Quick Sight, etc

Data Engineering is all about building Data Pipelines to get data from multiple sources into Data Lake or Data Warehouse and then from Data Lake or Data Warehouse to downstream systems. As part of this course, I will walk you through how to build Data Engineering Pipelines using AWS Analytics Stack. It includes services such as Glue, Elastic Map Reduce (EMR), Lambda Functions, Athena, QuickSight, and many more.

What you’ll learn

  • Data Engineering leveraging AWS Analytics features.
  • Managing Tables using Glue Catalog.
  • Engineering Batch Data Pipelines using Glue Jobs.
  • Orchestrating Batch Data Pipelines using Glue Workflows.
  • Running Queries using Athena – Server less query engine service.
  • Using AWS Elastic Map Reduce (EMR) Clusters for building Data Pipelines.
  • Using AWS Elastic Map Reduce (EMR) Clusters for reports and dashboards.
  • Data Ingestion using Lambda Functions.
  • Scheduling using Events Bridge.
  • Engineering Streaming Pipelines using Kinesis.
  • Streaming Web Server logs using Kinesis Firehose.

Course Content

  • Introduction to the course –> 1 lecture • 6min.
  • Setup Local Environment for Practice –> 10 lectures • 16min.
  • Setup Environment for Practice using Cloud9 –> 11 lectures • 44min.
  • AWS Getting Started –> 10 lectures • 26min.
  • Storage – All about AWS s3 (Simple Storage Service) –> 9 lectures • 55min.
  • User Level Security – Managing Users, Roles and Policies using IAM –> 8 lectures • 54min.
  • Infrastructure – AWS EC2 (Elastic Cloud Compute) Basics –> 10 lectures • 1hr 3min.
  • Infrastructure – AWS EC2 Advanced –> 6 lectures • 32min.
  • Data Ingestion using Lambda Functions –> 15 lectures • 1hr 41min.
  • Development Lifecycle for Pyspark –> 12 lectures • 1hr 4min.
  • Overview of Glue Components –> 9 lectures • 49min.
  • Setup Spark History Server for Glue Jobs –> 6 lectures • 22min.
  • Deep Dive into Glue Catalog –> 9 lectures • 43min.
  • Exploring Glue Job APIs –> 6 lectures • 27min.
  • Glue Job Bookmarks –> 10 lectures • 31min.
  • Streaming Pipeline using Kinesis –> 10 lectures • 1hr.
  • Consuming Data from s3 using boto3 –> 9 lectures • 46min.
  • Populating GitHub Data to Dynamodb –> 12 lectures • 58min.

Data Engineering using AWS Analytics Services

Requirements

  • Programming experience using Python.
  • Data Engineering experience using Spark.
  • Ability to write and interpret SQL Queries.
  • This course is ideal for experienced data engineers to add AWS Analytics Services as key skills to their profile.

Data Engineering is all about building Data Pipelines to get data from multiple sources into Data Lake or Data Warehouse and then from Data Lake or Data Warehouse to downstream systems. As part of this course, I will walk you through how to build Data Engineering Pipelines using AWS Analytics Stack. It includes services such as Glue, Elastic Map Reduce (EMR), Lambda Functions, Athena, QuickSight, and many more.

Here are the high-level steps which you will follow as part of the course.

  • Setup Development Environment
  • Getting Started with AWS
  • Development Life Cycle of Pyspark
  • Overview of Glue Components
  • Setup Spark History Server for Glue Jobs
  • Deep Dive into Glue Catalog
  • Exploring Glue Job APIs
  • Glue Job Bookmarks
  • Data Ingestion using Lambda Functions
  • Streaming Pipeline using Kinesis
  • Consuming Data from s3 using boto3
  • Populating GitHub Data to Dynamodb

Getting Started with AWS

  • Introduction – AWS Getting Started
  • Create s3 Bucket
  • Create IAM Group and User
  • Overview of Roles
  • Create and Attach Custom Policy
  • Configure and Validate AWS CLI

Development Lifecycle for Pyspark

  • Setup Virtual Environment and Install Pyspark
  • Getting Started with Pycharm
  • Passing Run Time Arguments
  • Accessing OS Environment Variables
  • Getting Started with Spark
  • Create Function for Spark Session
  • Setup Sample Data
  • Read data from files
  • Process data using Spark APIs
  • Write data to files
  • Validating Writing Data to Files
  • Productionizing the Code

Overview of Glue Components

  • Introduction – Overview of Glue Components
  • Create Crawler and Catalog Table
  • Analyze Data using Athena
  • Creating S3 Bucket and Role
  • Create and Run the Glue Job
  • Validate using Glue CatalogTable and Athena
  • Create and Run Glue Trigger
  • Create Glue Workflow
  • Run Glue Workflow and Validate