Data Engineering Duke Fall 2023-2024

dataeng

Data Engineering Course Syllabus - Fall 2024

Course Description

Data Engineering is a crucial discipline in our increasingly data-centric world. Serving as a prerequisite for a data and machine learning career, this class equips you with the skills necessary to undertake software engineering tasks in a challenging, high-pressure technology environment. The curriculum is designed to mirror real-world job scenarios, requiring a substantial time commitment from you.

Throughout the course, you will master the principles of data engineering, which include understanding diverse data types, data storage and management, and data processing to extract valuable insights. You’ll gain hands-on experience with various data engineering tools and technologies, such as cloud-based data platforms, big data processing frameworks, and data visualization tools. Moreover, you will hone your skills in building and deploying data pipelines, extracting data from diverse sources, transforming data into a desired format, and loading data into a target system.

Effective teamwork is a cornerstone in data engineering, considering the diverse skill sets and expertise within a team. Consequently, this course emphasizes developing communication, collaboration, and conflict-resolution skills. Furthermore, since the field of data engineering is continuously evolving, the ability to swiftly learn and adopt new technologies and techniques is a key learning objective of the course.

To augment the learning experience, the course will engage you in weekly demos, an industry-standard practice that enhances your metacognition abilities. By understanding what you know and identifying areas of improvement, you can fast-track your path to mastery in real-world software engineering.

One of the unique aspects of the course is the emphasis on AI Pair Programming. This approach is designed to raise the difficulty level of projects while mitigating the risk of errors and side-effects. You can leverage AI Pair Programming to develop complex projects and enhance your learning experience by applying robust DevOps automation, critical thinking skills, and effective teamwork.

Upon completing the course, you’ll have an impressive portfolio, including five substantial projects and 15 mini-projects, which will attest to your readiness to prospective employers. This intensive journey is supported by mentorship from faculty and TAs at a world-class institution, ensuring you’re not alone as you tackle these challenges.

Learning Objectives

  • Understand the principles of data engineering, including different data types, storage, management, and data processing for valuable insights.
  • Learn to use data engineering tools and technologies, including cloud-based data platforms, big data processing frameworks, and data visualization tools.
  • Develop skills to build and deploy data pipelines, including extracting data from diverse sources, transforming data into the desired format, and loading data into a target system.
  • Cultivate practical teamwork skills to work in a team environment, mastering effective communication, collaboration, and conflict resolution.
  • Gain proficiency in learning new things quickly to stay up-to-date in the ever-evolving field of data engineering.
  • Enhance your understanding and application of AI Pair Programming, DevOps automation, and critical thinking skills in software engineering projects.

Reading Material

Media and Labs

  • Coursera
  • AWS Credits
  • Gitlab Codespaces
  • AWS Academy Learner Labs
  • Coursera-Rust Bootcamp (Live: 9/096/2023)
  • MySQL for Data Engineering

Weekly Schedule

Section One: DevOps

Week One: Introduction to Cloud Infrastructure and Teamwork

Week Description: The first week serves as an introduction to the essentials of cloud infrastructure, which will form the foundation for all your future cloud-based projects. Alongside this, we’ll delve into understanding the fundamental principles of teams and teamwork.

Media and Readings: - Reading: Chapter One-Teamwork Book: Toward Understanding Teams and Teamwork - Reading: Chap. 6 Developing on AWS with C#-DevOps - Coursera: Week 5: Applying DevOps Principles

Assignments: - Weekly Mini-project 1: Create a Python Gitlab template you use for the rest of class (.devcontainer, Makefile, Gitlab Actions, requirements.txt, README.md) - Prepare for In class discussion based on media and reading and complete individual discussion spreadsheet

Week Two: Goal Setting and Effective Technical Communication

Week Description: This week focuses on the importance of setting clear, ambitious goals as a part of effective teamwork, and how to develop and refine communication skills specifically for technical contexts.

Media and Readings: - Reading: Chapter Two-Teamwork Book: A Clear, Elevating Goal - Coursera: Week 2: Developing Effective Technical Communication

Assignments: - Weekly Mini-project 2: Pandas Descriptive Statistics Script - Prepare for In class discussion based on media and reading and complete individual discussion spreadsheet

Week Three: Results-Driven Structure and Cloud Onboarding

Week Description: In the third week, we will focus on the importance of structuring a team in a way that is driven by results, as well as an introduction to cloud onboarding processes, including its challenges and best practices.

Media and Readings: - Reading: Chapter Three-Teamwork Book: Results-Driven Structure - Coursera: Week 3: Exploring Cloud Onboarding - Digital Rights of Humans: Duke-Fall-2024-digital-rights-humans

Assignments: - Individual Project #1 Due - Weekly Mini-project 3: Polars Descriptive Statistics Script - Prepare for In class discussion based on media and reading and complete individual discussion spreadsheet

Week Four: Competence and DevOps Principles

Week Description: The fourth week focuses on the importance of having competent team members and how their skills can significantly contribute to achieving team goals. Also, we’ll start exploring DevOps principles, an integral part of modern cloud computing practices.

Media and Readings: - Reading: Chapter Four-Teamwork Book: Competent Team Members - Coursera: Week 5: Applying DevOps Principles

Assignments: - Weekly Mini-project 4: Create a Gitlab Actions Matrix Build that tests more than one than one version of Python.

Section Two: Building Tools with SQL, Rust, and Python

Week Five: Python Scripting, SQL, and Fostering Unified Commitment

Week Description: This week, you’ll learn about Python scripting and SQL, integral tools for building robust data applications. We’ll also discuss the importance of fostering a unified commitment within a team.

Media and Readings: - Coursera: Week 2: Python Scripting and SQL - Reading: Chapter Five-Teamwork Book: Unified Commitment - Reading: One size database doesn’t fit anyone - Reading: Understanding Availability - Reading: Learning MySQL, 2nd Edition-Chapter 6-Transactions and Locking

Assignments: - Mini-Project: Create a Python script that interacts with a SQL database.

Week Six: Advanced SQL and Promoting a Collaborative Climate

Week Description: This week is about mastering advanced SQL techniques and understanding how a collaborative climate can significantly impact teamwork.

Media and Readings: - Coursera: Week 4: Working with MySQL - Reading: Chapter Six-Teamwork Book: Collaborative Climate - Reading: SQL Pocket Guide, 4th Edition-Chapter1-SQL Crash Course

Assignments: - Mini-Project: Design a complex SQL query for a MySQL database and explain the results.

Lecture Slides:

Week Seven: Python Packaging, Command Line Tools, and Upholding Standards of Excellence

Week Description: This week introduces Python packaging and the use of command-line tools. We will also discuss the importance of setting and upholding standards of excellence within a team.

Media and Readings: - Coursera: Week 4: Python Packaging and Command Line Tools - Reading: Chapter Seven-Teamwork Book: Standards of Excellence

Assignments: - Mini-Project: Package a Python script into a command-line tool and write a user guide.

Week Eight: Transitioning from Python to Rust for MLOps and the Role of External Support and Recognition

Week Description: This week delves into the practical aspects of transitioning from Python to Rust in MLOps. We will also examine the role of external support and recognition in a team’s success.

Media and Readings: - Coursera: Week 5: Rust for MLOps: The Practical Transition from Python to Rust - Reading: Chapter Eight-Teamwork Book: External Support and Recognition

Assignments: - Mini-Project: Rewrite a Python script for data processing in Rust, highlighting the improvements in speed and resource usage. - Individual Project #2: Rust CLI Binary with SQLite

Section Three: Building Data Pipelines

Week Nine: Cloud-Hosted Notebooks and Principled Leadership

Week Description: This week introduces you to the concept of cloud-hosted notebooks and their application in data management. Concurrently, we will discuss the role of principled leadership in successful teamwork.

Media and Readings: - Coursera: Week 2: Cloud-Hosted Notebooks - Reading: Chapter Nine-Teamwork Book: Principled Leadership - Reading: Chapter Ten-Teamwork Book: Inside Management Teams

Assignments: - Mini-project 9: Set up a cloud-hosted notebook and demonstrate data manipulation with a sample dataset.

Week Ten: Introduction to PySpark and Innovation in Energy and Public Health

Week Description: This week serves as an introduction to PySpark, a powerful tool for large-scale data processing. Also, we will explore the innovations in the energy and public health sectors.

Media and Readings: - Coursera: Week 1: Overview and Introduction to PySpark - Reading: Intro, Chap. 1,2: How Innovation Works (Energy, Public Health)

Assignments: - Mini-project 10: Use PySpark to perform data processing on a large dataset.

Week Eleven: Using the Databricks Platform

Week Description: This week, you’ll work with the Databricks platform, designed for massive-scale data engineering and collaborative data science.

Media and Readings: - Coursera: AWS Databricks and MLFlow - Reading: Chap. 3,4,5: How Innovation Works (Transport, Food, Low-technology innovation)

Assignments: - Mini-project 11: Create a data pipeline using the Databricks platform.

Week Twelve

Media and Readings: - Coursera: Week 1: Introduction to MLflow - Reading: Chap: 6,7: How Innovation Works: (Prehistoric innovation, Innovation’s essentials)

Assignments: - Mini-project 12: Use MLflow to manage a simple machine learning project. - Individual Project #3: Databricks ETL (Extract Transform Load) Pipeline

Section Four: Building Containerized Serverless Data Pipelines

Week Thirteen: Virtualization, Containers, and Understanding Innovation Failures

Week Description: This week, we dive into virtualization and containers, crucial components in building serverless data pipelines. We’ll also explore various reasons why some innovations fail.

Media and Readings: - Coursera: Week 2: Virtualization and Containers - Reading: Chap: 9,10: How Innovation Works: (Fakes, frauds, fads, and failures)

Assignments: - Mini-project 13: Build and deploy a simple containerized application using Docker.

Week Fourteen: Python Microservices, Resistance to Innovation

Week Description: You will learn about Python microservices, an architectural style that structures an application as a collection of loosely coupled services. We’ll also understand the resistance to innovation and its impacts.

Media and Readings: - Coursera: Week 3: Python Microservices - Reading: Chap: 11,12: How Innovation Works: (Resistance to innovation, An innovation famine)

Assignments: - Mini-project 14: Develop a simple microservice using Python and deploy it to a cloud platform - Present Final Project in Class

Course Grade

Grade Breakdown:

  • Individual Projects: 25%
    • Project 1: Continuous Integration using Gitlab Actions of Python Data Science Project (6.25%)
    • Project 2: Rust CLI Binary with SQLite (6.25%)
    • Project 3: Databricks ETL (Extract Transform Load) Pipeline (6.25%)
    • Project 4: Individual Project (6.25%)
  • Team Project: 25%
  • Class Discussion Grade: 25%
  • Mini-projects: 25%

Letter Grades

  • A+: 97-100%
  • A: 93-96%
  • A-: 90-92%
  • B+: 87-89%
  • B: 83-86%
  • B-: 80-82%
  • C+: 77-79%
  • C: 73-76%
  • C-: 70-72%
  • D+: 67-69%
  • D: 63-66%
  • D-: 60-62%
  • F: Below 60%

Individual Projects

Project #1: Continuous Integration using Gitlab Actions of Python Data Science Project

Requirements: - Jupyter Notebook with: - Cells that perform descriptive statistics using Polars or Panda. - Tested by using nbval plugin for pytest - Python Script performing the same descriptive statistics using Polars or Panda - lib.py file that shares the common code between the script and notebook - Makefile with commands to: - Run all tests (must test notebook and script and lib) - Format code with Python black - Lint code with Ruff - Install code via: pip install -r requirements.txt - test_script.py to test script - test_lib.py to test library - Pinned requirements.txt - Gitlab Actions performs all four Makefile commands with badges for each one in the README.md

Project #2: Rust CLI Binary with SQLite

Requirements: - Rust source code demonstrating comprehensive understanding of Rust’s syntax and unique features - Use of Gitlab Copilot (explained in README) - SQLite Database with CRUD operations - Optimized Rust Binary as a Gitlab Actions artifact - README.md explaining the project, dependencies, how to run, and use of Gitlab Copilot - Gitlab Actions workflow for testing, building, and linting - Video Demo linked in README.md

Project #3: Databricks ETL (Extract Transform Load) Pipeline

Requirements: - Well-documented Databricks notebook performing ETL operations - Usage of Delta Lake for data storage - Usage of Spark SQL for data transformations - Error handling and data validation - Visualization of transformed data - Automated trigger to initiate the pipeline - README.md explaining the project, dependencies, how to run, and including actionable recommendations - Video Demo linked in README.md

Project #4: Auto Scaling Flask App Using Any Platform As a Service

Requirements: - README.md file explaining the project, dependencies, how to run, and including recommendations - Complete Gitlab Repo with all required scripts and documentation - Flask App (functioning within Docker/Platform) - Use of DockerHub (or equivalent) - AWS Web App (or equivalent) deployment - Video Demo linked in README.md

Team Project

Requirements: - Microservice interfacing with a data pipeline (Python or Rust) - Load Test handling 10,000 requests per second - Data Engineering library usage - Infrastructure as Code (IaC) solution - CI/CD pipeline implementation - Comprehensive README.md - Architectural Diagram - Gitlab Configurations (Actions, .devcontainer, build badges) - Teamwork Reflection (individual submissions) - Quantitative Assessment of reliability and stability - Demo Video

Class Discussion

  • 25% of Grade
  • In-Person Attendance Mandatory
  • Self-reported participation in Google sheet
  • Final self-reported Google sheet must have approximately a 5050 mix of noting good points made by other students and mentioning points made by oneself in class
  • The final grade will follow Nick Eubanks Rubric and the TAs and Instructor will validate the self-grading
  • If not called upon in class, students should put their comments in Canvas AFTER CLASS, reflecting what actually happened in class

Weekly Mini-Project

  • 25% of Grade
  • 80%-Python/20%-Rust (Core Language)
  • Rubric:
    • Must pass lint/test/format in Gitlab Actions and proven via Gitlab badge for each of these three actions
    • Small Tool or service: CLI/Serverless/Microservice
    • Average time 1-2 hours max

Grading Rubric for Weekly Mini-Project

  • Project Development (25 points)
    • Appropriate project scope and complexity: 10 points
    • Functionality of the tool or service: 15 points
  • Language Use (25 points)
    • Correct and efficient use of Python in Python-only projects: 15 points
    • Correct and efficient use of Rust in Rust-only projects: 10 points
  • Linting, Testing, and Formatting (25 points)
    • Passing lint check: 8 points
    • Passing tests: 8 points
    • Proper code formatting: 9 points
  • Gitlab README and Submission (25 points)
    • Clear and accurate README with badges: 15 points
    • Proper submission on Microsoft Teams (including pasted version of README): 10 points

Total: 100 points

Four Individual Projects

  • 25% of Grade

Group Project

  • 25% of Grade
  • Rubric:
    • Group Feedback Session Required as Part of Final Writeup
    • Each Team Member is required at the end of class to add three positives and three areas to improve on during a Zoom or in-person meeting
    • Each student then writes a reflection statement about what they learned they could improve during the final project submission
  • The final project must include the following criteria to pass:
    • Demonstrated load-test
    • README explains project, and includes architectural diagram
    • Must pass test/lint/format in Gitlab actions and have badge providing it
    • Must continuously deploy

References

Related