Data Engineer

roadmap.sh: https://roadmap.sh/data-engineer

Suggested path through the Data Engineer nodes. Each node links to its lesson when written.

Nodes

Introduction

  • What is Data Engineering
  • Data Engineering vs Data Science
  • The Data Engineering Lifecycle
  • Skills and Responsibilities
  • Choosing the Right Technologies
  • Best Practices

Programming Skills

  • Python
  • Java
  • Scala
  • Go
  • Learn SQL
  • Data Structures and Algorithms
  • Git and GitHub
  • Linux Basics
  • APIs

Databases & Storage

  • Database Fundamentals
  • Relational Databases
  • NoSQL Databases
  • OLTP vs OLAP
  • Transactions
  • Indexing
  • Data Normalization
  • Data Modelling Techniques

Relational Databases

  • PostgreSQL
  • MySQL
  • MariaDB
  • MS SQL
  • Oracle

NoSQL Databases

  • Key-Value
  • Document
  • Column
  • Graph
  • MongoDB
  • Cassandra
  • HBase
  • CouchDB
  • Redis
  • Memcached
  • Elasticsearch
  • Neo4j

Data Warehousing

  • What is a Data Warehouse
  • Data Warehousing Architectures
  • Data Mart
  • Data Lake
  • Data Hub
  • Star vs Snowflake Schema
  • Slowly Changing Dimension (SCD)
  • Snowflake
  • Amazon Redshift
  • Google BigQuery
  • Databricks Delta Lake

Data Pipelines & ETL

  • Data Pipelines
  • Data Ingestion
  • Types of Data Ingestion (Batch / Streaming / Realtime)
  • Sources of Data
  • Data Collection Considerations
  • Extract Data
  • Transform Data
  • Load Data
  • ETL vs Reverse ETL
  • Reverse ETL
  • Reverse ETL Use Cases
  • Job Scheduling
  • Idempotency

ETL & Orchestration Tools

  • Apache Airflow
  • dbt
  • Luigi
  • Prefect
  • AWS Glue ETL
  • Azure Data Factory (ETL)
  • GCP Dataflow
  • Hightouch
  • Census
  • Segment

Big Data & Distributed Systems

  • Big Data Tools
  • Distributed Systems Basics
  • CAP Theorem
  • Distributed File Systems
  • Cluster Computing Basics
  • What is Cluster Computing
  • Cluster Management Tools
  • Horizontal vs Vertical Scaling
  • Apache Hadoop YARN
  • HDFS
  • MapReduce
  • Apache Spark

Messaging & Streaming

  • Messaging Systems
  • Messages vs Streams
  • Async vs Sync Communication
  • Apache Kafka
  • RabbitMQ
  • AWS SQS
  • AWS SNS
  • Streaming
  • Batch
  • Realtime

Data Quality, Governance & Privacy

  • Data Quality
  • Data Lineage
  • Metadata Management
  • Metadata-First Architecture
  • Data Mesh
  • Data Fabric
  • Data Interoperability
  • Data Masking
  • Data Obfuscation
  • Encryption
  • Tokenization
  • Authentication vs Authorization
  • GDPR
  • ECPA
  • EU AI Act

Analytics, BI & Serving

  • Data Serving
  • Data Analytics
  • Business Intelligence
  • AB Testing
  • Microsoft Power BI
  • Tableau
  • Looker
  • Streamlit
  • Machine Learning
  • MLOps

Cloud Computing

  • Cloud Computing
  • Cloud Architectures
  • Serverless Options
  • Hybrid Cloud
  • Amazon EC2 (Compute)
  • GCP Compute Engine
  • Azure Virtual Machines
  • S3 Storage
  • Google Cloud Storage
  • Azure Blob Storage
  • Amazon RDS
  • Aurora DB
  • Cloud SQL Database
  • Azure SQL Database
  • DynamoDB
  • Bigtable
  • CosmosDB
  • Neptune

DevOps & Infrastructure

  • Containers & Orchestration
  • Docker
  • Kubernetes
  • AWS EKS
  • Google Cloud GKE
  • Infrastructure as Code (IaC)
  • Terraform
  • OpenTofu
  • AWS CDK
  • Google Deployment Manager
  • CI/CD
  • GitHub Actions
  • GitLab CI
  • CircleCI
  • ArgoCD
  • Networking Fundamentals
  • Declarative vs Imperative

Monitoring & Testing

  • Monitoring
  • Logs
  • Prometheus
  • Datadog
  • New Relic
  • Sentry
  • Testing
  • Unit Testing
  • Integration Testing
  • Functional Testing
  • End-to-End Testing
  • Load Testing
  • Smoke Testing
  • Reusability
  • Environmental Management

Resources

See resources.md.

Project ideas

  • Batch ELT pipeline: ingest a public dataset (e.g. NYC taxi trips) into object storage, model it with dbt into a star schema in a warehouse (BigQuery/Snowflake/Postgres), and orchestrate the whole DAG with Airflow. Wire in data-quality tests and lineage. See playgrounds/.
  • Real-time streaming aggregator: produce synthetic events to Kafka, process them with Spark Structured Streaming (windowed aggregations), and sink results to a serving store (Redis/Postgres) with a small Streamlit dashboard on top.
  • Self-hosted lakehouse on containers: stand up MinIO + Spark + Delta Lake with Docker Compose, provisioned via Terraform, and load incremental data with SCD Type-2 handling; add Prometheus/Grafana monitoring for the jobs.

1 item under this folder.