Image by Author | DALLE-3 & Canva
Data Engineering is rapidly growing, and companies are now hiring more data engineers than data scientists. Operational jobs like data engineering, cloud architecture, and MLOps engineering are in high demand.
As a data engineer, you need to master containerization, infrastructure as code, workflow orchestration, analytical engineering, batch processing, and streaming tools. Apart from these tools, you need to master cloud infrastructure and manage services like Databricks and Snowflakes.
In this blog, we will learn about 10 GitHub repositories that will help you master all core tools and concepts. These GitHub repositories contain courses, experiences, roadmaps, a list of essential tools, projects, and a handbook. All you need to do is bookmark them while learning to become a professional data engineer.
1. Awesome Data Engineering
The Awesome Data Engineering repository contains a list of tools, frameworks, and libraries for data engineering, making it an excellent starting point for anyone looking to dive into the field.
It covers tools on databases, data ingestion, files system, streaming, batch processing, data lake management, workflow orchestration, monitoring, testing, and charts and dashboards.
Link: igorbarinov/awesome-data-engineering
2. Data Engineering Zoomcamp
Data Engineering Zoomcamp is a complete course that provides a hands-on learning experience in data engineering. You learn new concepts and tools using video tutorials, quizzes, projects, homework, and community-driven assessments.
The Data Engineering Zoomcamp covers:
- Containerization and Infrastructure as Code
- Workflow Orchestration
- Data Ingestion
- Data Warehouse
- Analytics Engineering
- Batch processing
- Streaming
Link: DataTalksClub/data-engineering-zoomcamp
3. The Data Engineering Cookbook
The Data Engineering Cookbook is a collection of articles and tutorials that cover various aspects of data engineering, including data ingestion, data processing, and data warehousing.
The Data Engineering Cookbook includes:
- Basic Engineering Skills
- Advanced Engineering Skills
- Free Hands On Courses / Tutorials
- Case Studies
- Best Practices Cloud Platforms
- 130+ Data Sources Data Science
- 1001 Interview Questions
- Recommended Books, Courses, and Podcasts
Link: andkret/Cookbook
4. Data Engineer Roadmap
The Data Engineer Roadmap repository provides a step-by-step guide to becoming a data engineer. This repository covers everything from the basics of data engineering to advanced topics like Infrastructures as a code and cloud computing.
The Data Engineer Roadmap includes:
- CS fundamentals
- Learning Python
- Testing
- Database
- Data Warehouse
- Cluster Computing
- Data Processing
- Messaging
- Workflow Scheduling
- Network
- Infrastructures as a Code
- CI/CD
- Data Security and Privacy
Link: datastacktv/data-engineer-roadmap
5. Data Engineering HowTo
Data Engineering HowTo is a beginner-friendly resource for learning data engineering from scratch. It contains a list of tutorials, courses, books, and other resources to help you build a solid foundation in data engineering concepts and best practices. If you’re new to the field, this repository will help you navigate the vast landscape of data engineering with ease.
How To Become a Data Engineer includes:
- Useful articles and blogs
- Talks
- Algorithms & Data Structures
- SQL
- Programming
- Databases
- Distributed Systems
- Books
- Courses
- Tools
- Cloud Platforms
- Communities
- Jobs
- Newsletters
Link: adilkhash/Data-Engineering-HowTo
6. Awesome Open Source Data Engineering
Awesome Open Source Data Engineering is a list of open-source data engineering tools that is a goldmine for anyone looking to contribute to or use them to build real-world data engineering projects. It contains a wealth of information on open-source tools and frameworks, making it an excellent resource for anyone looking to explore alternative data engineering solutions.
The repository includes open-source tools on:
- Analytics
- Business Intelligence
- Data Lakehouse
- Change Data Capture
- Datastores
- Data Governance and Registries
- Data Virtualization
- Data Orchestration
- Formats
- Integration
- Messaging Infrastructure
- Specifications and Standards
- Stream Processing
- Testing
- Monitoring and Logging
- Versioning
- Workflow Management
Link: gunnarmorling/awesome-opensource-data-engineering
7. Pyspark Example Project
Pyspark Example Project repository provides a practical example of implementing best practices for PySpark ETL jobs and applications.
PySpark is a popular tool for data processing, and this repository will help you master it. You will learn how to structure your code, handle data transformations, and optimize your PySpark workflows efficiently.
The project covers:
- Structure of an ETL Job
- Passing Configuration Parameters to the ETL Job
- Packaging ETL Job Dependencies
- Running the ETL job
- Debugging Spark Jobs
- Automated Testing
- Managing Project Dependencies
Link: AlexIoannides/pyspark-example-project
8. Data Engineer Handbook
Data Engineer Handbook is a comprehensive collection of resources covering all aspects of data engineering. It includes tutorials, articles, and books on all the topics related to data engineering. Whether you are looking for a quick reference guide or in-depth knowledge, this handbook has something for data engineers of all levels.
The Handbook includes:
- Great Books
- Communities to Follow
- Companies to Keep an Eye On
- Blogs to Read
- Whitepapers
- Great YouTube Channels
- Great Podcasts
- Newsletters
- LinkedIn, Twitter, TikTok, and Instagram Influencers to Follow
- Courses
- Certifications
- Conferences
Link: DataExpert-io/data-engineer-handbook
9. Data Engineering Wiki
The Data Engineering Wiki repository is a community-driven wiki that provides a comprehensive resource for learning data engineering. This repository covers a wide range of topics, including data pipelines, data warehousing, and data modeling.
Data Engineering Wiki includes:
- Data Engineering Concepts
- Frequently Asked Questions about Data Engineering
- Guides on How to Make Data Engineering Decisions
- Commonly Used Tools for Data Engineering
- Step-by-Step Guides for Data Engineering Tasks
- Learning Resources
Link: data-engineering-community/data-engineering-wiki
10. Data Engineering Practice
Data Engineering Practice offers a hands-on approach to learning data engineering. It provides practice projects and exercises to help you apply your knowledge and skills in real-world scenarios. By working through these projects, you will gain practical experience and build a portfolio that showcases your data engineering capabilities.
Data Engineering Practice Problems include exercises on:
- Downloading Files
- Web Scraping + Downloading + Pandas
- Boto3 AWS + s3 + Python.
- Convert JSON to CSV + Ragged Directories
- Data Modeling for Postgres + Python
- Ingestion and Aggregation with PySpark
- Using Various PySpark Functions
- Using DuckDB for Analytics and Transforms
- Using Polars Lazy Computation
Link: danielbeach/data-engineering-practice
Final Words
Mastering data engineering requires dedication, persistence, and a passion for learning new concepts and tools. These 10 GitHub repositories provide a wealth of information and resources to help you become a professional data engineer and keep you updated on current trends.
Whether you are just starting or an experienced data engineer, I encourage you to explore these resources, contribute to open-source projects, and stay engaged with the vibrant data engineering community on GitHub.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.