Image by Author | Midjourney & Canva
Introduction
ETL, or Extract, Transform, Load, is a necessary data engineering process, which involves extracting data from various sources, converting it into a workable form, and moving it to some destination, such as a database. ETL pipelines automate this process, making sure that data is processed in a consistent and efficient manner, which provides a framework for tasks like data analysis, reporting, and machine learning, and ensures data is clean, reliable, and ready to use.
Bash, short for short for Bourne-Again Shell — aka the Unix shell — is a powerful tool for building ETL pipelines, due to its simplicity, flexibility, and extremely wide applicability, and thus it is an excellent option for novices and seasoned pros alike. Bash scripts can do things like automate tasks, move files around, and talk to other tools on the command line, meaning that it is a good choice for ETL work. Moreover, Bash is ubiquitous on Unix-like systems (Linux, BSD, macOS, etc.), so it is ready to use on most such systems with no extra work on your part.
This article is intended for beginner and practitioner data scientists and data engineers who are looking to build their first ETL pipeline. It assumes a basic understanding of the command line and aims to provide a practical guide to creating an ETL pipeline using Bash.
The goal of this article is to guide readers through the process of building a basic ETL pipeline using Bash. By the end of the article, readers will have a working understanding of implementing an ETL pipeline that extracts data from a source, transforms it, and loads it into a destination database.
Setting Up Your Environment
Before we begin, ensure you have the following:
- A Unix-based system (Linux or macOS)
- Bash shell (usually pre-installed on Unix systems)
- Basic understanding of command-line operations
For our ETL pipeline, we will need these specific command line tools:
You can install them using your system’s package manager. On a Debian-based system, you can use apt-get
:
sudo apt-get install curl jq awk sed sqlite3
On macOS, you can use brew
:
brew install curl jq awk sed sqlite3
Let’s set up a dedicated directory for our ETL project. Open your terminal and run:
mkdir ~/etl_project
cd ~/etl_project
This creates a new directory called etl_project
and navigates into it.
Extracting Data
Data can come from various sources such as APIs, CSV files, or databases. For this tutorial, we’ll demonstrate extracting data from a public API and a CSV file.
Let’s use curl
to fetch data from a public API. For example, we’ll extract data from a mock API that provides sample data.
# Fetching data from a public API
curl -o data.json "https://api.example.com/data"
This command will download the data and save it as data.json
.
We can also use curl
to download a CSV file from a remote server.
# Downloading a CSV file
curl -o data.csv "https://example.com/data.csv"
This will save the CSV file as data.csv
in our working directory.
Transforming Data
Data transformation is necessary to convert raw data into a format suitable for analysis or storage. This may involve parsing JSON, filtering CSV files, or cleaning text data.
jq
is a powerful tool for working with JSON data. Let’s use it to extract specific fields from our JSON file.
# Parsing and extracting specific fields from JSON
jq '.data[] | {id, name, value}' data.json > transformed_data.json
This command extracts the id
, name
, and value
fields from each entry in the JSON data and saves the result in transformed_data.json
.
awk
is a versatile tool for processing CSV files. We’ll use it to extract specific columns from our CSV file.
# Extracting specific columns from CSV
awk -F, '{print $1, $3}' data.csv > transformed_data.csv
This command extracts the first and third columns from data.csv
and saves them in transformed_data.csv
.
sed
is a stream editor for filtering and transforming text. We can use it to perform text replacements and clean up our data.
# Replacing text in a file
sed 's/old_text/new_text/g' transformed_data.csv
This command replaces occurrences of old_text
with new_text
in transformed_data.csv
.
Loading Data
Common destinations for loading data include databases and files. For this tutorial, we’ll use SQLite, a commonly used lightweight database.
First, let’s create a new SQLite database and a table to hold our data.
# Creating a new SQLite database and table
sqlite3 etl_database.db "CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT, value REAL);"
This command creates a database file named etl_database.db
and a table named data
with three columns.
Next, we’ll insert our transformed data into the SQLite database.
# Inserting data into SQLite database
sqlite3 etl_database.db <<EOF
.mode csv
.import transformed_data.csv data
EOF
This block of commands sets the mode to CSV and imports transformed_data.csv
into the data
table.
We can verify that the data has been inserted correctly by querying the database.
# Querying the database
sqlite3 etl_database.db "SELECT * FROM data;"
This command retrieves all rows from the data
table and displays them.
Final Thoughts
We have covered the following steps while building our ETL pipeline with Bash, including:
- Environment setup and tool installation
- Data extraction from a public API and CSV file with
curl
- Data transformation using
jq
,awk
, andsed
- Data loading in an SQLite database with
sqlite3
Bash is a good choice for ETL due to its simplicity, flexibility, automation capabilities, and interoperability with other CLI tools.
For further investigation, think about incorporating error handling, scheduling the pipeline via cron, or learning more advanced Bash concepts. You may also wish to investigate alternative transformation apps and methods to increase your pipeline skillset.
Try out your own ETL projects, putting what you have learned to the test, in more elaborate scenarios. With some luck, the basic concepts here will be a good jumping-off point to more complex data engineering tasks.
Matthew Mayo (@mattmayo13) holds a Master’s degree in computer science and a graduate diploma in data mining. As Managing Editor, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.