Image by Author
I know the word ‘Python’ is probably the most overused word in the context of data science. To some degree, there’s a reason for that. But, in this article, I want to focus on SQL, which often gets overlooked when talking about data science. I emphasize talking because, in practice, SQL is not overlooked at all. On the contrary, it’s one of the holy trinity of the programming languages in data science: SQL, Python, and R.
SQL is made for data querying and manipulation but also has respectable data analysis and reporting capabilities. I’ll show some of the main SQL concepts you need as a data scientist and some easy examples from StrataScratch and LeetCode.
Then, I’ll provide two common business scenarios in which all or most of those SQL concepts must be applied.
Main SQL Concepts for Data Scientists
Here’s the overview of the concepts I’ll discuss.
1. Querying and Filtering Data
This is where your practical work as a data scientist usually starts: querying a database and extracting only the data you need for your task.
This typically involves relatively simple SELECT statements with the FROM and WHERE clauses. To get the unique values, use DISTINCT. If you need to use several tables, you also add JOINs.
You’ll often need to use ORDER BY to make your dataset more organized.
Example of Combining Two Tables: You could be required to list the persons’ names and the city and state they live in by joining two tables and sorting the output by last name.
SELECT FirstName,
LastName,
City,
State
FROM Person p LEFT JOIN Address a
ON p.PersonId = a.PersonId
ORDER BY LastName ASC;
2. Working with NULLs
NULLs are values that data scientists are often not indifferent to – they either want only NULLs, they want to remove them, or they want to replace them with something else.
You can select data with or without NULLs using IS NULL or IS NOT NULL in WHERE.
Replacing NULLs with some other values is typically done using conditional expressions:
- NULLIF()
- COALESCE()
- CASE statement
Example of IS NULL: With this query, you can find all the customers not referred by the customer with ID = 2.
SELECT name
FROM customer
WHERE referee_id IS NULL OR referee_id <> 2;
Example of COALESCE(): I can rework this example by saying I want to query all the data but also add a column that will show 0% as a host response rate instead of NULL.
SELECT *,
COALESCE(host_response_rate, '0%') AS edited_host_response_rate
FROM airbnb_search_details;
3. Data Type Conversion
As a data scientist, you will convert data frequently. Data often doesn’t come in the desired format, so you must adapt it to your needs. This is usually done using CAST(), but there are also some alternatives, depending on your SQL flavor.
Example of Casting Data: This query casts the star data from VARCHAR to INTEGER and removes the values that have non-integer values.
SELECT business_name,
review_id,
user_id,
CAST(stars AS INTEGER) AS cast_stars,
review_date,
review_text,
funny,
useful,
cool
FROM yelp_reviews
WHERE stars '?';
4. Data Aggregation
To better understand the data they’re working with (or simply because they need to produce some reports), data scientists very often have to aggregate data.
In most cases, you must use aggregate functions and GROUP BY. Some of the common aggregate functions are:
- COUNT()
- SUM()
- AVG()
- MIN()
- MAX()
If you want to filter aggregated data, use HAVING instead of WHERE.
Example of Sum: You can use this query to sum the bank account for each user and show only those with a balance above 1,000.
SELECT u.name,
SUM(t.amount) AS balance
FROM Users u
JOIN Transactions t
ON u.account = t.account
GROUP BY u.name
HAVING SUM(t.amount) > 10000;
5. Handling Dates
Working with dates is commonplace for data scientists. Again, the dates are only sometimes formatted according to your taste or needs. To maximize the flexibility of dates, you will sometimes need to extract parts of dates or reformat them. To do that in PostgreSQL, you’ll most commonly use these date/time functions:
- EXTRACT()
- DATE_PART()
- DATE_TRUNC()
- TO_CHAR()
One of the common operations with dates is to find a difference between the dates or to add dates. You do that by simply subtracting or adding the two values or by using the functions dedicated for that, depending on the database you use.
Example of Extracting Year: The following query extracts the year from the DATETIME type column to show the number of violations per year for Roxanne Cafe.
SELECT EXTRACT(YEAR FROM inspection_date) AS year_of_violation,
COUNT(*) AS n_violations
FROM sf_restaurant_health_violations
WHERE business_name="Roxanne Cafe" AND violation_id IS NOT NULL
GROUP BY year_of_violation
ORDER BY year_of_violation ASC;
Example of Date Formatting: With the query below, you format the start date as ‘YYYY-MM’ using TO_CHAR().
SELECT TO_CHAR(started_at, 'YYYY-MM'),
COUNT(*) AS n_registrations
FROM noom_signups
GROUP BY 1;
6. Handling Text
Apart from dates and numerical data, very often databases contain text values. Sometimes, these values have to be cleaned, reformatted, unified, split and merged. Due to these needs, every database has many text functions. In PostgreSQL, some of the more popular ones are:
- CONCAT() or ||
- SUBSTRING()
- LENGTH()
- REPLACE()
- TRIM()
- POSITION()
- UPPER() & LOWER()
- REGEXP_REPLACE() & REGEXP_MATCHES() & REGEXP_SPLIT_TO_ARRAY()
- LEFT() & RIGHT()
- LTRIM() & RTRIM()
There are usually some overlapping string functions in all databases, but each has some distinct functions.
Example of Finding the Length of the Text: This query uses the LENGTH() function to find invalid tweets based on their length.
SELECT tweet_id
FROM Tweets
WHERE LENGTH(content) > 15;
7. Ranking Data
Ranking data is one of the widespread tasks in data science. For instance, it can be used to find the best or worst-selling products, quarters with the highest revenue, songs ranked by number of streams, and the highest and lowest-paid employees.
The ranking is done using window functions (which we’ll talk a bit more in the next section):
- ROW_NUMBER()
- RANK()
- DENSE_RANK()
Example of Ranking: This query uses DENSE_RANK() to rank hosts based on the number of beds they have listed.
SELECT host_id,
SUM(n_beds) AS number_of_beds,
DENSE_RANK() OVER(ORDER BY SUM(n_beds) DESC) AS rank
FROM airbnb_apartments
GROUP BY host_id
ORDER BY number_of_beds DESC;
8. Window Functions
Window functions in SQL allow you to calculate the rows related to the current row. This characteristic is not only used to rank data. Depending on the window function category, they can have many different uses. You can read more about them in the window functions article. However, their main characteristic is that they can show analytical and aggregated data at the same time. In other words, they don’t collapse individual rows when performing calculations.
Example of FIRST_VALUE() Window Function: One window function example is to show the latest user login for a particular year. The FIRST_VALUE() window function makes this easier.
SELECT DISTINCT user_id,
FIRST_VALUE(time_stamp) OVER (PARTITION BY user_id ORDER BY time_stamp DESC) AS last_stamp
FROM Logins
WHERE EXTRACT(YEAR FROM time_stamp) = 2020;
9. Subqueries & CTEs
Subqueries and CTEs (known as tidier subqueries) allow you to reach a more advanced level of calculations. By knowing subqueries and CTEs, you can write complex SQL queries, with subqueries or CTEs used for sub-calculations referenced in the main query.
Example of Subqueries and CTEs: The query below uses the subquery to find the first year of the product sale. This data is then used in WHERE for the main query to filter data.
SELECT product_id,
year AS first_year,
quantity,
price
FROM Sales
WHERE (product_id, year) IN (
SELECT product_id,
MIN(year) AS year
FROM Sales
GROUP BY product_id
);
The code can be written using CTE instead of a subquery.
WITH first_year_sales AS (
SELECT product_id,
MIN(year) AS first_year
FROM Sales
GROUP BY product_id
)
SELECT s.product_id,
s.year AS first_year,
s.quantity,
s.price
FROM Sales s
JOIN first_year_sales AS fys
ON s.product_id = fys.product_id AND s.year = fys.first_year;
Business Examples of Using SQL
Let’s now look at a couple of business cases where data scientists can use SQL and apply all (or most) of the concepts we discussed earlier.
Finding Best Selling Product
In this example, you must know subqueries, data aggregation, handling dates, ranking data using window functions, and filtering the output.
The subquery calculates each product’s sales for each month and ranks them by sales. The main query then simply selects the required columns and leaves only products with the first rank, i.e., best-selling products.
SELECT sale_month,
description,
total_paid
FROM
(SELECT DATE_PART('MONTH', invoicedate) AS sale_month,
description,
SUM(unitprice * quantity) AS total_paid,
RANK() OVER (PARTITION BY DATE_PART('MONTH', invoicedate) ORDER BY SUM(unitprice * quantity) DESC) AS sale_rank
FROM online_retail
GROUP BY sale_month,
description) AS ranking_sales
WHERE sale_rank = 1;
Calculating Moving Average
The rolling or moving average is a common business calculation to which data scientists can apply their extensive SQL knowledge, as in this example.
The subquery in the code below calculates revenues by month. The main query then uses the AVG() window functions to calculate the 3-month rolling average revenue.
SELECT t.month,
AVG(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS avg_revenue
FROM
(SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
SUM(purchase_amt) AS monthly_revenue
FROM amazon_purchases
WHERE purchase_amt>0
GROUP BY 1
ORDER BY 1) AS t
ORDER BY t.month ASC;
Conclusion
All these SQL queries show you how to use SQL in your data science tasks. While SQL is not made for complex statistical analysis or machine learning, it’s perfect for querying, manipulating, aggregating data, and performing calculations.
These example queries should help you in your job. If you don’t have a data science job, many of these queries will come up in your SQL interview questions.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.