“Python!”
“No, R.”
“Fools, it’s obviously Rust.”
Many data science learners and experts alike are keen to pin down the very best language for data science. In my opinion, most people are wrong. Amidst the hunt for the newest, the sexiest, the most container-able data science language, people are looking for the wrong thing.
Image from Reddit
It’s easy to overlook. It’s easy to even discount it as a language. But the humble Structured Query Language, or SQL, is my pick for the language to learn for data science. All those other languages certainly have their place, but SQL is the one non-negotiable language that I consider a base requirement for anyone working in data science. Here’s why.
Look, databases come hand in hand with data science. It’s in the name. If you’re working with data science, you’re working with databases. And if you’re working with databases, you’re probably working with SQL.
Why? Because SQL is the universal database query language. There is no other. Imagine someone told you that if you just learned a specific language, you’d be able to speak to and understand every single person on Earth. How valuable would that be? SQL is that language in data science, the language that everyone uses to manage and access databases.
Image from X
Every data scientist needs to access and retrieve data, to explore data and build hypotheses, to filter, aggregate, and sort data. And hence, every data scientist will need SQL. As long as you know how to write a SQL query, you’ll go far.
Someone, reading this article right now, is piping up about the NoSQL movement. Indeed, certain data is now more commonly stored in non-relational databases, such as by key-value pairs or graph data. It’s true that there are benefits to storing data like that – you gain more scalability and flexibility. But there’s no standard NoSQL query language. You might learn one for one job, and then need to learn an entirely new one for a new job.
Plus, you will very rarely find a business that works entirely with NoSQL databases, while many companies don’t need non-relational databases.
There’s that famous (and debunked) stat about how data scientists spend 80% of their time cleaning. While it’s not true, I think if you ask any data scientist what they spend time on, data cleaning will rank in the top five tasks. That’s why this section is the longest.
You can clean and process data with other languages, but SQL in particular offers unique advantages for certain aspects of data cleaning and processing.
SQL’s expressive query language allows data scientists to efficiently filter, sort, and aggregate data using concise statements. This level of flexibility is especially useful when dealing with large datasets where manual data manipulation would be time-consuming and error-prone. Compare that to a language like Python, where achieving similar data manipulation tasks might require writing more lines of code and dealing with loops, conditions, and external libraries. While Python is renowned for its versatility and rich ecosystem of data science libraries, SQL’s focused syntax can expedite routine data cleaning operations, enabling data scientists to swiftly prepare data for analysis.
Plus, any data scientist will complain about the bane of their existence: missing values. SQL’s functions and capabilities for handling missing values—such as using COALESCE, CASE, and NULL handling—provide straightforward approaches to address gaps in data without the need for complex programming logic.
The other bane of a data scientist’s existence is duplicates. Happily, SQL offers efficient methods to identify and eliminate duplicate records from datasets, like the `DISTINCT` keyword and the `GROUP BY` clause.
You’ve probably heard of ETL pipelines. Well, SQL can be used to create data transformation pipelines, which take raw or semi-processed data and convert it into a format suitable for analysis. This is particularly beneficial for automating and standardizing that repetitive data-cleaning processes we all know and hate.
SQL’s ability to join tables from different databases or files streamlines the process of merging data for analysis is essential for projects involving data integration or aggregating data from diverse origins. Which, for a data scientist, comprises a majority of projects.
Finally, I like to remind people that data science does not happen in a vacuum. SQL queries are self-contained and can be easily shared with colleagues. This fosters collaboration and ensures that others can reproduce data cleaning steps without manual intervention.
Now, you won’t get far in data science if you only know SQL. But happily, SQL integrates perfectly well with any other of the top data science languages like R, Python, Julia, or Rust. You get all the benefits of analysis, data viz, and machine learning while still retaining SQL’s strength for data manipulation.
Image from LinkedIn
This is especially powerful when you think about all that data cleaning and processing I talked about earlier. You can use SQL to preprocess and clean data directly within databases, and then lean on Python, R, Julia, or Rust to perform more advanced data transformations or feature engineering, leveraging the extensive libraries available.
Many organizations rely on SQL – or, more accurately, rely on data scientists who know how to use SQL – to generate reports, dashboards, and visualizations that inform decision-making. Familiarity with SQL enables data scientists to produce meaningful reports directly from databases. And because SQL is so widespread, these reports are usually compatible and interoperable across almost any system.
Because of how interoperable it is with reporting tools and scripting languages like Python, R, and JavaScript, data scientists can actually automate the reporting processes, seamlessly combining SQL’s data extraction and manipulation capabilities with the visualization and reporting features of these languages. The upshot is you get comprehensive and insightful reports that effectively communicate data-driven insights to stakeholders, all inside one place.
There’s a reason you’ll get asked a bunch of SQL interview questions at any data science interview. Almost every data science job requires at least a basic familiarity with SQL.
Here’s an example of what I mean: the job listing says, “Expertise in SQL, and R or Python for data analysis and platform development.” In other words, SQL is a must. And then either R or Python, but one is as good as another to most employers. But thanks to SQL domination, there’s no alternative to SQL. Every data science job will require you to work with SQL.
The really cool thing about it is that it makes SQL the ultimate transferable tool. One job may prefer Python, while a startup might require Rust due to personal preference or legacy infrastructure. But no matter where you go, or what you do, it’s SQL or bust. Take the time to learn it, and you’ll always be able to tick off a job requirement.
Ultimately, if you find a job as a data scientist that doesn’t require SQL, you’re probably not going to be doing a whole lot of data science.
It really comes down to the database. Data science requires the storage, manipulation, retrieval, and management of a lot of data. That data lives somewhere. It can only be accessed with one tool, normally, and that tool is SQL. SQL is the language to learn for data science and will be for as long as we rely on databases to do data science.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.