Image by Author
Strong Python and SQL skills are both integral to many data professionals. As a data professional, you’re probably comfortable with Python programming—so much that writing Python code feels pretty natural. But are you following the best practices when working on data science projects with Python?
Though it’s easy to learn Python and build data science applications with it, it’s, perhaps, easier to write code that is hard to maintain. To help you write better code, this tutorial explores some Python coding best practices which help with dependency management and maintainability such as:
- Setting up dedicated virtual environments when working on data science projects locally
- Improving maintainability using type hints
- Modeling and validating data using Pydantic
- Profiling code
- Using vectorized operations when possible
So let’s get coding!
1. Use Virtual Environments for Each Project
Virtual environments ensure project dependencies are isolated, preventing conflicts between different projects. In data science, where projects often involve different sets of libraries and versions, Virtual environments are particularly useful for maintaining reproducibility and managing dependencies effectively.
Additionally, virtual environments also make it easier for collaborators to set up the same project environment without worrying about conflicting dependencies.
You can use tools like Poetry to create and manage virtual environments. There are many benefits to using Poetry but if all you need is to create virtual environments for your projects, you can also use the built-in venv module.
If you are on a Linux machine (or a Mac), you can create and activate virtual environments like so:
# Create a virtual environment for the project
python -m venv my_project_env
# Activate the virtual environment
source my_project_env/bin/activate
If you’re a Windows user, you can check the docs on how to activate the virtual environment. Using virtual environments for each project is, therefore, helpful to keep dependencies isolated and consistent.
2. Add Type Hints for Maintainability
Because Python is a dynamically typed language, you don’t have to specify in the data type for the variables that you create. However, you can add type hints—indicating the expected data type—to make your code more maintainable.
Let’s take an example of a function that calculates the mean of a numerical feature in a dataset with appropriate type annotations:
from typing import List
def calculate_mean(feature: List[float]) -> float:
# Calculate mean of the feature
mean_value = sum(feature) / len(feature)
return mean_value
Here, the type hints let the user know that the calcuate_mean
function takes in a list of floating point numbers and returns a floating-point value.
Remember Python does not enforce types at runtime. But you can use mypy or the like to raise errors for invalid types.
3. Model Your Data with Pydantic
Previously we talked about adding type hints to make code more maintainable. This works fine for Python functions. But when working with data from external sources, it’s often helpful to model the data by defining classes and fields with expected data type.
You can use built-in dataclasses in Python, but you don’t get data validation support out of the box. With Pydantic, you can model your data and also use its built-in data validation capabilities. To use Pydantic, you can install it along with the email validator using pip:
$ pip install pydantic[email-validator]
Here’s an example of modeling customer data with Pydantic. You can create a model class that inherits from BaseModel
and define the various fields and attributes:
from pydantic import BaseModel, EmailStr
class Customer(BaseModel):
customer_id: int
name: str
email: EmailStr
phone: str
address: str
# Sample data
customer_data = {
'customer_id': 1,
'name': 'John Doe',
'email': 'john.doe@example.com',
'phone': '123-456-7890',
'address': '123 Main St, City, Country'
}
# Create a customer object
customer = Customer(**customer_data)
print(customer)
You can take this further by adding validation to check if the fields all have valid values. If you need a tutorial on using Pydantic—defining models and validating data—read Pydantic Tutorial: Data Validation in Python Made Simple.
4. Profile Code to Identify Performance Bottlenecks
Profiling code is helpful if you’re looking to optimize your application for performance. In data science projects, you can profile memory usage and execution times depending on the context.
Suppose you’re working on a machine learning project where preprocessing a large dataset is a crucial step before training your model. Let’s profile a function that applies common preprocessing steps such as standardization:
import numpy as np
import cProfile
def preprocess_data(data):
# Perform preprocessing steps: scaling and normalization
scaled_data = (data - np.mean(data)) / np.std(data)
return scaled_data
# Generate sample data
data = np.random.rand(100)
# Profile preprocessing function
cProfile.run('preprocess_data(data)')
When you run the script, you should see a similar output:
In this example, we’re profiling the preprocess_data()
function, which preprocesses sample data. Profiling, in general, helps identify any potential bottlenecks—guiding optimizations to improve performance. Here are tutorials on profiling in Python which you may find helpful:
5. Use NumPy’s Vectorized Operations
For any data processing task, you can always write a Python implementation from scratch. But you may not want to do it when working with large arrays of numbers. For most common operations—which can be formulated as operations on vectors—that you need to perform, you can use NumPy to perform them more efficiently.
Let’s take the following example of element-wise multiplication:
import numpy as np
import timeit
# Set seed for reproducibility
np.random.seed(42)
# Array with 1 million random integers
array1 = np.random.randint(1, 10, size=1000000)
array2 = np.random.randint(1, 10, size=1000000)
Here are the Python-only and NumPy implementations:
# NumPy vectorized implementation for element-wise multiplication
def elementwise_multiply_numpy(array1, array2):
return array1 * array2
# Sample operation using Python to perform element-wise multiplication
def elementwise_multiply_python(array1, array2):
result = []
for x, y in zip(array1, array2):
result.append(x * y)
return result
Let’s use the timeit
function from the timeit
module to measure the execution times for the above implementations:
# Measure execution time for NumPy implementation
numpy_execution_time = timeit.timeit(lambda: elementwise_multiply_numpy(array1, array2), number=10) / 10
numpy_execution_time = round(numpy_execution_time, 6)
# Measure execution time for Python implementation
python_execution_time = timeit.timeit(lambda: elementwise_multiply_python(array1, array2), number=10) / 10
python_execution_time = round(python_execution_time, 6)
# Compare execution times
print("NumPy Execution Time:", numpy_execution_time, "seconds")
print("Python Execution Time:", python_execution_time, "seconds")
We see that the NumPy implementation is ~100 times faster:
Output >>>
NumPy Execution Time: 0.00251 seconds
Python Execution Time: 0.216055 seconds
Wrapping Up
In this tutorial, we have explored a few Python coding best practices for data science. I hope you found them helpful.
If you are interested in learning Python for data science, check out 5 Free Courses Master Python for Data Science. Happy learning!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.