Image created by Author with Midjourney
As a data scientist, having a standardized and portable environment for analysis and modeling is crucial. Docker provides an excellent way to create reusable and sharable data science environments. In this article, we’ll walk through the steps to set up a basic data science environment using Docker.
Why is it we would consider using Docker? Docker allows data scientists to create isolated and reproducible environments for their work. Some key advantages of using Docker include:
- Consistency – The same environment can be replicated across different machines. No more “it works on my machine” issues.
- Portability – Docker environments can easily be shared and deployed across multiple platforms.
- Isolation – Containers isolate dependencies and libraries needed for different projects. No more conflicts!
- Scalability – It’s easy to scale an application built inside Docker by spinning up more containers.
- Collaboration – Docker enables collaboration by allowing teams to share development environments.
The starting point for any Docker environment is the Dockerfile. This text file contains instructions for building the Docker image.
Let’s create a basic Dockerfile for a Python data science environment and save it as ‘Dockerfile’ without an extension.
# Use official Python image
FROM python:3.9-slim-buster
# Set environment variable
ENV PYTHONUNBUFFERED 1
# Install Python libraries
RUN pip install numpy pandas matplotlib scikit-learn jupyter
# Run Jupyter by default
CMD ["jupyter", "lab", "--ip='0.0.0.0'", "--allow-root"]
This Dockerfile uses the official Python image and installs some popular data science libraries on top of it. The last line defines the default command to run Jupyter Lab when a container is started.
Now we can build the image using the docker build
command:
docker build -t ds-python .
This will create an image tagged ds-python
based on our Dockerfile.
Building the image may take a few minutes as all the dependencies are installed. Once complete, we can view our local Docker images using docker images
.
With the image built, we can now launch a container:
docker run -p 8888:8888 ds-python
This will start a Jupyter Lab instance and map port 8888 on the host to 8888 in the container.
We can now navigate to localhost:8888
in a browser to access Jupyter and start running notebooks!
A key benefit of Docker is the ability to share and deploy images across environments.
To save an image to tar archive, run:
docker save -o ds-python.tar ds-python
This tarball can then be loaded on any other system with Docker installed via:
docker load -i ds-python.tar
We can also push images to a Docker registry like Docker Hub to share with others publicly or privately within an organization.
To push the image to Docker Hub:
- Create a Docker Hub account if you don’t already have one
- Log in to Docker Hub from the command line using
docker login
- Tag the image with your Docker Hub username:
docker tag ds-python yourusername/ds-python
- Push the image:
docker push yourusername/ds-python
The ds-python
image is now hosted on Docker Hub. Other users can pull the image by running:
docker pull yourusername/ds-python
For private repositories, you can create an organization and add users. This allows you to share Docker images securely within teams.
To load and run the Docker image on another system:
- Copy over the
ds-python.tar
file to the new system - Load the image using
docker load -i ds-python.tar
- Start a container using
docker run -p 8888:8888 ds-python
- Access Jupyter Lab at
localhost:8888
That’s it! The ds-python image is now ready to use on the new system.
This gives you a quick primer on setting up a reproducible data science environment with Docker. Some additional best practices to consider:
- Use smaller base images like Python slim to optimize image size
- Leverage Docker volumes for data persistence and sharing
- Follow security principles like avoiding running containers as root
- Use Docker Compose for defining and running multi-container applications
I hope you find this intro helpful. Docker enables tons of possibilities for streamlining and scaling data science workflows.
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master’s degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.