Ruddra.com

Reduce Build Time for Images with Data Science Packages

Reduce Build Time for Images with Data Science Packages

If you want to use data science packages like numpy, scipy, pandas etc with your docker container and using pip to install them, then it will take forever to build the images. Because pip compiles the C extended code and that compilation needs a lot of time. For numpy it took me around 4 minutes. For scipy, I terminated the build process after 30 minutes or so.

Today, I am going to share some ways in which you can build them faster. You can follow any one of them.

One: build using Anaconda

Anaconda is a free and open-source distribution of the Python(and R programming language as well, but we are not going to consider it here) for scientific computing. It uses a pre-compiled binary version of the packages so you do not have to compile it on your machine. Here is an example Dockerfile based on official anaconda image:

FROM continuumio/miniconda3

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

# Set work directory
WORKDIR /code

# Install dependencies
COPY requirement.txt /code/
RUN conda install -c conda-forge --file requirement.txt

# Copy project
COPY . /code/

Here is an another example using alpine linux:

FROM frolvlad/alpine-miniconda3
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1


# Set work directory
WORKDIR /code

# Install dependencies
COPY requirement.txt /code/
RUN conda install -c conda-forge --file requirement.txt

# Copy project
COPY . /code/

Two: use apt-get

If you do not want to use anaconda, then apart from pip, there is another package manager called apt which distributes data science packages like numpy, scipy, pandas. apt is the official package manager for Ubuntu and it is very reliable. Although versions of these packages are not latest, you can install the newest versions with pip. This time installation using pip will take much less time.

FROM python:3.7
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

# Install packages
RUN apt-get update -y && \
    apt-get install -y python-scipy\
    python-numpy python-pandas &&\
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Install dependencies
COPY requirement.txt /code/
RUN pip install --no-cache-dir -r /code/requirement.txt

Three: use separate layers to install packages

If you install packages which take longer in a separate layer, then docker will cache them in subsequent builds. In that way your first time build will take a long time, but after that building time will be significantly reduced for next builds.

FROM python:3.7

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

# Installing scipy
RUN pip3 install --no-cache-dir --disable-pip-version-check scipy==1.3.1

# Installing numpy, scipy, psycopg2, gensim
RUN pip3 install --no-cache-dir \
    pandas==0.25.2 \
    numpy==1.17.3 \
    psycopg2==2.8.4 \
    gensim==3.8.1

#Install dependencies
COPY requirement.txt /code/
RUN pip install -r requirement.txt

Advantage of this separation of layers is that even if you change your requirements.txt file or change in source code, it will not hamper the cached layer of data science packages.

Four: build your own wheel

Finally, you can build your own wheel files using pip-wheel. Wheel archives for your requirements and dependencies. Wheel is a built-package format, and offers the advantage of not recompiling your software during every install. Here is how you can do that:

pip wheel numpy scipy pandas -w wheels

It will store wheel files(extension .whl) inside the wheels directory. You can either add them to your docker image(like following example). FYI, keeping wheel files with repositories will increase its size. Alternatively you can put it somewhere in the cloud and download it during the docker build process.

FROM python:3.7
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

# Installing dependencies
COPY requirement.txt /code/
COPY ./wheels /tmp/wheels
RUN pip install --find-links=/tmp/wheels -r /code/requirements.txt
RUN rm -rf /tmp/wheels

In conclusion

In this article, we saw that there are several ways to reduce build time of the docker image with data science packages and you can follow any of them. If you have any thoughts regarding this, please let me know at the comment section below.

Last updated: Jul 13, 2024


← Previous
Create a Web App with JWT Authentication using Django

Build a JWT based auth service using Django and Django Rest Framework in 10 minutes or less.

Next →
Hugo: Add Table of Contents Anywhere in Markdown File

By using shortcodes, here is how to put table of contents anywhere in the markdown file for Hugo.

Share Your Thoughts
M↓ Markdown