2023-01-27

OpenCV CUDA on 🛞`s

I think we all know opencv - it’s a pretty old yet widely used and performant computer vision library with a lot of useful algorithms. One of the advantages of it - you can configure opencv how you want to get really nice performance gain. One of the ways - compile it to multiply matrices on GPUs to speed up both “old” CV filters (which are essentially convolutions) and use deep learning inference (yes, you can do it in opencv with the special module added).

Preparing the environment with GPUs and CUDA

Here I use machine with GPUs and Ubuntu as the base OS with CUDA drivers installed (don’t want to place here instructions on how to install CUDA, but I suggest just to use “official” nvidia docker images).

It’s pretty common nowadays to use both for training and inference in prod. relatively cheap VMs with NVIDIA A10G GPUs. So to build our libs, I’ve used g5.xlarge AWS EC2 instance with Ubuntu 20.04 as a base image, python 3.8 and CUDA 11.6 - you can use any similar setup.

Why Wheel

Python wheel is a standard of python package distribution (google’s definition). Essentially - it’s just a zip archive with all the files needed to install the package, like python code, *.pyc byte code and some compiled platform-specific native shared libraries (e.g. *.so).

The core advantage that makes us care about wheels in the ML domain - no need to compile libs during installation since we have a wheel that already contain compiled extension modules. Since almost all of libraries that we use are essentially python bindings for some lower level code in C/C++/Fortran/ - it could become really hard to always keep the host OS ready to build such libs.
Ideally, lib’s developer should put pre-built wheels for the specific platforms somewhere publicly, so everybody could just pip install them. But it’s not the case for a lot of important libraries, unfortunately ;(

It becomes especially important when we want to build libs with CUDA support.

And, also, some libraries may be very slow to build on the CI workers inside the special containers, like triton. So, obviously, installing pre-built wheel just with pip install saves a lot of time.

Compile

First, log-in to build machine and install OS dependencies:

apt-get update -y && apt-get upgrade -y
apt-get install -y \
        build-essential git cmake \
        unzip pkg-config wget \
        libavcodec-dev libavformat-dev libswscale-dev \
        libgstreamer-plugins-base1.0-dev libgstreamer1.0-dev \
        libgtk-3-dev libpng-dev libjpeg-dev \
        libopenexr-dev libtiff-dev libwebp-dev \
        libv4l-dev libxvidcore-dev libx264-dev \
        libgtk-3-dev libatlas-base-dev gfortran

Then, you need to pull the opencv-python repo and follow the instructions for the manual build.

During the build, you need to provide cmake flags, that suits your use cases. At this particular case, we want CUDA support and extra libs (ENABLE_CONTRIB) with neural networks support to be compiled. Here is the script:

cd opencv-python
rm -rf build && mkdir -p build 
pip install numpy==1.23.4
export CMAKE_ARGS="-D CMAKE_BUILD_TYPE=RELEASE -D INSTALL_PYTHON_EXAMPLES=OFF -D INSTALL_C_EXAMPLES=OFF -D OPENCV_ENABLE_NONFREE=ON -D WITH_CUDA=ON -D WITH_CUDNN=ON -D OPENCV_DNN_CUDA=ON -D ENABLE_FAST_MATH=1 -D CUDA_FAST_MATH=1 -D CUDA_ARCH_BIN=8.6 -D WITH_CUBLAS=1 -D BUILD_EXAMPLES=OFF"
export ENABLE_CONTRIB=1
pip wheel . --verbose -w dist

It could take from 30 mins to couple hours to compile. Be patient.

If everything works - you’ll have a precious wheel archive in the dist:

opencv_contrib_python-4.7.0.77db6ba-cp38-cp38-linux_x86_64.whl

Install and add runtime dependencies

Now, you can just install built wheel with pip:

pip install opencv_contrib_python-4.7.0.77db6ba-cp38-cp38-linux_x86_64.whl

After that open Python interpreter and try to import cv2, most probably you’ll see errors related to some *.so could not be found:

ImportError: libhdf5_serial.so.100: cannot open shared object file: No such file or directory

That’s because opencv relies on lots of shared libraries that should be installed on OS level and you don’t have them installed yet on your system.

In order to debug that issue, you’ll need two tools: ldd and apt-file. ldd is usually presented on most of the Linux distributions and is used to print out all the shared object dependencies. And apt-file is the utility to find apt package by providing some string pattern to search for. You can install and update cache like that:

sudo apt install apt-file && apt-file update

So first you run ldd against opencv shared object, that should be located somewhere there:

ldd /usr/local/lib/python3.X/dist-packages/cv2*.so

The output of ldd could look like that:

libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f60a55bf000)
...
libhdf5_serial.so.100 => not found

So you’re interested in not found lines (to make it simple just ldd+grep:

ldd cv2*.so | grep "not found"

And then apt-file could be used to find the package those object files are belong to:

apt-file search libhdf5_serial.so.100

It will output the list of apt packages (there could be duplicates - that’s normal):

libhdf5-100: /usr/lib/arm-linux-gnueabihf/libhdf5_serial.so.100
libhdf5-100: /usr/lib/arm-linux-gnueabihf/libhdf5_serial.so.100.0.1

So you just install that libhdf5-100 with apt install libhdf5-100 and it should be found next time you run ldd!

Repeat that for all the not found dependencies and opencv will finally work!

>>> import cv2
>>> cv2.__version__
'4.7.0'
>>> print(dir(cv2.cuda))
[... 'ORB', 'ORB_create', 'OpticalFlowDual_TVL1', 'OpticalFlowDual_TVL1_create', 'SHARED_ATOMICS', 'SURF_CUDA',...]

After doing all that, you’ll end up with the list of OS dependencies that should be installed in the container alongside with the built wheel to use it in your apps - so just add it to your Dockerfile. Here is a real missing deps for the tritonserver container:

RUN apt-get update -y && apt-get install -y \
    libhdf5-103 \
    libgtk-3-0 \
    libdc1394-22 \
    libgstreamer-plugins-base1.0-0 \
    libavcodec58 \
    libavformat58 \
    libswscale5

Test inference

Let’s image that we downloaded tensorflow pre-trained FSRCNN model and placed it to the same folder as the run-script:

import os

import cv2
from cv2 import dnn_superres 
import numpy as np

base_path = os.path.dirname(os.path.abspath(__file__))
model_id = os.path.join(base_path, "FSRCNN_x2.pb")

net = dnn_superres.DnnSuperResImpl_create()
net.readModel(model_id)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
net.setModel("fsrcnn", 2)

test_img = (np.random.random((512, 512, 3)) * 255).astype(np.uint8)
result = net.upsample(test_img)  

print(test_img.shape, result.shape)

Excpected output:

>>> 
(512, 512, 3) (1024, 1024, 3)

Congrats, it works!