How to properly Multithread in OpenCV in 2019?

First of all, thank you for the clarity of the question.

Q: Is it okay to use (multi)threading on application level with OpenCV?

A: Yes it is totally ok to use multithreading on application level with OpenCV unless and until you are using functions which can take advantage of multithreading such as blurring, colour space changing, here you can split the image into multiple parts and apply global functions throughout the divided part and then recombine it to give the final output.

In some functions such as Hough, pca_analysis which cannot give correct results when they are applied to divided image sections and then recombined, applying multithreading on application level to such functions may not give correct results and thus should not be done.

As πάντα ῥεῖ mentioned, your implementation of multithreading will not give you an advantage because you are joining the thread in the for loop itself. I would suggest you use promise and future objects(If you want an example of how to, let me know down in the comments, I will share the snippet.

Below answer took a lot of research, thanks for asking the question, it really helps me add info to my multithreading knowledge 🙂

Q: If yes, why are the time spans printed by my program above GROWING over time?

A: After a lot of research I found out that creating and destroying threads takes a lot of CPU as well as memory resources. When we initialize a thread(in your code by this line: thread t(blurSlowdown, nullptr); ) an identifier is written to the memory location to which this variable points and this identifier enables us to refer to the thread. Now in your program you are creating and destroying thread at a very high rate, now this is what happens, there is a thread pool allocated to a program through which our program can run and destroy threads, I will keep it short and let’s look at the explanation below:

  1. When you create a thread, this creates an identifier which points this thread.
  2. When you destroy the thread, this memory is freed

BUT

  1. When you again create a thread after no time the first thread is destroyed, the identifier of this new thread points to a new location(location other than the previous thread) in the thread pool.

  2. After repeatedly creating and destroying a thread, the thread pool is exhausted and so CPU is forced to slow down our program cycles a bit so that the thread pool is again freed for making space for a new thread.

Intel TBB and OpenMP are very good at thread pool management so this problem may not occur while using them.

Q: Is TBB in 2019 now widely supported?

A: Yes, you can take advantages of TBB in your OpenCV program while also turning on TBB support on building OpenCV.

Here is a program for TBB implementation in medianBlur:

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <iostream>
#include <chrono>

using namespace cv;
using namespace std;
using namespace std::chrono;

class Parallel_process : public cv::ParallelLoopBody
{

private:
    cv::Mat img;
    cv::Mat& retVal;
    int size;
    int diff;

public:
    Parallel_process(cv::Mat inputImgage, cv::Mat& outImage,
                     int sizeVal, int diffVal)
        : img(inputImgage), retVal(outImage),
          size(sizeVal), diff(diffVal)
    {
    }

    virtual void operator()(const cv::Range& range) const
    {
        for(int i = range.start; i < range.end; i++)
        {
            /* divide image in 'diff' number
               of parts and process simultaneously */

            cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i,
                                     img.cols, img.rows/diff));
            cv::Mat out(retVal, cv::Rect(0, (retVal.rows/diff)*i,
                                         retVal.cols, retVal.rows/diff));

            cv::medianBlur(in, out, size);
        }
    }
};

int main()
{
    VideoCapture cap(0);

    cv::Mat img, out;

    while(1)
    {
        cap.read(img);
        out = cv::Mat::zeros(img.size(), CV_8UC3);

        // create 8 threads and use TBB
        auto start1 = high_resolution_clock::now();
        cv::parallel_for_(cv::Range(0, 8), Parallel_process(img, out, 9, 8));
        //cv::medianBlur(img, out, 9); //Uncomment to compare time w/o TBB
        auto stop1 = high_resolution_clock::now();
        auto duration1 = duration_cast<microseconds>(stop1 - start1);

        auto time_taken1 = duration1.count()/1000;
        cout << "TBB Time: " <<  time_taken1 << "ms" << endl;

        cv::imshow("image", img);
        cv::imshow("blur", out);
        cv::waitKey(1);
    }

    return 0;
}

On my machine, TBB implementation takes around 10ms and w/o TBB it takes around 40ms.

Q: If yes, what offers better performance, multithreading on the application level(if allowed) or TBB / OpenMP?

A: I would suggest using TBB/OpenMP over POSIX multithreading(pthread/thread) because TBB offers you better control over thread + better structure for writing parallel code and internally it manages pthreads. In case if you use pthreads you will have to take care of sync and safety etc in your code. But using these framework abstracts the need for handling thread which may get very complex.

Edit: I checked the comments regarding the incompatibility of image dimensions with the number of thread in which you want to divide the processing. So here is a potential workaround(haven’t tested but should work), scale the image resolution to the compatible dimensions like:

If your image res is 485 x 647, scale it to 488 x 648 then pass it to Parallel_process then scale back the output to the original size of 458 x 647.

For comparison of TBB and OpenMP check this answer

Leave a Comment