Pythonニューラルネットワークでのリアルタイム画像分類の問題

Question

Caffeとpythonを使用してリアルタイムの画像分類を実行しようとしています。OpenCVを使用して1つのプロセスでウェブカメラからストリーミングし、別のプロセスでcaffeを使用して画像を実行していますウェブカメラから取得したフレームの分類次に、分類の結果をメインスレッドに戻し、ウェブカメラストリームにキャプションを付けます。

問題は、NVIDIA GPUを使用していて、GPUでカフェ予測を実行している場合でも、メインスレッドの速度が低下することです。通常、予測を行わずに、私のWebカメラストリームは30fpsで実行されます。ただし、予測では、私のWebカメラストリームは最大15fpsになります。

予測を実行するときにcaffeが実際にGPUを使用していること、およびGPUまたはGPUメモリが最大になっていないことを確認しました。また、プログラム中のどの時点でもCPUコアが最大になっていないことを確認しました。私は何か間違ったことをしているのか、それともこれら2つのプロセスを本当に分離しておく方法がないのか疑問に思っています。アドバイスをいただければ幸いです。これが参考のための私のコードです

class Consumer(multiprocessing.Process): def __init__(self, task_queue, result_queue): multiprocessing.Process.__init__(self) self.task_queue = task_queue self.result_queue = result_queue #other initialization stuff def run(self): caffe.set_mode_gpu() caffe.set_device(0) #Load caffe net -- code omitted while True: image = self.task_queue.get() #crop image -- code omitted text = net.predict(image) self.result_queue.put(text) return import cv2 import caffe import multiprocessing import Queue tasks = multiprocessing.Queue() results = multiprocessing.Queue() consumer = Consumer(tasks,results) consumer.start() #Creating window and starting video capturer from camera cv2.namedWindow("preview") vc = cv2.VideoCapture(0) #Try to get the first frame if vc.isOpened(): rval, frame = vc.read() else: rval = False frame_copy[:] = frame task_empty = True while rval: if task_empty: tasks.put(frame_copy) task_empty = False if not results.empty(): text = results.get() #Add text to frame cv2.putText(frame,text) task_empty = True #Showing the frame with all the applied modifications cv2.imshow("preview", frame) #Getting next frame from camera rval, frame = vc.read() frame_copy[:] = frame #Getting keyboard input key = cv2.waitKey(1) #exit on ESC if key == 27: break

予測をコメントアウトしてプロセス間でダミーテキストをやり取りすると、再び30 fpsが得られるため、すべてが遅くなるのはカフェ予測であると確信しています。

class Consumer(multiprocessing.Process): def __init__(self, task_queue, result_queue): multiprocessing.Process.__init__(self) self.task_queue = task_queue self.result_queue = result_queue #other initialization stuff def run(self): caffe.set_mode_gpu() caffe.set_device(0) #Load caffe net -- code omitted while True: image = self.task_queue.get() #crop image -- code omitted #text = net.predict(image) text = "dummy text" self.result_queue.put(text) return import cv2 import caffe import multiprocessing import Queue tasks = multiprocessing.Queue() results = multiprocessing.Queue() consumer = Consumer(tasks,results) consumer.start() #Creating window and starting video capturer from camera cv2.namedWindow("preview") vc = cv2.VideoCapture(0) #Try to get the first frame if vc.isOpened(): rval, frame = vc.read() else: rval = False frame_copy[:] = frame task_empty = True while rval: if task_empty: tasks.put(frame_copy) task_empty = False if not results.empty(): text = results.get() #Add text to frame cv2.putText(frame,text) task_empty = True #Showing the frame with all the applied modifications cv2.imshow("preview", frame) #Getting next frame from camera rval, frame = vc.read() frame_copy[:] = frame #Getting keyboard input key = cv2.waitKey(1) #exit on ESC if key == 27: break

Dale · Answer

いくつかの説明といくつかの再考：

以下のコードを_Intel Core i5-6300HQ @2.3GHz_ cpu、_8 GB RAM_、および_NVIDIA GeForce GTX 960M_ gpu（2GBメモリ）を搭載したラップトップで実行したところ、次のようになりました。

Caffeを実行してコードを実行したかどうかに関係なく（net_output = this->net_->Forward(net_input)とvoid Consumer::entry()に必要なものをコメントアウトするかどうかによって）、メインスレッドで常に約30fpsを取得できました。

同様の結果は、_Intel Core i5-4440_ cpu、_8 GB RAM_、_NVIDIA GeForce GT 630_ gpu（1GBメモリ）を搭載したPCでも得られました。
同じラップトップで質問の @ user3543300 のコードを実行したところ、結果は次のようになりました。

カフェが（GPUで）実行されているかどうかに関係なく、約30fpsを取得することもできました。
@ user3543300 のフィードバックによると、上記の2つのバージョンのコードでは、caffe（_Nvidia GeForce 940MX GPU and Intel® Core™ i7-6500U CPU @ 2.50GHz × 4_ラップトップで）を実行すると、@ user3543300は約15fpsしか取得できませんでした。）。また、独立したプログラムとしてGPUでカフェを実行すると、Webカメラのフレームレートが低下します。

したがって、問題はDMA帯域幅（このスレッドについて [〜＃〜] dma [〜＃]などのハードウェアI/O制限にある可能性が高いと思います。〜] が示唆するかもしれません。）またはRAM帯域幅。希望 @ user3543300 がこれをチェックするか、私が真の問題を見つけることができます気づいていません。

問題が実際に私が上記で考えていることである場合、賢明な考えは、CNNネットワークによって導入されるメモリI/Oオーバーヘッドを削減することです。実際、ハードウェアリソースが限られている組み込みシステムで同様の問題を解決するために、このトピックに関するいくつかの研究が行われています。 Qautization 構造的にスパースなディープニューラルネットワーク、 SqueezeNet 、ディープ圧縮。うまくいけば、そのようなスキルを適用することで、問題のWebカメラのフレームレートを改善するのにも役立つでしょう。

元の回答：

このc ++ソリューションを試してください。タスクの I/Oオーバーヘッドにスレッドを使用します。 _bvlc_alexnet.caffemodel_ 、 deploy.prototxt は画像分類を行い、カフェが実行されているとき（GPU上）にメインスレッド（ウェブカメラストリーム）の明らかな速度低下は見られませんでした：

_#include <stdio.h> #include <iostream> #include <string> #include <boost/thread.hpp> #include <boost/shared_ptr.hpp> #include "caffe/caffe.hpp" #include "caffe/util/blocking_queue.hpp" #include "caffe/data_transformer.hpp" #include "opencv2/opencv.hpp" using namespace cv; //Queue pair for sharing image/results between webcam and caffe threads template<typename T> class QueuePair { public: explicit QueuePair(int size); ~QueuePair(); caffe::BlockingQueue<T*> free_; caffe::BlockingQueue<T*> full_; DISABLE_COPY_AND_ASSIGN(QueuePair); }; template<typename T> QueuePair<T>::QueuePair(int size) { // Initialize the free queue for (int i = 0; i < size; ++i) { free_.Push(new T); } } template<typename T> QueuePair<T>::~QueuePair(){ T *data; while (free_.try_pop(&data)){ delete data; } while (full_.try_pop(&data)){ delete data; } } template class QueuePair<Mat>; template class QueuePair<std::string>; //Do image classification(caffe predict) using a subthread class Consumer{ public: Consumer(boost::shared_ptr<QueuePair<Mat>> task , boost::shared_ptr<QueuePair<std::string>> result); ~Consumer(); void Run(); void Stop(); void entry(boost::shared_ptr<QueuePair<Mat>> task , boost::shared_ptr<QueuePair<std::string>> result); private: bool must_stop(); boost::shared_ptr<QueuePair<Mat> > task_q_; boost::shared_ptr<QueuePair<std::string> > result_q_; //caffe::Blob<float> *net_input_blob_; boost::shared_ptr<caffe::DataTransformer<float> > data_transformer_; boost::shared_ptr<caffe::Net<float> > net_; std::vector<std::string> synset_words_; boost::shared_ptr<boost::thread> thread_; }; Consumer::Consumer(boost::shared_ptr<QueuePair<Mat>> task , boost::shared_ptr<QueuePair<std::string>> result) : task_q_(task), result_q_(result), thread_(){ //for data preprocess caffe::TransformationParameter trans_para; //set mean trans_para.set_mean_file("/path/to/imagenet_mean.binaryproto"); //set crop size, here is cropping 227x227 from 256x256 trans_para.set_crop_size(227); //instantiate a DataTransformer using trans_para for image preprocess data_transformer_.reset(new caffe::DataTransformer<float>(trans_para , caffe::TEST)); //initialize a caffe net net_.reset(new caffe::Net<float>(std::string("/path/to/deploy.prototxt") , caffe::TEST)); //net parameter net_->CopyTrainedLayersFrom(std::string("/path/to/bvlc_alexnet.caffemodel")); std::fstream synset_Word("path/to/caffe/data/ilsvrc12/synset_words.txt"); std::string line; if (!synset_Word.good()){ std::cerr << "synset words open failed!" << std::endl; } while (std::getline(synset_Word, line)){ synset_words_.Push_back(line.substr(line.find_first_of(' '), line.length())); } //a container for net input, holds data converted from cv::Mat //net_input_blob_ = new caffe::Blob<float>(1, 3, 227, 227); } Consumer::~Consumer(){ Stop(); //delete net_input_blob_; } void Consumer::entry(boost::shared_ptr<QueuePair<Mat>> task , boost::shared_ptr<QueuePair<std::string>> result){ caffe::Caffe::set_mode(caffe::Caffe::GPU); caffe::Caffe::SetDevice(0); cv::Mat *frame; cv::Mat resized_image(256, 256, CV_8UC3); cv::Size re_size(resized_image.cols, resized_image.rows); //for caffe input and output const std::vector<caffe::Blob<float> *> net_input = this->net_->input_blobs(); std::vector<caffe::Blob<float> *> net_output; //net_input.Push_back(net_input_blob_); std::string *res; int pre_num = 1; while (!must_stop()){ std::stringstream result_strm; frame = task->full_.pop(); cv::resize(*frame, resized_image, re_size, 0, 0, CV_INTER_LINEAR); this->data_transformer_->Transform(resized_image, *net_input[0]); net_output = this->net_->Forward(); task->free_.Push(frame); res = result->free_.pop(); //Process results here for (int i = 0; i < pre_num; ++i){ result_strm << synset_words_[net_output[0]->cpu_data()[i]] << " " << net_output[0]->cpu_data()[i + pre_num] << "
"; } *res = result_strm.str(); result->full_.Push(res); } } void Consumer::Run(){ if (!thread_){ try{ thread_.reset(new boost::thread(&Consumer::entry, this, task_q_, result_q_)); } catch (std::exception& e) { std::cerr << "Thread exception: " << e.what() << std::endl; } } else std::cout << "Consumer thread may have been running!" << std::endl; }; void Consumer::Stop(){ if (thread_ && thread_->joinable()){ thread_->interrupt(); try { thread_->join(); } catch (boost::thread_interrupted&) { } catch (std::exception& e) { std::cerr << "Thread exception: " << e.what() << std::endl; } } } bool Consumer::must_stop(){ return thread_ && thread_->interruption_requested(); } int main(void) { int max_queue_size = 1000; boost::shared_ptr<QueuePair<Mat>> tasks(new QueuePair<Mat>(max_queue_size)); boost::shared_ptr<QueuePair<std::string>> results(new QueuePair<std::string>(max_queue_size)); char str[100], info_str[100] = " results: "; VideoCapture vc(0); if (!vc.isOpened()) return -1; Consumer consumer(tasks, results); consumer.Run(); Mat frame, *frame_copy; namedWindow("preview"); double t, fps; while (true){ t = (double)getTickCount(); vc.read(frame); if (waitKey(1) >= 0){ consuer.Stop(); break; } if (tasks->free_.try_peek(&frame_copy)){ frame_copy = tasks->free_.pop(); *frame_copy = frame.clone(); tasks->full_.Push(frame_copy); } std::string *res; std::string frame_info(""); if (results->full_.try_peek(&res)){ res = results->full_.pop(); frame_info = frame_info + info_str; frame_info = frame_info + *res; results->free_.Push(res); } t = ((double)getTickCount() - t) / getTickFrequency(); fps = 1.0 / t; sprintf(str, " fps: %.2f", fps); frame_info = frame_info + str; putText(frame, frame_info, Point(5, 20) , FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0)); imshow("preview", frame); } } _

そして、 src/caffe/util/blocking_queue.cpp で、以下を少し変更して、caffeを再構築します。

_...//Other stuff template class BlockingQueue<Batch<float>*>; template class BlockingQueue<Batch<double>*>; template class BlockingQueue<Datum*>; template class BlockingQueue<shared_ptr<DataReader::QueuePair> >; template class BlockingQueue<P2PSync<float>*>; template class BlockingQueue<P2PSync<double>*>; //add these 2 lines below template class BlockingQueue<cv::Mat*>; template class BlockingQueue<std::string*>; _

Shai · Answer

Caffeのpythonラッパーがグローバルインタープリターロック（GIL）をブロックしているようです。したがって、任意のcaffe pythonコマンドブロック[〜＃〜] all [〜＃〜]pythonスレッド。

（自己責任で）回避策は、特定のカフェ機能に対してGILを無効にすることです。たとえば、ロックなしでforwardを実行できるようにする場合は、 _$CAFFE_ROOT/python/caffe/_caffe.cpp_ を編集できます。この関数を追加します。

_void Net_Forward(Net<Dtype>& net, int start, int end) { Py_BEGIN_ALLOW_THREADS; // <-- disable GIL net.ForwardFromTo(start, end); Py_END_ALLOW_THREADS; // <-- restore GIL } _

そして、 .def("_forward", &Net<Dtype>::ForwardFromTo) を次のように置き換えます。

_.def("_forward", &Net_Forward) _

変更後は_make pycaffe_することを忘れないでください。

詳細については、 this を参照してください。

MD. Nazmul Kibria · Answer

マルチプロセッシングの代わりにマルチスレッドアプローチを試してください。スポーンプロセスは、スレッドへのスポーンよりも遅くなります。それらが実行されると、大きな違いはありません。あなたの場合、非常に多くのフレームデータが関係しているので、スレッド化アプローチが役立つと思います。

MD. Nazmul Kibria · Answer

コードで発生する可能性があると考えられます。つまり、最初の呼び出しではgpuモードで動作し、その後の呼び出しでは、デフォルトモードとしてcpuモードで分類が計算されます。古いバージョンのカフェでは、GPUモードを一度設定するだけで十分でしたが、新しいバージョンでは毎回モードを設定する必要があります。次の変更を試すことができます。

def run(self): #Load caffe net -- code omitted while True: caffe.set_mode_gpu() caffe.set_device(0) image = self.task_queue.get() #crop image -- code omitted text = net.predict(image) self.result_queue.put(text) return

また、コンシューマースレッドの実行中のGPUタイミングも確認してください。 nvidiaには次のコマンドを使用できます。

nvidia-smi

上記のコマンドは、実行時のGPU使用率を示しています。

それが別の解決策を解決しない場合は、スレッドの下でopencvフレーム抽出コードを作成します。 I/Oとデバイスアクセスに関連しているため、GUIスレッド/メインスレッドとは別のスレッドで実行するとメリットが得られる場合があります。そのスレッドはキュー内のフレームをプッシュし、現在のコンシューマースレッドが予測します。その場合、クリティカルブロックでキューを慎重に処理します。