Multithreading and performance
Detecting and recognizing faces in a live video stream is a computationally intensive task. Therefore, FaceSpeaker needs to work efficiently and must take advantage of all the processing power the computer has to offer. Nearly all modern computers feature a multi core central processing unit (CPU). Taking advantage of those multicore CPU's requires a multithreaded application.
Programming a multithreaded application can be a challenging task and requires specialized knowledge about distributed programming. Fortunately, the C# programming language and the Microsoft .net framework 4.0 contain many features to make writing multithreaded applications easier. Those features were a key reason for choosing the C# programming language. The book Threading in C# by Joseph Albahari was the key reference for those features. Other distributed programming knowledge mainly derived from Gerard Tel's excellent distributed programming course and especially his lecture notes.
Nevertheless, getting FaceSpeaker to work reliably in a multithreaded environment proved complicated and took up much of the development time. This chapter explains some of the design choices and resolved issues related to multithreading and performance optimization.
FaceSpeaker became a reliable multithreaded application, but as expected it places a very high load on system resources. On most quad core computers used for testing, FaceSpeaker fully utilized the CPU and had to throttle its frame rate to prevent overloading the system. High end computers can run FaceSpeaker without throttling the frame rate but even on the unusually powerful development computer (Intel Xeon E5-2600 hex core CPU, 32 GB RAM) CPU utilization is as high as 70%. There are some options to reduce FaceSpeaker's computational footprint which are discussed at the end of this chapter.
How FaceSpeaker captures and processes frames
Skipping over most technical details, FaceSpeaker's operation can be summarized as follows. A dedicated thread (the "capture thread") continuously grabs frames from the camera and stores them in a "frames to process" queue. Every time it adds a frame to the FramesToProcess queue, it starts a new task "processFrame" using C#'s thread pooling mechanism. This mechanism automatically and efficiently divides all tasks over multiple threads. Thanks to this mechanism multithreaded computations run efficiently while the programmer need not worry about most of the technical complications involved.
Every "processFrame" task dequeues a frame from the FramesToProcess queue, detects the faces present on that frame and recognizes those faces. It then puts the processed frame (now containing identifications for the faces present) in the RecentFrames sorted list.
In addition, a "onNewFrame" event is fired whenever a new frame has been captured. The user interface will display the new frame's image and attempt to identify the person currently looking at the camera (using the identifyPerson function) approximately every second.
Frame rate throttling
Most cameras can capture 25 to 30 frames per second, but most personal computers in use as of 2013 cannot perform live face recognition using the OpenCV algorithms at that frame rate even if the program is multithreaded.
Therefore FaceSpeaker automatically throttles the frame rate if frames are being processed too slowly. Frames should be processed within half a second after they have been captured. The capture thread sleeps for a period of "capture delay" (0 milliseconds by default E.G. no delay) between capturing frames.
If a ProcessFrame task detects that processing a frame finishes over half a second after that frame has been captured, it processes the frame normally but increases the capture delay by a millisecond causing the frame rate to decrease. If a frame has been waiting for processing in the FramesToProcess queue for over half a second the capture delay is also increased by a millisecond, and the frame is not processed anymore. This avoids an accumulation of unprocessed frames in the FramesToProcess queue which would degrade performance even further. If within the last second no frames have exceeded the processing deadline, the capture delay will be decreased by a millisecond to prevent temporary performance inhibitors from causing excessive frame rate throttling.
This mechanism avoids wasting system resources on capturing frames which cannot be processed in time, while avoiding excessive frame rate throttling.
Power saver
The high CPU utilization caused by running FaceSpeaker is especially problematic when it is used on a laptop computer. Increased CPU utilization causes increased power consumption and heat production. High CPU utilization will drain the laptop battery very quickly, and the excessive heat - which cannot escape the user's backpack - may cause significant discomfort to the user or damage to the laptop.
When a person is looking at the camera, a high frame rate is desirable because this increases the probability of detecting a usable face and increases recognition accuracy. However, when no face is in front of the camera or when a person has been recognized and stays in front of the camera for an extended period of time, the high frame rate just wastes CPU resources.
Therefore a power saver feature was implemented. If enabled, FaceSpeaker enters standby mode if it did not detect a face for over 2 seconds or if a person has been finally identified (e.g. the user interface has issued the second beep indicating recognition confidence). In standby mode the capture delay is fixed at 200 milliseconds, causing the frame rate to drop to 4 - 5 frames per second. This is enough to detect when a person comes into view (and as soon as a face is detected FaceSpeaker will leave standby mode). The reduced frame rate causes a considerable drop in CPU utilization.
Process and thread priority
In early field trials 2 symptoms surfaced:
- Sound playback and speech output were delayed or distorted.
- When a screen reader was active, the screen reader tended to "lock up". This made it nearly impossible to read or navigate the user interface.
This is a case of resource starvation. In a multitasking operating system such as Microsoft Windows, the machine runs multiple processes (programs) simultaneously. Each process consists of one or more threads. A part of the operating system called the "scheduler" is responsible for distributing available CPU time among the various processes and threads trying to perform calculations. When multiple processes are active the scheduler should distribute the available CPU time equally among processes and should distribute each processes' CPU time equally among the processes' threads. However in practice the scheduler's operation is not that simple.
Notably, the scheduler operates in a non-deterministic fashion. This means that a programmer cannot make any assumptions as to how the scheduler will distribute available CPU time among competing processes and threads. In other words, the programmer has no way of estimating what amount of CPU time his program gets in a given time frame, or how CPU time will be distributed among the processes' threads. If CPU time is scarce, process / thread execution will inevitably be delayed in order to allow other threads to execute. Those waiting delays will be of unpredictable, highly variable duration. FaceSpeaker's background threads (e.g. the threads responsible for face detection and face recognition) request an inordinate amount of CPU time, indeed more CPU time than most computers can provide. Behind the scenes, C#'s thread pooling mechanism distributes the "background" work over many background threads. The "user interface thread" (the thread responsible for interacting with the user and issuing messages) is just one of n (n >> 1) threads competing for CPU time on the fully occupied CPU. Due to the non-deterministic operation of the scheduler there will be unpredictable and highly variable delays in the execution of the user interface thread. If the CPU is fully occupied, those execution delays can be long enough to cause noticeable effects in the user interface such as delayed or distorted sound and speech. Because the FaceSpeaker process tries to claim more CPU time than is available, competing processes (such as screen readers) may experience similar noticeable delays.
Fortunately, process / thread priority allow programmers to have the scheduler prioritize certain processes / threads over (or below) others when distributing CPU time. FaceSpeaker's user interface thread has a thread priority of "high", meaning the scheduler will assign this thread more CPU time than all other threads (which have the default thread priority of "normal"). The FaceSpeaker process as a whole has a process priority of "below normal". This prevents it from starving other processes such as screen readers of CPU time. After setting those priorities, the symptoms described earlier both disappeared while FaceSpeaker retained about the same frame rate.
Improving performance
After implementing the mechanisms described in this chapter and resolving many subtle technical issues, FaceSpeaker worked well enough for a prototype product. However, given enough development time and resources FaceSpeaker's performance could be improved.
As described on EmguCV's licensing page, a commercial version of EmguCV is available. This has better performance characteristics and native multithreading support thanks to a more optimized build process. At about $200 this product is quite affordable Although it would still have had a significant impact on FaceSpeaker's very limited budget. Because FaceSpeaker performs acceptably using EmguCV's open source release and using a commercial EmguCV version might interfere with FaceSpeaker's goal of releasing an open source product, this option was not investigated further. But if FaceSpeaker were to be developed beyond the current experimental phase, designing an optimized build process might be an easy way of improving performance.
Changing the process and thread priorities as explained in the previous section effectively resolved the problems experienced, but this quick fix is not an ideal solution. Lowering FaceSpeaker's process priority will cause it to yield significantly more CPU time to background processes. The work done by those background processes will always slow down FaceSpeaker but does not necessarily benefit the user directly (most users would prefer better FaceSpeaker performance over background processes like virus scans). Albahari points out that prioritizing the user interface thread over other threads may cause user interface updating code to take up a disproportionate amount of CPU time, unnecessarily slowing down the background threads. The solution he proposes is to split up the application into 2 processes. One process is responsible for running the user interface thread, the other process is responsible for running all other threads. The 2 program parts would then communicate through inter-process messaging.
This is a desirable solution for various reasons. Not only could it allow significant performance increases without fiddling with priorities, but once this solution has been implemented it becomes relatively easy to offload the face detection and recognition algorithms onto a server (which would basically receive the low bandwidth grayscale camera stream through a mobile internet connection and pass back the negligibly small recognition results to the client). The best part of such a solution is the possibility to run FaceSpeaker on augmented reality glasses like Google Glass or other mobile devices available today. Again such options are outside the scope of prototype development but if FaceSpeaker is to be developed further this should be a top priority. As explained in the software design chapter the user interface is loosely coupled to the rest of the program, so it is relatively straightforward to write some classes bridging the user interface and program backbone through a network connection. Of course a user interface suitable for augmented reality glasses or other mobile devices must also be developed.