Depicted above is a rough visualization of how HOG works. Image gradients are calculated and then aligned with 18 orientations. Here only 4 are displayed by their intensity. Only the brightest of the orientations at each point is used to create the HOG descriptors.
HOG descriptors are commonly used in computer vision and image processing to detect objects. These descriptors use intensity gradients, and edge orientations to describe an image. This makes HOG features particularly useful for detecting pedestrians, because they have a distinct outline, and orientation.
Figure 1: An image of a pedestrian, and its corresponding HOG descriptor.
The main datastructures used in the algorithm are histograms. They are used to bin gradient orientation data for different regions of the image. We also used lookup tables to speed up computationally intensive tasks.Key Operations
The datastructures are not complex and the main issues we tackle with our code are cache coherency and mutual exclusion. We aim to ompimitize the algorithm as well as parallelize it, requiring intense benchmarking and careful consideration of how the hardware works.Algorithm's inputs and outputs
The algorithm takes in an image and outputs a HOG descriptor.Parallelism
We are parallelizing the entire algorithm. There are data dependencies when it comes to storing data. Most notably the algorithm, bins data into to neighboring blocks' histograms. The algorithm is ammenable to SIMD execution, and thus we implemented it using Intel Vector Intrinsics. With this in mind we may implement a GPU version of the algorithm. However, there are multiple points where the algorithm must be synced up. For example, after determing the histograms of orientations, we must sync before normalizing based on intensity.
Without the lookup table here are our results:
| SIMD Type | SIMD Width | Precision | Result | ------------------------------------------------------------ SSE4.2 4 Single 4.56x SSE4.2 2 Double 2.75x AVX1.0 4 Double 3.31x (On a 2012 Macbook Air)
The lookup table without parallelism gives the following speedup:
| SIMD Type | SIMD Width | Precision | Result | ------------------------------------------------------------- SSE4.2 4 Single 2.78x SSE4.2 2 Double 1.58x AVX1.0 4 Double 2.22x (On a 2012 Macbook Air)
| SIMD Type | Precision | Result | -------------------------------------------- N/A Double 1.62x
Currently implemented is a lookup table that avoids the computation of the orientation of the gradient. This computation would be 9 iterations of 3 floating point operations and a two branch if statement, which at least partially flushes the pipeline at each iteration. We'd estimate around ~110 cycles total. An L2 cache hit (which is what we aim to achieve) takes ~10 cycles and an L3 takes ~40 cycles.
We used a cycle counter to measure the real results of the lookup table and found that without the table we saw ~200 cycles to compute the orientation of 4 pixels whereas with the table we saw ~50 cycles (not including cache misses, which were infrequent).AVX Single precision
This can't be tested without AVX2.0. Single precision requires integer operations on 8 wide vectors, which is not included in the AVX1.0 instruction set, unfortunately.
We brought the lookup table used to compute gradient orientations down in size by reducing it to char array of size 512x512. This allows the table to fit in the L2 cache and caused significant speed up on the Macbook, which had been hitting main memory using the larger lookup table.
We also looked into why the speed of AVX instructions was quite low (we'd assume it is 4 wide and thus match the relative speed up of the single precision SSE4.2 instructions). AVX1.0 does not come with integer operations and thus we were forced to use SSE intrinsics when using integers. This is actually quite a problem, so much so that Intel has a paper on it. The swap between using AVX and SSE is expensive and can account for much of the slowdown we see on the Macbook Air. Another possible reason for a relative speedup of less than 4.0x is because we used the Clang compiler, which has a habit of vectorizing as much as possible.
Another goal we have is to get a live object detection stream running from an iPhone. We are holding this as a low priority but have written a sample server to interface with Matlab and an iPhone client. We want the object recognition to run a bit faster before we can do anything live.
We are working on a GPU implemention of HOG. We hope to show that a GPU accelerated version will offer even better results. A current paper claims 67x speed up on a GPU.
P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, Sep. 2010
Junjie Yan, Zhen Lei, Longyin Wen, Stan Z. Li, "The Fastest Deformable Part Model for Object Detection," in IEEE International Conference on Computer Vision and Pattern Recognition, 2014.
Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros, "Ensemble of Exemplar-SVMs for Object Detection and Beyond," In ICCV 2011.
Avoiding AVX Transition Penalties
Ishan Misra, Abhinav Shrivastava and Martial Hebert, “HOG and Spatial Convolution on SIMD Architecture”
V. A. Prisacariu and I. D. Reid, "fastHOG - a real-time GPU implementation of HOG," Department of Engineering Science, Oxford University, 2009