Optimizing and Parallelizing
Histogram of Oriented Gradients Computation

by Michael O'Farrell (mofarrel) and Bram Wasti(bwasti)
Code
A HOG demo. Please allow your webcam to be accessed.

Sorry, this feature is not supported by your browser


Depicted above is a rough visualization of how HOG works. Image gradients are calculated and then aligned with 18 orientations. Here only 4 are displayed by their intensity. Only the brightest of the orientations at each point is used to create the HOG descriptors.
Summary We optimized and parallelized the computation of Histograms of Oriented Gradients (HOG features) using Intel Vector Intrinsics.
Background

HOG descriptors are commonly used in computer vision and image processing to detect objects. These descriptors use intensity gradients, and edge orientations to describe an image. This makes HOG features particularly useful for detecting pedestrians, because they have a distinct outline, and orientation.


Figure 1: An image of a pedestrian, and its corresponding HOG descriptor.
Calculating these descriptors takes up a large part of the computionation time necessary to perform object recognition.

Key Datastructures

The main datastructures used in the algorithm are histograms. They are used to bin gradient orientation data for different regions of the image. We also used lookup tables to speed up computationally intensive tasks.

Key Operations

The datastructures are not complex and the main issues we tackle with our code are cache coherency and mutual exclusion. We aim to ompimitize the algorithm as well as parallelize it, requiring intense benchmarking and careful consideration of how the hardware works.

Algorithm's inputs and outputs

The algorithm takes in an image and outputs a HOG descriptor.

Parallelism

We are parallelizing the entire algorithm. There are data dependencies when it comes to storing data. Most notably the algorithm, bins data into to neighboring blocks' histograms. The algorithm is ammenable to SIMD execution, and thus we implemented it using Intel Vector Intrinsics. With this in mind we may implement a GPU version of the algorithm. However, there are multiple points where the algorithm must be synced up. For example, after determing the histograms of orientations, we must sync before normalizing based on intensity.

Approach We implemented HOG using Intel Vector Intrinsics (not using ISPC). We created a small header library to allow flexible machine targetting as well as the ability to easily change precision. We have implemented single and double precision versions of the algorithm with SSE4 and a double precision version for AVX1.0. Unfortunately, we could not find a machine to test our AVX2.0 instruction set (which would allow 8 way parallelism for the single precision version).
Results Speedups are relative to Pedro Felzenszwalb's sequential implementation. We were able to achieve significant speedups by exploiting a lookup table and being clever with our SIMD use. This allowed us to beat linear speedup in multiple cases. These are the current results utilizing the lookup table as advised here:

|   SIMD Type   |   SIMD Width   |   Precision   |  Result  |
------------------------------------------------------------
     SSE4.2             4             Single        4.56x
     SSE4.2             2             Double        2.75x
     AVX1.0             4             Double        3.31x  (On a 2012 Macbook Air)
Without the lookup table here are our results:

|   SIMD Type   |   SIMD Width   |   Precision   |  Result  |
-------------------------------------------------------------
     SSE4.2             4             Single        2.78x
     SSE4.2             2             Double        1.58x
     AVX1.0             4             Double        2.22x  (On a 2012 Macbook Air)
The lookup table without parallelism gives the following speedup:

|   SIMD Type   |   Precision   |  Result  |
--------------------------------------------
       N/A           Double        1.62x
Notes on Performance Lookup table

Currently implemented is a lookup table that avoids the computation of the orientation of the gradient. This computation would be 9 iterations of 3 floating point operations and a two branch if statement, which at least partially flushes the pipeline at each iteration. We'd estimate around ~110 cycles total. An L2 cache hit (which is what we aim to achieve) takes ~10 cycles and an L3 takes ~40 cycles.

We used a cycle counter to measure the real results of the lookup table and found that without the table we saw ~200 cycles to compute the orientation of 4 pixels whereas with the table we saw ~50 cycles (not including cache misses, which were infrequent).

AVX Single precision

This can't be tested without AVX2.0. Single precision requires integer operations on 8 wide vectors, which is not included in the AVX1.0 instruction set, unfortunately.

Updates 5/7/2014

We brought the lookup table used to compute gradient orientations down in size by reducing it to char array[2] of size 512x512. This allows the table to fit in the L2 cache and caused significant speed up on the Macbook, which had been hitting main memory using the larger lookup table.

We also looked into why the speed of AVX instructions was quite low (we'd assume it is 4 wide and thus match the relative speed up of the single precision SSE4.2 instructions). AVX1.0 does not come with integer operations and thus we were forced to use SSE intrinsics when using integers. This is actually quite a problem, so much so that Intel has a paper on it. The swap between using AVX and SSE is expensive and can account for much of the slowdown we see on the Macbook Air. Another possible reason for a relative speedup of less than 4.0x is because we used the Clang compiler, which has a habit of vectorizing as much as possible.

Goals Clearly there is room for improvement. We plan to increase the parallelism of the algorithm and will try to find more locations of heavy computation that could be improved with a static lookup table.

Another goal we have is to get a live object detection stream running from an iPhone. We are holding this as a low priority but have written a sample server to interface with Matlab and an iPhone client. We want the object recognition to run a bit faster before we can do anything live.

We are working on a GPU implemention of HOG. We hope to show that a GPU accelerated version will offer even better results. A current paper claims 67x speed up on a GPU.
References N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in IEEE Conference on Computer Vision and Pattern Recognition, 2005.

P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, Sep. 2010

Junjie Yan, Zhen Lei, Longyin Wen, Stan Z. Li, "The Fastest Deformable Part Model for Object Detection," in IEEE International Conference on Computer Vision and Pattern Recognition, 2014.

Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros, "Ensemble of Exemplar-SVMs for Object Detection and Beyond," In ICCV 2011.

Avoiding AVX Transition Penalties

Ishan Misra, Abhinav Shrivastava and Martial Hebert, “HOG and Spatial Convolution on SIMD Architecture”

V. A. Prisacariu and I. D. Reid, "fastHOG - a real-time GPU implementation of HOG," Department of Engineering Science, Oxford University, 2009