This article explains how to optimize image processing using mathematical techniques and Intrinsics ARM NEON instructions.

Note.pngNote: This was an entry in the Nokia Imaging and Big UI Wiki Competition 2013Q4.


This article explains how to speed up computationally expensive operations when using the Nokia Imaging SDK.

By using math optimization techniques and SIMD instructions in C++, you can boost your application's performance by a factor of up to 16x. For the purposes of illustration, on a Snapdragon™ S4 in my HDR I - Implementing High Dynamic Range filters using Nokia Imaging SDK project, to process and blend three shots, I saved more than 3 seconds compared to the C# not optimized version. The Nokia Imaging SDK is highly optimized in C++.

Although using C++ remains the best way to increase performance, it isn't always feasible. For image processing, C# works fine in most cases.

The techniques presented in this article can be used in both C++ and C#. This article explains how to:

  • use Magic Numbers to speed up division
  • use Fast Square Root to speed up sqrt
  • use SIMD intrinsics ARM NEON instructions for image processing

Speeding up Division with Magic Numbers

In image processing, division is one of the most used mathematical operations. Unfortunately it is also one of most expensive operations, and thus is often the cause of performance failure. This is why it's important to optimize division operations.

To do this, we'll take use Magic Numbers to take advantage of two facts:

  • CPUs perform bit-shift operations much faster than other operations
  • Multiplication is faster than division

You can perform division by multiplying by certain magic numbers and right-shifting by certain const bits.

To some extent, you sacrifice accuracy with this approach. Division performed in this way will not have the same accuracy as divisions performed with floats, as is more commonly the case. This could be a problem in scientific calculations where an accuracy of 0.0000000001 can, in some cases, make a difference. For image processing, even if the result of our division is 34 instead of 35.02378 ( that would be rounded to 35 anyway), this loss of accuracy is not a problem at all compared to the benefits.

Division by 3

Division by 3 is a common operation in image processing. For example, you use it to calculate the RGB average ((R+G+B)/3).

To divide by 3 we will use the following values:

  • magic number - 0xAAAB
  • right shift - 17

Let's say we need to compute the average of the RGB values (137,78,246).

Using normal division, the result will be ((137 + 78 + 246 ) / 3) = ( 461 / 3 ) = 153.

Using magic numbers, the result will be (( 461 * 0xAAAB ) >> 17) = ((461*43691)>>17) = ( 20141551 >> 17 ) = 153.

For ease of use, the magic number operation can be put inside a MACRO.

Of course, for a few operations, you will not be able to see differences compared to standard division. However, once you see the effect multiplied by the millions of pixels there are in a image, you will start to see a definite improvement.

0xAAAB is not the only magic number that can be used to divide by three. It is the smallest one that results in very good accuracy. However, you should use a smaller number in some circumstances; for example, when using SIMD ARM NEON instructions, to be sure to stay inside the register's size.

In this case you can use the following values:

  • magic number - 0x55
  • right shift - 8

These numbers are easier to use, but result in lower accuracy.

Using normal division, the result will be 30 / 3 = 10.

Using magic numbers, the result will be ((30*0x55)>8) = ((30*85)>8) = 9.

In some circumstances such a gap is unacceptable, but in my opinion, in a pixel context, there is no difference between 9 and 10. It's up to you to decide which magic number is more suitable; the bigger, but more accurate 0xAAAB, or the smaller, but not so accurate 0x55.

Division by multiples of 2

To divide by 2 you don't need any magic numbers. You simply right-shift your value by 1.

Using normal division: (255/2) = 127

Right-shifting values: ( 255 >> 1 ) = 127

Each time you shift by 1 you perform a division by 2. So if, for example, you want to divide by 4, perform a shift by 2:

Using normal division: (255 / 4) = 63

Right-shifting values: (255 >>2 ) = 63

And so on.

These are the operations I found most useful, but there are many others explained on the web.

Speeding up sqrt with Fast Square Root

Another operation always present in computer graphics is the square root ( sqrt in C++ ). As you probably know, sqrt is a very expensive operation. Let's see how to improve it.

Assuming that the float is in the [IEEE 754 single precision floating point format, we will treat it as an int, leveraging the bit organization.

To understand the math behind this solution, you can read Fast Inversion Square Root.

The first approach is based on the method of reciprocal square root, or better, sqrt(x) = x−½.

The algorithm accepts a 32-bit floating point number as the input and stores a halved value for later use. Then, treating the bits representing the floating point number as a 32-bit integer, a logical right-shift of one bit is performed, and the result subtracted from the magic number 0x5f3759df. This is the first approximation of the inverse square root of the input. Treating the bits again as floating point, it runs one iteration of Newton's method to return a more precise approximation. This computes an approximation of the inverse square root of a floating point number approximately four times faster than floating point division.

float sqrt( float number )
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;

You can add as many Newton's method iterations you want to increase the accuracy, but of course the accuracy will lead to lower performance.

Another method is not based on reciprocal and magic numbers, but the approach is very similar. This approach is based on the implementation of the following formula:

((((val_int / 2^m) - b) / 2) + b) * 2^m = ((val_int - 2^m) / 2) + ((b + 1) / 2) * 2^m)

where b is the exponent bias and m the number of mantissa bits. Here the code that implements it:

float sqrt(float z)
int val_int = *(int*)&z; // Same bits, but as an int
val_int -= 1 << 23; // Subtract 2^m
val_int >>= 1; // Divide by 2
val_int += 1 << 29; // Add ((b + 1) / 2) * 2^m
return *(float*)&val_int; // Interpret again as float

Both methods improve performance. The first method is more accurate but less performant. The second method is less accurate and offers better performance.

Which one you choose depends on how many operations you need to perform. The difference between both methods is not so great, and both bring dramatic performance improvements. In my opinion, if your algorithm needs to perform millions of square root operations, it is reasonable to sacrifice accuracy and use the second method.

In my opinion these solutions are simply awesome! The dark math world sometimes produces miracles. The small loss in accuracy is worth all benefits in performance gains. In the end, it's up to you whether to use these mathematical techniques in your application.

Using ARM Advanced SIMD (NEON)

ARM NEON Instrinsics is a Single Instruction, Multiple Data (SIMD) instruction set. It speeds processing by performing a single set of instructions in parallel over multiple devices. It is commonly used for audio, video and graphics applications.

The Advanced SIMD NEON extension is a combined 64- and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications like to execute MP3 audio decoding, GSM adaptive multi-rate (AMR) speech codec. It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time.

We will explain how to use the most common ARM NEON instructions for image processing, such as for interleaving, multiply-accumulator and bit-shifting. We will also give some examples that demonstrate how to combine these techniques to solve common problems.

Using ARM NEON memcpy

memcpy is one of my favourite, and, in my opinion, most useful functions can be optimized in ARM NEON. Think of image processing and moving data from one buffer to another more than once. The standard resolution starts from 5 milion pixels that in the ARRGB (or BGRA) means 20 million bytes of raw data to move each time! Now that you understand the amount of data we are talking about, think about how many pixels are present in high resolution images. This gives you an idea of the millions of byte we are moving.

Using 128-bit ARM NEON registers, which can process 16 bytes at once, we can speed up this process up to 16x, depending on whether the size of the array we want to move is 16-aligned or, better yet, an exact multiple of 16.

Let's start working with ARM NEON by importing the library with the following code:

#if defined(_M_ARM)
#include <arm_neon.h>

The following function works with ARM NEON to copy data in as many 16-byte vectors as possible, then copies the remainder with the standard function. As said before, the best performance results when the data length is an exact multiple of 16, as we can leverage ARM NEON for the entire length.

void ARM::memcpy(void* Dest, void* Source, int length)
// Divide by 16 to find the limits of the array that can be managed by ARM NEON
int arm_length = length / 16;
uint8 * src = (uint8 *) Source;
uint8 * dest = (uint8 *) Dest;
// Declare a vector capable of 16 unsigned bytes
uint8x16_t buffer;
for( int i = 0 ; i < arm_length; i++ )
// Load a single vector from memory
buffer= vld1q_u8(src);
// Store a single vector into memory
vst1q_u8(dest, buffer);
// We moved 16 bytes so do not increment just one byte as usual but 16
src +=16;
dest +=16;
// Prefetch the next data
// Copy the rest of data with standard function
int gap = length - (arm_length*16);
if(gap > 0 )
memcpy((byte*) Dest+arm_length, (byte*)Source+arm_length, gap);
  • uint8x16_t vld1q_u8(__transfersize(16) uint8_t const * ptr); Load a single vector from memory
  • void vst1q_u8(__transfersize(16) uint8_t * ptr, uint8x16_t val); Store a single vector into memory

Converting to grayscale

This operation can be done easily using the Imaging SDK. We are using it as a starting point for explaining how to create the code for your own filter. We will use the approach of fast division we explained before, working with multiplication and shifting.

For a better grayscale conversion we should avoid performing the average of three color components, (( R+ G + B ) / 3), and instead use an operation like this:

  • GrayLevel = (R*0.30 + g*0.59 + b*0.10);

Even in C++ this simple operation could be straining, so we proceed with the fast division method. In our case the approximate values are given by:

  • 0.30 by 77/255
  • 0.59 by 151/255
  • 0.10 by 20/255

To perform this operation we introduce the Multiply–accumulate operation, a common step that computes the product of two numbers. It then adds that product to an accumulator; a register in which intermediate arithmetic and logic results are stored.

Another very important feature introduced by the SIMD ARM NEON instruction set for image processing is interleaving. Assuming we are working with images using the ARGB color model, this means that looking at the byte in sequence, the first byte is the Alpha channel, the second Red, the third Green, the fourth Blue, the fifth again the Alpha channel, but of the second pixel, and so on.

Interleaving operations vld4_u8 translate each channel on the same line in a single register, while de-interleaving vst4_u8 does the inverse restoring operation.

ARM NEON Interleaving

This step is useful for perform operation like multiply each channel by a constant, 8 bytes at once. You use it as follows:

  • uint8x8_t vdup_n_u8(uint8_t value); Loads all lanes of vector to the same literal value
  • uint8x8x4_t vld4_u8(__transfersize(32) uint8_t const * ptr); Loads N-element structure from memory where in that case
struct uint8x8x4_t
uint8x8x4_t val[4];

Imagine you have an RGB buffer like this:

RGB buffer

Interleaving data using vld4_u8 we get a vector structure like this:

ARM NEON interleaving for multiplication ( Red channel )
ARM NEON interleaving for multiplication ( Green channel )
ARM NEON interleaving for multiplication ( Blue channel )

Performing the multiply-accumulate operation, and finally, right-shifting by 8 bits, we implement fast division.

ARM NEON interleaving for multiply-accumulation and right-shifting

The last step is to copy the gray value to each of three former R,G,B channels and then perform de-interleaving.

void Utilities::ConvertToGrayNeon( unsigned char* inputBuffer, unsigned char* outputBuffer, int length)
uint8 * src = (uint8 *) inputBuffer;
uint8 * dest = (uint8 *) outputBuffer;
int n = length;
// Copy the const value of 77 for each of eight entries of vector dedicated to red channel
uint8x8_t rfac = vdup_n_u8 (77);
// Copy the const value of 151 for each of eight entries of vector dedicated to green channel
uint8x8_t gfac = vdup_n_u8 (151);
// Copy the const value of 28 for each of eight entries of vector dedicated to blue channel
uint8x8_t bfac = vdup_n_u8 (28);
// Calculate the new array length based on the fact we are processing eight bytes at once, so dividing by 8
// Assign the default value to Alpha channel
uint8x8x4_t interleaved;
interleaved.val[0] = vdup_n_u8 (0xFF); //Alpha value
for (int i=0; i < n; i++)
uint16x8_t temp;
// Copy and interleave
uint8x8x4_t rgb = vld4_u8 (src);
// Multiply the red value by the const value 77
temp = vmull_u8 (rgb.val[1], rfac);
// Multiply the green values by const value of 151 and add the result to previous
temp = vmlal_u8 (temp,rgb.val[2], gfac);
// Multiply the blue values by const value of 20 and add the result to previous
temp = vmlal_u8 (temp,rgb.val[3], bfac);
// Right-shift all values by 8, performing a division by 255
interleaved.val[1] = vshrn_n_u16 (temp, 8);
// Since this is a gray scale value it can be copied to all remaining two (green and blue)
interleaved.val[2] = interleaved.val[0];
interleaved.val[3] = interleaved.val[0];
// De-interleave the result
vst4_u8 (dest, interleaved);
// Move the pointer forward
src += 8*4;
dest += 8*4;
  • uint16x8_t vmull_u8(uint8x8_t a, uint8x8_t b); vector long multiply
  • uint16x8_t vmlal_u8(uint16x8_t a, uint8x8_t b, uint8x8_t c); vector multiply accumulate long: vmlal_<type>. Vr[i] := Va[i] + Vb[i] * Vc[i]
  • uint8x8_t vshrn_n_u16(uint16x8_t a, __constrange(1,8) int b); vector narrowing shift right by constant
  • void vst4_u8(__transfersize(32) uint8_t * ptr, uint8x8x4_t val); store N-element structure to memory

Blending images

The HDR::BlendArmNeon method for my project HDR I - Implementing High Dynamic Range filters using Nokia Imaging SDK is built on the two ARM functions we just discussed: using memcpy and converting to grayscale.

It takes as input the raw camera buffer of unsigned char*. It assumes that it is working with anNV12 color model.

The code that converts to grayscale takes as input a Buffer object. Ideally you would work with IBuffer but this doesn't change the nature of the function, you should just use a different conversion function from AsBuffer to FromIBuffer, both returning a unsigned char* that are referenced in the Nokia Imaging SDK in native code article.

This portion of code that handles the Luminance information performs the blending as ((y1+y2+y3)/3). Division by 3 is performed using the magic number 0x55.

All ARM instructions are already described but combined a bit differently. The code should be easy to understand. The main difference is that we are working with register that is 16-bit wide, rather than 8-bit as in one of previous examples, so that we can store numbers greater than 255.

byte* HDR::BlendArmNeon(unsigned char* image1,
unsigned char* image2,
unsigned char* image3,
DSP::CartesianPoint displacement1,
DSP::CartesianPoint displacement2)
int _width = (int) CaptureResolution.Width;
int _height = (int) CaptureResolution.Height;
int image_size = _width * _height;
int displacement1_X=0, displacement1_Y=0, displacement2_X=0, displacement2_Y=0;
double y1=0, y2=0, y3=0;
double totalLumaWeights = 0;
int r=0, g=0, b=0;
cancellation_token_source cts;
auto token = cts.get_token();
unsigned char *result = new unsigned char[_width*_height];
concurrency::parallel_for(0, _height, [this, image1, image2, image3, &result, _height, _width, &displacement1, &displacement2](int y)
int line = y*_width;
int size = _width * _height;
int m_displacement_1 = (int) (displacement1.X() + (displacement1.Y() * _width));
int m_displacement_2 = (int) (displacement2.X() + (displacement2.Y() * _width));
if( (m_displacement_1 < 0) || (line + _width + m_displacement_1 >= size) ) m_displacement_1 = 0;
if( (m_displacement_2 < 0) || (line + _width + m_displacement_2 >= size) ) m_displacement_2 = 0;
int arm_neon_length = _width / 8;
//uint8x8_t divideby3 = vdup_n_u8 (0xAAAb);
uint16x8_t divideby3 = vdupq_n_u16 (0x55);
uint8 * m_image1 = (uint8 *) image1 + line;
uint8 * m_image2 = (uint8 *) image2 + line + m_displacement_1;
uint8 * m_image3 = (uint8 *) image3 + line + m_displacement_2;
uint8 * dest = (uint8 *) result + line;
uint8x8_t y_image1;
uint8x8_t y_image2;
uint8x8_t y_image3;
uint16x8_t y_temp;
uint8x8_t result_temp;
for (int x = 0; x < arm_neon_length ; x++)
y_image1 = vld1_u8 (m_image1);
y_image2 = vld1_u8 (m_image2);
y_image3 = vld1_u8 (m_image3);
y_temp = vaddl_u8(y_image1, y_image2);
y_temp = vaddq_u16(y_temp, vmovl_u8(y_image3));
y_temp = vmulq_u16(y_temp, divideby3);
result_temp = vshrn_n_u16 (y_temp, 8);
vst1_u8(dest, result_temp);
m_image1 += 8;
m_image2 += 8;
m_image3 += 8;
dest += 8;
return result;

Parallel Processing

The native C++ libraries were extended to provide rich support for parallel programming. There are different layers at which users can interact with the parallel runtime. The highest of these layers is the Parallel Patterns Library. It can be accessed using the header file ppl.hppl.h contains different constructs that you can use to parallelize your programs without extensive knowledge of scheduling decisions, underlying threads, the surrounding environment, etc. One of the constructs in ppl.h is the parallel_for construct, which is used to quickly parallelize a for-loop.

parallel_for takes the body of a for-loop as captured in a function, divides the number of iterations amongst the available computing resources (processors), and executes that work in parallel. Here's a simple example of serial-to-parallel transformation:

    for (int i = 0; i < n; i++)


    Concurrency::parallel_for(0, n,
[] (int i)

Converting all for-loops in a program into parallel for-loops might have unintended consequences. You can only do this if all iterations of the loop run independently. Suppose, for example, you have the following code:

    for (int i = 0; i < n; i++)
// Array "a" contains both an original sequence and the end result
a[i] += a[i-1];

In order to compute the kth term in a resulting sequence, the k-1th term must be known. If you were to execute iterations in parallel, it's possible that the k-1th term may not be populated by the time the kth term is processed, yielding an incorrect result.

    Concurrency::parallel_for(0, n,
[] (int i)
a[i] += a[i-1]; // incorrect!

Consider, also, that creating a for-loop thread has a performance cost. Therefore it is makes sense to parallelize for-loops that process a large amount of data, but may not make sense to parallelize small for-loops.

In our example, we parallelize each row of the image. On a dual core device, this means that we are likely processing two rows in parallel. However the final decision on parallelization is left to the scheduler.

In my application, running on a Lumia 820 so working with 8MPx images, using parallelization resulted in a net improvement of 1.5 seconds, compared using a standard for-loop.

Blending ARGB images

void HDR::BlendARGBArmNeon(Buffer^ ne_bitmapBuffer,
Buffer^ se_bitmapBuffer,
Buffer^ oe_bitmapBuffer,
DSP::CartesianPoint displacement1,
DSP::CartesianPoint displacement2)
int _width = (int) CaptureResolution.Width;
int _height = (int) CaptureResolution.Height;
int image_size = _width * _height;
int scanline = _width * PIXELSIZEINBYTES;
int displacement1_X=0, displacement1_Y=0, displacement2_X=0, displacement2_Y=0;
double y1=0, y2=0, y3=0;
double totalLumaWeights = 0;
int r=0, g=0, b=0;
cancellation_token_source cts;
auto token = cts.get_token();
unsigned char* image1 = AsArray(ne_bitmapBuffer);
unsigned char* image2 = AsArray(se_bitmapBuffer);
unsigned char* image3 = AsArray(oe_bitmapBuffer);
unsigned char* result = new unsigned char[scanline*_height];
concurrency::parallel_for(0, _height, [this, image1, image2, image3, &result, _height,_width, scanline, &displacement1, &displacement2](int y)
int line = y*scanline;
int size = scanline * _height;
int m_displacement_1 = (int) (displacement1.X()*4 + (displacement1.Y() * scanline));
int m_displacement_2 = (int) (displacement2.X()*4 + (displacement2.Y() * scanline));
if( (m_displacement_1 < 0) || (line + scanline + m_displacement_1 >= size) ) m_displacement_1 = 0;
if( (m_displacement_2 < 0) || (line + scanline + m_displacement_2 >= size) ) m_displacement_2 = 0;
int arm_neon_length = _width / 8;
//uint8x8_t divideby3 = vdup_n_u8 (0xAAAb);
uint16x8_t divideby3 = vdupq_n_u16 (0x55);
uint8 * m_image1 = (uint8 *) image1 + line;
uint8 * m_image2 = (uint8 *) image2 + line + m_displacement_1;
uint8 * m_image3 = (uint8 *) image3 + line + m_displacement_2;
uint8 * dest = (uint8 *) result + line;
uint8x8x4_t interleaved;
interleaved.val[0] = vdup_n_u8 (0xFF); //Alpha value
for (int x = 0; x < arm_neon_length ; x++)
uint16x8_t temp_r;
uint16x8_t temp_g;
uint16x8_t temp_b;
uint8x8x4_t rgb_image1 = vld4_u8 (m_image1);
uint8x8x4_t rgb_image2 = vld4_u8 (m_image2);
uint8x8x4_t rgb_image3 = vld4_u8 (m_image3);
temp_r = vaddl_u8(rgb_image1.val[1], rgb_image2.val[1]);
temp_g = vaddl_u8(rgb_image1.val[2], rgb_image2.val[2]);
temp_b = vaddl_u8(rgb_image1.val[3], rgb_image2.val[3]);
temp_r = vaddw_u8(temp_r, rgb_image3.val[1]);
temp_g = vaddw_u8(temp_g, rgb_image3.val[2]);
temp_b = vaddw_u8(temp_b, rgb_image3.val[3]);
temp_r = vmulq_u16(temp_r, divideby3);
temp_g = vmulq_u16(temp_g, divideby3);
temp_b = vmulq_u16(temp_b, divideby3);
interleaved.val[1] = vshrn_n_u16 (temp_r, 8);
interleaved.val[2] = vshrn_n_u16 (temp_g, 8);
interleaved.val[3] = vshrn_n_u16 (temp_b, 8);
vst4_u8 (dest, interleaved);
m_image1 += 8*4;
m_image2 += 8*4;
m_image3 += 8*4;
dest += 8*4;
class CartesianPoint
CartesianPoint(double X, double Y)
CartesianPoint(CartesianPoint& other ) : _x(other._x), _y(other._y) {}
virtual ~CartesianPoint(void){}
CartesianPoint& CartesianPoint::operator=(const CartesianPoint& p1)
_x = p1._x ;
_y = p1._y ;
return *this ;
double X() { return _x; }
double Y() { return _y; }
void setX(double value){_x=value;}
void setY(double value){_y=value;}
double _x;
double _y;


This article presented three techniques for speeding up image processing in C++/CX and C#:

  • Using Magic Numbers to speed up division
  • Using Fast Square Root to speed up sqrt
  • Using SIMD intrinsics ARM NEON instructions for image processing


Originally contributed on 13 December 2013‎ by galazzo