Performance Optimization on Modern Processor Architecture through Vectorization

by Yong Fu, Software Engineer  

December 2016


Hardware provides performance through three sources of parallelism at different levels: thread parallelism, instruction level parallelism, and data parallelism. Thread parallelism is achieved by splitting the workload equally, if possible, among threads. Instruction level parallelism is achieved by applying the same operation to a block of code and by compiling into tight machine code. Data parallelism is achieved by implementing operations to use SIMD (Single Instruction Multiple Data) instructions effectively. Modern processors have evolved on all sources of parallelism, featuring massively superscalar pipelines, out-of-order execution, simultaneous multithreading (SMT) and advanced SIMD capabilities, all integrated on multiple cores per processor. In this article, we focus on optimizing performance through SIMD.

SIMD processing exploits data-level parallelism. Data-level parallelism means that the operations required to transform a set of vector elements can be performed on all elements of the vector at the same time. That is, a single instruction can be applied to multiple data elements in parallel. All mainstream processors today such as Intel Sandy/Ivy Bridge, Haswell/Broadwell, AMD Bulldozer, ARM, and PowerPC implement SIMD features, although the implementation details may vary. Since most SIMD instructions on specific hardware involve operations of vector operands, sometimes SIMD operations are also referred to as vectorizations. Although there is a subtle difference between these two terms, in this article we use them interchangeably.



Intel Xeon Phi Architecture & Micro-Architecture for each Tile 

As an example of a processor with SIMD support, the above figure is an architecture diagram of Intel's Xeon Phi. In each core of Xeon Phi's 36 twin-core tiles there two vector processing units (VPUs) which provide the capability of two vector operations per cycle. To enhance vector operations fully exploiting VPUs, Knights Landing provides the AVX512 instruction set for operations on 512-bit vectors.


Vectorization or SIMD is a class of parallel computers in Flynn's taxonomy, which refers to computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data parallelism rather than concurrency since there are simultaneous computations, but only a single process at a given moment.

Single Instruction, Multiple Data (SIMD)
Scalar vs. SIMD Operation

As shown in the figures above, unlike normal operations in which are the scalar operands, a vector operand SIMD operation is an operand containing a set of data elements packed into a one-dimensional array. The elements can be integer, floating-point, or other data types.

Automatic vectorization by compiler

Modern compilers like GCC can automatically identify opportunities and vectorize the code without interference by programmers. Automatic vectorization is straightforward and easy to perform with only the knowledge of related compiler options. However, automatic vectorization can only happen if memory is allocated totally and continuous and there are no data dependencies to prevent vectorization. In addition, compilers don’t always have enough static information to know what they can vectorize. In some cases, they cannot achieve the same performance speedup as code optimized by a human expert.

Explicit vectorization

Besides automatic vectorization by compiler, the programmer may want to explicitly specify parallelism in the code. Vectorization may be explicitly exploited through the use of machine instructions (assembly language) or, to lower efforts of programmers, intrinsics which are functions directly wrapped low-level machine instructions and operations on vector data types. There are also alternatives which provide higher abstractions than intrinsics. For example, a C++ library libsimdpp is a portable header-only library wrapping intrinsics and presents a single interface over several instruction sets. Boost.SIMD also offers a portable library for SIMD operation under the framework of Boost (currently it is a Boost candidate). In addition, OpenMP and Open Computing Language (OpenCL) exploit both coarse-grained (thread) and fine grain parallelism (vectorization) in code. They are suitable for to applications exploiting large quantities of data parallelism on many-core architectures.

Generally, there are two places in code where vectorization could weigh in:


Vectorization Principles

For loop-based vectorization, there are some principles which can be applied strategically to vectorize the loop computation.

  1. for (int t = 0; t < l; ++t) {
  2. for (int i = 0; i < n; ++i) {
  3. for (int j = 0; j < n; ++j) {
  4. a[t+1, i, j] = a[t,i+1,j+1] + a[t,i-1,j-1] -
  5. a[t,i-1,j+1] - a[t,i+1,j-1]
  6. }
  7. }
  8. }
Loop Dependencies
  1. for (int i = 0; i < n; ++i) {
  2. x[i] = y[i] ;
  3. }
Original Loop
  1. for (int i = 0; i < n; i += 4) {
  2. for (int j = 0; (j < 4) && (i + j < n); ++j) {
  3. x[j] = y[j];
  4. }
  5. }
Loop After Stripmining
        • Now the inner loop can apply vectorization with vector operands size of 16 bytes (strictly speaking this example is only conceptual and needs more transformation to remove the condition j < 4 in inner loop guard before vectorization). Note that we also can use the technique of loop fission (described below) to split the original loop into two loops. One is transformed by stripmining and another is not.
      • Unrolling
        • After a loop has been stripmined, it can be unrolled to increase the degree of instruction level parallelism and to allow basic block level vectorization. The unroll transformation merges the loop bodies of several iterations into a single body, removing some of the overhead of iterating through the loop. The number of merged iterations is called the unroll factor.
  1. for (int i = 0; i < n; i += 4) {
  2. if (i + 3 < n){
  3. x[i+0] = y[i+0];
  4. x[i+1] = y[i+1];
  5. x[i+2] = y[i+2];
  6. x[i+3] = y[i+3];
  7. } else {
  8. for (int j = 0 ; i + j < n; ++j) {
  9. x[i+j] = y[i+j] ;
  10. }
  11. }
Loop After Unrolling
        • Note that the example introduces an if statement when unrolling, which often causes a performance penalty. A more delicate algorithm using conditional variables could overcome this overhead.
      • Fission
        • Fission refers to splitting a single loop into two or more loops. This transformation may enable vectorization by breaking inter-loop dependencies.
  1. for (int i = 0; i < (n - n%4); i += 4) {
  2. for (j = 0; j < 4; ++j) {
  3. x[j] = y[j];
  4. }
  5. }
  6. for (int k = (n - n%4); k < n; ++k) {
  7. x[k] = y[k];
  8. }
Loop After Fission
        • The example also uses loop stripmining. Compared to loop unrolling, this transformation adds more loops so it may add extra computation overhead.

In very rare cases only one or two transformations are used for vectorization.Generally,combinations of loop transformations can achieve better performance.

Vectorization with Intel AVX

In 2008, Intel introduced a new set of high-performance vectorization instructions called Advanced Vector Extensions (AVX). These instructions extend previous SIMD offerings, MMX instructions and Intel Streaming SIMD Extensions (SSE), by operating on larger chunks of data faster. Recently, Intel has released additional vectorization instructions in AVX2 and AVX512.

Intel SIMD Evolution

Since AVX/AVX2/AVX512 instruction sets include different SIMD functions, the first thing for vectorization with AVX is to check availability and support in the development and production environment. For example, on Linux the following command directly shows which AVX instruction set is supported on your platform.

      cat /proc/cpuinfo

For example, on a Intel Core i7 (Sandy Bridge), the above command generate output like:

      flags: ... sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand ...

The only relevant flag avx indicates that no more than basic avx operations are supported on this platform.


Intel AVX is a low-level instruction set. To take advantage of AVX to speed up performance, one could inline a couple of assembly instructions. But obviously, this method is both non-portable and error-prone. The preferred approach is to use intrinsics instead. Simply speaking, intrinsics are functions that provide functionality equivalent to a few lines of assembly. Hence, the assembly version can serve as a drop-in replacement for the function at compile-time. This allows performance-critical assembly code to be inlined without sacrificing the beauty and readability of the high-level language syntax. At the same time, the correct intrinsics libraries for AVX development can be detected by following code snippet.

  1. #if defined(_MSC_VER)
  2. /* Microsoft C/C++-compatible compiler */
  3. #include <intrin.h>
  4. #elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
  5. /* GCC-compatible compiler, targeting x86/x86-64 */
  6. #include <x86intrin.h>
  7. #elif defined(__GNUC__) && defined(__ARM_NEON__)
  8. /* GCC-compatible compiler, targeting ARM with NEON */
  9. #include <arm_neon.h>
  10. #elif defined(__GNUC__) && defined(__IWMMXT__)
  11. /* GCC-compatible compiler, targeting ARM with WMMX */
  12. #include <mmintrin.h>
  13. #endif

Data types

In the intrinsic library, each data type starts with two underscores, an m, the width of the vector in bits, and real data type it builds. AVX512 supports 512-bit vector types that start with __m512, but AVX/AVX2 vectors don't go beyond 256 bits.

Data Types for AVX Intrinsics
Data TypeDescription
__m128   128-bit vector containing 4 floats
__m128d   128-bit vector containing 2 doubles   
__m128i   128-bit vector containing integers
__m256   256-bit vector containing 8 floats
__m256d   256-bit vector containing 4 doublesā€‹
__m256i   256-bit vector containing integers

If a vector type ends in d, it contains doubles, and if it doesn't have a suffix, it contains floats. It might look like _m128i and _m256i vectors must contain ints, but this isn't the case. An integer vector type can contain any type of integer, from chars to shorts to unsigned long longs. That is, an _m256i may contain 32 chars, 16 shorts, 8 ints, or 4 longs. These integers can be signed or unsigned.


The names of AVX/AVX2 intrinsics can be confusing at first, but the naming convention is really straightforward. Once you understand it, you'll be able to determine what a function does by looking at its name. A generic AVX/AVX2 intrinsic function is given as follows:

  1. bit_width: the size of the vector returned by the function. For 128-bit vectors, this is empty. For 256-bit vectors, this is 256.
  2. name: the operation performed by the intrinsic
  3. vector_type: the data type contained in the function's primary argument vector

Vector type in function names actually refers to the type of data packed in the vector.

Vector Type of AVX Intrinsics Functions
Vector TypeDescription
ps floats (ps stands for packed single-precision)
pd doubles (pd stands for packed double-precision)
epi8/16/32/64 8-bit/16-bit/32-bit/64-bit signed integers
epu8/16/32/64 8-bit/16-bit/32-bit/64-bit unsigned integers
si128/256 ā€‹ unspecified 128-bit vector or 256-bit vector and could be any type
m128/m256<i/d>   identifies input vector types when types when they're different than the type of the returned vector  

As an example, consider _mm256_srlv_epi64. Even if you don't know what srlv means, the _mm256 prefix tells you the function returns a 256-bit vector, and the _epi64 tells you that the arguments contain 64-bit signed integers.

Build applications using AVX intrinsics

	 _m256d vector = _m256_setzero_pd();
  1. float* aligned_floats = (float*)aligned_alloc(32, 64 *
  2. sizeof(float));
  3. // some operations
  4. __m256 aligned_vector = _mm256_load_ps(aligned_floats);
  1. float* unaligned_floats = (float*)malloc(64 * sizeof(float));
  2. // some operations
  3. __m256 unaligend_vector =
  4. _mm256_loadu_ps(unaligned_floats);
	  gcc -mavx -o <your_avx_code>.c
      However, once FMA is used, the following command should be executed:
	  gcc -mfma -o <your_avx_code>.c

AVX matrix computation example

The following code snippets are from Pronghorn, which is developed in Java but could accelerate matrix operation by C++ AVX function through Java Native Interface (JNI). However, whether it is suitable to use native C/C++ AVX function to optimize Java application needs careful consideration. This is to be explained later.

  1. void mulNoAVX(const int row, const int col,
  2. jint** colSlabs_nat,
  3. jint* rowSlab_nat,
  4. const jint rowMask,
  5. const jlong rowPosition,
  6. const jint colMask,
  7. jint* colPositions_nat,
  8. int* results) {
  9. for (int c = col - 1; c >= 0; --c) {
  10. int prod = 0;
  11. for (int p = row - 1; p >= 0; --p) {
  12. int v1 = rowSlab_nat[rowMask & (jint)(rowPosition + p)];
  13. int v2 = colSlabs_nat[c][colMask & (colPositions_nat[c] + p)];
  14. prod += v1 * v2;
  15. }
  16. results[c] = prod;
  17. }
  18. }
Baseline Implementation of Integer Vector & Matrix Product Without AVX

The pointer rowSlab_nat and colSlabs_nat are a vector and a matrix, respectively. The results of this code is a vector of multiplication of rowSlab_nat and colSlabs_nat. Index operations rowMask & (jint)(rowPosition + p) and colMask & (colPositions_nat[c] + p) calculate the index of the operands of multiplication. In this code, the inner loop actually computes a product of two vectors.

In the implementation with AVX intrinsics, only 256-bit AVX operations are used to improve portability. The code mainly consists of a for loop which has two parts. The first part is the same as the implementation without AVX but only works on the elements that cannot gather into 256-bit vector.

The second part is a loop with multiple AVX operations.

  1. When loading data the intrinsic function, _mm256_loadu_si256 is responsible for loading unaligned data into a 256-bit vector. Since the input vector and matrix are transferred by JVM through JNI interface, it is impossible to control the alignment of the data. Although in a normal case, load/store unaligned data could bring extra overhead, modern processors optimized the behavior of load/store instruction and reduced the difference of performance between aligned and unaligned data.
  2. If the vector may be used to store different data types, _mm256_cvtepi32_ps and other conversion intrinsic functions can convert the data in the vector to correct data type.
  3. The operation of dot product of two vectors is a little bit complicated. The result vector stores a half of the result of product into the first byte and the other half into the 5th byte. Like this, there are some intrinsic in AVX math operations are not executed in an intuitive way. Users should refer to the document and check usages of them carefully.
  4. The AVX operation applies loop transformations like loop stripmine/fission to help vectorization. It also applies the technique of loop unrolling to improve the opportunity of the processor to exploit out-of-order instruction execution.
  1. void mulAVXInt(const int row, const int col,
  2. jint** colSlabs_nat,
  3. jint* rowSlab_nat,
  4. const jint rowMask,
  5. const jlong rowPosition,
  6. const jint colMask,
  7. jint* colPositions_nat,
  8. int* results) {
  9. __m256 ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6;
  10. for (int c = col - 1; c >= 0; --c) {
  11. int prod = 0;
  12. int row_remain = (row - 1) - (row - 1)%16;
  13. for (int p = (row - 1); p >= row_remain; --p) {
  14. int v1 = rowSlab_nat[rowMask & (jint)(rowPosition + p)];
  15. int v2 = colSlabs_nat[c][colMask & (colPositions_nat[c] + p)];
  16. prod += v1 * v2;
  17. }
  19. // upgrade integer to float due to no dot product for int of AVX
  20. // another implementation is to use "hadd" instructions
  21. // see example (
  22. for (int p = row_remain - 1; p >= 0; p -= 16) {
  23. ymm0 = _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)(&rowSlab_nat[(int)rowMask & ((int)(rowPosition) + p - 7)])));
  24. ymm1 = _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)(&rowSlab_nat[(int)rowMask & ((int)(rowPosition) + p - 15)])));
  25. ymm2 = _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&colSlabs_nat[c][colMask & (colPositions_nat[c] + p - 7)]));
  26. ymm3 = _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&colSlabs_nat[c][colMask & (colPositions_nat[c] + p - 15)]));
  28. ymm4 = _mm256_dp_ps(ymm0, ymm2, 255);
  29. ymm5 = _mm256_dp_ps(ymm1, ymm3, 255);
  30. ymm6 = _mm256_add_ps(ymm4, ymm5);
  31. float tmp_prod[8] = {0};
  32. _mm256_storeu_ps(tmp_prod, ymm6);
  33. prod += (int)tmp_prod[0] + (int)tmp_prod[4];
  34. }
  35. // results[c] = prod;
  36. memcpy((void*)(results + c), (void*)&prod, sizeof(int));
  37. }
  38. }
AVX Implementation of Vector and Matrix 

Vectorization in other languages

Although aforementioned, vectorization through AVX is only shown using C/C++ code for convenience. In other languages, it is also possible to take advantage of AVX optimization.


Since Java application is running on JVM, there is no direct usage of AVX optimization in Java. However, we still could leverage JIT for automatic vectorization, or further AVX optimization through JNI.

  1. The hotspot JVM implements automatic vectorization in version 8. Specifically, the Hotspot compiler implements the concept of the Superword Level Parallelism Hotspot JVM option. Specifically, JVM could use the option -XX:+UseSuperWord to enable automatic vectorization. However, the JIT compiler suffers from two main disadvantages. First, the JIT is a black box for developers. There is no control on optimization phases like vectorization. Secondly, the time constraint narrows down the degree of optimization compared to static compilers like GCC or LLVM. So, it is compelling to use statically compiled code since it benefits from additional optimization reducing performance bottlenecks. Static compilation can be achieved through JNI in Java.
  2. JNI is a specification which allows applications to call statically compiled methods from dynamic libraries and to manipulate Java objects from there. So through JNI we can realize all AVX optimization aforementioned. But normally JNI suffers from an additional invocation cost. Invoking a native target function requires two calls. The first is from the Java application to the JNI wrapper which sets up the native function parameters. The second is from the JNI wrapper to the native target function. The most significant source of overhead occurs during the invocation of JNI callbacks. An example is to get an array pointer from the Java heap. A callback pays a performance penalty because two levels of indirection are used: one to obtain the appropriate function pointer inside a function table, and one to invoke the function. Yet, these callbacks remain necessary to work with Java objects while avoiding memory copies that are even more expensive. In addition to the Java heap, where objects are allocated and automatically removed by the garbage collector, Java allows programmers to allocate chunks of memory out of the Java heap. This kind of memory is called native memory. But JNI specification does not recommend the use of native memory directly since it cause instability of JVM and other issues.
Comparison of JIT and JNI Vectorization
  JIT     no control of optimization     lower overhead  
  JNI   Customized optimization   higher overhead  


The python library, Numpy, has optimized its code to leverage AVX/AVX2. Since it is a cornerstone of most application involved math and matrix operations, there is no need for users to do manual optimization. But vectorization at the level of python code may still be useful since it allows Numpy to implement the operations based on vectorization at the machine level using AVX.


This article only examines basic ideas and usages of vectorization for performance optimization. A simple example demonstrates the process of using AVX intrinsics implement vectorization. As a powerful performance optimization technique, vectorization could be applied in many applications, especially considering the rich instruction set support of SIMD in mainstream modern processor architectures. Although manual vectorization could require some effort and careful thought on algorithm and implementation, in some cases, it may achieve greater speedup than automatic vectorization generated by compilers.


This article is inspired by optimizing math performance for a Pronghorn project at OCI. I want to thank my colleagues Adam Mitz, Nathan Tippy, and Charles Calkins from OCI for their useful and professional comments that greatly improved the quality of this article.

Futher Reading

WebSanity Top Secret