SIMD is a terminology for high performance calculation under the CPU. In this post we are going to look this shortly.

Meaning Of SIMD

SIMD means Single Instruction Multiple Data. Calculating data on reserved registers on the CPU. To do that you have to know basic register of architectures of Intel X86

X86 registers
  • MMX is the first version of SIMD registers. Pantium II and above has this thecknology. The registers are 64bits and MM0..MM7
  • SSE is another SIMD register sets they are XMM0..XMM7. These registers are extended registers of MM.
  • SSE2 is extends 128 bits XMM registers to 256bit YMM and add 8 additional.
  • SSE3 extends YMMs to ZMM 512 bits and add ZMM16 to ZMM31. SSE3 also has 64bit general registers such as RAX, RDI
  • SSE4 Add several instructions

How to do?

To vectorise your calculation, you may use SSE or SSE2 registers and commands. The easiest way is by using inline assembly codes.

Disadvantages of inline assembly for SIMD?

  • Assembly is really hard language
  • There are tons of CPU instructions

The solution is using Intrinsics


The intrinsics are easy format

  • __m128i : 128bit integer
  • __m256d : 256bit double
  • __m128 : 128 bit float

However intrinsics are easy you cant use simple operators. So u need additional functions. Such as

  • __mm256_mul_ps() : 256 bit packed multiplication
  • __mm128_add_ps() : 128 bit packed adder.

To use intrinsics please add <intrin.h> on your code.

 x86intrin.h     Everything, including non-vector x86 instructions like _rdtsc().                         
 mmintrin.h      MMX (Pentium MMX!)                                                                       
 mm3dnow.h       3dnow! (K6-2) (deprecated)                                                               
 xmmintrin.h     SSE + MMX (Pentium 3, Athlon XP)                                                         
 emmintrin.h     SSE2 + SSE + MMX (Pentium 4, Athlon 64)                                                  
 pmmintrin.h     SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Athlon 64 San Diego)                        
 tmmintrin.h     SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer)                                      
 popcntintrin.h  POPCNT (Nehalem (Core i7), Phenom)                                                       
 ammintrin.h     SSE4A + SSE3 + SSE2 + SSE + MMX (AMD-only, starting with Phenom)                         
 smmintrin.h     SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Penryn, Bulldozer)                             
 nmmintrin.h     SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Nehalem (aka Core i7), Bulldozer)     
 wmmintrin.h     AES (Core i7 Westmere, Bulldozer)                                                        
 immintrin.h     AVX, AVX2, AVX512, all SSE+MMX (except SSE4A and XOP), popcnt, BMI/BMI2, FMA             
intrin headers


As a conclusion, SIMD programming have same advantages and disadvantages.


  • You are using just only CPU instructions to level up your calculation speed.


  • Programming in assembly language is really hard.
  • Intrinsics codes are not that small on the disassembler output
  • Instructions are using only one core of the CPU
  • Vectorization generally not possible on /O optimization process.

There are many matrix calculation frameworks use SIMD processes such as BLAS from Boost, Armadillo, Eigen, IT++, and Newmat.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *