SIMD
SIMD is a terminology for high performance calculation under the CPU. In this post we are going to look this shortly.
Meaning Of SIMD
SIMD means Single Instruction Multiple Data. Calculating data on reserved registers on the CPU. To do that you have to know basic register of architectures of Intel X86
- MMX is the first version of SIMD registers. Pantium II and above has this thecknology. The registers are 64bits and MM0..MM7
- SSE is another SIMD register sets they are XMM0..XMM7. These registers are extended registers of MM.
- SSE2 is extends 128 bits XMM registers to 256bit YMM and add 8 additional.
- SSE3 extends YMMs to ZMM 512 bits and add ZMM16 to ZMM31. SSE3 also has 64bit general registers such as RAX, RDI
- SSE4 Add several instructions
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
#include <iostream> using namespace std; void get_cpu_info(){ uint32_t b,c,d; __asm( "movl $1, %%eax;" "cpuid;" "movl %%ebx , %0;" "movl %%edx , %1;" "movl %%ecx , %2" : "=m"(b), "=m"(c), "=m"(d) : //no input : "eax", "ebx", "ecx", "edx" //data adress ); if ((d & (1 << 23)) != 0){ cout << "MMX support \n"; } else { cerr << "You Dont have MMX \n"; } if ((d & (1 << 25)) != 0) { cout << "SSE support \n"; } else { cerr << "You Dont have SSE \n"; } if ((d & (1 << 26)) != 0) { cout << "SSE2 support \n"; } else { cerr << "You Dont have SSE2 \n"; } if ((c & 1) != 0) { cout << "SSE3 support \n"; } else { cerr << "You Dont have SSE3 \n"; } if ((c & (1 << 9)) != 0) { cout << "SSSE3 support \n"; } else { cerr << "You Dont have SSSE3 \n"; } if ((c & (1 << 19)) != 0) { cout << "SSE4.1 support \n"; } else { cerr << "You Dont have SSE4.1 \n"; } if ((c & (1 << 20)) != 0) { cout << "SSE4.2 support \n"; } else { cerr << "You Dont have SSE4.2 \n"; } } int main(int argc, char const *argv[]) { //read https://en.wikipedia.org/wiki/CPUID get_cpu_info(); return EXIT_SUCCESS; } |
How to do?
To vectorise your calculation, you may use SSE or SSE2 registers and commands. The easiest way is by using inline assembly codes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#include <iostream> using namespace std; #include <iostream> void div_sse_two_vector(){ float f1[4] = { 1.1f, 2.2f, 3.3f, 4.4f }; float f2[4] = {5.1f, 6.2f, 7.3f, 8.4f}; float re[4] = { 0.f }; //SSE SIMD codes __asm __volatile ( "movups %1, %%xmm1;" "movups %2, %%xmm2;" "divps %%xmm1, %%xmm2;" "movups %%xmm2, %0" : "=m" (re) : "m" (f1), "m" (f2)); } int main(int argc, char const *argv[]) { //read https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions div_sse_two_vector(); return EXIT_SUCCESS; } |
Disadvantages of inline assembly for SIMD?
- Assembly is really hard language
- There are tons of CPU instructions
The solution is using Intrinsics
Intrinsics
The intrinsics are easy format
- __m128i : 128bit integer
- __m256d : 256bit double
- __m128 : 128 bit float
However intrinsics are easy you cant use simple operators. So u need additional functions. Such as
- __mm256_mul_ps() : 256 bit packed multiplication
- __mm128_add_ps() : 128 bit packed adder.
To use intrinsics please add <intrin.h> on your code.
Header | Purpose |
x86intrin.h | Everything, including non-vector x86 instructions like _rdtsc(). |
mmintrin.h | MMX (Pentium MMX!) |
mm3dnow.h | 3dnow! (K6-2) (deprecated) |
xmmintrin.h | SSE + MMX (Pentium 3, Athlon XP) |
emmintrin.h | SSE2 + SSE + MMX (Pentium 4, Athlon 64) |
pmmintrin.h | SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Athlon 64 San Diego) |
tmmintrin.h | SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer) |
popcntintrin.h | POPCNT (Nehalem (Core i7), Phenom) |
ammintrin.h | SSE4A + SSE3 + SSE2 + SSE + MMX (AMD-only, starting with Phenom) |
smmintrin.h | SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Penryn, Bulldozer) |
nmmintrin.h | SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Nehalem (aka Core i7), Bulldozer) |
wmmintrin.h | AES (Core i7 Westmere, Bulldozer) |
immintrin.h | AVX, AVX2, AVX512, all SSE+MMX (except SSE4A and XOP), popcnt, BMI/BMI2, FMA |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
#include <iostream> #include <xmmintrin.h> void div_sse_two_vector(){ __m128 a = _mm_set_ps(1.1f, 2.2f, 3.3f, 4.4f); __m128 b = _mm_set_ps(5.1f, 6.2f, 7.3f, 8.4f); __m128 c = _mm_div_ps(b,a); for(int x = 0 ; x < 4 ; x++){ std::cout << a[x] << "\t / \t" << b[x] << "\t = \t" << c[x] << "\n"; } } int main(int argc, char const *argv[]) { div_sse_two_vector(); return EXIT_SUCCESS; } |
Result
As a conclusion, SIMD programming have same advantages and disadvantages.
Advantages:
- You are using just only CPU instructions to level up your calculation speed.
Disadvantages:
- Programming in assembly language is really hard.
- Intrinsics codes are not that small on the disassembler output
- Instructions are using only one core of the CPU
- Vectorization generally not possible on /O optimization process.
There are many matrix calculation frameworks use SIMD processes such as BLAS from Boost, Armadillo, Eigen, IT++, and Newmat.