SIMD

by okanakyuz · 2020-04-11

SIMD is a terminology for high performance calculation under the CPU. In this post we are going to look this shortly.

Meaning Of SIMD

SIMD means Single Instruction Multiple Data. Calculating data on reserved registers on the CPU. To do that you have to know basic register of architectures of Intel X86

MMX is the first version of SIMD registers. Pantium II and above has this thecknology. The registers are 64bits and MM0..MM7
SSE is another SIMD register sets they are XMM0..XMM7. These registers are extended registers of MM.
SSE2 is extends 128 bits XMM registers to 256bit YMM and add 8 additional.
SSE3 extends YMMs to ZMM 512 bits and add ZMM16 to ZMM31. SSE3 also has 64bit general registers such as RAX, RDI
SSE4 Add several instructions

#include <iostream>

using namespace std;

void get_cpu_info(){
    uint32_t b,c,d;
    __asm(
        "movl $1, %%eax;"
        "cpuid;"
        "movl %%ebx , %0;"
        "movl %%edx , %1;"
        "movl %%ecx , %2"
        : "=m"(b), "=m"(c), "=m"(d)
        :                            //no input
        : "eax", "ebx", "ecx", "edx" //data adress
    );

    if ((d & (1 << 23)) != 0){
        cout << "MMX support \n"; 
    }
    else
    {
        cerr << "You Dont have MMX \n";
    }

    if ((d & (1 << 25)) != 0)
    {
        cout << "SSE support \n";
    }
    else
    {
        cerr << "You Dont have SSE \n";
    }

    if ((d & (1 << 26)) != 0)
    {
        cout << "SSE2 support \n";
    }
    else
    {
        cerr << "You Dont have SSE2 \n";
    }

    if ((c & 1) != 0)
    {
        cout << "SSE3 support \n";
    }
    else
    {
        cerr << "You Dont have SSE3 \n";
    }

    if ((c & (1 << 9)) != 0)
    {
        cout << "SSSE3 support \n";
    }
    else
    {
        cerr << "You Dont have SSSE3 \n";
    }

    if ((c & (1 << 19)) != 0)
    {
        cout << "SSE4.1 support \n";
    } else {
        cerr << "You Dont have SSE4.1 \n";
    }

    if ((c & (1 << 20)) != 0)
    {
        cout << "SSE4.2 support \n";
    } else {
        cerr << "You Dont have SSE4.2 \n";
    }
}

int main(int argc, char const *argv[])
{
    //read https://en.wikipedia.org/wiki/CPUID
    get_cpu_info();

    return EXIT_SUCCESS;
}

#include <iostream>

using namespace std;

void get_cpu_info(){

uint32_t b,c,d;

__asm(

"movl $1, %%eax;"

"cpuid;"

"movl %%ebx , %0;"

"movl %%edx , %1;"

"movl %%ecx , %2"

: "=m"(b), "=m"(c), "=m"(d)

: //no input

: "eax", "ebx", "ecx", "edx" //data adress

);

if ((d & (1 << 23)) != 0){

cout << "MMX support \n";

}

else

{

cerr << "You Dont have MMX \n";

}

if ((d & (1 << 25)) != 0)

{

cout << "SSE support \n";

}

else

{

cerr << "You Dont have SSE \n";

}

if ((d & (1 << 26)) != 0)

{

cout << "SSE2 support \n";

}

else

{

cerr << "You Dont have SSE2 \n";

}

if ((c & 1) != 0)

{

cout << "SSE3 support \n";

}

else

{

cerr << "You Dont have SSE3 \n";

}

if ((c & (1 << 9)) != 0)

{

cout << "SSSE3 support \n";

}

else

{

cerr << "You Dont have SSSE3 \n";

}

if ((c & (1 << 19)) != 0)

{

cout << "SSE4.1 support \n";

} else {

cerr << "You Dont have SSE4.1 \n";

}

if ((c & (1 << 20)) != 0)

{

cout << "SSE4.2 support \n";

} else {

cerr << "You Dont have SSE4.2 \n";

}

int main(int argc, char const *argv[])

{

//read https://en.wikipedia.org/wiki/CPUID

get_cpu_info();

return EXIT_SUCCESS;

}

How to do?

To vectorise your calculation, you may use SSE or SSE2 registers and commands. The easiest way is by using inline assembly codes.

#include <iostream>

using namespace std;
#include <iostream>

void div_sse_two_vector(){
    float f1[4] = { 1.1f, 2.2f, 3.3f, 4.4f };
    float f2[4] = {5.1f, 6.2f, 7.3f, 8.4f};
    float re[4] = { 0.f };

    //SSE SIMD codes
    __asm __volatile (  "movups %1, %%xmm1;"
                        "movups %2, %%xmm2;"
                        "divps %%xmm1, %%xmm2;"
                        "movups %%xmm2, %0"
                        : "=m" (re)
                        : "m" (f1), "m" (f2));
}

int main(int argc, char const *argv[])
{
    //read https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
    div_sse_two_vector();

    return EXIT_SUCCESS;
}

#include <iostream>

using namespace std;

#include <iostream>

void div_sse_two_vector(){

float f1[4] = { 1.1f, 2.2f, 3.3f, 4.4f };

float f2[4] = {5.1f, 6.2f, 7.3f, 8.4f};

float re[4] = { 0.f };

//SSE SIMD codes

__asm __volatile ( "movups %1, %%xmm1;"

"movups %2, %%xmm2;"

"divps %%xmm1, %%xmm2;"

"movups %%xmm2, %0"

: "=m" (re)

: "m" (f1), "m" (f2));

}

int main(int argc, char const *argv[])

{

//read https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

div_sse_two_vector();

return EXIT_SUCCESS;

}

Disadvantages of inline assembly for SIMD?

Assembly is really hard language
There are tons of CPU instructions

The solution is using Intrinsics

Intrinsics

The intrinsics are easy format

__m128i : 128bit integer
__m256d : 256bit double
__m128 : 128 bit float

However intrinsics are easy you cant use simple operators. So u need additional functions. Such as

__mm256_mul_ps() : 256 bit packed multiplication
__mm128_add_ps() : 128 bit packed adder.

To use intrinsics please add <intrin.h> on your code.

Header	Purpose
x86intrin.h	Everything, including non-vector x86 instructions like _rdtsc().
mmintrin.h	MMX (Pentium MMX!)
mm3dnow.h	3dnow! (K6-2) (deprecated)
xmmintrin.h	SSE + MMX (Pentium 3, Athlon XP)
emmintrin.h	SSE2 + SSE + MMX (Pentium 4, Athlon 64)
pmmintrin.h	SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Athlon 64 San Diego)
tmmintrin.h	SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer)
popcntintrin.h	POPCNT (Nehalem (Core i7), Phenom)
ammintrin.h	SSE4A + SSE3 + SSE2 + SSE + MMX (AMD-only, starting with Phenom)
smmintrin.h	SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Penryn, Bulldozer)
nmmintrin.h	SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Nehalem (aka Core i7), Bulldozer)
wmmintrin.h	AES (Core i7 Westmere, Bulldozer)
immintrin.h	AVX, AVX2, AVX512, all SSE+MMX (except SSE4A and XOP), popcnt, BMI/BMI2, FMA

intrin headers

#include <iostream>
#include <xmmintrin.h>

void div_sse_two_vector(){
    __m128 a = _mm_set_ps(1.1f, 2.2f, 3.3f, 4.4f);
    __m128 b = _mm_set_ps(5.1f, 6.2f, 7.3f, 8.4f);
    __m128 c = _mm_div_ps(b,a);

    for(int x = 0 ; x < 4 ; x++){
        std::cout << a[x] << "\t / \t" << b[x] << "\t = \t" << c[x] << "\n"; 
    }
}

int main(int argc, char const *argv[])
{
    div_sse_two_vector();
    return EXIT_SUCCESS;
}

#include <iostream>

#include <xmmintrin.h>

void div_sse_two_vector(){

__m128 a = _mm_set_ps(1.1f, 2.2f, 3.3f, 4.4f);

__m128 b = _mm_set_ps(5.1f, 6.2f, 7.3f, 8.4f);

__m128 c = _mm_div_ps(b,a);

for(int x = 0 ; x < 4 ; x++){

std::cout << a[x] << "\t / \t" << b[x] << "\t = \t" << c[x] << "\n";

}

int main(int argc, char const *argv[])

{

div_sse_two_vector();

return EXIT_SUCCESS;

}

Result

As a conclusion, SIMD programming have same advantages and disadvantages.

Advantages:

You are using just only CPU instructions to level up your calculation speed.

Disadvantages:

Programming in assembly language is really hard.
Intrinsics codes are not that small on the disassembler output
Instructions are using only one core of the CPU
Vectorization generally not possible on /O optimization process.

There are many matrix calculation frameworks use SIMD processes such as BLAS from Boost, Armadillo, Eigen, IT++, and Newmat.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

SIMD

Meaning Of SIMD

How to do?

Disadvantages of inline assembly for SIMD?

Intrinsics

Result

You may also like...

Leave a Reply Cancel reply

SIMD

Meaning Of SIMD

How to do?

Disadvantages of inline assembly for SIMD?

Intrinsics

Result

You may also like...

Variadic Template

Perfect Forwarding with Template

[[deprecated]]

Leave a Reply Cancel reply