Friday, December 18, 2015

Software Portability and Optimization course review


The Software Portability and Optimization course have been a rough, this course is full of a lot of information. Everything that I learned was new to me and most of that was interesting and additionally Chris Tyler told us about some handy tools beside course material. In the beginning I regretted taking this course but it turned out an interesting class and when you email Chris with problem he replies within a day to keep us moving so we don't get stuck on our work.

Things that I liked most is working with shell scripting, learning new things about Linux, improving programs using algorithms and compilation process (compiler options). Chris gave and showed samples on how he did the work which got us started. Before this course when I wrote code I'd try to improve it by removing unnecessary variables and code, trying to consume minimal memory as possible but these things never made difference because I learned that the compiler readjusted the code to improve it. So, the things that I did to improve the code the compiler does it automatically therefore, from now on I'll just work with compiler options to improve my code other than manually doing it.

Vectorizing

Everyone wants to write their code that gives better performance and one of the processes which slows down program performance is loops. If you are writing a loop that loops through each element in array/s and you want to optimize it to get better performance, you could use vectorization to get best performance on the loop. Vectorization is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

Ex:
void copy(char* dest, char* source, long size) {
int i;
for (i = 0; i < size; i++)
dest[i] = source[i];
}
The code above copies data from one location into another, the performance wouldn't matter is the size is small (like: 100 - 1000) but if the size was high (millions) then you can see the effect but by using vectorization the program copies N elements (lets say 4) at cost of one element copy in code above only if the memory region of destination and source aren't overlapping. To apply vectorization on your program you could do is compile your program with -03 flag. If the loop is guaranteed to operate on non-overlapping memory regions then you could add “#pragma ivdep” on top of loop which informs compiler to ignore vector dependencies and the vectorizing will be applied to your loop with even better performance then -03 because the vectorization checks will not run but if the loop operates on overlapping memory then your program will give false results.
The difference between loop unrolling and vectorization is loop unrolling will do 4 (4 as an example) separate operations in one index increment but in vectorization it will copy 4 (4 as an example) data and place it in destination in 1 operation. It's easy to show it in code, so here it is:

Loop Unrolling
for (int i=0; i < size; i += 4){
       dest[i] = source[i];
       dest[i + 1] = source[i + 1];
       dest[i + 2] = source[i + 2];
       dest[i + 3] = source[i + 3];
}

Vectorization
for (int i=0; i < size; i += 4){
       copyFourThingsAtOnce(&dest[i], &source[i]);
}

Assembly Language

Writing assembler code is very difficult, it's difficult because you have to write lowest level programming language, you have to write code specific to architecture of machine, 32bit or 64bit machines.

Take a look at following assembly code written for x86_64 machine. This code loops until start reaches to value in max and prints out each value.

.globl    _start    /* must declare for program to know where to start program, it's just says which is main function. */

.set     start, 0     /* starting value for the loop index */
.set     max, 10    /* loop exits when the index hits this number (loop condition is i<max) */
_start:
mov     $start,%r15    /* loop index */
loop:
mov     %r15,%r14     /* copy to r14 for ascii conversion */
mov     $'0',%r13        /* loding value of start into r13 register */
add      %r13,%r14     /* adding r13 and r14 values and result is store into r14 register */
movq   $digit,%rsi       /* message location, rsi is register source index */
movb   %r14b,(%rsi)  /* write byte to start of message */
movq   $len,%rdx       /* message length, rdx is register d extended */
movq   $msg,%rsi      /* message location, rsi is register source index */
movq   $1,%rdi          /* file descriptor stdout, rdi is register destination index */
movq   $1,%rax         /* syscall, sys_write, rax is register a extended */
syscall

inc       %r15              /* increment register 15 */
cmp     $max,%r15    /* see if we're done */
jne       loop               /* loop if we are not */

mov     $0,%rdi         /* exit status */
mov     $60,%rax      /* syscall sys_exit */
syscall 

.data
msg:    .ascii    "Loop: #\n"   /* message text - # is number of index */
digit = . - 2                           /* memory address for digit */
len = . - msg                         /* length of message */

The code belongs to Chris Tyler and some of the comments belongs to him as well, it's quite difficult to understand some of the steps. This code is able to handle only single digit value for being printed. All the registers that were used in code above are for only x86_64 machine, for Aarch64 there are different set of registers. Take a look at following table for some basic difference between x86_64 and Aarch64 machine.


X86_64 Aarch64
R8 to r15 are general registers r0 to r30 are general registers
rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi are specialied regisers -
64-bit registers using the 'r' prefix: rax, r15

32-bit registers using the 'e' prefix (original registers: e_x) or 'd' suffix (added registers: r__d): eax, r15d

16-bit registers using no prefix (original registers: _x) or a 'w' suffix (added registers: r__w): ax, r15w

8-bit registers using 'h' ("high byte" of 16 bits) suffix (original registers - bits 8-15: _h): ah, bh

8-bit registers using 'l' ("low byte" of 16 bits) suffix (original registers - bits 0-7: _l) or 'b' suffix (added registers: r__b): al, bl, r15b
x0 through x30 - for 64-bit-wide access registers w0 through w30 - for 32-bit-wide access registers
Openrations like add, divide, multiply takes two parameters and operation result is stored in second parameter. Openrations like add, divide, multiply takes three parametrs and result is stored in first parameter and other two are for operation values.
Many differences.. Many differences..

One of the good thing about writing assembly code is when you compiler it and analyze executable file using objdump command you will not see any code other than what you have written but when you write C/C++ code and analyze it' executable there is a lot of other code that you didn't write, that code is coming from your includes but since you don't need to include anything in assembly code so what you write is what you get. This assembly program is written in .s type extension and you just run:
ls -o name.o file.s
ld -o executable-name name.o

You can also write/add assembly code in your C/C++ program using asm function, ex: asm("add $1, %0"). And to compile this you can just using gcc or g++ compiler.

Thursday, December 17, 2015

Benchmarking MPICH package

In this post I'm going to write about benchmarking MPICH software using the framework we designed in Software Portability and Optimization class, Here is the link to get to the framework: https://github.com/pk400/SPO600-Build-Framework. MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard, it's being used exclusively on nine of the top 10 supercomputers.

I have spent a lot of time on getting this test framework run with my MPICH package and I faced many problems as well. First of all when I edited build plugin file to change CFLAGS to use compiler options selected in test framework and build my project the build fails and I didn't know why until the professor (Chris Tyler) looked at it and told me that exporting to CFLAGS variable is breaking the build. I looked into it and I wasn't sure why it was causing it to fail and I still don't know, I checked that the configure file and make file uses CFLAGS variable. I found the a way around by setting compiler options directly to the variable used for build but before that I unset the CFLAGS variable after calling configure command because configure command was adding compiler options to CFLAGS variable and in make command it uses CFLAGS variable to assign it into variable that has flags for building package which is MPICH_MAKE_CFLAGS. While I was figuring out this I found out that MPICH library was being build from CFLAGS and appended other compiler options in configure file as will. So by unsetting the variable CFLAGS after calling configure file I was able to build MPICH package and it's libraries with compiler flags chosen by the permutation, which I wanted.

For the tesing reason I used -02 and -03 as trial run for my package. The results on x86_64 system is:

Permutation columns: (uniqueId, givenId, flags, buildExit, buildWallTime, buildUserSystemTime, buildSize, testExit)
Benchmark columns: (uniqueId, permutationUniqueId, givenId, speedScore, memoryScore)

(1, 1, '-02', 0, 274.37, 703.28, 0, 0)
    (1, 1, 1, 702, 17016)
    (2, 1, 2, 702, 18736)
    (3, 1, 3, 704, 18796)
    (4, 1, 4, 701, 18796)

(2, 2, '-03', 0, 224.73, 598.86, 0, 0)
    (5, 2, 1, 819, 18796)
    (6, 2, 2, 825, 18776)
    (7, 2, 3, 820, 18764)
    (8, 2, 4, 820, 18784)

The results with test framework compiler options are:
There was an issue with first group because it has all the options that stay on in all O-levels and options that are kept off in all levels (-01, -02, -03) and because of that there was no delimiter causing the permutation to fail after first group. So to get results from all the groups I had to run the framework with group one only and remove group one from config file so other groups are used for building package in different set of compiler options. In the benchmarking results I'll only specify group name because there are a lot of flags being used, to see what compiler options are being used go to: https://github.com/pk400/SPO600-Build-Framework and look at monkeys10k.config file.

(1, 1, 'group1', 0, 256.45, 697.5, 0, 0)
    (1, 1, 1, 702, 18760)
    (2, 1, 2, 703, 18888)
    (3, 1, 3, 701, 18796)
    (4, 1, 4, 704, 18720)

(1, 1, 'group2 permute: 1', 0, 487.48, 729.73, 0, 0)
    (1, 1, 1, 1373, 18804)
    (2, 1, 2, 1424, 18800)
    (3, 1, 3, 1405, 18780)
    (4, 1, 4, 1361, 7356)

(2, 2, 'group2 permute: 2', 0, 470.55, 630.61, 0, 0)
    (5, 2, 1, 1248, 14856)
    (6, 2, 2, 1267, 16396)
    (7, 2, 3, 1268, 18768)
    (8, 2, 4, 1338, 17056)

(1, 1, 'group3 permute: 1', 0, 550.99, 734.2, 0, 0)
    (1, 1, 1, 1555, 16392)
    (2, 1, 2, 1660, 14876)
    (3, 1, 3, 1659, 18700)
    (4, 1, 4, 1653, 18764)

(2, 2, 'group3 permute: 2', 0, 435.94, 621.67, 0, 0)
    (5, 2, 1, 703, 18804)
    (6, 2, 2, 704, 18820)
    (7, 2, 3, 699, 18836)
    (8, 2, 4, 707, 18772)

(1, 1, 'group4 permute: 1', 0, 368.79, 714.64, 0, 0)
    (1, 1, 1, 1292, 18776)
    (2, 1, 2, 1303, 18744)
    (3, 1, 3, 1266, 18816)
    (4, 1, 4, 767, 18776)

(2, 2, 'group4 permute: 2', 0, 321.62, 616.49, 0, 0)
    (5, 2, 1, 1409, 18800)
    (6, 2, 2, 1328, 18768)
    (7, 2, 3, 1337, 17272)
    (8, 2, 4, 1302, 18812)

This measures are a combination of time it takes to configure, make, run make check, run benchmark on package. In the results above the group 1 contains flags which are always on/off in all levels made fastest test in wall time but group 4's off options were fastest in user systems time but we don't care about user system time because we want to measure the time user/person (stop watch) not computer's cpu cycles or etc. This just proves that the MPICH program compiler faster on defaults without any additional compiler flags because the once that are set by default is faster in this case on x86_64 machine but I would say to benchmark it more deep I would like to add compiler options of other groups (not one) to group one, one by one, and see performance on MPICH software.