Friday, December 18, 2015

Software Portability and Optimization course review


The Software Portability and Optimization course have been a rough, this course is full of a lot of information. Everything that I learned was new to me and most of that was interesting and additionally Chris Tyler told us about some handy tools beside course material. In the beginning I regretted taking this course but it turned out an interesting class and when you email Chris with problem he replies within a day to keep us moving so we don't get stuck on our work.

Things that I liked most is working with shell scripting, learning new things about Linux, improving programs using algorithms and compilation process (compiler options). Chris gave and showed samples on how he did the work which got us started. Before this course when I wrote code I'd try to improve it by removing unnecessary variables and code, trying to consume minimal memory as possible but these things never made difference because I learned that the compiler readjusted the code to improve it. So, the things that I did to improve the code the compiler does it automatically therefore, from now on I'll just work with compiler options to improve my code other than manually doing it.

Vectorizing

Everyone wants to write their code that gives better performance and one of the processes which slows down program performance is loops. If you are writing a loop that loops through each element in array/s and you want to optimize it to get better performance, you could use vectorization to get best performance on the loop. Vectorization is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

Ex:
void copy(char* dest, char* source, long size) {
int i;
for (i = 0; i < size; i++)
dest[i] = source[i];
}
The code above copies data from one location into another, the performance wouldn't matter is the size is small (like: 100 - 1000) but if the size was high (millions) then you can see the effect but by using vectorization the program copies N elements (lets say 4) at cost of one element copy in code above only if the memory region of destination and source aren't overlapping. To apply vectorization on your program you could do is compile your program with -03 flag. If the loop is guaranteed to operate on non-overlapping memory regions then you could add “#pragma ivdep” on top of loop which informs compiler to ignore vector dependencies and the vectorizing will be applied to your loop with even better performance then -03 because the vectorization checks will not run but if the loop operates on overlapping memory then your program will give false results.
The difference between loop unrolling and vectorization is loop unrolling will do 4 (4 as an example) separate operations in one index increment but in vectorization it will copy 4 (4 as an example) data and place it in destination in 1 operation. It's easy to show it in code, so here it is:

Loop Unrolling
for (int i=0; i < size; i += 4){
       dest[i] = source[i];
       dest[i + 1] = source[i + 1];
       dest[i + 2] = source[i + 2];
       dest[i + 3] = source[i + 3];
}

Vectorization
for (int i=0; i < size; i += 4){
       copyFourThingsAtOnce(&dest[i], &source[i]);
}

Assembly Language

Writing assembler code is very difficult, it's difficult because you have to write lowest level programming language, you have to write code specific to architecture of machine, 32bit or 64bit machines.

Take a look at following assembly code written for x86_64 machine. This code loops until start reaches to value in max and prints out each value.

.globl    _start    /* must declare for program to know where to start program, it's just says which is main function. */

.set     start, 0     /* starting value for the loop index */
.set     max, 10    /* loop exits when the index hits this number (loop condition is i<max) */
_start:
mov     $start,%r15    /* loop index */
loop:
mov     %r15,%r14     /* copy to r14 for ascii conversion */
mov     $'0',%r13        /* loding value of start into r13 register */
add      %r13,%r14     /* adding r13 and r14 values and result is store into r14 register */
movq   $digit,%rsi       /* message location, rsi is register source index */
movb   %r14b,(%rsi)  /* write byte to start of message */
movq   $len,%rdx       /* message length, rdx is register d extended */
movq   $msg,%rsi      /* message location, rsi is register source index */
movq   $1,%rdi          /* file descriptor stdout, rdi is register destination index */
movq   $1,%rax         /* syscall, sys_write, rax is register a extended */
syscall

inc       %r15              /* increment register 15 */
cmp     $max,%r15    /* see if we're done */
jne       loop               /* loop if we are not */

mov     $0,%rdi         /* exit status */
mov     $60,%rax      /* syscall sys_exit */
syscall 

.data
msg:    .ascii    "Loop: #\n"   /* message text - # is number of index */
digit = . - 2                           /* memory address for digit */
len = . - msg                         /* length of message */

The code belongs to Chris Tyler and some of the comments belongs to him as well, it's quite difficult to understand some of the steps. This code is able to handle only single digit value for being printed. All the registers that were used in code above are for only x86_64 machine, for Aarch64 there are different set of registers. Take a look at following table for some basic difference between x86_64 and Aarch64 machine.


X86_64 Aarch64
R8 to r15 are general registers r0 to r30 are general registers
rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi are specialied regisers -
64-bit registers using the 'r' prefix: rax, r15

32-bit registers using the 'e' prefix (original registers: e_x) or 'd' suffix (added registers: r__d): eax, r15d

16-bit registers using no prefix (original registers: _x) or a 'w' suffix (added registers: r__w): ax, r15w

8-bit registers using 'h' ("high byte" of 16 bits) suffix (original registers - bits 8-15: _h): ah, bh

8-bit registers using 'l' ("low byte" of 16 bits) suffix (original registers - bits 0-7: _l) or 'b' suffix (added registers: r__b): al, bl, r15b
x0 through x30 - for 64-bit-wide access registers w0 through w30 - for 32-bit-wide access registers
Openrations like add, divide, multiply takes two parameters and operation result is stored in second parameter. Openrations like add, divide, multiply takes three parametrs and result is stored in first parameter and other two are for operation values.
Many differences.. Many differences..

One of the good thing about writing assembly code is when you compiler it and analyze executable file using objdump command you will not see any code other than what you have written but when you write C/C++ code and analyze it' executable there is a lot of other code that you didn't write, that code is coming from your includes but since you don't need to include anything in assembly code so what you write is what you get. This assembly program is written in .s type extension and you just run:
ls -o name.o file.s
ld -o executable-name name.o

You can also write/add assembly code in your C/C++ program using asm function, ex: asm("add $1, %0"). And to compile this you can just using gcc or g++ compiler.

Thursday, December 17, 2015

Benchmarking MPICH package

In this post I'm going to write about benchmarking MPICH software using the framework we designed in Software Portability and Optimization class, Here is the link to get to the framework: https://github.com/pk400/SPO600-Build-Framework. MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard, it's being used exclusively on nine of the top 10 supercomputers.

I have spent a lot of time on getting this test framework run with my MPICH package and I faced many problems as well. First of all when I edited build plugin file to change CFLAGS to use compiler options selected in test framework and build my project the build fails and I didn't know why until the professor (Chris Tyler) looked at it and told me that exporting to CFLAGS variable is breaking the build. I looked into it and I wasn't sure why it was causing it to fail and I still don't know, I checked that the configure file and make file uses CFLAGS variable. I found the a way around by setting compiler options directly to the variable used for build but before that I unset the CFLAGS variable after calling configure command because configure command was adding compiler options to CFLAGS variable and in make command it uses CFLAGS variable to assign it into variable that has flags for building package which is MPICH_MAKE_CFLAGS. While I was figuring out this I found out that MPICH library was being build from CFLAGS and appended other compiler options in configure file as will. So by unsetting the variable CFLAGS after calling configure file I was able to build MPICH package and it's libraries with compiler flags chosen by the permutation, which I wanted.

For the tesing reason I used -02 and -03 as trial run for my package. The results on x86_64 system is:

Permutation columns: (uniqueId, givenId, flags, buildExit, buildWallTime, buildUserSystemTime, buildSize, testExit)
Benchmark columns: (uniqueId, permutationUniqueId, givenId, speedScore, memoryScore)

(1, 1, '-02', 0, 274.37, 703.28, 0, 0)
    (1, 1, 1, 702, 17016)
    (2, 1, 2, 702, 18736)
    (3, 1, 3, 704, 18796)
    (4, 1, 4, 701, 18796)

(2, 2, '-03', 0, 224.73, 598.86, 0, 0)
    (5, 2, 1, 819, 18796)
    (6, 2, 2, 825, 18776)
    (7, 2, 3, 820, 18764)
    (8, 2, 4, 820, 18784)

The results with test framework compiler options are:
There was an issue with first group because it has all the options that stay on in all O-levels and options that are kept off in all levels (-01, -02, -03) and because of that there was no delimiter causing the permutation to fail after first group. So to get results from all the groups I had to run the framework with group one only and remove group one from config file so other groups are used for building package in different set of compiler options. In the benchmarking results I'll only specify group name because there are a lot of flags being used, to see what compiler options are being used go to: https://github.com/pk400/SPO600-Build-Framework and look at monkeys10k.config file.

(1, 1, 'group1', 0, 256.45, 697.5, 0, 0)
    (1, 1, 1, 702, 18760)
    (2, 1, 2, 703, 18888)
    (3, 1, 3, 701, 18796)
    (4, 1, 4, 704, 18720)

(1, 1, 'group2 permute: 1', 0, 487.48, 729.73, 0, 0)
    (1, 1, 1, 1373, 18804)
    (2, 1, 2, 1424, 18800)
    (3, 1, 3, 1405, 18780)
    (4, 1, 4, 1361, 7356)

(2, 2, 'group2 permute: 2', 0, 470.55, 630.61, 0, 0)
    (5, 2, 1, 1248, 14856)
    (6, 2, 2, 1267, 16396)
    (7, 2, 3, 1268, 18768)
    (8, 2, 4, 1338, 17056)

(1, 1, 'group3 permute: 1', 0, 550.99, 734.2, 0, 0)
    (1, 1, 1, 1555, 16392)
    (2, 1, 2, 1660, 14876)
    (3, 1, 3, 1659, 18700)
    (4, 1, 4, 1653, 18764)

(2, 2, 'group3 permute: 2', 0, 435.94, 621.67, 0, 0)
    (5, 2, 1, 703, 18804)
    (6, 2, 2, 704, 18820)
    (7, 2, 3, 699, 18836)
    (8, 2, 4, 707, 18772)

(1, 1, 'group4 permute: 1', 0, 368.79, 714.64, 0, 0)
    (1, 1, 1, 1292, 18776)
    (2, 1, 2, 1303, 18744)
    (3, 1, 3, 1266, 18816)
    (4, 1, 4, 767, 18776)

(2, 2, 'group4 permute: 2', 0, 321.62, 616.49, 0, 0)
    (5, 2, 1, 1409, 18800)
    (6, 2, 2, 1328, 18768)
    (7, 2, 3, 1337, 17272)
    (8, 2, 4, 1302, 18812)

This measures are a combination of time it takes to configure, make, run make check, run benchmark on package. In the results above the group 1 contains flags which are always on/off in all levels made fastest test in wall time but group 4's off options were fastest in user systems time but we don't care about user system time because we want to measure the time user/person (stop watch) not computer's cpu cycles or etc. This just proves that the MPICH program compiler faster on defaults without any additional compiler flags because the once that are set by default is faster in this case on x86_64 machine but I would say to benchmark it more deep I would like to add compiler options of other groups (not one) to group one, one by one, and see performance on MPICH software.

Sunday, November 8, 2015

Algorithm Selection

A couple of weeks ago, we were asked to write two different approaches to adjusting the volume of sound samples. The two approaches we were asked to write was:
  • Scaling the values of sound sample by multiplying value by a scaling factor.
  • Scaling the values of sound sample by multiplying value by a scaling factor and storing it into a lookup table. So, when the program needs the data it will provide it from lookup table until the scaling factor isn’t changed.

To analyze the performance I’ll be using the Archie system (CDOT's ARMv8 - AArch64) and Xerxes system (CDOT's x86_64). I’m going to be using data type int16_t (16 bits) to represent a sound sample and providing 5,000,000 samples which adds up to 10MB of sound sample’s volume to adjust.
To calculate the performance of each approach I first made a dummy run of the program to get its performance without adjusting the volume. In this the program just creates a 10MB of sample data and prints out all its values.  Then I calculated the performance on each approach of volume adjustment. I made five runs on each approach on each system.

Performance on Archie:
Tests
1
2
3
4
5
Dummy run
10.86s (11392 KB)
7.79s  (11328 KB)
10.82s  (11328 KB)
7.49s  (11328 KB)
10.82s  (11392 KB)
Multiplication
9.54s  (11328 KB)
9.50s  (11392 KB)
12.15s  (11392 KB)
9.06s  (11392 KB)
11.99s  (11392 KB)
Lookup Table
7.96s  (11584 KB)
11.39s  (11584 KB)
7.37s  (11584 KB)
11.39s  (11648 KB)
11.42s  (11584 KB)

Performance on Xerxes:
Tests
1
2
3
4
5
Dummy run
0.86s  (12416 KB)
0.85s  (12360 KB)
0.87s  (12368 KB)
0.85s  (12364 KB)
0.93s  (12392 KB)
Multiplication
0.92s  (12308 KB)
0.90s  (12416 KB)
0.91s  (12316 KB)
0.91s  (12424 KB)
0.86s  (12472 KB)
Lookup Table
0.93s  (12448 KB)
0.95s  (12496 KB)
0.92s  (12524 KB)
0.94s  (12608 KB)
0.92s  (12540 KB)

As you can see results from Archie system is inconsistent in terms of run time, but the memory usage of the program is consistent and Xerxes system is providing consistence run time and memory usage. By looking at these stats we can say that Xerxes system is faster in performing these tests.

The average run time on Archie system is: 9.556s on dummy run, 10.448s on multiplication, and 9.906 on the lookup table; and the average run time on Xerxes system is: 0.892s on dummy run, 0.900 on multiplication, and 0.932s on the lookup table. Using the average we can calculate the run time of adjusting volume on Archie system is: 0.892s on multiplication method and 0.350s on lookup table; and for Xerxes system is: 0.008s on multiplication method and 0.04s on lookup table method. You can also see that Xerxes system consumes more memory than Archie system. So, based on these results I would say that the approach of adjusting volume depends on the system you are using.

Saturday, November 7, 2015

File Transferring

A couple of days ago I wanted to transfer some source files on Archie (CDOT's ARMv8 - AArch64 system) and Xerxes (CDOT's x86_64 system) machines and run my program and analyze the result between these systems. When I regularly connect to these systems it uses my private key generated on my Linux machine as password and provides me access to Archie and Xerxes systems. I was having problems making a file transfer to these systems but after some research and help from professor (Chris Tyler) I learn few more interesting things as well.

Transferring file to Xerxes system was easy, If I’m on College’s network then I can just use “scp test.cpp (filename) (username)@xerxes.internal.cdot.systems:(destination)”. Xerxes server is only access able from Seneca’s network so, if you want to connect to Xerxes server from outside you can use other system which can be access able from outside like Archie and Matrix and using that you can connect to Xerxes system but you have to transfer the private key to these systems. Also after connection name if you do not include “: (destination)” the file will not be transferred it will just make copy of that file and put it in current directory with connection name as file name.

When you want to transfer file on Archie machine it’s little complicated because you have to specify the port number. Archie machine connection is set up to connect via different system so you have to connect to it using “ssh –p 2200 (connection name)”. Also for Archie there are two different connection names, one for connecting from outside College network (ehl.cdot.systems) and another for connecting from within Seneca network (ehl.internal.cdot.systems). so if you want to transfer file on Archie system you can use “scp –P 2200 test.cpp (filename) (username)@(connection name either internal or external):(destination)”. Also keep in mind that because Linux is case sensitive the scp command uses capital ‘P’ to specify port number.

If you have your private key protected by a passphrase and every time you want to connect to different machines you use your private key and you have enter your passphrase to unlock private key. If you use ssh-agent and add your private key to that agent and by doing that you can connect to multiple servers without entering your passphrase and also keeping you private key secured at the same time. You can do that by entering command “eval $(ssh-agent)” the add your private key by entering “ssh-add ~/.ssh/id_rsa (path to private key)” and you will be prompt to enter your passphrase and enjoy.

Finally, there are many other services you can use to transfer files like: ftp, sftp, filezilla (GUI), wget, and more.

Thursday, November 5, 2015

Open Source Community


There are a lot of open source software’s out there and a lot of programmers come to gather to make that software. One of the open source software I looked at was Front Accounting which is licenced under GNU General Public License. This license basically allows users the freedom to run, study, share, and modify the software

How to Contribute to Software
To begin contributing to this software you can head over to Front Accounting’s wiki page which gives you the information on the project, like: different ways to obtain project source code, links to forums/discussions, and more.  On their forum page you can find out the project broken down into many little pieces and how are they planning to implement it, place for reporting bugs, wish list, modules discussions, and even job offers.
Bug Tracker
On the forum page there is a place to report bugs and people discuss on how to fix these but the primary system is Mantis bug tracking system which is allowed for registered users only but you can make an account easily. Mantis system is an issue tracker that provides a delicate balance between simplicity and power.
Bug fixes/Patches
Eventually someone who will be fixing bugs will be discussing their progress and on forum or on the Mantis bug tracking system and many people would be testing to see if the patch fixes these problems or not and according to these results that patch will be used in project.
Community
The community can jump into discussions or issues and provide solutions. The source code is made available on GitHub, SourceForge, BitBucket and with different versions/modules. On the Forum page there are over 7,000 registered users and they have discussed in over 4,000 topics about Front Accounting software.



Second open source software I looked at was Spam Assassin licenced under The Apache Software Foundation. Apache license agreement allows users the freedom to run, study, and share. The licence agreement must be sign if you want to modify the software.
How to Contribute to Software
To begin contributing to this software you can head over to Spam Assassin wiki page where you can learn about the software, join to mailing list, look at reviews, and etc. There is also a link for developers where it gives you information about joining to development spam assassin software. You can find information on general rules like coding style, development rules, how to get source code, and etc. You will have to join a mailing list in order to get latest updates on software. Contributors are divided in to different roles, like: contributors (anyone can provide feedback, submit bug reports, or submit patches), committers (a committer is simply an individual who was given write access to the code base), PMC Members (the project management committee is responsible for managing a project), and finally there are users who uses the software.
Bug Tracker
There is a forum page where all the problems are reported and discussed by developers. You have to be registered to report bugs and collaborate on solving these problems.
Accepting Patches
Eventually someone will come around to fix the bugs reported in software, they will be discussing their progress on bug tracker system and when they are done they will be submitting it. According to development mode page some one will be reviewing the code and after that they will be testing it.

Sunday, October 4, 2015

Compiler Options

A few weeks ago, we learn about gcc compiler and few of it's options, using these options your compiler will change/move your code around to make your code for better performance. I'm going to discuss about GCC compiler options which I researched on for my presentation for my Software Portability and Optimization course.

-ftree-dce is one of the option which I've chose for my research. -ftree-dce performs dead code elimination (DCE) on trees. This flag is enabled by default at -O and higher.

Example: 
int main(){
    int t = 5 + 5; //dead code
    int r = 3 + 3;
    printf(“Result is %d”, r);
}

In this example variable ‘t’ is considered a dead variable because it takes result of two numbers but the value of ‘t’ is never used. It isn't going to affect the final result so, calculating 5 + 5 and assigning the result to a variable will consume resources and space for something that can be avoided. Dead Code Elimination checks for conditions that will never be true and unnecessary calculations on variables which are not affecting the final result and it will remove that block of code before it converts the code into machine language which will provide better performance in your program by eliminating steps that are unnecessary.

Hence compiler option is enabled by default at -O level, so the dead code elimination option is always in use, unless you compile with –fno-tree-dce but still there are other similar options which eliminates dead code.
These are: -fdump-rtl-dce, -fdump-rtl-dce1, -fdump-rtl-dce2 and –ftree-builtin-call-dce which gives warning on unused variables it's default enabled at -O2  

-fmove-loop-invariants is another compiler option which I've chose for my research. -fmove-loop-invariants enables the loop invariant motion pass in the RTL loop optimizer and its enabled at level -O1. Its purpose is to find any value or expression that remains the same on all iterations of the loop. By substituting out the expressions which are unchanged on all iteration the computer will have less number of operations to perform then original.

Example:

Loop invariants transform following code:
             for (int i = 0; i < n; i++) {    
                   x = y + z;
                   array[i] = 6 * i + x * x;
             }
 to:
              Int x = y + z;
              Int t = x * x;

              for (int i = 0; i < n; i++){ 
                    array[i] = 6 * i + t;
              }
Let's calculate number of operation in each code if n is 10. For untransformed code it will be 10 iterations * (2 for x=y+z + 4 for array[i] = 6 * i + x * x ) which is 60 operations and if we look at transformed code it will be 4 for code before loop + 10 iterations * (3 for array[i] = 6 * i + t) which is 34 operations. So, in small code as this we have almost half the number of difference and which means less processing for computer and you could have better performing applications.

SPO600 Lab 2 - Examining compiler options

SPO600 - Lab 2

Modifying C program and compiling the program with gcc options and examining the output. Link to Lab is: SPO600_Compiled_C_Lab. Everything shown in this page has been ran on aarchie 64 machine.

Simple C - hello world program.

int main(){
    printf("Hello World!");
}

Compiling Using: gcc -g -O0 -fno-builtin hello.c
Examining Executable File Using:
objdump -f -s -d --source (executable filename)
Executable File Size: 9569 bytes

1. Add the compiler option -static. Note and explain the change in size, section headers, and the function call.
 
Executable File Size: 681707 bytes

Using the objdump command to determine changes gcc compiler did using -static option tell us that there was no changes made on how the main function is going to run, it just prevented linking with shared gcc libraries. This makes compiler copy all required information from gcc libraries to the executable file.

2. Remove the compiler option -fno-builtin. Note and explain the change in the function call.


Executable File Size: 9182 bytes

One small changed has been made in executable file, In original executable the printf() function is called to print
the line/it makes a one line function results in smaller and faster code but without -fno-builtin option the printf()
is using puts@plt option it puts printf() function on stack so it will be called later.

3. Remove the compiler option -g. Note and explain the change in size, section headers, and disassembly output.
Executable File Size: 8192 bytes

The executable file doesn't contain information on what each line does in main function, -g option adds debuggig information
to the executable file.

4. Add additional arguments to the printf() function in your program. Note which register each argument is placed in.


Executable File Size: 9456 bytes

Modified print line: printf("Helo World! %d, %d, %d, %d, %s, %f, %f, %f, %f, %s\n", 10, 20, 30, 40, "fifty", 59.99, 69.99, 79.99, 89.99, "Hundred");
I added multiple intergers, floats and strings arguments to printf funtion and the compiler uses ldr to create floating variables and fmove
pushes stores these variable into memory and adrp makes a string variables and add pushes/adds these varibles to memory.

5. Move the printf() call to a separate function named output(), and call that function from main(). Explain the changes in the object code.

Executable File Size: 9359 bytes

The executable file has few steps which is doing the same thing in both function like: stp and mov steps after function calls and
ldp and ret after function closing. Two bl (callq) are being made one to output function from main and then another to printf function
from output function these extras and repeting steps are made, for this scenario the function calls are too many then actucal task(printf) thats why this is inefficient. Having an extra function adds more work to the program and it provides less performance.

6. Remove -O0 and add -O3 to the gcc options. Note and explain the difference in the compiled code.

Executable File Size: 10616 bytes

The main function doesn't have the steps stp, mov, ldp and ret for main function because in main function is calling printf function and nothing else so the compiler doesn't save information about main function. Before it use to make callq to printf function and then comes back to main functions to finish its steps but now it just jumps to printf function and doesn't come back to make closing statements on main like ldp and ret because there isn't anything other then printf function so the return from printf function will go to operating system rather then going back to main function and main function returning to operating system. Option -O3 is optimized for program speed over memory space, the compiler will apply all optimiations to make it's speed better without worrying about memory space.

Sunday, September 13, 2015

Introduction

Developing softwares to run on new architectures and optimizing it's performance.