Jump to content

User:Sbbhuva/sandbox/Program Level Parallelism

From Wikipedia, the free encyclopedia

Program Level Parallelism is a type of parallelism which involves the usage of the concepts of task oriented parallelism and multi-threading. It is a type of parallelism where multiple processors are given different blocks of code which help in the task being completed in a shorted duration of time. When a program is to be parallelized, the programmer needs to decide how large the chunks are that are assigned to different processors. Thus by assigning the task to different processors, the total time required for execution of a program can be reduced and this results in an increase in the throughput of the system.[1]

Background

[edit]

With the advent of multi-processor system and intense competition in the market, emphasis has always been on making systems which are faster, power efficient and require a fewer number of instructions. It is here that the concepts of parallelism creep in, so that multiple processors can be used to work on different blocks and hence complete the given task in a smaller amount of time.

The execution time (T) of a program is mathematically defined as the product of the number of instructions, clock cycles per instruction and the time required for each cycle

[2] 

However, if we look at the factors affecting the execution time, we realize that lowering the clock time is not feasible as there is a threshold beyond which the clock frequency cannot be increased with the current manufacturing process. Also, with increasing frequency, power consumption also increases which might result in the device getting burnt out due to heat.

The other two factors, number of instructions and CPI, are continuously being enhanced by constant changes in the architecture and designing of the processors. The common thread that binds these is parallelism.

Program level parallelism is one way of improving the performance

Description

[edit]

Program level parallelism, as the name suggests is achieved by splitting a program into different parts, each executed by a different processor.

Consider the example of matrix multiplication.

[C] = [A] * [B]

If Matrix [A] is a m*n, [B] is n*d, so by the properties of matrix multiplication, the resultant matrix [C] will be m*d matrix. Following is a generalized code for carrying out matrix multiplication

for (int i=0;i<m;i++)
   for (int j=0; j<d; j++)
   { 
   int sum =0;
       for (int k=0; k<n;k++)
       sum = sum + a[i].[k] + b[k].[j] ;
    c[i].[j] = sum;
    }

After analyzing the code for matrix multiplication, we realize that there is no dependency in the first and second loop while calculating the value of sum.This is because for a given value of i and a given value of j, the value of sum remains the same. This is not the case with the k loop. The value of sum is changing for every iteration of the third loop (k). Thus, we can execute the task of first two loops on different processors. However, there is a true dependency between different iterations of the loop and hence can't be executed on multiple processors (i.e. same processor needs to execute that task).

Suppose both [A] and [B] are 2*2 matrices, so C will also be a 2*2 matrix. Let us consider that we have 4 processors which can be used to execute this code in parallel. Thus, each processor can be used to calculate an element in the matrix.  The code can be executed by 4 processors in this way.

Processor 1 Processor 2 Processor 3 Processor 4
int sum1 = 0;
for (int k = 0; k < n; k++)
    sum1 + = a[1][k]*b[k][1];
c[1][1] = sum1;
int sum2 = 0;
for (int k = 0; k < n; k++)
    sum2  += a[1][k]*b[k][2];
c[1][2] = sum2;
int sum3 = 0;
for (int k = 0 ; k < 2 ; k++)
     sum3 += a[2][k]*b[k][1];
c[2][1] = sum3;
int sum4 = 0;
for (int k = 0 ; k < 2 ; k++)
     sum4 += a[2][k]*b[k][2];
c[2][2] = sum4;

Thus, a program which needed 3 loops to execute will now need just one loop to execute and will be completed quickly as all the elements are being computed in parallel.

Scenarios

[edit]

Iterative code within a loop

[edit]

A code segment having iterations without any data dependencies between loops is one scenario where the code can be parallelized at program level. In such a scenario, values associated with one index are not dependent on the succeeding or previous indexed values. Thus every iteration can be executed as a separate element with each of them functioning in parallel.  [2]

Example

Consider the following piece of code, having a loop over i:

for (i=0; i<=n; i++) 
{
c[i] = a[i] + b[i];  /* code segment S1 */
f[i] = d[i] + e[i];  /* code segment S2 */
}

In code segment S1, value for cycle i doesn’t depends upon the previous (i-1) or next cycle (i+1), which means it does not have any Data dependencies. Thus for all the iterations, each computation for i = 0 to i= n can be parallelized.

Thus independent processors can be asked to calculate the values of c[0], c[1],c[2].. c[n], helping the system to perform the task in a quicker manner.

c[0] = a[0] + b[0];

c[1] = a[1] + b[1];

c[2] = a[2] + b[2];

…..

c[n] = a[i] + b[i];

All the above computations will run in parallel as there is no dependency between execution of either of them.

Parallel compilation of files

[edit]

A compiler is responsible for efficient execution of programs for a given computer architecture. During the compilation phase of a program, the compiler converts the high level language to machine language semantics. Thus, for a combination of files in a program we can have multiple processes that compile different source files which are not dependent upon each other. [1]

Example

Pmake can be used to create several targets at once, i.e multiple files can be compiled at the same time provided they don't have any dependencies among themselves. Consider a program having three ".c" files named as a.c, b.c and d.c . With pmake, we can compile all three files at the same time , resulting in object files named as a.o, b.o, c.o corresponding to each one of the ".c" files at the same time.[1][3]

Realization

[edit]

Different processors executing different task

[edit]

Implementation

In a multiprocessor system with each of the processor executing different tasks and all of them connected to a single bus is one method of implementing program level parallelism.

Figure 1
Figure 1

It can be observed in the figure besides, where multiple processors with their own cache are executing different programs speeding up the total computation time. This form of parallelism is mostly seen in servers employing four to eight processors. [4]

Limitations and challenges

There is a possibility of different load across different processors, which might lead to some processors executing more than the others. To take care of such issues, load balancing techniques are used which make sure that all the processors are equally balanced.[1]

Single program (task) spread across multiple threads

[edit]

Implementation

A program executing on a single processor can be speed up by distributing it across multiple threads running on multiple processors. With all the threads running concurrently on all the processors, the overall execution time will be less. [4]

In the adjoining figure, Thread A, Thread B and Thread C are all executing concurrently.

Limitations and Challenges

To create sufficient number of threads so that all the available cores are optimally utilized is one of the challenge faced by software developers. If too many threads are created, then might lead to some heavy rescheduling and context switching. Moreover, programs that are developed using threading model cannot properly adjust to the changes in multi-core system architecture.[5]

Shortcomings

[edit]

Limitations on speed-up

[edit]

How effectively a program can be distributed amongst multiple processors, so that each processor is sharing equal load is the most important question while talking about parallelism. So the effect to which parallelism is possible is majorly dependent on the program which is being executed. Thus, as per Amdahl’s law, if a program is parallel only in short bursts and majority of it is sequential code, theoretically it will need many processors to achieve a greater speed-up. However, if the number of processors is limited some operations will be delayed compared to the others. But, an important point is that this delay will add to the execution time only if the delayed operation is in the critical path.[6]

Data dependencies

[edit]

Executing a program in parallel is often limited by data dependencies that sections of the program might have on each other. Due to this, processors might have to wait till the other processor does not complete its execution and hence synchronization signals come into play. This causes an increase in the communication overhead. Also, the limitation comes into play as programs having DOPIPE and DOACROSS dependencies can’t be parallelized at program level.[1]

See Also

[edit]

References

[edit]
  1. ^ a b c d e Solihin, Yan (2016). Fundamentals of parallel multicore architecture. Boca Raton, FL: CRC Press, Taylor & Francis Group.
  2. ^ a b "3 High Performance Computer Architecture". www.phy.ornl.gov. Retrieved 2016-09-05.
  3. ^ Shotts, William (July 28, 2016). The Linux Command Line. No Starch Press.
  4. ^ a b Dowling, Eric M. (Jan 2, 2001), Apparatus and method for program level parallelism in a VLIW processor, retrieved 2016-09-05
  5. ^ Drepper, Ulrich (May 27, 2010), Methods and systems for managing program-level parallelism, retrieved 2016-09-05
  6. ^ Theobald, K. B.; Gao, G. R.; Hendren, L. J. (1992-12-01). "On The Limits Of Program Parallelism And Its Smoothability". , Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992. MICRO 25: 10–19. doi:10.1109/MICRO.1992.696993.
[edit]

1) Data dependencies

2) Types of parallelsim

3) Pmake

4) Parallelism