# Gang scheduling

In computer science, gang scheduling is a scheduling algorithm for parallel systems that schedules related threads or processes to run simultaneously on different processors. Usually these will be threads all belonging to the same process, but they may also be from different processes. For example, when the processes have a producer-consumer relationship, or when they all come from the same MPI program.

Gang scheduling is used so that if two or more threads or processes communicate with each other, they will all be ready to communicate at the same time. If they were not gang-scheduled, then one could wait to send or receive a message to another while it is sleeping, and vice versa. When processors are over-subscribed and gang scheduling is not used within a group of processes or threads which communicate with each other, it can lead to situations where each communication event suffers the overhead of a context switch.

Gang scheduling is based on a data structure called the Ousterhout matrix. In this matrix each row represents a time slice, and each column a processor. The threads or processes of each job are packed into a single row of the matrix.[1] During execution, coordinated context switching is performed across all nodes to switch from the processes in one row to those in the next row.

Gang scheduling is stricter than coscheduling.[2] It requires all threads of the same process to run concurrently, while coscheduling allows for fragments, which are sets of threads that do not run concurrently with the rest of the gang.

Gang scheduling was implemented and used in production mode on several parallel machines, most notably the Connection Machine CM-5.

## Types

### Bag of gangs (BoG)

In gang scheduling, one to one mapping happens, which means each task will be mapped to a processor. Usually, jobs are considered as independent gangs, but with bag of gangs scheme, we can combine all the gangs and send them together to the system. When jobs are executed in the system, the execution can never be completed until and unless all the gangs that belong to the same BoG completes its execution.[3] Due to this, if one gang belonging to that job completes its execution, that gang will have to wait until all the gangs complete its execution. This leads to increased synchronization delay overhead.

Response time ${\displaystyle R_{j}}$ of ${\displaystyle j^{th}}$ Bag of Gangs is defined as the time interval from the arrival of the BoG to the grid dispatcher to the completion of job of all of the sub-gangs which belong to the BoG. The average response time is defined as follows:

Response Time (RT)=${\displaystyle {\frac {1}{N}}\textstyle \sum _{j-1}^{N}R_{j}\displaystyle }$.[3]

The response time is further affected when a priority job arrives. Whenever a priority job arrives at the system, that job will be given priority with respect to all other jobs, even over the one which are currently being executed on the processors. In case, when priority job arrives, the sub-gang which is currently executing on the system will be stopped and all the progress that has been made will be lost and has to be redone. This interruption of the job will further delay the total response time of the BoG.[3]

### Adapted first come first served (AFCFS)

Adapted first come first served (AFCFS) scheme is the adapted version of first come and first serve scheme. As per the first-come, first-served scheme whichever job that comes first will be forwarded for execution. But in AFCFS scheme, once a job arrives at the system, the job will not be scheduled until and unless enough processors are available for the execution of the respective job.[3] When a large job arrives at the system and is present at the starting of the ready queue but enough processors are not available, then AFCFS policy will schedule the smaller job for which enough processors are available even if that job is present at the back of the queue. In other words, this scheme favors smaller jobs as compared to larger jobs based on the availability of processor, thus this will leads to increased fragmentation in the system.[3][4]

### Largest gang first served (LGFS)

In the above execution scheme, the tasks which correspond to increasing job size are placed in a queue,with the tasks belonging to the largest gang are scheduled first,but this method of execution tends to lead to the starvation of resources of smaller jobs and are therefore unfit to be executed on the systems where number of processors is comparatively low.[5]

The AFCFS and LGFS also have to deal with the processor failure,in such a case,tasks executing on that processor are submitted to other processors for execution. The tasks wait in the head of the queue on these processors while they wait for the current processor to be repaired.

There are two scenarios which emerge from the above issue:[5]

• Blocking case: The processors assigned to the interrupted jobs are blocked and cannot execute other jobs in their queue until the jobs from the damaged processors are cleared.[5]
• Non-blocking case: This case is incurred when the jobs already executing in the processors are processed early instead of waiting for the blocked jobs to resume execution.[5]

### Paired gang scheduling

Gang scheduling while executing the I/O bound processes keeps the CPU’s idle while awaiting response from the other processors, whereas the idle processors can be utilized for executing tasks.If the characteristics of each gang are known beforehand, that is, if the gang consists of a mix of CPU and I/O Processes. These processes interfere little in each other’s operation,and algorithms can be defined to keep both the CPU and the I/O busy at the same time and exploit the parallelism in the same.This method would present the idea of matching pairs of gangs,one I/O based and one CPU bound.Each gang would assume that it is working in isolation as they utilize different devices.[6]

#### Scheduling algorithm

• General case: In the general case, a central node is designated in the network to handle task allocation and the resource allocation. It maintains the information in an Ousterhout matrix. In strict gang scheduling,one row is selected at a time following which a node scheduler schedules a process in the respective cell of that row.[6]
• Paired gang: In paired gang scheduling, two rows are selected instead of one.One each of the I/O bound gang and CPU gang, it is at the discretion of the local scheduler to allot jobs to the appropriate processors in order to elicit maximum allowed parallelism.[6]

## Synchronization methods

### Concurrent gang scheduling (CGS)

It is a highly scalable and versatile algorithm and assumes existence of a synchronizer that utilizes the internal clock of each node. CGS primarily consists of the following three components.[7]

• Processor/Memory module (Also called Processing Element).
• 2-way network which allows 1-1 Communication.
• A synchronizer which performs synchronization of all PE’s after a constant interval.

The synchronization algorithm is performed in two stages.[7]

• When the load changes, a dedicated time table is created by the front end scheduler.
• Local scheduler then follows the time table by switching between the jobs that have been distributed to them by the front end scheduler.

We assume the existence of a synchronizer that sends the signal to all the nodes in a cluster at a constant interval. The CGS utilizes the fact that the most common events which occur in a PC are timer interrupts and they use the same parameter to be the internal clock.[7]

• A common counter is initialized which gets incremented every time an interrupt is encountered and is designated the OS's internal clock.
• All nodes are synchronized after a checking interval 't' and utilize the internal clocks of the individual nodes.
• If after time t the clock of the local node there is no discrepancy of the individual clock of the nodes and the global clock,time interval t is extended,else it is shortened.
• Constantly check and update checking interval t.

### SHARE scheduling system

SHARE scheduling system utilizes the internal clock system of each node and is synchronized using the NTP Protocol. The flavor of scheduling is implemented by collecting jobs with same resource requirements in a group and executing the same for a pre-defined time-slice. Incomplete jobs are pre-empted after the time slice is exhausted.[8]

The local memory of the node is utilized as the swap space for pre-empted jobs.The main advantages of the SHARE scheduled system are that it guarantees the service time for accepted jobs and supports both batch and interactive jobs.

Synchronization:

Each gang of processes utilizing the same resources are mapped to a different processor.SHARE system primarily consists of three collaborating modules.[8]

• A global scheduler: This scheduler directs the local scheduler the specific order in which to execute their processes(local gang members).
• A local scheduler:After the local scheduler is provided the jobs to execute by the global scheduler,it ensures that only one of the parallel process is executed at any one of the processors in a given time slot.The Local scheduler requires a context switch to pre-empt a job once its time slice has expired and swap a new one in its place.
• Interface to the communication system:The communication subsystem must satisfy several requirements which greatly increase the overhead requirements of the scheduler.They primarily consist of
• Efficiency: Must expose hardware performance of the interconnect to the client level.
• Protection and Security: The interconnect must maintain atomicity of the processors by not allowing one to affect the performance of another processor in any way possible.
• Multi-Protocol: the interconnect must be able to map various protocols simultaneously to cater to different client needs.

## Packing criteria

A new slot is created when we cannot pack the job into the available slot. In case, a new slot is opened even if the job can be packed in the available slot, then the run fraction which is equal to one over the number of slots used will increase. Therefore, certain algorithms have been devised on packing criteria and are mentioned below:

### Capacity based algorithm

This algorithm monitors the slots capacity and decides whether there is any need of opening a new slot. There are two sub division in this algorithm which are listed below:

#### First fit

As per this algorithm, the used slots are checked for capacity in a sequential order then the first one which is having sufficient capacity is chosen. And if none of the available slot have enough capacity, a new slot is opened. Once the new slot is opened, the processing elements (PE) are allocated in the slot in sequential order.[9]

#### Best fit

Unlike the previous algorithm, the used slots are sorted based on capacity, but not in sequential order. The slot which is having the smallest sufficient capacity is chosen. If none of the used slots have sufficient capacity, then only new slot is opened. Once the new slot is opened,the processing elements(PE) are allocated in the slot in sequential order same as the previous algorithm.[9]

### Left-right based algorithms

This algorithm is the modified version of the best fit algorithm. In the best fit algorithm, the PEs are allocated in a sequential order but in this algorithm the PEs can be inserted from both the direction so as to reduce the overlap between different sets of PEs assigned to different jobs.[9]

#### Left-right by size

As per this algorithm, the PEs can be inserted in sequential order and in reverse sequential order based on the size of the job. If the size of the job is small, the PEs are inserted from left to right and if the job is large, the PEs are inserted from right to left.[9]

#### Left-right by slots

Unlike the previous algorithm the choice was based on the size of the job, here in this algorithm the choice is dependent on the slot. In this, slots are indicated as being filled, i.e. being filled from the left or from the right. The PEs shall be allocated to the job in the same order. The number of slot on both sides shall be approximately equal so when a new slot is opened, the direction is indicated based on the number of slots in both direction.[9]

Both the Capacity-based and Left-Right based algorithm do not give any concern for the load on individual PEs. This algorithm takes into account the load on the individual PE keeping into account the overlap between sets of PEs assigned to different jobs.[9]

In this scheme, PEs are sorted based on the load on them that each job will have on the PEs. The availability of the free PEs in the slot determines the capacity of the slot. Suppose that PEs are allocated to a job which has ${\displaystyle x}$ threads, the ${\displaystyle x^{th}}$PE in the load order (last one) will determine the maximum load that any PE can have which is available in the slot. The slot which has minimal maximum load on any PE is chosen and a number of least loaded free PEs are used in the slot.[9]

Unlike previous scheme in which slots were chosen based on the minimal maximum load on ${\displaystyle x^{th}}$ PE, as per this scheme slots are chosen based on the average of the load on the ${\displaystyle x}$ least loaded PEs.[9]

### Buddy based algorithm

In this algorithm the PEs are assigned in clusters not individually. As per this, the PEs are first partitioned into groups that are power of two. Each one of the group will be assigned a controller and when a job of size n arrives, it is assigned to a controller of size 2[lg 2] (the smallest power to 2 that is larger than or equal to n). The controller is assigned by first sorting all the used slots, and then identifying groups of 2[lg 2] contiguous free processors. Controller which is having all the PEs free in some of the slots, then only newly arrived job will be assigned to that controller otherwise a new slot is opened.[9]

### Migration based algorithm

All the above-mentioned algorithms, the initial placement policy is fixed and jobs are allocated to the PEs based on that. But this algorithm is different in a way that, as the name suggests, this scheme migrate jobs from one set of PEs to another set of PEs which in turn improves the run fraction of the system. Although there are system which had been implemented with this algorithm but the migration rate is kept low.[9]