Content deleted Content added

Inline

Revision as of 01:39, 26 March 2010

Introduction to Shared Memory Multiprocessors

Cache Coherence Protocol, Memory Consistency Model, and Synchronization are the three main types of support necessary for the accurate execution of shared memory parallel programs on a multiprocessor systems.

Multiprocessors

Inter-Process Communication (IPC)

IPC defines the methods which are used to exchange information between multiple threads in a threaded program. The four main types of communication between processes is message passing, synchronization, shared memory, and remote procedure calls. In shared memory, IPC refers to the allocation of memory space by one process which is used by other instances of the process.

Symmetric Shared-Memory Multi-Processing (SMP)

In traditional SMP (Symmetric Multiprocessing) systems, the computer has a single memory controller that is shared by all CPUs. This single memory connection often becomes a bottleneck when all processors access memory at the same time. It also does not scale very well for larger systems with a higher number of CPUs. For this reason, more and more modern systems are using a CC/NUMA (Cache Coherent/Nonuniform Memory Access) architecture. Examples are AMD* Opteron*, IBM* Power5*, HP* Superdome, and SGI* Altix*. ^[1]

Non-Uniform Memory Access (NUMA)

File:Ccnuma.PNG

NUMA refers to a hardware architecture which may access its local memory quicker than the local memory of another processor. The latency is defined by the physical distance between the processor and memory. NUMA requires some sort of support by the Operating System in order to improve performance, by considering spacial locality when allocating memory pages. One great advantage of the NUMA architecture is that even in a big system with many CPUs it is possible to get very low latency on the local memory. Because modern CPUs are much faster than memory chips, the CPU often spends quite some time waiting when reading data from memory. Minimizing the memory latency can therefore improve software performance.^[1] NUMA is also referred to as Distributed Shared Memory.

Most modern systems use some sort of local, non-shared cache and hence NUMA hardware can lead to overhead of memory access due to coherence between caches. As a result, Cache-coherent NUMA (CC-NUMA) is most commonly used in modern systems. In an SMP-based CC-NUMA multiprocessor system, SMP noes are interconnected via the interconnection network based on the cache-coherent non-uniform memory access model. All processors belonging to the same SMP node are allowed to uniformly access the node memory modules. All SMP nodes have access to all physically distributed memories. ^[2] NUMA configurations are also common in Massively Parallel Processing (MPP) Systems because they provide scalability in terms of disk space.

Cache Coherence

Hardware-Based Coherence

Snoop devices are used in cores and their caches so that shared data is cached.The caches are known to be coherent. A comparison of a variety of snoop-based cache coherency schemes portrays a “sensitivity to cache write policy more than the specific coherency protocol.” ^[3]. The speed ratio between a cache hit and shared memory access is less than an order of magnitude and consumption for accessing fast and power-hungry cache memories is larger than that for on-chip bus and slower shared memories. Therefore, it is harder to amortize power the cost of caches when many redundant accesses are performed by snoop devices.

Comparison between snoop-based cache coherency schemes shows strong sensitivity to cache write policy more than specific coherency protocol.

Cache memory is a cost-effective method of increasing performance in uniprocessor systems. Write-back schemes are more efficient, despite increased hardware complexity of cache-coherency support, generating less bus traffic than write-through^[4]

Sofware-Based Coherence

In sofware-based coherence, shared data are not cached. Advanced schemes would require the compiler to perform correct analysis, so that some shared data may be cached when it is safe.^[3] 'Light-weight' schemes avoid caching nonshared data for energy efficiency and are have the advantage of scalability.

In the sample code^[3], the shared keyword used on a variable implies the fact that it cannot be cached. |Sample Code

OS-Based Coherence

In OS-based coherence, a communication infrastructure using message queues which are implemented as packets. Remote processes use global identifiers in queues to obtain packet buffers and locks are used to for synchronized access of buffers. The OS is thus able to guarantee coherence. OS-based coherence however, has an extremely high cost and “general-purpose libraries... are not a practical alternative in a highly-performance and power-constrained context.”^[3]

File:Cacheorganization.PNG^[4]

Light-weight schemes avoid caching nonshared data, so they are energy efficient. They have scalability.

Memory Consistency

Memory consistency problems occur due to reordering of memory based instructions during execution of threaded programs.

Load Queues

Enforcing memory consistency can be done by strictly ordering memory operations, but can result in unnecessary overhead. As a result, load queues track dependencies between memory operations to ensure memory consistency while preventing program violations. There are mainly two tupes of load queues.

In a processor with a snooping load queue, originally described by Gharachorloo et al., the memory system forwards external write requests (i.e.invalidate messages from other processors or I/O devices) to the load queue, which searches for already-issued loads whose addresses match the invalidation address [9], squashing any overlapping load. If inclusion is enforced between the load queue and any cache, replacements from that cache will also result in an external load queue search. Insulated load queues enforce the memory consistency model without processing external invalidations, by squashing and replaying loads that may have violated the consistency model. ^[5]

Alpha Memory Model

The Alpha model provides two different fence instructions, the memory barrier (MB) and the write memory barrier (WMB). The MB instruction can be used to maintain program order from any memory operations before the MB to any memory operations after the MB. The WMB instruction provides this guarantee only among write operations. The Alpha model does not require a safety net for write atomicity. ^[6]

File:Alphamodel1.PNG^[7]

Synchronization

MCS Lock

MCS lock is a spin lock algorithm designed by Mellor-Crummey and Scott, which ensures FIFO ordering of lock reception, spins on local flag variables, uses a small amount of space per lock, and works well on machines regardless of coherent caches. When testing several spin locks, the MCS lock was found to test best of all. The only issue is that the time necessary to release an MCS lock is dependent on whether or not another processor is waiting. The algorithm, seen in the sample code below, maintains a queue of processors which each request a lock. This enables each processor to wait on a “unique, locally-accessible flag variable”. ^[8]

File:MCSalgorithm.PNG^[8]

PowerPC Sync Instruction

In context-synchronizing, the isync instruction is used to guarantee that memory access has been completed. Instructions after the synch instruction will execute in a new context. In execution-synchronizing, the sync instruction synchronizes execution and broadcasts addresses on the bus. This may be done to synchronize coherent memory with alternate processors. The difference between isync and sync is that with sync, external addresses must complete “with respect to other processors and mechanisms that access memory”. ^[9]

References

^ ^a ^b A NUMA API For Linux (http://www.novell.com/rc/docrepository/public/37/basedocument.2009-11-18.5883877819/4621437_en.pdf?noredir=True)
^ Cache Coherent Protocols in NUMA Multiprocessors (http://ettrends.etri.re.kr/PDFData/13-5-2.pdf)
^ ^a ^b ^c ^d Cache Coherence Tradeoffs in Shared-Memory MPSoCs (https://wiki.ittc.ku.edu/ittc/images/0/0f/Loghi.pdf)
^ ^a ^b A Low-Overhead Coherence Solution for Multiprocessors With Private Cache Memories (http://portal.acm.org/citation.cfm?id=808204)
^ Memory Ordering: A Value based approach (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.2874&rep=rep1&type=pdf)
^ Shared Memory Consistency Models: A Tutorial
^ Shared memory consistency protocol verification against weak memory models: refinement via model-checking? (http://www.cs.utah.edu/formal_verification/papers/cav02paper.pdf)
^ ^a ^b Synchronization Without Contention (http://www.freescale.com/files/32bit/doc/app_note/AN2540.pdf?noredir=True)
^ Synchronizing Instructions of PowerPC Instruction Set Architecture (http://www.freescale.com/files/32bit/doc/app_note/AN2540.pdf?noredir=True)

[NUMA-1] A NUMA API For Linux (http://www.novell.com/rc/docrepository/public/37/basedocument.2009-11-18.5883877819/4621437_en.pdf?noredir=True)

[2] Cache Coherent Protocols in NUMA Multiprocessors (http://ettrends.etri.re.kr/PDFData/13-5-2.pdf)

[Toghi-3] Cache Coherence Tradeoffs in Shared-Memory MPSoCs (https://wiki.ittc.ku.edu/ittc/images/0/0f/Loghi.pdf)

[CACHE-4] A Low-Overhead Coherence Solution for Multiprocessors With Private Cache Memories (http://portal.acm.org/citation.cfm?id=808204)

[5] Memory Ordering: A Value based approach (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.2874&rep=rep1&type=pdf)

[6] Shared Memory Consistency Models: A Tutorial

[7] Shared memory consistency protocol verification against weak memory models: refinement via model-checking? (http://www.cs.utah.edu/formal_verification/papers/cav02paper.pdf)

[MCS-8] Synchronization Without Contention (http://www.freescale.com/files/32bit/doc/app_note/AN2540.pdf?noredir=True)

[SYNC-9] Synchronizing Instructions of PowerPC Instruction Set Architecture (http://www.freescale.com/files/32bit/doc/app_note/AN2540.pdf?noredir=True)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

@@ Line 64: / Line 64: @@
 The Alpha model provides two different fence instructions, the memory barrier (MB) and the write memory barrier (WMB). The MB instruction can be used to maintain program order from any memory operations before the MB to any memory operations after the MB. The WMB instruction provides this guarantee only among write operations. The Alpha model does not require a safety net for write atomicity. <ref>Shared Memory Consistency Models: A Tutorial </ref>
-[[File:Alphamodel1.jpg]]<ref>Shared memory consistency protocol verification against weak memory models: refinement via model-checking? (http://www.cs.utah.edu/formal_verification/papers/cav02paper.pdf) </ref>
+[[File:Alphamodel1.PNG]]<ref>Shared memory consistency protocol verification against weak memory models: refinement via model-checking? (http://www.cs.utah.edu/formal_verification/papers/cav02paper.pdf) </ref>
 == Synchronization ==