Load-link/store-conditional
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (February 2015) |
In computer science, load-link and store-conditional (LL/SC) are a pair of instructions used in multithreading to achieve synchronization. Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link. Together, this implements a lock-free atomic read-modify-write operation.
LL/SC was originally proposed by Jensen, Hagensen, and Broughton for the S-1 AAP multiprocessor at Lawrence Livermore National Laboratory. Load-link is also known as "load-linked", "load and reserve", or "load-locked".
Comparison of LL/SC and compare-and-swap
If any updates have occurred, the store-conditional is guaranteed to fail, even if the value read by the load-link has since been restored. As such, an LL/SC pair is stronger than a read followed by a compare-and-swap (CAS), which will not detect updates if the old value has been restored (see ABA problem).
Real implementations of LL/SC do not always succeed if there are no concurrent updates to the memory location in question. Any exceptional events between the two operations, such as a context switch, another load-link, or even (on many platforms) another load or store operation, will cause the store-conditional to spuriously fail. Older implementations will fail if there are any updates broadcast over the memory bus. This is often called weak LL/SC by researchers, as it breaks many theoretical LL/SC algorithms.[citation needed] Weakness is relative, and some weak implementations can be used for some algorithms.
LL/SC is more difficult to emulate than CAS. Additionally, stopping running code between paired LL/SC instructions, such as when single-stepping through code, can prevent forward progress, making debugging tricky.[citation needed]
Nevertheless, LL/SC can be implemented in O(1) and in wait-free manner using CAS and vice versa, meaning that the two primitives are equivalent from this viewpoint.[1]
Implementations
LL/SC instructions are supported by:
- Alpha: ldl_l/stl_c and ldq_l/stq_c
- PowerPC: lwarx/stwcx and ldarx/stdcx
- MIPS: ll/sc
- ARM: ldrex/strex (ARMv6 and v7), and ldxr/stxr (ARM version 8)
- RISC-V: lr/sc
- ARC: LLOCK/SCOND
Some CPUs[which?] require the address being accessed exclusively to be configured in write-through mode.
Typically, CPUs track the load-linked address at a cache-line or other granularity, such that any modification to any portion of the cache line (whether via another core's store-conditional or merely by an ordinary store) is sufficient to cause the store-conditional to fail.
All of these platforms provide weak[clarification needed] LL/SC. The PowerPC implementation allows an LL/SC pair to wrap loads and even stores to other cache lines (although this approach is vulnerable to false cache line sharing). This allows it to implement, for example, lock-free reference counting in the face of changing object graphs with arbitrary counter reuse (which otherwise requires double compare-and-swap, DCAS). RISC-V provides an architectural guarantee of eventual progress for LL/SC sequences of limited length.
Some ARM implementations define platform dependent blocks, ranging from 8 bytes to 2048 bytes, and an LL/SC attempt in any given block fails if there is between the LL and SC a normal memory access inside the same block. Other ARM implementations fail if there is a modification anywhere in the whole address space. The former implementation is the stronger and most practical.
LL/SC has two advantages over CAS when designing a load-store architecture: reads and writes are separate instructions, as required by the design philosophy (and pipeline architecture); and both instructions can be performed using only two registers (address and value), fitting naturally into common 2-operand ISAs. CAS, on the other hand, requires three registers (address, old value, new value) and a dependency between the value read and the value written. x86, being a CISC architecture, does not have this constraint; though modern chips may well translate a CAS instruction into separate LL/SC micro-operations internally.
Extensions
Hardware LL/SC implementations typically do not allow nesting of LL/SC pairs.[2] A nesting LL/SC mechanism can be used to provide a MCAS primitive (multi-word CAS, where the words can be scattered).[3] In 2013, Trevor Brown, Faith Ellen, and Eric Ruppert have implemented in software a multi-address LL/SC extension (which they call LLX/SCX) that relies on automated code generation;[4] they have used it to implement one of the best performing concurrent binary search tree (actually a chromatic tree), slightly beating the JDK CAS-based skip list implementation.[5]
See also
References
- ^ Anderson, James H.; Moir, Mark (1995). "Universal constructions for multi-object operations". PODC '95 Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing. ACM. pp. 184–193. doi:10.1145/224964.224985. ISBN 0-89791-710-3. See their Table 1, Figures 1 & 2 and Section 2 in particular.
- ^ Larus, James R.; Rajwar, Ravi (2007). Transactional Memory. Morgan & Claypool. p. 55. ISBN 978-1-59829-124-7.
- ^ Fraser, Keir (February 2004). Practical lock-freedom (PDF) (Technical report). University of Cambridge Computer Laboratory. p. 20. UCAM-CL-TR-579.
- ^ Brown, Trevor; Ellen, Faith; Ruppert, Eric (2013). "Pragmatic primitives for non-blocking data structures". PODC '13 Proceedings of the 2013 ACM symposium on Principles of distributed computing. ACM. pp. 13–22. doi:10.1145/2484239.2484273. ISBN 978-1-4503-2065-8.
{{cite book}}
: External link in
(help); Unknown parameter|chapterurl=
|chapterurl=
ignored (|chapter-url=
suggested) (help) See also slides - ^ Brown, Trevor; Ellen, Faith; Ruppert, Eric (2014). "A general technique for non-blocking trees". PPoPP '14 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM. pp. 329–342. doi:10.1145/2555243.2555267. ISBN 978-1-4503-2656-8.
{{cite book}}
: External link in
(help); Unknown parameter|chapterurl=
|chapterurl=
ignored (|chapter-url=
suggested) (help)
- Jensen, Eric H.; Hagensen, Gary W.; Broughton, Jeffrey M. (November 1987). A New Approach to Exclusive Data Access in Shared Memory Multiprocessors (PDF) (Technical report). Lawrence Livermore National Laboratory. UCRL-97663.
- Bruner, John D.; Hagensen, Gary W.; Jensen, Eric H.; Pattin, Jay C.; Broughton, Jeffrey M. (11 November 1987). Cache Coherence on the S-1 AAP (PDF) (Technical report). Lawrence Livermore National Laboratory. UCRL-97646.
- Detlefs, D.; Martin, P.; Moir, M.; Steele, Jr., Guy L. (2001). "Lock-free reference counting". PODC '01 Proceedings of the twentieth annual ACM symposium on Principles of distributed computing. ACM. pp. 190–9. doi:10.1145/383962.384016. ISBN 1-58113-383-9.
- Reinholtz, Kirk (December 2004). "Atomic Reference Counting Pointers". C/C++ Users Journal.
{{cite journal}}
: CS1 maint: year (link)[permanent dead link] - Sites, R. L. (February 1993). "Alpha AXP architecture". Comm. ACM. 36 (2): 33–44. doi:10.1145/151220.151226.