Transactional memory

From Wikipedia, the free encyclopedia
  (Redirected from Hardware transactional memory)
Jump to: navigation, search

In computer science and engineering, transactional memory attempts to simplify concurrent programming by allowing a group of load and store instructions to execute in an atomic way. It is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing.

Overview[edit]

The biggest challenge of any hardware implementation of transactional memory is dealing with writes. A transactional memory system must track stores and assemble a write-set for the transaction; the actual write-set data must be buffered until the end of a transaction. In case of a successful commit, all the stores in the write-set become globally visible in an atomic fashion, typically by writing the buffered data to a cache. Alternatively, if the transaction aborts, then the buffered stores must be discarded, without modifying the memory.[1]

Reads (loads) are somewhat simpler, because they are not altering memory, only the architectural registers. The transactional memory must still track reads, creating a read-set; a successful transaction simply writes all the reads belonging to a read-set, to the register file. Aborting a transaction involves undoing the changes to the register file, which is far easier than undoing changes to the memory system.[1]

Motivation[edit]

The motivation of transactional memory lies in the programming interface of parallel programs. The goal of a transactional memory system is to transparently support the definition of regions of code that are considered a transaction, that is, that have atomicity, consistency and isolation requirements. Transactional memory allows writing code like this example:

def transfer_money(from_account, to_account, amount):
    with transaction():
        from_account -= amount
        to_account += amount

In the code, the block defined by "transaction" has the atomicity, consistency and isolation guarantees and the underlying transactional memory implementation must assure those guarantees transparently.

Hardware vs. software[edit]

Hardware transactional memory systems may comprise modifications in processors, cache and bus protocol to support transactions.[2][3][4][5][6] Load-link/store-conditional (LL/SC) offered by many RISC processors can be viewed as the most basic transactional memory support. However, LL/SC usually operates on data that is the size of a native machine word, so only nano-transactions are supported.

Software transactional memory provides transactional memory semantics in a software runtime library or the programming language,[7] and requires minimal hardware support (typically an atomic compare and swap operation, or equivalent). As the downside, software implementations usually come with a performance penalty, when compared to hardware solutions.

History[edit]

One of the earliest implementations of transactional memory was the gated store buffer used in Transmeta's Crusoe and Efficeon processors. However, this was only used to facilitate speculative optimizations for binary translation, rather than any form of speculative multithreading, or exposing it directly to programmers. Azul Systems also implemented hardware transactional memory to accelerate their Java appliances, but this was similarly hidden from outsiders.[8]

Sun Microsystems implemented hardware transactional memory and a limited form of speculative multithreading in its high-end Rock processor. This implementation proved that it could be used for lock elision and more complex hybrid transactional memory systems, where transactions are handled with a combination of hardware and software. The Rock processor was cancelled by Oracle in 2009; while the actual products were never released, a number of prototype systems were available to researchers.[8]

In 2009, AMD proposed the Advanced Synchronization Facility (ASF), a set of x86 extensions that provide a very limited form of hardware transactional memory support. The goal was to provide hardware primitives that could be used for higher-level synchronization, such as software transactional memory or lock-free algorithms. However, AMD has not announced whether ASF will be used in products, and if so, in what timeframe.[8]

More recently, IBM announced in 2011 that Blue Gene/Q had hardware support for both transactional memory and speculative multithreading. The transactional memory could be configured in two modes; the first is an unordered and single-version mode, where a write from one transaction causes a conflict with any transactions reading the same memory address. The second mode is for speculative multithreading, providing an ordered, multi-versioned transactional memory. Speculative threads can have different versions of the same memory address, and hardware implementation keeps tracks of the age for each thread. The younger threads can access data from older threads (but not the other way around), and writes to the same address are based on the thread order. In some cases, dependencies between threads can cause the younger versions to abort.[8]

The most recent development is Intel's Transactional Synchronization Extensions (TSX) and the actual implementation in Haswell processors. Haswell is the first x86 processor to feature hardware transactional memory. Intel's TSX specification describes how the transactional memory is exposed to programmers, but withholds details on the actual transactional memory implementation.[8]

Hardware implementation[edit]

Caches extensions[edit]

The crux of transactional memory is detecting conflicts (intersections) in a transaction. A read-set conflict occurs if a cache line in the read-set is written by another thread. Similarly, a write-set conflict occurs if a cache line in the write-set is read or written by another thread. When a conflict occurs, the transaction aborts and must rollback any reads and writes in the transaction, along with any dependent instructions.[1]

Extending the cache coherence protocol and modifying the processor's caches for supporting transactional memory is one possible implementation, but not the only option. Cache-based systems have potential limits due to associativity, and certain access patterns will cause transaction aborts due to cache contention.[1]

For example, Intel's TSX performs read-set and write-set tracking at the cache line (64 bytes) granularity during a transaction.[1]

Ordering buffers[edit]

One alternative to modifying the caches to support transactional memory is extending the functionality of the memory ordering buffer (MOB) and re-order buffer (ROB) in modern x86 microprocessors. These processors have a relatively strong ordering model; generally, loads and stores must appear to execute in program order. However, modern microprocessors can substantially enhance performance by re-ordering loads and stores during out-of-order execution.[1]

When a store is issued to the out-of-order core for renaming and scheduling, an entry in the store buffer is allocated (in-order) for the address and the data. The store buffer will hold the address and data until the instruction has retired and the data has been written to the L1 data cache.[1]

Analogously, when a load is issued, an entry in the load buffer is reserved for the address. However, loads must also compare the load address against the contents of the entire store buffer to check for aliasing with older stores. If the load address matches an older store, then the load must wait for the older store to complete, to preserve the dependency. Most x86 processors optimize this further, by allowing the store to forward data to the load without accessing the cache. The load buffer entry can be released, once the instruction has retired and the load data is written into the register file.[1]

Pipeline modifications[edit]

Implementing transactional memory can be done with a few modifications to the pipeline of an x86 processor. The overall idea is to handle relatively small transactions using out-of-order execution. Before a transaction starts, the old out-of-order window must be retired and cleared. Transaction commits are handled by retiring the speculative out-of-order window, and aborts are treated like clearing a pipeline (for example, due to a branch misprediction or an exception).[1]

The load buffer essentially already tracks the read-set. The only necessary change is avoiding any buffer overflows. The buffer is locked at the start of a transaction, so loads do not leave the read-set; additionally, a transaction must be aborted if an entry is not available for allocation. The load buffer is already snooped; if a remote store hits the buffer, that indicates a conflict has occurred and the transaction must be aborted. Similarly, the ROB must be locked to prevent instructions from retiring and potentially overwriting the saved architectural registers.[1]

Rolling back the read-set for a transaction is quite simple for out-of-order microprocessors with branch predictors. Speculative execution already requires undoing changes to the register file (for example, while loading a data value from memory to a register) when a branch is mispredicted. The ROB naturally tracks the old values of the architectural registers. If the processor forces all instructions to retire before starting a transaction, that creates a clean copy of the architectural registers, which can be reloaded after cleaning a pipeline.[1]

Handling store instructions requires a few more changes. As with load instructions, the store buffer must be emptied before a transaction begins, and cause an abort if it overflows. Stores are locked in the buffer and are not allowed to writeback to the L1 data cache until the end of a transaction. The store buffer can safely hold the write-set, and if an abort occurs, their contents are easily discarded. However, the store buffer must be snooped by remote loads and stores to track the write-set and detect conflicts. This is a substantial change and involves a fair bit of overhead.[1]

Available implementations[edit]

See also[edit]

References[edit]

  1. ^ a b c d e f g h i j k l David Kanter (2012-08-21). "Haswell Transactional Memory Alternatives". Real World Technologies. Retrieved 2013-11-14. 
  2. ^ Herlihy, Maurice; Moss, J. Eliot B. (1993). "Transactional memory: Architectural support for lock-free data structures" (PDF). Proceedings of the 20th International Symposium on Computer Architecture (ISCA). pp. 289–300. 
  3. ^ Multiple Reservations and the Oklahoma Update. doi:10.1109/88.260295. 
  4. ^ Hammond, L; Wong, V.; Chen, M.; Carlstrom, B.D.; Davis, J.D.; Hertzberg, B.; Prabhu, M.K.; Honggo Wijaya; Kozyrakis, C.; Olukotun, K. (2004). "Transactional memory coherence and consistency". Proceedings of the 31st annual International Symposium on Computer Architecture (ISCA). pp. 102–13. 
  5. ^ "Unbounded transactional memory". 
  6. ^ "LogTM: Log-based transactional memory" (PDF). WISC. 
  7. ^ "The ATOMOΣ Transactional Programming Language" (PDF). Stanford. 
  8. ^ a b c d e David Kanter (2012-08-21). "Analysis of Haswell's Transactional Memory". Real World Technologies. Retrieved 2013-11-19. 
  9. ^ "IBM plants transactional memory in CPU". EE Times. 
  10. ^ Java on a 1000 Cores – Tales of Hardware/Software CoDesign on YouTube
  11. ^ Wong, Michael. "Transactional Language Constructs for C++". Retrieved 12 Jan 2011. 

Further reading[edit]

  • Larus, James R.; Rajwar, Ravi (2006), Transactional Memory, Synthesis Lectures on Computer Architecture 1 (1), Morgan & Claypool, pp. 1–226, doi:10.2200/S00070ED1V01Y200611CAC002 
  • Harris, Tim; Larus, James R.; Rajwar, Ravi (December 2010), Transactional Memory, 2nd edition, Synthesis Lectures on Computer Architecture 5 (1), Morgan & Claypool, pp. 1–263, doi:10.2200/S00272ED1V01Y201006CAC011 
  • McKenney, Paul E.; Michael, Maged M.; Triplett, Josh; Walpole, Jonathan (July 2010). "Why the grass may not be greener on the other side: a comparison of locking vs. transactional memory". SIGOPS Oper. Syst. Rev. (New York, NY, USA: ACM) 44 (3): 93–101. doi:10.1145/1842733.1842749. ISSN 0163-5980. 

External links[edit]