Concurrent Algorithms for Emerging Hardware
                  Platforms


                                 by
                            Irina Calciu

     B.Sc., Jacobs University Bremen; Bremen, Germany, 2009
          M.Sc., Brown University; Providence, RI, 2011


        A dissertation submitted in partial fulfillment of the
        requirements for the degree of Doctor of Philosophy
    in The Department of Computer Science at Brown University


                PROVIDENCE, RHODE ISLAND


                             May 2015

c Copyright 2015 by Irina Calciu
       This dissertation by Irina Calciu is accepted in its present form
         by The Department of Computer Science as satisfying the
       dissertation requirement for the degree of Doctor of Philosophy.


Date
                                    Maurice Herlihy, Ph.D., Advisor


                   Recommended to the Graduate Council


Date
                                 Justin Gottschlich, Ph.D., Co-advisor


Date
                                    Rodrigo Fonseca, Ph.D., Reader


                     Approved by the Graduate Council


Date
                             Peter M. Weber, Dean of the Graduate School


                                      iii
                                      Vita

Irina Calciu was born and raised in Bucharest, Romania. She became interested in
Computer Science at an early age and participated in various programming competi-
tions during high-school. In 2006 she moved to Bremen, Germany, to pursue a B.Sc.
in Computer Science at Jacobs University Bremen. In the fall of 2008, she was an
exchange student in the School of Computer Science at Carnegie Mellon University.
Irina joined Brown University as a PhD student in 2009. She obtained a M.Sc. in
Computer Science from Brown in 2011 and a PhD in Computer Science in 2015. Her
research focuses on designing algorithms that leverage new architectural features of
modern hardware to enable more software parallelism. In particular, she is inter-
ested in hybrid transactional memory and in non-uniform memory access (NUMA)
algorithms. Irina has co-authored papers at top conferences and workshops, such as
PACT, PPoPP, DISC, OPODIS, HotPar and TRANSACT, obtaining a Best Paper
Award for work on software fallbacks for best-effort hardware transactional memory
at TRANSACT 2014. She has been awarded a Kanellakis Fellowship in 2014 and
a Dissertation Fellowship in 2013 and she has been a research intern at Microsoft
Research, Intel Labs, Oracle Labs, Google and Mozilla.


                                         iv
                              Acknowledgements


                 ”It’s always easier to ask forgiveness than it is to get permission.”

                                                                    Grace Murray Hopper

I am incredibly lucky to have had amazing people accompany me through my PhD
journey. This thesis exists because of their support and guidance.

I am very grateful to my advisor, Maurice Herlihy, who gave me independence before
I even knew I wanted it. He inspired and motivated me. He found the time to listen,
to help and to provide feedback whenever I needed it. Yet, he also gave me the
confidence to fly on my own and to carve my own path. He taught me to be bold in
my pursuits and not to ask for permission. It was a great experience to have him as
an advisor!

I am forever indebted to my co-advisor, Justin Gottschlich. Justin was my mentor
during an internship with Intel Labs and he remained my mentor even after the
internship ended. He insisted that I can write better papers and patents and that
I can make better visuals for my presentations. He trusted me more than I trusted
myself. Today, I am a better writer, presenter, researcher and a better person because
of his trust.

I am thankful to Rodrigo Fonseca, who was a member of my thesis committee, and
the other faculty at Brown, in particular Sorin Istrail, Ugur Cetintemel and Shriram
Krishnamurthi. They have all provided me invaluable guidance and advice in this
journey and I do not have enough words to thank them for it.


                                            v
I had the great pleasure to work with amazing researchers during my internships. I
want to thank Mark Moir and the Scalable Synchronization group for teaching me
about NUMA machines and systems research. I am grateful to the Programming
Systems Lab at Intel for offering me the opportunity to work on Hybrid Transac-
tional Memory on Haswell before anyone else could. I am also thankful to Konrad
Lai and Andi Kleen. They spared much pain by providing support and tools to nav-
igate the prototype hardware. It was a great honor to work with Marcos Aguilera,
Mahesh Balakrishnan, Rama Ramasubramanian and Sid Sen during two internships
at Microsoft Research. I am excited to continue working with them at VMware
Research Group. I am also grateful to the interns with whom I shared many special
moments in the Bay Area - Fangbo, Ilya, Jana, Mohsen, Rajiv, Tobias, and Tomas.
Special thanks go to Yehuda Afek, who provided access to the machines used for
many experiments in this thesis.

I wouldn’t be here today without the love and support of my family back in Romania.
They have always cheered for me and they have forgiven my absence from so many
family holidays. I am grateful to my aunt and godmother, Rodica Ristea, who
was the first to show me the world outside my country. I am also thankful to my
aunts tanti Miti and tanti Anda, to my uncle Nenicu, and to my cousins, Catalina,
Dani, Bogdan and Doru. Moreover, I was incredibly lucky to find a second family in
Minnesota. They were kind to make me feel part of the family. Their encouragement
during these last few years was unmatched. I am thankful to Theresa and Joe Berg,
Rita and Kyle Johansen, Jenny Berg, and aunts Dorothy and Lisa.

My journey would have not been the same without my amazing family at Brown.
Jenny and Nathan were my first officemates. They taught me to navigate the intri-
cacies of the PhD and prepared me for every next step in the process. I shared the
ups and downs of the PhD life with my good friends, Jeff, Shane, Yuri, Hammurabi


                                        vi
and Carrie, Marcelo and Yoko, Alex and Tanya, Steve and Alyssa, Oana and Igor.
My friend Marquita has been my research and my gym buddy. Archita, Vikram and
Zhiyu have been great academic siblings. Although we haven’t had a chance to col-
laborate so far, I hope we will do so in the future. I have enjoyed countless cupcakes
and life discussions with my GWICS friends - Alexandra, Betsy, Esha, Foteini, Gen,
Hannah, Layla, Olya, Rebecca, Sasha, and Silvia. I have spent many late hours in
the CIT with Deepak, Greg and Sunil early on during my PhD. I am thankful to
Lauren Clarke for always having a solution to all problems and to Genie DeGouveia
for always having a key to my office when I locked myself out before a deadline. I
am grateful to my friends in Providence. Robin Feder provided a great introduc-
tion to Providence and the American culture. Charlie and Michael have been great
neighbors and have organized unforgettable get-togethers.

Elena, Alin and Andrei have been very far, but also very close all these years. We
have all experienced together the PhD adventure, even though from very different
places. I am thankful to Elena for sharing all the joys and sorrows of the PhD life.
Thank you for making all the roadblocks seem easier to navigate!

I owe all my achievements and success to my wonderful mother and closest confidant,
Germina Ristea. Thank you for being my idol and inspiration! You magically
managed to strike the perfect balance between giving me the freedom to explore
the world while also keeping the bar high for me. You taught me to work hard and
to see failures as a learning experience. You gave me all the skills I needed to succeed
in a PhD program, perhaps without even knowing it.

My fianc´
        e and best friend, Daniel Berg, has been my biggest supporter and my
biggest critic during all these years. Thank you for believing in me and for not
letting me settle for anything less than the best! Your passion and dedication to
research continue to inspire me every day.


                                          vii
Abstract of “ Concurrent Algorithms for Emerging Hardware Platforms ” by Irina
Calciu, Ph.D., Brown University, May 2015

Computer architecture has recently seen an explosion of innovation that has enabled
more parallel execution, while parallel software systems have been making strides
in providing more simplified programming models. The number of computing cores
used in every area of the software ecosystem continues to increase, and parallelism
within programs is now ubiquitous. Ideally, performance would scale linearly with
the number of cores, but that is rarely the case in practice. Communication and
synchronization between cores running the same application are often necessary, but
usually come at a high cost. This results in reduced scalability and a significant
drop in performance. In this context, parallel software needs to provide more simpli-
fied programming patterns and tools that enable a higher potential for parallelism
without increasing the burden on the programmer.

This thesis discusses new techniques to simplify writing efficient parallel code by
leveraging novel architectural features from many current systems. First, we de-
scribe various programming abstractions, such as delegation, elimination, combining
and transactional memory, which improve scalability and performance of concurrent
programs. Next, we show how to use and integrate these abstractions to write scal-
able concurrent algorithms, such as stacks and priority queues. Finally, we describe
how to further improve these abstractions. In particular, we present new transac-
tional memory algorithms that use Intel’s new extension to the x86 instruction set
architecture, called Restricted Transactional Memory, to simplify general synchro-
nization. Developers can use all of these abstractions as building blocks to create
efficient code that is able to scale on very diverse platforms, with minimal specialized
knowledge of parallel programming.
Contents

1 Introduction                                                                                                                1
  1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                            2
  1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                           5
  1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                             8

2 Design Patterns and Abstractions for               Concurrent Algorithms                                                   10
  2.1 Concurrent Data Structures . . . . .           . . . . . . . . . . . . . . . .                                 .   .   10
  2.2 Elimination . . . . . . . . . . . . . .        . . . . . . . . . . . . . . . .                                 .   .   12
  2.3 Combining and Delegation . . . . . .           . . . . . . . . . . . . . . . .                                 .   .   13
  2.4 Transactional Memory . . . . . . . .           . . . . . . . . . . . . . . . .                                 .   .   14

3 A Concurrent NUMA-Aware Stack                                                                                              16
  3.1 Background . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
  3.2 Algorithm Design . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
      3.2.1 Delegation . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
      3.2.2 Elimination . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
      3.2.3 Advantages and Limitations           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
  3.3 Evaluation . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
  3.4 Summary . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29

4 A Concurrent Priority Queue                                                                                                30
  4.1 Background . . . . . . . . . . . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   31
  4.2 Algorithm Design . . . . . . . . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   33
      4.2.1 Concurrent Skiplist . . . . . . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   35
      4.2.2 Elimination and Combining . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   40
  4.3 Linearizability . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   43
  4.4 Evaluation . . . . . . . . . . . . . . . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   46
      4.4.1 PQ::moveHead() and PQ::chopHead()                            .   .   .   .   .   .   .   .   .   .   .   .   .   49
  4.5 Hardware Transactions . . . . . . . . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   50
      4.5.1 Skiplist . . . . . . . . . . . . . . . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   51


                                          viii
         4.5.2 Aborted Transactions . . . . . . . . . . . . . . . . . . . . . . .               52
         4.5.3 Combining and Elimination . . . . . . . . . . . . . . . . . . .                  53
   4.6   Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              56

5 Software Fallbacks for Best-effort Hardware Transactional Memory                              57
  5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                58
  5.2 SGL Fallback (E-SGL) . . . . . . . . . . . . . . . . . . . . . . . . . .                  60
  5.3 Lazy SGL (L-SGL) . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  61
      5.3.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 64
      5.3.2 Sandboxing . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  66
  5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              68
      5.4.1 Speedup relative to sequential execution . . . . . . . . . . . .                    69
      5.4.2 Percentage of lock acquisitions . . . . . . . . . . . . . . . . . .                 73
      5.4.3 Single-threaded penalty . . . . . . . . . . . . . . . . . . . . .                   74
  5.5 Fine-grained SGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                74
      5.5.1 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 75
      5.5.2 Performance and Practicality . . . . . . . . . . . . . . . . . .                    76
  5.6 Hardware Optimizations . . . . . . . . . . . . . . . . . . . . . . . . .                  77
  5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 78

6 Hybrid Transactional Memory                                                                   80
  6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .    81
  6.2 Overview of InvalSTM . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .    84
  6.3 Invyswell’s Design . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .    86
      6.3.1 SpecSW: An HTM-Friendly InvalSTM . . . . . . . .               .   .   .   .   .    87
      6.3.2 BFHW: Hardware-Software Conflict Detection . . . .             .   .   .   .   .    89
      6.3.3 LiteHW: Optimizing for Small Transactions . . . . .            .   .   .   .   .    91
      6.3.4 IrrevocSW: Progress Guarantees . . . . . . . . . . . .         .   .   .   .   .    92
      6.3.5 SglSW: Progress Guarantees with Reduced Overhead               .   .   .   .   .    93
      6.3.6 Transitioning Between Transaction Types . . . . . .            .   .   .   .   .    94
      6.3.7 SpecSW Validation . . . . . . . . . . . . . . . . . . .        .   .   .   .   .    95
      6.3.8 Contention Manager (CM) . . . . . . . . . . . . . . .          .   .   .   .   .    96
  6.4 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .    97
      6.4.1 Opacity and Sandboxing . . . . . . . . . . . . . . . .         .   .   .   .   .    99
      6.4.2 Hardware Sandboxing Limitations . . . . . . . . . . .          .   .   .   .   .   100
  6.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   102
  6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   105
  6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   114

7 Conclusion                                                                                   117


                                          ix
List of Figures

 3.1    Example of a NUMA system with two nodes and 128 hardware threads.                   17
 3.2    Delegation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        21
 3.3    Communication protocol . . . . . . . . . . . . . . . . . . . . . . . . .            22
 3.4    Results for 50% pushes and 50% pops . . . . . . . . . . . . . . . . . .             25
 3.5    Results for 70% pushes and 30% pops . . . . . . . . . . . . . . . . . .             26
 3.6    Results for 90% pushes and 10% pops . . . . . . . . . . . . . . . . . .             26

 4.1    Skiplist design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   34
 4.2    Transitions of a slot in the elimination array. . . . . . . . . . . . .     .   .   43
 4.3    Linearizability of priority queue elimination . . . . . . . . . . . .       .   .   45
 4.4    Linearizability of priority queue delegation . . . . . . . . . . . . .      .   .   46
 4.5    Priority queue performance with 50% add()s, 50% removeMin()s.               .   .   47
 4.6    Priority queue performance with 80% add()s, 20% removeMin()s.               .   .   47
 4.7    add() work breakdown. . . . . . . . . . . . . . . . . . . . . . . .         .   .   48
 4.8    removeMin() work breakdown. . . . . . . . . . . . . . . . . . . .           .   .   48
 4.9    Priority queue performance using transactions . . . . . . . . . . .         .   .   52
 4.10   Priority queue performance using transactions . . . . . . . . . . .         .   .   52

 5.1    Obvious SGL Fallback implementation (E-SGL). . . . . . . . . . . . .                61
 5.2    Lazy SGL (L-SGL). . . . . . . . . . . . . . . . . . . . . . . . . . . . .           62
 5.3    Inconsistent reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . .         64
 5.4    Correctness: Cases 1-4. Arrows denote the “happens-before” relation.                66
 5.5    Example of overflow due to hyper-threading (vacation high benchmark).               69
 5.6    STAMP Throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . .             70
 5.7    STAMP Percentage of Lock Acquisitions. . . . . . . . . . . . . . . . .              71
 5.8    Speedup for 8 threads . . . . . . . . . . . . . . . . . . . . . . . . . .           72
 5.9    Slowdown for 1 thread. . . . . . . . . . . . . . . . . . . . . . . . . . .          72

 6.1    STAMP Performance Differential by Geometric Mean. . . . . . . . .                   84
 6.2    Transactional Events for Invyswell’s Different Transaction Types. . .               85
 6.3    Invyswell’s State Machine . . . . . . . . . . . . . . . . . . . . . . . .           88


                                           x
6.4    Speculative Software Transaction (SpecSW). . . . .     .   .   .   .   .   .   .   .   .   .    90
6.5    Bloom Filter Hardware Transaction (BFHW). . . .        .   .   .   .   .   .   .   .   .   .    92
6.6    Overview of Invyswell’s SpecSW Validation Process.     .   .   .   .   .   .   .   .   .   .    96
6.7    Invyswell’s Concurrent Execution Matrix. . . . . . .   .   .   .   .   .   .   .   .   .   .    97
6.8    Speedup on STAMP Benchmarks . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   107
6.9    Invyswell Transaction Types: 1-threaded execution.     .   .   .   .   .   .   .   .   .   .   108
6.10   Invyswell Transaction Types: 8-threaded execution.     .   .   .   .   .   .   .   .   .   .   108
6.11   Percentage of Committed Hardware Transactions. .       .   .   .   .   .   .   .   .   .   .   113


                                      xi
Chapter 1


Introduction

This thesis discusses new techniques to simplify writing efficient parallel code that
leverage architectural features of current systems. We focus on a few design patterns,
such as elimination, delegation, combining and transactional memory. These tech-
niques promise to simplify writing parallel code and improve scalability in scenarios
with high contention. We describe how to use these techniques and integrate them
to design scalable concurrent algorithms. First, we show how to use delegation and
elimination to implement a scalable concurrent stack suitable for the Non-Uniform
Memory Access (NUMA) architecture from a sequential stack. Next, we present the
first elimination algorithm for a priority queue and describe how to integrate this
algorithm with delegation and transactional memory to design a scalable concurrent
priority queue. Finally, we describe two hybrid transactional memory (HyTM) algo-
rithms that use Intel’s Restricted Transactional Memory (RTM) to simplify general
synchronization. These techniques make parallel programming simpler and more
efficient and are suitable for the rapidly evolving hardware ecosystem. They repre-
sent a foundation for building large-scale concurrent systems that may be suitable
to address wide-interest problems.


                                          1
                                                                                    2


1.1     Motivation

The landscape of Computer Science is fundamentally changing. For a long time,
Moore’s law ensured that performance would increase with each new CPU iteration.
But the ”free ride” is over and the demand for faster computation is now satisfied
through parallelism. A boom in hardware innovation is enabling more concurrency.
Therefore, server machines with hundreds of cores are becoming ubiquitous. Ideally,
performance would increase linearly with the number of cores, but that is rarely the
case in practice. The culprits are communication and synchronization between cores
running the same application. These are often necessary, and usually come at a high
cost, causing a loss of scalability and reduced performance. In order to leverage the
huge potential of these emerging hardware platforms, we need better synchronization
methods and updated parallel programming abstractions.

Moreover, as computer architectures are changing and growing to accommodate
more cores, the connecting bus is becoming the limiting factor to how many cores
a system can accommodate. To circumvent these issues, machines are progressively
adopting the non-uniform memory access (NUMA) model, where each processor
(also called a node) has its own memory. Multiple cores are grouped on a node and
share a last level cache. Although all the memory is shared, a thread running on a
node can access local memory (on the same node) faster than it can access remote
memory (on another node). Different access times and cache-to-cache traffic can
significantly impact the performance of applications unaware of this non-uniformity.
As these machines are becoming critical components in data centers, it is essential
to provide software building blocks for developing efficient parallel applications on
these platforms.

Meanwhile, software does not seem to leverage the potential for increased parallelism.
Shared data needs to be protected from simultaneous access by multiple threads. The
                                                                                      3


primary mechanism to ensure mutually exclusive access to shared data currently in
use is locking. Nevertheless, fine-grained locking is complex and prone to errors, while
coarse-grained locking can impact scalability. Moreover, locks are not composable,
which means that multiple critical sections cannot be combined together into one,
which affects the code’s modularity. Locks can also cause priority inversions and
deadlocks, which are difficult to detect and recover from. For these reasons, locking
is not an ideal solution for synchronization, especially on NUMA machines with
hundreds of cores.

Transactional memory (TM) has been proposed to abstract away the complexity of
lock-based mutual exclusion while providing benefits comparable to fine-grained lock-
ing. Moreover, transactional memory eliminates the negative side-effects of locking,
such as deadlocks and priority inversions. Transactional Memory executes critical
sections speculatively, as transactions, tracking all memory accesses and restarting
or stalling one or more transactions if it detects a conflict. Transactional Memory
can enable more parallelism by allowing critical sections to execute concurrently as
long as there are no data conflicts between them.

For example, two threads that insert elements into different buckets of a hash table
can execute in parallel as they do not have any data conflicts. If using coarse grained-
locking, these threads would have to acquire a lock on the hash table before doing
the insert. Therefore, they would not be able to execute in parallel. However, if these
threads were using fine-grained locking, by each thread locking only its corresponding
bucket, both threads could proceed in parallel.

Nevertheless, designing fine-grained synchronization is a bigger undertaking than
using a coarse-grained lock and it is more prone to programming errors [44, 33].
Instead, one may be able to achieve the efficiency of fine-grained locking with the
programming simplicity of coarse-grained locking by using transactional memory. In
                                                                                    4


the previous example, both threads can perform the inserts as transactions. If a
conflict is detected, one of the transactions needs to roll back and retry.

Software transactional memory (STM) is implemented in software only. STM is
most effective when used in applications with large, contended critical sections,
where smart contention managers can efficiently manage transactions to obtain the
best throughput. Unfortunately, keeping track of all memory accesses in software
generally incurs a prohibitive overhead for short critical sections. For these, hard-
ware transactional memory (HTM) has proven more feasible [17, 24, 66]. HTMs
have recently become available in Intel’s Haswell processor [48, 49] and IBM’s Blue
Gene/Q [89] and System z [51]. Practical HTMs, such as those provided by Haswell
and Blue Gene/Q are best-effort, which means there are no forward progress guar-
antees. In particular, hardware transactions are restricted from using certain unsup-
ported instructions and are limited in size. Therefore, a fallback is needed to ensure
forward progress of hardware transactions.

In practice, a single global lock (SGL) is often used as a fallback to an aborted
hardware transaction. The SGL is similar to Intel’s Hardware Lock Elision (HLE)
technique, used for legacy code, where existent locks are elided and the critical
sections are executed as transactions. If a transaction aborts, the hardware acquires
the locks that were previously elided and executes pessimistically. However, the SGL
prevents any concurrency while the lock is being held.

Hybrid transactional memory (HyTM) [19] combines lightweight hardware transac-
tions with the forward progress guarantees offered by software transactions, while
also offering more flexibility for transaction and contention management. Therefore,
HyTMs represent a complete solution for the problem of synchronization in shared
memory. Although current consumer systems supporting HTM are limited to four
cores (eight hardware threads using hyperthreading), we believe the next genera-
                                                                                   5


tion architectures may eventually offer support for HTM on machines with hundreds
of cores, thus making Transactional Memory a viable and portable synchronization
solution.

As more parallel architectures emerge, wide-scale adoption of parallel programming
across multiple disciplines may be possible. However, the programming paradigm
needs to be greatly simplified, as it is with Transactional Memory, or parallel pro-
gramming is likely to remain a restricted ”experts-only” domain.


1.2         Outline

In this thesis, we explore design patterns and abstractions that leverage novel hard-
ware features to improve the scalability and performance of concurrent programs
and to simplify writing parallel code.

In particular, we explore elimination [39], delegation [64], combining [38, 64, 54,
86, 6, 27, 74] and transactional memory [43, 84] and we propose new ways to use
and integrate these abstractions to design new scalable concurrent algorithms. We
present new designs for concurrent stacks and priority queues and analyze their
performance and scalability benefits compared to prior work. Next, we propose new
ways to further improve these abstractions. We focus on transactional memory and
propose new fallback algorithms to be used in conjunction with Intel’s new Restricted
Transactional Memory instructions [49] to provide forward progress guarantees.

This thesis is organized as follows.

In Chapter 2, we describe related work. We focus on various abstractions that have
been proposed in the concurrent computing community, such as elimination, dele-
gation, combining and transactional memory. Elimination consists of canceling out
inverse operations. For example, a thread that executes a push operation on a stack
                                                                                     6


can eliminate its operation with another thread executing a pop operation on the
same stack. The delegation method consists of one dedicated thread, called a server,
which is responsible for managing a sequential data structure and executing opera-
tions on behalf of other threads, called clients. Clients post synchronous (blocking)
operation requests in dedicated memory locations, called slots, and the server loops
through these slots, collects all operations and executes them on the data structure.
The server is the only thread able to access the data structure, so it does not need
any synchronization for the access. Combining is similar to delegation, but there
is no dedicated server thread. Operations are performed by one of the clients, the
one that acquires the combiner lock. Combining and delegation can reduce cache-
to-cache traffic by allowing one thread to execute multiple operations. Moreover,
some operations can be executed more efficiently as a batch, allowing the combiner
or the server to achieve more throughput with less work. For example, removing
multiple consecutive items in a sorted linked list can be executed at once with the
cost of a single operation by a server or combiner thread, while it would take multiple
operations if each operation was executed separately by the client threads. Transac-
tional memory has been proposed as a general synchronization method and allows
the critical sections to execute speculatively. In case of conflicts, where multiple
transactions access the same data, one of the conflicting transactions needs to be
stalled or aborted and retried at a later time.

In Chapter 3, we describe how to use elimination and delegation to design a scalable
NUMA-aware stack. In our design, clients use elimination before delegating their
requests in order to reduce the burden on the server thread and to parallelize mixed
workloads in which operations cancel each other out. Moreover, we experiment with
placing the elimination layer locally, on each NUMA node and globally - where it
is shared between NUMA nodes. We show that there is significant benefit from
using local elimination and delegation together, by comparing to state-of-the-art
                                                                                      7


concurrent stack implementations and to global-elimination based stacks.

In Chapter 4, we describe a scalable concurrent priority queue design. We present
the first elimination algorithm for a priority queue and show how to integrate this
algorithm with delegation, combining and transactional memory to achieve a highly
scalable design. Our algorithm is based on the observation that high-value add()
operations should execute in parallel for high scalability, while removeMin() oper-
ations should be executed by a combiner to avoid contention on the smallest items
in the priority queue. Moreover, we noticed that small value add() operations can
either eliminate immediately, if their values are smaller than the priority queue mini-
mum, or they can quickly become eligible for elimination, if they are likely to conflict
with the removeMin() operation. Therefore, we allow these operations to attempt
elimination before accessing the shared priority queue. Moreover, in order to re-
duce contention, we use a dedicated server thread that collects operations that failed
to eliminate and executes them sequentially on the priority queue. We show that
our design is more scalable and performs better than state-of-the-art priority queue
implementations.

In Chapter 5, we describe improvements to the simple, but widely used, Single
Global Lock (SGL) fallback mechanism to ensure forward progress for best-effort
hardware transactional memory. First, we present Lazy Single Global Lock (L-
SGL), a simple optimization that can achieve surprising benefits. Its simplicity,
combined with high throughput on current Haswell machines, make L-SGL likely
to be adopted by industry and implemented as a software library, or even in the
compiler or hardware. Next, we describe how to refine conflict detection between
multiple hardware transactions and a single software SGL transaction using Bloom
Filters. Finally, we discuss how implementing these features in hardware would
improve performance even further.
                                                                                 8


In Chapter 6, we present Invyswell, a novel hybrid transactional memory algorithm
that uses Haswell RTM for hardware transactions. Invyswell, is more complex than
L-SGL and uses InvalSTM [30] software transactions as a fallback mechanism for
Haswell. This algorithm pays a penalty for its complexity at low thread counts,
but we anticipate that it will be more scalable than L-SGL on machines with more
cores. L-SGL and Invyswell are not meant to be compared. Rather, we believe they
are complementary. L-SGL can be provided as an out-of-the-box hardware solution
for simple application that do not utilize many hardware resources, while Invyswell
can provide added benefit in software to highly parallel applications with tens or
hundreds of threads.

Finally, Chapter 7 provides concluding remarks.


1.3     Contributions

Related papers published:

  1. Chapter 3. Using Elimination and Delegation to Implement a Scalable, NUMA-
      Friendly Stack, I. Calciu, J. Gottschlich, M. Herlihy, HotPar 2013 [11].

  2. Chapter 4. The Adaptive Priority Queue with Elimination and Combining,
      I. Calciu, H. Mendes, M. Herlihy, DISC 2014 [13].

  3. Chapter 5. Improved Single Global Lock Fallback for Best-effort Hardware
      Transactional Memory, I. Calciu, T. Shpeisman, G. Pokam, M. Herlihy,
      Transact 2014 [14].

  4. Chapter 6. Invyswell, A Hybrid Transactional Memory for Haswell’s Re-
      stricted Transactional Memory, I. Calciu, J. Gottschlich, T. Shpeisman, G.
      Pokam, M. Herlihy, PACT 2014 [12].
                                                                               9


Other publications:

  1. Message Passing or Shared Memory: Evaluating the Delegation Abstraction for
     Multicores, I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. Marathe,
     M. Moir, OPODIS 2013 [9].

  2. NUMA-Aware Reader-Writer Locks, I. Calciu, D. Dice, Y. Lev, V. Luchangco,
     V. Marathe, N. Shavit, PPoPP 2013 [10].
Chapter 2


Design Patterns and Abstractions
for Concurrent Algorithms

There have been many proposals for general techniques to design and analyze concur-
rent algorithms [20, 21]. In addition, specific concurrent data structures designs have
also been proposed, such as Stacks [65, 82], Queues [71, 37, 65], Deques [41, 55, 60],
Trees [4, 34, 53, 57, 73, 72] and Priority Queues[87, 3, 40, 46, 50]. In this chapter
we survey prior work related to the main techniques used in this thesis. First, we
describe different notions related to concurrent data structures. Next, we present
techniques such as elimination, delegation and combining. Finally, we survey the
literature related to transactional memory.


2.1     Concurrent Data Structures

Concurrent data structures are quickly gaining importance, as multicore machines
are becoming ubiquitous. The building blocks for designing concurrent data struc-
tures generally consist of locks and atomic primitives to ensure the safety of all


                                          10
                                                                                    11


the shared data accesses. The most commonly used atomic primitive is Compare-
And-Swap (CAS), which is supported in most current processors. A CAS operation
consists of atomically changing a memory location from a known old value to a new
value, only if the memory location has not been updated in the meantime. CAS
operations are often used to design lock-free algorithms. In addition, they are also
used in designing efficient blocking synchronization techniques.

However, designing concurrent data structures is generally difficult [68, 83] be-
cause the many possible interleavings of different threads can cause different out-
comes.   Therefore, concurrent algorithms need to account for non-deterministic
threads schedules and always produce the expected outcomes.

Linearizability [45] is the most commonly used correctness condition for data struc-
tures. Linearizability requires that all operations appear to take place instanta-
neously, at some moment in time between the invocation of the operation and the
response. This means that all operations on the shared data structure can be or-
dered so that the result is equivalent to a sequential execution of these operations.
In addition, linearizability enforces that this order reflects the real order of these
operations. Therefore, concurrent operations, i.e. those operations whose executions
overlap, can be re-ordered, but operations whose executions do not overlap cannot
be re-ordered. The moment at which the operation appears to take place is called
linearization point. Linearizability is compsable, which means that a data structure
that is created out of multiple linearizable parts, is linearizable. In this thesis, we
focus on linearizable designs of concurrent data structures.
                                                                                      12


2.2        Elimination

Stacks are generally seen as sequential data structures because all threads contend for
access to the stack at its top location. However, prior work has shown that stacks
can be parallelized using a technique called elimination [39]. This technique uses
an additional data structure to allow threads performing push operations to meet
threads performing pop operations and exchange their arguments. This is equivalent
to the push being executed on the stack and immediately followed by a pop. The
elimination data structure, generally implemented as an array, allows multiple such
pairs to exchange arguments in parallel and decreases contention on the underlying
lock-free stack. If one thread fails to find an inverse operation, then its elimination
attempt times out and the thread accesses the stack directly.

This technique can be used as a backoff mechanism to a lock-free stack. A thread
can first try to perform its operation on the lock free stack using a CAS operation
and only use the elimination array if the CAS fails. Using elimination as a backoff
mechanism allows the throughput to be significantly increased in high contention
cases, without affecting latency of operations in cases where there is not much con-
tention.

If the original stack design is linearizable, the resulting stack design using the elimi-
nation method is also linearizable. As described in section 2.1, concurrent operations
can be reordered. Therefore, operations that perform elimination concurrently can
be reordered to appear that each push operation was immediately followed by its
eliminating pop operation.

The rendezvous method [2] improves the elimination algorithm by replacing the
elimination array with a smarter structure for processing the elimination, consisting
of an adaptive circular ring.
                                                                                    13


Elimination is generally used in the context of stacks, but an elimination algorithm
for queues has also been proposed [67]. The main idea behind the queue elimination
is to allow threads that fail to enqueue to linger for some time in the elimination
array, until they become eligible to eliminate. This process is called aging the opera-
tion. The enqueue operation becomes eligible to eliminate with a dequeue operation
when all the items that have been enqueued before the start of the enqueue op-
eration have already been dequeued. The operation aging process is necessary for
linearizability. If any enqueue operation would be allowed to eliminate, the First-In-
First-Out property of the queue would not be satisfied. We use this idea as a basis
for our priority queue elimination algorithm described in Chapter 4.


2.3      Combining and Delegation

The idea of one thread helping other threads to complete their work is a well-known
concept [38, 64, 54, 86, 6, 27, 74, 40]. A recent example of this helping mechanism is
called flat combining [38], in which a thread that acquires a lock for a data structure
executes operations for itself and also for other threads that are waiting on the same
lock. The global lock and the data remain in this thread’s cache while it executes
operations on behalf of other threads, thereby decreasing the number of cache misses
and contention on the lock. Moreover, flat combining aligns well for data structures
that are sequential, because only one thread is able to operate on it at a time,
regardless.

Due to the increasing number of hardware threads in a system, the helper thread
could be a dedicated thread (called a server thread) used only to service requests
from other threads (client threads). This is especially useful on heterogeneous archi-
tectures, where some cores could be faster than others. An example of this approach
                                                                                      14


is CPHash [64], a partitioned hash table. Each partition has an associated server
thread that receives add and remove requests from clients and sends back the re-
sponses obtained from performing the operations requested. Each client-server pair
share a location where they exchange messages, called a communication channel.
In [9], Calciu et al. investigate the tradeoffs between the traditional shared memory
techniques and a message passing approach based on delegation.


2.4      Transactional Memory

Transactional memory (TM) systems [36] fall into three rough categories: software
(STM), hardware (HTM), and hybrid (HyTM). Most of the research literature con-
cerns itself with STM systems [84, 1, 25, 30, 42, 61, 76, 79]. In this thesis, we compare
our HyTM design to NOrec [18], a state-of-the-art STM that uses value-based vali-
dation, deferred updates and lazy conflict detection.

Early HTM research was limited to simulation [35, 69]. Early implementations in-
clude Azul’s Vega [16] and Sun’s Rock [22], though neither became widely avail-
able.

Recently, however, Intel [49] and IBM [89, 51, 8] announced new processors with
hardware support for transactions, and it seems likely that others will follow. Like
Herlihy and Moss’s original TM proposal [43], these systems rely on modified cache
coherence protocols to achieve atomicity and isolation. Haswell also supports hard-
ware lock elision [75, 48], a scheme where annotated lock-based critical sections are
executed speculatively, but are retried pessimistically if speculation fails. We restrict
the evaluation in this thesis to Intel’s Haswell transactional memory instructions,
called Restricted Transactional Memory.

HyTM schemes promise to provide the best of both worlds: the efficiency of HTM
                                                                                   15


with the scalability and forward progress guarantees of STM. The first papers to
articulate this point are from Damron et al. [19] and Kumar et al. [56]. Later work in
this area includes PhTM [58], intended for Sun’s Rock architecture and Riegel et al.’s
work [77] intended for AMD’s proposed Advanced Synchronization Facility (ASF).
Dalessandro et al. [17] proposed Hybrid NOrec, a HyTM based on NOrec STM.
We compare our HyTM algorithm with Hybrid NOrec in Chapter 6. Matveev and
Shavit [62, 63] describe a new type of HyTM based on reduced hardware transactions:
HTM is used not only for the hardware transactions, but also for making the software
transactions more efficient. More recently, Wang et al. [89] proposed a HyTM for
IBM Blue Gene/Q’s best-effort HTM, based on a Single-Global-Lock fallback.

Our HyTM algorithms, like any other TM, must address the problem of how to most
efficiently resolve conflicts between transactions, a problem known as contention
management (CM) [31, 80, 28, 85].
Chapter 3


A Concurrent NUMA-Aware
Stack

Emerging cache-coherent non-uniform memory access (cc-NUMA) architectures pro-
vide cache coherence across hundreds of cores. These architectures change how appli-
cations perform: while local memory accesses can be fast, remote memory accesses
suffer from high access times and increased interconnect contention. Because of
these costs, performance of legacy code on NUMA systems is often worse than their
uniform memory counterparts despite the potential for increased parallelism.

In this chapter, we explore these effects on prior implementations of concurrent
stacks and propose the first NUMA-aware stack design that improves data locality
and minimizes interconnect contention. We achieve this by using a dedicated server
thread that performs all operations requested by the client threads. Data is kept
in the cache local to the server thread thereby restricting cache-to-cache traffic to
messages exchanged between the clients and the server. In addition, we match
reciprocal operations (pushes and pops) by using the rendezvous algorithm [2], an
improved elimination algorithm, before sending requests to the server. This has the


                                         16
                                                                                17


dual effect of avoiding unnecessary interconnect traffic and reducing the number of
operations that change the data structure. The result of combining elimination and
delegation is a scalable stack that outperforms all previous stack implementations
on NUMA systems.


3.1     Background

The current trend in computer architecture is to increase system performance by
adding more cores so that more work can be done simultaneously. In order to enable
systems to scale to hundreds of cores, the main hardware vendors are switching to
non-uniform memory access (NUMA) architectures. Recent examples include Intel’s
Nehalem family and the SPARC Niagara line.


Figure 3.1: Example of a NUMA system with two nodes and 128 hardware threads.


NUMA systems contain multiple sockets connected by an interconnect. Each socket
(also called a node) consists of multiple processing cores with a shared last level
cache (LLC) and a local memory (as in Figure 3.1). A thread can quickly access
the local memory on its own socket and it can access the memory on another socket
using the interconnect, so the programming model is similar to uniform memory
architectures. The NUMA design allows systems to scale to hundreds of cores and
                                                                                      18


provides inexpensive data sharing for cores on the same socket. However, remote
cache invalidations and remote memory access can drastically degrade performance
because of the interconnect’s high latency and limited bandwidth. Therefore, in
many cases, legacy code exhibits worse throughput when ported to NUMA machines
than on non-NUMA ones.

Prior research addresses this by using a NUMA aware contention manager that
migrates threads closer to the data they access [5]. However, migrating threads is a
complex solution that, while feasible for operating systems, is not generally realistic
for end-user applications. Alternatively, one could devise solutions in which the data
are moved to the accessing threads. For example, cohort locks [23] and NUMA
reader-writer locks [10] keep the data local to one cache as long as possible. This
is implemented by transferring ownership of the locks from the threads finishing
their critical sections to other threads on the same socket. Similarly, Metreveli
et al. [64] minimize cache data transfers by partitioning a concurrent hash table
and distributing operations for each partition to a specifically assigned thread. All
threads wanting to access the hash table submit requests to these server threads
through message passing implemented in shared memory. Essentially, the hash table
resides in the caches of the accessing threads and the cache-to-cache traffic is limited
to requests sent to and from the servers.


Making Data Structures NUMA-Aware. To maximize performance, Metrev-
eli et al. [64] leverage the concurrency properties of hash tables in their partition im-
plementation. Namely, hash tables are highly concurrent, easily partitionable data
structures. However, many data structures do not have the inherent concurrency
benefits of hash tables. This chapter focuses on a NUMA-aware implementation of
a stack. Nevertheless, the method presented can be applied to other data structures
as well.
                                                                                    19


Stacks have a broad range of uses: from evaluating expressions in calculators to
parsing the syntax of expressions and program blocks in compilers. In addition,
stacks can easily be used to implement unfair thread pools and any containers with-
out ordering guarantees. An example of this is the Java unfair synchronous queue
[81].

Unfortunately, stacks cannot be easily partitioned without forfeiting their last-in-
first-out (LIFO) property. Because of this, multiple threads often contend on the
single entry point providing access into the stack. It is primarily for this reason
that stacks seem to be inherently sequential. However, prior work has shown that
stacks can benefit from parallelism under balanced workloads (i.e., a similar number
of push and pop operations) using elimination [39, 2]. This is implemented by
canceling concurrent inverse operations from different threads even before they reach
the stack. Elimination is not specific to stacks. Moir et al. [67] have shown how to
use elimination with queues. Although this method significantly improves scalability
of stacks, it does not address the primary concern of this chapter: i.e., remote cache
invalidations on NUMA systems.

The goal is to reduce cache traffic and maintain data locality while using the prop-
erties of the underlying data structure to enable parallelism. The result is a scalable
and highly parallel stack that outperforms all previous stack implementations on
NUMA systems.


3.2     Algorithm Design

This section describes the use of delegation to implement a NUMA-aware stack.
At the highest level, the design provides efficiencies in increased cache locality and
reduced interconnect contention. After discussing the design, this section shows
                                                                                       20


how to employ the rendezvous elimination algorithm [2] to make this stack scalable.
Moreover, this section presents the difference between global elimination, which is
implemented using one rendezvous structure shared by all threads, and local elimi-
nation, which contains an elimination structure for each NUMA node.


3.2.1       Delegation

We use the delegation approach to implement a NUMA-aware stack. In particular,
we use one dedicated server thread that accepts push and pop requests from many
client threads. Figure 3.2 shows the overall interaction between the server and the
clients. The communication is implemented in shared memory, using one location
(which we call a slot) for each client. The server loops through all the slots, collecting
and processing requests and writing the responses back to the slots. The clients post
requests to their private slots and spin-wait on that slot until a response is provided
by the server. Figure 3.3a provides a high-level overview of this communication
protocol.

We note that only the pop operations need to spin-wait until a response is provided.
The push operations could return as soon as the server notices their requests. This
optimization improves throughput, but we decided not to use it in our experiments,
for a more fair comparison with the other methods.

A weakness of this design is that using a reserved slot for each client can result
in wasted space if the clients’ workloads are not evenly distributed. Furthermore,
the server must loop through all slots, even those not in use, when looking for
requests. These two drawbacks can result in increased space and time complexity.
To overcome these limitations, we statically assign several threads to the same slot
                                                                                        21

                1
by thread id.       To synchronize the access of multiple threads to the same slot, we
introduce an additional spinlock for each slot. Figure 3.3b reflects these changes to
the communication protocol.


Figure 3.2: Delegation: clients post their requests in shared local slots and wait for
the server to process them. The server loops through all the slots, processes requests
as it finds them and immediately posts back the response in the same slot. The
sequential stack is only accessed by the server thread; therefore, the top part of the
stack remains in the server’s cache all throughout execution.


3.2.2     Elimination

Elimination generally works best when the number of inverse operations are roughly
equivalent. For inequivalent, unbalanced workloads, many operations cannot be
eliminated, thereby requiring a thread to access the data structure directly. This
creates contention and cache-to-cache traffic because these operations could origi-
nate from different NUMA nodes. In order to solve these problems, we augment
the delegation stack presented in the previous section with a rendezvous elimination
layer. Threads first try to eliminate and, if they time out, they delegate their oper-
ation to the server thread. Delegation ensures that the data remains in the server’s
cache, while elimination enables parallelism, thus making the NUMA-aware stack
more scalable. Moreover, threads can continue to try to eliminate while they wait
for the spinlock of their slot to be released. The complete algorithm is described in
  1
   It is important to note that all threads using the same slot need to be on the same NUMA
node in order to maintain the slot’s locality.
                                                                                          22


Figure 3.3c.


(a) (Black) Single thread per slot: each thread posts requests in its private slot, without
any synchronization.
(b) (Blue) Multiple threads per slot: threads share slots, so they need to acquire the slot’s
spinlock before writing the request.
(c) (Red) Elimination: Threads first try to eliminate; if they fail they then try to acquire
the slot spinlock and submit a request, but if the lock is already taken, they go back to the
elimination structure; they continue this loop until either they eliminate, or they acquire
the spinlock.

Figure 3.3: Communication protocol between a client thread and the server thread
using slots.


Local vs. Global Elimination. For the rendezvous stack, threads first try elim-
ination and, in the case of failure, they then directly access the stack. Our NUMA-
aware stack is an improvement over this design, because it increases locality and re-
duces contention on the stack by replacing direct access to the stack with delegation.
However, the initial stage of elimination can still cause a number of invalidations be-
tween different NUMA nodes’ caches because each thread accesses the same shared
structure when performing elimination. To overcome this bottleneck, our NUMA-
aware stack splits the single elimination data structure into several local structures,
equal to the number of NUMA nodes. To minimize interconnect contention, we limit
elimination to occur only between those threads located on the same socket.
                                                                                     23


3.2.3     Advantages and Limitations

Our stack design is optimized for the NUMA architecture. Local elimination and
delegation both contribute to removing the contention on the interconnect and on
the stack. Moreover, delegation makes the inter-node communication explicit and
reduces it to the messages exchanged between the server and the clients. The stack
remains local to the server’s cache and requires no synchronization, because only
the server thread can access it directly. In contrast, state-of-the-art synchronization
methods, such as locking, allow all threads to access the shared data, causing more
cache-to-cache transfers than used by delegation. In addition, these methods also
require communication for achieving synchronization.

One potential drawback of this approach is that the access to the stack is serialized
by using only the server thread. However, the direct access of multiple threads to a
stack would also be serialized by a lock to keep the stack’s integrity. Moreover, we
enable parallelism by using elimination, which compensates for accessing the stack
sequentially.

Another drawback is the potential for additional communication overhead between
the clients and the server. For example, if the stack is only rarely accessed, then
direct access to it would likely be more efficient. However, the overhead of elimination
and delegation is eclipsed by their benefits when there are many threads contending
for access to the stack.

Finally, our description assumes one server thread for each shared stack. In order to
maintain high throughput, this thread must always be available to handle queries.
Therefore, each server thread is assigned a hardware thread and runs at high priority.
Unfortunately, we might not have enough hardware threads if an application uses
multiple shared data structures, so some of the structures might have to share a
server. If the application uses many shared data structures, the server threads could
                                                                                        24


become a performance bottleneck. However, we believe this scenario does not happen
often in practice.


3.3      Evaluation

We conducted our experiments on an Oracle SPARC T5240 machine with two Nia-
gara T2+ processors running at 1.165GHz. Each chip has 8 cores and each core has 8
hardware threads for a total of 128 hardware threads (64 per chip). We implemented
our NUMA stack algorithm in C++ and we compared it to previous stack implemen-
tations using the same microbenchmark as [2]: a rendezvous stack, a flat combining
stack and a lock-free stack. The benchmark has flexible parameters, allowing us to
measure throughput under different percentages of push and pop operations. The re-
sults we present were obtained using threads with fixed roles (e.g. push-only threads
and pop-only threads). We allow the scheduler to assign our software threads to
                                                                     2
NUMA nodes and then pin them to their respective processors.             The server thread
is created with increased priority compared to the client threads, to decrease the
likelihood of being swapped out by the scheduler.

For our experiments, we started by comparing our local elimination and delegation
NUMA stack (nstack el) with a lock-free stack (lfstack) [88], which has been the basis
for other stack implementations such as rendezvous [2] and flat-combining [38]. Then,
we compared our stack to the flat combining stack (fcstack) [38], which outperforms
the rendezvous stack when there is no significant potential for elimination (i.e., in
unbalanced workloads).

The scalable performance of the lock-free stack begins to degrade around 16 threads.
The flat-combining stack, however, is unaffected by the type of workload and achieves
  2
    We also experimented with unbounded and variable role threads, but the results were too
similar to warrant inclusion in this thesis.
                                                                                   25


relatively stable scalability across different thread counts. However, the elimination
based NUMA stack outperforms both of them by a large margin. These results can
be observed in Figures 3.4, 3.5 and 3.6.


                Figure 3.4: Results for 50% pushes and 50% pops


Effect of elimination. To judge the effect of the local elimination structures
used in our implementation, we compared our NUMA stack (nstack el) against two
other versions; one without elimination (nstack) and one with global elimination
(nstack el gl). As expected, the global elimination algorithm outperforms the algo-
rithm without elimination, while both perform worse than local elimination. From
Figures 3.4, 3.5 and 3.6, we conclude that local elimination is crucial for the scala-
bility of our algorithm because it achieves locality for most of the operations. Our
experiments were performed on a 2-node NUMA system, but we expect that these
results generalize to systems with more nodes, as long as the push and pop operations
are distributed uniformly across all the nodes.
                                                  26


Figure 3.5: Results for 70% pushes and 30% pops


Figure 3.6: Results for 90% pushes and 10% pops
                                                                                    27


Effect of delegation. To better understand and characterize the impact of del-
egation, and because elimination has such a strong influence on performance, we
compare our stack against two variations of the rendezvous stack: one uses local elim-
ination and the other uses global elimination. The rendezvous stack (rendezvous)
consists of global elimination and direct access. To provide a more fair comparison,
we modified the rendezvous stack to perform elimination locally on each NUMA node
(rendezvous loc). Threads that fail to eliminate on each node must access the data
structure directly. This local version of the rendezvous stack improves the scalability
of the rendezvous stack for NUMA systems. However, our NUMA stack performs
even better, indicating there is an observable performance benefit using delegation
under high contention, for both balanced and unbalanced workloads (Figures 3.4,
3.5 and 3.6) due to reduced cache-to-cache traffic. We believe the benefit of delega-
tion would become more apparent on a NUMA system with more sockets, because
the penalty of inter-node communication is higher on such systems. Although the
latency of an individual operation could increase because the server needs to inspect
slots on more nodes, cache and memory locality would play an even more signifi-
cant role than they do on a 2-node system, so the benefit given by delegation would
increase. We leave evaluation on such a system as future work.


Balanced workloads. We experimented with different percentages of push and
pop operations. Elimination works best when the number of pushes is very similar
to the number of pops. In the balanced workload case, we use 50% push threads and
50% pop threads. Experimental results are shown in Figure 3.4. For this setting,
elimination plays a significant role, as most operations will manage to eliminate.
There is some benefit from delegation, as we can see when we compare to the local
rendezvous algorithm, but not that significant.
                                                                                      28


Unbalanced workloads. For unbalanced workloads, elimination plays a much
smaller role in reducing the number of operations. We present results for 70% pushes,
30% pops in Figure 3.5 and 90% pushes, 10% pops in Figure 3.6. In both cases, there
is some elimination, but not as significant as in the balanced workload case. However,
delegation plays a much more important role for these workloads, as more operations
fail to eliminate and need to access the stack. Results show that preserving cache
locality through delegation works much better than direct access to the stack.


Number of slots. Finally, we want to measure the impact of the synchronization
introduced with sharing slots by different threads. We compared the implementation
of the NUMA stack using shared slots (nstack el) with the implementation using one
slot per client thread, which does not require any synchronization to access the slots
(nstack el st - nstack elimination single thread per slot). The results indicate that
there is no clear winner in this case, which can be explained by the fact that the server
has to loop through all the slots to service requests. Each request might have to wait
a linear time in the number of slots to be found by the server. If the server finds too
many of the slots empty, then much of the work performed by the server is wasted.
However, if the server finds requests in most of the slots, then the algorithm can
benefit from more slots because of the lack of synchronization. Our results seem to
support this claim: the single thread (ST) per slot version outperforms the multiple
threads per slot version (MT) for very unbalanced workloads as in Figure 3.6, while
MT outperforms ST for more balanced workloads, as in Figures 3.4 and 3.5. This is
due to the elimination algorithm significantly reducing the number of requests sent
to the server for balanced workloads, while for unbalanced workloads there is less
elimination and more requests sent to the server.

In our experiments, we assumed that we know the maximum number of client threads
in the system and always check all the slots, even when running with fewer threads.
                                                                                     29


This could be improved using an adaptive way of determining the number of slots,
but we leave that as future work.


3.4      Summary

Hardware’s shift towards NUMA systems urges a compatible software redesign. Ba-
sic data structures are not optimized for these architectures. We propose the first
NUMA-aware design of a stack, using local elimination and delegation. Combin-
ing these two methods is favorable across a number of scenarios: elimination works
best when the number of pushes and pops is roughly the same, while delegation
significantly reduces contention in the cases in which there is not enough potential
for elimination because the workload is not very balanced. Our NUMA-aware stack
outperforms prior stack implementations across different scenarios from completely
balanced workloads to the more unbalanced ones.

However, this is just the first step in transitioning to NUMA systems. There are
vast and exciting opportunities for exploring the design of other NUMA-aware data
structures. We presented one technique and showed that it works well for a stack.
The same technique could be applied to other data structures, such as queues and
lists, which also admit inverse operations. In contrast, other data structures might
not be suitable for elimination or might suffer from the serialized access of the server
thread. For these data structures, we need to find new tools that allow us to redesign
them for the NUMA space.
Chapter 4


A Concurrent Priority Queue

Priority queues are fundamental abstract data structures, often used to manage
limited resources in parallel programming. Several proposed parallel priority queue
implementations are based on skiplists, harnessing the potential for parallelism of
the add() operations. In addition, methods such as Flat Combining [38] have been
proposed to reduce contention, batching together multiple operations to be executed
by a single thread. While this technique can decrease lock-switching overhead and the
number of pointer changes required by the removeMin() operations in the priority
queue, it can also create a sequential bottleneck and limit parallelism, especially for
non-conflicting add() operations.

In this chapter, we describe a novel priority queue design, harnessing the scalability of
parallel insertions in conjunction with the efficiency of batched removals. Moreover,
we present a new elimination algorithm suitable for a priority queue, which further
increases concurrency on balanced workloads with similar numbers of add() and
removeMin() operations. We implement and evaluate our design using a variety of
techniques including locking, atomic operations, hardware transactional memory, as
well as employing adaptive heuristics given the workload.


                                           30
                                                                                    31


4.1     Background

A priority queue is a fundamental abstract data structure that stores a set of keys (or
a set of key-value pairs), where keys represent priorities. It usually exports two main
operations: add(), to insert a new item in the priority queue, and removeMin(), to
remove the first item (the one with the highest priority). Parallel priority queues are
often used in discrete event simulations and resource management, such as operating
systems schedulers. Therefore, it is important to carefully design these data struc-
tures in order to limit contention and improve scalability. Prior work in concurrent
priority queues exploited parallelism by using either a heap [47] or a skiplist [59] as
the underlying data structures. In the skiplist-based implementation of Lotan and
Shavit [59], each node has a “deleted” flag, and processors contend to mark such
“deleted” flags concurrently, in the beginning of the list. When a thread logically
deletes a node, it tries to remove it from the skiplist using the standard removal
algorithm. A lock-free skiplist implementation is presented in [87].

However, these methods may incur limited scalability at high thread counts due to
contention on shared memory accesses. Hendler et al. [38] introduced Flat Combin-
ing, a method for batching together multiple operations to be performed by only one
thread, thus reducing the contention on the data structure. This idea has also been
explored in subsequent work on delegation [64, 9], where a dedicated thread called
a server performs work on behalf of other threads, called clients. Unfortunately,
the server thread could become a sequential bottleneck. A method of combining
delegation with elimination has been proposed to alleviate this problem for a stack
data structure [11]. Elimination [39] is a method of matching concurrent inverse
operations so that they don’t access the shared data structure, thus significantly
reducing contention and increasing parallelism for otherwise sequential structures,
such as stacks. An elimination algorithm has also been proposed in the context of a
                                                                                      32


queue [67], where the authors introduce the notion of aging operations - operations
that wait until they become suitable for elimination.

In this chapter, we describe, to the best of our knowledge, the first elimination al-
gorithm for a priority queue. Only add() operations with values smaller than the
priority queue minimum value are allowed to eliminate. However, we use the idea of
aging operations introduced in the queue algorithm [67] to allow add() values that
are small enough to participate in the elimination protocol, in the hope that they
will soon become eligible for elimination. We implement the priority queue using
a skiplist and we exploit the skiplist’s capability for both operations-batching and
disjoint-access parallelism. RemoveMin() requests can be batched and executed by
a server thread using the combining/delegation paradigm. Add() requests with high
keys will most likely not become eligible for elimination, but need to be inserted in the
skiplist, requiring expensive traversals towards the end of the data structure. These
operations would represent a bottleneck for the server and a missed opportunity for
parallelism if executed sequentially. Therefore, we split the underlying skiplist into
two parts: a sequential part, managed by the server thread and a parallel part, where
high-valued add() operations can insert their arguments in parallel. Our design re-
duces contention by performing batched sequential removeMin() and small-value
add() operations, while also leveraging parallelism opportunities through elimina-
tion and parallel high-value add() operations. We show that our priority queue
outperforms prior algorithms in high contention workloads on a SPARC Niagara II
machine. Finally, we explore whether the use of hardware transactions could simplify
our design and improve throughput. Unfortunately, machines that support hardware
transactional memory (HTM) are only available for up to four cores (eight hardware
threads), which is not enough to measure scalability of our design in high contention
scenarios. Nevertheless, we showed that a transactional version of our algorithm is
better than a non-transactional version on a Haswell four-core machine. We believe
                                                                                      33


that these preliminary results will generalize to machines with more threads with
support for HTM, once they become available.


4.2      Algorithm Design

Our priority queue exports two operations: add() and removeMin() and is im-
plemented using an underlying skiplist. The elements of the skiplist are buckets
associated with keys. For a bucket b, the field b.key denotes the associated key.
We split the skiplist in two distinct parts. The sequential part, in the beginning
of the skiplist, is likely to serve forthcoming removeMin() operations of the prior-
ity queue (PQ::removeMin() for short) as well as add(v) operations of the priority
queue (PQ:add() for short) with v small enough (hence expected to be removed
soon). The parallel part, which complements the sequential part, is likely to serve
PQ::add(v) operations where v is large enough (hence not expected to be removed
soon). Either the sequential or the parallel part may become empty. Both lists are
complete skiplists, with (dummy) head buckets called headSeq and headPar, respec-
tively, with key −∞. Both lists also contain (dummy) tail buckets, with key +∞.
We call the last non-dummy bucket of the sequential part lastSeq, which is the
logical divider between parts. Figure 4.1 shows the design.

When a thread performs a PQ::add(v), either (1) v > lastSeq.key, and the thread
inserts the value concurrently in the parallel part of the skiplist, calling the SL::addPar()
skiplist operation; or (2) v ≤ lastSeq.key, and the thread tries to perform elimi-
nation with a PQ::removeMin() using an elimination array. A PQ::add(v) with v
less than the smallest value in the priority queue can immediately eliminate with a
PQ::removeMin(), if one is available. A PQ::add(v) operation with v bigger than
minValue (the current minimal key) but smaller than lastSeq.key lingers in the
                                                                                           34


         Add (small keys)/Remove: server thread          Add (bigger keys): parallel


    Add (small) Remove

                                                                                       …

       Elimination
          Layer
                                   Sequential Part             Parallel Part

Figure 4.1: Skiplist design. An elimination array is used for removeMin()s and
add()s with small keys. A dedicated server thread collects the operations that do
not eliminate and executes them on the sequential part of the skiplist. Concurrent
threads operate on the parallel part, performing add()s with bigger keys. The dotted
lines show pointers that would be established if the single skiplist was not divided
in two parts.


elimination array for some time, waiting to become eligible for elimination or time-
out. A server thread executes sequentially all operations that fail to eliminate.

This mechanism describes the first elimination algorithm for a priority queue, well
integrated with delegation/combining, presented in more detail in Section 4.2.2.
Specifically: (1) The scheme harnesses the parallelism of the priority queue add()
operations, letting those add() operations with keys physically distant and large
enough (bigger than lastSeq.key) execute in parallel. (2) At the same time, we
batch concurrent priority queue add() with small keys and removeMin() operations
that timed out in the elimination array, serving such requests quickly through the
server thread – this latter operation simply consumes elements from the sequential
part by navigating through elements in its bottom level, merely decreasing counters
and moving pointers in the most common situation. While detaching a sequential
part is non-negligible cost-wise, a sequential part has the potential to serve multiple
removals.
                                                                                    35


4.2.1     Concurrent Skiplist

Our underlying skiplist is operated by the server thread in the sequential part and
by concurrently inserting threads with bigger keys in the parallel part.

Sequential part. The server calls the skiplist function SL::moveHead() to extract
a new sequential part from the parallel part if some PQ::removeMin() operation
was requested and the sequential part was empty. Conversely, it calls the skiplist
function SL::chopHead() to relink the sequential and the parallel parts, forming a
completely parallel skiplist, if no PQ::removeMin() operations are being requested
for some time. In SL::moveHead(), we initially determine the elements to be moved
to the sequential part. If no elements are found, the server clears the sequential
part, otherwise separating the sequential part from the rest of the list, which be-
comes the parallel part. The number of elements that SL::moveHead() tries to
detach to the sequential part adaptively varies between 8 and 65,536. Our policy is
simple: if more than N insertions (e.g. N = 1000) occurred in the sequential part
since the last SL::moveHead(), we halve the number of elements moved; otherwise,
if less than M insertions (e.g. M = 100) were made, we double this number. After
SL::moveHead() executes, a pointer called currSeq indicates the first bucket in the
sequential part, and another called lastSeq indicates the final bucket. The server
uses SL::addSeq() and SL::removeSeq() within the sequential part to remove ele-
ments or insert elements with small keys (i.e., belonging to the sequential part) that
failed to eliminate. Buckets are not deleted at this time; they are deleted lazily when
the whole sequential part gets consumed. A new sequential part can be created by
calling SL::moveHead() again.

Parallel part. The skiplist function SL::addPar() inserts elements into the par-
allel part, and is called by concurrent threads performing PQ::add(). While these
insertions are concurrent, the skiplist still relies on a Single-Writer Multi-Readers
                                                                                   36


lock with writer preference for the following purpose. Multiple SL::addPar() oper-
ations acquire the lock for reading (executing concurrently), while SL::moveHead()
and SL::chopHead() operations acquire the lock for writing. This way, we avoid
that SL::addPar() operates on buckets that are currently being moved to the
sequential part by SL::moveHead(), or interferes with SL::chopHead(). Despite
the lock, SL::addPar() is not mutually exclusive with the head-moving operations
(SL::moveHead() and SL::chopHead()). Only the pointer updates (for new buck-
ets) or the counter increment (for existing buckets) must be done in the parallel part
(and not have been moved to the sequential part) after we determine the locations
of these changes. Hence, in the SL::addPar() operation, we first try to get a clean
SL::find(): a find operation followed by lock acquisition for reading, with no in-
tervening head-moving operations. We can tell whether no head-moving operation
took place since our lock operations always increases a timestamp variable, checked
in the critical section. After a clean SL::find(), therefore now holding the lock,
if a bucket corresponding to the key is found, we insert the element in the bucket
(incrementing a counter). Otherwise, a new bucket is created, and inserted level by
level using CAS() operations. If a CAS() fails in a certain level, we release the lock
and retry a clean SL::find().

Our algorithm differs from the traditional concurrent skiplist insertion algorithms in
two ways: (1) we hold a lock to avoid head-moving operations to take place after
a clean SL::find(); and (2) if the new bucket is moved out of the parallel section
while we insert the element in the upper levels, we stop SL::addPar(), leaving
this element with a capped level. This bucket is likely to be soon consumed by a
SL::removeSeq() operation, resulting from a PQ::removeMin() operation.
                                                                                    37


Pseudo-code


We present the pseudo-code for the concurrent skiplist algorithm. The skiplist
contains a Single-Writer-Multi-Readers lock with writer preference, called simply
lock. In terms of notation, lock.acquireR() acquires the lock for reads, and
lock.acquireW() acquires the lock for writes. The SL::removeSeq() skiplist pro-
cedure is described in Algorithm 1.

Algorithm 1 SL::removeSeq()
 1:   if minValue = MaxInt then
 2:       return MaxInt
 3:   if currSeq = ⊥ then
 4:       moveHead()
 5:   key ← currSeq.key
 6:   currSeq.counter ← currSeq.counter - 1
 7:   if currSeq.counter = 0 then
 8:       while currSeq 6= lastSeq do
 9:          currSeq ← currSeq.next[0]
10:          if currSeq.counter > 0 then
11:              minValue = currSeq.key
12:              return key
13:       moveHead()
14:   return key


The variable lock.timestmap contains the timestamp associated with the lock (and
hence with the head-moving operations). Algorithm 2 returns a pair of elements
(b, r): b is a bucket found using the skiplist SL::find() operation, and r is a boolean
defined as follows. If a head-moving operation happened anywhere between Lines 1
and 4, the timestamp moved and r will be false.

The SL::addPar() skiplist procedure is described in Algorithm 3. It uses the clean
find protocol above. It performs a clean find, followed by mutable operations (ei-
ther increasing a counter or inserting a bucket), executed with lock acquired for
reading.
                                                        38


Algorithm 2 cleanFind(v, preds, succs)
 1: t ← lock.timestamp
 2: b ← find(headPar, v, preds, succs)
 3: lock.acquireR()
 4: if t < lock.timestamp then
 5:     lock.release()
 6:     return (⊥, false)
 7: return (b, true)


Algorithm 3 SL::addPar(v)
 1:   if v ≤ lastSeq.key then
 2:       return false
 3:   (b, r) ← cleanFind(v, preds, succs)
 4:   if r = false then
 5:       restart at line 3
 6:   if b 6= ⊥ then
 7:       Atomically increment b.counter
 8:       lock.release()
 9:       return true
10:   b ← newNode(v)
11:   for i: 1 → b.topLevel do
12:       b.next[i] ← succs[i]
13:   if not CAS(preds[0].next[0]: succs[0] → b) then
14:       lock.release()
15:       restart at line 3
16:   repeat
17:       m ← minValue
18:   until m ≤ v or CAS(minValue: m → v)
19:   for i: 1 → b.topLevel do
20:       b.next[i] ← succs[i]
21:       if CAS(preds[i].next[i]: succs[i] → b) then
22:           continue
23:       lock.release()
24:       repeat
25:           (b, r) ← cleanFind(v, preds, succs)
26:       until r = true
27:       if b = ⊥ then
28:           lock.release()
29:           return true
30:   return true
                                                                                     39


The SL::moveHead() skiplist procedure is described in Algorithm 4. Line 19 creates
the sequential part starting from where the parallel part used to be, and the opera-
tions starting at Line 21 separate the skiplist in two parts. Note how SL::find() is
used to locate the pointers that will change in order to separate the skiplist.

Algorithm 4 SL::moveHead()
 1:   n is determined dynamically (see text)
 2:   lock.acquireW()
 3:   currSeq ← ⊥
 4:   pred ← headPar
 5:   curr ← headPar.next[0]
 6:   i=0
 7:   while i < n and curr 6= tail do
 8:       i ← i + curr.counter
 9:       if currSeq = ⊥ then
10:           currSeq ← curr; minValue ← curr.key
11:       pred ← curr; curr ← curr.next[0]
12:   if i = 0 then
13:       for i : MaxLvl → 0 do
14:           headPar[i], headSeq[i] ← tail
15:       lastSeq ← headPar, minValue ← MaxLvl
16:       lock.release()
17:       return false
18:   lastSeq ← pred
19:   for i : MaxLvl → 0 do
20:       headSeq[i] ← headPar[i]
21:   find(headSeq, lastSeq + 1, preds, succs)
22:   for i : MaxLvl → 0 do
23:       preds[i].next[i] ← tail
24:       headPar.next[i] ← succs[i]
25:   lock.release()
26:   return true


Finally, the SL::chopHead() skiplist procedure is described in Algorithm 5. Note
that all the SL::find() operations are executed outside the critical section. These
operations identify the pointers that will change in order to relink the skiplist.
                                                                                  40


Algorithm 5 SL::chopHead()
 1:   if currSeq = ⊥ then
 2:       return false
 3:   find(headSeq, lastSeq.key + 1, preds, ⊥)
 4:   find(headSeq, currSeq.key, ⊥, succs)
 5:   lock.acquireW()
 6:   for i : MaxLvl → 0 do preds[i].next[i] ← headPar.next[i]
 7:   lastSeq ← headPar, currSeq ← ⊥
 8:   for i : MaxLvl → 0 do
 9:       headPar.next[i] ← succs[i] if succs[i] 6= tail
10:   lock.release()
11:   return true


4.2.2      Elimination and Combining

Elimination allows matching operations to complete without accessing the shared
data structure, thus increasing parallelism and scalability. In a priority queue, any
SL::removeMin() operation can be eliminated, but only SL::add() operations with
values smaller or equal to the current minimum value can be so. If the priority
queue is empty, any SL::add() value can be eliminated. We used an elimination
array similar to the one in the stack elimination algorithm [39]. Each slot uses 64
bits to pack together a 32-bit value that represents either an opcode or a value to
be inserted in the priority queue and a stamp that is unique for each operation.
The opcodes are: EMPTY, REMREQ, TAKEN and INPROG. These are special
values that cannot be used in the priority queue. All other values are admissible.
In our implementation, each thread has a local count of how many operations it
performed. This count is combined with the thread ID to obtain a unique stamp
for each operation. Overflow was not an issue in our experiments, but if it becomes
a problem a different algorithm for associating unique stamps to each operation
could be used. The unique stamp is used to ensure linearizability, as explained in
Section 4.3. All slots are initially empty, marked with the special value EMPTY,
and the stamp value is zero.
                                                                                   41


A PQ::removeMin() thread loops through the elimination array until it finds a re-
quest to eliminate with or it finds an empty slot in the array, as described in Al-
gorithm 6. If it finds a value in the slot, then it must ensure that the stamp is
positive, otherwise the value was posted as a response to another thread. The value
it finds must be smaller than the current priority queue minimum value. Then, the
PQ::removeMin() thread can CAS the slot, which contains both the value and the
stamp, and replace it with an indicator that the value was taken (TAKEN, with
stamp zero). The thread returns the value found. If instead, the PQ::remove()
thread finds an empty slot, it posts a remove request (REMREQ), with a unique
stamp generated as above. The thread waits until the slot is changed by another
thread, having a value with stamp zero. The PQ::removeMin() thread can then
return that value.

Algorithm 6 PQ::removeMin()
 1: while true do
 2:   pos ← (id + 1)% ELIM SIZE; (value, stamp) ← elim[pos]
 3:   if IsValue(value) and (stamp > 0) and (value ≤ skiplist.minValue)) then
 4:       if CAS(elim[pos], (value, stamp), (TAKEN, 0)) then
 5:           return value
 6:   if value = EMPTY then
 7:       if CAS(elim[pos], (value, stamp), (REMREQ, uniqueStamp())) then
 8:           repeat
 9:               (value, stamp) ← elim[pos]
10:           until value 6= REMREQ and value 6= INPROG
11:           elim[pos] ← (EMPTY, 0); return value
12:   inc(pos)


A PQ::add() thread initially tries to use SL::addPar() to add its key concurrently
in the parallel part of the skiplist. A failed attempt indicates that the value should
try to eliminate or should be inserted in the sequential part instead. The PQ::add()
thread tries to eliminate by checking through the elimination array for REMREQ
indicators. If it finds a remove request, and its value is smaller than the priority
queue minValue, it can CAS its value with stamp zero, effectively handing it to
                                                                                      42


another thread. If multiple such attempts fail, the thread changes its behavior: it
still tries to perform elimination as above, but as soon as an empty slot is found,
it uses a CAS to insert its own value and the current stamp in the slot, waiting for
another thread to match the operation (and change the opcode to TAKEN) returning
the corresponding value.

The PQ::add() and PQ::removeMin() threads that post a request in an empty
slot of the elimination array wait for a matching thread to perform elimination.
However, elimination could fail because no matching thread shows up or because
the PQ::add() value is never smaller than the priority queue minValue. To ensure
that all threads make progress, we use a dedicated server thread that collects add
and remove requests that fail to eliminate. The server thread executes the operations
sequentially on the skiplist, calling SL::addSeq() and SL::removeSeq() operations.
To ensure linearizability, the server marks a slot that contains an operation it is about
to execute as in progress (INPROG). Subsequently, it executes the sequential skiplist
operation and writes back the response in the elimination slot for the other thread to
find it. A state machine showing the possible transitions of a slot in the elimination
array is shown in Figure 4.2, and the algorithm is described in Algorithm 7.

Algorithm 7 Server::execute()
 1: while true do
 2:   for i: 1 → ELIM SIZE do
 3:       (value, stamp) ← elim[i]
 4:       if value = REMREQ then
 5:           if CAS(elim[i], (value, stamp), (INPROG, 0)) then
 6:               min ← skiplist.removeSeq(); elim[i] ← (min, 0)
 7:       if IsValue(value) and (stamp > 0) then
 8:           if CAS(elim[i], (value, stamp), (INPROG, 0)) then
 9:               skiplist.addSeq(value); elim[i] ← (TAKEN, 0)


The priority queue insertion algorithm is shown in Algorithm 8. If the value being
inserted is not suitable for the parallel part (PQ::addPar() returns false), the request
                                                                                        43


                                                                 remSeq()
                                                   (INPROG, 0)               (val, 0)


                       (REMREQ, stamp)


                                           Add
                                                   (val, 0)
 (EMPTY, 0)
                                                               addSeq()
                                                 (INPROG, 0)                (TAKEN, 0)

                       (val, stamp)


                                                 (TAKEN, 0)

              Figure 4.2: Transitions of a slot in the elimination array.


is posted in the elimination array, until eliminated with a suitable PQ::removeMin()
or consumed by the server thread.


4.3     Linearizability

Our design provides a linearizable [45] priority queue algorithm. Some operations
have multiple possible linearization points by design, requiring careful analysis and
implementation.

Skiplist. A successful SL::addPar(v) (respectively, SL::addSeq(v)) usually lin-
earizes when it inserts the element in the bottom level of the skip list with a CAS
(respectively, with a store), or when the bucket for key v has its counter incremented
with a CAS (respectively, with a store). However, a thread inserting a minimal bucket,
whenever v < minValue, is required to update minValue. When the sequential
part is not empty, only the server can update minValue (without synchronization).
When the sequential part is empty, a parallel add with minimal value needs to up-
date minValue. The adding thread loops until a CAS decreasing minValue succeeds
                                                                                   44


Algorithm 8 PQ::add(inValue)
 1:   if inValue ≤ skiplist.minValue then
 2:       rep ← MAX ELIM MIN
 3:   else
 4:       if skiplist.addPar(inValue) then
 5:           return true
 6:       rep = MAX ELIM
 7:   while rep > 0 do
 8:       pos ← (id + 1)% ELIM SIZE; (value, stamp) ← elim[pos]
 9:       if value = REMREQ and (inValue ≤ skiplist.minValue)) then
10:           if CAS(elim[pos], (value, stamp), (inValue, 0)) then
11:               return true
12:       rep ← rep −1; inc(pos)
13:   if skiplist.addPar(inValue) then
14:       return true
15:   while true do
16:       (value, stamp) ← elim[pos]
17:       if value = REMREQ and (inValue ≤ skiplist.minValue)) then
18:           if CAS(elim[pos], (value, stamp), (inValue, 0)) then
19:               return true
20:       if value = EMPTY then
21:           if CAS(elim[pos], (value, stamp), (inValue, uniqueStamp())) then
22:               repeat
23:                   (value, stamp) ← elim[pos]
24:               until value = TAKEN
25:               elim[pos] ← (EMPTY, 0); return true
26:       inc(pos)


or another thread inserts a bucket with key smaller than v. Note that no head-
moving operation can execute concurrently because the SL::addPar() threads hold
the lock. Threads that succeed changing minValue linearize their operation at the
point of the successful CAS.

The head-moving operations SL::moveHead() and SL::chopHead() execute while
holding the lock for writing, which effectively linearizes the operation at the lock.release()
instant because: (1) no SL::addPar() is running; (2) no SL::addSeq() or SL::removeSeq()
are running, as the server thread is the single thread performing those operations.
The head-moving operations do not change minValue. In fact, they preclude any
       Server Thread                                                   x
                                       Read       CAS INPROG Sequential          Write
                                        op          to slot  op on skiplist      result

                                                                                                     45


                                        Post (op, stamp)        Time              Observe
                               Start                                                        Return
                                             to slot                               result
          Add/Remove (op) Thread                                x
       Remove/Add (inv) Thread                                  x
                                                            Check val <=       CAS
                                                 Read
                                              (op, stamp)    minValue      (op, stamp)

Figure 4.3: Concurrent execution of an op thread posting its request to an empty
slot, and an inv thread, executing a matching operation. The operation by the inv
thread could begin any time before the Read and finish any time after the CAS. The
linearization point is marked with a red X.


changes to it. During these operations, however, threads may still perform elimina-
tion, which we discuss next.

Elimination. A unique stamp is used in each request posted in the array entries to
avoid the “ABA” problem. Each elimination slot is a 64-bit value that contains 32
bits for the posted value (for PQ::add()) or a special opcode (for PQ::removeMin())
and 32 bits for the unique stamp. In our implementation, the unique stamp is ob-
tained by combining the thread id with the number of operations performed by each
thread. Each thread, either adding or removing, that finds the inverse operation in
the elimination array must verify that the exchanged value is smaller than minValue.
If so, the thread can CAS the elimination slot, exchanging arguments with the wait-
ing thread. It is possible that the priority queue minimum value is changed by a
concurrent PQ::add(). In that case, the linearization point for both threads engaged
in elimination is at the point where the value was observed to be smaller than the
priority queue minimum. See Fig. 4.3.

The thread performing the CAS first reads the stamp of the thread that posted the
request in the array and verifies that it is allowed to eliminate. Only then it performs
a CAS on both the value and the stamp, guaranteeing that the thread waiting did
not change in the meantime. Because both threads were running at the time of
the verification, they can be linearized at that point. Without the unique stamp,
the eliminating thread could perform a CAS on an identical request (i.e., identical
                                                                                                            46


operation and value) posted in the array by a different thread. The CAS would
incorrectly succeed, but the operations would not be linearizable because the new
thread was not executing while the suitable minimum was observed.

The linearizability of the combining operation results from the linearizability of the
skiplist. The threads post their operation in the elimination array and wait for the
server to process it. The server first marks the operation as in progress by CASing
INPROG into the slot. Then it performs the sequential operation on the skiplist and
writes the results back in the slot, releasing the waiting thread. The waiting thread
observes the new value and returns it. The linearization point of the operation
happens during the sequential operation on the skiplist, as discussed above. See
Fig. 4.4.

                                  Post (op, stamp)      Time                             Observe
                          Start                                                                    Return
                                       to slot                                            result
     Add/Remove (op) Thread                                              x
            Server Thread                                                x
                                                                     Sequential
                                         Read        CAS INPROG                     Write
                                          op           to slot  op on skiplist      result

Figure 4.4: Concurrent execution of a client thread and the server thread. The
client posts its operation op to an empty slot and waits for the server to collect the
operation and execute it sequentially on the skiplist. The linearization point occurs
                                     Post (op, stamp) Time       Observe
                              Startis marked with a red
in the sequential operation and                            X.            Return
                                          to slot                 result
               Add/Remove (op) Thread                              x
            Remove/Add (inv) Thread                                x
                                                               Check val <=       CAS
                                                   Read
                                                (op, stamp)     minValue      (op, stamp)

4.4         Evaluation

In this section, we discuss results on a Sun SPARC T5240, which contains two
UltraSPARC T2 Plus chips with 8 cores each, running at 1.165 GHz. Each core
has 8 hardware strands, for a total of 64 hardware threads per chip. A core has
a 8KB L1 data cache and shares an 4MB L2 data cache with the other cores on
a chip. We restrict the evaluation to cores within one chip to avoid cache traffic
and memory effects. Each experiment was performed five times and we report the
                                                                                                                                                                                             47


median. Variance was very low for all experiments. Each test was run for ten seconds
to measure throughput. We used the same benchmark as flat combining [38]. A
thread randomly flips a coin with probability p to be an PQ::add() and 1 − p to be
a PQ::removeMin(). We started a run after inserting 2000 elements in the priority
queue for stable state results.

Our priority queue algorithm (pqe) uses combining and elimination, and leverages the
parallelism of PQ::add(). We performed experiments to compare against previous
priority queues using combining methods, such as flat combining skiplist (fcskiplist)
and flat combining pairing heap (fcpairheap). We also compared against previous
priority queues using skiplists with parallel operations, such as a lock free skiplist
(lfskiplist) and a lazy skiplist (lazyskiplist). The flat combining methods are very
fast at performing PQ::removeMin() operations, which then get combined and ex-
ecuted together. However, performing the PQ::add() operations sequentially is a
bottleneck for these methods. Conversely, the lfskiplist and lazyskiplist algorithms
are very fast at performing the parallel adds, but get significantly slowed down by
having PQ::removeMin() operations in the mix, due to the synchronization over-
head involved. Our pqe design tries to address these limitations through our dual
(sequential and parallel parts), adaptive implementation that can be beneficial in
the different scenarios.

                               50% Add , 50% RemoveMin                                                                       80% Add , 20% RemoveMin
              4000                                                                                          3000
              3500                                                                                          2500
 Throughput


                                                                                               Throughput


              3000
              2500                                                                                          2000
              2000                                                                                          1500
              1500                                                                                          1000
              1000
               500                                                                                           500
                 0                                                                                             0
                           1   2     4      8    16     24   32     40     48      54     60                             1   2     4      8    16     24   32     40     48      54     60
                                                   Threads                                                                                       Threads
                     pqe       fcpairheap       fcskiplist    lfskiplist        lazyskiplist                       pqe       fcpairheap       fcskiplist    lfskiplist        lazyskiplist


Figure 4.5: Priority queue performance Figure 4.6: Priority queue performance
with 50% add()s, 50% removeMin()s.     with 80% add()s, 20% removeMin()s.


We considered different percentages of PQ::add() and PQ::removeMin() in our tests.
                                                                                                                                                                   48


                                                                                                                     RemoveMin Operations Breakdown


                                                                               % of the total RemoveMin ops
                                                                                                              100%
                                 Add Operations Breakdown
                          100%                                                                                80%

 % of the total add ops                                                                                       60%
                          80%

                          60%                                                                                 40%

                          40%                                                                                 20%

                          20%                                                                                  0%
                                                                                                                         80_20                 50_50       20_80
                           0%
                                 80_20             50_50              20_80                                                      Elimination      Server

                                    Elimination   Server   Parallel

                                                                              Figure 4.8: removeMin() work break-
    Figure 4.7: add() work breakdown.                                         down.


When the operations are roughly the same number, pqe can fully take advantage of
both elimination and parallel adds, so it has peak performance. Figure 4.5 shows how
for 50% PQ::add() and 50% PQ::removeMin(), pqe is much more scalable and can be
up to 2.3 times faster than all other methods. When there are more PQ::add() than
PQ::removeMin(), as in Figure 4.6 with 80% PQ::add() and 20% PQ::removeMin(),
pqe behavior approaches the other methods, but it is still 70% faster than all other
methods at high thread counts. In this specific case there is only little potential for
elimination, but having parallel insertion operations makes our algorithm outperform
the flat combining methods. The lazyskiplist algorithm also performs better than
other methods, as it also takes advantage of parallel insertions. However, pqe uses
the limited elimination and the combining methods to reduce contention, making it
faster than the lazyskiplist. For more PQ::removeMin() operations than PQ::add()
operations, the pqe’s potential for elimination and parallel adds are both limited,
thus other methods can be faster. Pqe is designed for high contention scenarios, in
which elimination and combining thrive. Therefore, it can incur a penalty at lower
thread counts, where there is not enough contention to justify the overhead of the
indirection caused by the elimination array and the server thread.

To better understand when each of the optimizations used is more beneficial, we ana-
lyzed the breakdown of the PQ::add() and PQ::removeMin() operations for different
                                                                                    49


PQ::add() percentages. When we have 80% PQ::add(), most of them are likely to
be inserted in parallel (75%), with a smaller percentage being able to eliminate and
an even smaller percentage being executed by the server, as shown in Fig. 4.7. In the
same scenario, 75% of removeMin() operations eliminate, while the rest is executed
by the server, as seen in Fig. 4.8. For balanced workloads (50% − 50%), most oper-
ations eliminate and a few PQ::add() operations are inserted in parallel. When the
workload is dominated by PQ::removeMin(), most PQ::add() eliminate, but most
PQ::removeMin() are still left to be executed by the server thread, thus introducing
a sequential bottleneck. Eventually the priority queue would become empty, not
being able to satisfy PQ::removeMin() requests with an actual value anymore. In
this case, any add() operation can eliminate, allowing full parallelism. We do not
present results for this case because it is an unlikely scenario that unrealistically
favors elimination.


4.4.1     PQ::moveHead() and PQ::chopHead()

Maintaining separate skiplists for the sequential and the parallel part of the priority
queue is beneficial for the overall throughput, but adds some overhead, which we
quantify in this section. The number of elements that become part of the sequential
skiplist changes dynamically based on the observed mix of operations. This adap-
tive behavior helps reduce the number of moveHead() and chopHead() operations
required. Table 4.1 shows the percentage of the number of head-moving opera-
tions out of the total number of PQ::removeMin() operations for different mixes
of PQ::add() and PQ::removeMin() operations. The head-moving operations are
rarely called due to the priority queue’s adaptive behavior.
                                                                                  50


                Add() percentages        % moveHead() % chopHead()

                                    80          0.24%          0.03%

                                    50          0.32%          0.01%

                                    20          0.00%          0.00%

Table 4.1: The number of head-moving operations as a percentage of the total num-
ber of PQ::removeMin() operations, considering different add() and removeMin()
mixes.


4.5     Hardware Transactions

Transactional memory [43] is an optimistic mechanism to synchronize threads ac-
cessing shared data. Threads are allowed to execute critical sections speculatively
in parallel, but, if there is a data conflict, one of them has to roll back and retry
its critical section. Recently, IBM and Intel added HTM instructions to their pro-
cessors [89, 49]. In our priority queue implementation, we used Intel’s Transactional
Synchronization Extensions (TSX) [49] to simplify the implementation and reduce
the overhead caused by the synchronization necessary to manage a sequential and
a parallel skiplist. We evaluate our results on an Intel Haswell four core processor,
Core i7-4770, with hardware transactions enabled (restricted transactional memory
- RTM), running at 3.4GHz. There are 8GB of RAM shared across the machine and
each core has a 32KB L1 cache. Hyperthreading was enabled on our machine so we
collected results using all 8 hardware threads. Hyperthreading causes resource shar-
ing between the hyperthreads, including L1 cache sharing, when running with more
than 4 threads, thus it can negatively impact results, especially for hardware trans-
actions. We did not notice a hyperthreading effect in our experiments. We used the
GCC 4.8 compiler with support for RTM and optimizations enabled (-O3).
                                                                                    51


4.5.1     Skiplist

The Single-Writer-Multi-Readers lock used to synchronize the sequential and the
parallel skiplists complicates the priority queue design and adds overhead. In this
section, we explore an alternative design using hardware transactions. The naive
approach of making all operations transactional causes too many aborts. Instead, the
server increments a timestamp whenever a head-moving operation - SL::moveHead()
or SL::chopHead() - starts or finishes. A SL::addPar() operation first reads the
timestamp and executes a nontransactional SL::find() and then starts a transaction
for the actual insertion, adding the server’s timestamp to its read set and aborting if
it is different from the initially recorded value. Moreover, if the timestamp changes
after starting the transaction, indicating a head-moving operation, the transaction
will be aborted due to the timestamp conflict. If the timestamp is valid, SL::find()
must have recorded the predecessors and successors of the new bucket at each level
i in preds[i] and succs[i], respectively. If a bucket already exists, the counter
is incremented inside the transaction and the operation completes. If the bucket
does not exist, the operation proceeds to check if preds[i] points to succs[i] for
all levels 0 ≤ i ≤ MaxLvl. If so, the pointers have not changed before starting
the transaction and the new bucket can be correctly inserted between preds[i]
and succs[i]. Otherwise, we commit the (innocuous) transaction, yet restart the
operation.

Figures 4.9 and 4.10 compare the performance of the lock-based implementation and
the implementation based on hardware transactions for two different percentages of
PQ::add()s and PQ::removeMin()s. When fewer PQ::removeMin() operations are
present, the timestamp changes less frequently and the PQ::add() transactions are
aborted fewer times, which increases performance in the 80%-20% insertion-removal
mix. In the 50%-50% mix, we obtain results comparable to the pqe algorithm using
                                                                                                                                                  52


the lock-based approach, albeit with a much simpler implementation.

                 80% Add operations 20% RemoveMin operations                                50 % Add operations 50% RemoveMin operations
              10000                                                                      10000
               9000                                                                       9000
               8000                                                                       8000
               7000                                                                       7000
 Throughput


                                                                            Throughput
               6000                                                                       6000
               5000                                                                       5000
                                                                  pqe                                                                       pqe
               4000                                                                       4000
                                                                  pqe-tsx                                                                   pqe-tsx
               3000                                                                       3000
               2000                                                                       2000
               1000                                                                       1000
                  0                                                                          0
                      1+1   1+2   1+3   1+4    1+5    1+6   1+7                                  1+1   1+2  1+3    1+4    1+5   1+6   1+7
                              Threads (server + working)                                                Threads (server + working)


Figure 4.9: Priority queue performance Figure 4.10: Priority queue performance
when we use a transaction-based dual when we use a transaction-based dual
skiplist; 80% add()s, 20% removeMin()s. skiplist; 50% add()s, 50% removeMin()s.


4.5.2                   Aborted Transactions

The impact of aborted transactions is reported in Tables 4.2 and 4.3. As the num-
ber of threads increases, the number of transactions per successful operation also
increases, as does the percentage of operations that need more than 10 retries to suc-
ceed. Note that the innocuous transactions that find inconsistent pointers, changed
between the SL::find() and the start of the transaction are not included in the mea-
surement. After 10 retries, threads give up on retrying the transactional path and
the server executes the operations on their behalf, either in the sequential part, using
sequential operations, or in the parallel part, using CAS() for the pointer changes,
but without holding the readers lock. The server does not need to acquire the readers
lock because no other thread will try to acquire the writer lock.

The number of transactions per successful operation is at most 3.92, but 3.22 in the
50% − 50% case. The percentage of operations that get executed by the server (after
aborting 10 times) is at most 10% of the total number of operations, but between
1.73% and 2.01% for the 50% − 50% case.
                                                                                                     53


 Working   Transactions per      Fallbacks per 
                                                      Add         Transactions per      Fallbacks per 
 Threads successful operation   total operations   percentage   successful operation   total operations
    1            1.01               0.00%             100               1.32               0.00%
    2            2.34               0.51%             80                1.77               0.01%
    3            3.21               1.73%             60                2.37               0.29%

    4            3.31               2.12%             50                3.22               2.01%

    5            3.46               2.74%             40                3.64               5.24%

    6            3.46               2.67%             20                3.92               10.34%

                                                       0                1.09               0.00%
    7            3.61               3.25%


Table 4.2: Transaction stats for varying Table 4.3: Transaction stats for varying
# of threads, with 50% PQ::add()s and mixes, with 1 server thread and 3 working
50% PQ::removeMin()s                     threads.


4.5.3     Combining and Elimination

In this section, we describe our experience using Intel TSX to simplify combining and
elimination. Adapting the elimination algorithm to use transactions was straightfor-
ward, by just replacing the pessimistic synchronization with transactions. We note
that a unique stamp as described in Section 4.2.2 is not necessary for linearizability of
elimination if the operations are performed inside hardware transactions. If a thread
finds a matching operation and ensures in a transaction that the value is smaller than
the minimum, then elimination is safe. If a change in the matching operation had
occurred, the transaction would have aborted. We retry each transaction N times
(e.g. N = 3 in our implementation). If a thread’s transaction is aborted too many
times during elimination, the thread moves on to other slots without retrying the
failed slot in a fallback path. However, if the transaction fails while trying to insert
an PQ::add() or PQ::removeMin() operation in an empty slot to be collected by the
server thread, the original pessimistic algorithm is used as a software fallback path
in order to guarantee forward progress. Unfortunately, the unique stamp needs to be
used to ensure linearizability of the operations executed on the fallback path.

Using transactions in the server thread implementation required including SL::addSeq()
                                                                                     54


and SL::removeSeq() inside a transaction, which in turn caused too many aborts.
Therefore, we designed an alternative combining algorithm that executes these op-
erations outside the critical section. The complete algorithm is presented in Algo-
rithm 9. It is based on the observation that, as long as there is a sequential part
in the skiplist, the SL::removeSeq() and the SL::addSeq() operations can be exe-
cuted lazily. The server can use the skiplist’s minValue to return a value to a remove
request and only execute the sequential operation after, without the remove thread
waiting for it. Note that the skiplist’s minValue could, in the meantime, return a
value that is outdated. However, this value is always smaller or equal to the ac-
tual minimum in the skiplist, because it can only lag behind one sequential remove.
This function is used by the PQ::add() operations to determine if they can elim-
inate or not. Therefore, estimating a minimum smaller than the actual minimum
can affect performance, but will not impact correctness of our algorithm. Moreover,
the server performs the PQ::removeMin() operation immediately after writing the
minimum, thus cleaning up the sequential part and updating the minimum estimate.
The PQ::add() case is similar too. If there is a sequential part to the skiplist, the
server can update the skiplist lazily, after it releases the waiting thread. There is
one difference. If the value inserted is smaller than minValue, then this needs to be
updated before releasing the waiting thread.

Using these changes in the combining algorithm allowed a straightforward imple-
mentation using hardware transactions. However, our experiments indicated that
certain particularities of the best-effort HTM design make it unsuitable for this sce-
nario. First of all, because of its best-effort nature, a fallback is necessary in order
to make progress. Therefore, the algorithm might be simplified on the common
case, but it is still as complex as the fallback. Moreover, changes are often needed
to adapt algorithms for an implementation using hardware transactions. Because
these changes involve decreasing the sizes of the critical sections and decreasing the
                                                                                   55


number of potential conflicts, these changes could be beneficial to the original al-
gorithm too. Finally, it seems that communications paradigms, such as elimination
and combining, are best implemented using pessimistic methods. Intel TSX has no
means of implementing non-transactional operations inside transactions (also called
escape actions) and no polite spinning mechanism to allow a thread to wait for a
change that is going to be performed in a transaction. The spinning thread could
often abort the thread that it is waiting for. We used the PAUSE instruction in the
spinning thread to alleviate this issue, but better hardware support for implementing
communication paradigms using hardware transactions is necessary. For our elim-
ination and combining algorithms, we concluded that pessimistic synchronization
works better.

Algorithm 9 Server::execute()
 1:   while true do
 2:     for i: 1 → ELIM SIZE do
 3:         (value, stamp) ← elim[i]
 4:         if value = REMREQ then
 5:             if skiplist.currSeq = ⊥ then
 6:                 skiplist.moveHead()
 7:             if skiplist.currSeq 6= ⊥ then
 8:                 if CAS(elim[i], (value, stamp), (skiplist.minValue, 0)) then
 9:                     skiplist.removeSeq()
10:             else
11:                 if CAS(elim[i], (value, stamp), (INPROG, 0)) then
12:                     min ← skiplist.removeSeq(); elim[i] ← (min, 0)
13:         if IsValue(value) and (stamp > 0) then
14:             if skiplist.currSeq 6= ⊥ then
15:                 if CAS(elim[i], (value, stamp), (INPROG, 0)) then
16:                     if value < skiplist.minValue then
17:                         skiplist.minValue ← value
18:                     elim[i] ← (TAKEN, 0); skiplist.addSeq(value)
19:             else
20:                 if CAS(elim[i], (value, stamp), (INPROG, 0)) then
21:                     skiplist.addSeq(value); elim[i] ← (TAKEN, 0)
                                                                                      56


4.6      Summary

In this chapter, we describe a technique to implement a scalable, linearizable priority
queue based on a skiplist, divided into a sequential and a parallel part. Our scheme
simultaneously enables parallel PQ::add() operations as well as sequential batched
PQ::removeMin() operations. The sequential part is beneficial for batched removals,
which are performed by a special server thread. While detaching the sequential part
from the parallel part is non-negligible cost-wise, the sequential part has the potential
to serve multiple subsequent removals at a small constant cost. The parallel part is
beneficial for concurrent insertions of elements with bigger keys (smaller priority), not
likely to be removed soon. In other words, we integrate the flat combining/delegation
paradigm introduced in prior work with disjoint-access parallelism.

In addition, we present a novel priority queue elimination algorithm, where PQ::add()
operations with keys smaller than the priority queue minimum can eliminate with
PQ::removeMin() operations. We permit PQ::add() operations, with keys small
enough, to linger in the elimination array, waiting to become eligible for elimination.
If the elimination is not possible, the operation is delegated to the server thread.
Batched removals (combining) by the server thread is well-integrated with both: (1)
parallelism of add() operations with bigger keys; and (2) the elimination algorithm,
that possibly delegates failed elimination attempts (of elements with smaller keys)
to the server thread in a natural manner. Our priority queue integrates delega-
tion, combining, and elimination, while still leveraging the parallelism potential of
insertions.
Chapter 5


Software Fallbacks for Best-effort
Hardware Transactional
Memory

Intel’s Haswell and IBM’s Blue Gene/Q and System Z are the first commercially
available systems to include hardware transactional memory (HTM). However, they
are all best-effort, meaning that every hardware transaction must have an alterna-
tive software fallback path that guarantees forward progress. The simplest and most
widely used software fallback is a single global lock (SGL), in which aborted hardware
transactions acquire the SGL before they are re-executed in software. Other hard-
ware transactions need to subscribe to this lock and abort as soon as it is acquired.
This approach, however, causes many hardware transactions to abort unnecessarily,
determining even more transactions to fail and resort to the SGL.

In this chapter we suggest improvements to the simple SGL fallback. First, we
use lazy subscription to reduce the rate of SGL acquisitions. Next, we propose fine-
grained conflict detection mechanisms between hardware transactions and a software


                                         57
                                                                                    58


SGL transaction. Finally, we describe how our findings can be used to improve future
generations of HTMs.


5.1     Background

Parallel programming has gained significant importance due to the rise of commodity
multicore computer systems. Unfortunately, writing correct and efficient software
that effectively utilizes the resources of multicore systems remains an obstacle for a
wider spread adoption of parallel programming. Locks, the current state-of-the-art
solution for synchronizing shared memory access in parallel programs, are notoriously
challenging, even for expert programmers [44].

Transactional memory (TM) [43] aims to simplify writing correct and efficient parallel
software while avoiding the pitfalls of locks. Threads can speculatively execute trans-
actions, maintaining read and write sets to track conflicts. If a conflict is detected
between two transactions, one is usually aborted and rolled back so the other can
commit. Transactional memory generally promises all-or-nothing semantics, where
critical sections appear as if they executed atomically or not at all. Unfortunately,
the overhead associated with these designs is generally high.

Hardware transactional memory (HTM), on the other hand, promises a faster per-
forming, lower overhead alternative to STM. Yet, practical HTMs are best-effort:
they do not guarantee forward progress. Furthermore, practical HTMs are bounded
in size and support a restricted set of operations. It is for these reasons that an
HTM alone is an insufficient TM solution.

In short, ensuring forward progress requires a software fallback. A simple and at-
tractive solution to use for many applications is a single global lock (SGL) mecha-
nism [89, 90], where all transactions that access a particular data structure synchro-
                                                                                          59


nize through a single lock1 . Perhaps the most visible example of an SGL fallback
scheme is Haswell’s hardware lock elision (HLE) [48], which supports a lock fallback
directly through the instruction set architecture. SGL schemes are attractive be-
cause they can easily be retrofitted to legacy code, and they do not require code
duplication.

In both HLE, and HTM with SGL fallback, each hardware transaction starts by
reading the lock’s state, called subscribing to the lock. Subscription ensure that any
software transaction that subsequently acquires that lock will provoke a data con-
flict, ensuring correctness by forcing any active subscribing hardware transactions
to abort. The duration of a lock subscription represents a “window of vulnerabil-
ity” during which the arrival of a software transaction will prevent any subscribing
hardware transactions from executing.

In this chapter we present novel optimizations to the simple SGL fallback approach.
We show that one can significantly improve performance by performing lock sub-
scription in a lazy manner: optimistically postponing reading the lock state for as
long as possible (usually the very end of the transaction). Lazy subscription was
first proposed in the context of Hybrid NOrec [17] to allow concurrent execution
of multiple hardware transactions with the committing phase of a speculative soft-
ware transaction. Here, lazy subscription allows concurrent execution of multiple
hardware transactions with a single non-speculative SGL transaction. The resulting
mechanism maintains the simplicity and correctness of the original SGL fallback,
but reduces its costs. We evaluate this design using Haswell’s restricted transac-
tional memory (RTM) running the STAMP benchmark suite, and compare it to
several alternatives: a non-speculative SGL implementation, a speculative imple-
mentation with the usual SGL fallback, the hardware-only HLE, and to an STM
  1
    “Global” here could mean a single lock per data structure, not necessarily system-wide if
composability is not an issue.
                                                                                   60


(TL2). We also show how to improve conflict detection with the SGL transaction
and we propose several novel hardware extensions.


5.2     SGL Fallback (E-SGL)

As noted, hardware transactional memory (HTM) has become a commercial reality,
but HTM provided by processors such as Intel’s Haswell and IBM’s Power ISA offer
no progress guarantees, implying that some form of software fallback is needed. In
the single global lock (SGL) approach, each shared data structure has an associated
lock. When it starts, a hardware transaction immediately reads the lock state, an
action known as eager subscription. When a repeatedly failed hardware transaction
restarts in software, it acquires exclusive access to the lock, forcing any subscribed
hardware transactions to abort.

SGL fallback is attractive because it is simple, requiring no memory access annota-
tion, and no code duplication between alternative paths. Nevertheless, an inherent
limitation of current SGL fallbacks schemes is that hardware and software transac-
tions that share a global lock cannot execute concurrently. Figure 5.1 shows the four
ways in which hardware and software transaction can overlap. In cases 2 and 3, the
hardware transaction is aborted as soon as it checks the lock, while in cases 1 and 4
the hardware transaction is aborted when the software transaction acquires the lock.
With eager subscription, it makes sense for a thread starting a hardware transaction
to wait until the SGL becomes free.

In this chapter, we describe how to improve conflict detection to allow some con-
currency between the hardware and software transaction that share a lock. In Sec-
tion 5.3, we describe a lazy subscription mechanism that permits concurrent hardware
and software transactions to share the same SGL and intuitively show its correctness.
                                                                                61


We evaluate this scheme’s performance in Section 5.4. We describe finer-grained
conflict detection mechanisms in Section 5.5. In Section 5.6, we describe how these
observations might improve future hardware.


          Figure 5.1: Obvious SGL Fallback implementation (E-SGL).


5.3     Lazy SGL (L-SGL)

In a na¨ıve SGL implementation (E-SGL), a hardware transaction immediately adds
the lock to its read set, ensuring the transaction will be aborted if that lock is
acquired by a software transaction. Hardware and software transactions cannot
overlap (Figure 5.1).

Lazy subscription can improve the chances of success of a hardware transaction by
allowing some overlap with a software transaction. In Figure 5.2, L-SGL allows
transactions (3) and (4) to commit, while E-SGL would abort them.

Software and hardware transactions are treated differently in L-SGL. Each software
transaction must acquire the SGL. Hardware transactions do not acquire the SGL,
                                                                                  62


but they must check its status. With some exceptions described later, L-SGL hard-
ware transactions read the lock only at the end, right before committing. If the lock
is held by a software transaction, the hardware transaction explicitly aborts. This
check is necessary because the hardware transaction may have observed an inconsis-
tent state. If the lock is free, then no software transaction is in progress, and the
hardware transaction can commit.

Lazy subscription has been proposed to improve HyTM performance [17], but its
use for SGL fallback is new. HyTMs typically use sophisticated techniques to allow
concurrency between multiple hardware and software transactions, but SGLs’ sim-
plicity makes them attractive in practice [89, 90]. The lazy SGL (L-SGL) approach
described here improves a popular HTM fallback mechanism by allowing multiple
hardware transactions to run concurrently with one software transaction.


                         Figure 5.2: Lazy SGL (L-SGL).


Haswell RTM provides an abort status code that offers limited information about
why a hardware transaction aborted. L-SGL makes it easier to collect diagnostic
information about failed hardware transactions from this abort status code. When
an E-SGL hardware transaction is about to start, it makes sense to wait until the
                                                                                    63


SGL is free. As a result, eager subscription rarely aborts hardware transactions
explicitly at the time of subscription, so transactions are much more likely to be
aborted automatically in-flight. Therefore, the abort status code will report this
abort as a conflict. By contrast, L-SGL’s lazy subscription mechanism makes it
more likely that transactions will be aborted explicitly on subscription, allowing the
programmer to obtain more detailed diagnostic information because, in this case, the
abort status code can indicate precisely that the abort was caused by the lock.

L-SGL is similar to E-SGL in that it does not require read or write annotations, it
permits transactions to be arbitrarily nested, but does not permit explicit transaction
aborts in user code.

A software transaction waiting to acquire the SGL uses a combination of backoff and
sleeping to reduce cache line contention. It starts by inserting an exponentially in-
creasing number of null operations (NOPs) between successive lock attempts. When
the number of NOPs reaches a threshold, T , the transaction calls the sleep func-
tion to release the processor for a brief duration before trying again. We found
that sleeping right away is generally too slow for benchmarks where transactions are
small and fast, but works well for larger and slower running transactions. Overall,
we found that exponential waiting followed by sleeping works best across the range
of benchmarks we considered.

Before a thread starts a hardware transaction, it reads the SGL to prefetch the lock
into the cache. If no software transaction tries to acquire that lock, the lock is
likely to be cached at commit time, which our experiments have observed to speed
commit.
                                                                                 64


5.3.1    Correctness

STM designers often go to great efforts to ensure that all transactions see a con-
sistent state, even after synchronization conflicts have occurred, a property called
opacity [32]. The L-SGL design is simplified because hardware transactions do not
need opacity. Instead, the L-SGL design relies on two guarantees. First, Haswell’s
hardware sandboxing mechanism ensures that any hardware transaction that raises
an exception or enters an infinite loop because of an inconsistent state is aborted
and rolled back without affecting any other transactions. Second, the L-SGL design
ensures that no hardware transaction can commit while a software transaction is in
progress. There is one exception, explained in the next section.

Fig. 5.3 illustrates why opacity is unnecessary: variables X and Y are linked by
the invariant Y = X + 1. Now suppose a hardware transaction reads X and Y
after a software transaction has incremented X, but before it has incremented Y ,
resulting in the inconsistent view X = Y . This hardware transaction will never
commit, but it may encounter a division by zero when it evaluates 1/(Y − X). The
Haswell hardware sandboxing mechanism will suppress the exception and roll back
the transaction, ensuring that no other transaction is affected.


                          Figure 5.3: Inconsistent reads.


Fig. 5.4 outlines possible orderings between hardware and software transactions.
                                                                                  65


We order transactions by their commit time. Because software transactions cannot
abort, any conflicting operation a software transaction executes after a hardware
transaction has committed must be ordered after the hardware transaction. More-
over, because TSX provides no “escape actions” a hardware transaction cannot wait
for a software transaction to commit.

In cases 1 (Fig. 5.4a) and 2 (Fig. 5.4b), the hardware transaction ends before the
software transaction ends, and finds the lock taken when it tries to commit. In
these two cases, the hardware transaction must be serialized before the software
transaction. If a software transaction performs an operation that conflicts with a
concurrently executing hardware transaction while the hardware transaction is still
in-flight, the hardware transaction is aborted by the Haswell HTM conflict detection
mechanism. If, on the other hand, the conflicting operation is performed by the
hardware transaction, the conflict would not be detected. If both transactions were
permitted to commit, the value of the conflicting location would be incorrect because
the hardware overwrote the software transaction’s write (see Fig. 5.4). Here, we must
abort the hardware transaction, because software transactions cannot be aborted.
It does not matter when the hardware transaction is aborted, so it is sufficient to
check for conflicts as the final step of the hardware transaction before it commits.
In L-SGL, such conflicts are detected by inspecting the state of the lock.

In cases 3 (Fig. 5.4c) and 4 (Fig. 5.4d), the hardware transaction begins its commit
after the concurrent software transaction has committed. If the lock is free at the
time of the hardware commit, then the hardware transaction can commit even though
it might have overlapped one or more software transactions. Because the hardware
transaction commits after any concurrently executing software transaction, it will
be ordered after any such overlapping software transaction. Therefore the correct
value for any conflicting location is the value written by the hardware transaction.
                                                                                 66


If the last value written to a location that conflicts with the hardware transaction
belonged to the software transaction, then the hardware transaction would have
aborted, because Haswell’s HTM conflict detection system would have identified such
a conflict and aborted the hardware transaction. Moreover, a software transaction
observes only old values until the hardware transaction commits, so the software
reads are serialized before the hardware writes.


               (a) Case 1.                                (b) Case 2.


                  (c) Case 3.                          (d) Case 4.

Figure 5.4: Correctness: Cases 1-4. Arrows denote the “happens-before” relation.


5.3.2    Sandboxing

Hardware sandboxing prevents faults that occur inside a hardware transaction from
propagating outside of the transaction. Spurious writes and faults caused by reading
                                                                                     67


inconsistent state from the SGL transaction are not visible to other threads. There
is, however, one unlikely situation when inconsistent reads can cause a hardware
transaction to commit prematurely. In principle, inconsistent reads could cause a
spurious write to a location that is later used by the same transaction as the target
of an indirect jump. If the target of the incorrect jump is is an xend (commit)
instruction, or data that looks like one, then the hardware transaction might commit
incorrectly, without checking the lock. Note that the inconsistent transaction cannot
actually change the program code and insert spurious xend instructions, as the code
area is protected and accessing it would cause the transaction to abort.

To address this hazard, lazy subscription must be performed before any indirect jump
executed inside a hardware transaction that has written to memory. A read-only
transaction, or one that is read-only before the indirect jump is not subject to this
hazard. Moreover, if a transaction makes multiple indirect jumps, it is sufficient to
check the lock only before the first jump, because the lock remains in the transaction’s
read set.

In the results presented in Section 5.4, we use L-SGL with early subscription on the
first indirect jump that occurs after a shared memory write. We found that this
restriction did not affect performance.

In general, this problem is similar to security concerns caused by buffer overflows.
There is a trend towards compiler support to help with this issue, which might also
be used to protect hardware transactions from incorrect premature commits. For
example, the latest GCC supports security functionality to check vtable integrity.
Moreover, for optimizations levels higher than -O2, GCC uses devirtualization and
inlining for the most likely target in indirect jumps. A transactional compiler could
use similar techniques to generate multiple likely targets and use the early lock check
only in the unlikely case that none of the pre-established targets are chosen.
                                                                                   68


5.4     Evaluation

Our experimental evaluation was performed on an Intel Haswell processor (Core
i7-4770) with RTM and HLE enabled, running at 3.40 GHz. The machine has a
total of 8GB of RAM shared across four cores, each having a 32 KB L1 cache. For
our experiments, hyper-threading was enabled, giving us a total of eight hardware
threads. However, we notice that hyper-threading negatively impacts performance
at 8 threads due to L1 cache sharing. In practice, this results in more hardware
transactions being aborted because of overflow. To show this effect, we performed a
simple experiment in which we measured the rate of aborts due to overflow for one,
two, four and eight threads for all STAMP benchmarks. The rate of overflow for 1
thread is indicative of the percentage of transactions that cannot succeed in hardware
because of cache size or associativity limitations. As we increase the number of
threads, the rate of overflow decreases, as more and more transactions abort because
of conflicts with other transactions. However, for 8 threads, the rate of overflow
significantly increases, showing the negative effects of hyper-threading, as can be
seen for the Vacation High benchmark in Fig. 5.5. Results were similar for all
other STAMP benchmarks, except for the Labyrinth benchmark, where most of the
aborts are caused by unsupported instructions; we omitted these graphs due to space
constraints.

We used GCC 4.8.2 compiler with -O3 optimization enabled and gcc intrinsics [49].
We used the STAMP benchmarks [15] to compare L-SGL’s speedup relative to a
single-threaded sequential execution with software only approaches - a state-of-the-
art STM (TL2) and a single global lock (spinlock) without any transactional execu-
tion (SGL) - and with a hardware only solution (Haswell HLE). For HLE, we used
a single global spin lock prefixed with HLE-Acquire and HLE-Release instructions
to suggest that the critical section should be executed speculatively. If speculation
                                                                                   69


fails, the critical section is retried non-speculatively, according to a hardware pol-
icy. We also compared to the na¨ıve SGL implementation with eager subscription
(E-SGL). We ran all methods five times and presented the median of the results.
Variance was generally low. We also compared L-SGL’s rate of transactional success
with that of HLE and E-SGL, by measuring the percentage of transactions executed
non-speculatively for both methods.


Figure 5.5: Example of overflow due to hyper-threading (vacation high benchmark).


5.4.1    Speedup relative to sequential execution

L-SGL performs best on benchmarks with medium sized transactions, such as In-
truder 5.6c, Vacation Low 5.6h and Vacation High 5.6i, where it outperforms all
prior methods. On the benchmarks with smaller transactions, such as Ssca2 5.6g,
Kmeans Low 5.6d and Kmeans High 5.6e, L-SGL has good speedup compared to
sequential execution, and outperforms TL2, which has too much overhead for these
small transactions. However, L-SGL does not present a significant advantage com-
pared to HLE on these benchmarks, because most transactions will quickly succeed
in hardware, therefore making the differences between L-SGL and HLE less notice-
able. For Kmeans Low 5.6d, where there is little contention, SGL performs similar to
L-SGL and HLE as well. However, when there is more contention, as is the case with
Kmeans High 5.6e, or when most transactions can succeed in hardware, in parallel,
                                                     70


   (a) Bayes.                      (b) Genome.


  (c) Intruder.                  (d) Kmeans Low.


(e) Kmeans High.                  (f ) Labyrinth.


   (g) Ssca2.                    (h) Vacation Low.


(i) Vacation High.                   (j) Yada.

          Figure 5.6: STAMP Throughput.
                                                      71


   (a) Bayes.                        (b) Genome.


  (c) Intruder.
                                   (d) Kmeans Low.


                                    (f ) Labyrinth.
(e) Kmeans High.


   (g) Ssca2.
                                  (h) Vacation Low.


                                      (j) Yada.
(i) Vacation High.

Figure 5.7: STAMP Percentage of Lock Acquisitions.
                                     72


Figure 5.8: Speedup for 8 threads


Figure 5.9: Slowdown for 1 thread.
                                                                                     73


as in Ssca2 5.6g, the performance of SGL quickly degrades.

Finally, for large transactions and those with unsupported instructions, as in Bayes 5.6a,
Labyrinth 5.6f and Yada 5.6j, TL2 is more advantageous, because it can execute
transactions in parallel, in software, without overflowing the cache. The effects of
hyperthreading when running with 8 threads are even more pronounced on these
benchmarks, because most transactions are large. We note that Labyrinth in par-
ticular is very suitable for STM systems because it uses very large transactions,
whose initial memory accesses are all local. Therefore, these memory accesses do not
contribute towards generating conflicts in an STM. Unfortunately, Haswell RTM
does not have escape actions, therefore counting local accesses as transactional and
overflowing the cache unnecessarily.


5.4.2     Percentage of lock acquisitions

We measured the percentage of lock acquisitions in L-SGL by inserting statistics in
our code to measure the total number of transactions and the percentage executed
non-speculatively. We measure the percentage of lock acquisitions in HLE using perf
with support for TSX, a performance analysis tool for Linux.

We can notice in fig. 5.7 that L-SGL achieves a better rate of transactional execution
than HLE on all STAMP benchmarks (its rate of lock acquisitions is lower than
HLE’s rate on all benchmarks). L-SGL uses lazy subscription, so the lock is read
transactionally at the end of the critical section. In contrast, HLE subscribes to the
lock address in the beginning of the critical section, suffering more aborts due to
changes to the lock.
                                                                                  74


5.4.3    Single-threaded penalty

One of the biggest advantages of L-SGL is that it manages to improve performance
for 4 and 8 threads without paying a big penalty for single threaded execution, as
is the case with most STMs. For example, fig. 5.8 shows L-SGL’s speedup relative
to sequential for 8 threads and fig. 5.9 shows the slowdown for 1 thread. We can see
that TL2 pays a huge penalty for single-threaded execution, while L-SGL execution
is almost as good as sequential execution.


5.5     Fine-grained SGL

L-SGL allows multiple hardware transactions to execute concurrently with a soft-
ware transaction as long as the software transaction commits first (Fig. 5.4c and
Fig. 5.4d). Unfortunately, hardware transactions that attempt to commit while a
software transaction is in progress will abort (Fig. 5.4a and Fig. 5.4b). This is the
correct and expected behavior if there are conflicts between the hardware transac-
tions and the software transaction, but otherwise these hardware transactions could
successfully commit. Despite being an improvement over the simple single global
lock algorithm, L-SGL does not enable the maximum amount of concurrency possi-
ble between multiple hardware transactions and a software transaction.

In this section, we describe another SGL fallback mechanism that performs finer
grained conflict detection than E-SGL and L-SGL, based on Bloom filters (BF-SGL).
BF-SGL increases the amount of concurrency offered by the hybrid transactional
memory system in Fig. 5.4a and Fig. 5.4b. In order to detect conflicts between
the software transaction and hardware transactions, we add a Bloom filter for each
thread. Each read and write is annotated to add the memory location to the Bloom
filter. Hardware transactions consult the global lock before committing and, if they
                                                                                      75


find it free, they can commit successfully. However, if the lock is taken, they can
compare their Bloom filter with the software transaction’s Bloom filter to determine
if there are conflicts. The Bloom filter allows false positives, but not false negatives.
Therefore, it could detect a conflict despite the transactions not having any conflicts,
but it will never report zero conflicts if the transactions accessed the same memory.
So the hardware transactions can commit successfully even if the lock is taken as
long as the Bloom filters do not report conflicts. L-SGL represents a particular case
of BF-SGL. Specifically, L-SGL can be obtained from BF-SGL if the Bloom filter
set intersection operation between the hardware transaction trying to commit and
the currently executing software transaction always returns that there exists at least
one conflict.


5.5.1     Use Cases

Using BF-SGL, many small hardware transactions that access disjoint memory loca-
tions and concurrently executing large software transactions can commit. The same
is not true for any other system that we are aware of. This is because we provide
precise conflict detection using the Bloom filters for the HW and SW transactions
to track memory accesses. Consider, for example, an array representing an open
addressing hash-table. Threads can perform lookup(x) operations and insert(x) op-
erations in this hash-table. Once a threshold of occupancy is achieved, a thread
decides to double the size of the hash-table by allocating a new array and rehashing
elements from the old array to the new array. Lookup and insert are short transac-
tions and can succeed in hardware most of the time. Rehashing is always executed
as a software transaction, so the thread needs to acquire the single global lock.

Using L-SGL, no lookup and insert operations can succeed during rehashing. How-
ever, using BF-SGL with precise conflict detection between the software transaction
                                                                                       76


and the concurrent hardware transactions, lookup operations executed as hardware
transactions can commit using data from the old array while the rehashing to the
new array is taking place. Moreover, insert operations executed as hardware trans-
actions at the end of the old array, in the part that has not been rehashed yet, can
also commit during rehashing. Therefore, BF-SGL improves throughput by allowing
small hardware transactions to commit concurrently with long executing software
transactions.


5.5.2     Performance and Practicality

Adding the software Bloom filter to hardware transactions incurs some overhead
compared to simple hardware transactions. However, the Bloom filter adds the ben-
efit of being able to commit hardware transactions even when a software transaction
is executing as long as there are no real conflicts or false conflicts caused by the Bloom
filter. An efficient Bloom filter implementation allows insertion and set intersection
in O(1) time, minimizing the overhead.

In addition, reading these two locations in the hardware transaction only adds two
additional cache lines to the read set of the transaction. This can be optimized
so that a bit of the Bloom filter is used to indicate whether the lock is taken or
not and the rest is used as a Bloom filter. Therefore, the lock location can serve
both purposes, reducing the read set size of the hardware transaction to just one
additional location. The transaction’s own Bloom filters add additional cache lines
to the write set, but this could be as low as only one cache line, depending on the
Bloom filter size.

Hardware transactions read the software Bloom filter only right before committing,
narrowing the window when hardware transactions could be aborted by software
transactions. Unfortunately, the software transaction needs to modify its Bloom
                                                                                    77


filters for every read and write, causing many spurious aborts for the hardware
transactions. We found that this behavior significantly affects the performance of
BF-SGL, so we did not include results for this system. However, we note that this
is a strong motivation why escape actions should be included with any HTM. If
we had escape actions, the Bloom filters could be read non-transactionally at the
end of the hardware transaction, avoiding the spurious aborts caused by the software
transaction updating its Bloom filters. Correctness would still be maintained because
any conflicting read or write performed by the software will still abort the hardware
transaction. We believe this support will be available in the future, making the
bloom filter based conflict detection a viable option. For example, IBM’s Power ISA
suspended mode [8] provides the necessary functionality.


5.6     Hardware Optimizations

Lazy Hardware Lock Elision (LHLE). Haswell’s HLE works by eliding locks
prefixed with HLE-Acquire and executing the critical sections as hardware transac-
tions. If the speculation fails for any reason, the lock is acquired and the critical
section is re-executed non-speculatively in software. HLE is similar to E-SGL: hard-
ware transactions need to subscribe to the lock in the beginning of their execution to
ensure correctness. However, we have shown that L-SGL, implemented in software,
outperforms the hardware only HLE. Therefore, we speculate that Lazy Hardware
Lock Elision (LHLE), where the lock is added to the read set at the end of the critical
section, would perform better than HLE. Similar to HLE, LHLE enables multiple
speculative critical sections to execute in parallel if there are no conflicts detected
at run-time and it simplifies programming by enabling more parallelism for coarse-
grained critical sections. In contrast to HLE, LHLE supports parallelism between
one non-speculative critical section and multiple speculative critical sections. More-
                                                                                  78


over, LHLE is designed to be implemented entirely in hardware, so the sandboxing
issues described in Section 5.3.2 do not arise, as the hardware can ensure that the
subscription to the lock occurs whenever the xend instruction is invoked.


Bloom Filter Hardware Lock Elision. As described in Section 5.5, BF-SGL
can improve the granularity of conflict detection with an SGL, but causes spurious
aborts because the SGL transaction’s Bloom filters become part of the read set
of the hardware transactions. This could be avoided if the HTM allowed escape
actions. In that case, the Bloom filters would be read non-transactionally to detect
conflicts. Alternatively, if the Bloom filters were handled by the hardware instead
of the software, they could avoid the tracking mechanism of HTM and avoid the
unnecessary aborts. Haswell HLE could be extended with Bloom filters for the
hardware transaction, as well as for the SGL transaction. With this design, conflict
detection would be realized at a finer-grained level than it is currently done.


5.7     Summary

The na¨ıve SGL fallback’s simplicity makes it an appealing alternative to more com-
plicated, even if better-performing, HyTM schemes. In this chapter, we introduced
novel SGL methods that improve the performance of the simple SGL fallback, while
maintaining its simplicity. First, we described L-SGL, a simple SGL-based fallback
for HTM that uses lazy subscription to allow hardware-software transaction concur-
rency. L-SGL improves performance on current machines by up to 4X compared to
state-of-the-art software and hardware solutions.

In addition, L-SGL has some appealing properties. For example, it does not require
read and write annotations, making it suitable for implementation in a real system,
either in the compiler or even in hardware. Our L-SGL software implementation
                                                                                  79


improves performance over native Haswell lock elision by almost a factor of 2, and
reduces the rate of lock acquisitions by up to 35%. We conjecture this difference
would be even higher if L-SGL were implemented in hardware.

We also described BF-SGL, an alternative SGL fallback mechanism with more accu-
rate conflict detection. Our BF-SGL results, perhaps counter-intuitively, show that
adding a mechanism to support better conflict detection, such as Bloom filters, hin-
ders performance by increasing the abort rate. If the HTM were to support escape
actions, allowing precise conflict detection to be performed outside of transactional
tracking, we speculate that this comparison would change in favor of BF-SGL. Fi-
nally, we showed how to use these ideas to improve future HTMs with minimal
microarchitectural changes.
Chapter 6


Hybrid Transactional Memory

The Intel Haswell processor includes restricted transactional memory (RTM), which
is the first commodity-based hardware transactional memory (HTM) to become pub-
licly available. However, like other real HTMs, such as IBM’s Blue Gene/Q, Haswell’s
RTM is best-effort, meaning it provides no transactional forward progress guaran-
tees. Because of this, a software fallback system must be used in conjunction with
Haswell’s RTM to ensure transactional programs execute to completion. To com-
plicate matters, Haswell does not provide escape actions. Without escape actions,
non-transactional instructions cannot be executed within the context of a hardware
transaction, thereby restricting the ways in which a software fallback can interact
with the HTM. As such, the challenge of creating a scalable hybrid TM (HyTM)
that uses Haswell’s RTM and a software TM (STM) fallback is exacerbated.

In this chapter, we present Invyswell, a novel HyTM that exploits the benefits and
manages the limitations of Haswell’s RTM. After describing Invyswell’s design, we
show that it outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid NOrec,
NOrec’s hybrid implementation, by 18%, and Haswell’s hardware-only lock elision
by 25% across all STAMP benchmarks.


                                        80
                                                                                     81


6.1      Background

Traditionally, locks have been the predominant mechanism used to synchronize
shared memory in multithreaded programs [44]. Yet, developing software that cor-
rectly and efficiently uses locks is notoriously challenging, even for the most seasoned
programmers. Transactional memory (TM) has been proposed as an alternative to
locks, where much of the mechanical complexity of synchronization is managed by
the underlying system, not the programmer [43, 84].

Experience with software transactional memory (STM), where transactions are im-
plemented entirely in software, has demonstrated the simplicity of transactional pro-
gramming, but has raised challenging performance issues. Modern STMs tend to be
scalable at high thread counts [26], meaning that beyond a certain point (and up to
a limit), adding more threads typically increases throughput for many benchmarks,
yielding performance that is often competitive with fine-grained locking. Unfortu-
nately, these STMs tend to perform poorly at low or medium thread counts, because
of non-amortized transactional overhead, resulting in performance that is not com-
petitive with fine-grained locking.

To improve the performance of transactions, hardware vendors such as Intel and IBM
have included support for hardware transactional memory (HTM). One such exam-
ple is Intel’s Haswell processor [49], which includes restricted transactional mem-
ory (RTM), a cache-based HTM design that uses the microarchitecture’s existing
cache coherence protocol to manage transactional conflicts. Yet, it is unclear how
RTM can be most effectively used by software. One cannot simply substitute hard-
ware transactions for software transactions, because RTM, like other HTMs, such
as IBM’s Blue Gene/Q [89] and System z [51], is best-effort, providing no progress
                                                                                                          82

               1
guarantees.        Whether a transaction succeeds depends on whether its data set fits
in the processor’s cache, whether the transaction finishes without interruption, and
a myriad of other architectural and platform-specific limitations best hidden from
the programmer.

It has been recognized that effectively integrating best-effort HTM with the software
that uses it requires an intermediate software fallback when hardware transactions
fail. Such a system is called hybrid transactional memory (HyTM) [19, 17, 22, 58],
where hardware and software transactions execute under the umbrella of a single
TM system. In this chapter, we present a novel HyTM, called Invyswell, that uses
hardware transactions from Haswell’s RTM in conjunction with software transac-
tions from a heavily modified design of InvalSTM [30], an STM designed to provide
scalability and performance for large transactions with notable contention.

Invyswell enables the concurrent execution of both hardware and software transac-
tions with the aim of being performant for all transaction sizes and degrees of con-
tentions. Haswell’s RTM performs best for small transactions with low contention,
as it imposes no instrumentation overhead, but is limited to a “requester-wins” con-
tention policy. InvalSTM performs best for large transactions with high contention,
because it can make highly informed contention management decisions through its
commit-time invalidation process. Yet, challenges remain in finding an efficient solu-
tion for the “transactional twilight zone” - midsize transactions that are small enough
to successfully execute in hardware but have a non-trivial degree of contention. Fur-
thermore, even after designing a TM that addresses the unique challenges of each of
these categories, that system must ensure that each individual component does not
negatively impact the overall performance by mismanaging transactions for which
it was not intended. Invyswell addresses this by using a sophisticated design that
   1
     Although System z supports constrained transactions, which are guaranteed to commit, we believe this does
not present a generalized mechanism for HTM forward progress as constrained transactions are size-restricted.
                                                                                    83


employs several hardware and software modes of execution. This gives the system
the flexibility to trade execution overhead for precision in conflict detection.

Haswell’s RTM does not support escape actions, non-transactional instructions ex-
ecuted within transactions [70]. This limitation complicated our design, especially
with respect to opacity [32], a correctness conditions that guarantees consistency of
eventually-aborted transactions. Another challenge we encountered was designing
Invyswell’s contention manager (CM), a decision-making process aimed at improv-
ing throughput, due to the different isolation properties for hardware and software
transactions. The lack of escape actions further complicated this issue, as well, as it
restricts the way a hardware transaction can abort a software transaction before the
hardware transaction itself commits.

We evaluate Invyswell’s performance using the STAMP benchmark suite. Invyswell’s
performance compares favorably to that of pure software, pure hardware, and hybrid
solutions. Invyswell is 35% faster than NOrecSTM [18], a state-of-the-art software
transactional memory, and 18% faster than NOrecHy [17], a state-of-the-art hybrid
transactional memory, as shown in Figure 6.1. It also outperforms Haswell’s native
hardware lock elision (HLE) [48, 75], a hardware mechanism that attempts to elide
locks by executing critical sections as transactions and supports transactional re-
execution with single global lock fallback implemented purely in hardware. Although
on the average Invyswell is only 25% faster than HLE, the performance difference is
significant for some benchmarks with large transactions, where Invyswell outperforms
HLE by 2× to 5.4×.
                                                                                    84


Figure 6.1: STAMP Performance Differential by Geometric Mean. *Hyperthread-
ing is enabled for 8 threads. (Note: NOrec and Hybrid NOrec are abbreviated as
NorecSTM and NorecHy, respectively, in the legend)


6.2     Overview of InvalSTM

One of the key differences between InvalSTM and other STMs is that it performs
commit-time invalidation [30]. This approach requires that a transaction identify
and resolve conflicts with all other in-flight (i.e., concurrently executing) transac-
tions during its commit phase. InvalSTM achieves this by storing read and write
sets in transaction-specific Bloom filters so it can perform conflict detection using
constant-time set intersection. With commit-time invalidation, InvalSTM has com-
plete knowledge of all conflicts between a committing transaction and other in-flight
transactions, allowing it to make informed decisions on how to best mitigate con-
tention. All InvalSTM transactions perform validation to achieve opacity in O(N )
total computational complexity, where N is the number of read elements, which
is notably faster than the O(N 2 ) overhead incurred by incremental validation and
can drastically reduce the opacity cost for large transactions. Additionally, read-only
transactions commit without incurring any commit-time serialization overhead.

For these reasons, InvalSTM naturally complements Haswell’s RTM. Haswell’s RTM
can be used for small transactions and low thread counts, while InvalSTM can be used
for large transactions and high thread counts. Moreover, Haswell’s RTM can lever-
                                                                            85


Figure 6.2: Transactional Events for Invyswell’s Different Transaction Types.
                                                                                    86


age InvalSTM’s use of Bloom filters for conflict detection by augmenting Haswell’s
hardware transactions with Bloom filters to enable many hardware transactions to
execute concurrently with many software transactions. These Bloom filters are a
good fit for Haswell’s cache-based HTM design because they can be structured for
constant-sized cache line alignment, thereby minimizing the negative impact of intro-
ducing hardware-to-software conflict detection into an already restricted HTM space.
Finally, because InvalSTM’s read-only transactions do not introduce any serializa-
tion in their execution, the performance overhead for transactions is transparent to
Haswell RTM’s faster executing hardware transactions. This enables Haswell’s RTM
to perform without interference when read-only software transactions are executing
within InvalSTM, regardless of their size.


6.3     Invyswell’s Design

In this section, we describe Invyswell, a HyTM that supports the concurrent exe-
cution of multiple hardware and multiple software transactions while guaranteeing
forward progress. Invyswell uses Haswell’s RTM [49] and a modified version of Inval-
STM [30]. In InvalSTM, when a transaction is ready to commit, it marks conflicting
in-flight transactions as invalid. InvalSTM uses Bloom filters for fast conflict detec-
tion between software transactions. Invyswell also uses Bloom filters at times, but
not always, for conflict detection between hardware and software transactions.

Because Haswell’s RTM does not support escape actions, the communication be-
tween in-flight hardware and software transactions is essentially impossible without
introducing conflicts between them. For example, if a software transaction writes
to memory shared by a hardware transaction, the latter will abort. Yet, communi-
cation between hardware and software transactions might be useful to improve the
                                                                                    87


precision of conflict detection between them, thereby increasing throughput in cases
when conflicts do not occur.

To manage this space, Invyswell generally performs conflict detection between a
hardware and a software transaction after the hardware transaction has committed.
This enables increased throughput in cases where no conflicts exist while minimizing
the chance of aborting a hardware transaction because of communication with in-
flight software transactions.

Furthermore, Invyswell exploits the observation that hardware transactions do not
need to check for conflicts with software transactions until just before committing,
a mechanism called lazy subscription, which was introduced by Dalessandro et al.
in their NOrec HyTM system [17]. By using lazy subscription, Invyswell reduces
the “window of vulnerability” in which a write to a software transaction’s conflict
detection metadata (e.g., its read set, its execution lock, etc.) will abort a non-
conflicting, in-flight hardware transaction.

Invyswell supports five transactions types, motivated by the need for progress guar-
antees and adaptability to different types of workloads. Two types are in hard-
ware, lightweight (LiteHW) and bloom filter-based (BFHW), and three types are
in software, speculative (SpecSW), irrevocable (IrrevocSW), and single global lock
(SglSW). The pseudocode for these transaction is shown in Figure 6.2. Invyswell’s
state transitions between them are shown in Figure 6.3.


6.3.1     SpecSW: An HTM-Friendly InvalSTM

Invyswell’s first type of transaction is the speculative software transaction (SpecSW),
which is similar to an InvalSTM transaction, and is shown in Figure 6.4. It tracks
its read and write locations in transaction-specific Bloom filters and stores its write
                                                                                    88


Figure 6.3: Invyswell’s State Machine Describing the Transitions Between the
Different Transaction Types.


set’s values in a hash table for deferred update during its commit phase. Note
that a memory barrier is necessary after inserting a memory location in a read
bloom filter and before reading the value from memory. At commit-time, a SpecSW
performs invalidation, where it compares its write Bloom filter against all other in-
flight SpecSWs’ read Bloom filters. If a conflict is found, it consults the contention
manager (CM) on how to proceed. The CM then either aborts the committing
transaction or permits it to commit. If permitted to commit, the SpecSW transaction
updates all write locations and then marks all conflicting in-flight transactions as
invalid. During a SpecSW’s execution, it checks to see if it has been marked as
invalid prior to each read and write and prior to committing. If it has, it aborts and
it retries again as a SpecSW or another type as illustrated in Figure 6.3.

A key difference between Invyswell and InvalSTM is that SpecSWs perform invali-
dation after committing changes to memory, unlike InvalSTM, which performs in-
validation before. The reason for doing this is the following. In InvalSTM, new
transactions acquire an in-flight lock to insert their transaction ID into an in-flight
linked list. If Invyswell did the same, hardware transactions would have to read this
                                                                                 89


lock before committing, to ensure correctness in their conflict detection. However,
reading such a lock could subsequently cause many unnecessary hardware transac-
tion aborts because whenever a new SpecSW was added to the list the in-flight lock
would be acquired, automatically aborting all hardware transactions that previously
read it.

To avoid this behavior, Invyswell performs invalidation after committing SpecSW’s
changes to memory and uses a slotted array for the in-flight SpecSWs, rather than
a linked list. The combination of these changes results in Invyswell’s elimination
of the InvalSTM in-flight lock, thereby reducing the likelihood of unnecessary hard-
ware transaction aborts. Instead, if a new transaction starts while the committing
transaction is updating memory, it will be detected by the invalidation phase of the
committing transaction, which will follow the memory update phase. Alternatively,
if the new transaction starts after the memory was already updated, it could be
missed by the invalidation phase. However, this new transaction is guaranteed to
only read consistent states because the committing transaction has finished updating
the memory, making the bloom filter check unnecessary for this transaction.

Initially, this modification results in the loss of opacity for SpecSWs, however, we
restore opacity for SpecSWs by adding inexpensive validation to each read as de-
scribed in Section 6.3.7. This change makes SpecSWs compatible with hardware
transactions that can invalidate in-flight SpecSWs and it permits Invyswell to elim-
inate the need for an in-flight lock and the per-transaction locks that are required
by InvalSTM.


6.3.2      BFHW: Hardware-Software Conflict Detection

Invyswell’s second type of transaction is the Bloom filter hardware transaction
(BFHW). BFHWs execute in hardware and, like SpecSWs, record the memory loca-
                                                                                    90


             Figure 6.4: Speculative Software Transaction (SpecSW).


tions they read and write in transaction-specific software Bloom filters.

At commit time, if a BFHW sees the software commit lock is free, it increments the
hw post commit counter, which subsequently prevents SpecSWs from committing or
reading new values while its value is non-zero, and then commits its speculative writes
to memory and performs post-commit invalidation on all in-flight SpecSWs, where
all conflicting transactions are marked as invalid. The BFHW then decrements the
hw post commit counter to indicate its post-commit phase has completed, allowing
software transactions to again commit, as shown in Figure 6.5.

The hw post commit counter is necessary because there is a window of vulnerability
after a BFHW has committed, but before it has finished executing the invalidation
phase, when SpecSWs can read inconsistent values written by the BFHW. Without
the hw post commit counter these SpecSWs will be marked as invalid by the BFHW
during its invalidation phase, but they could still execute momentarily returning
inconsistent reads, causing SpecSWs to lose their opacity.
                                                                                   91


Alternatively, if the commit lock is taken when a BFHW enters its commit phase,
this means a SpecSW is committing. In this scenario, the simplest option is for the
BFHW to abort, because there may be a conflict with the committing SpecSW. How-
ever, because BFHWs track their read and write accesses, Invyswell can instead per-
form conflict detection between the committing BFHW and the committing SpecSW
via Bloom filter set intersection. If an overlap is found, the BFHW is aborted. Oth-
erwise, no conflict exists between the BFHW and the SpecSW, and, because their
respective read and write sets are immutable during their commit phases, the BFHW
is permitted to commit.

When a SpecSW commits, it releases the commit lock before clearing its read and
write sets. This ensures that if the SpecSW commits before a committing BFHW
performs conflict detection against the committing SpecSW, that the BFHW is au-
tomatically aborted because the write performed by the SpecSW to the commit lock
would trigger a hardware conflict with the BFHW from its prior read.

Note that if a BFHW transaction aborts, its hw post commit counter increment never
becomes visible, because it is part of its speculative write set. Moreover, the new
counter value becomes visible only when the hardware transaction commits. If a
SpecSW reads this counter after it has been written to by a BFHW, but before the
BFHW has committed, the BFHW will be automatically aborted by Haswell RTM’s
strong isolation property, thereby avoiding a race.


6.3.3    LiteHW: Optimizing for Small Transactions

Although BFHWs enable the concurrent execution of hardware and software trans-
actions, they come with added overhead because each load and store requires an
associated Bloom filter insert operation. Invyswell addresses this limitation with its
third type of transaction, the LiteHW.
                                                                                  92


            Figure 6.5: Bloom Filter Hardware Transaction (BFHW).


LiteHWs are lightweight hardware transactions, which execute without read or write
annotations. They can only commit if there are no in-flight software transactions
when they begin their commit phase. Unfortunately, because LiteHWs do not main-
tain read or write set metadata, if a software transaction is in-flight when a LiteHW
enters its commit phase, Invyswell must assume a conflict exists between the LiteHW
and the software transaction and, therefore, must abort the LiteHW. LiteHWs deter-
mine if there is an in-flight software transaction by reading the commit lock and the
software transaction counter, sw cnt, prior to committing. Because LiteHWs do not
perform conflict detection against software transactions, they require no post-commit
phase.


6.3.4    IrrevocSW: Progress Guarantees

InvalSTM guarantees forward progress by using transaction-specific priorities that
are incremented each time a transaction is aborted. Using this mechanism, a contin-
                                                                                   93


uously aborted transaction will eventually yield the highest priority and is guaran-
teed to commit. Invyswell’s BFHWs, however, deviate from this model and instead
commit memory changes first and perform invalidation second, at which point all
conflicting software transactions are aborted. Because of this change, there is a dan-
ger that BFHWs could repeatedly abort high-priority SpecSWs, resulting in their
starvation.

To address this problem, Invyswell introduces a fourth transaction type, the Ir-
revocSW, a direct update irrevocable transaction type that cannot be aborted. To
ensure irrevocability, IrrevocSWs acquire the commit lock as soon as they begin their
execution and hold it until they have committed. To enable conflict detection with
other transactions, an IrrevocSW transactions records its read and write locations in
Bloom filters. An IrrevocSW needs no commit phase, because its writes are in-place.
Its post-commit phase invalidates conflicting in-flight SpecSWs. While an IrrevocSW
is executing, SpecSWs are required to perform validation and are disallowed from
committing. Furthermore, LiteHW transactions must abort if their commit phase
overlaps with any part of an IrrevocSW’s execution. However, BFHWs can exe-
cute concurrently with an IrrevocSW. Yet, to ensure correctness, a BFHW needs to
check for conflicts with the IrrevocSW transaction prior to committing its changes
to memory and it must abort itself if a conflict is found.


6.3.5    SglSW: Progress Guarantees with Reduced Overhead

Small transactions that execute instructions not supported by Haswell’s RTM need to
be executed in software. However, both SpecSWs and IrrevocSWs add transactional
metadata that may be too expensive for transactions that only access a few memory
elements. To address this need, Invyswell employs a final transaction type that uses
a single global lock without any associated transactional metadata.
                                                                                                                94


This transaction type, SglSW, uses direct update and is irrevocable. SglSW is fast,
but it does not allow the concurrent execution of other software transactions. Be-
cause SglSW does not track its reads or writes, it cannot perform conflict detection.
Instead, it uses a sequence lock to force all in-flight SpecSWs to abort and acquires
the commit lock when it begins its execution to prevent IrrevocSWs from starting.
BFHW and LiteHW transactions abort if an SglSW is executing when they try to
commit. However, SglSWs allows for some overlap in execution with BFHWs and
LiteHWs, as long as the executing SglSW commits before the hardware transac-
tions do, thereby ensuring that the hardware’s strong isolation property aborts any
BFHWs and LiteHWs that conflict with the SglSW.


6.3.6        Transitioning Between Transaction Types

Transactions are scheduled opportunistically, first as fast, high-risk hardware trans-
actions, then as slower, low-risk software transactions as shown in Figure 6.3. Each
transaction is first tried in hardware, as LiteHW or BFHW, depending on whether
other software transactions are present. If the hardware abort status suggests that
a transaction is unlikely to succeed in hardware, then it is retried as a SpecSW.
If it fails again, it is either retried as a SpecSW or it is escalated to irrevocable
status, preventing it from aborting and ensuring progress. The transitions between
the different types are decided automatically at runtime based on a heuristic that is
                                   2
application-independent.
   2
      Due to limitations in Intel’s first generation HTM (e.g., imprecision on a transaction’s abort status and lim-
itations of only four concurrent hardware threads, eight with hyperthreading) Invyswell’s state transitions deviate
slightly from that shown in Figure 6.3. In particular, we use a modified design that transitions to SglSW when
SpecSWs fail.
                                                                                   95


6.3.7    SpecSW Validation

InvalSTM performs invalidation before committing a transaction’s writes to mem-
ory. It uses a per-transaction invalid flag which is set to true when a committing
transaction invalidates a conflicting in-flight transaction. For reasons described in
Section 6.3.1, Invyswell departs from this design and performs invalidation after
committing a SpecSW’s writes to memory. Unfortunately, this change makes Inval-
STM’s approach to ensure opacity – using the transaction’s invalid flag – insufficient
for SpecSWs. Instead, on every new read that is not present in a SpecSW’s write set,
Invyswell inserts the new read location into the SpecSW’s Bloom filter and only then
is the SpecSW permitted to read the value. This ensures that a potential conflict
will not be missed by another transaction’s invalidation phase. Next, the SpecSW
performs the validation process shown in Figure 6.6. This validation process is nec-
essary because of the interactions SpecSWs can have with different transactions and
the inconsistent reads they might cause, as we explain next.


SglSW First, a SpecSW read could be inconsistent due to a concurrently executing
SglSW. Because SglSWs do not store reads and writes using Bloom filters, conflict
detection cannot be performed between them and a SpecSW. Thus, the SpecSW
must abort if the commit sequence has changed (Line 1 in Figure 6.6) after it was
read at tx begin (Figure 6.2).


IrrevocSw Second, a concurrently executing IrrevocSW or a committing SpecSW
could cause an inconsistent read. Thus, the SpecSW read must check if the read
location is in the Bloom filter of the transaction holding the commit lock (Line 2 in
Figure 6.6). If so, it must abort. If commit lock changes during the read validation,
the conflict may go unnoticed by the validation code. However, if the lock has
changed, it means the transaction that released it must have finished the invalidation
                                                                                    96


phase. Therefore, it is sufficient to check if the SpecSW has been invalidated in the
meantime (Line 4 in Figure 6.6).


BFHW Finally, a SpecSW must wait for all committed BFHWs to finish inval-
idation (hw post commit to reach zero) before using a new read value (Line 3 in
Figure 6.6). If the SpecSW is not marked as invalid, the read is safe (Line 4 in
Figure 6.6).


         Figure 6.6: Overview of Invyswell’s SpecSW Validation Process.


6.3.8     Contention Manager (CM)

SpecSWs consult the CM during the commit phase to acquire permission to commit.
As in InvalSTM, the CM considers all in-flight transactions that would be aborted
if the committing transaction was allowed to commit. Any CM policy can be used.
Invyswell uses iBalanced [29], which makes decisions based on priority, read and
write set sizes, and other factors.

Invyswell has trade-offs that the original InvalSTM design does not have. For exam-
ple, InvalSTM’s ability to make decisions based on complete knowledge of in-flight
transactions is lost. Essentially, there is no CM for Invyswell’s hardware transactions
because Haswell’s RTM does not support escape actions, and thus a hardware trans-
action has to abort all conflicting software transactions after the hardware transac-
tion has committed. The side-effect of this approach is that, conceptually, hardware
transactions are likely to scale to high thread counts only when there is little to no
contention, even if mitigation of that contention could be possible with an intelligent
                                                                                  97


CM. On the other hand, software transactions retain a complete knowledge of the
CM decision-making process, enabling them to scale for high thread counts amidst
high contention when the contention can be managed to provide wide transactional
throughput.


6.4      Correctness

Figures 6.2 and 6.3 show the five types of Invyswell transactions and the transitions
between them, respectively. In this section, we give an informal explanation why
these five transaction types can run concurrently with one another without violating
atomicity, as shown in Figure 6.7. However, atomicity by itself does not guarantee
that aborted transactions are opaque; that is, that they only observe consistent
states, a topic we discuss in Section 6.4.1.


              Figure 6.7: Invyswell’s Concurrent Execution Matrix.


LiteHW and BFHW vs. LiteHW and BFHW Haswell’s hardware transac-
tions are strongly isolated, meaning that their changes to memory become visible to
other threads only on commit, whether those threads are executing a transaction
or not. The hardware automatically detects conflicts between these types of trans-
actions, and any conflict will abort at least one transaction. There is no need for
additional mechanisms to synchronize concurrently executing LiteHWs and BFHWs
with respect to each other.
                                                                                   98


LiteHW vs. Software Transactions LiteHWs can execute concurrently with
Invyswell’s software transactions, but they cannot commit while such software trans-
actions are executing. A LiteHW that overlaps execution with a software transaction
(SpecSW, IrrevocSW, or SglSW) can commit only after the software transaction has
committed, otherwise the resulting execution may be not serializable. A LiteHW
that tries to commit while a software transaction is executing will abort. Such be-
havior is detected by the sw cnt counter and the commit lock (see Figure 6.2).


BFHW vs. SpecSW or IrrevocSW Unlike LiteHWs, BFHWs use software
Bloom filters to keep track of the memory locations they access. By performing ex-
plicit conflict detection with these Bloom filters, BFHWs can commit in the presence
of software transactions. If a committing SpecSW conflicts with an in-flight BFHW,
then the BFHW will automatically be aborted by the hardware when the SpecSW
writes its speculative data to memory. If a committing BFHW conflicts with an
in-flight SpecSW, the SpecSW will be aborted during the BFHW’s post-commit in-
validation phase. Moreover, BFHWs’ use of lazy subscription means it is sufficient
to compare the Bloom filters of BFHWs and SpecSWs at the end of the hardware
transaction.

Postponing conflict detection to the end of the BFHW’s execution narrows the win-
dow in which it will be aborted by false conflicts. Moreover, SpecSWs’ Bloom filters
do not change while it is committing, so a BFHW can read them without being
aborted due to metadata interference (i.e., non-transactional interference). Note
that SpecSWs that are doomed to abort after a BFHW invalidates them could read
inconsistent memory before they notice they were aborted, generating faulty behav-
ior. For this reason, atomicity by itself is not the only TM correctness property that
Invyswell guarantees, an issue we discuss in Section 6.4.1.
                                                                                   99


SpecSW vs. SpecSW Conflict detection between multiple SpecSWs uses inval-
idation. A committing SpecSW checks for conflicts with other in-flight SpecSWs
and, if conflicts are found, the committing SpecSW either aborts itself or invalidates
the SpecSWs it conflicts with. No SpecSW can commit during another SpecSW’s
invalidation process because the committing SpecSW holds the commit lock.


IrrevocSW vs. Software Transactions An IrrevocSW acquires the commit
lock as soon as it becomes active, ensuring that no other software transaction can
become irrevocable (i.e., other IrrevocSWs and SglSWs cannot start) or commit.
When an IrrevocSW commits, it invalidates in-flight conflicting SpecSWs.


SglSW vs. Everything When an SglSW begins, it acquires the commit lock
and aborts all other concurrently executing transactions. While it holds that lock,
SglSWs and IrrevocSWs are prevented from starting, and LiteHWs and BFHWs can-
not commit. The SglSW also updates the commit sequence lock at the transaction’s
start and end, aborting all concurrently executing SpecSWs and BFHWs.


6.4.1    Opacity and Sandboxing

Opacity is a correctness property that ensures that aborted transactions do not
observe inconsistent states [32]. The principal challenge to achieving opacity for
Invyswell occurs when a hardware transaction and a software transaction conflict.
Haswell’s hardware transactions are strongly isolated, but InvalSTM’s software trans-
actions are not, so care must be taken when managing their interaction.

Invyswell’s initial modification to InvalSTM’s design permits doomed SpecSWs, i.e.
SpecSWs that are guaranteed to abort, to observe inconsistent states because com-
mitting SpecSWs perform invalidation after writing their changes to memory. To
                                                                                  100


prevent these transactions from observing inconsistent states, Invyswell performs val-
idation at commit-time and before each new read as described in Section 6.3.7.

Unlike SpecSWs, Invyswell’s IrrevocSWs and SglSWs cannot observe inconsistent
states because these transactions are never aborted and are, therefore, never doomed.
Finally, Haswell’s shared memory writes executed by a hardware transaction become
visible only when the transaction commits, and writes by aborted transactions never
become visible. Moreover, Haswell’s transactions are (mostly) sandboxed, meaning
that faulty behavior caused by inconsistent reads will cause the transaction to abort.
Unfortunately, however, there is one leak in the Haswell sandbox, described in detail
in the next section.


6.4.2     Hardware Sandboxing Limitations

For the most part, hardware sandboxing ensures that no consistency violation within
a hardware transaction can affect other transactions. There is, however, one vexing
“loophole”, an unlikely sequence of events in which (1) mutually inconsistent reads
cause a spurious memory write, (2) which overwrite an address later used as the
target of an indirect jump in that same transaction, (3) thereby causing a jump to a
location that happens to contain either an xend (commit transaction) instruction,
or immediate data that looks like one. Executing this instruction without the final
commit lock check could prematurely commit an inconsistent set of changes.

This hazard, however unlikely, presents a challenge for any HyTM system imple-
mented in an unmanaged language. Broadly speaking, without escape actions, hard-
ware transactions cannot guarantee transactional consistency if they execute concur-
rently with either in-place update software transactions or with the commit phase
of a deferred update software transaction.
                                                                                     101


To address this hazard, Invyswell’s hardware transactions check the commit lock
before doing an indirect jump using function pointers. Simple optimizations can
reduce the cost of such a policy. For example, there is no need to check the lock
if the transaction has an empty write set, because it could not have corrupted the
jump address. If a transaction makes multiple indirect jumps, it suffices to check
the lock before the first jump, because once read, the commit lock remains in the
transaction’s read set, and the transaction will be aborted if the lock is changed
externally.

In the results presented in Section 6.6, we performed these optimizations by hand.
For some benchmarks, we found that early checking slightly improved performance,
probably because transactions with indirect jumps are often longer, hence less likely
to succeed in hardware, and more likely to benefit from a quicker fallback to soft-
ware.

In the long term, there is a trend toward compiler support to help with this issue.
The danger posed by indirect jumps in transactions is similar to the danger posed by
common security threats such as buffer overflow in general-purpose programs. The
security literature has many examples of compiler techniques to protect jump ad-
dresses, such as moving vtables and return addresses in a separate memory space [7]
marked as read-only. The latest GCC supports security functionality to check vtable
integrity.

Static validity checking for function pointers is difficult, in general, but feasible for
common special cases, such as initializers. GCC uses devirtualization and inlin-
ing for the most likely target for indirect pointers for optimization levels -O2 or
higher. When inlining is possible, GCC can make indirect jumps direct. A trans-
actional compiler could be more aggressive about eliminating or protecting indirect
jumps.
                                                                                102


6.5      Optimizations

In this section, we describe the modifications that we made to Invyswell’s original
design to improve its performance. We found these optimizations to be effective for
the first-generation Intel Haswell RTM processor, however, some optimizations are
designed specifically for performance of low thread counts (as indicated by the *
below) and may degrade performance as thread counts increase. As a result, when
Intel’s RTM scales to higher thread counts, these “low thread count” changes should
be eliminated.


Hardware Transactions Hardware transactions are retried with exponential back-
off. Before starting a hardware transaction, the commit lock and the software trans-
action counter, sw cnt, are read non-transactionally to increase the likelihood of
finding these data cached, and to optimize for the case when only hardware trans-
actions are active.


Validation Consider two SpecSWs, T A and T B. Assume that T A has entered
its commit phase and T B is about to validate a read. Furthermore, assume that
T B has higher priority than T A and that they conflict with one another. When
T B performs its validation, it could notice that T A has acquired the commit lock
and abort because of the conflict it identifies. At the same time, T A could consult
the CM and abort because T B has a higher priority, resulting in both transactions
aborting because of each other. A similar situation could also occur between a
committing SpecSW and a committing BFHW.

To avoid such scenarios, we introduce two global flags, hw check and sw check, in
addition to the commit lock, to indicate the different phases of a SpecSW’s commit
phase. At the highest level, these flags are used to ensure that SpecSWs and BFHWs
                                                                                103


are only aborted by a SpecSW that is guaranteed to commit. These flags change the
SpecSW and BFHW commit process in the following way.

At commit, a SpecSW, called T C, acquires the commit lock and then consults the
CM to receive permission to commit. If permitted to commit, T C sets the hw check
= true to signal to BFHWs that it is committing its writes to memory. With
this approach, BFHWs only read the hw check flag at commit-time, instead of the
commit lock, which ensures that a BFHW can only be aborted by a SpecSW that
will eventually commit, rather than reading the commit lock, where a BFHW could
be aborted by a SpecSW that has only started its commit phase but may eventually
be aborted by the CM.

Next, T C waits for the hw post commit counter to reach zero and, once it has, it
checks if it was invalidated by a concurrently committing BFHW. If still valid, T C
sets the sw check = true, which informs other SpecSWs about to read new memory
to perform conflict detection against T C’s Bloom filters. At this point, T C and
many concurrently reading SpecSWs may perform simultaneous conflict detection on
each other. If conflicts are found, the reading SpecSWs are aborted. If no conflicts
are found between reading SpecSWs and T C, the reading SpecSWs subsequently
check their valid flag to ensure they were not invalidated by T C, which may have
performed conflict detection before the reading SpecSWs had, and subsequently
cleared its Bloom filters before the reading SpecSWs could identify conflicts with
them. Any reading SpecSWs that are still valid are permitted to continue their
execution. Without the sw check flag, the scenario of conflicting transactions T A
and T B might occur. With it, a reading SpecSW’s validation can only fail if it
conflicts with a concurrently executing SpecSW that is guaranteed to commit.


* Bloom Filters In principle BFHWs and IrrevocSWs enable more concurrency
than LiteHWs or SglSWs, yet, in practice the overhead associated with BFHWs’ and
                                                                                         104


IrrevocSWs’ Bloom filters can negate their concurrency benefits. This is especially
true at low thread counts where there is not enough concurrency to justify such
overhead. Because of this, we use SglSWs, rather than IrrevocSWs, as the fallback
from SpecSWs for our experiments (see Figure 6.3), as SglSWs do not employ Bloom
filters. However, once RTM becomes available with higher core counts, we plan to
reinstate IrrevocSWs as the fallback for SpecSWs because they enable SpecSWs to
execute alongside them, while SglSWs do not.

To reduce the overhead of BFHWs, we optimize away their read set Bloom filters.
This optimization is possible because BFHWs only invalidate SpecSWs – SpecSWs
never invalidate BFHWs – thereby only requiring write-write and write-read conflict
                                              3
detection for BFHWs invalidation phase.           However, this change prohibits BFHWs
and SpecSWs from committing concurrently, which the original Invyswell design per-
mitted. For low thread counts, however, we have found this change to only positively
impact performance. Yet, for higher thread counts, this change will likely degrade
performance and, therefore, it would be advisable to revert back to Invyswell’s orig-
inal Bloom filter design.


* Fail-Fast When there is contention, many SpecSWs will repeatedly abort before
reaching their retry threshold and falling back to SglSWs. The amount of wasted
work that this process can incur could be substantial if contention is consistent, or
even bursty, throughout the entire benchmark.

To address this, we add a counter to count the number of high-priority software
transactions aborted during the invalidation phase. Whenever a thread notices that
this number is over a threshold, it increments a racy shared counter. Once this
counter reaches a pre-defined threshold, our optimized system switches to Fail-Fast
  3
    BFHWs can be aborted by other hardware transactions, but that is handled automatically by
the hardware.
                                                                                    105


mode, which only uses LiteHWs and SglSWs. We have found this optimization to be
efficient because it identifies the cases when STMs are wasting work with too many
retries, which eventually fail to irrevocable mode. In these cases, we have found it
is better to use irrevocable software transactions immediately.


Read-Only We employed optimizations for both read-only SpecSWs and BFHWs.
Read-only SpecSWs can commit when they reach commit phase without acquiring
the commit lock, even if they were invalidated. First, the validation process in the
read annotations ensures that the transaction’s read set was consistent at the time of
the last read. Second, read-only SpecSWs, as well as read-only BFHWs do not need
to perform invalidation, as they can be serialized before conflicting in-flight software
transactions.


6.6      Evaluation

Our experimental results were gathered on an Intel Haswell four-core processor (Core
i7-4770) with RTM and HLE support, running at 3.40GHz. Each core has a 32KB
L1 cache, and a total of 8GB RAM shared across all cores. We enabled hyper-
threading to collect data for up to eight threads. Because of L1 cache sharing due to
hyperthreading, we noticed that at eight threads some hardware transactions that
previously executed without failure began to abort due to overflow, thereby degrad-
ing performance. We used the GCC 4.8 compiler with -O3 optimizations for all
benchmarks.

We used the STAMP benchmark suite [15] to measure the speedup that Invyswell
provides relative to sequential execution. We compare this speedup against NOrec,
which we call NorecSTM, Hybrid NOrec, which we abbreviate as NorecHy, and
Haswell’s HLE. For each of these systems, we executed each STAMP benchmark five
                                                                                                106


times and present the median result as shown in Figure 6.8. Variance was generally
low, except for Bayes.


Invyswell Details We instrumented the STAMP code using its macros to use
a thread-local transaction type indicator for choosing which code path to execute.
This instrumentation incurs a run-time performance penalty. A compiler could gen-
erate different code paths for these transaction types, but it would not need to
generate a code path for each type. In particular, LiteHW and SglSW have similar
read/write annotations, as do BFHW and IrrevocSW. Moreover, the overhead in-
curred for manual instrumentation is higher than the overhead incurred by compiler
instrumentation.

Hardware transactions are retried N times, where N = 10 for our experiments, unless
the abort status indicates that the transaction is unlikely to succeed in hardware, in
which case the transaction is immediately retried in software. SpecSWs are retried
M times, where M = 4, and used SglSW as a fallback if the number of retries is ex-
ceeded. Invyswell was configured to use 1024 bits and the spooky-hash function [52]
for its Bloom filters. Outside of normal Bloom filter trade-offs of precision versus size,
there is an additional trade-off with Bloom filters for Invyswell’s hardware transac-
tions between their precision and the aborts they cause by overflow. 4 We found 1024
bits to be a good balance across all benchmarks. For example, the Yada benchmark
emits many Bloom filter false positives and makes this tradeoff apparent. Increasing
the Bloom filters’ size improves SpecSW performance but degrades BFHW, as it
causes more aborts.
   4
    The larger the Bloom filter, the better its precision, but the more likely a hardware transaction
using such a Bloom filter will abort due to cache overflow, because the Bloom filter must be part
of the hardware transaction’s speculative state stored, in this case, in Haswell’s L1D cache.
                                                                       107


               (a) Bayes.                        (b) Genome.


               (c) Intruder.                   (d) Kmeans Low.


            (e) Kmeans High.                    (f ) Labyrinth.


                (g) Ssca2.                    (h) Vacation Low.


            (i) Vacation High.                    (j) Yada.

Figure 6.8: Speedup on STAMP Benchmarks (Note: 8 threads using hyperthread-
ing).
                                                                  108


Figure 6.9: Invyswell Transaction Types: 1-threaded execution.


Figure 6.10: Invyswell Transaction Types: 8-threaded execution.
                                                                                               109


Hybrid NOrec and Invyswell Hybrid NOrec has many variants, many of which
require nonspeculative loads. [77] requires both nonspeculative loads and nonspec-
ulative stores. These variants cannot be implemented using TSX, and are not con-
sidered in this thesis. The version of NOrec evaluated in this chapter uses the two
                                                                      5
location variant and the sw exists filter described in [17].              Hybrid NOrec has two
types of transactions, hardware and software. Both types can execute at the same
time. To ensure hardware transactions do not see inconsistent memory states, they
eagerly subscribe to the software transactions’ commit lock as soon as they begin
their execution. When a software transaction begins its commit phase, hardware
transactions are automatically aborted. When a hardware transaction commits, it
increments a shared counter, which notifies software transactions that they must
perform value-based validation to ensure consistency. To perform validation, each
software transaction maintains its own list of read memory locations. To reduce list
insert computational overhead, each software transaction inserts new read element
directly to the list’s tail, even if the item is already in the list, resulting in O(1) insert
time complexity. A disadvantage of this approach is that the read list can become
large if a software transaction reads many locations, thereby increasing the time it
takes to perform validation, where the entire list must be walked. Each software
transaction performs validation in O(N ) time, where N is the size of the read set,
for every new read added to the transaction’s read set after a software or hardware
transaction has committed.

In contrast to Hybrid NOrec, Invyswell has two hardware transaction types, three
software transaction types, and performs conflict detection using Bloom filters, not
lists, which house the memory accessed by both hardware transactions (BFHW)
   5
     Our implementation of Hybrid NOrec included all the optimizations used in [17]. In addition,
we tried a variant of this algorithm that had hardware transactions lazily subscribe to the software
commit lock, which also used the indirect jump annotations that we used for Invyswell. This
version performed similarly to Hybrid NOrec’s normal eager subscription, so we omitted the results
for clarity.
                                                                                   110


and software transactions (SpecSW and IrrevocSW). With Bloom filters, Invyswell’s
conflict detection is performed in O(1) time, yet, because Invyswell uses invalidation,
it has additional overhead that Hybrid NOrec does not have, where invalidation is
performed after committing a transaction’s speculative writes to memory.

Invyswell’s LiteHWs are similar to Hybrid NOrec’s hardware transactions, but In-
vyswell’s BFHWs have no Hybrid NOrec counterpart. Although BFHWs incur over-
head not found in Hybrid NOrec’s hardware transactions – the storing of read and
write set data in Bloom filters – this overhead is amortized on large transactions be-
cause of the finer grained conflict detection that it enables. The improved precision
of conflict detection enables wider transactional throughput between hardware and
software transactions if they don’t conflict (e.g., Figure 6.8f’s benchmark).

If Invyswell did not include BFHWs, nearly all of Labyrinth’s transactions would
execute as software transactions, because Invyswell’s LiteHWs often get aborted by
the long-running software transactions. However, with BFHW, hardware and spec-
ulative software transactions (SpecSWs) can execute concurrently and both types of
transactions can commit, as there are not many conflicts. NOrec hardware transac-
tions do not exhibit Bloom filter overhead but, instead, incur overhead on its software
transactions, which must do value based validation, re-validating the entire read set
after each transactional commit. As 50% of the transactions in Labyrinth cannot
succeed in hardware, the performance of both HyTMs is similar to that of NOrec
STM.

Another important difference between Invyswell and Hybrid NOrec is how fast soft-
ware transactions execute for different transaction sizes. Invyswell’s SpecSW trans-
actions, which are similar to InvalSTM’s transactions, are fast for large transactions,
while NOrec’s software transactions are fast for small transactions without many
reads to re-validate. Yet, because Haswell’s RTM can successfully execute most
                                                                                 111


smaller size transactions (those without unsupported instructions), we believe Spec-
SWs are the natural choice as a fallback mechanism for hardware transactions.

Nevertheless, there is an interesting effect that occurs in the presence of hyper-
threading, where hardware transactions overflow at smaller sizes than they would
without hyperthreading because of cache sharing between two hyperthreads on the
same core. For example, in Genome (Figure 6.8b), at eight threads about 50% of
hardware transactions spill to software, for both HyTMs, because of overflow. Be-
cause of this, Hybrid NOrec performs better than Invyswell for Genome at eight
threads. However, we believe this is an artifact of hyperthreading, as Invyswell is
notably faster than Hybrid NOrec for Genome at four threads, where significantly
fewer hardware transactions spill to software. With this in mind, we expect Invywell
to perform better as HTMs scale in core count, as only large transactions will over-
flow the cache, resulting in the use of Invysell’s SpecSWs only in the cases in which
they were intended.


NOrec and HyTMs STMs typically scale at higher thread counts, but often
perform poorly at low thread counts, especially for small and mid-sized transactions.
NOrec, referred to as NorecSTM in our figures, like any STM, incurs instrumentation
overhead that limits performance for small (Ssca2, Kmeans) and mid-sized (Intruder,
Vacation, Genome) transactions. For such benchmarks, Invyswell can outperform
NOrec by a factor of 3.5× (6.8g). Hybrid NOrec also outperforms NOrec on these
benchmarks, indicating that a hybrid is necessary over an STM. However, Invyswell
can be twice as fast as Hybrid NOrec (6.8c) because of its more lightweight SglSWs,
in which Hybrid NOrec has no software equivalent.

As expected, NOrec performs best for benchmarks with longer transactions, and big-
ger read and write sets, such as Bayes, Labyrinth and Yada (Figures 6.8a, 6.8f, and
6.8j, respectively). Hybrid NOrec closely approaches the NOrec’s speedup, as most
                                                                                  112


of the benefit in these cases comes from the software transactions. In Figure 6.8a,
NOrec is 2.1× faster than sequential execution, while Invyswell is 1.6× faster. For
completeness, we included results for Bayes, but its high variance suggests that these
results should be interpreted with caution [78].

Labyrinth (Figure 6.8f) has long transactions, where the first portion of the transac-
tion manipulates non-shared memory. For this benchmark, 50% of the transactions
cannot complete in hardware, so HLE’s performance degrades to that of a lock. In
contrast, NOrec yields high throughput because it enables concurrency between its
transactions. Because Haswell does not support non-transactional loads and stores,
all local operations performed inside a transaction are also transactional, putting
pressure on the cache. Therefore, both Hybrid NOrec and Invyswell are negatively
affected, resulting in performance similar to NOrec.


Hardware Lock Elision (HLE) HLE is implemented entirely in hardware and
has no instrumentation overhead, but uses a non-scalable single global lock fallback
when transactions fail. For large benchmarks, such as Bayes or Labyrinth, even at
small thread counts, Invyswell outperforms HLE by a notable margin. This is be-
cause many transactions overflow the cache and fall back to software, being serialized
by the lock used in HLE. For medium sized benchmarks, Invyswell also outperforms
HLE. However, for small transactions, HLE benefits most from the lack of overhead,
so it is faster than Invyswell on benchmarks such as Kmeans Low and Kmeans High.
Ssca2 is also a benchmark with small transactions, but Invyswell and HLE perform
similarly.

Figure 6.11 shows the percentage of committed hardware transactions for one thread
and four threads for both Invyswell and HLE. The one-threaded execution indicates,
in general, the percentage of transactions that fail in hardware because of unsup-
ported instructions or overflow. This provides a baseline of the maximum number
                                                                                113


         Figure 6.11: Percentage of Committed Hardware Transactions.


of hardware transaction commits that are possible for each benchmark. We also
found that the number of HLE hardware transactions that begin is higher than the
total number of committed transactions. This suggests that HLE also retries failed
transactions before falling back to its global lock.

Invyswell’s percentage of committed hardware transactions at four threads is similar
to its percentage at one thread, and it is higher than HLE’s percentage at four
threads. This makes the argument that Invyswell generally makes more efficient
use of hardware resources than the hardware (i.e., HLE) itself. Figures 6.9 and 6.10
show the breakdown of Invyswell’s transaction types for one thread and eight threads
executions. The eight-threaded execution suffers from the effects of hyperthreading,
so the number of hardware transactions successfully committed is lower than for the
one thread execution.

Overall, Invyswell outperforms HLE. For Yada, however, HLE is faster than Invyswell
despite using fewer hardware transactions. This benchmark has large transactions
and high contention, causing a lot of conflicts between transactions. In this case,
Invyswell suffers from many false positives in its Bloom filter set intersection. We
                                                                                   114


noticed an increase in performance for SpecSWs as we increase the size of the Bloom
filters. However, as we previously explained, larger Bloom filters negatively impact
BFHWs. Therefore, the size of the Bloom filters represents a tradeoff to balance the
performance of SpecSWs and BFHWs.


Discussion In general, Invyswell outperforms prior methods across all STAMP
benchmarks. Not only does Invyswell outperform HLE for all but the smallest trans-
actions, it is inherently more flexible, because the programmer has explicit control
over CM and failover policies. Although Invyswell is adapted from the earlier Inval-
STM design, the existence of hardware transactions that bypass the CM means that
the two systems are divergent, in terms of design and behavior.

Hardware transactions can fail for a variety of reasons, including resource exhaustion,
timing anomalies, or illegal instructions. For future work, there is a need for better
adaptive CM to identify when a particular approach is not working well, and when
to switch to a more effective alternative.


6.7       Summary

In this chapter, we described Invyswell, a HyTM that combines Haswell’s RTM
transactions with software transactions from a heavily modified version of InvalSTM.
We evaluated Invyswell on a 3.4 GHz 4-core Haswell processor capable of supporting
up to eight hardware threads and compared it to to Haswell’s native hardware lock
elision (HLE), a state-of-the-art STM (NOrec), and a state-of-the-art HyTM (Hybrid
NOrec).

Our main goals with Invyswell were to (i) improve performance for small- to medium-
sized transactions, configurations where the instrumentation costs of STMs typically
                                                                                  115


cause them to perform poorly and (ii) to extend InvalSTM’s design to support
the concurrent execution of both hardware and software transactions. We found
that very small transactions are handled well by a simple combination of hardware
transactions with fallback to a single global lock. The most interesting challenges
were (i) modifying InvalSTM to provide some degree of precision in its conflict
detection between concurrently executing hardware and software transactions and
(ii) improving mid-size transaction performance, transactions that are small enough
to benefit from hardware transactions, but too large to work well with a single global
lock.

We evaluated a variety of transactional mechanisms, both hardware and software, on
a range of STAMP benchmarks. As one might expect for such heterogeneous bench-
marks, no single mechanism was best for every benchmark, but overall, Invyswell
outperformed prior methods by more than 18%.

Haswell supports hardware lock elision (HLE), which allows an annotated critical
section to be first executed speculatively as a hardware transaction, and then, if
that transaction fails, to be re-executed non-speculatively using the original lock.
HLE already provides some of the functionality of HyTM, so it is natural to ask
whether Haswell needs HyTM at all. We find that HyTM is indeed needed: on
average, Invyswell is about 25% faster than HLE across all benchmarks. Moreover,
for benchmarks with large transactions, such as Bayes and Labyrinth, HLE does not
scale and it is 2×-5.4× slower than Invyswell. The principal reason HLE does not
eliminate the need for HyTM is that HyTM allows for better contention manage-
ment. HLE follows a hard-wired policy of falling back to a lock after failure, but
HyTM can make more intelligent and flexible decisions about resolving conflicts, tak-
ing advantage of software-based transactions, and making more effective transitions
between speculative and various non-speculative synchronization mechanisms.
                                                                                  116


We tested alternative software mechanisms that trade overhead for precision. Con-
flict detection can be coarse and fast (SglSW) or more precise and slower (IrrevocSW
and SpecSW). In the thread-count range supported by our platform, coarse-and-fast
usually slightly outperforms precise-and-slower. We conjecture that precise conflict
detection will become more attractive in future hardware platforms with more cores,
where Invyswell is likely to perform well.

Any HyTM faces the challenge of providing opacity, which ensures that all transac-
tions only observe consistent states. This is more difficult than it may seem, because
the composition of two opaque mechanisms (for example, Haswell’s RTM and In-
valSTM) is not necessarily opaque. RTM’s lack of escape actions complicated our
task. Escape actions could make it substantially easier to ensure opacity, and to
provide more effective conflict management. For example, a hardware transaction
could invalidate software transactions during its commit phase, rather than after it,
allowing, in some cases, for it to abort itself to improve overall throughput, as was
the case in InvalSTM’s original design.

Our experience suggests that hybrid mechanisms can improve the performance of
small to mid-size transactions that can execute in hardware, compared to software-
only or hardware lock-elision mechanisms. We conjecture that this difference will
become even more pronounced when Haswell platforms with more cores become
available.
Chapter 7


Conclusion

Computer architecture design has reached a ”power wall”, marking the end of CPU
frequency scaling. To further improve performance in this context, new hardware
platforms are now increasingly focusing on leveraging more parallelism. These new
architectures are continually increasing the number of cores, becoming more het-
erogeneous and offering new instructions in support of parallelism. However, they
are also becoming harder to program. Parallel and concurrent programming have
become necessities in this highly parallel environment, but our current abstractions
are not up to the challenge. Locking is still the most widely used synchronization
paradigm, but it either fails to deliver acceptable performance and scalability beyond
a small number of cores or it comes at a very high cost in terms of development effort
and expertise.

In this thesis, we proposed new techniques to simplify writing efficient parallel code
that leverage the architectural features of these emerging systems. We focused on
two commercially available platforms: NUMA architectures with hundreds of cores
and Haswell processors with support for hardware transactional memory.

We described various abstractions that have been proposed in the concurrent comput-


                                         117
                                                                                 118


ing community, such as delegation, elimination, combining and transactional mem-
ory and we showed how to use and integrate these abstractions to design scalable
concurrent algorithms. We designed, implemented and evaluated a NUMA-aware
concurrent stack and a scalable concurrent priority queue using these abstractions.
Our designs achieve significant performance benefits compared to prior work.

Moreover, we proposed improved algorithms for transactional memory. We presented
new fallback algorithms for best-effort hardware transactional memory that outper-
form state-of-the-art software, hardware and hybrid solutions. First, we described
Lazy Single Global Lock fallback (L-SGL), which uses an optimized single global
lock as the software fallback. Second, we described Invyswell, a new Hybrid Trans-
actional Memory solution based on a modified version of InvalSTM. Our experience
suggests that hybrid mechanisms can improve the performance of small to mid-size
transactions, in situations where the number of threads fits in hardware, compared to
software-only or hardware lock-elision mechanisms. We conjecture that this improve-
ment will become even more pronounced when Haswell platforms with more cores
become available, although the trade-offs among the various hybrid mechanisms are
likely to change as platforms scale.

As hardware changes and improves to provide more parallelism potential, we need
better software mechanisms to leverage these new features. The methods we dis-
cussed are a step in the direction of scalable concurrent software design, but more
abstractions are needed to design highly scalable programs and to eliminate the ne-
cessity of specializing code for particular architectures. Moreover, software needs
to anticipate and inform hardware developments, because only a tight collaboration
between hardware and software can achieve the performance and scalability desired
by developers.
Bibliography

[1] Ali-Reza Adl-Tabatabai, Christos Kozyrakis, and Bratin Saha. Transactional
    programming in a multi-core environment. In Katherine A. Yelick and John M.
    Mellor-Crummey, editors, PPoPP, page 272. ACM, 2007.
[2] Yehuda Afek, Michael Hakimi, and Adam Morrison. Fast and scalable ren-
    dezvousing. In Proceedings of the 25th international conference on Distributed
    computing, DISC’11, pages 16–31, Berlin, Heidelberg, 2011. Springer-Verlag.
[3] Rassul Ayani. Lr-algorithm: concurrent operations on priority queues. In Pro-
    ceedings of the Second IEEE Symposium on Parallel and Distributed Processing,
    SPDP 1990, Dallas, Texas, USA, December 9-13, 1990., pages 22–25, 1990.
[4] R. Bayer and M. Schkolnick. Readings in database systems. chapter Concur-
    rency of Operations on B-trees, pages 129–139. Morgan Kaufmann Publishers
    Inc., San Francisco, CA, USA, 1988.
[5] Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fe-
    dorova. A case for numa-aware contention management on multicore systems.
    In Proceedings of the 2011 USENIX conference on USENIX annual technical
    conference, USENIXATC’11, pages 1–1, Berkeley, CA, USA, 2011. USENIX
    Association.
[6] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek,
    Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang
    Zhang, and Zheng Zhang. Corey: an operating system for many cores. In
    Proceedings of the 8th USENIX conference on Operating systems design and
    implementation, OSDI’08, pages 43–57, Berkeley, CA, USA, 2008. USENIX As-
    sociation.
[7] Pete Broadwell, Matt Harren, and Naveen Sastry. Scrash: A system for generat-
    ing secure crash information. In Proceedings of the 12th Conference on USENIX
    Security Symposium - Volume 12, SSYM’03, pages 19–19, Berkeley, CA, USA,
    2003. USENIX Association.
[8] Harold W. Cain, Maged M. Michael, Brad Frey, Cathy May, Derek Williams,
    and Hung Le. Robust architectural support for transactional memory in the
    power architecture. In Proceedings of the 40th Annual International Symposium
    on Computer Architecture, ISCA ’13, pages 225–236, New York, NY, USA, 2013.
    ACM.


                                      119
                                                                                  120


 [9] Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra J.
     Marathe, and Mark Moir. Message passing or shared memory: Evaluating the
     delegation abstraction for multicores. In OPODIS, pages 83–97, 2013.
[10] Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe,
     and Nir Shavit. Numa-aware reader-writer locks. In Proceedings of the 18th
     ACM SIGPLAN symposium on Principles and practice of parallel programming,
     PPoPP ’13, pages 157–166, New York, NY, USA, 2013. ACM.
[11] Irina Calciu, Justin Gottschlich, and Maurice Herlihy. Using delegation and
     elimination to implement a scalable numa-friendly stack. In 5th USENIX Work-
     shop on Hot Topics in Parallelism, 2013.
[12] Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles Pokam, and Mau-
     rice Herlihy. Invyswell: a hybrid transactional memory for haswell’s restricted
     transactional memory. In International Conference on Parallel Architectures
     and Compilation, PACT ’14, Edmonton, AB, Canada, August 24-27, 2014,
     pages 187–200, 2014.
[13] Irina Calciu, Hammurabi Mendes, and Maurice Herlihy. The adaptive priority
     queue with elimination and combining. In Distributed Computing - 28th In-
     ternational Symposium, DISC 2014, Austin, TX, USA, October 12-15, 2014.
     Proceedings, pages 406–420, 2014.
[14] Irina Calciu, Tatiana Shpeisman, Gilles Pokam, and Maurice Herlihy. Improved
     single global lock fallback for best-effort hardware transactional memory. In 9th
     ACM Sigplan Workshop on Transactional Computing, TRANSACT ’14, Salt
     Lake City, UT, USA, March 2, 2014.
[15] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun.
     Stamp: Stanford transactional applications for multi-processing. In IISWC ’08:
     Proceedings of The IEEE International Symposium on Workload Characteriza-
     tion, September 2008.
[16] C. Click. Azul’s experiences with hardware transactional memory. HP Labs’
     Bay Area Workshop on Transactional Memory, August 2007.
[17] Luke Dalessandro, Fran¸cois Carouge, Sean White, Yossi Lev, Mark Moir,
     Michael L. Scott, and Michael F. Spear. Hybrid norec: a case study in the
     effectiveness of best effort hardware transactional memory. In Proceedings of
     the sixteenth international conference on Architectural support for programming
     languages and operating systems, ASPLOS XVI, pages 39–52, New York, NY,
     USA, 2011. ACM.
[18] Luke Dalessandro, Michael F. Spear, and Michael L. Scott. Norec: streamlining
     stm by abolishing ownership records. In Proceedings of the 15th ACM SIGPLAN
     Symposium on Principles and Practice of Parallel Programming, PPoPP ’10,
     pages 67–78, New York, NY, USA, 2010. ACM.
[19] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir,
     and Daniel Nussbaum. Hybrid transactional memory. In Proceedings of the 12th
     international conference on Architectural support for programming languages
     and operating systems, ASPLOS XII, pages 336–346, New York, NY, USA,
     2006. ACM.
                                                                                 121


[20] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you al-
     ways wanted to know about synchronization but were afraid to ask. In Proceed-
     ings of the Twenty-Fourth ACM Symposium on Operating Systems Principles,
     SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM.
[21] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Asynchronized con-
     currency: The secret to scaling concurrent search data structures. In Proceed-
     ings of the Twentieth International Conference on Architectural Support for
     Programming Languages and Operating Systems, ASPLOS ’15, pages 631–644,
     New York, NY, USA, 2015. ACM.
[22] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience with
     a commercial hardware transactional memory implementation. In Proceedings
     of the 14th international conference on Architectural support for programming
     languages and operating systems, ASPLOS XIV, pages 157–168, New York, NY,
     USA, 2009. ACM.
[23] David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: a general
     technique for designing numa locks. In Proceedings of the 17th ACM SIGPLAN
     symposium on Principles and Practice of Parallel Programming, PPoPP ’12,
     pages 247–256, New York, NY, USA, 2012. ACM.
[24] David Dice, Mark Moir, and Nir Shavit. Sun Microsystems: Transactional
     memory.
[25] David Dice, Ori Shalev, and N. Shavit. Transactional locking II. In Proc. of the
     20th International Symposium on Distributed Computing (DISC 2006), pages
     194–208, 2006.
[26] Aleksandar Dragojevi´c, Pascal Felber, Vincent Gramoli, and Rachid Guerraoui.
     Why stm can be more than a research toy. Commun. ACM, 54(4):70–77, April
     2011.
[27] Panagiota Fatourou and Nikolaos D. Kallimanis. Revisiting the combining syn-
     chronization technique. In Proceedings of the 17th ACM SIGPLAN symposium
     on Principles and Practice of Parallel Programming, PPoPP ’12, pages 257–266,
     New York, NY, USA, 2012. ACM.
[28] Justin E. Gottschlich and Daniel A. Connors. Extending contention managers
     for user-defined priority-based transactions. In Proceedings of the 2008 Work-
     shop on Exploiting Parallelism with Transactional Memory and other Hardware
     Assisted Methods, Apr 2008.
[29] Justin E. Gottschlich, Maurice P. Herlihy, Gilles A. Pokam, and Jeremy G. Siek.
     Visualizing transactional memory. In Proceedings of the 21st international con-
     ference on Parallel architectures and compilation techniques, PACT ’12, pages
     159–170, New York, NY, USA, 2012. ACM.
[30] Justin E. Gottschlich, Manish Vachharajani, and Jeremy G. Siek. An efficient
     software transactional memory using commit-time invalidation. In Proceedings
     of the International Symposium on Code Generation and Optimization (CGO),
     April 2010.
                                                                                122


[31] Rachid Guerraoui, Maurice Herlihy, and Bastian Pochon. Polymorphic con-
     tention management. In DISC. LNCS, 2005.
[32] Rachid Guerraoui and Michal Kapalka. On the correctness of transactional
     memory. In Proceedings of the ACM SIGPLAN Symposium on Principles and
     practice of parallel programming, pages 175–184, New York, NY, USA, 2008.
     ACM.
[33] Rachid Guerraoui and Michal Kapalka. Principles of Transactional Memory.
     Morgan and Claypool, 2010.
[34] Leo J. Guibas and Robert Sedgewick. A dichromatic framework for balanced
     trees. In Proceedings of the 19th Annual Symposium on Foundations of Com-
     puter Science, SFCS ’78, pages 8–21, Washington, DC, USA, 1978. IEEE Com-
     puter Society.
[35] Lance Hammond, Brian D. Carlstrom, Vicky Wong, Ben Hertzberg, Mike Chen,
     Christos Kozyrakis, and Kunle Olukotun. Programming with transactional co-
     herence and consistency (TCC). j-SIGPLAN, 39(11):1–13, November 2004.
[36] Tim Harris, James R. Larus, and Ravi Rajwar. Transactional Memory, Second
     Edition. Morgan and Claypool, 2010.
[37] D. Hendler and N. Shavit. Work dealing. In Proc. of the Fourteenth ACM
     Symposium on Parallel Algorithms and Architectures, pages 164–172, 2002.
[38] Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Flat combining
     and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM
     symposium on Parallelism in algorithms and architectures, SPAA ’10, pages
     355–364, New York, NY, USA, 2010. ACM.
[39] Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack
     algorithm. J. Parallel Distrib. Comput., 70(1):1–12, January 2010.
[40] Maurice Herlihy. A methodology for implementing highly concurrent data ob-
     jects. ACM Transactions on Programming Languages and Systems, 15(5):745–
     770, November 1993.
[41] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free synchro-
     nization: Double-ended queues as an example. In Proceedings of the 23rd In-
     ternational Conference on Distributed Computing Systems, ICDCS ’03, pages
     522–, Washington, DC, USA, 2003. IEEE Computer Society.
[42] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer, III.
     Software transactional memory for dynamic-sized data structures. In Proceed-
     ings of the twenty-second annual symposium on Principles of distributed com-
     puting, PODC ’03, pages 92–101, New York, NY, USA, 2003. ACM.
[43] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural
     support for lock-free data structures. In Proceedings of the International Sym-
     posium on Computer Architecture. May 1993.
[44] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. El-
     sevier, Inc., 2008.
                                                                                 123


[45] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness
     condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–
     492, July 1990.
[46] Q. Huang. An evaluation of concurrent priority queue algorithms. Technical
     report, Cambridge, MA, USA, 1991.
[47] Galen Hunt, M. Michael, S. Parthasarathy, and M. Scott. An efficient algorithm
     for concurrent priority queue heaps. Information Processing Letters, 60(3):151
     – 157, 1996.
[48] Intel Corporation.     Hardware lock elision in Haswell.       Re-
     trieved      from        http://software.intel.com/sites/products/
     documentation/doclib/stdxe/2013/composerxe/compiler/cpp-win/
     GUID-A462FBC8-37F2-490F-A68B-2FFA8010DEBC.htm.
[49] Intel Corporation. Transactional Synchronization in Haswell. Retrieved from
     http://software.intel.com/en-us/blogs/2012/02/07/
     transactional-synchronization-in-haswell/, 8 September 2012.
[50] Amos Israeli and Lihu Rappoport. Efficient wait-free implementation of a con-
     current priority queue. In Proceedings of the 7th International Workshop on Dis-
     tributed Algorithms, WDAG ’93, pages 1–17, London, UK, UK, 1993. Springer-
     Verlag.
[51] Christian Jacobi, Timothy Slegel, and Dan Greiner. Transactional memory ar-
     chitecture and implementation for ibm system z. In Proceedings of the 2012
     45th Annual IEEE/ACM International Symposium on Microarchitecture, MI-
     CRO ’12, pages 25–36, Washington, DC, USA, 2012. IEEE Computer Society.
[52] Jenkins, B. SpookyHash: a 128-bit non-cryptographic hash (2010). Retrieved
     from http://burtleburtle.net/bob/hash/spooky.html, 25 June 2014.
[53] J. L. W. Kessels. On-the-fly optimization of data structures. Commun. ACM,
     26(11):895–901, November 1983.
[54] Gokcen Kestor, Roberto Gioiosa, Tim Harris, Osman S. Unsal, Adri´an Cristal,
     Ibrahim Hur, and Mateo Valero. Stm2: A parallel stm for high performance
     simultaneous multithreading systems. In Lawrence Rauchwerger and Vivek
     Sarkar, editors, PACT, pages 221–231. IEEE Computer Society, 2011.
[55] Donald E. Knuth. The Art of Computer Programming, Volume 2 (3rd Ed.):
     Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc.,
     Boston, MA, USA, 1997.
[56] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and An-
     thony Nguyen. Hybrid transactional memory. In Proceedings of the eleventh
     ACM SIGPLAN symposium on Principles and practice of parallel programming,
     PPoPP ’06, pages 209–220, New York, NY, USA, 2006. ACM.
[57] H. T. Kung and Philip L. Lehman. Concurrent manipulation of binary search
     trees. ACM Trans. Database Syst., 5(3):354–382, September 1980.
[58] Yosef Lev, Mark Moir, and Dan Nussbaum. PhTM: Phased transactional mem-
     ory. In TRANSACT ’07: 2nd Workshop on Transactional Computing, aug 2007.
                                                                                124


[59] I. Lotan and N. Shavit. Skiplist-based concurrent priority queues. In Proc. of
     the 14th International Parallel and Distributed Processing Symposium (IPDPS),
     pages 263–268, 2000.
[60] Udi Manber. On maintaining dynamic information in a concurrent environment.
     SIAM J. Comput., 15(4):1130–1142, November 1986.
[61] Virendra Jayant Marathe and Mark Moir. Toward high performance nonblock-
     ing software transactional memory. In PPoPP ’08: Proceedings of the 13th ACM
     SIGPLAN Symposium on Principles and practice of parallel programming, pages
     227–236, New York, NY, USA, 2008. ACM.
[62] Alex Matveev and Nir Shavit. Reduced hardware transactions: A new approach
     to hybrid transactional memory. In Proceedings of the ACM Symposium on
     Parallelism in Algorithms and Architectures (SPAA), July 2013.
[63] Alexander Matveev and Nir Shavit. Reduced hardware norec: A safe and scal-
     able hybrid transactional memory. In Proceedings of the Twentieth International
     Conference on Architectural Support for Programming Languages and Operating
     Systems, ASPLOS ’15, pages 59–71, New York, NY, USA, 2015. ACM.
[64] Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek. Cphash: a cache-
     partitioned hash table. In Proceedings of the 17th ACM SIGPLAN symposium
     on Principles and Practice of Parallel Programming, PPoPP ’12, pages 319–320,
     New York, NY, USA, 2012. ACM.
[65] Maged M. Michael and Michael L. Scott. Nonblocking algorithms and
     preemption-safe locking on multiprogrammed shared memory multiprocessors.
     J. Parallel Distrib. Comput., 51(1):1–26, May 1998.
[66] Mark Moir. Hybrid transactional memory. Jul 2005.
[67] Mark Moir, Daniel Nussbaum, Ori Shalev, and Nir Shavit. Using elimination to
     implement scalable and lock-free fifo queues. In Proceedings of the seventeenth
     annual ACM symposium on Parallelism in algorithms and architectures, SPAA
     ’05, pages 253–262, New York, NY, USA, 2005. ACM.
[68] Mark Moir and Nir Shavit. Concurrent data structures. In Dinesh P. Mehta and
     Sartaj Sahni, editors, Handbook of data structures and applications. Chapman
     and Hall/CRC Press, 2007.
[69] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. Logtm:
     log-based transactional memory. In High-Performance Computer Architecture,
     2006. The Twelfth International Symposium on, pages 254–265, 2006.
[70] Michelle J. Moravan, Jayaram Bobba, Kevin E. Moore, Luke Yen, Mark D.
     Hill, Ben Liblit, Michael M. Swift, and David A. Wood. Supporting nested
     transactional memory in logtm. In Proceedings of the 12th international confer-
     ence on Architectural support for programming languages and operating systems,
     ASPLOS XII, pages 359–370, New York, NY, USA, 2006. ACM.
[71] Adam Morrison and Yehuda Afek. Fast concurrent queues for x86 processors. In
     Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice
     of Parallel Programming, PPoPP ’13, pages 103–112, New York, NY, USA,
     2013. ACM.
                                                                              125


[72] O. Nurmi, E. Soisalon-Soininen, and D. Wood. Concurrency control in database
     structures with relaxed balance. In Proceedings of the Sixth ACM SIGACT-
     SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’87,
     pages 170–176, New York, NY, USA, 1987. ACM.
[73] Otto Nurmi and Eljas Soisalon-Soininen. Uncoupling updating and rebalancing
     in chromatic binary search trees. In Proceedings of the Tenth ACM SIGACT-
     SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’91,
     pages 192–198, New York, NY, USA, 1991. ACM.
[74] Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with
     synchronization bottlenecks efficiently, 1999.
[75] Ravi Rajwar and James R. Goodman. Speculative lock elision: enabling highly
     concurrent multithreaded execution. In MICRO, pages 294–305. ACM/IEEE,
     2001.
[76] Torvald Riegel, Christof Fetzer, and Pascal Felber. Time-based transactional
     memory with scalable time bases. In Proceedings of the nineteenth annual ACM
     symposium on Parallel algorithms and architectures, SPAA ’07, pages 221–228,
     New York, NY, USA, 2007. ACM.
[77] Torvald Riegel, Patrick Marlier, Martin Nowack, Pascal Felber, and Christof
     Fetzer. Optimizing hybrid transactional memory: the importance of nonspec-
     ulative operations. In Proceedings of the 23rd ACM symposium on Parallelism
     in algorithms and architectures, SPAA ’11, pages 53–64, New York, NY, USA,
     2011. ACM.
[78] Wenjia Ruan, Yujie Liu, and Michael Spear. Stamp need not be considered
     harmful. In Ninth ACM SIGPLAN Workshop on Transactional Computing.
     Mar 2014.
[79] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and
     Ben Hertzberg. McRT-STM: a high performance software transactional memory
     system for a multi-core runtime. In PPOPP. ACM SIGPLAN 2006, March 2006.
[80] William N. Scherer and Michael L. Scott. Advanced contention management
     for dynamic software transactional memory. In Marcos Kawazoe Aguilera and
     James Aspnes, editors, PODC, pages 240–248. ACM, 2005.
[81] William N. Scherer, III, Doug Lea, and Michael L. Scott. Scalable synchronous
     queues. Commun. ACM, 52(5):100–111, May 2009.
[82] Nir Shavit. Combining funnels: a dynamic approach to software combining.
     Journal of Parallel and Distributed Computing, page 2000, 2000.
[83] Nir Shavit. Data structures in the multicore age. Commun. ACM, 54(3):76–84,
     March 2011.
[84] Nir Shavit and Dan Touitou. Software transactional memory. In Proceedings of
     the Principles of Distributed Computing. Aug 1995.
[85] Michael F. Spear, Luke Dalessandro, Virendra Marathe, and Michael L. Scott.
     A comprehensive strategy for contention management in software transactional
     memory. In PPoPP, February 2009.
                                                                                   126


[86] M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt.
     Accelerating critical section execution with asymmetric multi-core architectures.
     In Proceedings of the 14th international conference on Architectural support for
     programming languages and operating systems, ASPLOS XIV, pages 253–264,
     New York, NY, USA, 2009. ACM.
[87] H. Sundell and P. Tsigas. Fast and lock-free concurrent priority queues for multi-
     thread systems. In IEEE International Symposium on Parallel and Distributed
     Processing, page 11 pp., april 2003.
[88] R. Kent Treiber. Systems programming: Coping with parallelism. Technical
     report, IBM Almaden Research Center, 2006.
[89] Amy Wang, Matthew Gaudet, Peng Wu, Jos´e Nelson Amaral, Martin Ohmacht,
     Christopher Barton, Raul Silvera, and Maged Michael. Evaluation of Blue
     Gene/Q hardware support for transactional memories. In Proceedings of the 21st
     international conference on Parallel architectures and compilation techniques,
     PACT ’12, pages 127–136, New York, NY, USA, 2012. ACM.
[90] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar. Per-
     formance evaluation of intel transactional synchronization extensions for high-
     performance computing. In Proceedings of SC13: International Conference for
     High Performance Computing, Networking, Storage and Analysis, SC ’13, pages
     19:1–19:11, New York, NY, USA, 2013. ACM.