Abstract of “ Theory and Applications of Parallelism with Futures ” by Zhiyu Liu , Ph.D., Brown University, May 2017. Futures are an attractive way to structure parallel computations. When a thread creates an expression with a keyword future, a new thread is spawned to compute that expression in parallel. When a thread later applies a touch operation to that future, it gets the result of the expression if the result has been computed, and otherwise blocks until the result becomes ready. In this thesis, we explore different aspects of parallel programs with futures, including their theoretical bounds, scheduling, and applications. Researchers have shown futures can have a deleterious effect on cache locality. We will show, however, that if futures are used in a simple, disciplined way, then their negative impact can be much alleviated. This structured use of futures is characteristic of many parallel applications. Futures lend themselves well to dynamic scheduling algorithms, such as work stealing, that ensure high processor utilization. Implementing work stealing on hierarchical platforms, such as NUMA systems and distributed clusters, have recently drawn lots of attention. However, little has been explored on its theoretical bounds. We present lower and upper bounds of work stealing for fork-join programs, a well-studied subclass of parallel-future programs, on hierarchical systems. As originally conceived, a future encapsulates a functional computation without side-effects. Recently, however, futures have been proposed as a way to encapsulate method calls to shared data structures. We propose a new program model, that supports both normal futures without side-effects and linearizable futures that exist for their side-effects. Using this model, we propose the lazy work stealing scheduler that facilitates certain optimizations for linearizable futures and guarantees good time bounds. The processing-in-memory (PIM) model has reemerged recently as a solution to alleviating the growing speed discrepancy between CPU’s computation and memory access, commonly known as the memory wall. In this model, some lightweight computing units are directly attached to the main memory, providing fast memory access. We study applications of linearizable futures in the PIM model: operation requests to concurrent data structures are sent as linearizable futures to those computing units to execute. These PIM-managed data structures can outperform state-of-the-art concurrent data structures in the literature. Theory and Applications of Parallelism with Futures by Zhiyu Liu B. E., Beijing University of Posts and Telecommunications, 2010 Sc. M., Dartmouth College, 2012 A dissertation submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the Department of Computer Science at Brown University Providence, Rhode Island May 2017 c Copyright 2017 by Zhiyu Liu This dissertation by Zhiyu Liu is accepted in its present form by the Department of Computer Science as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date Maurice Herlihy, Director Recommended to the Graduate Council Date Rodrigo Fonseca, Reader Brown University Date Eli Upfal, Reader Brown University Approved by the Graduate Council Date Andrew G. Campbell Dean of the Graduate School iii Vita Education • Ph.D in Computer Science, Brown University, September 2012 – 2017 Advisor: Maurice Herlihy • M.S. in Computer Science, Dartmouth College, September 2010 – June 2012 Advisor: Prasad Jayanti Thesis: Abortable Reader-Writer Locks Are No More Complex Than Abortable Mutex Locks • B.Eng. in Computer Science and Technology, Beijing University of Posts and Telecommunications, September 2006 – July 2010 Work Experience • Research Intern in System Algorithms Research Group at Microsoft Research Asia, May 2015 – August 2015 Mentor: Thomas Moscibroda Project: Automatic and efficient scheduling for dependent cloud tasks on Microsoft Azure • Research Intern at VMware Research Group, June 2016 – August 2016 Mentor: Irina Calciu Project: Concurrent data structures in the processing-in-memory model Publications • PIM-managed Concurrent Data Structures Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2017 iv • Well-Structured Futures and Cache Locality Maurice Herlihy and Zhiyu Liu ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2014 (Best Paper Award) • Approximate Local Sums and Their Applications in Radio Networks Zhiyu Liu and Maurice Herlihy International Symposium on DIStributed Computing (DISC) 2014 • Abortable Reader-Writer Locks Are No More Complex Than Abortable Mutex Locks Prasad Jayanti and Zhiyu Liu International Symposium on DIStributed Computing (DISC) 2012 Manuscripts Related to Thesis • Theoretical Analysis of Work Stealing on Hierarchical Platforms Zhiyu Liu and Maurice Herlihy • Work Stealing for Linearizable Futures Zhiyu Liu and Maurice Herlihy v Acknowledgements First, I would like to thank my advisor, Maurice Herlihy, who has been a perfect mentor throughout the five years of my PhD life. He gave me everything I needed in research: guidance, support, inspiration, vision, and freedom. I was incredibly lucky to be supervised by such a world-class researcher. I would also like to thank my thesis committee members, Rodrigo Fonseca and Eli Upfal, for their support and all the nice discussions. I wish to thank Irina Calciu and Thomas Moscibroda, my mentors during my internships at VMware and Microsoft Research, for the wonderful experiences in the two summers. I want to thank all my colleague students I have worked with: Archita Agarwal, Esha Ghosh, Eli Rossenthal, Vikram Saraph, and Hammurabi Mendes. I wish we could have even more collaborations in the future. I would thank all other faculty members and students in our department. One of the most enjoyable things to me at Brown was to attend their talks and lectures. It is a great honor for me to know and learn from them. I also want to thank all the administrative and technical staff in our department for all their support over the five years. In particular, my special thanks go to Lauren Clarke and Eugenia DeGouveia who helped me countless times during the process of scheduling my proposal and defense. Finally, I thank all my family and friends, in the U.S. and back in China, for everything they have done for me. vi Contents List of Tables ix List of Figures x 1 Introduction 1 1.1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Well-Structured Futures and Cache Locality . . . . . . . . . . . . . . . . . . . 3 1.2.2 Theoretical Analysis of Work Stealing on Hierarchical Platforms . . . . . . . 3 1.2.3 Work Stealing for Linearizable Futures . . . . . . . . . . . . . . . . . . . . . . 4 1.2.4 Concurrent Data Structures for Near-Memory Computing . . . . . . . . . . . 4 2 Well-Structured Futures and Cache Locality 2.1 6 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 7 Computation DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Work-Stealing and Cache Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Structured Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Structured Single-Touch Computations . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Future Thread First at Each Fork . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Parent Thread First at Each Fork . . . . . . . . . . . . . . . . . . . . . . . . 18 Other Kinds of Structured Computations . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 Structured Local-Touch Computations . . . . . . . . . . . . . . . . . . . . . . 23 2.5.2 Structured Computations with Super Final Nodes . . . . . . . . . . . . . . . 24 2.5 3 Theoretical Analysis of Work Stealing on Hierarchical Platforms 26 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Fork-Join Model on Hierarchical Systems . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Fork-Join Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 S-Bounded Fork-Join Programs . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 Hierarchical System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Work Stealing on Hierarchical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 vii 3.4 3.5 3.3.1 The Class of Work Stealing Algorithms . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Global Work Stealing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 33 Bounds for Fork-Join Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Lower Bound for All Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Upper bounds of Two Work Stealing Algorithms . . . . . . . . . . . . . . . . 37 Bounds for s-Bounded Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.1 Lower Bound for Work Stealing Algorithms . . . . . . . . . . . . . . . . . . . 39 3.5.2 Unbalanced Work Stealing and Its Upper Bound . . . . . . . . . . . . . . . . 42 4 Work Stealing for Linearizable Futures 44 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Linearizable-Futures Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Normal Futures and Linearizable Futures . . . . . . . . . . . . . . . . . . . . 45 4.2.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.3 Computation DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.4 Combining and Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Lazy Work Stealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Performance Analysis of Lazy Work Stealing . . . . . . . . . . . . . . . . . . . . . . 53 5 Concurrent Data Structures for Near-Memory Computing 65 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Hardware Architecture and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Low Contention Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.1 Linked-lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.2 Skip-lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 High Contention Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5.1 FIFO queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5.2 Pipelining and Performance analysis . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Bibliography 81 viii List of Tables 5.1 Throughputs of linked-list algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Throughputs of skip-list algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 ix List of Figures 2.1 Node and thread terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The interesting part of the bound is Ω(CtT∞ ). Figure 5 in [73] shows a DAG, as a 8 building block of a worst-case computation, that can incur Ω(T∞ ) deviations because of one touch. We can replace it with the DAG in Figure 2.2, which can incur Ω(CT∞ ) additional cache misses due to one touch v (if the processor at a fork always chooses the parent thread to execute first), so that the worst-case computation in [73] can incur Ω(CtT∞ ) additional cache misses because of t such touches. This DAG is similar to the DAG in Figure 2.7(a) in this paper. The proof of Theorem 10 shows how a parallel execution of this DAG incurs Ω(CT∞ ) additional cache misses. . . . . 2.3 10 A simplified version of the DAG in [73] that can incur high cache overhead. Here, v1 and v2 are touches. Suppose a processor p1 executes the root node, pushes the right child x of the root node into its deque, and then falls asleep. Now another processor p2 steals x from p1 ’s deque and executes the subgraph rooted at x. Thus, v1 and v2 will be checked (to see if they are available) even before the corresponding future threads are spawned at u1 and u2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 11 In this structured (single-touch) computation, the touches v1 and v2 will not be checked until their corresponding future threads have been spawned at u1 and u2 , respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Two examples illustrating single-touch computations are more flexible than fork-join computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 11 Figure (c) shows a DAG on which work stealing can incur 2 Ω(P T∞ ) 2 Ω(P T∞ ) 13 deviations and additional cache misses. It uses the DAGs in (a) and (b) as building blocks. 16 2.7 DAGs used by Figure 2.8 as building blocks. . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 A DAG on which work stealing can incur Ω(tT∞ ) deviations and Ω(CtT∞ ) if it chooses parents threads to execute first at forks. This example uses the DAGs in Figure 2.7 as building blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 22 The graph on the left hand side shows a thread in the dashed box spawned by its parent thread at a fork. The graph on the right hand side shows the join of a thread in the dashed box and its parent thread. . . . . . . . . . . . . . . . . . . . . . . . . . x 28 3.2 A fork-join DAG, where the main thread is the rightmost directed path in the DAG. 3.3 The DAG of Fibonacci(7) of Algorithm 1. The main thread (i.e., the rightmost one) 29 spawns a child thread x to compute Fibonacci(6) and then continues to compute Fibonacci(5) itself. Thread x spawns a child thread y of its own to compute Fibonacci(5) and then computes Fibonacci(4) itself. According to the pseudocode, Fibonacci(5) and Fibonacci(4) will be executed sequentially. Later thread x sums up Fibonacci(4) and Fibonacci(5) after the join of x and y, and finally the main thread sums up Fibonacci(5) and Fibonacci(6) after the join of x and itself. . . . . . . . . . . . . . . 3.4 30 A DAG where a single remote steal can incur a sequence of remote join operations. Suppose the processor executing the main thread always execute the child thread first after a fork. At each of the first three joins of the main thread, the processor spawns and quickly completes a child thread, and then gets back to the main thread. It finally arrives at the last fork, spawns a child thread and is about to execute v, while u is pushed into its deque. Now a remote processor steals u and start executing the main thread. If v is executed before u, the remote processor will have to make a remote join operation at each of the four joins in the main thread, in order to retrieve the results of the four child threads. We can imagine if there are a sequence of Θ(T∞ ) joins, Θ(T∞ ) has to be made. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The graph on the left presents a diamond structure and the DAG on the right consists of a sequence of four diamonds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 35 36 A DAG G(T1 , T∞ ), where the subgraph on the left branch of the source node is a sequence of diamonds, and the subgraph on the right branch of the source node is an arbitrary DAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 A hexagon where each thread has s = 4 sequential nodes between the last level of forks and the first level of joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 37 40 The first two hexagons in a series of hexagons. The s sequential nodes of a randomly chosen thread in the first hexagon (on the left-hand side) is replaced by the second hexagon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Pseudocode for computing the sum of the third and the fifth Fibonacci numbers using futures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 41 45 A main thread t spawns two linearizable futures f1 and f2 , both on object o, at forks v1 and v2 respectively. v3 and v4 are touches of f1 and f2 respectively. By definition, (v1 , v5 ) and (v2 , v7 ) are future edges. (v6 , v4 ) and (v8 , v3 ) are touch edges. (v6 , v7 ) is an order edge. All the other edges are continuation edges. . . . . . . . . . . . . . . . 4.3 48 The DAG on the left shows an “original” execution of the program, where each linearizable future fi is executed solely, for any 0 ≤ i ≤ 3. The DAG on the right shows the DAG of an execution, where f1 and f2 are grouped. . . . . . . . . . . . . . . . . xi 50 4.4 An example illustrating how lazy work stealing works when a processor P is executing node v in thread t. First, f3 is added to the linked list of ready linearizable futures on object o created by thread t. Since f3 is the last future on o before the first touch of the ready futures on o (i.e., the touch of f3 ), a pointer node to the linked list is pushed into the bottom of P ’s deque. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 52 A well-formed program obtained by modifying the program in Figure 4.4. The only modification is moving the touch of f1 to a node before the fork of f4 , so that f4 is created after all f1 , f2 , and f3 have be touched. . . . . . . . . . . . . . . . . . . . . . 4.6 54 The DAG on the left is a path βi in G. g1 , g2 , and g3 are linearizable futures on object o contained in βi , and the other segments of βi are all in the main thread and normal futures (for simplicity, we assume there is no linearizable futures on other objects). The DAG on the right is part of G0 , where the dotted paths are αo and α0 . . . . . . Q01 58 4.7 Q1 and in the proof of Theorem 38 . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 The PIM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Experimental results of linked-lists. We evaluated the linked-list with Fine-grained locks and the flat-combining linked-list (FC) with and without the combining optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 A PIM-managed FIFO queue with three partitions . . . . . . . . . . . . . . . . . . . 72 5.4 Experimental results of skip-lists. We evaluated the lock-free skip-list and the flatcombining skip-list (FC) with different numbers (1, 4, 8, 16) of partitions. . . . . . . 73 5.5 A PIM-managed FIFO queue with three segments . . . . . . . . . . . . . . . . . . . 75 5.6 (a) illustrates the pipelining optimization, where a PIM core can start executing a new deq() (step 1 of deq() for the CPU on the left), without waiting for the dequeued node of the previous deq() to return to the CPU on the right (step 3). (b) shows the timeline of pipelining four deq() requests. . . . . . . . . . . . . . . . . . . . . . . . . xii 79 Chapter 1 Introduction 1.1 Background and Motivations Futures [41, 42] are an attractive way to structure many parallel programs. When a thread creates an expression with a keyword future, a new thread is spawned to compute that expression in parallel with the thread that created it. When a thread later applies a touch operation to that future, it gets the result of the expression if the result has been computed, and otherwise blocks until the result becomes ready. Futures were first proposed by Halstead [41, 42] and have been well studied since then (e.g., [55, 8, 58, 31, 16, 7, 20, 77, 32, 1, 73]), sometimes under different names. Futures lend themselves well to sophisticated dynamic scheduling algorithms, such as work stealing [20] and its variations, that ensure high processor utilization and hence high execution speedup. Arora et al. [7] proved that (parsimonious) work stealing achieves the asymptotically optimal speedup for the parallel execution of a program in the future-parallel model. As originally conceived, a future encapsulates a short-lived functional computation that has no side-effects, so the order in which futures are execute cannot be observed. Recently, Kogan and Herlihy [53] proposed an alternative approach, in which futures encapsulate method calls to long-lived shared data structures, facilitating common optimizations such as combining [37, 39, 46] and elimination [45, 63, 72]. Since operations on shared objects are typically executed for their sideeffects, they also proposed several variations of linearizability [49] to constrain how the computations of futures with side-effects can be interleaved. Despite a rich literature of research on future-parallel programs, we find the following interesting problems unexplored. • The asymptotically optimal execution time of programs with futures [7] is proved without considering cache performance. However, modern multicore architectures employ complex multi-level memory hierarchies, and technology trends are increasing the relative performance differences among the various levels of memory. As a result, processor utilization can no longer be the sole figure of merit for schedulers and the cache locality of a parallel execution 1 2 will become increasingly critical to overall performance. Several researchers [1, 73] have shown, however, that introducing parallelism through the use of futures can sometimes substantially reduce cache locality. In the worst case, if we add futures to a sequential program, a parallel execution managed by a work stealing scheduler can incur Ω(P T∞ + tT∞ ) deviations, which implies Ω(CP T∞ + CtT∞ ) more cache misses than the sequential execution. Here, C is the number of cache lines, P is the number of processors, t is the number of touch operations, and T∞ is the computation’s span (or critical path). As technology trends cause the cost of cache misses to increase, this additional cost is troubling. Therefore, we believe it will be very interesting to study new ways of using futures, such as constructing parallel programs in structured forms or designing cache-friendly schedulers for parallel executions, to improve the cache performance of programs with futures. • As hierarchical platforms, such as NUMA systems and distributed clusters, are becoming more and more prevalent, implementing efficient work stealing algorithms for parallel programs on such hierarchical systems have drawn lots of attention recently. Different techniques have been proposed to improvement the performances of different work stealing algorithms on hierarchical systems. For example, researchers have found different heuristic strategies on when a processor should make a remote steal and how much work it should steal (e.g., [62, 70, 65, 76]). Other people have presented work stealing variants via message passing [29, 71, 78, 2, 57], as opposed to the traditional ways based on concurrent data structures. Despite the rich research on the practical implementations and empirical analysis of work stealing on hierarchical systems, however, little has been explored with respect to their theoretical bounds. Quintin and Wagner [70] gave a theoretical bound on the time complexity of their hierarchical work stealing (HWS) algorithm, in which the leader processors of clusters first distribute tasks across a system and then the worker processors in each cluster execute their local tasks. However, without good knowledge of a program, the workloads distributed to different clusters can be highly unbalanced in theory, and hence the upper bound of HWS cannot help us figure out the performance bounds of work stealing. • As we mentioned earlier, Kogan and Herlihy [53] proposed a new type of futures, which we call linearizable futures, to encapsulate method calls to long-lived shared data structures. Their original proposal left open the problem of how best to schedule programs that employ linearizable futures with side-effects. The idea behind the linearizable futures is to delay the executions of data structures’ method calls, so that certain threads can batch them and apply different optimizations to them, in order to improve overall throughput. Therefore, another interesting problem is to design new data structures on certain platforms that can utilize linearizable futures well to achieve good performance. 3 1.2 Overview of Contributions This thesis makes fours contributions, as summarized in Sections 1.2.1-1.2.4 blow. 1.2.1 Well-Structured Futures and Cache Locality In Chapter 2, we will study the cache performance of future-parallel programs and show that if futures are used in a simple, disciplined way, then the situation with respect to cache locality is much better: if each future is touched only once, either by the thread that created it, or by a thread to which the future has been passed directly or indirectly from the thread that created it, 2 then parallel executions with work stealing can incur at most O(CP T∞ ) additional cache misses, a substantial improvement over the unstructured case. This result provides a simple way to identify computations for which introducing futures will not incur a high cost in cache locality, as well as providing guidelines for the design of future parallel computations. (Informally, we think these guidelines are natural, and correspond to structures programmers are likely to use anyway.) We also prove that this upper bound is tight within a factor of C. Our second result is the observation that when a work stealing scheduler has a choice between running the thread that created a future, and the thread that implements the future, running the future thread first provides better cache locality. Finally, we show that certain variations of structured computation also have good cache locality. 1.2.2 Theoretical Analysis of Work Stealing on Hierarchical Platforms In Chapter 3, we will present lower bounds for all work stealing algorithms as well as upper bounds for specific work stealing variants in a theoretical hierarchical model. More specifically, we consider a hierarchical system model of k homogeneous clusters, each having n local processors. The execution of an instruction of a program and an operation for local communication both take time 1, while a remote operation, such as a remote steal and a remote join of two threads, takes time Θ(r). We focus on fork-join programs [15], a well-studied subclass of future-parallel programs. Our hierarchical system model is kept abstract and general, in order to cover both hierarchical shared-memory and message-passing systems. We prove the lower bound on the execution time of any load balancing algorithm in this model is Ω(min{ Tn1 + T∞ , Tp1 + rT∞ log(nr) }), where T1 and T∞ are the total work and the critical path length of a fork-join program, respectively, and p = kn is the number of processors in the whole system. This lower bound indicates that, when T1 T∞ ≤ nr log(nr) , i.e., when the workload of a program is light, running a single cluster of n processors with the classical work stealing algorithm can achieve an optimal expected execution time O( Tn1 + T∞ ). when T1 T∞ > nr log(nr) , i.e., when the workload is heavy, we show that an algorithm, called global work stealing algorithm, can achieve an execution time O( Tp1 +rT∞ ) in expectation, which is optimal within a factor of O(log(nr)). This algorithm is essentially the classical work stealing algorithm with the “attaching scheme”, which resembles the clone optimization [34], to reduce remote joins. Its 4 upper bound is a little surprising, as processors in the global work stealing algorithm choose victim processors uniformly at random, which is not considered a good strategy on hierarchical systems. In most fork-join programs we find in practice, a thread stops splitting into two parallel ones when the amount of work is small enough and can be quickly done sequentially. Thus to analyze work stealing on these programs, we also define two subclasses of fork-join programs, called s-bounded fork-join programs and s-bounded divide-and-conquer programs, where each thread has at least s nodes of work. We show that a very similar lower bound Ω(min{ Tn1 + T∞ , Tp1 + rT∞ }) log( nr s ) still holds for all work stealing algorithms, for any s = O(r). On the other hand, we prove that the class of unbalanced work stealing algorithms can achieve a good upper bound O( Tp1 +rT∞ ), any for s = Ω(r). This bound is optimal within a factor of log(n), when the workload is heavy, i.e., T1 T∞ > nr log(n) . We believe the class of unbalanced work stealing algorithms capture the characteristics of many work stealing algorithms using the heuristic that processors should make more local steals than remote steals, a commonly used strategy found empirically effective in practice [62]. Therefore, our result may imply that this heuristic, combined with the strategy of having threads stop splitting early, is likely to have good performance guarantees. 1.2.3 Work Stealing for Linearizable Futures In chapter 4, we will propose a new program model, called the linearizable-futures model, that supports both futures without side-effects, which we call normal futures, and futures that exist for their side-effects, which we call linearizable futures. This model requires futures to be used in certain structured ways. These constraints are reasonable, in the sense that they rule out only certain pathological uses that seem unlikely to occur in practice. We use this model to propose a novel scheduler, called lazy work stealing, a variant of the classical work stealing intended to facilitate combining and elimination optimizations for linearizable futures. Finally, we prove bounds on program execution time using lazy work stealing. We show that if the execution time of a program by an optimal offline scheduler is Θ( PTA1 + T∞ ), the expected execution time by lazy work stealing is O( PT10 + (c + 1)T∞ ), where c is the containment level of the program, A defined below. Roughly speaking, when the containment level c is a small constant, which is the case for many programs, the performance of lazy work stealing is close to that of an optimal offline scheduler. We also show that this bound is asymptotically optimal for non-clairvoyant schedulers, by proving a matching lower bound. 1.2.4 Concurrent Data Structures for Near-Memory Computing The performance gap between memory and CPU has grown exponentially. Memory vendors have focused mostly on improving memory capacity and bandwidth, sometimes even at the cost of increased memory access latencies. To provide higher bandwidth with lower access latencies, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located close to memory. A memory access from a PIM 5 core is much faster than from a CPU core. Near-memory computing is an old idea, that has been intensely studied in the past (e.g., [74, 54, 36, 68, 67, 52, 40]), but so far has not yet materialized. However, new advances in 3D integration and in die stacked memory make near-memory computing viable in the near future. For example, one PIM design assumes memory is organized in multiple vaults, each having an in-order PIM core to manage it. These PIM cores can communicate through message passing, but do not share memory, and cannot access each other’s vaults. This new technology promises to revolutionize the interaction between computation and data, as memory becomes an active component in managing the data. Therefore, it invites a fundamental rethinking of basic data structures and promotes a tighter dependency between algorithmic design and hardware characteristics. Prior work has already shown significant performance improvements by using PIM for embarrassingly parallel and data-intensive applications [79, 4, 80, 6], as well as for pointer-chasing traversals [50] in sequential data structures. However, current server machines have hundreds of cores; algorithms for concurrent data structures exploit these cores to achieve high throughput and scalability, with significant benefits over sequential data structures (e.g., [33, 69, 75, 48]). As we will show, naive PIM data structures cannot outperform state-of-the-art concurrent data structures. In particular, the lower latency access to memory cannot compensate for the loss of parallelism. To be competitive with traditional concurrent data structures, PIM data structures need new algorithms and new approaches to leverage parallelism. In Chapter 5, we will present some PIM-managed concurrent data structures, where threads send their operation requests as linearizable futures to PIM cores which execute those requests with certain optimizations. In particular, we analyze pointer chasing data structures, which have a high degree of inherent parallelism and low contention, but incur significant overhead due to unpredictable memory accesses. We propose using techniques such as combining and partitioning the data across vaults to reintroduce parallelism for these data structures. Second, we explore contended data structures, such as FIFO queues, which can leverage CPU caches to exploit their inherent high locality. Therefore, FIFO queues might not seem to be able to leverage PIM’s faster memory accesses. Nevertheless, these data structures exhibit a high degree of contention, which makes it difficult even for the most advanced algorithms to obtain good performance for many threads accessing the data oncurrently. We use pipelining of requests, which can be done very efficiently in PIM, to design a new FIFO queue suitable for PIM that can outperform state-of-the-art concurrent FIFO queues [64, 44]. Chapter 2 Well-Structured Futures and Cache Locality In this chapter, we will show that if futures are used in a simple, disciplined way, then their negative impact on cache locality can be much alleviated, a significant improvement over the previous result. This chapter is organized as follows. Section 2.1 describes the model for future-parallel computations. In Section 2.2, we describe parsimonious work-stealing schedulers, and briefly discuss their cache performance measures. In Section 2.3, we define some restricted forms of structured futureparallel computations. Among them, we highlight structured single-touch computations, which, we believe, are likely to arise naturally in many programs. In Section 2.4.1, we prove that work-stealing 2 schedulers on structured single-touch computations incur only O(CP T∞ ) additional cache misses, if a processor always chooses the future to execute first when it creates that future. We also prove this bound is tight within a factor of C. In Section 2.4.2, we show that if a processor chooses the current thread over the future thread when it creates that future, then the cache locality of a structured single-touch computation can be much worse. In Section 2.5, we show that some other kinds of structured future-parallel computations also achieve relatively good cache locality. 2.1 Model In fork-join parallelism [15, 13, 17], a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges. These tasks are executed in an order consistent with their dependencies, and tasks unrelated by dependencies can be executed in parallel. Fork-join parallelism is well-suited to dynamic load-balancing techniques such as work stealing [22, 41, 7, 20, 1, 42, 18, 34, 55, 3, 23]. A popular and effective way to extend fork-join parallelism is to allow threads to create futures [41, 42, 8, 16, 32]. A future is a data object that represents a promise to deliver the result of an asynchronous computation when it is ready. That result becomes available to a thread when 6 7 the thread touches that future, blocking if necessary until the result is ready. Futures are attractive because they provide greater flexibility than fork-join programs, and they can also be implemented effectively using dynamic load-balancing techniques such as work stealing. Fork-join parallelism can be viewed as a special case of future-parallelism, where the spawn operation is an implicit future creation, and the sync operation is an implicit touch of the untouched futures created by a thread. Future-parallelism is more flexible than fork-join parallelism, because the programmer has finer-grained control over touches (joins). 2.1.1 Computation DAG A thread creates a future by marking an expression (usually a method call) as a future. This statement spawns a new thread to evaluate that expression in parallel with the thread that created the future. When a thread needs access to the results of the computation, it applies a touch operation to the future. If the result is ready, it is returned by the touch, and otherwise the touching thread blocks until the result becomes ready. Without loss of generality, we will consider fork-join parallelism to be a special case of future-parallelism, where forking a thread creates a future, and joining one thread to another is a touch operation. Our notation and terminology follow earlier work [7, 20, 1, 73]. A future-parallel computation is modeled as a directed acyclic graph (DAG). Each node in the DAG represents a task (one or more instructions), and an edge from node u to node v represents the dependency constraint that u must be executed before v. We follow the convention that each node in the DAG has in-degree and outdegree either 1 or 2, except for a distinguished root node with in-degree 0, where the computation starts, and a distinguished final node with out-degree 0, where the computation ends. There are three types of edges: • continuation edges, which point from one node to the next in the same thread, • future edges (sometimes called spawn edges), which point from node u to the first node of another thread spawned at u by a future creation, • touch edges (sometimes called join edges), directed from a node u in one thread t to a node v in another thread, indicating that v touches the future computed by t. A thread is a maximal chain of nodes connected by continuation edges. There is a distinguished main thread that begins at the root node and ends at the final node, and every other thread t begins at a node with an incoming future edge from a node of the thread that spawns t. The last node of t has only one outgoing edge which is a touch edge directed to another thread, while other nodes of t may or may not have incoming and outgoing touch edges. A critical path of a DAG is a longest directed path in the DAG, and the DAG’s computation span is the length of a critical path. As illustrated in Figure 2.1, if a thread t1 spawns a new thread t2 at node v in t1 (i.e., v has two out-going edges, a continuation edge and a future edge to the first node of t2 ), then we call t1 the parent thread of t2 , t2 the future thread (of t1 ) at v, and v the fork of t2 . A thread t3 is a descendant 8 (a) (b) Figure 2.1: Node and thread terminology thread of t1 if t3 is a future thread of t1 or, by induction, t3 ’s parent thread is a descendant thread of t1 . If there is a touch edge directed from node v1 in thread t1 to node v2 in thread t2 (i.e., t2 touches a future computed by t1 ), and a continuation edge directed from node u2 in t2 to v2 , then we call node v2 a touch of t1 by t2 , v1 the future parent of v2 , u2 the local parent of v2 , and t1 the future thread of v2 . (Note that the touch v2 is actually a node in thread t2 .) We call the fork of t1 the corresponding fork of v2 . Note that only touch nodes have in-degree 2. To distinguish between the two types of nodes with out-degree 2, forks and future parents of touches, we follow the convention of previous work that the children of a fork both have in-degree 1 and cannot be touches. In this way, a fork node has two children with in-degree 1, while a touch’s future parent has a (touch) child with in-degree 2. We follow the convention that when a fork appears in a DAG, the future thread is shown on the left, and the future parent on the right. (Note that this does not mean the future thread is chosen to execute first at a fork.) Similarly, the future parent of a touch is shown on the left, and the local parent on the right. We use the following (standard) notation. Given a computation DAG, P is the number of processors executing the computation, t is the number of touches in the DAG, T∞ , the computation span (or critical path), is the length of the longest directed path, and C is the number of cache lines in each processor. 2.2 Work-Stealing and Cache Locality In the paper, we focus on parsimonious work stealing algorithms [7], which have been extensively studied [7, 20, 1, 73, 19] and used in systems such as Cilk [18]. In a parsimonious work stealing algorithm, each processor is assigned a double-ended queue (deque). After a processor executes a node with out-degree 1, it continues to execute the next node if the next node is ready to execute. After the processor executes a fork, it pushes one child of the fork onto the bottom of its deque and 9 executes the other. When the processor runs out of nodes to execute, it pops the first node from the bottom of its deque if the deque is not empty. If, however, its deque is empty, it steals a node from the top of the deque of an arbitrary processor. In our model, a cache is fully associative and consists of multiple cache lines, each of which holds the data in a memory block. Each instruction can access only one memory block. In our analysis we focus only on the widely-used least-recently used (LRU) cache replacement policy, but our results about the upper bounds on cache overheads should apply to all simple cache replacement policies [1]. 1 The cache locality of an execution is measured by the number of cache misses it incurs, which depends on the structure of the computation. To measure the effect on cache locality of parallelism, it is common to compare cache misses encountered in a sequential execution to the cache misses encountered in various parallel executions, focusing on the number of additional cache misses introduced by parallelism. Scheduling choices at forks affect the cache locality of executions with work stealing. After executing a fork, a processor picks one of the two child nodes to execute and pushes the other into its deque. For a sequential execution, whether a choice results in a better cache performance is a characteristic of the computation itself. For a parallel execution of a computation satisfying certain properties, however, we will show that choosing future threads (the left children) at forks to execute first guarantees a relatively good upper bound on the number of additional cache misses, compared to a sequential execution that also chooses future threads first. In contrast, choosing the parent threads (the right children) to execute first can result in a large number of additional cache misses, compared to a sequential execution that also chooses parent threads first. 2.3 Structured Computations Consider a sequential execution where node v1 is executed immediately before node v2 . A deviation [73], also called a drifted node [1], occurs in a parallel execution if a processor P executes v2 , but not immediately after v1 . For example, p might execute v1 after v2 , it might execute other nodes between v1 and v2 , or v1 and v2 might be executed by distinct processors. [73] showed that a parallel execution of a future-parallel computation with work stealing can incur Ω(P T∞ + tT∞ ) deviations. This implies a parallel execution of a future-parallel computation with work stealing can incur Ω(P T∞ + tT∞ ) additional cache misses. With minor modifications in that computation (see Figure 2.2), a parallel execution can even incur Ω(CP T∞ + CtT∞ ) additional cache misses. Our contribution in this paper is based on the observation that such poor cache locality occurs primarily when futures in the DAG can be touched by arbitrary threads, resulting in unrealistic and complicated dependencies. For example, in the worst-case DAGs in [73] that can incur significantly 1 That is because the upper bounds in this paper are based on the results of [1] that bound the number of drifted nodes (i.e., deviations), and those results hold for all simple cache replacement policies, even with set associative caches, as discussed in [1]. 10 Figure 2.2: The interesting part of the bound is Ω(CtT∞ ). Figure 5 in [73] shows a DAG, as a building block of a worst-case computation, that can incur Ω(T∞ ) deviations because of one touch. We can replace it with the DAG in Figure 2.2, which can incur Ω(CT∞ ) additional cache misses due to one touch v (if the processor at a fork always chooses the parent thread to execute first), so that the worst-case computation in [73] can incur Ω(CtT∞ ) additional cache misses because of t such touches. This DAG is similar to the DAG in Figure 2.7(a) in this paper. The proof of Theorem 10 shows how a parallel execution of this DAG incurs Ω(CT∞ ) additional cache misses. high cache overheads, futures are touched by threads that can be created before the future threads computing these futures were created. As illustrated in Figure 2.3, a parallel execution of such a computation can arrive at a scenario where a thread touches a future before the future thread computing that future has been spawned. (As a practical matter, an implementation must ensure that such a touch does not return a reference to a memory location that has not yet been allocated.) Such scenarios are avoided by structured future-parallel computations (e.g. Figure 2.4) that follow certain simple restrictions. Definition 1 A DAG is a structured future-parallel computation if, (1) for the future thread t of any fork v, the local parents of the touches of t are descendants of v, and (2) at least one touch of t is a descendant of the right child of v. There are two reasons we require that at least one touch of t is a descendant of the right child of v. First, it is natural that a computation spawns a future thread to compute a future because the computation itself later needs that value. At the fork v, the parent thread (the right child of v) represents the “main body” of the computation. Hence, the future will usually be touched either by the parent thread, or by threads spawned directly or indirectly by the parent thread. Second, a computation usually needs a kind of “barrier” synchronization to deal with resource release at the end of the computation. Some node in the future thread t, usually the last node, should have an outgoing edge pointing to the “main body” of the computation to tell the main body that the future thread has finished. Without such synchronization, t and its descendants will be 11 Figure 2.3: A simplified version of the DAG in [73] that can incur high cache overhead. Here, v1 and v2 are touches. Suppose a processor p1 executes the root node, pushes the right child x of the root node into its deque, and then falls asleep. Now another processor p2 steals x from p1 ’s deque and executes the subgraph rooted at x. Thus, v1 and v2 will be checked (to see if they are available) even before the corresponding future threads are spawned at u1 and u2 . Figure 2.4: In this structured (single-touch) computation, the touches v1 and v2 will not be checked until their corresponding future threads have been spawned at u1 and u2 , respectively. isolated from the main body of the computation, and we can imagine a dangerous scenario where the main body of the computation finishes and releases its resources while t or its descendant threads are still running. In our DAG model, such a synchronization point is by definition a touch node, though it may not be a real touch. We follow the convention that the thread that spawns a future thread releases it, so the synchronization point is a node in the parent thread or one of its descendants. Another possibility is to place the synchronization point at the last node of the entire computation, which is the typically case in languages such as Java, where the main thread of a program is in charge of releasing resources for the entire computation. These two styles are essentially equivalent, and should have almost the same bounds on cache overheads. We will briefly discuss this issue in Section 2.5.2. We consider how the following constraint affects cache locality. Definition 2 A structured single-touch computation is a structured future-parallel computation where each future thread spawned at a fork v is touched only once, and the touch node is a descendant of v’s right child. By the definition of threads, the future parent of the only touch of a future thread must be the last node of the future thread (the last node can also be a parent of a join node, but we don’t distinguish between a touch node and a join node). The DAG in Figure 2.4 represents a structured singletouch computation. We will show that work-stealing parallel executions of structured single-touch computations achieve significantly less cache overheads than unstructured computations. 12 In principle, a future could be touched multiple times by different threads, so structured singletouch computations are more restrictive structured computations in general. Nevertheless, the single-touch constraint is one that is likely to be observed by many programs. For example, as noted, the Cilk [18] language supports fork-join parallelism, a strict subset of the future-parallelism model considered here. If we interpret the Cilk language’s spawn statement as creating a future, and its sync statement as touching all untouched futures previously created by that thread, then Cilk programs (like all fork-join programs) are structured single-touch computations. Structured single-touch computations encompass fork-join computations, but are strictly more flexible. Figure 2.5 presents two examples that illustrate the differences. If a thread creates multiple futures first and touches them later, fork-join parallelism requires they be touched (evaluated) in the reverse order. MethodA in Figure 2.5(a) shows the only order in which a thread can first create two futures and then touch them in a fork-join computation. This rules out, for instance, a program where a thread creates a sequence of futures, stores them in a priority queue, and evaluates them in some priority order. In contrast, our structured computations permit such futures to be evaluated by their creating thread or its descendants in any order. Also, unlike fork-join parallelism, our notion of structured computation permits a thread to pass a future to another thread which touches that future, as illustrated in Figure 2.5(b): after a future is created, the future can be passed, as an argument of a new method call or the return value of the current thread’s method call, to another thread. The thread receiving the future (MethodC in the figure) can even pass it to another thread, and so on. The only constraint is that only one of the threads that have received the future can touch it. In a fork-join computation, however, only the thread creating the future can touch it, which is much more restrictive. We believe these restrictions are easy to follow and should be compatible with how many people program in practice. [16] observe that if a future can be touched multiple times, then complex and potentially inefficient operations and data structures are needed to correctly resume the suspended threads that are waiting for the touch. By contrast, the run-time support for futures can be significantly simplified if each future is touched at most once. We also consider the following structured local-touch computations in the paper. Definition 3 A structured local-touch computation is one where each future thread spawned at a fork v is touched only at nodes in its parent thread, and these touches are descendants of the right child of v. Informally, the local touch constraint implies that a thread that needs the value of a future should create the future itself. Note that in a structured computation with local touch constraint, a future thread is now allowed to evaluate multiple futures and these futures can be touched at different times. Though allowing a future thread to compute multiple futures is not very common, [16] point out that it can be useful for some future-parallel computations like pipeline parallelism [16, 18, 38, 35, 56]. We will show in Section 2.5.1 that work-stealing parallel executions of computations satisfying the local touch constraint also have relatively low cache overheads. Note that structured computations with both single touch and local touch constraints are still a superset of fork-join computations. 13 void MethodA { Future x = some computation; Future y = some computation; a = y.touch(); b = x.touch(); } (a) void MethodB { Future x = some computation; Future y = MethodC(x); ...... } void MethodC(Future f){ a = f.touch(); } (b) Figure 2.5: Two examples illustrating single-touch computations are more flexible than fork-join computations 2.4 2.4.1 Structured Single-Touch Computations Future Thread First at Each Fork We now analyze cache performance of work stealing on parallel executions of structured single-touch computations. We will show that work stealing has relatively low cache overhead if the processor at a fork always chooses the future thread to execute first, and puts the parent future into its deque. For brevity, all the arguments and results in this section assume that every execution chooses the future thread at a fork to execute first. Lemma 4 In the sequential execution of a structured single-touch computation, any touch x’s future parent is executed before x’s local parent, and the right child of x’s corresponding fork v immediately follows x’s future parent. Proof. By induction. Given a DAG, initially let S be an empty set and T the set of all touches. Note that S ∩ T = ∅ and S ∪ T = {all touches}. (2.1) Consider any touch x in T , such that x has no ancestors in T . (That is, x has no ancestor nodes that are also touches.) Let t be the future thread of x and v the corresponding fork. Note that x’s future parent is the last node of t by definition. When the single processor executes v, the processor 14 pushes v’s right child into the deque and continues to execute thread t. By hypothesis, there are no touches by t, since any touch by t must be an ancestor of x. There may be some forks in t. However, whenever the single processor executes a fork in t, it pushes the right child of that fork, which is a node in t, into the deque and hence t (i.e., a node in t) is right below v’s right child in the deque. Therefore, the processor will always resume thread t before the right child of v. Since there is no touch by t, all the nodes in t are ready to execute one by one. Thus, when the future parent of the touch x is executed eventually, the right child of v is right at the bottom of the deque. By the single touch constraint, the local parent of x is a descendant of the right child of v, so the local parent of x cannot be executed yet. Thus, the processor will now pop the right child of v out from the bottom of the deque. Since this node is not a touch, it is ready to execute. Therefore, x satisfies the following two properties. Property 5 Its future parent is executed before its local parent. Property 6 The right child of its corresponding fork immediately follows its future parent. Now set S = S ∪ {x} and T = T − {x}. Thus, all touches in S satisfy Properties 5 and 6. Note that Equation 2.1 still holds. Now suppose that at some point all nodes in S satisfy Properties 5 and 6, and that Equation 2.1 holds. Again, we now consider a touch x in T , such that no touches in T are ancestors of x, i.e., all the touches that are ancestors of x are in S. Since the computation graph is a DAG, there must be such an x as long as T is not empty. Let t be the future thread of x and v the corresponding fork. If there are no touches by t, then we can prove x satisfies Properties 5 and 6, by the same argument for the first touch added into S. Now assume there are touches by t. Since those touches are ancestors of x, they are all in S and hence they all satisfy Property 5. When the processor executes v, it pushes v’s right child into the deque and starts executing t. Similar to what we showed above, when the processor gets to a fork in t, it will always push t into its deque, right below the right child of v. Thus, the processor will always resume t before the right child of v. When the processor gets to the local parent of a touch by t, we know the future parent of the touch has already been executed since the touch satisfies Property 5. Thus, the processor can immediately execute that touch and continue to execute t. Therefore, the processor will eventually execute the future parent of x while the right child of t is still the next node to pop in the deque. Again, since the local parent of x is a descendant of the right child of v, the local parent of x as well as x cannot be executed yet. Therefore, the processor will now pop the right child of v to execute, and hence x satisfies Properties 5 and 6. Now we set S = S ∪ {x} and T = T − {x}. Therefore, all touches in S satisfy Properties 5 and 6, and Equation2.1) also holds. By induction, we have S = {all touches} and all touches satisfy Properties 5 and 6. t u [1] have shown that the number of additional cache misses in a work-stealing parallel computation is bounded by the product of the number of deviations and the number of cache lines. It is easy to see that only two types of nodes in a DAG can be deviations: the touches and the child nodes of 15 forks that are not chosen to execute first. Since we assume the future thread (left child) at a fork is always executed first, only the right children of forks can be deviations. Next, we bound the number of deviations incurred by a work-stealing parallel execution to bound its cache overhead. Lemma 7 Let t be the future thread at a fork v in a structured single-touch computation. If t’s touch x or v’s right child u is a deviation, then either u is stolen or there is a touch by t which is a deviation. Proof. By Lemma 4, a touch is a deviation if and only if its local parent is executed before its future parent. Now suppose a processor p executes v and pushes u into its deque. Assume that u is not stolen and no touches by t are deviations. Thus, u will stay in p’s deque until p pops it out. The proof of this lemma is similar to that of Lemma 4. After p spawns thread t at v, it moves to execute t. When p executes “ordinary” nodes in t, no nodes are pushed into or popped out of p’s deque and hence u is still the next node in the deque to pop. When p executes a fork in t, it pushes t (more specifically, the right child of that fork) into its deque, right below u. Since a thief processor always steals from the top of a deque, and by hypothesis u is not stolen, t cannot be stolen. Thus, p will always resume t before u and then u will become the next node in the deque to pop. When p executes the local parent of a touch by t, the future parent of that touch must have been executed, since we assume that touch is not a deviation. Thus, p can continue to execute that touch immediately and keep moving on in t with its deque unchanged. Therefore, p will finally get to the local parent of x and then pop u out from its deque, since x is a descendant of u and x cannot be execute yet. Hence, neither x nor u can be a deviation. t u Theorem 8 If, at each fork, the future thread is chosen to execute first, then a parallel execution 2 2 with work stealing incurs O(P T∞ ) deviations and O(CP T∞ ) additional cache misses in expectation on a structured single-touch computation, where (as usual) P is the number of processors involved in this computation, T∞ is the computation span, and C is the number of cache lines. Proof. [7] have shown that in a parallel execution with work stealing, there are in expectation O(P T∞ ) steals. Now let us count how many deviations these steals can incur. A steal on the right child u of a fork v can make u and v’s corresponding touch x1 deviations. Suppose x1 is a touch by a thread t2 , then the right child of the fork of t2 and t2 ’s touch x2 can be deviations. If x2 is a deviation and x2 is a touch by another thread t3 , then the right child of the fork of t3 and t3 ’s touch x3 can be deviation too. Note that x2 is a descendant of x1 and x3 is a descendant of x2 . By repeating this observation, we can find a chain of touches x1 , x2 , x3 , ..., xn , called a deviation chain, such that each xi and the right child of the corresponding fork of xi can be deviations. Since for each i > 1, xi is a descendant of x2 , x1 , x2 , x3 , . . . , xn is in a directed path in the computation DAG. Since the length of any path is at most T∞ , we have n ≤ T∞ . Since each future thread has only one touch, there is only one deviation chain for a steal. Since there are O(P T∞ ) steals in expectation in 2 a parallel execution [7], we can find in expectation O(P T∞ ) deviation chains and in total O(P T∞ ) 2 touches and right children of the corresponding forks involved, i.e., O(P T∞ ) deviations involved. 16 (a) (b) (c) 2 Figure 2.6: Figure (c) shows a DAG on which work stealing can incur Ω(P T∞ ) deviations and 2 ) additional cache misses. It uses the DAGs in (a) and (b) as building blocks. Ω(P T∞ Next, we prove by contradiction that no other touches or right children of forks can be deviations. suppose there is touch y, such that y or the right child of the corresponding fork of y is a deviation, and that y is not in any deviation chain. The right child of the corresponding fork of y can not be stolen, since by hypothesis y is not the first touch in any of those chains. Thus by Lemma 7, there is a touch y 0 by the future thread of y and y 0 is a deviation. Note that y 0 s cannot be in any deviation chain either. Otherwise y and the deviation chain y 0 is in will form a deviation chain too, a contradiction. Therefore, by repeating such “tracing back”, we will end up at a deviation touch that is not in any deviation chain and has no touches as its ancestors. Therefore, there are no touches by the future thread of this touch, and the right child of the corresponding future fork of it is not stolen, contradicting Lemma 7. The upper bound on the expected number of additional cache misses follows from the result of [1] that the number of additional cache misses in a work-stealing parallel computation is bounded by the product of the number of deviations and the number of cache lines. t u The bound on the number of deviations in Theorem 8 is tight, and the bound on the number of additional cache misses is tight within a factor of C, as shown below in Theorem 9. Theorem 9 If, at each fork node, the future thread is chosen to execute first, then a parallel ex2 2 ecution with work stealing can incur Ω(P T∞ ) deviations and Ω(P T∞ ) additional cache misses on a structured single-touch computation, while the sequential execution of this computation incurs 2 O(P T∞ /C) cache misses. Proof. Figure 2.6(c) shows a computation DAG on which we can get the bounds we want to prove. The DAG in Figure 2.6(c) uses the DAGs in Figures 2.6(a) and 2.6(b) as building blocks. Let’s look at Figures 2.6(a) first. Suppose there are two processors p1 and p2 executing the DAG in 17 Figure 2.6(a). Suppose p2 executes v, pushes u1 into its deque, and then falls asleep before executing w. Now suppose p1 steals u1 . For each i ≤ k, neither si nor Zi can be executed since w has not been executed yet. Now p1 takes a solo run, executing u1 , x1 , Y1 , u2 , x2 , Y2 , ..., xk , Yk . After p1 finishes, p2 wakes up and executes the rest of the computation DAG. Note that the right (local) parent of si is executed before the left (future) parent of the touch is executed. Thus, by Lemma 4, each si is a deviation. Hence, this parallel execution incurs k deviations and the computation span of the computation is Θ(k). Now let us consider a parallel execution of the computation in 2.6(b). For each i ≤ k, the subgraph rooted at vi is identical to the computation DAG in 2.6(a) (except that the last node of the subgraph has an extra edge pointing to a node of the main thread). Suppose there are three processors p1 , p2 , and p3 working on the computation. Assume p2 executes r1 and v1 and then falls asleep when it is about to execute w. p3 now steals r2 from p2 and then falls asleep too. Then p1 steals u1 from p2 ’s deque. Now p1 and p2 execute the subgraph rooted at v1 in the same way they execute the DAG in 2.6(a). After p1 and p2 finish, p3 wakes up, executes r2 . Now these three processors start working on the subgraph rooted at r3 in the same way they executed the graph rooted at r1 . By repeating this, the execution ends up incurring k 2 deviations when all the k subgraphs are done. Since the length of the path r1 , r2 , r3 ... on the right-hand side is Θ(k), the computation span of the DAG is still Θ(k). Now we construct the final computation DAG, as in Figure 2.6(c). The “top” nodes of the DAG are all forks, each spawning a future thread. Thus, they form a binary tree and the number of threads increase exponentially. The DAG stops creating new threads at level Θ(log n) when it has n threads rooted at S1 , S2 , ..., Sn , respectively. For each i, the subgraph rooted at Si is identical to the DAG in 2.6(b). Suppose there are 3n processors working on the computation. It is easy to see n processors can eventually get to S1 , S2 , ..., Sn . Suppose they all fall asleep immediately after executing the first two nodes of Si (corresponding to r1 and v1 in Figure 2.6(b)) and then each two of the rest 2n free processors join to work on the subgraph rooted at Si , in the same way p1 , p2 and p3 did in Figure 2.6(b). Therefore, this execution will finally incur nk 2 deviations, while the computation span of the DAG is Θ(k + log n). Therefore, by setting n = P/3, we get a parallel 2 execution that incurs Ω(P T∞ ) deviations, when log P = O(k). To get the bound on the number of additional cache misses, we just need to modify the graph in 2.6(a) as follows. For each 1 ≤ i ≤ k, Yi consists of a chain of C nodes yi1 , yi2 , ..., yiC , where C is the number of cache lines. yi1 , yi2 , ..., yiC access memory blocks m1 , m2 , ..., mC , respectively. Similarly, each Zi consists of a chain of C nodes zi1 , zi2 , ..., ziC . zi1 , zi2 , ..., ziC access memory blocks mC , mC−1 , ..., m1 , respectively. all si access memory block mC . For all 1 ≤ i ≤ k, ui and xi both access memory block mC+1 . It does not matter which memory blocks the other nodes in the DAG access. For simplicity, assume the other nodes do not access memory. In the sequential execution, the single processor has m1 , m2 , ..., mC in its cache after executing v, w, u1 , x1 , Y1 , Z1 and it has incurred (C + 1) cache misses so far. Now it executes u2 and x2 , incurring one cache miss at node u2 by replacing mC with mC+1 in its cache, since mC is the least recently used block. When it executes 18 Y2 and Z2 , it only incurs one cache miss by replacing mC+1 with mC at the last node of Y2 , y2C . Likewise, it is easy to see that the sequential execution will only incur cache misses at nodes ui and at the last nodes of Yi for all i. Hence, the sequential execution incurs only O(k + C) cache misses. When k = Ω(C), the sequential execution incurs only O(k) cache misses. Now consider the parallel execution by two processors p1 and p2 we described before. p2 will incur only C cache misses, since Zi and si only access m different blocks m1 , m2 , ..., mC and hence p2 doesn’t need to swap any memory blocks out of its cache. However, p1 will incur lots of cache misses. After executing each Yi , p1 will execute ui+1 . Thus at ui+1 , one cache miss is incurred and m1 is replaced with mC+1 , since m1 is the least recently used block. Then, when p1 executes the first node y(i+1)1 in Yi , , m1 is not in its cache. Since m2 now becomes the least recently used memory block in p1 ’s cache, m2 is replaced by m1 . Thus, m2 will not be in the cache when it is in need at y(i+1)2 . Therefore, it is obvious that p1 will incur a cache miss at each node in Yi and hence incur Ck cache misses in total in the entire execution. Note that the computation span of this modified DAG is Θ(Ck), since each Zi now has C nodes. Therefore, the sequential execution and the parallel execution actually incur Θ(T∞ /C) and Θ(T∞ ), respectively, when log P = O(k). Therefore, if we use this modified DAG as the building blocks in 2.6(c), we will get the bound on the number of additional cache misses stated in the theorem. 2.4.2 t u Parent Thread First at Each Fork In this section, we show that if the parent thread is always executed first at a fork, a work-stealing parallel execution of a structured single-touch computation can incur Ω(tT∞ ) deviations and Ω(CtT∞ ) additional cache misses, where t is the number of touches in the computation, while the corresponding sequential execution incurs only a small number of cache misses. This bound matches the upper bound for general, unstructured future-parallel computations [73]2 . This result, combined with the result in Section 2.4.1, shows that choosing the future threads at forks to execute first achieves better cache locality for work-stealing schedulers on structured single-touch computations. Theorem 10 If, at each fork, the parent thread is chosen to execute first, then a parallel execution with work stealing can incur Ω(tT∞ ) deviations and Ω(CtT∞ ) additional cache misses on a structured single-touch computation, while the sequential execution of this computation incurs only O(C + t) cache misses. Proof. The final DAG we want to construct is in Figure 2.8. It uses the DAGs in Figure 2.7 as building blocks. We first describe how a single deviation at a touch u3 can incur Ω(T∞ ) deviations and Ω(CT∞ ) additional cache misses in Figure 2.7(a). In order to get the bound we want to prove, here we follow the convention in [1, 73] to distinguish between touches and join nodes in the DAG. More specifically, yi is a join node, not a touch, for each 1 ≤ i ≤ n. For each 1 ≤ i ≤ n, node 2 The bound on the expected number of deviations in [73] is actually O(P T∞ + tT∞ ). However, as pointed out in [73], a simple fork-join computation can get Ω(P T∞ ) deviations. Hence we focus on the more interesting part Ω(tT∞ ). 19 (a) (b) Figure 2.7: DAGs used by Figure 2.8 as building blocks. 20 xi accesses memory block m1 and yi accesses memory block mC+1 . Zi consists of a chain of C nodes zi1 , zi2 , ..., ziC , accessing memory blocks m1 , m2 , ..., mC respectively. All the other nodes do not access memory. Assume in the sequential execution a single processor p1 executes the entire DAG in Figure 2.7(a). Suppose initially the left (future) parent of u3 has already been executed. p1 starts executing the DAG at u1 . Since p1 always stays on the parent thread at a fork, it first pushes s into its deque, continues to execute u2 , u3 , u4 , and then executes x1 , x2 , ..., xn while pushing z11 , z21 , ..., zn1 into its deque. Since v cannot be executed due to s, p1 pops zn1 out of its deque and executes the nodes in Zn . Then p1 executes all the nodes in Zn−1 , Zn−2 , ..., Z1 , in this order. So far p1 has only incurred C cache misses, since all the nodes it has executed only access memory blocks m1 , ..., mC and hence it did not need to swap any memory blocks out of its cache. Now p1 executes s, v and then yn , yn−1 , ..., y1 , incurring only one more cache miss by replacing m1 with mC+1 at yn . Hence, this execution incurs O(C) cache misses in total. Note that the left parent of yi is executed before the right parent yi for all i. Now assume in another execution by p1 , the left parent of u3 is in p1 ’s deque when p1 starts executing u1 . Thus, u3 is a deviation with respect to the previous execution. Since u3 is not ready to execute after p1 executes u2 , p1 pops s out of its deque to execute. Since v is not ready, p1 now pops the left parent of u3 to execute and then executes u3 , u4 , x1 , x2 , ..., xn , v. Now p1 pops zn1 out and executes all the nodes Zn . Note that yn is now ready to execute and the memory blocks in p1 ’s cache at the moment are m1 , m2 , ..., mC . Now p1 executes yn , replacing the least recently used block m1 with mC+1 . p1 then pops z(n−1)1 out and executes all the nodes z(n−1)1 , z(n−1)2 , ..., z(n−1) C in Zn−1 one by one. When p1 executes z(n−1)1 , it replaces m2 with m1 , and when it executes z(n−1)2 , it replaces m3 with m2 , and so on. The same thing happens to all Zi and yi . Thus, p1 will incur a cache miss at every node afterwards, ending up with Ω(Cn) cache misses in total. Note that the computation span of this DAG is T∞ = Θ(C + n). Thus, this execution with a deviation at u3 incurs Ω(CT∞ ) cache misses when n = Ω(C). Moreover, all yi are deviations and hence this execution incurs Ω(T∞ ) deviations. Now let us see how a single steal at the beginning of a thread results in Ω(T∞ ) deviations and Ω(CT∞ ) cache misses at the end of the thread. Figure 2.7(b) presents such a computation. First we consider the sequential execution by a processor p1 . It is easy to see p1 executes nodes in the order r, u1 , w1 , s2 , s1 , v1 , u2 , w2 , v2 , u3 , w3 , s4 , s3 , v3 , u4 , .... The key observation is that wi is executed before si is executed for any odd-numbered i while wi is executed after si is executed for any evennumbered i. This statement can be proved by induction. Obviously, this holds for i = 1 and i = 2, as we showed before. Now suppose this fact holds for all 1, 2, ..., i, for some even-numbered i. Now suppose p1 executes ui−1 . Then p1 pushes si into its deque and executes wi−1 . Since we know wi−1 should be executed before si−1 , si−1 has not been executed yet. Moreover, si−1 must already be in the deque before si was pushed into the deque, since si−1 ’s parent ui−2 has been executed and si−1 is ready to execute. Now p1 pops si out to execute. Since vi is not ready to execute, p1 pops si−1 out and then executes si−1 , vi−1 , ui , and pushes si+1 into the deque. Now p1 continues to execute wi , vi , ui+1 and pushes si+1 into its deque. Then pi executes wi+1 and pops si+2 out, since vi+1 is 21 not ready due to si+1 . Now we can see wi+1 and si+2 have been executed, but si+1 and wi+2 not yet. That is, wi+1 is executed before si+1 and wi+2 is executed after si+2 . Therefore, the statement holds for i + 1 and i + 2, and hence the proof completes. The subgraph rooted at uk is identical to the graph in Figure 2.7(a), with vk corresponding to u3 in Figure 2.7(a). Therefore, if k is an even number, vk ’s left parent has been executed when wk is executed and hence the sequential execution will incur only O(C) cache misses on the subgraph rooted at uk . Now consider the following parallel execution of the DAG in Figure 2.7(b) by two processors p1 and p2 . p1 first executes r and pushes s1 into its deque. Then p2 immediately steals s1 and executes it. Now p2 falls asleep, leaving p1 executing the rest of the DAG alone. It is easy to see p1 will execute the nodes in the DAG in the order u1 , w1 , v1 , u2 , w2 , s3 , s2 , v2 , u3 , w3 , v3 , u4 , s4 , ... It can be proved by induction that wi is executed after si is executed for any odd-numbered i while wi is executed before si is executed for any even-numbered i, which is opposite to the order in the sequential execution. The induction proof is similar to that of the previous observation in the sequential execution, so we omit the proof here. If k is an even number, wk will be executed before the left parent of vk and hence this execution will incur Ω(T∞ ) deviations and Ω(CT∞ ) cache misses when n = Ω(C) and n = Ω(k). The final DAG we want to construct is in Figure 2.8. This is actually a generalization of the DAG in Figure 2.7(b). Instead of having one fork ui before each touch vi , it has two forks ui and xi , for each i. After each touch vi , the thread at yi splits into two identical branches, touching the futures spawned at ui and xi , respectively. In this figure, we only depict the right branch and omit the identical left branch. As we can see, the right branch later has a touch vi+1 touching the future si+1 spawned at the fork xi . If we only look at the thread on the right-hand side, it is essentially the same as the DAG in Figure2.7(b). The sequential execution of this DAG by p1 is similar to that in Figure2.7(b). The only difference is that p1 at each yi will execute the right branch first and then the left branch recursively. Similarly, it can be proved by induction that wi is executed before si is executed for any odd-numbered i while wi is executed after si is executed for any evennumbered i. Obviously this also holds for each left branch. Now consider a parallel execution by two processors p1 and p2 . p1 first executes r. p2 immediately steals s1 and executes it and then sleeps forever. Now p1 makes a solo run to execute the rest of the DAG. Again, we can prove by the same induction argument that wi is executed after si is executed for any odd-numbered i while wi is executed before si is executed for any even-numbered i, which is opposite to the order in the sequential execution. The above two induction proofs are a little more complicated than those for the DAG in Figure2.7(b), but the ideas are essentially the same (the only difference is now we have to prove the statements hold for the two identical branches split at fork yi at the inductive step) and hence we omit the proofs again. By splitting each thread into two after each yi , the number of branches in the DAG increases exponentially. Suppose there are t touches in the DAG. Thus, there are eventually Θ(t) branches and the height of this structure is Θ(log t). At the end of each branch is a subgraph identical to the 22 Figure 2.8: A DAG on which work stealing can incur Ω(tT∞ ) deviations and Ω(CtT∞ ) if it chooses parents threads to execute first at forks. This example uses the DAGs in Figure 2.7 as building blocks. DAG in Figure 2.7(a). Therefore, the parallel execution with only one steal can end up incurring Θ(tn) deviations and Θ(Ctn) cache misses. The sequential execution incurs only Θ(C + t) cache misses, since the sequential execution will incur only 2 cache misses by swapping mC+1 in and out at each branch, after it incurs C cache misses to load m1 , m2 , ..., mC at the first branch. hence, when n = Ω(log t) and n = Ω(C), we get the bound stated in the theorem. 2.5 t u Other Kinds of Structured Computations It is natural to ask whether other kinds of structured computations can also achieve relatively good cache locality. We now consider two alternative kinds of restrictions. 23 2.5.1 Structured Local-Touch Computations In this section, we prove that work-stealing parallel executions of structured local-touch computations also have relatively good cache locality, if the future thread is chosen to execute first at each fork. This result, combined with Theorems 8 and 10, implies that work-stealing schedulers for structured computations are likely better off choosing future threads to execute first at forks. Lemma 11 In the sequential execution of a structured local-touch computation where the future thread at a fork is always chosen to execute first, any touch x’s future parent is executed before x’s local parent, and the right child of any fork v immediately follows the last node of the future thread spawned at v, i.e., the future parent of the last touch of the future thread. The proof is omitted because it is almost identical to that of Lemma 4. (We first consider a future thread whose touches are the “earliest” in the DAG, that is, no other touches are ancestors of them, and we can easily prove the statement in Lemma 11 holds for those touches. Then by the same induction proof as for Lemma 4, we can prove the statement holds for all future threads’ touches.) Theorem 12 If the future thread at a fork is always chosen to execute first, then a parallel execution 2 2 ) additional cache misses in expectation with work stealing incurs O(P T∞ ) deviations and O(CP T∞ on a structured local-touch computation. Proof. Let v be a fork that spawns a future thread t. Now we consider a parallel execution. Let p be a processor that executes v and pushes the right child of v into its deque. Suppose the right child of v is not stolen. Now consider the subgraph G0 consisting of t and its descendant threads. Note that G0 itself is a structured computation DAG with local touch constraint. Now p starts executing G0 . According to local touch constraint, the only nodes outside G0 that connect to the nodes in G0 are v and the touches of t, and c is the only node outside G0 that the nodes in G0 depend on. Now v has been executed and the touches of t are not ready to execute due to the right child of v. Hence, p is able to make a sequential execution on G0 without waiting for any node outside to be done or jumping to a node outside, as long as no one steals a node in G0 from p’s deque. Since we assume the right child of v will not be stolen and any nodes in G0 can only be pushed into p’s deque below v, no nodes in G0 can be stolen. Hence, G0 will be executed by a sequential execution by p. Therefore, there are no deviations in G0 . After p executed the last node in G0 , which is the last node in t, p pops the right child of v to execute. Hence, the right child of v cannot be a deviation either, if it is not stolen. That is, those nodes can be deviations only if the right child of v is stolen. Since there are in expectation O(P T∞ ) steals in an parallel execution and each future thread has at most 2 T∞ touches, the expected number of deviations is bounded by O(P T∞ ) and the expected number 2 of additional touches is bounded by O(CP T∞ ). t u 24 2.5.2 Structured Computations with Super Final Nodes As discussed in Section 2.3, in languages such as Java, the program’s main thread typically releases all resources at the end of an execution. To model this structure, we add an edge from the last node of each thread to the final node of the computation DAG. Thus, the final node becomes the only node with in-degree greater than 2. Since the final node is always the last to execute, simply adding those edges pointing to the final node into a DAG will not change the execution order of the nodes in the DAG. It is easy to see that having such a super node will not change the upper bound on the cache overheads of the work-stealing parallel executions of a structured computation. For structured computations with super final nodes, it also makes sense to slightly relax the single-touch constraint as follows. Definition 13 A structured single-touch computation with a super final node is one where each future thread t at a fork v has at least one and at most two touches, a descendant of v’s right child and the super final node. In such a computation, a future thread can have the super final node as its only touch. This structure corresponds to a program where one thread forks another thread to accomplish a sideeffect instead of computing a value. The parent thread never touches the resulting future, but the computation as a whole cannot terminate until the forked thread completes its work. Now we show that the parallel executions of structured single-touch computations with super final nodes also have relatively low cache overheads. Lemma 14 In the sequential execution of a structured single-touch computation with a super final node, where the future thread at a fork is always chosen to execute first, any touch x’s future parent is executed before x’s local parent, and the right child u of any fork v immediately follows the last node of the future thread spawned at v, i.e., the future parent of the last touch of the future thread. Lemma 15 Let t be the future thread at a fork v in a structured single-touch computation with a super final node. If a touch of t or v’s right child u is a deviation, then either u is stolen or there is a touch by t which is a deviation. Proof. The proofs of Lemma 4 and Lemma 7, with only minor modifications, also apply to the above two lemmas, respectively. That is because introducing the super final node into a computation doesn’t affect the order in which other nodes are executed, since no other nodes need to wait for the super final node and the super final node is always the last node to execute. More specifically, when a processor executing any thread t reaches a node that is a parent of the super final node, the processor will continue to work on t if that node is not the last node of t, and otherwise try popping a node out of its deque. Therefore, by the same proof techniques as for Lemmas 4 and 7, we can show that a processor will execute the right child u of a fork v and the parents of the touches of the future spawned at v in the order stated in Lemmas 14 and 15. t u 25 Theorem 16 If, at each fork, the future thread is chosen to execute first, then a parallel execution 2 2 with work stealing incurs O(P T∞ ) deviations and O(CP T∞ ) additional cache misses in expectation on a structured single-touch computation with a super final node. Proof. The proof is similar to that of Theorem 8. The only difference is that if a touch by a thread t is a deviation, now the two touches of t can both be deviations, which could be a trouble for constructing the deviation chains. Fortunately, one of these two touches is the super final node, which is always the last node to execute and hence will not make the touches of other threads become deviations. Therefore, we can still get a unique deviation chain starting from a steal and hence the proof of Theorem 8 still applies here. t u Similarly, we can also introduce a super final node to a structured local-touch computation as follows. Definition 17 A structured local-touch computation with a super final node is one where each future thread t spawned at a fork v can be touched only by the super final node and by t’s parent thread at nodes that are descendants of the right child of v. It is obvious that by the same proof as for Theorem 12, we can prove the following bounds. Theorem 18 If the future thread at a fork is always chosen to execute first, then a parallel execution 2 2 with work stealing incurs O(P T∞ ) deviations and O(CP T∞ ) additional cache misses in expectation on a structured local-touch computation with a super final node. Chapter 3 Theoretical Analysis of Work Stealing on Hierarchical Platforms In this chapter, we will discuss the performance of fork-join programs on hierarchical platforms. We present lower bounds for their executions by all work stealing algorithms as well as upper bounds for specific work stealing variants in a theoretical hierarchical model. We first discuss the related work in Section 3.1. Then in Section 3.2, we introduce the theoretical model for hierarchical systems and the computation models for fork-join and divided-and-conquer programs, as well as their s-bounded subclasses. In Section 3.3, we define the class of work stealing algorithms, and introduce the global work stealing algorithm with its attaching scheme. Section 3.4 presents upper and lower bounds of work stealing algorithms on general fork-join programs, while Section 3.5 shows their bounds on s-bounded fork-join and s-bounded devide-and-conquer programs. 3.1 Related Work Work stealing [20] on traditional shared-memory systems has been an active research area for decades. On the theoretical side, its time and space bounds [15, 7, 19, 20] as well as its cache locality [1, 73, 47] for different parallel program models, such as fork-join parallelism (i.e., nested parallelism) [15] and future parallelism [41, 42], have been extensively studied. To implement efficient work stealing algorithms on hierarchical platforms such as NUMA systems and distributed clusters, different heuristic strategies for choosing victim processors to steal from have been discussed. A heuristic people have found effective is that processors should be in favor of local steals if neighboring processors have enough work to be stolen, in order to reduce the number of costly remote steals. A commonly used method is to have a processor first make one or more steal attempts locally and then try remote steals only if all the local steal attempts have failed [62]. Another method is to have each processor make a local steal with a higher probability than a remote steal [70]. Other implementations [24, 66] only allow leaders of sockets/clusters to make 26 27 remote steals. Researchers have also examined the appropriate amounts of work a processor should steal. Many papers (e.g., [62, 65, 76]) suggested the StealHalf policy for remote steals, that is, a thief should steal half of the tasks from a remote processor’s task pool, as opposed to stealing only one task in the classical work stealing algorithm. Work stealing is usually implemented using concurrent deques in traditional shared-memory systems. However, in some hierarchical platforms such as distributed clusters, concurrent deques for remote steals can be very inefficient or even impossible to implement. For this reason, researchers [29, 71, 78, 2, 57] have proposed work stealing variants with non-concurrent deques via message passing. In contrast to the large body of research on the systems side we just mentioned above, the theoretical analysis of hierarchical work steal has been much less discussed. The only theoretical bound we are aware of is given by Quintin and Wagner [70], on the time complexity of their hierarchical work stealing (HWS) algorithm. The leader processors of clusters in HWS first distribute tasks across a system, and then the worker processors in each cluster execute their local tasks. However, the time bound of HWS can be very bad in theory, since the workloads in different clusters can be highly unbalanced unless HWS knows the structure of the target program very well in advance. Therefore, the upper bound of HWS cannot help us figure out the real bounds of the class of all work stealing algorithms. Some hierarchical cache models, such as parallel cache-oblivious (PCO) model [14], hierarchical multi-level multicore (HM) model [25], and Threaded Many-core Memory (TMM) model [61], have been proposed. They focus on the cache locality of parallel programs, while the results of this paper focus more on the costs of remote communications caused by remote steals and joins. The performance of work stealing in these models has not be studied yet, and we consider it an interesting open problem. 3.2 3.2.1 Fork-Join Model on Hierarchical Systems Fork-Join Model Fork-join parallelism is also called nested parallelism. As in previous work (e.g., [1] [20]), we model a program as a directed acyclic graph (DAG), where each node represents a single instruction of the program that can be executed by a processor and each directed edge (u, v) indicates that v cannot be executed until u has been executed. Both the indegree and outdegree of a node in a DAG are at most two. There are three types of edges in the model—continuation edges, spawn edges, and join edges. At most one of the incoming edges and one of the outgoing edges of a node are continuation edges. A thread is represented as a maximal sequence of nodes connected by continuation edges in a DAG. Obviously, the nodes, i.e., the instructions, in a thread can only be executed sequentially. If a node has two outgoing edges, we call it a fork node. One outgoing edge of a fork is a 28 Figure 3.1: The graph on the left hand side shows a thread in the dashed box spawned by its parent thread at a fork. The graph on the right hand side shows the join of a thread in the dashed box and its parent thread. continuation edge and the other is a spawn edge. A thread x spawns a new thread y at a fork node u in x: the spawn edge of u points to the first node of thread y (see Figure 3.1). We call x the parent thread of y, and y a child thread of x. For a better illustration, we always put the spawn edge of a fork to the left of its outgoing continuation edge in a figure in the paper. If a node has two incoming edges, we call it a join node, representing the join of a thread and its parent thread. (In the fork-join model, only a thread and its parent thread can join, but in the more general future-parallel model [41, 42], any two threads can join in principle.) One incoming edge (on the right-hand side in a figure) of a join is a continuation edge and the other (on the left-hand side) is a join edge (see Figure 3.1). The node that the join edge comes from is in the child thread. This node must be the last node of the child thread, since the node has no outgoing continuation edge. (If the node has an outgoing continuation edge, it must be a fork by definition. However, a fork cannot have an outgoing spawn edge, a contradiction.) A node cannot be both a fork and a join, since it represents only a single instruction of a thread. In a typical fork-join program in practice, when a thread wants to partition some computation into two parts to execute in parallel, it makes a fork operation, creating a child thread for one part and working on the other itself. When the thread finishes its work, it calls a join, retrieving the result of the child thread if the child thread has completed, and waiting for the child thread otherwise. There is only one node of indegree 0 in a DAG, called the source node, and only one node of outdegree 0, called the sink node. The source node and the sink node are the first and the last nodes of the main thread respectively. All the other threads are spawned directly or indirectly by the main thread (see 3.2 for an example). It is easy to see the source node is the first node to execute in the DAG and the sink node is the last. 29 Figure 3.2: A fork-join DAG, where the main thread is the rightmost directed path in the DAG. Another requirement for fork-join DAGs is that if a thread spawns some child threads at different forks before joining any of them, these child threads must join in a reverse order. This constraint guarantees the forks and joins of threads in a DAG are in a “nested form” and that is why fork-join programs is also called nested-parallel programs. 1 As Acar et al. [1] pointed out, a fork-join DAG can also be defined by induction: if we only consider the nodes between a fork and its corresponding join, the two threads derived at the fork can be thought of as the main threads of two independent DAGs that eventually merge at the join. Algorithm 1 Fibonacci(n) if n > 5 then fork f1 = Fibonacci(n-1); f2 = Fibonacci(n-2); join f1; return f1 + f2; else return Sequential Fibonacci(n); 1 Another common way to create parallel threads is by calling a parallel-for loop, where each iteration of a for-loop in a parent thread creates a parallel child thread and the parent later makes a single join call to collects the results of all child threads. In fact, a parallel-for loop can also be thought of as a fork-join DAG, if we model the single join as a sequence of consecutive joins for those child threads, in a reverse order. 30 Figure 3.3: The DAG of Fibonacci(7) of Algorithm 1. The main thread (i.e., the rightmost one) spawns a child thread x to compute Fibonacci(6) and then continues to compute Fibonacci(5) itself. Thread x spawns a child thread y of its own to compute Fibonacci(5) and then computes Fibonacci(4) itself. According to the pseudocode, Fibonacci(5) and Fibonacci(4) will be executed sequentially. Later thread x sums up Fibonacci(4) and Fibonacci(5) after the join of x and y, and finally the main thread sums up Fibonacci(5) and Fibonacci(6) after the join of x and itself. Algorithm 1 is the pseudocode for computing the nth Fibonacci number recursively and Figure 3.3 illustrates how Fibonacci(7) is modeled as a fork-join DAG. We use T1 to denote the total number of nodes in a DAG and T∞ to denote the number of nodes in a longest directed path in the DAG. T∞ is called the critical path length (or the span) of the DAG. Since we will focus on the theoretical bounds of DAGs with different ratios of T1 T∞ , we use G(T1 , T∞ ) to denote a DAG of T1 nodes with critical path length T∞ . A well-studied subset of fork-join programs are divided-and-conquer programs. In a divided-andconquer program, a thread has no forks after the first join node in the thread. In other words, threads in a divided-and-conquer program first keep forking to compute things in parallel and later join in pairs recursively. Most fork-join programs discussed in the literature, such as the Merge-Sort algorithm, the N-Queens algorithm, and the Fibonacci algorithm shown in Figure 3.3, are in fact divide-and-conquer programs. 3.2.2 S-Bounded Fork-Join Programs Prior research (e.g., [60]) has found it beneficial empirically to prevent a computation from splitting into too small tasks. This is because the extra parallelism provided by splitting a sequential computation into tiny parallel subroutines may not pay off the costs of spawning and joining threads, 31 as well as the communication costs among processors, especially on a hierarchical system where remote operations and communications are expensive. For instance, it is common that a Merge-Sort program stops partitioning an array of numbers into smaller ones when the size of the array is a few kilobytes [70]. Another example is the recursive 8-Queens algorithm in [70]. The algorithm first spawns a thread for each feasible position of the queen on the first line. Then each thread spawns a new thread for each feasible position of the queen on the second line and so on. When a thread reaching a feasible position of the queen on the fourth line, it stops splitting and the rest of the work is executed sequentially in the thread. To analyze the performance of work stealing on these algorithms, we define the class of s-bounded fork-join programs. We say a DAG is s-bounded, if each thread in it has at least s nodes. We use G(T1 , T∞ , s) to denote an s-bounded DAG G(T1 , T∞ ). We will show in Section 3.5 some upper and lower bounds for s-bounded fork-join programs and s-bounded divide-and-conquer programs. 3.2.3 Hierarchical System Model The hierarchical system model we consider in the paper consists of k distributed, homogeneous clusters. Each cluster has n processors and there are in total p = kn processors in the system. A processor can execute nodes in a DAG, one at a time, and communicate with other processors in order to distribute nodes across the system or share information. The cluster a processor belongs to is the local cluster and the other clusters are remote clusters to it. Processors within a cluster are local to each other while Processors in different clusters are remote to each other. Communication is efficient in time within a cluster and inefficient across different clusters. More specifically, there is a global clock on a hierarchical system, which processors may or may not be aware of it. A tick of the global clock is the minimum unit of time. The execution of a node in a DAG, as well as one round of one-way communication between two local processors, takes time 1, while one round of one-way communication between two remote processors, called a remote operation, takes time r. Our communication model can be thought of as a simplified version of the LogP model [27]. We keep the communication model abstract in order to cover both hierarchical shared-memory and message-passing systems. In a hierarchical shared memory system, for instance, each cluster has its local shared memory that can be efficiently accessed by processors in the cluster: we can assume it takes a processor time 1 to make a read or write on the memory of the local cluster, while it takes time r to do so on the memory of a remote cluster. In a hierarchical message-passing model, it takes time 1 for a message sent by a processor to arrive at a local processor, while it takes time r for a message to arrive at a remote processor. For any load balancing algorithm executing a fork-join program, there are at least two scenarios where a remote operation has to be incurred. First, if a thread executed by a processor in a cluster Ca spawns a new thread at a fork and later a remote processor in another cluster Cb wants to take the spawned thread to execute, a remote operation must be made. That is because the information 32 about the new thread has to be transferred to Cb , either by a message sent to Cb in the messagepassing model, or by a remote read from Ca and/or a remote write to Cb in the shared-memory model. In either case, it takes at least time r. Second, if a thread and its parent thread are held in two different clusters, then at least one remote operation has to be made by the time the join node of the two threads is executed, so that a processor can combine them at the join and continue to execute the parent thread. This remote operation is usually made when one of the two threads arrive at the join, but in principle it can happen at any time before the join is executed. 3.3 Work Stealing on Hierarchical Systems 3.3.1 The Class of Work Stealing Algorithms An algorithm is a work stealing algorithm if processors always follow the rules below: • Each processor maintains its own pool to store threads it creates. • A processor currently executing a thread will continue until it arrives at a fork or a join. • When a processor executing thread x arrives at a fork that creates a new thread y, it puts either x or y in its pool and continues to execute the other. • When a processor reaches a join, it executes the join if the join is ready to be executed and otherwise tries to pick a thread from its pool to execute. • If the pool of a processor is empty when the processor tries to take a thread from it, the processor “steals” some threads from the pool of another processor, executes one of them and puts the other in its own pool. A processor making a steal is a thief processor and the processor a thief steals from is a victim processor. A steal is a local steal if the thief and the victim processor are in the same cluster, and otherwise it is a remote steal. A local steal takes time 1 to complete while a remote steal takes at least time r. We assume programs are compute-intensive, but not very data-intensive, such that a remote steal of time Θ(r) is enough to transfer all necessary information of a thread from one cluster to another. In order to keep the class of work stealing algorithms as general as possible, we do not put any constraints on which thread a processor takes out from its pool to execute or which threads of which processor a “thief” processor steals. We will show lower bounds for the class of work stealing algorithms on s-bounded divide-andconquer programs in Section 3.5. Keeping the class of work stealing algorithms general makes our lower bounds stronger. 33 3.3.2 Global Work Stealing Algorithm The classical work stealing algorithm [20] has been proved to perform well in the traditional sharedmemory model, both theoretically and empirically. However, little was known about its theoretical time bounds on hierarchical platforms. In this section, we introduce the global work stealing algorithm, a variant of the classical work stealing algorithm modified for the hierarchical system model. We will show how to modify the classical work stealing algorithm so that it has a good upper bound on the expected execution time for any fork-join program in the hierarchical system model. As in the classical work stealing, the pool of a processor to store threads is a doubly-ended queue (deque) in the global work stealing algorithm. When the processor executing thread x arrives at a fork u that creates a new thread y, it pushes either x or y into the bottom of its deque and continues to execute the other. When a processor needs to take a thread out of its deque, it always pops the first one from the bottom of the deque, i.e., the one pushed into the deque at the last fork. When a processor makes a steal, it chooses a processor in the whole system uniformly at random as the victim and steals only the first thread from the top of the deque of the victim. As we discussed before, a remote steal incurs a remote execution. To be more concrete, we assume a remote steal takes time 2r to complete. In the first time r, the thief makes a steal attempt (e.g., by sending a request message to the victim) during which the node to be stolen stays in the victim’s deque. At the first time slot of the second interval of time r, the actual steal action is made. If there are multiple steal actions made at the same moment, an arbitrary one will succeed and others fail. No matter if a remote steal succeeds or not, it takes time r for the thief to get the result. Therefore, if the remote steal succeeds, the stolen node is in transition during the second interval of time r, and hence no processor can execute it until it arrives at the thief at the end of the interval. The algorithm also needs to deal with joins carefully. When a processor in cluster Ca executes thread x and spawns thread y at a fork, some space in Ca (e.g., some memory location in Ca ) is reserved for the join of x and y. When the processor executing x reaches the join, it first checks the reserved space to figure out if y has already arrived at the join. If so, the processor retrieves the information of y and executes the join. Otherwise, it blocks x and puts x in the reserved space. The processor executing y behaves similarly and the only difference is that the information of y to be written is the result (output) of the computation of y, as y is finished at the join. As we can see, if x and y are executed by two threads in different clusters, Ca is remote to at least one of the threads. We say a processor makes a remote join operation at the join if Ca is remote to it. Like a remote steal, a remote join operation takes time 2r (it could be up to 4r if the processor didn’t combine its operations. We keep it simple, as the constant factor doesn’t affect our theoretical analysis). In the classical work stealing algorithm, a steal only takes the thread to be stolen from the victim. Here we make an important change in the global work stealing algorithm: when a processor steals a thread x, it also steals the threads that have been attached to x. A thread y can be attached to x only in the following scenario: If a processor executing thread y arrives at the join of thread x and 34 y, and finds x has not reached the join yet, the processor pops a thread from the bottom of its deque as usual. If the thread popped out happens to be x, the processor attaches y to x, by attaching to x everything needed for the join of x and y. Since it is an operation within the same cluster, it takes time 1 in our theoretical model. Now that y has been attached to x, the processor executing x can retrieve y locally when it later reaches the join of x and y, without the need to check the reserved space for the join, and hence avoiding the potential risk of making a remote join operation. This attaching scheme resembles the clone optimization [34] in the classic shared-memory setting: the high-level idea of the clone optimization is that, when thread x is completed while thread y has not been stolen yet, y can be modified to a fast version such that when reaching the join of x and y, it can simply resolve the join using the result of x. Now assume the global work stealing algorithm is work-first at forks. That is, a processor at a fork always puts the parent thread into its deque and continues to execute the child thread. Then we can prove that only the child thread can be attached to the parent thread in this case, not vice versa. 2 Therefore, when a thread is completed by a processor and the next thread popped out is the parent thread of the completed thread, the processor can just attach the result of the completed thread to its parent thread. Although the choice at a join doesn’t affect the theoretical bounds in the paper, we think attaching only the result of a thread to its parent is easier to implement in practice. The attaching scheme can largely reduce the number of remote join operations in an execution. Roughly speaking, without this trick, a processor can incur Θ(T∞ ) remote join operations after a single remote steal of a thread in the worst case. For instance, if all the Θ(T∞ ) threads spawned by thread x have already been executed by the time x is stolen by a remote processor, that remote processor has to make a remote join operation whenever it reaches the join of x and one of the Θ(T∞ ) threads. Figure 3.4 shows an example of a single remote steal incurring a sequence of remote joins. As we will show in Lemma 22, the attaching scheme helps us bound the number of remote join operations in an execution, which is critical in the proof of the upper bound on the algorithm’s execution time. 3.4 3.4.1 Bounds for Fork-Join Programs Lower Bound for All Algorithms We now prove the lower bounds in the following theorem hold for any load balancing algorithm on fork-join programs. Theorem 19 Given any T1 and T∞ , there is a fork-join DAG G(T1 , T∞ ) that any algorithm takes time Ω(min{ Tn1 + T∞ , Tp1 + 2 Lemma rT∞ log(nr) }) to execute on a hierarchical system. 3 in Arora et al.’s paper [7] implies that if the algorithm follows the first-first strategy, the first thread in the bottom of a processor’s deque must be an ancestor thread of the thread the processor is executing. This rules out the possibility that the thread popped out from its deque is a child thread of the thread that got just blocked. 35 Figure 3.4: A DAG where a single remote steal can incur a sequence of remote join operations. Suppose the processor executing the main thread always execute the child thread first after a fork. At each of the first three joins of the main thread, the processor spawns and quickly completes a child thread, and then gets back to the main thread. It finally arrives at the last fork, spawns a child thread and is about to execute v, while u is pushed into its deque. Now a remote processor steals u and start executing the main thread. If v is executed before u, the remote processor will have to make a remote join operation at each of the four joins in the main thread, in order to retrieve the results of the four child threads. We can imagine if there are a sequence of Θ(T∞ ) joins, Θ(T∞ ) has to be made. Proof. Let us first discuss the case when case is Ω( Tn1 T1 T∞ ≤ nr 2 log(nr)−1 . We will prove the lower bound in this + T∞ ). Without loss of generality and for simplicity, assume T1 T∞ = 2t +2t−1 −2 , 2t−1 for some t integer t. Note that 2 ≤ nr. Consider a fork-join DAG G(T1 , T∞ ) consisting of a sequence of “diamond” structures, as illustrated in Figure 3.5. The first half of each diamond is a complete binary tree of 2t − 1 nodes and the second half is the reverse of the first half—nodes keep joining in pairs until it becomes a single node. Obviously, each diamond has 2t + 2t−1 − 2 nodes and height h = 2t − 1. The DAG consists of T∞ h such diamonds. Now consider an arbitrary algorithm that executes this DAG. Suppose the first node of a diamond is executed by a processor in a cluster at time t0 . Since all other nodes of the diamond are descendants of its first node. If there is no remote steal successfully stealing a node of the diamond from that cluster, then the diamond will be executed only by the n processors in the cluster. Thus it takes t time Ω( 2n + h) to complete the diamond. If one node of the diamond is stolen by a remote processor, it will take the algorithm at least time r to complete the diamond, as it is the cost to make a remote operation. Since 2t n ≤ r and h = 2t − 1 < r, we conclude that the execution time of the diamond 36 Figure 3.5: The graph on the left presents a diamond structure and the DAG on the right consists of a sequence of four diamonds. t is Ω( 2n + h). Since the entire DAG is Ω( Th∞ · t ( 2n T∞ h diamonds have to be executed sequentially, the execution time of the + h)) = Ω( Tn1 + T∞ ). Now we discuss the case when T1 T∞ > nr 2 log(nr)−1 . The DAG G(T1 , T∞ ) we want to construct is shown in Figure 3.6. The first node of the DAG forks into two parallel subgraphs that join at the end of the DAG. The left subgraph is a sequence of T∞ −2 h diamonds, each having nr nodes, where h is the height of each diamond and we know h = Θ(log(nr)) (again, without loss of generality and for simplicity, assume T∞ −2 h is an integer and nr nodes are just enough to construct a diamond). The right subgraph consists of the rest (T1 − 2 − nr(Th∞ −2) ) nodes in an arbitrary form with only one requirement that its critical path length is at most T∞ − 2. As we analyzed before, the execution rT∞ time of the left subgraph is Ω( T∞h−2 · ( nr n + h)) = Ω( log(nr) ). On the other hand, since there are p processors in the system, the execution time of the entire DAG is Ω( Tp1 ). Hence the lower bound on the execution time is Ω( Tp1 + rT∞ log(nr) ). Now combining the lower bounds in the two cases we have discussed above, we complete the proof. t u 37 Figure 3.6: A DAG G(T1 , T∞ ), where the subgraph on the left branch of the source node is a sequence of diamonds, and the subgraph on the right branch of the source node is an arbitrary DAG. 3.4.2 Upper bounds of Two Work Stealing Algorithms In this section, we will show two work stealing algorithms, the local work stealing algorithm and the global work stealing algorithm, achieve good uppers for different ratios of T1 T∞ . In the local work stealing algorithm, a processor steals uniformly at random from a processor within the same cluster and hence all steals are local. Since in our theoretical model, a DAG is induced from a single source node held by a processor in a cluster at the start of an execution, the DAG will be executed only by the n processors in the cluster in the local work stealing algorithm. Therefore, the local work stealing algorithm is essentially the classical work stealing running in a single cluster. (Although in practice, different programs that have synchronization points with each other can initially be assigned to different clusters, and those programs together may be considered as a DAG.) Thus, the upper bound of the classical work stealing algorithm in the shared memory model by Arora et al. [7] immediately gives the following bound. Claim 20 The local work stealing algorithm executes any DAG G(T1 , T∞ ) in expected time O( Tn1 + T∞ ) on a hierarchical system. Comparing Claim 20 with Theorem 19, we know the local work stealing algorithm is optimal when T1 T∞ ≤ nr log(nr) . 38 Since most of the steals in the global work stealing algorithm are remote, each taking time 2r, one may think the algorithm should be quite inefficient for many programs. Surprisingly, the theorem below shows the global work stealing algorithm actually achieves a very good upper bound, when the the workload of a program is heavy (i.e., T1 T∞ is large). Theorem 21 The global work stealing algorithm executes any DAG G(T1 , T∞ ) in expected time O( Tp1 + rT∞ ) on a hierarchical system. It is easy to see that, when nr log(nr) ≤ within a factor of Θ(log(nr)), and when T1 T∞ T1 T∞ ≤ pr, the global work stealing algorithm is optimal ≤ pr, the global work stealing algorithm is optimal. Intuitively, doing remote steals frequently can quickly balance the workload across the system. Also, when the workload is heavy, the cost of remote steals will be negligible compared to the cost of executing the program itself. The rest of the section is the proof of Theorem 21. We start with the following lemma that bounds the number of remote joins in an execution. Lemma 22 In any execution of a fork-join program by the global work stealing algorithm on a hierarchical system, the number of remote join operations is at most twice the number of successful steals. Proof. Suppose a thread x creates a new thread y at a fork u and the two thread will join at node v. Without loss of generality, assume the processor p0 currently executing x pushes x into the bottom of its deque and then starts executing y. Since it is a fork-join program, we know that the subgraph induced from y is a fork-join DAG, denoted by Gy , whose source and sink nodes are u and v respectively. Assume x is not stolen by any other processor. If p0 needs to push nodes at forks in Gy into its deque, the nodes will be put below x in its deque. Since other processors steals nodes from the top of the deque, no nodes in Gy can be stolen and hence p0 will execute all nodes in Gy and finally completes y. Now p0 will immediately pop x from its deque and attach the result of y to x. Therefore, if x is not stolen, neither of the two join operations for v is remote. In other words, the two join operations for v can be remote only if x is stolen. Therefore the number of remote join operations is at most twice the number of successful steals. t u Lemma 23 In any execution of G(T1 , T∞ ) by the global work stealing algorithm on a hierarchical system, the expected number of steals is O(pT∞ ). Proof. (Sketch) The proof is a variant of the well-known proof (of Corollary 4 and Lemmas 6–8) in Arora et al.’s paper [7]. The main idea of the proof in Arora et al.’s paper [7] is to analyze the decrease of the potential Φ which is 3Θ(2T∞ ) at the beginning of an execution. The decrease of Φ can be caused by making steals and executing nodes. The proof by Arora et al. shows that Φ decreases by a constant factor with a constant probability after a period in which p steals are made. Therefore, Φ becomes 0, which indicates the end of the execution, after O(pT∞ ) steals in expectation. 39 We only have to make one change to Arora et al.’s proof, in order to apply it to the global work stealing algorithm on a hierarchical system: we show that Φ decreases by a constant factor with a constant probability after a period in which 2p remote steals are made (instead of p steals). The idea behind it is that each remote join operation takes time 2r on a hierarchical system and hence the decrease of Φ caused by a remote join operation takes affects in time 2r. Since each remote steal also takes time 2r, we know that after 2p remote steals have been made, all the remote joins that occurred during the period in which the first p of the 2p remote steals were done must have been completed. The rest of the proof is identical to those of Corollary 4 and Lemmas 6–8 in Arora et al.’s paper [7] and hence we omit it due to space limits. We complete the proof by observing that the total number of steals has the same bound, as in expectation the number of remote steals is k−1 k t u of the total number of steals. Now we are ready to prove Theorem 21. Proof of Theorem 21. Processors in a work stealing algorithm only have three kinds of actions— executing a node, making a steal, and doing a remote join operation. Suppose an execution finishes in time T . Thus, we know that pT ≤ T1 + 2r · (Ns + Nj ) and hence T ≤ T1 p + 2r(Ns +Nj ) , p where Ns is the number of steals and Nj is the number of remote join operations. By Lemmas 22 and 23, Ns = O(pT∞ ) in expectation and Nj ≤ Ns . Therefore T = O( Tp1 + rT∞ ) in expectation. 3.5 t u Bounds for s-Bounded Programs 3.5.1 Lower Bound for Work Stealing Algorithms We now discuss the bounds for work stealing on s-bounded programs which we think are the programs we encounter most in practice. Theorem 24 below shows that no work stealing algorithm can beat the lower bound in Theorem 19 even for s-bounded divide-and-conquer algorithms, a sublass of s-bounded fork-join programs. Theorem 24 Given any T1 ≥ 8nr and T∞ , and any positive integer s = O(r), there is a divideand-conquer DAG G(T1 , T∞ , s) that any work stealing algorithm takes expected time Ω(min{ Tn1 + T∞ , Tp1 + rT∞ }) log( nr s ) to execute on a hierarchical system. Proof. When T1 = O(nT∞ ), T1 n + T∞ = O(T∞ ). Therefore the bound Ω( Tn1 + T∞ ) holds in this case, as Ω(T∞ ) is a lower bound for any DAG whose critical path length is T∞ . The rest of the proof will focus on the case when T1 > nT∞ . For simplicity, we will hide some annoying constants in the proof and use asymptotical notations (usually the Θ notation) instead, as the lower bounds we want to prove won’t be affected this way and our proof is essentially similar to that of Theorem 19. The main idea is to construct a series of “hexagon” structures recursively. As shown in 3.7, the nr first part of a hexagon is a complete binary tree of max{ 4nr s , 4n} leaves and of height Θ(log( s )). For simplicity, we assume s ≤ r and hence there are 4nr leaves. (When s > r, the proof is almost 40 Figure 3.7: A hexagon where each thread has s = 4 sequential nodes between the last level of forks and the first level of joins. identical and hence we omit it in the paper.) All the 4nr s representing 4nr s leaves each spawn a sequence of s nodes, threads each having s nodes to execute after their last forks. Finally, the threads join in pairs at recursively and eventually join into a single node. The series of hexagons we want to construct is illustrated in Figure 3.8. We choose one of the 4nr s threads in the first hexagon uniformly at random, and replace its s nodes between the last fork and the first join with another hexagon. recursively, we replace the s sequential nodes of a random thread in the ith hexagon with the (i + 1)th hexagon until we reach the last hexagon we want to create. If there are t hexagons, it is not hard to see that the whole structure has St = Θ(tnr) nodes and height Ht = Θ(t log( nr s )), for t ≥ s . log( nr s ) Now consider the case when 8nT∞ < T1 ≤ nrT∞ . log( nr s ) dnT∞ Suppose T1 = Θ( log( nr ), for some integer ) s d such that log( nr s ) < d < r. Let t be the largest integer such that St < T1 , where, as we defined before, St is the number of nodes in a series of t hexagons. The DAG we construct is a series of t hexagons, with the last node of the first hexagon followed by an arbitrary DAG G 0 (T1 − St , T∞ − Hk ) in order to make the entire DAG a G(T1 , T∞ , s). It is not hard to prove that St = Θ(T1 ) and dT∞ t = Θ( r log( nr ), when T1 > 8nr (so that the nodes are enough to construct a DAG with as least ) s one hexagon). We will now prove that the expected execution time of the t hexagons by any work stealing dT∞ T1 dT∞ algorithm is Ω( log( nr ) and hence the bound Ω( ) n + T∞ ) = Ω( log( nr ) ) holds in this case. It suffices s s to prove that, after reaching the first node of a hexagon, it takes any algorithm expected time Ω(r) to reach the next hexagon. 41 Figure 3.8: The first two hexagons in a series of hexagons. The s sequential nodes of a randomly chosen thread in the first hexagon (on the left-hand side) is replaced by the second hexagon. Suppose the first node of a hexagon is about to be executed by a processor in a cluster. If a remote processor steals a node that is an ancestor of the next hexagon, it is obvious that it takes at least time r for the node to get to the remote processor, and hence time Ω(r) to reach the next hexagon in this case. Since there are 4nr s threads in the hexagon and those threads are identical before the last layer of forks, we can conclude that if the first node of the next hexagon has not been reached and remote processors have stolen nodes that will spawn more than current hexagon, then with probability at least 1 2 2nr s threads of the the thread spawning the next hexagon is in those threads. Hence the lower bound of expected time Ω(r) holds in this case. On the other hand, suppose remote processors only steal nodes that will spawn no more than 2nr s threads and the one spawning the next hexagon remains in the cluster. Now the n processors in the cluster will have to find out the thread containing the next hexagon out of more than 2nr s seemingly identical threads. Note that if a processor in a work stealing algorithm reaches the first node of a thread after the last fork, it won’t do anything else until it finishes all the s nodes. Therefore, with probability at least 1 2 the first nr s threads whose nodes after the last layer of forks are executed do not include the one spawning the next hexagon and the execution time is already Ω( nr·s s·n ) = Ω(r). This completes the proof for the case when 8nT∞ < T1 ≤ nrT∞ . log( nr s ) rT∞ Now we prove the bound Ω( Tp1 + log( nr ) for the case when T1 > ) s nrT∞ . log( nr s ) Since Ω( Tp1 ) always holds rT∞ for any system of p processors, we will prove Ω( log( nr ). The proof is similar to that of Theorem 19 ) s and hence we will keep it brief. The DAG G(T1 , T∞ , s) we want to construct is similar to the one in Figure 3.6: a single node forking into two subgraphs that eventually join. The left subgraph is T∞ a series of Θ( log( nr ) hexagons that we explained above and the right subgraph contains the rest ) s of nodes in an arbitrary form. As we proved before, it takes any work stealing algorithm expected 42 time Ω(r · 3.5.2 T∞ ) log( nr s ) to execute the left subgraph, which completes the proof. t u Unbalanced Work Stealing and Its Upper Bound We say a work stealing algorithm is an unbalanced work stealing algorithm, if in any execution, 1) each processor pushes and pops threads through the bottom of its deque, 2) each local steal of a processor chooses a victim processor uniformly at random from the same cluster, 3) each remote steal of a processor chooses a victim processor uniformly at random from other clusters, 4) each steal takes only the first thread from the top of the victim’s deque, 5) after a processor made a remote steal, its next steal is a local steal with probability at least c, for a constant c, and 6) Nrem ≤ Nloc ≤ λ · Nrem , where λ = Θ(r), Nrem is the number of remote steals and Nloc is the number of local steals in the execution. The first three rules are standard in most work stealing algorithms in the literature. Rule 4 is also the strategy many existing algorithm uses. However in other algorithms, a thief processor can steal multiple threads (e.g., half of the threads) from a deque. Rule 5 implies that after making a remote steal, a processor is likely to make a local steal. Rule 6 means intuitively that the algorithm has a bias towards local steals while still making enough remote steals. The most common heuristic we find in the existing work stealing algorithms for hierarchical systems is to have each processor make more local steals than remote steals, hoping that this can reduce the number of costly remote operations while most processors can have enough work to do by only balancing workloads locally most of the time [62]. Algorithms using this heuristic usually behave like unbalanced work stealing algorithms, at least when running on unbalanced programs in which workloads cannot keep evenly distributed among clusters for long without remote steals. Since unbalanced programs are usually the ones on which algorithms performance worst in practice, we believe we can measure the performance bounds of many work stealing algorithms using this heuristic, by proving the bounds of the class of unbalanced work stealing algorithms. When the workload is light, the local work stealing algorithm has an optimal upper bound, as we showed before. Since algorithms with this heuristic are in favor of local steals, they should work well in this case. Theorem 25 indicates that algorithms using this heuristic are likely to have good performance guarantees for Ω(r)-bounded fork-join programs, when the workload is heavy, i.e., T1 T∞ is big, and each thread contains a relatively large amount of work. Theorem 25 An unbalanced work stealing algorithm executes any s-bounded DAG G(T1 , T∞ ) in expected time O( Tp1 + rT∞ ) on a hierarchical system, for any s = Ω(r). The proof of Theorem 25 is based on the following lemma. Lemma 26 In any execution of a DAG G(T1 , T∞ , s) by any work stealing algorithm on a hierarchical system, the number of joins is no more than T1 s . 43 Proof. The proof is simple. Since each thread in an s-bounded DAG G(T1 , T∞ ) has at least s nodes and there are in total T1 nodes in the DAG, the number of threads is at most T1 s . Since a join in a fork-join DAG is the end of a thread, the number of joins is bounded by the number of threads in the DAG and hence is at most Proof of Theorem 25. T1 s . t u By the definition of unbalanced work stealing algorithms, after a processor made a remote steal, its next steal is a local steal with probability at least c, for some constant c. Suppose a processor makes m remote steals in an execution. We call each odd-numbered one of these m remote steals and the steal of the processor after it (which is a local steal with probability at least c) a steal couple. Thus, the processor has dm/2e steal couples and they don’t overlap. It is easy to see that by making a steal couple, the processor steals each other processor in the system 1 1 , c(n−1) }> with probability at least min{ n(k−1) 1 cp . Therefore, by the same proof of Lemma 23, we can prove that the expected number of steal couples by all processors in an execution is O(pT∞ ). Hence, the expected number of remote steals in an execution is also O(pT∞ ). The rest of the proof is similar to that of Theorem 21. The execution time of an unbalanced work stealing algorithm is T ≤ T1 p + 2r(Nrem +Nj )+Nloc , p where Nrem is the number of remote steals, Nloc is the number of local steals, and Nj is the number of remote join operations. Since Nrem = O(pT∞ ) in expectation, Nloc = O(rNrem ) by definition, and Nj = O( Tr1 ) by Lemma 26, we have T = O( Tp1 + rT∞ ) in expectation. t u As a byproduct, the following theorem shows that The global work stealing algorithm, even without the attaching scheme, has the same upper bound for Ω(r)-bounded fork-join programs. Theorem 27 The global work stealing algorithm, even without the attaching scheme, executes any DAG G(T1 , T∞ , s) in expected time O( Tp1 + rT∞ ) on a hierarchical system, for any s = Ω(r). Proof. The proof is similar to that of Theorem 21. Lemma 26 indicates any work stealing algorithm incurs only O( Tr1 ) remote join operations in an execution of a DAG G(T1 , T∞ , s), when s = Ω(r). Also note that Lemma 23 still holds for s-bounded DAGs. The execution time of the global work stealing algorithm without the attaching scheme is T ≤ T1 p + 2r(Ns +Nj ) , p where Ns is the number of steals and Nj is the number of remote join operations. Since Ns = O(pT∞ ) in expectation and Nj = O( Tr1 ), we have T = O( Tp1 + rT∞ ) in expectation. t u Chapter 4 Work Stealing for Linearizable Futures This chapter presents our work on linearizable futures, a new type of futures invented recently [53] to handle functions with side effects, especially for method calls to long-lived shared data structures. we propose a new program model, called the linearizable-futures model, that supports both futures without side-effects, which we call normal futures, and futures that exist for their side-effects, which we call linearizable futures. We use this model to propose a novel work-stealing scheduler that facilitates the kind of combining and elimination optimizations supported by linearizable futures. The rest of the chapter is organized as follows. Section 4.1 shows the related work. In Section 4.2, we explain our new program model, linearizable-futures model, in detail. In Section 4.3, we show how to modify work stealing for linearizable-futures model, in order to make good use of combining and elimination optimizations. Finally we prove in Section 4.4 that the modified work stealing, combined with combining and elimination, achieves very good performance bounds with respect to its execution times on programs in linearizable-futures model. 4.1 Related Work Futures were first proposed by Halstead [41, 42] and have been well studied since then (e.g., [55, 8, 58, 31, 16, 77, 32]), sometimes under different names. They are a flexible way to structure parallel programs, as the future-parallel model is a generalization of the widely used fork-join model (also called nested-parallel model) [13, 15, 17]. Futures are well suited to dynamic load-balancing techniques such as the popular work stealing scheduler [20]. Arora et al. [7] proved that (parsimonious) work stealing achieves the asymptotically best speedup for the parallel execution of a program in the future-parallel model. If a program in the future-parallel model is written in a natural, structured way [47], its parallel execution using work stealing also guarantees good cache locality, much better than a worst-case unstructured program [1, 44 45 73]. Kogan and Herlihy [53] recently proposed futures for side-effects, intended to facilitate batching optimizations such as combining [39, 37, 46] and elimination [45, 63, 72]. An operation on a shared data structure immediately returns a future, and a later touch will retrieve the return value of the operation, as soon as that value is ready. Since operations on shared objects have side-effects, it is necessary to specify when these side-effects might be observed. Three alternative correctness conditions were proposed: strong, medium, and weak futures linearizability. These conditions should be thought of as extensions to linearizability [49], a widely used correctness condition in distributed and parallel computing. Here, we restrict out attention to medium futures linearizability, arguably the most useful of the futures-based extensions to linearizability. For brevity, futures satisfying this condition are called linearizable futures here. 4.2 4.2.1 Linearizable-Futures Model Normal Futures and Linearizable Futures As noted, futures [41, 42] are an attractive way to structure parallel computation. When a thread creates an expression (usually a method call) with a keyword future, a new thread, called a future thread or a future for short, is spawned to compute that expression in parallel with the thread that created it. When a thread needs the result of that expression, it applies a touch operation to that future. If the result is ready, it is returned to the touching thread, and otherwise the touching thread blocks until the result becomes ready (see Figure 4.1 for an example). A future thread itself can even create new futures. Figure 4.1: Pseudocode for computing the sum of the third and the fifth Fibonacci numbers using futures. In prior models, futures are deterministic in a program (with a given input): the sequence of a future thread’s instructions are fixed in all executions, no matter how those instructions are interleaved with other threads. Once we introduce threads with asynchronous side-effects, we can no longer guarantee determinism. Instead, we allow future threads to be regular : in any execution of a program, (1) the number of a regular thread t’s instructions is fixed, (2) the futures and touches 46 t creates, as well as their positions in t, are fixed,1 and (3) the instructions between two successive touch operations in t take effect as if they were executed atomically at the moment the first of the two touches is executed.2 We call futures that are regular normal futures, to distinguish them from linearizable futures introduced next. Kogan and Herlihy [53] recently proposed a new way of using futures to encapsulate operations on shared mutable data structures. When an operation is applied to a shared data structure, it immediately returns a future. Touching that future later returns the operation’s result, along with implicit confirmation that the operation has taken effect. We call futures for operations on shared objects linearizable futures. A linearizable future for an operation on a shared object o is also called a linearizable future on o, for short. In the presence of concurrency, it is important to specify when a future-returing operation can take effect. Three alternative correctness conditions have been proposed: strong, medium, and weak futures linearizability [53]. In this paper, we focus on medium futures linearizability, which appears to be the most useful. Medium futures linearizability requires that (1) the operation associated with a linearizable future should appear to take effect at some instant between when that future is created, and when it is touched, and (2) linearizable futures created by the same thread on the same object should take effect in the same order as their creation operations. Informally, these conditions mean that (1) the creation of a linearizable future and its matching touch fill the roles of operation invocation and response in the classic linearizability condition, and (2) the order of operations called by a thread on the same shared object is preserved. 4.2.2 The Model In this paper, we propose a new parallel program model, called linearizable-futures model, that exploits both normal futures and linearizable futures in a structured way. Suppose there are multiple main threads (which represent multiple independent programs in practice) in a program, each having its own, side-effect free computation. The main threads also need to communicate with each other by accessing shared objects in the shared memory. As we can imagine, an efficient way to run this program is to have each main thread spawn “worker threads” to execute its local, computationally heavy work in parallel, and also have each main thread create linearizable futures to apply operations to shared objects in order to communicate with other main threads. Our linearizable-futures model captures this scenario. More specifically, a program in linearizable-futures model consists of some independent main threads and their descendant threads which are normal futures and linearizable futures. The main 1 In fact, the number of instructions of a thread is allowed to vary within a constant factor, and the positions of futures and touches can move within a constant number of instructions, as long as their order is preserved. We ignore these variations, as they are negligible in our theoretical performance analysis, and do not seem useful in practice. 2 Intuitively, the instructions between two successive touches are required to be side-effect free. We define it this way, because in our new program model, introduced later, only the touch operations in a regular thread can return data from shared objects, causing nondeterminism in the thread. 47 threads and the normal futures are regular. A main thread can spawn and touch both normal futures and its own linearizable futures. A normal future thread, which is spawned directly or indirectly by a main thread, can only create and touch normal futures, but not linearizable futures, as linearizable futures are designed for operations on shared objects by main threads. A linearizable future thread can neither create nor touch any type of futures, since in practice an operation on a shared object is executed by a single thread. We also require each future to be touched only once (note that the thread touching a normal future doesn’t have to be the one that created that future). This requirement is beneficial in practice for reasons such as that it makes the implementations of future/touches simpler and have lower overheads [16], and that the requirement, together with another practical constraint proposed in the well-structured future-parallel model in [47], guarantees good cache locality bounds for parallel executions by work stealing scheduler. We believe that writing a parallel program in linearizable-futures model is easy and natural: When we want to execute some side-effect free computation in a thread in parallel, or when we want to split a thread into two or more to run different parts of a program later, we create a normal future in that thread, and we may have the newly spawned normal future thread create more normal futures further, if necessary; When a main thread needs to exchange information with other main threads, we have it create a linearizable future to apply an operation to a shared object, which is the standard way of communication in shared-memory systems. Unlike the previous future-parallel models, our model supports nondeterminism by the use of linearizable futures, with a clear correctness condition—medium futures linearizability. Although the requirement that main threads and normal futures must be regular limits the degree of nondeterminism, in return it avoids a program to be notoriously hard to reason about and debug, by forcing nondeterminism to be caused only by single-purpose linearizable future threads and keeping the structure of the program unchanged in all parallel executions. 4.2.3 Computation DAG For theoretical analysis, an execution of a program is modeled as a directed acyclic graph (DAG). A node in a DAG represents an instruction and a directed edge (u, v) represents the dependency constraint that v must be executed after u. Therefore, a node is able to execute when all its parent nodes have been executed. A node that creates a future is called a fork and a node that touches a future is called a touch. The most common edges in a DAG are continuation edges, which point from one node to the next in the same thread. Thus, a thread is a maximal chain of nodes connected by continuation edges. There are three other types of edges (Figure 4.2 shows a DAG consisting of edges of all these types): • future edges, which point from a fork u to the first node of the future thread spawned by u. • touch edges, which point from the last node of a future thread f to a touch v in another thread that touches the future computed by f . 48 • order edges, which point from the last node of a linearizable future f1 on object o created by a thread t to the first node of the next linearizable future f2 on o created by t, indicating that f2 cannot be executed until f1 has be finished, according to medium futures linearizability. Figure 4.2: A main thread t spawns two linearizable futures f1 and f2 , both on object o, at forks v1 and v2 respectively. v3 and v4 are touches of f1 and f2 respectively. By definition, (v1 , v5 ) and (v2 , v7 ) are future edges. (v6 , v4 ) and (v8 , v3 ) are touch edges. (v6 , v7 ) is an order edge. All the other edges are continuation edges. A DAG consists of some main threads that don’t have dependency constraints on each other, and some normal futures and linearizable futures spawned directly or indirectly by them. Since each node is a single instruction, it cannot spawn multiple futures or touch multiple futures. Combining this with the fact that each future can be touched only once, we can conclude that each node has in-degree and out-degree either 1 or 2, except that the first and the last nodes of each main thread have in-degree 0 and out-degree 0, respectively. A critical path of a DAG is a longest directed path in the DAG and the length (i.e., the number of nodes) of a critical path is sometimes called the computation span of the DAG. In the next subsection, we will show how to generalize the DAG model to represent programs that support combining and elimination optimizations. 4.2.4 Combining and Elimination Kogan and Herlihy [53] shows that if shared object operations are created as futures, some powerful optimizations, such as combining and elimination, can be applied to a program. Since a linearizable 49 future doesn’t have to be executed immediately after its creation, a clever scheduler can delay its execution until it can be done efficiently. For instance, consider a shared queue implemented as a linked list and k enqueue operations created as k linearizable futures by a thread. If a thread can execute those k enqueue operations together, it can locally make them a linked list in the same order as they were created, and then append them to the linked list of the shared queue by only one CAS operation. Since local operations are usually much faster than operations on a shared object, this combining optimization will largely boost the performance of the execution, compared to simply enqueuing k elements one by one. Similarly, if a push operation and a pop operation on the same shared stack object can be combined locally, they can be canceled out by an elimination optimization, without even accessing the shared stack. As the main results of the paper, we will show in Section 4.4 that work stealing scheduler, with some modifications, can make good use of combining and elimination optimizations in our model. As Kogan and Herlihy did in [49], we only consider combining and eliminating linearizable futures on the same object created by the same thread. Modeling and analyzing the performance of combining and eliminating linearizable futures created by different threads is out of the scope of the paper, and we consider it an interesting open question for future work. Since we focus on the analysis of “local” combining and elimination, we make a simplification to our model to assume that the length of the thread that groups some linearizable futures to execute together (with possible combining and elimination) is fixed, regardless of the execution of any other part of the program. For instance, let f0 , f1 , f2 and f3 be the linearizable futures on object o created by the same thread t. Grouping f1 and f2 to execute together will replace these two future threads by a new thread g(f1 ∪ f2 ) in the DAG, as illustrated in Figure 4.3, and the length of g(f1 ∪ f2 ) is fixed in all executions where f1 and f2 are grouped with no other futures. Since we cannot group f1 and f2 until f2 is created, g(f1 ∪ f2 ) is spawned at the fork of f2 . The touch of g(f1 ∪ f2 ) is the same as the earlier one of the touches of f1 and f2 , which is the touch of f2 in this case. Because of medium futures linearizability, there is an order edges from the last node of f0 to the first node of g(f1 ∪ f2 ), and an order edges from the last node of g(f1 ∪ f2 ) to the first node of f3 . Note that only consecutive linearizable futures can be grouped, according to medium futures linearizability (e.g., if f0 and f2 are grouped, then executing f1 either before or after g(f0 ∪ f2 ) violates medium futures linearizability). Formally, let f0 , f1 , ..., fn denote all the linearizable futures on object o created by thread t in a program, where fi is created before fi+1 for all 0 ≤ i ≤ n − 1. If a thread, denoted by g(fi ∪ fi+1 ∪ ... ∪ fj ), takes fi , fi+1 , ..., fj to execute together in an execution, for some 0 ≤ i ≤ j ≤ n, we say g(fi ∪ fi+1 ∪ ... ∪ fj ) is a future group (or a group for short) in that execution. In the DAG of that execution, future threads fi , fi+1 , ..., fj are replaced by thread g(fi ∪ fi+1 ∪ ... ∪ fj ), where g(fi ∪ fi+1 ∪ ... ∪ fj ) is created at the fork that creates fj and touched by the earliest one among the touches of fi , fi+1 , ..., fj . If i > 0, there is an order edge from the last node of the group fi−1 is in, to the first node of g(fi ∪ fi+1 ∪ ... ∪ fj ). If i < n, there is an order edge from the last node of g(fi ∪ fi+1 ∪ ... ∪ fj ) to the first node of the group fj+1 is in. 50 Figure 4.3: The DAG on the left shows an “original” execution of the program, where each linearizable future fi is executed solely, for any 0 ≤ i ≤ 3. The DAG on the right shows the DAG of an execution, where f1 and f2 are grouped. If a linearizable future fk is executed solely in an execution, we still consider it as a future group g(fk ), in order to keep notations consistent. Thus, all linearizable futures are replaced by future groups in the DAG of an execution. When we say the fork (resp. the touch) of a linearizable future f in a DAG, we refer to the node where f is created (resp. touched) in the original program, although that node is not necessarily the fork (resp. the touch) of a future group in the DAG. When we say the DAG of a program, we refer to the DAG of an execution of the program where each linearizable future is executed solely as a group. Let T (g) denote the number of nodes of group g, i.e., the length or the execution time of g. If both combining and elimination can be applied to the program whenever possible, we assume T (g(fi ∪ fi+1 ∪ ... ∪ fj )) is fixed in any execution of the program and satisfies the inequalities of combining and elimination: 0 ≤ T (g(fi ∪ fi+1 ∪ ... ∪ fj )) ≤ T (g1 ) + T (g2 ) + ... + T (gk ) where g1 , g2 , .., gk are a partition of fi , fi+1 , ..., fj , i.e., future groups that execute disjoint subsets of fi , fi+1 , ..., fj and collectively execute all fi , fi+1 , ..., fj , such that if fa is in gc and fb is in gd for some i ≤ a < b ≤ j, then 1 ≤ c ≤ d ≤ k. The two bounds in the inequalities of combining and elimination show that (1) in the best case, where fi , fi+1 , ..., fj can be entirely canceled out, no work is needed, (2) executing fi , fi+1 , ..., fj together in a single group won’t be less efficient than splitting them into subgroups to execute one by one (because a group can always simulate the execution of its subgroups if that is the most efficient way), and (3) in the worst case T (g(fi ∪ fi+1 ∪ ... ∪ fj )) = T (g(fi )) + T (g(fi+1 )) + ... + T (g(fj )), where no optimization can be applied to those futures, then g(fi ∪ fi+1 ∪ ... ∪ fj ) has to execute 51 those linearizable futures one by one. Note that the inequalities of combining and elimination is a very weak assumption, allowing the performance of a future group be to anything between the two extreme cases. If only combining is applied to the program, then we assume T (fi ∪ fi+1 ∪ ... ∪ fi+k ) is fixed in any execution and satisfies the inequalities of combining: T (g ∗ ) ≤ T (g(fi ∪ fi+1 ∪ ... ∪ fj )) ≤ T (g1 ) + T (g2 ) + ... + T (gk ) where g ∗ is a subgroup of g(fi ∪ fi+1 ∪ ... ∪ fj ), i.e., a future group g(fa ∪ fa+1 ∪ ... ∪ fb ), for some i ≤ a ≤ b ≤ j. The inequalities of combining capture the fact that if we cannot cancel out operations, combining a group of linearizable futures to execute usually takes at least the time needed for executing any subgroup of them (in the best case, it takes the time needed for executing a single linearizable future in the group). Although the inequalities of combining are slightly more restricted than the inequalities of combining and elimination, we believe they are still a reasonable assumption in many cases in practice, as efficient elimination is feasible only for specific data structures in specific situations. 4.3 Lazy Work Stealing Given a program with futures, we still need a scheduler to assign the threads of the program to different processors in order to run them in parallel. Work stealing [20, 7] is the most popular scheduler for programs in the original future-parallel model, achieving very good load balancing and low scheduling overheads. In this section, we propose a modified work stealing scheduler, called lazy work stealing, for programs in linearizable-futures model. We will show in the next section that lazy work stealing can make good use of combining and elimination, by proving good theoretical bounds on its performances. As in the original work stealing scheduler, each processor in lazy work stealing has its own doubleended queue (deque) to store the threads it created that are ready to execute (in fact, the deque stores the next available nodes of those threads). When a processor executing a thread spawns a normal future thread at a fork, it chooses either the current thread or the newly spawned thread to execute, and push the other into the bottom of its deque. When a processor doesn’t have a thread to work on, it pops the first thread from the bottom of its deque to execute if the deque is not empty, and otherwise it randomly chooses another processor and steals the first thread from the top of that thread’s deque to execute. At the beginning of an execution of a program, the main threads of the program are stored in the deques of some arbitrary processors, so active processors can start popping them out to execute. The execution finishes when all threads (all nodes) in the program have been executed. To deal with linearizable futures and make good use of combining and elimination, lazy work stealing also does the following: 52 • For each thread t and each shared object o, all the ready linearizable futures on o created by t are stored together in the shared memory, forming a linked list ` in the order they were created. (Note that linearizable futures in the same linked list can be created by different processors, as t can be executed by different processors at different times.) • When a processor executing t spawns a linearizable future f on o at a fork, it always continues to execute the current thread t and appends f to the linked list ` of the ready linearizable futures on o created by t. If f is the last linearizable future on o created by t before the first of the touches of all the ready linearizable futures (including f ) on o by t, the processor also creates a pointer node pointing to the head of ` and pushes the pointer node to the bottom of its deque (see Figure 4.4). • When the node a processor pops or steals from a deque is a pointer to the linked list of the ready linearizable futures on o created by t, the processor groups those linearizable futures to execute together, with combining and elimination if possible. Figure 4.4: An example illustrating how lazy work stealing works when a processor P is executing node v in thread t. First, f3 is added to the linked list of ready linearizable futures on object o created by thread t. Since f3 is the last future on o before the first touch of the ready futures on o (i.e., the touch of f3 ), a pointer node to the linked list is pushed into the bottom of P ’s deque. We can observe that using lazy work stealing, linearizable futures can be executed only if the 53 pointer to the linked list containing them has been pushed into a deque. It is also easy to see that in any execution linearizable futures are always partitioned into the same groups to execute. Now we explain why a linked link described above can be easily and efficiently implemented. We can, for example, maintain a hash table and use t and o as the key to quickly find the location of the linked list for ready linearizable futures on o created by t. The linked list is created dynamically only when a linearizable future on o is created by t, and its memory can be freed when all the ready linearizable futures have been executed, so lazy work stealing won’t take much extra memory in general. Also note that (1) only one processor takes control of thread t at a time and (2) when a processor executes the future group for the ready linearizable futures on o created by t, no more linearizable futures on o will be added to that linked list, as the creation (fork) of the next linearizable future on o created by t is after the touch of that future group. Therefore, we can conclude that only one processor at a time can update the linked list and hence a simple implementation of the linked list without supporting concurrency will suffice. The only tricky part in the implementation of lazy work stealing is to figure out whether a linearizable future on o is the last one before the first touch of the ready linearizable futures on o in the same thread. To achieve that, we can, for example, have the compiler find out and mark all the linearizable futures of that kind when it compiles the program. Alternatively, we can have the programmer (or the IDE automatically) mark those linearizable futures when the programmer is writing the program. Given that futures and touches are key words explicitly written in the program, Both methods should be easy and convenient. 4.4 Performance Analysis of Lazy Work Stealing One may concern that lazy work stealing is very inefficient in some scenarios, as lazily waiting for all linearizable futures to be created before executing them can sometimes waste too much CPU time of the idle available processors. In this section, we prove that lazy work stealing performs well in linearizable-futures model, having the best bounds on execution times that any non-clairvoyant scheduler can achieve. Moreover, we prove that in many cases, the performance of lazy work stealing is also close to that of an optimal offline scheduler. Note that in our analysis, we assume each processor can execute different main threads and future threads in an execution, but our results also applies to the case where each processor is assigned to only one main thread and its descendant threads all the time, which is usually the case in hierarchical shared memory systems. Consider a group g of linearizable futures on object o created by thread t in an execution of a program. We define the span of g as the path in thread t from the fork that creates the first future in g to the last touch of the futures in g. The span of g is divided into three segments—the creation span, the middle span, and the touch span. The creation span of g is from the the fork that creates the first future in g to the fork that creates the last future in g. The middle span of g is from the fork that creates the last future in g to the first touch of the futures in g. The touch span of g is from the first touch to the last touch of the futures in g. For instance, in any execution of the program 54 in Figure 4.4 by lazy work stealing, f1 , f2 and f3 are grouped, and the creation, middle and touch spans of group g(f1 ∪ f2 ∪ f3 ) are the paths from the fork of f1 to v, from v to the touch of f3 , and from the touch of f3 to the touch of f1 , respectively, in thread t. We will prove theoretical bounds for programs in general form and programs in a special form, called well-formed programs, in linearizable-futures model. A program in linearizable-futures model is well formed, if in the DAG of any execution by lazy work stealing, the spans of future groups on the same object created by the same thread don’t overlap. Intuitively, this means that once a thread touches a linearizable future on object o, it has to touch all the other ready linearizable futures on o it has created, before it can create new linearizable futures on o (see Figure 4.5). It is easy to see that a well-formed program implicitly divides linearizable futures into the same groups as lazy work stealing does. Figure 4.5: A well-formed program obtained by modifying the program in Figure 4.4. The only modification is moving the touch of f1 to a node before the fork of f4 , so that f4 is created after all f1 , f2 , and f3 have be touched. Recall that in all executions of a program by lazy work stealing, linearizable futures are partitioned into the same groups and all those executions can be represented as the same DAG. We 55 define the containment level of a program with respect to the DAG G of its executions by lazy work stealing as follows. Let the size of a future group be the number of linearizable futures in that group. We define the containment level of the creation span (resp. the touch span) of a group g in G as the number of different shared objects on which there are future groups of size at least 2 whose middle spans are contained in the creation span (resp. the touch span) of g. The containment level of g is the maximum of the containment level of the creation span of g and the containment level of the touch span of g. Finally, the containment level of a program is the maximum of the containment levels of all the future groups in G. We will show later in the section that the containment level of a program is often a small constant in practice. The main results about the performance of lazy work stealing are as follows: Theorem 28 Given a program with combining optimization in linearizable-futures model, its execution time by an optimal offline scheduler is Θ( PTA1 + T∞ ), where T1 and T∞ are the number of nodes and the length of a critical path in the DAG of the execution respectively, and PA is the average number of available processors in the execution. Its execution time by lazy work stealing is O( PT10 + (c + 1)T∞ ) in expectation, where c is the containment level of the program and PA0 is the A average number of available processors in the execution. Theorem 29 Given a well-formed program with combining and elimination in linearizable-futures model, its execution time by an optimal offline scheduler is Θ( PTA1 + T∞ ), where T1 and T∞ are the number of nodes and the length of a critical path in the DAG of the execution respectively, and PA is the average number of available processors in the execution. Its execution time by lazy work stealing is O( PT10 + (c + 1)T∞ ) in expectation, where c is the containment level of the program and PA0 is the A average number of available processors in the execution. We start with the proofs of the lemmas for proving Theorem 29. Lemma 30 Let T1 be the number of nodes in the DAG G of any execution of a well-formed program by lazy work stealing. The number of nodes in the DAG of any execution of that program by any scheduler is at least T1 . Proof. Consider a linearizable future f on object o created by a main thread t and the group g that f is in in G. Since the program is well formed, any linearizable future on o created by t that is not in g cannot be grouped with f in any execution by any scheduler. This implies any scheduler can either group all the linearizable futures in g to execute together or split them into subgroups to execute one by one. According to the inequalities of combining and elimination, the number of nodes in the g is no more than the number of all the nodes in the those subgroups of g in any execution. Since the number of nodes in normal futures are fixed in all executions and the work stealing scheduler incurs the smallest total number of nodes in linearizable futures, we can conclude that T1 is the lower bound for any execution by any scheduler. t u The following lemma holds not only for well-formed programs, but also for programs in the general form. 56 Lemma 31 Let G be the DAG of an execution of a program of containment level c by lazy work stealing and let g1 , g2 , ..., gn denote all the future groups contained in a path in G. If g1 , g2 , ..., gn are on 2c + 4 or more different shared objects, then any linearizable future in g1 precedes any linearizable future in gn in the program. Proof. Since g1 , g2 , ..., gn are on 2c + 4 or more different shared objects, there must be a group gi whose touch u divides the path into two segments such that the future groups contained in each of the two segments are on at least c + 2 different shared objects. Note that g1 , g2 , ..., gn must be spawned and touched by the same main thread. Therefore, we know that the touch of any linearizable future in g1 must be a node in that main thread before u, because otherwise the touch span of g1 will contain all the middle spans of g2 , g3 , ..., gi which are on at least c+1 different shared objects, contradicting that the containment level of the program is c. Similarly, the fork of any linearizable future in gn must be a node in that main thread after u, because otherwise the creation span of gn will contain all the middle spans of gi+1 , gi+2 , ..., gn−1 which are on at least c + 1 different shared objects, contradicting that the containment level of the program is c. Therefore, we can conclude that any linearizable future in g1 must precede any linearizable future in gn in the program. t u Lemma 32 Let G be the DAG of an execution of a well-formed program by lazy work stealing, and let g1 , g2 , ..., gn be all the future groups on object o created by a main thread in G. In the DAG of any execution of the program by any scheduler, there exists a path from u to v such that the path is not shorter than the sum of the lengths of ga , ga+1 , ..., gb for any 1 ≤ a ≤ b ≤ n, where u is the fork of a linearizable future in ga and v is the touch of a linearizable future in gb . Proof. As we pointed out in the proof of Lemma 30, since the program is well-formed, for any 1 ≤ i ≤ n, the linearizable futures in gi have to either be grouped together or be partitioned into subgroups of gi in any execution by any scheduler. By the inequalities of combining and elimination, the sum of the lengths of those subgroups is not smaller than the length of gi . Therefore in any execution by any scheduler, the groups consisting of all the linearizable futures in ga , ga+1 , ..., gb form a path αu,v from the fork u of the first group to the touch v of the last group, through all the nodes in those groups (with order edges connecting successive groups), and the length of αu,v is not shorter than the sum of the lengths of ga , ga+1 , ..., gb . It is obvious that u is the fork of a linearizable future in ga and v is the touch of a linearizable future in gb , so αu,v suffices. t u Lemma 33 Let G be the DAG of an execution of a well-formed program of containment level c by T∞ lazy work stealing. If the length of a critical path in G is T∞ , then there is a path of length Ω( c+1 ) in the DAG of any execution of the program by any scheduler. Proof. Consider any critical path in G and all the future groups contained in the path. We partition the critical path into segments as follows. The first segment of the path is from the first node of the path to the touch of the first future group contained in the path, such that the future groups contained in this segment are on 2c + 4 different objects. Then by induction, the next segment is 57 from the end of the previous segment to the touch of the first future group in the rest of the path, such that the future groups contained in this segment are on 2c + 4 different objects. This partition completes when we reach the end of this critical path and hence the last segment is from the end of the previous segment to the end of the path. Let β1 , β2 , ..., βn denote the segments. Without loss of generality, assume n = 2k is an even number. Now we consider βi , for any 1 ≤ i ≤ n. By Lemma 32, for an object o on which there are future groups in βi , there exists a path αo from u to v in the DAG G0 of any execution by any scheduler, such that αo is not shorter than the sum of the lengths of all the groups on o in βi in G, where u is the fork of a linearizable future in the first group on o in βi and v is the touch of a linearizable future in the first group on o in βi . In G0 , we can also find a path α0 from the start of βi to the end of βi , such that α0 contains all the segments of βi that are in normal futures and the main thread, and goes through the nodes in the main thread between the fork and the touch of each linearizable future group instead of going through that group (see Figure 4.6). Since α0 contains all the nodes in the normal futures and the main thread in βi , and for each object o, αo is not shorter than the sum of the lengths of all the linearizable future groups on o in βi in G, we know that the sum of the lengths of α0 and all the paths αo is not smaller than the length of βi . Since the future groups in βi are on at most 2c + 4 different objects, there is a path αβi among α0 and all αo , such that αβi is not shorter than 1 2c+5 of the length of βi . Note that the first node of α0 is the first node of βi , which is essentially the touch of a linearizable future in the last group in βi−1 . Also note that the first node of any αo is the fork of a linearizable future in the first group on o in βi . Hence, we know the first node of αβi is either the touch of a linearizable future in the last group in βi−1 or the fork of a linearizable future in a group in βi . Since both the last node of α0 (which is the last node of βi ) and the last node of any αo are the touches of linearizable futures in groups in βi , we know the last node of αβi is the touch of a linearizable future in a group in βi (note that the last node of αo may be after the last node of βi ). For any 1 ≤ i ≤ k − 1, since the future groups in β2i are on 2c + 4 different objects, by Lemma 31, any linearizable future in a group in β2i−1 precedes both any linearizable future in the last group in β2i and any linearizable future in a group in β2i in the program. Since (1) the last node of αβ2i−1 is the touch of a linearizable future in a group in β2i−1 and (2) the first node of αβ2i+1 is either the touch of a linearizable future in the last group in βi−1 or the fork of a linearizable future in a group in βi , we can conclude that in G0 , the last node of αβ2i−1 is before the first node of αβ2i+1 . Hence, in G0 , there is always a path αodd containing the paths αβ2i−1 for all 1 ≤ i ≤ k and the length of the path is at least 1 2c+5 0 of the sum of the lengths of the paths β2i−1 for all 1 ≤ i ≤ k. Similarly, we can prove that in G , there is always a path αeven containing the paths αβ2i for all 1 ≤ i ≤ k and the length of the path is at least 1 2c+5 of the sum of the lengths of the paths β2i for all 1 ≤ i ≤ k. Therefore, the sum of the lengths of αodd and αeven is at least 1 2c+5 of the length of the union of βi for all 1 ≤ i ≤ 2k, which is the critical path in G. Hence, at least one of αodd and αeven is of length T∞ Ω( c+1 ). Proof of Theorem 29. t u It is a well-known result that any execution of a program by any scheduler 58 Figure 4.6: The DAG on the left is a path βi in G. g1 , g2 , and g3 are linearizable futures on object o contained in βi , and the other segments of βi are all in the main thread and normal futures (for simplicity, we assume there is no linearizable futures on other objects). The DAG on the right is part of G0 , where the dotted paths are αo and α0 . T∗ ∗ ∗ are the number of nodes and the length of a critical path ), where T1∗ and T∞ takes time Ω( P1∗ + T∞ A in the DAG of the execution respectively, and PA∗ is the average number of available processors in the execution. It has been proved that the (nonblocking) work stealing scheduler is asymptotically T∗ ∗ optimal, achieving Θ( P1∗ + T∞ ) execution time in expectation for any program with only normal A futures (i.e., a deterministic program whose DAG is fixed) [7]. A key observation is that lazy work stealing behaves the same as the ordinary work stealing scheduler with respect to G, the DAG of any execution of the program by lazy work stealing. This implies that the expected execution time T0 0 0 of a program in our model by lazy work stealing is Θ( P 10 + T∞ ), where T10 and T∞ are the number A 59 of nodes and the length of a critical path in G, respectively. By Lemmas 30 and 33, in the DAG of any execution of the program by an optimal offline T0 ∞ ), scheduler, the number of nodes T1 is at least T10 and the length of a critical path T∞ is Ω( c+1 where c is the containment level of the program. Hence, we can conclude that the expected execution time of the program by lazy work stealing is O( PT10 + (c + 1)T∞ ). t u A Now we prove Theorem 28 with the help of the lemmas below. Lemma 34 Let G be the DAG of any execution of a program by lazy work stealing, and let g1 , g2 , ..., gn be all the future groups on an object o created by a main thread in G. For any 1 ≤ a ≤ b ≤ n, let fa and fb denote the first future in ga and the last future in gb , respectively. In the DAG G0 of any execution of the program by any scheduler, the groups ga0 0 , ga0 0 +1 , ..., gb0 0 that contain all the futures in ga , ga+1 , ..., gb satisfy T (ga0 0 ) + T (ga0 0 +1 ) + ... + T (gb0 0 ) ≥ 12 (T (ga ) + T (ga+1 ) + ... + T (gb )), where ga0 0 contains fa and gb0 0 contains fb . Proof. We first prove that no group in G0 can contain both a linearizable future in gi and a linearizable future in gj , for any 1 ≤ i, j ≤ n and j ≥ i + 2. To see that, suppose by way of contradiction that a group g 0 contains both a linearizable future in gi and a linearizable future in gj . By medium futures linearizability, g 0 must contain all the linearizable futures in gi+1 . However, we know that the touch of some linearizable future in gi+1 is before the fork of any linearizable future in gj (because of the way work stealing groups linearizable futures), contradicting that g 0 can contain a linearizable future in gj . Without loss of generality, suppose b = a + 2k for some k ≥ 0. For any a ≤ i ≤ b, let Si0 denote the set of groups in G0 that contain linearizable futures in gi . What we just proved above implies 0 0 , ..., gd0 } be the subset of {ga0 0 , ga0 0 +1 , ..., gb0 0 }, that Si0 ∩ Sj0 = ∅ for any j ≥ i + 2. Let Sa+2i = {gc0 , gc+1 for some 0 ≤ i ≤ k, where gc0 and gd0 contain the first and the last linearizable futures in group P ga+2i , respectively. By the inequalities of combining and elimination, we have g0 ∈S 0 T (g 0 ) = a+2i 0 0 0 0 ) + T (gd∗ ) ≥ T (ga+2i ), where T (gc0 ) + T (gc+1 ) + ... + T (gd0 ) ≥ T (gc∗ ) + T (gc+1 ) + T (gc+2 ) + ... + T (gd−1 T (gc∗ ) and T (gd∗ ) are the subgroups of T (gc0 ) and T (gd0 ), respectively, that contain only the futures 0 0 in ga+2i . Since Sa+2i ∩ Sa+2j = ∅ for any i 6= j, we have k X X T (g 0 ) ≥ 0 i=0 g 0 ∈Sa+2i k X T (ga+2i ). i=0 Similarly, we can prove k−1 X X 0 i=0 g 0 ∈Sa+2i+1 T (g 0 ) ≥ k−1 X T (ga+2i+1 ). i=0 Sk 0 Therefore, we can conclude that either the total number of nodes in the future groups in i=0 Sa+2i Sk−1 0 1 or the total number of nodes in the future groups in i=0 Sa+2i+1 is not smaller than 2 (T (ga ) + Sk 0 T (ga+1 ) + ... + T (gb )). Since both the future groups in i=0 Sa+2i and the future groups in Sk−1 0 0 0 0 t u i=0 Sa+2i+1 are among ga0 , ga0 +1 , ..., gb0 , we complete the proof. 60 Lemma 35 Let T1 be the number of nodes in the DAG of any execution of a program by lazy work stealing. Then the number of nodes in the DAG of an execution of the program by any scheduler is at least T1 /2. Proof. Lemma 34 implies that in the DAG of any execution of the program by any scheduler, the total number of nodes in linearizable groups on an object o created by a main thread is at least half of that in the DAG of an execution by lazy work stealing. Given that the total number of nodes in the main threads and the normal futures in the program is fixed in all executions, we can conclude that the number of nodes in the DAG of any execution by any scheduler is at least T1 /2. t u Lemma 36 Let G be the DAG of an execution of a program by lazy work stealing, and let g1 , g2 , ..., gn be all the future groups on object o created by a main thread in G. In the DAG of any execution of the program by any scheduler, there exists a path from u to v such that the path is not shorter than half of the sum of the lengths of ga , ga+1 , ..., gb for any 1 ≤ a ≤ b ≤ n, where u is after the fork of some linearizable future in ga and v is before the touch of some linearizable future in gb . Proof. By Lemma 34, we know the path from the fork of ga0 to the touch of gb0 through all the nodes 0 in groups gc0 , gc+1 , ..., gd0 is not shorter than half of the sum of the lengths of ga , ga+1 , ..., gb . Now the only thing we need to prove is that the fork of gc0 is after the fork of some linearizable future in ga and gd0 is before the touch of some linearizable future in gb . This follows from the fact that gc0 contains the first future in ga and gd0 contains the last future in gb . t u Lemma 37 Let G be the DAG of an execution of a program of containment level c by lazy work T∞ ) in stealing. If the length of a critical path in G is T∞ , then there is always a path of length Ω( c+1 the DAG of any execution of the program by any scheduler. Proof. The proof is almost identical to that of Lemma 32. The only difference is this proof is based on Lemma 36, instead of Lemma 32. t u Proof of Theorem 28. t u The proof is almost identical to that of Theorem 29. We argue that many programs in practice have small containment levels. For example, if a program makes operation calls to only c different shared objects, for some small constant c, its containment level is at most c. Moreover, since containment level is only related to the interleaving of linearizable futures in a very restricted form—the creation or touch span of a future group of size at least 2 containing the middle span of another future group of size at least 2, interleaving in other forms doesn’t affect the containment level of a program at all. For instance, a program can have a very small containment level even if (1) a thread creates a large number of linearizable futures first and then touches them later, since their interleaving is not in the form related to the containment level, (2) the creation/touch span of a group contains the middle spans of a lot of groups that are only on a few different shared objects, and (3) the creation/touch span of a group contains the middle spans of a lot of groups of size 1 (i.e., single linearizable futures). 61 The theorem below shows that no non-clairvoyant scheduler can achieve better bounds than lazy work stealing. A scheduler is non-clairvoyant if it cannot know the part of the DAG that has not been executed yet. Here we give a non-clairvoyant scheduler extra power to know whether a linearizable future on object o is the last one in a thread before the first touch of the ready linearizable futures on o in the thread, as lazy work stealing knows that information. Theorem 38 Given a non-clairvoyant scheduler A, an integer c ≥ 0, and the set of available processors at each step of time, there exists a well-formed program of containment level at most c with combining optimization in linearizable-futures model, such that its execution time using an optimal offline scheduler is Θ( PTA1 + T∞ ) and its execution time using A is Ω( PT10 + (c + 1)T∞ ) in A expectation, where PA and PA0 are the average numbers of available processors in the two executions respectively, and T1 and T∞ are the number of nodes and the length of a critical path in the DAG of the execution by the optimal offline scheduler respectively. Theorem 38 (the existence of such a well-formed program with only combining) covers both Theorem 28 (a program with only combining) and Theorem 29 (a well-formed program with combining and elimination), and hence the upper bounds in the two theorems are tight. Its proof is presented below. Proof of Theorem 38. When c = 0, the theorem is trivially true. Now we consider c > 0. Since A is non-clairvoyant, we will adaptively construct a program Q, based on the probabilities of the choices A has made in the execution of the part of Q that has been generated, such that the expected number of nodes and the expected length of a critical path in the DAG of an execution of G by A are Ω(T1 ) and Ω(cT∞ ), respectively. We construct Q in c steps. In the first step, we construct Q∗1 , a subgraph of Q. we first let A execute a program Q1 , whose DAG is shown in Figure 4.7. Q1 has only one main thread, in which c linearizable futures f1,1 , f1,2 , ..., f1,c , all on object o1 , are created consecutively and then touched consecutively in the same order as they are created. We assume that grouping any linearizable futures takes time t, for some fixed number t, that is, T (g(f1,i ∪ f1,i+1 ∪ ... ∪ f1,j )) = T (g(f1,i )) = t, for any 1 ≤ i ≤ j ≤ c. Let P1,i be the probability that A groups f1,i with some other linearizable futures in an execution of Q1 , for any i. If P1,i < 1/2 for all 1 ≤ i ≤ c, we know that in expectation at least c/2 linearizable futures are executed solely with no other futures in an execution by A. Therefore, the length of any path `1 going though all the groups containing the c futures in the DAG of an execution by A is at least tc/2 in expectation. Now consider the case where P1,k ≥ 1/2 for some 1 ≤ k ≤ c. Since at least one of f1,k−1 and f1,k+1 must be in the group containing f1,k when f1,k is grouped with other futures, we know that the probability that f1,k0 and f1,k0 +1 are in the same group in an execution by A is at least 1/4, for some k 0 = k−1 or k 0 = k. Now let us construct another program Q01 by modifying Q1 as follows. The positions of the touches of f1,k0 +1 , f1,k0 +2 , ..., f1,c are all moved to a super node S (which represents a sequence of touch nodes) at the end of the complete program Q, as illustrated in Figure 4.7. 62 Figure 4.7: Q1 and Q01 in the proof of Theorem 38 63 The execution times of future groups are also modified. We keep T (g(f1,i ∪ f1,i+1 ∪ ... ∪ f1,j )) = T (g(f1,i )) = t, for any 1 ≤ i ≤ j ≤ k 0 , but we let T (g(f1,i ∪ f1,i+1 ∪ ... ∪ f1,j )) = T (g(f1,j )) = ct, for any j ≥ k 0 + 1. Intuitively, f1,k0 +1 , f1,k0 +2 , ..., f1,c are now “longer” futures each taking time ct, and grouping any of them, with or without linearizable futures prior to fk0 +1 , takes ct. Note that the execution times of the future groups in G1 and G01 all satisfy the inequalities of combining. A key observation is that the two programs Q1 and Q01 are indistinguishable from A’s perspective before A starts executing the group containing f1,k0 , because (1) the touch of f1,k0 blocks A from moving forward in the main thread and (2) all the instructions in the main thread before that touch and the future groups containing only futures prior to f1,k0 are the same in the two problems. More specifically, given the same sequence of random bits for running A (since it may be a randomized scheduler) and the same available processors at each time step, the executions of Q1 and Q01 by A are identical until A starts executing the group containing f1,k0 . Therefore, it is still true that the probability that f1,k0 and f1,k0 +1 are in the same group in an execution of Q01 by A is at least 1/4. Thus in an execution of Q01 by A, the expected length of the path `01 consisting of the group containing f1,k0 is at least ct/4 and obviously the end of `01 , which is the touch of that group, is the touch of a future f1,i for some i ≤ k 0 . If P1,k ≥ 1/2 for some 1 ≤ k ≤ c, we set Q∗1 = Q1 and `∗1 = `1 , and otherwise we set Q∗1 = Q01 and `∗1 = `01 . Thus, in the DAG of an execution of Q∗1 by A, the expected length of `∗1 is at least ct/4. Now in the second step of constructing Q, we generate Q∗2 by extending Q∗1 as follows. Starting from the “last node” of Q∗1 (which is either the touch of f1,c in Q1 or the touch of f1,k0 in Q01 ), we create linearizable futures f2,k0 +1 , f2,k0 +2 , ..., f2,c on another object o2 consecutively in the main thread and then touch them in the same order. We assume grouping any of those futures takes time t. Let Q2 denote the new program that consists of Q∗1 and the newly added part. Like what we did for Q1 , if the probability that f2,i is executed solely without other linearizable futures in an execution by A is less than 1/2 for all i, we let Q∗2 = Q2 . Otherwise, there must be futures f2,k0 and f2,k0 +1 for some k 0 such that they are in the same group in an execution by A with probability at least 1/4 and hence we let Q∗2 = Q02 , where Q02 is obtained by moving the touches of f2,k+1 , f2,k+2 , ..., f2,c to the super node S and changing the execution time to ct for any group containing any of f2,k+1 , f2,k+2 , ..., f2,c . Again, we can show that Q2 and Q02 are indistinguishable from A’s perspective until A starts executing the group containing fk0 . By the same argument for Q∗1 , We can prove that there is a path `∗2 from the fork of f2,i to the touch of f2,j , for some i, j, such that its length is at least ct/4 in expectation in the DAG of an execution of Q∗2 by A. By induction, in the kth step for any 2 ≤ k ≤ c, we construct Q∗i by “appending” to Q∗k−1 linearizable futures fk,1 , fk,2 , ..., fk,c on a new object ok in the same way we explained above, and we can conclude that there is a path `∗k from the fork of fk,i to the touch of fk,j , for some i, j, such that its length is at least ct/4 in expectation in the DAG of an execution of Q∗k by A. Finally, we construct G by appending the super node S to the last node in the main thread in G∗c . Noth that S represents a sequence of at most c nodes that are touches of linearizable futures 64 in G∗c . Now consider any path ` that contains all `∗1 , `∗2 , ..., `∗c in the DAG of an execution of G by A. Obviously, the expected length of ` is at least c · ct/4 = c2 t/4. In contrast, an optimal offline scheduler will group all fk,1 , fk,2 , ..., fk,c together if G∗k = Gk , or group all the “shorter” futures fk,1 , fk,2 , ..., fk,i and then group all the “longer” futures fk,i+1 , fk,i+2 , ..., fk,c if G∗k = G0k for any 1 ≤ k ≤ c. It is easy to see that the DAG of an execution by this optimal scheduler has Θ(c2 t) nodes, asymptotically optimal, and the length of its critical path T∞ is only Θ(ct). Thus, the expected length of a critical path in an execution by A is Ω(cT∞ ), i.e. Ω((c + 1)T∞ ), which completes the proof. (Note that G has only one main thread. We can also construct a program with multiple main threads that suffices, where each main thread is constructed in the same way as G is constructed.) t u Chapter 5 Concurrent Data Structures for Near-Memory Computing In this chapter, we will discuss applications of linearizable futures in the near-memory computing model, also called the PIM model. As we briefly mentioned earlier, to be competitive with traditional concurrent data structures, data structures in the PIM model need new approaches to leverage parallelism. Here we present some PIM-managed concurrent data structures, where threads send their operation requests as linearizable futures to PIM cores which execute those requests with certain optimizations. With those optimizations and the help of PIM cores’ fast memory access speed, our data structures can beat state-of-the-art concurrent data structures in the literature. This chapter is organized as follows. We first present related work in Section 5.1. In Section 5.2 we briefly describe our assumptions about the hardware architecture. In Section 5.3 we introduce a simplified performance model that we use throughout this paper to predict performance of our algorithms using the hardware architecture described in Section 5.2. Finally in Sections 5.4 and 5.5, we describe and analyze our PIM algorithms and use our model to compare them to prior work. We also use current architectures to simulate the behavior of our algorithms and evaluate compared to state-of-the-art concurrent algorithms. 5.1 Related Work The PIM model is undergoing a renaissance. Studied for decades (e.g., [74, 54, 36, 68, 67, 52, 40]), this model has recently re-emerged due to advances in 3D-stacked techology that can stack memory dies on top of a logic layer [51, 59, 12]. For example, Micron and others have recently released a PIM prototype called the Hybrid Memory Cube [26], and the model has again become the focus of architectural research. Different PIM-based architectures have been proposed, either for general purposes or for specific applications [5, 4, 79, 50, 11, 6, 10, 9, 21, 80, 81]. The PIM model has many advantages, including low energy consumption and high bandwidth 65 66 (e.g., [4, 79, 80, 9]). Here, we focus on one more: low memory access latency [59, 50, 11]. To our knowledge, however, we are the first to utilize PIM memory for designing efficient concurrent data structures. Although some researchers have studied how PIM memory can help speed up concurrent operations to data structures, such as parallel graph processing [4] and parallel pointer chasing on linked data structures [50], the applications they consider require very simple, if any, synchronization between operations. In contract, operations to concurrent data structures can interleave in arbitrary orders, and therefore have to correctly synchronize with one another in all possible situations. This makes designing concurrent data structures with correctness guarantees like linearizability [49] very challenging. Moreover, no one has ever compared the performance of data structures in the PIM model with that of state-of-the-art concurrent data structures in the classic shared memory model. We analyze and evaluate concurrent linked-lists and skip-lists, as representatives of pointer-chasing data structures, and concurrent FIFO queues, as representatives of contended data structures. For linkedlists, we compare our PIM-managed implementation with well-known approaches such as fine-grained locking [43] and flat combining [44, 30]. For skip-lists, we compare our implementation with the lock-free skip-list [48] and a skip-list with flat combining and partitioning optimization. For FIFO queues, we compare our implementation with the flat-combining FIFO queue [44] and the F&A-based FIFO queue [64]. 5.2 Hardware Architecture and Model In the PIM hardware model, multiple CPUs are connected to the main memory, via a shared crossbar network, as illustrated in Figure 5.1. The main memory consists of two parts—one is a normal DRAM accessible by CPUs and the other, called the PIM memory, is divided into multiple partitions, called PIM vaults or simply vaults. According to the Hybrid Memory Cube specification 1.0 [26], each HMC consists of 16 or 32 vaults and has total size 2GB or 4 GB (so each vault has size roughly 100MB). We assume the same specifications in our PIM model, although the size of a PIM memory and the number of its vaults can be greater. Each CPU also has access to a hierarchy of caches backed by DRAM, and there can be last-level caches shared among multiple CPUs. Each vault has a PIM core directly attached to it. we say a vault is local to the PIM core attached to it, and vice versa. A PIM core is a lightweight CPU that may be slower than a full-fledged CPU with respect to computation speed.1 A vault can be accessed only by its local PIM core.2 Although a PIM core is relatively slow computationally, it has fast access to its local vault. A PIM core communicates with other PIM cores and CPUs via messages. Each PIM core, as 1A PIM core can be thought of as an in-order CPU with only small private L1 cache and without some optimizations that full-fledged CPUs usually have. 2 We may alternatively assume that a PIM core has direct access to remote vaults, at a larger cost. We may also assume that vaults are accessible by CPUs as well, but at the cost of dealing with cache coherence between CPUs and PIM cores. Some cache coherence mechanisms for PIM memory claim to be not costly (e.g., [21, 5]). However, we prefer to keep the hardware model simple and we will show that we are still able to design efficient concurrent data structure algorithms with this simple, less powerful PIM memory. 67 Figure 5.1: The PIM model well as each CPU, has buffers for storing incoming messages. A message is guaranteed to eventually arrive at the buffer of its receiver. Messages from the same sender to the same receiver are delivered in FIFO order: the message sent first arrives at the receiver first. However, messages from different senders or to different receivers can arrive in an arbitrary order. To keep the PIM memory simple, we assume that a PIM core can only make read and write operations to its local vault, while a CPU also supports more powerful atomic operations, such as CAS and F&A. Virtual memory is cheap to be achieved in this model, by having each PIM core maintain its own page table for its local vault [50]. 5.3 Performance Model Based on the latency numbers in prior work on PIM memory, in particular on the Hybrid Memory Cube [26, 11], and on the evaluation of operations in multiprocessor architectures [28], we propose the following simple performance model to compare our PIM-managed algorithms with existing 68 concurrent data structure algorithms. For read and write operations, we assume Lcpu = 3Lpim = 3Lllc , where Lcpu is the latency of a memory access by a CPU, Lpim is the latency of a local memory access by a PIM core, and Lllc is the latency of a last-level cache access by a CPU. We ignore the costs of cache accesses of other levels in our performance model, as they are negligible in the concurrent data structure algorithms we will consider. We assume that the latency of a CPU making an atomic operation, such as a CAS or a F&A, to a cache line is Latomic = Lcpu , even if the cache line is currently in cache. This is because an atomic operation hitting the cache is usually as costly as a memory access by a CPU, acorrding to [28]. When there are k atomic operations competing for a cache line concurrently, we assume that they are executed sequentially, that is, they complete in times Latomic , 2Latomic , ..., k · Latomic , respectively. We assume that the size of a message sent by a PIM core or a CPU is at most the size of a cache line. Given that a message transferred between a CPU and a PIM core goes through the crossbar network, we assume that the latency for a message to arrive at its receiver is Lmessage = Lcpu . We make a conservative assumption that the latency of a message transferred between two PIM cores is also Lmessage . Note that the message latency we consider here is the transfer time of a message through a message passing channel, that is, the period between the moment when a PIM or a CPU sends off the message and the moment when the message arrives at the buffer of its receiver. We ignore the time spent in other parts of a message passing procedure, such as preprocessing and constructing the message, as it is negligible compared to the time spent in message transfer. 5.4 Low Contention Data Structures In this section we consider data structures with low contention; pointer chasing data structures, such as linked-lists and skip-lists, fall in this category. These are data structures whose operations need to de-reference a non-constant sequence of pointers before completing. We assume they support operations such as add(x), delete(x) and contains(x), which follow “next node” pointers until reaching the position of node x. When these data structures are too large to fit in CPU caches and access uniformly random keys, they incur expensive memory accesses, which cannot be easily predicted, making the pointer chasing the dominating overhead of these data structures. Naturally, these data structures have been early examples of the benefit of near-memory computing [50], as the entire pointer chase could be performed by the PIM core, and only the final result returned to the application. However, under the same conditions, these data structures have inherently low contention. Lockfree algorithms [33, 69, 75, 48] have shown that these data structures can scale to hundreds of cores 69 under low contention. Unfortunately, each vault in PIM memory has a single core; as a consequence, prior work has only compared PIM data structures with sequential data structures, not with carefully crafted concurrent data structures. We analyze linked-lists and skip-lists, and show that the naive PIM data structure in each case cannot outperform the equivalent CPU managed concurrent data structure even for a small number of cores. Next, we show how to use state-of-the art techniques from concurrent computing literature to optimize algorithms for near-memory computing to outperform well-known concurrent data structures. 5.4.1 Linked-lists We now describe a naive PIM linked-list. The linked-list is stored in a vault, maintained by the local PIM core. Whenever a CPU3 wants to perform an operation on the linked-list, it sends a request to the PIM core. The PIM core will retrieve the message, execute the operation, and send the result back to the CPU. The PIM linked-list is sequential, as it can only be accessed by one PIM core. Doing pointer chasing on sequential data structures by PIM cores is not a new idea (e.g., [50, 4]). It is obvious that for a sequential data structure like a sequential linked-list, replacing the CPU with a PIM core to access the data structure will largely improve its performance due to the PIM core’s much faster memory access. However, we are not aware of any prior comparison between the performance of PIM-managed data structures and concurrent data structures in which CPUs can make operations in parallel. In fact, our analytical and experimental results will show that the performance of the naive PIM-managed linked-list is much worse than that of the concurrent linked-list with fine-grained locks [43]. To improve the performance of the PIM-managed linked-list, we apply the following combining optimization to it: the PIM core retrieves all pending requests from its buffer and executes all of them during only one traversal over the linked-list. It is not hard to see that the role of the PIM core in our PIM-managed linked-list is very similar to that of the combiner in a concurrent linked-list implemented using flat combining [44], where, roughly speaking, threads compete for a “combiner lock” to become the combiner, and the combiner will take over all operation requests from other threads and execute them. Therefore, we think the performance of the flat-combining linked-list is a good indicator of the performance of our PIM-managed linked-list. Based on our performance model, we can calculate the approximate expected throughputs of the linked-list algorithms mentioned above, when there are p CPUs making operation requests concurrently. We assume that a linked-list consists of nodes with integer keys in the range of [1, N ]. Initially a linked-list has n nodes with keys generated independently and uniformly at random from [1, N ]. The keys of the operation requests are generated the same way. To simplify the analysis, we assume that CPUs only make contains() requests (or the number of add() requests is the same as the number of delete() so that the size of each linked-list nearly doesn’t change). We also assume that a CPU makes a new operation request immediately after its previous one completes. Assuming 3 We use the term CPU to refer to CPU cores, as opposed to PIM cores. 70 that n  p and N  p, the approximate expected throughputs (per second) of the concurrent n P i )p .4 linked-lists are presented in Table 5.1, where Sp = ( n+1 i=1 Algorithm Linked-list with fine-grained locks Flat-combining linked-list without combining PIM-managed linked-list without combining Flat-combining linked-list with combining PIM-managed linked-list with combining Throughput 2p (n+1)Lcpu 2 (n+1)Lcpu 2 (n+1)Lpim p (n−Sp )Lcpu p (n−Sp )Lpim Table 5.1: Throughputs of linked-list algorithms. It is easy to see that the PIM-managed linked-list with combining outperforms the linked-list with fine-grained locks, which is the best one among other algorithms, as long as that 0 < Sp ≤ n 2 Lcpu Lpim > 2(n−Sp ) n+1 . Given and Lcpu = 3Lpim , the throughput of the PIM-managed linked-list with combining should be at least 1.5 times the throughput of the linked-list with fine-grained locks. Without combining, however, the PIM-managed linked-list cannot beat the linked-list with fine-grained locks when p > 6. We implemented the linked-list with fine-grained locks and the flat-combining link-list with and without the combining optimization. We tested them on a Dell server with 512 GB RAM and 56 cores on four Intel Xeon E7-4850v3 processors at 2.2 GHz. To get rid of NUMA access effects, we ran experiments with only one processor, which is a NUMA node with 14 cores, a 35 MB shared L3 cache, and a private L2/L1 cache of size 256 KB/64 KB per core. Each core has 2 hyperthreads, for a total of 28 hyperthreads. Cache lines have 64 bytes. The throughputs of the algorithms are presented in Figure 5.2. The results confirmed the validity of our analysis in Table 5.1. The throughput of the flat-combining algorithm without combining optimization is much worse than the algorithm with fine-grained locks. Since we believe the performance of the flat-combining linked-list is a good indicator of that of the PIM-managed linked-list, we triple the throughput of the flat-combining algorithm without combining optimization to get the estimated throughput of the PIM-managed algorithm. As we can see, it is still far below the throughput of the one with fined-grained locks. However, with the combining optimization, the performance of the flat-combining algorithm improves significantly and the estimated throughput of our PIM-managed linked-list with combining optimization now beats all others’. 5.4.2 Skip-lists Like the naive PIM-managed linked-list, the naive PIM-managed skip-list keeps the skip-list in a single vault and CPUs send operation requests to the local PIM core that executes those operations. 4 We define the rank of an operation request to a linked-list as the number of pointers it has to traverse until it finds the right position for it in the linked-list. Sp is the expected rank of the operation request with the biggest key among p random requests a PIM core or a combiner has to combine, which is essentially the expected number of pointers a PIM core or a combiner has to go through during one pointer chasing procedure. 71 Figure 5.2: Experimental results of linked-lists. We evaluated the linked-list with Fine-grained locks and the flat-combining linked-list (FC) with and without the combining optimization. As we will see, this algorithm is less efficient than some existing algorithms. Unfortunately, the combining optimization cannot be applied to skip-lists effectively. The reason is that for any two nodes not close enough to each other in the skip-list, the paths we traverse through to reach them don’t largely overlap. On the other hand, PIM memory usually consists of many vaults and PIM cores. For instance, the first generation of Hybrid Memory Cube [26] has up to 32 vaults. Hence, a PIM-managed skiplist may achieve much better performance if we can exploit the parallelism of multiple vaults. Here we present our PIM-managed skip-list with a partitioning optimization: A skip-list is divided into partitions of disjoint ranges of keys, stored in different vaults, so that a CPU sends its operation request to the PIM core of the vault to which the key of the operation belongs. Figure 5.3 illustrates the structure of a PIM-managed skip-list. Each partition of a skip-list starts with a sentinel node which is a node of the max height. For simplicity, assume the max height Hmax is predefined. A partition covers a key range between the key of its sentinel node and the key of the sentinel node of the next partition. CPUs also store a copy of each sentinel node in the normal DRAM and the copy has an extra variable indicating the vault containing the sentinel node. Since the number of nodes of the max height is very small with high probability, those copies of those sentinel nodes can almost certainly stay in cache if CPUs access them frequently. When a CPU applies an operation for a key to the skip-list, it first compares the key with those of the sentinels, discovers which vault the key belongs to, and then sends its operation request to that vault’s PIM core. Once the PIM core retrieves the request, it executes the operation in the local vault and finally sends the result back to the CPU. Now let us discuss how we implement the PIM-managed skip-list when the key of each operation 72 Figure 5.3: A PIM-managed FIFO queue with three partitions is an integer generated uniformly at random from range [0, n] and the PIM memory has k vaults available. Initially we can create k partitions starting with fake sentinel nodes with keys 0, 1/k, 2/k,..., (n − 1)/k, respectively, and allocate each partition in a different vault. The sentinel nodes will never be deleted. If a new node to be added has the same key as a sentinel node, we insert it immediately after the sentinel node. We compare the performance of our PIM-managed skip-list with partitions to the performance of a flat-combining skip-list [44] and a lock-free skip-list [48], where p CPUs keeps making operation requests. We also apply the partitioning optimization to the flat-combining skip-list, so that k combiners are in charge of k partitions of the skip-list. To simplify the comparison, we assume that all skip-lists have the same initial structure (expect that skip-lists with partitions have extra sentinel nodes) and all the operations are contains() operations (or the number of add() requests is the same as the number of delete() so that the size of each skip-list nearly doesn’t change). Their approximate expected throughputs are presented in Table 5.2, where β is the average number of nodes an operation has to go through in order to find the location of its key in a skip-list (β = Θ(log N ), where N is the size of the skip-list). Note that we have ignored some overheads in the flat-combining algorithms, such as maintaining combiner locks and publication lists (we will discuss publication lists in more detail in Section 5.5). We also have overestimated the performance of the lock-free skip-list by not counting the CAS operations used in add() and delete() requests, as well as the cost of retries caused by conflicts of updates. Even so, our PIM-managed linked-list with partitioning optimization is still expected to outperform the second best algorithm, the lock-free skip-list when k > (βLpim +Lmessage )p . βLcpu Given that Lmessage = Lcpu = 3Lpim , k > p/3 should suffice. Our experiments have revealed similar results, as presented in Figure 5.4. We have implemented and run the flat-combining skip-list with different numbers of partitions and compared them with 73 Algorithm Look-free skip-list Flat-combining skip-list PIM-managed skip-list Flat-combining skip-list with k partitions PIM-managed skip-list with k partitions Throughput p βLcpu 1 βLcpu 1 (βLpim +Lmessage ) k βLcpu k (βLpim +Lmessage ) Table 5.2: Throughputs of skip-list algorithms. the lock-free skip-list. As the number of partitions increases, the performance of the flat-combining skip-list gets better, implying the effectiveness of the partitioning optimization. Again we believe the performance of the flat-combining skip-list is a good indicator to the performance of our PIMmanaged skip-list. Therefore, according to the analytical results in Table 5.2, we can triple the throughput of a flat-combining skip-list to get the expected performance of a PIM-managed skiplist. As Figure 5.4 illustrates, when the PIM-managed skip-list has 8 or 16 partitions, it is expected to outperform the lock-free skip-list with up to 28 hardware threads. Figure 5.4: Experimental results of skip-lists. We evaluated the lock-free skip-list and the flatcombining skip-list (FC) with different numbers (1, 4, 8, 16) of partitions. Skip-list Rebalancing The PIM-managed skip-list performs well with a uniform distribution of requests. However, if the distribution of requests is not uniform, a static partitioning scheme will result in unbalanced partitions, with some PIM cores being idle, while others having to serve a majority of requests. To address this problem, we introduce a non-blocking protocol for migrating consecutive nodes from one vault to another. 74 The protocol works as follows. A PIM core p that manages a vault v 0 can send a message to another PIM core q, managing vault v, to request that some nodes are moved from v 0 to v. First, p sends a message notifying q of the start of the migration. Then p sends messages of adding those nodes to q one by one in an ascending order according to the keys of the nodes. After all the nodes have been migrated, p sends notification messages to CPUs so that they can update their copies of sentinel nodes accordingly. After p receives acknowledgement messages from all CPUs, it notifies q of the end of migration. To keep the node migration protocol simple, we don’t allow q to move those nodes to another vault again until p finishes its node migration. During the node migration, p can still serve requests from CPUs. Assume that a request with key k1 is sent to p when p is migrating nodes in a key range containing k1 . If p is about to migrate a node with key k2 at the moment and k1 ≥ k2 , p serves the request itself. Otherwise, p must have migrated all nodes in the subset containing key k1 , and therefore p forwards the request to q which will serve the request and respond directly to the requesting CPU. The algorithm is correct, because a request will eventually reach the vault that currently contains nodes in the key range that the request belongs to: If a request arrives to p which no longer holds the partition the request belongs to, p can simply reply with a rejection to the CPU and the CPU will resend its request to the correct PIM core, because it has already updated its sentinels and knows which PIM core it should contact now. Using this node migration protocol, the PIM-managed FIFO queue can support two rebalancing schemes: 1) If a partition has too many nodes, the local PIM core can send nodes in a key range to a vault that has fewer nodes; 2) If two consecutive partitions are both small, we can merge then by moving one to the vault containing the other. In practice, we expect that rebalancing will not happen very frequently, so its overhead can be ameliorated by the improved efficiency resulting from a rebalance. 5.5 High Contention Data Structures In this section, we consider data structures that are often contended when accessed by many threads concurrently. In these data structures, operations compete for accessing one or several locations, creating a contention spot, which can become a performance bottleneck. Examples include head and tail pointers in queues or the top pointer of a stack. These data structures have good locality and the contention spots are often found in shared CPU caches, such as the last level cache in a multi-socket non-uniform memory access machine when accessed by threads running only on one socket. Therefore, these data structures might seem to be a poor fit for near-memory computing, because the advantage of the faster access to memory is muted by having the frequently accessed data in the cache. However, such a perspective does not consider the overhead introduced by contention in a concurrent data structure where all threads try to access the same locations. As a representative example of this class of data structures, we consider a FIFO queue, where 75 concurrent enqueue and dequeue operations compete for the head and tail of the queue, respectively. Although a naive PIM FIFO queue is not a good replacement for a well crafted concurrent FIFO queue, we show that, counterintuitively, PIM can still have benefits over a traditional concurrent FIFO queue. In particular, we exploit the pipelining of requests from CPUs, which can be done very efficiently in PIM, to design a PIM FIFO queue that can outperform state-of-the-art concurrent FIFO queues, such as the one using flat combining [44] and the one using Fetch And Add [64]. 5.5.1 FIFO queues The structure of our PIM-managed FIFO queue is shown in Figure 5.5. A queue consists of a sequence of segments, each containing consecutive nodes of the queue. A segment is allocated in a PIM vault, with a head node and a tail node pointing to the first and the last nodes of the segment, respectively. A vault can contain multiple (mostly likely non-consecutive) segments. There are two special segments—the enqueue segment and the dequeue segment. To enqueue a node, a CPU sends an enqueue request to the PIM core of the vault containing the enqueue segment. The PIM core will then insert the node to the head of the segment. Similarly, to dequeue a node, a CPU sends a dequeue request to the PIM core of the vault holding the dequeue segment. The PIM core will then pop out the node at the tail of the dequeue segment and send the node back to the CPU. Figure 5.5: A PIM-managed FIFO queue with three segments Initially the queue consists of an empty segment which acts as both the enqueue segment and the dequeue segment. When the length of enqueue segment exceeds some threshold, the PIM core maintaining it notifies another PIM core to create a new segment as the new enqueue segment.5 When the dequeue segment becomes empty and the queue has other segments, the dequeue segment is deleted and the segment that was created first among all the remaining segments is designated 5 When and how to create a new segment can be decided in other ways. For example, CPUs, instead of the PIM core holding the enqueue segment, can decide when to create the new segment and which vault to hold the new segment, based on more complex criteria (e.g., if a PIM core is currently holding the dequeue segment, it will not be chosen for the new segment so as to avoid the situation where it deals with both enqueue and dequeue requests). To simplify the description of our algorithm, we omit those variants. 76 as the new dequeue segment. (It is not hard to see that the new dequeue segment were created when the old dequeue segment acted as the enqueue segment and exceeded the length threshold.) If the enqueue segment is different from the dequeue segment, enqueue and dequeue operations can be executed by two different PIM cores in parallel, which doubles the throughput compared to a straightforward queue implementation held in a single vault. Algorithm 2 PIM-managed FIFO queue: PIM core’s procedures upon receiving requests enq(cid, u), deq(cid), newEnqSeg(), and newEnqDeq() 1: procedure enq(cid, u) 2: if enqSeg == null then 3: send message(cid, false); 4: else 5: if enqSeg.head 6= null then 6: enqSeg.head.next = u; 7: enqSeg.head = u; 8: else 9: enqSeg.head = u; 10: enqSeg.tail = u; 11: enqSeg.count = enqSeg.count + 1; 12: send message(cid, true); 13: if enqSeg.count > threshold then 14: cid0 = the CID of the PIM core chosen to maintain the new segment; 15: send message(cid0 , newEnqSeg()); 16: enqSeg.nextSegCid = cid0 ; 17: enqSeg = null; 1: procedure newEnqSeg() 2: enqSeg = new Segment(); 3: segQueue.enq(engSeg) ; 4: notify CPUs of the new enqueue segment; 1: procedure deq(cid) 2: if deqSeg == null then 3: send message(cid, false); 4: else 5: if deqSeg.tail 6= null then 6: send message(cid, deqSeg.tail); 7: deqSeg.tail = deqSeg.tail.next; 8: else 9: if deqSeg == enqSeg then 10: send message(cid, null); 11: else 12: send message(deqSeg.nextSegCid, newDeqSeg()); 13: deqSeg = null; 14: send message(cid, false); 1: procedure newDeqSeg() 2: deqSeg = segQueue.deq(); 3: notify CPUs of the new dequeue segment; The pseudocode of the algorithm is presented in Algorithm 2. Each PIM core has local variables enqSeg and deqSeg that are references to local enqueue and dequeue segments. When enqSeg (respectively deqSeq) is not null, it indicates that the PIM core is currently holding the enqueue (respectively dequeue) segment. Each PIM core also maintains a local queue segQueue for storing local segments. CPUs and PIM cores communicate via message(cid, content) calls, where cid is the unique core ID (CID) of the receiver and the content is either a request or a response to a request. Once a PIM core receives an enqueue request enq(cid, u) of node u from a CPU whose CID is cid, it first checks if it is holding the enqueue segment (line 2 of Procedure enq(cid, u)). If so, the PIM core enqueues u (lines 5-12), and otherwise sends back a message informing the CPU that the request is rejected (line 3) so that the CPU can resend its request to the right PIM core holding the enqueue segment (we will explain later how the CPU can find the right PIM core). After enqueuing u, the PIM core may find the enqueue segment is longer than the threshold (line 13). If so, it sends a message with a newEnqSeg() request to the PIM core of another vault that is chosen to create a new enqueue segment. Finally the PIM core sets its enqSeq to null indicating it no longer deals with enqueue operations. Note that the CID cid’ of the PIM core chosen for creating the new segment is recorded in enqSeg.nextSegCid for future use in dequeue requests. As Procedure newEnqSeg() 77 in Algorithm 2 shows, The PIM core receiving this newEnqSeg() request creates a new enqueue segment and enqueues the segment into its segQueue (line 3). Finally it notifies CPUs of the new enqueue segment (we will get to it in more detail later). Similarly, when a PIM core receives a dequeue request deq(cid) from a CPU with CID cid, it first checks whether it still holds the dequeue segment (line 2 of Procedure deq(cid)). If so, the PIM core dequeues a node and sends it back to the CPU (lines 5-7). Otherwise, it informs the CPU that this request has failed (line 3) and the CPU will have to resend its request to the right PIM core. If the dequeue segment is empty (line 8) and the dequeue segment is not the same as the enqueue segment (line 11), which indicates that the FIFO queue is not empty and there exists another segment, the PIM core sends a message with a newDeqSeg() request to the PIM core with CID deqSeg.nextSegCid. (We know that this PIM core must hold the next segment, according to how we create new segments in enqueue operations, as shown at lines 14-16 of Procedure enq(cid, u).) Upon receiving the newDeqSeg() request, as illustrated in Procedure newDeqSeg(), the PIM core retrieves from its segQueue the oldest segment it has created and makes it the new dequeue segment (line 2). Finally the PIM core notifies CPU that it is holding the new dequeue segment now. Now we explain how CPUs and PIM cores coordinate to make sure that CPUs can find the right enqueue and dequeue segments, when their previous attempts have failed due to changes of those segments. We will only discuss how to deal with enqueue segments here, since the same methods can be applied to dequeue segments. A straightforward way to inform CPUs is to have the owner PIM core of the new enqueue segment send notification messages to them (line 4 of newEngSeg()) and wait until CPUs all send back acknowledgement messages. However, if there is a slow CPU that doesn’t reply in time, the PIM core has to wait for it and therefore other CPUs cannot have their requests executed. A more efficient, non-blocking method is to have the PIM core start working for new requests immediately after it has sent off those notifications. A CPU does not have to reply to those notifications in this case, but if its request later fails, it needs to send messages to (sometimes all) PIM cores to ask whether a PIM core is currently in charge of the enqueue segment. In either case, the correctness of the algorithm is guaranteed: at any time, there is only one enqueue segment and only one dequeue segment, and only requests sent to them will be executed. We would like to mention that the PIM-managed FIFO can be further optimized. For example, the PIM core holding the enqueue segment can combine multiple pending enqueue requests and store the nodes to be enqueued in an array as a “fat” node of the queue, so as to reduce memory accesses. This optimization is also used in the flat-combining FIFO queue [44]. Even without this optimization, our algorithm still performs well, as we will show next. 5.5.2 Pipelining and Performance analysis We compare the performance of three concurrent FIFO queue algorithms—our PIM-manged FIFO queue, a flat-combining FIFO queue and a F&A-based FIFO queue [64]. The F&A-based FIFO queue is the most efficient concurrent FIFO queue we are aware of, where threads make F&A operations on 78 two shared variables, one for enqueues and the other for dequeues, to compete for slots in the FIFO queue to enqueue and dequeue nodes (see [64] for more details). The flat-combining FIFO queue we consider is based on the one proposed by [44], with a modification that threads compete for two “combiner locks”, one for enqueues and the other for dequeues. We further simplify it based on the assumption that the queue is always non-empty, so that it doesn’t have to deal with synchronization issues between enqueues and dequeues when the queue is empty. Let us first assume that a queue is long enough such that the PIM-managed FIFO queue has more than one segment, and enqueue and dequeue requests can be executed separately. Since changes of enqueue and dequeue segments happen very infrequently, its overhead is negligible and therefore ignored to simplify our analysis. (If the threshold of segment length at line 13 of enq(cid, u) is a large integer n, then, in the worst case, changing an enqueue or dequeue segment happens only once every n requests, and the cost is only the latency of sending one message and a few steps of local computation.) Since enqueues and dequeues are isolated in all the three algorithms when queues are long enough, we will focus on dequeues, and the analysis of enqueues is almost identical. Assume there are p concurrent dequeue requests by p threads. Since each thread needs to make a F&A operation on a shared variable in the F&A-based algorithm and F&A operations on a shared variable are essentially serialized, the execution time of p requests in the algorithm is at least pLatomic . If we assume that each CPU makes a request immediately after its previous request completes, we can prove that the throughput of the algorithm is at most 1 Latomic . The flat-combining FIFO queue maintains a sequential FIFO queue and threads submit their requests into a publication list. The publication list consists of slots, one for each thread, to store those requests. After writing a request into the list, a thread competes with other threads for acquiring a lock to become the “combiner”. The combiner then goes through the publication list to retrieve requests, executes operations for those requests and writes results back to the list, while other threads spin on their slots, waiting for the results. The combiner therefore makes two last-level cache accesses to each slot other than its own slot, one for reading the request and one for writing the result back. Thus, the execution time of p requests in the algorithm is at least (2p − 1)Lllc and the throughput of the algorithm is roughly 1 2Lllc for large enough p. Note that we have made quite optimistic analysis for the F&A-based and flat-combining algorithms by counting only the costs in part of their executions. The latency of accessing and modifying queue nodes in the two algorithms is ignored here. For dequeues, this latency can be high: since nodes to be dequeued in a long queue is unlikely to be cached, the combiner has to make a sequence of memory accesses to dequeue them one by one. Moreover, the F&A-based algorithm may suffer performance degradation under heavy contention, because contended F&A operations may perform worse in practice. The performance of our PIM-managed FIFO queue seems poor at first sight: although a PIM core can update the queue efficiently, it takes a lot of time for the PIM core to send results back to CPUs one by one. To improve its performance, the PIM core can pipeline the executions of requests, as illustrated in Figure 5.6(a). Suppose p CPUs send p dequeue requests concurrently to the PIM 79 (a) (b) Figure 5.6: (a) illustrates the pipelining optimization, where a PIM core can start executing a new deq() (step 1 of deq() for the CPU on the left), without waiting for the dequeued node of the previous deq() to return to the CPU on the right (step 3). (b) shows the timeline of pipelining four deq() requests. core, which takes time Lmessage . The PIM core fist retrieves a request from its message buffer (step 1 in the figure), dequeues a node (step 2) for the request, and sends the node back to the CPU (step 3). After the PIM core sends off the message containing the node, it immediately retrieves the next request, without waiting for the message to arrive at its receiver. This way, the PIM core can pipeline requests by overlapping the latency of message transfer (step 3) and the latency of memory accesses and local computations (steps 1 and 2) in multiple requests (see Figure 5.6(b)). During the execution of a dequeue, the PIM core only makes one memory access to read the node to be 80 dequeued, and two L1 cache accesses to read and modify the tail node of the dequeue segment. It is easy to prove that the execution time of p requests, including the time CPUs send their requests to the PIM core, is only Lmessage + p(Lpim + ) + Lmessage , where  is the total latency of the PIM core making L1 cache accesses and sending off one message, which is negligible in our performance model. If each CPU makes another request immediately after it receives the result of its previous request, we can prove that the throughput of the PIM-managed FIFO queue is 1 − 2Lmessage 1 − 2Lmessage 1 ≈ ≈ , Lpim +  Lpim Lpim which is expected twice the throughput of the flat-combining queue and three times that of the F&A queue, in our performance model assuming Latomic = 3Lllc = 3Lpim . When the PIM-managed FIFO queue is short, it may contain only one segment which deals with both enqueue and dequeue requests. In this case, its throughput is only half of the throughput shown above, but it should still be at least as good as the throughput of the other two algorithms. Bibliography [1] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. In Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures, SPAA ’00, pages 1–12, New York, NY, USA, 2000. ACM. [2] Umut A. Acar, Arthur Chargueraud, and Mike Rainey. Scheduling parallel programs by work stealing with private deques. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pages 219–228, New York, NY, USA, 2013. ACM. [3] Kunal Agrawal, Yuxiong He, and Charles E. Leiserson. Adaptive work stealing with parallelism feedback. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’07, pages 112–120, New York, NY, USA, 2007. ACM. [4] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 105–117, New York, NY, USA, 2015. ACM. [5] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. Pim-enabled instructions: A lowoverhead, locality-aware processing-in-memory architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 336–348, New York, NY, USA, 2015. ACM. [6] Berkin Akin, Franz Franchetti, and James C. Hoe. Data reorganization in memory using 3d-stacked dram. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 131–143, New York, NY, USA, 2015. ACM. [7] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, SPAA ’98, pages 119–129, New York, NY, USA, 1998. ACM. [8] Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: data structures for parallel computing. ACM Trans. Program. Lang. Syst., 11(4):598–632, October 1989. 81 82 [9] Erfan Azarkhish, Christoph Pfister, Davide Rossi, Igor Loi, and Luca Benini. Logic-base interconnect design for near memory computing in the smart memory cube. IEEE Trans. VLSI Syst., 25(1):210–223, 2017. [10] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. High performance axi-4.0 based interconnect for extensible smart memory cubes. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE ’15, pages 1317–1322, San Jose, CA, USA, 2015. EDA Consortium. [11] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In Proceedings of the 29th International Conference on Architecture of Computing Systems – ARCS 2016 - Volume 9637, pages 19–31, New York, NY, USA, 2016. Springer-Verlag New York, Inc. [12] Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die stacking (3d) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 469–479, Washington, DC, USA, 2006. IEEE Computer Society. [13] Guy E. Blelloch. Programming parallel algorithms. Commun. ACM, 39(3):85–97, March 1996. [14] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twentythird Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 355–366, New York, NY, USA, 2011. ACM. [15] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, SPAA ’95, pages 1–12, New York, NY, USA, 1995. ACM. [16] Guy E. Blelloch and Margaret Reid-Miller. Pipelining with futures. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, SPAA ’97, pages 249–259, New York, NY, USA, 1997. ACM. [17] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. An analysis of dag-consistent distributed shared-memory algorithms. In Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, SPAA ’96, pages 297–308, New York, NY, USA, 1996. ACM. [18] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPOPP ’95, pages 207–216, New York, NY, USA, 1995. ACM. 83 [19] Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM J. Comput., 27(1):202–229, February 1998. [20] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720–748, September 1999. [21] Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. Lazypim: An efficient cache coherence mechanism for processing-inmemory. IEEE Computer Architecture Letters, 2016. [22] F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 conference on Functional programming languages and computer architecture, FPCA ’81, pages 187–194, New York, NY, USA, 1981. ACM. [23] David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA ’05, pages 21–28, New York, NY, USA, 2005. ACM. [24] Quan Chen, Minyi Guo, and Zhiyi Huang. Cats: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 163–172, New York, NY, USA, 2012. ACM. [25] Rezaul Alam Chowdhury, Vijaya Ramachandran, Francesco Silvestri, and Brandon Blakeley. Oblivious algorithms for multicores and networks of processors. Journal of Parallel and Distributed Computing, 73(7):911–925, 2013. [26] Hybrid Memory Cube Consortium. Hybrid memory cube specification 1.0. [27] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’93, pages 1–12, New York, NY, USA, 1993. ACM. [28] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 33–48, New York, NY, USA, 2013. ACM. [29] James Dinan, Stephen Olivier, Gerald Sabin, Jan Prins, P Sadayappan, and Chau-Wen Tseng. Dynamic load balancing of unbalanced computations using message passing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–8. IEEE, 2007. 84 [30] Panagiota Fatourou and Nikolaos D. Kallimanis. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pages 257–266, New York, NY, USA, 2012. ACM. [31] Cormac Flanagan and Matthias Felleisen. The semantics of future and its use in program optimization. In Proceedings of the 22Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’95, pages 209–220, New York, NY, USA, 1995. ACM. [32] Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in manticore. In Proceedings of the 13th ACM SIGPLAN international conference on Functional programming, ICFP ’08, pages 119–130, New York, NY, USA, 2008. ACM. [33] Keir Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory, February 2004. [34] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI ’98, pages 212–223, New York, NY, USA, 1998. ACM. [35] John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, pages 43–52, New York, NY, USA, 2008. ACM. [36] Maya Gokhale, Bill Holmes, and Ken Iobst. Processing in memory: The terasys massively parallel pim array. Computer, 28(4):23–31, April 1995. [37] James R. Goodman, Mary K. Vernon, and Philip J. Woest. Efficient synchronization primitives for large-scale cache-coherent multiprocessors. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, pages 64–75, New York, NY, USA, 1989. ACM. [38] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, ASPLOS XII, pages 151–162, New York, NY, USA, 2006. ACM. [39] Allan Gottlieb, Boris D. Lubachevsky, and Larry Rudolph. Basic techniques for the efficient coordination of very large numbers of cooperating sequential processors. ACM Trans. Program. Lang. Syst., 5(2):164–189, April 1983. [40] Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook 85 Shin, and Joonseok Park. Mapping irregular applications to diva, a pim-based data-intensive architecture. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, SC ’99, New York, NY, USA, 1999. ACM. [41] Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming, LFP ’84, pages 9–17, New York, NY, USA, 1984. ACM. [42] Robert H. Halstead, Jr. Multilisp: a language for concurrent symbolic computation. ACM Trans. Program. Lang. Syst., 7(4):501–538, October 1985. [43] Steve Heller, Maurice Herlihy, Victor Luchangco, Mark Moir, William N. Scherer, and Nir Shavit. A lazy concurrent list-based set algorithm. In Proceedings of the 9th International Conference on Principles of Distributed Systems, OPODIS’05, pages 3–16, Berlin, Heidelberg, 2006. Springer-Verlag. [44] Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’10, pages 355–364, New York, NY, USA, 2010. ACM. [45] Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’04, pages 206–215, New York, NY, USA, 2004. ACM. [46] Maurice Herlihy, Beng-Hong Lim, and Nir Shavit. Scalable concurrent counting. ACM Trans. Comput. Syst., 13(4):343–364, November 1995. [47] Maurice Herlihy and Zhiyu Liu. Well-structured futures and cache locality. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 155–166, New York, NY, USA, 2014. ACM. [48] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. [49] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–492, July 1990. [50] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation. In IEEE 34th International Conference on Computer Design, ICCD 2016, pages 25–32. IEEE, 2016. [51] Joe Jeddeloh and Brent Keeth. Hybrid memory cube new dram architecture increases density and performance. In Symposium on VLSI Technology, VLSIT 2012, pages 87–88. IEEE, 2012. 86 [52] Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Vi Lam, Josep Torrellas, and Pratap Pattnaik. Flexram: Toward an advanced intelligent memory system. In Proceedings of the IEEE International Conference On Computer Design, ICCD ’99. [53] Alex Kogan and Maurice Herlihy. The future(s) of shared data structures. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing, PODC ’14, pages 30–39, New York, NY, USA, 2014. ACM. [54] Peter M. Kogge. Execube-a new architecture for scaleable mpps. In Proceedings of the 1994 International Conference on Parallel Processing - Volume 01, ICPP ’94, pages 77–84, Washington, DC, USA, 1994. IEEE Computer Society. [55] D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. Mul-t: a high-performance parallel lisp. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, PLDI ’89, pages 81–90, New York, NY, USA, 1989. ACM. [56] I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Jim Sukha, and Zhunping Zhang. On-the-fly pipeline parallelism. In Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures, SPAA ’13, pages 140–151, New York, NY, USA, 2013. ACM. [57] Joao V. F. Lima and Nicolas Maillard. Online mapping of mpi-2 dynamic tasks to processes and threads. Int. J. High Perform. Syst. Archit., 2(2):81–89, March 2009. [58] B. Liskov and L. Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, PLDI ’88, pages 260–267, New York, NY, USA, 1988. ACM. [59] Gabriel H. Loh. 3d-stacked memory architectures for multi-core processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pages 453–464, Washington, DC, USA, 2008. IEEE Computer Society. [60] Hans-Wolfgang Loidl and Kevin Hammond. On the granularity of divide-and-conquer parallelism. In Proceedings of the 1995 International Conference on Functional Programming, FP’95, pages 135–144, Swinton, UK, UK, 1995. British Computer Society. [61] Lin Ma, Kunal Agrawal, and Roger D. Chamberlain. A memory access model for highlythreaded many-core architectures. Future Gener. Comput. Syst., 30:202–215, January 2014. [62] Seung-Jai Min, Costin Iancu, and Katherine Yelick. Hierarchical work stealing on manycore clusters. In Fifth Conference on Partitioned Global Address Space Programming Models, PGAS ’11, 2011. [63] Mark Moir, Daniel Nussbaum, Ori Shalev, and Nir Shavit. Using elimination to implement scalable and lock-free fifo queues. In Proceedings of the Seventeenth Annual ACM Symposium 87 on Parallelism in Algorithms and Architectures, SPAA ’05, pages 253–262, New York, NY, USA, 2005. ACM. [64] Adam Morrison and Yehuda Afek. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pages 103–112, New York, NY, USA, 2013. ACM. [65] Stephen Olivier and Jan Prins. Scalable dynamic load balancing using upc. In Parallel Processing, 2008. ICPP’08. 37th International Conference on, pages 123–131. IEEE, 2008. [66] Stephen L Olivier, Allan K Porterfield, Kyle B Wheeler, Michael Spiegel, and Jan F Prins. Openmp task scheduling strategies for multicore numa systems. Int. J. High Perform. Comput. Appl., 26(2):110–124, May 2012. [67] Mark Oskin, Frederic T. Chong, and Timothy Sherwood. Active pages: A computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, ISCA ’98, pages 192–203, Washington, DC, USA, 1998. IEEE Computer Society. [68] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A case for intelligent ram. IEEE Micro, 17(2):34–44, March 1997. [69] W. Pugh. Concurrent maintenance of skip lists. Technical report, University of Maryland at College Park, 1990. [70] Jean-Noël Quintin and Frédéric Wagner. Hierarchical work-stealing. In Proceedings of the 16th International Euro-Par Conference on Parallel Processing: Part I, EuroPar’10, pages 217–229, Berlin, Heidelberg, 2010. Springer-Verlag. [71] Kaushik Ravichandran, Sangho Lee, and Santosh Pande. Work stealing for multi-core hpc clusters. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part I, Euro-Par’11, pages 205–217, Berlin, Heidelberg, 2011. Springer-Verlag. [72] Nir Shavit and Dan Touitou. Elimination trees and the construction of pools and stacks: Preliminary version. In Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’95, pages 54–63, New York, NY, USA, 1995. ACM. [73] Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA ’09, pages 91–100, New York, NY, USA, 2009. ACM. [74] Harold S. Stone. A logic-in-memory computer. IEEE Trans. Comput., 19(1):73–78, January 1970. 88 [75] J. Valois. Lock-free Data Structures. PhD thesis, Rensselaer Polytechnic Institute, Troy, NY, USA, 1996. [76] Rob V. van Nieuwpoort, Thilo Kielmann, and Henri E. Bal. Efficient load balancing for widearea divide-and-conquer applications. SIGPLAN Not., 36(7):34–43, June 2001. [77] Adam Welc, Suresh Jagannathan, and Antony Hosking. Safe futures for java. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pages 439–453, New York, NY, USA, 2005. ACM. [78] Yonghong Yan, Sanjay Chatterjee, Zoran Budimlic, and Vivek Sarkar. Integrating mpi with asynchronous task parallelism. In Proceedings of the 18th European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface, EuroMPI’11, pages 333–336, Berlin, Heidelberg, 2011. Springer-Verlag. [79] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. Top-pim: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, pages 85–98, New York, NY, USA, 2014. ACM. [80] Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry T. Pileggi, and Franz Franchetti. A 3d-stacked logic-in-memory accelerator for application-specific data intensive computing. In IEEE International 3D Systems Integration Conference, 3DIC 2013, San Francisco, CA, USA, October 2-4, 2013, pages 1–7, 2013. [81] Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Larry T. Pileggi, and Franz Franchetti. Accelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardware. In IEEE High Performance Extreme Computing Conference, HPEC 2013, Waltham, MA, USA, September 10-12, 2013, pages 1–6, 2013.