Date of Award
Doctor of Philosophy (PhD)
Modern computing systems from all domains are becoming increasingly more parallel. Manufacturers are taking advantage of the increasing number of available transistors by packaging more and more computing resources together on a single chip or within a single system. These platforms generally contain many levels of private and shared caches in addition to physically distributed main memory. Therefore, some memory is more expensive to access than other and high-performance software must consider memory locality as one of the first level considerations.
Memory locality is often difficult for application developers to consider directly, however, since many of these NUMA affects are invisible to the application programmer and only show up in low performance. Moreover, on parallel platforms, the performance depends on both locality and load balance and these two metrics are often at odds with each other. Therefore, directly considering locality and load balance at the application level may make the application much more complex to program.
In this work, we develop locality-conscious concurrency platforms for multiple different structured parallel programming models, including streaming applications, task-graphs and parallel for loops. In all of this work, the idea is to minimally disrupt the application programming model so that the application developer is either unimpacted or must only provide high-level hints to the runtime system. The runtime system then schedules the application to provide good locality of access while, at the same time also providing good load balance. In particular, we address cache locality for streaming applications through static partitioning and developed an extensible platform to execute partitioned streaming applications. For task-graphs, we extend a task-graph scheduling library to guide scheduling decisions towards better NUMA locality with the help of user-provided locality hints. CilkPlus parallel for loops utilize a randomized dynamic scheduler to distribute work which, in many loop based applications, results in poor locality at all levels of the memory hierarchy. We address this issue with a novel parallel for loop implementation that can get good cache and NUMA locality while providing support to maintain good load balance dynamically.
Chris Gill, I-Ting A. Lee, Roger Chamberlain, Sriram Krishnamoorthy,