Date of Award
Doctor of Philosophy (PhD)
The disparity in performance between processors and main memories has
led computer architects to incorporate large cache hierarchies in
modern computers. These cache hierarchies are designed to be
general-purpose in that they strive to provide the best possible
performance across a wide range of applications. However, such a memory
subsystem does not necessarily provide the best possible performance for
a particular application.
Although general-purpose memory subsystems are desirable when the
work-load is unknown and the memory subsystem must remain fixed,
when this is not the case a custom memory subsystem may be beneficial.
For example, in an application-specific integrated circuit (ASIC) or
a field-programmable gate array (FPGA) designed to run a particular
application, a custom memory subsystem optimized for that application
would be desirable. In addition, when there are tunable
parameters in the memory subsystem, it may make sense to change these
parameters depending on the application being run. Such a situation
arises today with FPGAs and, to a lesser extent, GPUs, and it is
plausible that general-purpose computers will begin to support
greater flexibility in the memory subsystem in the future.
In this dissertation, we first show that it is possible to create
application-specific memory subsystems that provide much better
performance than a general-purpose memory subsystem. In addition,
we show a way to discover such memory subsystems automatically using
a superoptimization technique on memory address traces gathered
from applications. This allows one to generate a custom memory subsystem
with little effort.
We next show that our memory subsystem superoptimization technique can
be used to optimize for objectives other than performance. As an example,
we show that it is possible to reduce the number of writes to the main
memory, which can be useful for main memories with limited write
durability, such as flash or Phase-Change Memory (PCM).
Finally, we show how to superoptimize memory subsystems for streaming
applications, which are a class of parallel applications. In particular, we
show that, through the use of ScalaPipe, we can author and deploy streaming
applications targeting FPGAs with superoptimized memory subsystems.
ScalaPipe is a domain-specific language (DSL) embedded in the Scala
programming language for generating streaming applications that can be
implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we
are able to demonstrate actual performance improvements using the
superoptimized memory subsystem with applications implemented in hardware.
Roger D Chamberlain
Kunal Agrawal, Ron K Cytron, Viktor Gruev, Krishna Kavi, Hiro Mukai