Using a User-Level Memory Thread for Correlation Prefetching
or=black>
Using a User-Level Memory Thread for Correlation Prefetching
Using a User-Level Memory Thread for Correlation Prefetching
Yan Solihin
¡
Jaejin Lee
¢
Josep Torrellas
¡
¡
University of Illinois at Urbana-Champaign
¢
Michigan State University
http://iacoma.cs.uiuc.edu
http://www.cse.msu.edu/
£
jlee
Abstract
This paper introduces the idea of using a User-Level Memory Thread
(ULMT) for correlation prefetching. In this approach, a user thread
runs on a general-purpose processor in main memory, either in the
memory controller chip or in a DRAM chip. The thread performs
correlation prefetching in software, sending the prefetched data into
the L2 cache of the main processor. This approach requires mini-
mal hardware beyond the memory processor: the correlation table
is a software data structure that resides in main memory, while the
main processor only needs a few modications to its L2 cache so
that it can accept incoming prefetches. In addition, the approach has
wide usability, as it can effectively prefetch even for irregular ap-
plications. Finally, it is very exible, as the prefetching algorithm
can be customized by the user on an application basis. Our simula-
tion results show that, through a new design of the correlation table
and prefetching algorithm, our scheme delivers good results. Specif-
ically, nine mostly-irregular applications show an average speedup
of 1.32. Furthermore, our scheme works well in combination with
a conventional processor-side sequential prefetcher, in which case
the average speedup increases to 1.46. Finally, by exploiting the
customization of the prefetching algorithm, we increase the average
speedup to 1.53.
1. Introduction
Data prefetching is a popular technique to tolerate long memory ac-
cess latencies. Most of the past work on data prefetching has focused
on processor-side prefetching [6, 7, 8, 12, 13, 14, 15, 19, 20, 23, 25,
26, 28, 29]. In this approach, the processor or an engine in its cache
hierarchy issues the prefetch requests. An interesting alternative is
memory-side prefetching, where the engine that prefetches data for
the processor is in the main memory system [1, 4, 9, 11, 22, 28].
Memory-side prefetching is attractive for several reasons. First, it
eliminates the overheads and state bookkeeping that prefetch re-
quests introduce in the paths between the main processor and its
caches. Second, it can be supported with a few modications to
the controller of the L2 cache and no modication to the main pro-
cessor. Third, the prefetcher can exploit its proximity to the memory
to its advantage, for example by storing its state in memory. Fi-
nally, memory-side prefetching has the additional attraction of riding
the technology trend of increased chip integration. Indeed, popular
platforms like PCs are being equipped with graphics engines in the
memory system [27]. Some chipsets like NVIDIAs nForce even in-
tegrate a powerful processor in the North Bridge chip [22]. Simpler
¤
This work was supported in part by the National Science Foundation un-
der grants CCR-9970488, EIA-0081307, EIA-0072102, and CHE-0121357;
by DARPA under grant F30602-01-C-0078; by Michigan State University;
and by gifts from IBM, Intel, and Hewlett-Packard.
engines can be provided for prefetching, or existing graphics pro-
cessors can be augmented with prefetching capabilities. Moreover,
there are proposals to integrate processing logic in DRAM chips,
such as IRAM [16].
Unfortunately, existing proposals for memory-side prefetching en-
gines have a narrow scope [1, 9, 11, 22, 28]. Indeed, some de-
signs are hardware controllers that perform simple and specic op-
erations [1, 9, 22]. Other designs are specialized engines that are
custom-designed to prefetch linked data structures [11, 28]. Instead,
we would like an engine that is usable in a wide variety of workloads
and that offers exibility of use to the programmer.
While memory-side prefetching can support a variety of prefetching
algorithms, one type that is particularly suited to it is Correlation
prefetching [1, 6, 12, 18, 26]. Correlation prefetching uses past se-
quences of reference or miss addresses to predict and prefetch future
misses. Since no program knowledge is needed, correlation prefetch-
ing can be easily moved to the memory side.
In the past, correlation prefetching has been supported by hardware
controllers that typically require a large hardware table to keep the
correlations [1, 6, 12, 18]. In all cases but one, these controllers are
placed between the L1 and L2 caches, or between the processor and
the L1. While effective, this approach has a high hardware cost. Fur-
thermore, it is often unable to prefetch far ahead enough and deliver
good prefetch coverage.
In this paper, we present a new scheme where correlation prefetch-
ing is performed by a User-Level Memory Thread (ULMT) running
on a simple general-purpose processor in memory. Such a proces-
sor is either in the memory controller chip or in a DRAM chip,
and prefetches lines to the L2 cache of the main processor. The
scheme requires minimal hardware support beyond the memory pro-
cessor: the correlation table is a software data structure that resides
in main memory, while the main processor only needs a few mod-
ications to its L2 cache controller so that it can accept incoming
prefetches. Moreover, our scheme has wide usability, as it can ef-
fectively prefetch even for irregular applications. Finally, it is very
exible, as the prefetching algorithm executed by the ULMT can be
customized by the programmer on an application basis.
Using a new design of the correlation table and correlation prefetch-
ing algorithm, our scheme delivers an average speedup of 1.32 for
nine mostly-irregular applications. Furthermore, our scheme works
well in combination with a conventional processor-side sequential
prefetcher, in which case the average speedup increases to 1.46. Fi-
nally, by exploiting the customization of the prefetching algorithm,
we increase the average speedup to 1.53.
This paper is organized as follows: Section 2 discusses memory-side
and correlation prefetching; Section 3 presents ULMT for correla-
Possible
Locations of
the Memory
Processor
Chip
North
Bridge
DRAM
Memory
CPU
L2 $
L1 $
(a)
Proc
Main
Mem
Proc
Mem
Main Memory System
2: Lookup
2: Reply i
3: Prefetch j, k
1: Fetch i
(b)
Proc
Main
Mem
Proc
Mem
Main Memory System
1: Execute
2: Fetch i
3: Prefetch i
3: Reply i
(c)
Figure 1.
Memory-side prefetching: some locations where the memory processor can be placed (a), and actions under push passive (b)
and push active (c) prefetching.
tion prefetching; Section 4 discusses our evaluation setup; Section 5
evaluates our design; Section 6 discusses related work; and Section 7
concludes.
2. Memory-Side and Correlation Prefetching
2.1. Memory-Side Prefetching
Memory-Side prefetching occurs when prefetching is initiated by an
engine that resides either close to the main memory (beyond any
memory bus) or inside of it [1, 4, 9, 11, 22, 28]. Some manufacturers
have built such engines. Typically, they are simple hardwired con-
trollers that probably recognize only simple stride-based sequences
and prefetch data into local buffers. Some examples are NVIDIAs
DASP engine in the North Bridge chip [22] and Intels prefetch cache
in the i860 chipset.
In this paper, we propose to support memory-side prefetching with a
user-level thread running on a general-purpose core. The core can be
very simple and does not need to support oating point. For illustra-
tion purposes, Figure 1-(a) shows the memory system of a PC. The
core can be placed in different places, such as in the North Bridge
(memory controller) chip or in the DRAM chips. Placing it in the
North Bridge simplies the design because the DRAM is not modi-
ed. Moreover, some existing systems already include a core in the
North Bridge for graphics processing [22], which could potentially
be reused for prefetching. Placing the core in a DRAM chip compli-
cates the design, but the resulting highly-integrated system has lower
memory access latency and higher memory bandwidth. In this paper,
we examine the performance potential of both designs.
Memory- and processor-side prefetching are not the same as Push
and Pull (or On-Demand) prefetching [28], respectively.
Push
prefetching occurs when prefetched data is sent to a cache or proces-
sor that has not requested it, while pull prefetching is the opposite.
Clearly, a memory prefetcher can act as a