Notes on Interconnection Networks for PIM

Notes on Interconnection Networks for PIM1
Michael Niemier Department of Computer Science and Engineering, University of Notre Dame College of Computing, Georgia Institute of Technology 1 Introduction
"On-chip interconnection networks are becoming increasingly important for SoCs and CMPs. Power and wire constraints are forcing the adoption of new design methodologies for systems-on-chip (SOC) ­ namely those that incorporate explicit parallelism. To enable these MP-SoC platforms, researchers have recently pursued scaleable communication centric fabrics, (i.e. networks-on-chip ­ NOC), which possess many features that are particularly attractive for these. These communication-centric interconnect fabrics are characterized by different trade-offs with regard to latency, throughput, energy dissipation, and silicon area requirements" [Saleh 2005]. "Chip multiprocessors (CMPs) use increasing transistor budgets to integrate multiple processors onto a single die. Tiled architectures provide a scalable means for effectively using the resources available in advanced VLSI technologies as wire delay becomes a significant design constraint. On-chip networks address many of the difficulties associated with managing the increasing complexity of on-chip communication: their modular construction facilitates design reuse and allows high-performance circuits to be used in the interconnect; the structured wiring allows the abundant wire resources to be efficiently used while maintaining well controlled electrical characteristics" [Balfour 2006]. Here we consider on-chip interconnection networks in the context of various PIM configurations. The goal of this work is to determine what on-chip interconnection networks map well to various configurations of PIM nodes. We will leverage some existing work that looks at interconnection networks for CMPs and SoCs. However, most existing work looks at the performance of networks for a specific number of on-chip nodes, to be implemented in a specific technology, for specific message sizes [Brockman 2003]. The goal of this work is to create a model that's a little bit more general ­ sacrificing some accuracy and detail for the flexibility of evaluating PIM nodes of varying complexity. While we could easily tie our study to a specific configuration and technology, at present a more flexible framework is actually more useful. Expected traffic patterns between PIM nodes, message sizes, etc. are not clearly defined. A more flexible model will allow us to quickly evaluate different interconnection topologies as this information evolves. This should allow us to narrow the spectrum of potential topologies that may be suitable for a particular configuration, and we can study these in more detail. Additionally, the configuration of a PIM chip will be tailored to an application ­ which will most likely result in the need for a different interconnection network. By having a more general model, we will be able to consider how connections between PIM nodes will affect overall system performance; and while we may want to tailor our interconnection network to the ideal hardware for a given application, it may become necessary to tailor functional units to the interconnection network. This work represents the first study of interconnection networks in the context of PIM configurations. We have developed a spreadsheet simulation that allows a user to determine the number of CCs required to pass a message from one node to another, for various interconnection network topologies, in the absence of contention as a function of: a. b. c. d. e. f. g. h. the chosen PIM hardware the interconnection network topology the clock rate what layers of metal are used for the interconnection network what wire resistivities and dielectrics are assumed wire geometry the technology node the size of the message being sent between nodes. 1 This research was sponsored by DARPA through Cray, Inc. under the HPCS program and the Cascade project. While by no means complete ­ again, specific layouts, traffic patterns, switch designs, etc. are needed ­ this study does provide the initial infrastructure for a more detailed study when appropriate, as well as a good estimate of a lower bound on CC latency. We will discuss topologies and latencies first. How we can (easily) leverage this work to perform a more detailed analysis of a specific configuration will be explained at the end of this document. 2 Overview
Latency in an interconnection network is ultimately a function of many different variables ­ or perhaps betterstated "aspects of a design." At the highest level, interconnection network topology, the technology node, the configuration of hardware on the PIM chip, and the number of bits being transmitted between nodes of the PIM chip all have an effect on clock cycle latency. Of course, CCs are not the only performance metric worth considering. Eventually we will want to consider the interconnection network in the context of the power budget, fabrication costs, etc. However, this work can easily be used as a jumping off point for such studies ­ and is also affected by performance requirements at the architectural-level. As an example, consider Fig. 1. A 16 node mesh network and torus network are shown in Fig. 1a. When translated to a physical layout, the maximum wire length for a torus will be double that of the mesh [Saleh 2005]. Technology scaling will of course have an impact too. If one simply considers RC delay it is relatively easy to see that as wire width goes down, resistance per unit length (and hence latency) goes up (Fig. 1b).
Wire thickness also affects bandwidth as well as latency (as the number of wires between 2 nodes will decrease as thickness increases). + ~ ...
(a) Mesh Torus Wire length in mesh approximately 1/2 of torus. ... ... ...
(b) Fig. 1: (a) Latencies between nodes will be different for different interconnection network topologies, (b) technology scaling will also affect interconnection network latency. Fig. 1 2.1 Topology-based Latency Dally and Towles define latency in the absence of contention as a function of the number of hops required for any topology (Eq. 1) [Dally 2003]. Here, latency, (To) is a combination of three different elements. Eq. 1 To = H min tr + D min L + v b The specific components of the equation are explained below in Table 1.
Hmin tr Dmin v L b Minimum number of hops to send a message from node A to B. ! Time spent in the router (function of flit length, number of virtual channels, etc.) Distance traveled in wire. Propagation velocity. Length of message. Bandwidth. Table 1: Variables that influence latency in an interconnection network. The first term (Hmintr) represents the time spent in routers, the second term (Dmin/v) is the IC latency (the time for one part of a message spends on the wires), and the third term (L/b) is the serialization latency (or time spent to move the whole message over a set of wires). 2.2 Architecture and Configuration-based Latency This model would give us some insight as to the total number of seconds required to move a message from one node to the next. However, it does not consider latency as a function of CCs (more useful for an on-chip architecture), and it is more difficult to parameterize in terms of topologies and technology scaling. Additionally, we may be working around a clock rate dictated by processing logic and memory ­ not the interconnection network. We will want to evaluate how we can get the best performance from this aspect of the design (i.e. the interconnection network) as a function of the actual processor hardware. To estimate the number of CCs required to send a message from one node to the next we present a slightly different version of Eq. 12. Specifically, the total clock cycle latency is defined by: a. the number of times a group of wires between hops must be reused (due to bandwidth limitations) b. the number of hops that must be traversed. c. the number of CCs between hops (due to wire length limitations). In equation form, we have the following: Eq. 2 [# of Hops + 1] " [ # of CCs due to BW limits] " [ # of CCs due to latency limits] Each term in the above equation is a function of some subset of the design parameters listed above (PIM ! hardware, interconnection network topology, clock rate, layers of metal used for interconnection network, wire resistivities and dielectrics are assumed, wire geometry, technology node, and size of message being sent between nodes). We will use this equation to illustrate how a PIM configuration, technology, and IC NW topologybased CC estimations will evolve. For example, as seen in Fig. 2a, latency is in part a function of the number of "hops" a message must traverse ­ which is in turn a function of the network topology chosen. The number of connections available between nodes is limited (ultimately) by wiring pitch ­ which affects the number of message bits that can be sent from one node to the next at the same time (Fig. 2b). This might be further constrained by the number of dedicated links in each direction. For example, bidirectional channels might necessitate that only one half of the wires that could be place between nodes be used to send a message from node X to node Y (i.e. if we wanted to simultaneously send a message from node Y to node X).
3 4 5 B ... ... ...
(b) But wiring is not unlimited between nodes. Here, there are 8 wires between nodes. (c) With bi-directional channels, only 4 bits of data can be sent between nodes. 2 1 A (a) In this 4x4 mesh, to send a message from A to B, we must traverse 5 nodes (or 5+1 links). Fig. 2: (a) 4x4 mesh with a group of wires between each node, (b) wire pitch will limit the number of wires between nodes, (c) pitch and bi-directionality will ultimately limit the number of bits sent during each CC. 3 Number of Hops for a Topology
Table 2 reports the average number of hops for ring, mesh, and tori interconnection networks as a function of the number of nodes [Dally 2003]. A 4-children, 2-parent butterfly fat-tree (4-ary) [Saleh 2005] has also been considered (for 32 nodes). This data will be used later when we consider systems-level latency.
Average Case3
2 3 This assumes uniform wire length between nodes (not an unreasonable assumption as will be seen later). Note: k represents the number of nodes. Ring (n=1) Mesh (n=2) Torus (n=2) Fat-tree (nk)/4, k even n[(k/4)-(1/4 k)], k odd (nk)/3, k even n[(k/3)-(1/3 k)], k odd (nk)/4, k even n[(k/4)-(1/4 k)], k odd 3.7 Table 2: Number of hops for various interconnection networks. 4 Bandwidth and Clock Cycle Latency
4.1 Overview Here we explain how we will estimate the number of CCs required to move a message between nodes. We begin by examining potential delays due to bandwidth limitations. For example, assume we need to send an 8-bit message between 2 nodes, but there are only 4 wires between them. If you can send one bit on each wire during each CC, at least 2 CCs (8/4=2) are required to send the message. If channels are bidirectional 4 CCs will be required (8/2 = 4). Here we will determine the maximum number of wires that could exist between two nodes for a given PIM chip configuration. This upper bound will create a lower bound on the number of CCs required to send a message from one node to the next (as the maximum number of wires will maximize our available bandwidth and minimize the number of CCs required). Assuming we can move one bit, on each wire, during each CC, the number of CCs required to move a message of size L is: Eq. 3 # of CCs = L # of wires We apply the ceiling function to this result as we must have an integer number of CCs: ! Eq. 4
" $ L # of CCs = # # # of wires% % The number of wires is a function of the area per node (on a PIM chip), the spacing between parallel wires, ! technology scaling, and what elements might interrupt potential wiring channels (i.e. vias, other dedicated wires, etc.) As shown in Fig. 3, in this section we will assume all PIM nodes have equal area. By dividing the total PIM chip area by the number of nodes (here the number of integer datapaths) (Fig. 3a), we can determine the area of one node (Fig. 3b). Taking the square root of this area provides an estimate for the length of one edge. By considering variables like wire pitch (Fig. 3c), we can estimate the maximum number of wires between two nodes. 4.1.1 First approximation: Here we expand Eq. 4. Specifically, we want to calculate the maximum number of wires that could exist between two on-chip PIM nodes. As seen in Fig. 3c, the number of wires will ultimately be limited by a node's edge length and the number of wires that can be placed between two sides. Thus, we can transform Eq. 4 to
Eq. 5 ­ where the denominator represents how many bits we can move between nodes in 1 CC. Eq. 5 # of CCs = L Area of PIM chip # of on - chip nodes pitch between wires ! 4.1.2 Second approximation: Unfortunately, the number of bits that we could move between nodes in 1 CC will realistically be cut in half. With bidirectional channels, only half of the wires between two nodes can be used to send a message from node A to node B. The other half would be used to send a message from node B to node A. To account for this aspect of the design, we introduce a scaling factor n. If channels are bidirectional, n will equal 2. If not, n will equal 1. As a result, the number of CCs required to send a message from node A to B becomes Eq. 6. Fig. 3: (a) we divide the chip area into n equal parts. The square root of the node area (b) provides an estimate of the edge length ­ and an upper bound on the number of wires between nodes. Wire pitch (and any channel "interruptions") (c) set an upper bound for the maximum amount of data that can be moved simultaneously between 2 nodes. While this does not account for any metrics other than area, it is an ultimate upper bound ­ as this is the maximum number of wires that could possibly fit. Eq. 6 # of CCs = L Area of PIM chip # of on - chip nodes n " pitch between wires 4.1.3 Third approximation: Up to this point we have also assumed that every channel will be available to send a bit from one node to the next ! ­ and that the number of wires will just be a function of the length of one side of a node and the wire pitch. For a variety of reasons, realistically, not every wiring channel will be available. While specific data will not be available until actual layouts are done, to get a feel for how this might affect bandwidth, we simply introduce a variable that accounts for the "edge length lost" to the overhead discussed above (see Eq. 7).
Eq. 7 # of CCs = L
Area of PIM chip " length lost to overhead # of on - chip nodes n # pitch between wires 4.1.4 Fourth approximation: We next consider metal wiring pitch (denoted by in Fig. 4). Table 3 summarizes data from the 2005 ITRS ! Roadmap.
Node MPU M1 65 136 45 90 32 64 22 44 Intermediate Global DRAM M1 140 210 130 90 135 90 64 96 64 44 66 44 Table 3: Wire pitches. However, we would like to consider pitch in the context of wire width. Intuitively, making wires wider will only reduce the number of wires between nodes, lower bandwidth, and possibly increase the number of CCs required to send a message between two nodes (as wire groups will have to be used more than one time). However, increasing wire width can help decrease the time required to move a bit of data over a fixed distance on a metal wire (thus reducing the third element of Eq. 2). This will be discussed in more detail in the next section. However, regardless of the wire width ultimately selected, it will still be the same for the second and third components of Eq. 2 and we introduce a scaling factor now 4. To consider with width (W) as a function of pitch (), we will refer to Fig. ­ a cross section of two parallel metal wires in the same metal layer. One common g assumption [Saleh 2005] is that the width of the wire is equal to half the pitch. Thus, if we were looking at intermediate interconnect for the 45 nm node, W W W would equal 45 and g would also equal 45 nm. But what happens if we scale our wire width up? One option would simply be to assume the same approximation (i.e. if W = 90 nm, g would also equal 90 nm). However, a more appropriate measure may be to consider g as a function of . is generally considered to be Fig. 4: Pitch and width. some fraction of the technology node (i.e. = a * technology node where a is usually 0.5). Two parallel metal wires are required to have a spacing of x (where x is usually 2 or 3). We believe that the second option (g is a function of ) is the most reasonable, but augmented our formula for the number of CCs due to latency such that either option could be used simply by setting the two scaling factors a and x appropriately. The net result is Eq. 8. Eq. 8 # of CCs = L
Area of PIM chip " length lost to overhead # of on - chip nodes n # (Ww + (x # a # technology node)) Note that we have also introduced the scaling factor w ­ which is simply a way to scale the width W. The base value for W is just assumed to be one half of the metal pitch given in Table 3. Thus, if we did not want to scale W ! and wanted to assume intermediate interconnect for the 45 nm node, W would initially be 45 nm, w would equal 1, x would equal 2, a would equal ½, and the technology node would be 45 nm. Plugging via pitch these numbers into the above formula would reveal a wire pitch of 90 nm ­ exactly what is provided by the roadmap. 4.1.5 Fifth Approximation: While the improved calculation for wire pitch (and hence the number of possible wires) is good, a better approximation for the number of possible wires would consider the pitch between vias. Throughout this document we will assume that the interconnection network that is implemented will sit above the logic in higher layers of metal. As a result, we will Fig. 5: Via Pitch. have to connect a wire in the interconnection network to a wire in the logic. Below, we augment our equation for the number of CCs due to latency to consider the pitch between vias.
Generally, a via will be 1 value greater than the width of the wire itself [Rabaey 1996]. With this in mind, we change Ww to (Ww + ). Additionally, the spacing between vias is generally 3. We leave this in terms of a
4 Ultimately, we will want to find a wire width that minimizes the number of CCs due to wire length and does not negatively affect bandwidth. variable however and note that the via (and hence wire) pitch will be dictated by (Ww + ) +s (where s is usually 3). Again, as a short example, let's again assume the 45 nm node and no width scaling. Metal pitch = 90 nm, w = 1, = 45/2 = 22.5 nm, and s = 3. (Ww + ) + s = (45 * 1 + 22.5) + (3)(22.5) = 135 nm ­ indicating that there would be 135 nm between two parallel wires. If we merge all of the above information into one formula, we arrive at Eq. 9.
) , + . + Area of PIM chip "# of channels lost due to "overhead". + . # of nodes + )% , . 1 'if wire : ( + Ww + (x # a # node)) (x ~ 2 - 3, a ~ ). . n # +& 2. . + +' + Ww + $ ) + s$ (s ~ 3) *(if via : ( - . + . Eq. 9 4.1.6 Sixth Approximation: Finally, we consider the "Area of PIM chip" and "# of nodes". This will be determined by leveraging the document ! "Notes on Processor-in-Memory Component Sizings" in [Brockman 2003]. This document considers the state of the art for commodity DRAM, embedded DRAM, integer datapaths, floating point units, and cache SRAM, and normalizes their area in terms of process technology. Table 4 summarizes the functional unit sizing (embedded DRAM data is for Cu-8 process; the rest are averages)
Functional Unit 1 MB Commodity DRAM 1 MB Hi-Density Embedded DRAM 1 MB Hi-Performance Embedded DRAM 1 MB SRAM Cache 32-bit Integer Unit/Core 64-bit Floating Point Unit Area (10 6 x L p2) 96 206 434 2120 69 169 Table 4: Normalized areas 4.1.6.7 A short example We can use this data to estimate the area required for a chip with certain functional unit specifications. For example, let's say that we wanted to fabricate a PIM chip in a 45 nm process with 32 integer units, 32 KB of cache per 4 integer units/1 FP unit group, and 64 MB of commodity DRAM. The area for this configuration would be calculated as follows:
DRAM (mm2) Integer Unit (mm2) FP Unit (mm2) Cache Area (mm2) Total: 96 x 106 x (45x10-6)2 69 x 106 x (45x10-6)2 169 x 106 x (45x10-6)2 2120 x 106 x (45x10-6)2 = 0.879 mm 2 / MB = 0.140 mm 2 / unit = 0.342 mm 2 / unit = 0.034 mm 2 / MB = 0.879 x 64 = 0.140 x 32 = 0.342 x 8 = 0.034 x (32/1000) x 8 = 56.3 mm 2 = 4.48 mm 2 = 2.74 mm 2 = 0.0087 mm 2 = 63.5 mm 2 Table 5: Example PIM configuration and area. Thus, the total area for the configuration referenced in Table 5 would be 63.5 mm2. (Note that all data to be reported here assumes this configuration ­ except that commodity DRAM has been replaced with HP DRAM. Other configurations that might be considered are summarized in Table 6. Our simulation suite is well suited for this task ­ as only the number of functional units will need to be changed in our spreadsheet and the rest of this analysis will be automatically redone.)
Configuration Integer Only Single Shared FPU 1 FPU per node 4 SIMD FPU per node DRAM (MB) 64 64 64 64 Integer Units 4 4 4 4 FP Units 0 1 4 16 Cache (KB) 32 32 32 32 Table 6: Other PIM configurations. 4.2 Putting it all together Using information reported in Section 4.1, we determine the allowable message size before wire groups will have to be reused. Table 7 reports this information for the PIM configuration discussed above as a function of both wire and via spacing for both high-performance embedded DRAM as well as commodity DRAM. Each column assumes a different percentage of wire channels (or the edge length of a node) will be unavailable.
0% 10,800 7,200 5,700 3,800 16.7% 9,000 6,000 4,800 3,200 20% 8,700 5,800 4,600 3,100 25% 8,100 5,400 4,300 2,900 33% 7,200 4,800 3,800 2,600 50% 5,400 3,600 2,900 1,900 Wire pitch (HP) Via pitch (HP) Wire pitch (C) Via pitch (C) Table 7: Message size (bits) before wire set is reused (considering percentage of edge lost) (45 nm). Table 7 reports data assuming the 45 nm technology node. As one can see, the number of bits could theoretically be quite high. A natural question to ask is whether or not allowable message sizes will vary as the technology node changes. As seen in Table 8, this is not the case.
65 nm 3,600 1,900 45 nm 3,600 1,900 32 nm 3,600 1,900 22 nm 3,600 1,900 Via pitch (HP) Via pitch (C) Table 8: Message size (bits) before wire set is reused (33% edge area lost) Finally, we consider the number of times a wire group must be reused as a function of width scaling and message size (Fig. 6). Assuming the 45 nm node and 33% lost edge area, for reasonable message sizes, wire groups should not have to be reused. Fig. 6: (a) HP DRAM, (b) Commodity DRAM ­ times wires are reused as a function of width & message size 5 CCs due to Wire Delay
The analysis presented in Section 4 allows us to determine whether or not we will have enough wires to get a message from one node to the next without having to reuse a subset of those wires one or more times. Here, we estimate how long of a wire will be required to connect one node to the next ­ and whether or not it is traversable in one CC. 5.1 Sketch of Calculation Each time we send a message from node A to node B, we will need to traverse a wire of some particular length. This is a function of the complexity of the PIM configuration, the process technology, and the size of the die. We will also be at least somewhat bound by the CC period ­ in that a bit of information will only be able to move on a wire so far in a given amount of time (in this case the time for one CC). In equation form, we can formulate this problem as shown in Eq. 10. All units but seconds will cancel ­ and the number of seconds will be a multiple of the number of CCs. Eq. 10
Distance we need to travel s/CC distance/CC However, like before, we will need to take the ceiling of this number in order to get an integer number of CCs as ! seen in Eq. 11:
" $ #Distance we need to travel% # % s/CC # % # % distance/CC Eq. 11 We now expand Eq. 11 in order to analyze different PIM configurations in different technology nodes. ! 5.1.1 First approximation: We begin by considering the distance reachable in an unbuffered wire. The delay (or amount of time required to go a distance L) is given by Eq. 12 where Rw is the wire's resistance per unit length, Cw is the capacitance per unit length, and L is the length of the wire segment in question. Eq. 12 Dunbuffered = 0.4 " Rw " Cw " L2 We can leverage this relation to consider the distance reachable in a single CC. Specifically, we solve for L and set D equal to the period of the ! CC. Thus, the maximum reachable distance in a CC with period D is given by Eq. 13. Eq. 11 is thus transformed into Eq. 14. Eq. 13 L= D 0.4RwCw !
Eq. 14 " $ #Distance needed to travel% # % D # % # % 0.4RwCw 5.1.2 Second approximation: ! We now turn our attention to the distance that we need to travel. This is a function of chip area (and hence a specific PIM configuration) and the interconnection network topology chosen. The chip area is determined by the information presented in Section 4.1.6. The maximum wire distance needed for a specific topology can be estimated by the equations that appear in Table 2 [Saleh 2005].
In general, when thinking about IC wire delay, switched-based architectures offer advantages in that wires between IP blocks and between switches are logically structured ­ and lengths are largely predictable and consistent across a network. The wire area estimation problem involves determining the longest wire segment in each architecture and distribution. Long wire segments block wiring channels and force other wires to become longer. [Saleh 2005] notes that there will be additional area consumed by wires than what is predicted by a first order analysis. Assuming this overhead, their aim is to estimate the wire lengths in all interconnection network architectures under consideration. In an NOC environment inter-switch wire segments are generally the longest on-chip wires except for those dedicated to the clock, power, and ground. Because of the structured nature of NOC-based interconnection network, we can get good estimates for inter-switch wire lengths. Interconnection Network Ring Mesh Torus Tree Max Wire Distance Needed ! ! w a +1,a Chip Area N "1 Chip Area w = N "1 2 " Chip Area w = N #1 Chip Area = ; levels = log4 N 2levels "a w = ! Table 9: Maximum wire lengths for different interconnect network topologies. N is the number of on-chip nodes.
5.1.3 Third approximation: ! We now turn our attention to the resistance per unit length and capacitance per unit length variables of Eq. 14. Resistance per Unit Length (Rw) is defined by the resistivity of the wire () divided by the wire's thickness (T) times its width (W) (as seen in Eq. 15). has units of ( unit length) and the product TW has units of (unit length)2. Thus, /(length unit) falls out. Eq. 15 Rw = " TW Expected wire resistivities are predicted in the ITRS roadmap and summarized in Table 10.
Resistivity MPU (µ cm) Scattering -M1 -Intermediate -Global No Scattering No Scattering ! 65 3.47 3.43 2.73 2.2 2.2 45 4.08 4.08 3.1 2.2 2.2 32 4.83 4.83 3.52 2.2 2.2 22 6.01 6.01 4.2 2.2 2.2 MPU DRAM Table 10: Resistivites from ITRS roadmap. As mentioned in Section 4, a good approximation for an initial value of W is one half of the metal pitch. As seen in Fig. (a cross section of several layers of metal) width is not the only variable that we must consider. Here, T represents wire thickness, S represents intra-level dielectric thickness, and H represents the inter-level dielectric thickness. For now, we are only concerned with W and T. Like W, a first approximation for the base value of T is one half of the metal pitch (i.e. the same as W) 5. 5 Aspect ratios of approximately 2-3 are not uncommon [Ho 2003]. This will be considered/discussed later. Fig. 7: H, S, W, and T Capacitance per Unit Length (or Cw) is a function of the dielectic constant (d), the physical constant o (the permittivity of free space which is approximately 8.85x10-14 F/cm), the thickness of the wire T, the width of the wire W, and the fringe capacitance Cfringe (~ 0.04 fF/µm for all technology nodes) [Saleh 2004]. How we will determine W and T has already been discussed ­ and like resistivity, expected dielectrics for different technology nodes are provided by the ITRS Roadmap. Data points for specific technology nodes under consideration are summarized below in Table 11.
Capacitance MPU (unitless) Scattering Effective ­ Max Effective ­ Min Bulk (min. expected) No Scattering Effective ­ Max Effective ­ Min (fF/µm) value 65 3 2.7 2.4 4.1 3.6 0.04 45 2.8 2.55 2.2 3.4 3.1 0.04 32 2.4 2.1 2 3 2.7 0.04 22 2.2 1.9 1.8 2.8 2.5 0.04 MPU Fringe Table 11: Dielectrics. All of these parameters can be folded into one equation (Eq. 16) which estimates capacitance per unit length6: Eq. 16 #1+ 2(T /W )2 & Cw = 2"d"o% ( + Cfringe $ (T /W ) ' By substituting Eq. 15, Eq. 16, and the data in Table 9 into Eq. 14, we arrive at Eq. 17 which summarizes the number of CCs required (for a given clock period) to move a bit from one node to another in a given network ! topology. 6 This is "pessimistic" ­ i.e. the factor of 2 accounts for 2 different values being moved in opposite directions. Eq. 17 0 $ 2 Chip Area 1 &Ring : 3 N "1 1 & 3 1 & 3 Chip Area 1 &Mesh : 3 N "1 1 & 3 % 1 & 3 2 # Chip Area 1 & Torus : 3 N "1 1 & 3 1 & 3 Chip Area ; levels = log4 N 3 1 & Tree : w a +1,a = 2levels "a 1 ' 3 CC Period 1 3 2, 1 ) ,3 )1+ 2(T /W ) ) ( , 1 0.4 # +Rw = .+Cw = 2/d/o+ . + Cfringe. 3 * TW -* * (T /W ) -3 1 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 5.1.4 Fourth Approximation As in Section 4, we will also introduce W and T scaling factors (with w as the scaling factor for the width and t as ! the scaling factor for the thickness). Again, the base value for W is just assumed to be one half of the metal pitch given in Table 3. Similarly, the base value of T is also assumed to be one half of the metal pitch given in Table 3. Thus, Eq. 17 becomes: Eq. 18 0 2 $ Chip Area 1 3 &Ring : N "1 1 3 & 1 3 & Chip Area 1 3 &Mesh : & N "1 1 3 % 1 3 & 2 # Chip Area Torus : 1 3 & N "1 1 3 & 1 3 Chip Area & Tree : w a +1,a = ; levels = log4 N 1 3 & ' 2levels "a 1 3 CC Period 1 3 2, 1 ) ,3 )1+ 2(Tt /Ww ) ) ( , 1 0.4 # +Rw = .+Cw = 2/d/o+ . + Cfringe. 3 * TtWw -* * (Tt /Ww ) -3 1 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 ! We next consider how far we can go on a wire. The minimum conceivable clock cycle time (for a highly pipelined design style) is assumed to be equal to 15 FO47. We will calculate 15FO4 and solve the above equation for L to determine the maximum distance a bit of information can move on an unbuffered wire (as a function of technology). Similarly, we will solve the above equation for L and consider the maximum distance a bit of information can move on an unbuffered wire as a function of slower clock rates (currently 500, 50 and 1000 MHz). Assuming that a wire's width and thickness are identical, the maximum distance that a bit of data could traverse in an unbuffered wire is illustrated in Fig. 8. As expected, reachable distance increases as aspect ratio grows. (a) (b) Fig. 8: Reachable distance as a function of technology node. (a) Aspect Ratio = 1, (b) = 2. 7 Estimates for F04 include (a) 425*Lmin (ps) [Saleh 2004] or (b) 500*Lmin (ps) [Ho 2003]. Lmin is assumed to be in nm. We assume option (a) in this work. Fig. 8 tells us how long of a wire we can traverse in a single clock cycle as a function of technology. Here, we will compare that information to the distance that we need to traverse for a given topology. Fig. 9 considers reachable distance and needed distance for two different topologies and two different DRAM types. Both curves assume global wiring pitches and bulk resistivity. Note that for almost every topology and DRAM type, for at least one of the four clock rates considered, unbuffered wire could be used (b). For a lower aspect ratio (a), only a mesh interconnection network with commodity DRAM appears to work well with unbuffered wire. Reachable Distance in Unbuffered Wire Reachable Distance in Unbuffered Wire Fig. 9: Reachable distance versus needed distance. (a) Aspect Ratio = 1, (b) = 2. (What this means is that if we wanted to wire up this particular PIM configuration ­ with high performance embedded DRAM for example ­ with a torus interconnection network, we would either need to add repeaters in our wires or spend multiple clock cycles moving a bit over the wire span.) 6 Putting it all together:
Here, we tie together the information discussed in Sections 3-5 in the context of Eq. 2, rewritten in the context of Eq. 9 and Eq. 18. In Fig. 10, for each technology node, we consider the maximum distance reachable in one, 1 GHz CC as a function of wire width and wire thickness scaling (w and t variables). An aspect ratio of approximately 2 is also highlighted. We also consider the total number of clock cycles to send a message from one node to the next (leveraging data in Table 2) in the absence of contention for a mesh network. This analysis is primarily meant to show how the reachable distance scales with w and t, and how this affects the amount of time required to send a message from one node to the next. (a) 65 nm (b) 45 nm (c) 32 nm (d) 22 nm Fig. 10: Mesh networks assuming PIMs of HP DRAM at (a) 65 nm, (b) 45 nm, (c) 32 nm, and (d) 22 nm. A perhaps more interesting analysis appears in Fig. 11 In Fig. 11, for the 65 nm technology node we consider the number of 1 GHz CCs needed to send a message over each of the four interconnection network topologies discussed above, for that networks average number of hops, as a function of w and t. Again, an aspect ratio of approximately 2 is highlighted. Data points for several w, t ratios also appear in Table 12. aspect ratio ~2 aspect ratio ~2 (a) Ring (b) Mesh aspect ratio ~2 aspect ratio ~2 (c) Torus (d) Tree Fig. 11: (a) Ring network, (b) Mesh network, (c) Torus network, and (d) Tree network. w 1 1.5 2 1 1 1.5 t 1 1 1 1.5 2 1.5 Ring 24 24 24 24 24 16 Mesh 16 16 16 16 16 10.67 Torus 24 20 20 24 20 16 Tree 14.5 14.5 14.5 14.5 14.5 16 Table 12: Number of CCs for various interconnection networks as a function of w, t scaling (same PIM config.) From this analysis, tree networks seemingly perform best, followed by mesh networks. However, it is important to reiterate that this is just a first approximation, and (among other things) does not consider contention. However, this study is already underway ­ and we will target meshes and trees first based on the information presented here. 7 Buffered Wire:
Finally, we briefly consider how interconnection network performance would be affected by buffered wire (with repeaters). We assume the simple delay model (reported in [Ho 2003] among other places) that models a long wire broken up by CMOS inverters as a collection of resistors and capacitors. In this model, important parameters include: · · · · The transistor width (w) The wire length (l) The normalized sum of PMOS and NMOS gate widths (a) Technology parameters that represent transistor drive and parasitics (Rv, Cd, Cg, Rw, and Cw) The delay for one stage of a repeated wire (between inverters) is given as: Eq. 19 Where: R C & # Dstage = 0.7 ' $ Rv a ( g + Cd )+ v Cwl + Rw w l 2 + RwlaC g w! C w 2 % " Eq. 20 lopt = RvC g RwCw Eq. 21 wopt = 1 3 RvCw RwC g For this study, we will look at data from the 65 nm node ­ not from the ITRS roadmap, but rather from [Balfour 2006] ­ which compiles the parameters summarized in Table X specifically for this node (via [Bai, Chat., Luo, 2004]) 8. Local Technology Node:
8 Semi-global 65 65 Global 65 [Bai, Chat., Luo 2004] represent a very close look at a specific technology node. Therefore, we felt that this data might be a bit more insightful (and will also provide a point of comparison to other CMP-based systems). Nevertheless, it is worth noting that data points (i.e. for Rw and Cw) match up quite will with the analysis discussed here. Rv (/mm) C g (F/mm) C d (F/mm) W (mm) T (mm) R w (/mm) C w (F/mm) lopt (mm) w opt (mm) a Dstage (with ideals) (s) Target Clock Rate (MHz): Target Clock Period (s): Reachable Distance (max, mm): 1.625 9.5E-13 1.14E-12 0.0001 1.80E-04 1550 1.80E-13 0.22 0.00814 3 2.32251E-11 1000 0.000000001 9.61 1.625 9.5E-13 1.14E-12 0.0002 0.00036 350 2.20E-13 0.42 0.01893 3 2.32251E-11 1000 0.000000001 18.29 1.625 9.5E-13 1.14E-12 0.0004 0.00072 80 2.40E-13 0.85 0.04136 3 2.32251E-11 1000 0.000000001 36.63 Table 13: Data used in buffered wire analysis. Looking at Equations 19-21, we could consider delay as a function of any number of variables. Instead, we solve for l and will consider the stage delay as a function of the inverter width w 9. Using Eq. 19 and the data in Table Y, the theoretically reachable distance vs. aw is shown in Fig. 12 for local, semi-global, global interconnect. Also shown in Fig. 12 is an estimate of the number of lines possible between PIM nodes (estimated as the length of a PIM node divided by (w + ). Not surprisingly, for large w values, our inter-node bandwidth falls off quickly. Reachable Distance (Repeaters) as function of inverter width 35.000 30.000 25.000 20.000 )D (local )D (semi )D (global BW 10000 Reachable Distance 1000 100
15.000 10.000 5.000 0.000
9.1E-03 8.5E-03 7.8E-03 3.9E-03 4.6E-03 3.3E-03 1.0E-02 9.8E-03 7.2E-03 6.5E-03 5.9E-03 5.2E-03 2.6E-03 2.0E-03 1.3E-03 1.4E-02 6.5E-04 1.4E-02 1.2E-02 1.1E-02 1.6E-02 1.6E-02 1.5E-02 1.3E-02 1.2E-02 10 1 w (mm) 9 Solving for w and considering the optimal segment plus estimate of numberuseful ­ as ultimately we will have to Fig. 12 : Reachable distance as a function of w , length l is not nearly as of wires between PIM nodes. traverse the distance nl ­ where n is the number of segments of length l needed to traverse the distance for a particular interconnection network for a particular PIM configuration. On the other hand, scaling w can affect the time required to traverse a wire segment as well as the number of wires between nodes. Lines between PIM nodes A more interesting question to consider might be how wires with repeaters affect performance at the architectural level. (In other words, how many CCs are required to send a message from node A to node B, over the average number of hops for a given number of nodes, for a ring, mesh, torus, and tree network. These results are shown in Fig. 13. Also shown in Fig. 13 are estimates for the number of CCs required for unbuffered wire, for each interconnection network, at the 65 nm node (using data from [Balfour 2006] for consistency) 10.
CCs to send message via average number of hops
60 50 )Ring (Semi )Tree (Semi .)Torus (unbuf )Mesh (Semi )Ring (unbuf .)Tree (unbuf )Torus (Semi .)Mesh (unbuf 40 CCs 30 20 10 0 As one can see, in some cases, buffered wire performs better than unbuffered wire ­ but for every network but the torus, we can move a bit of data from one node to the next in the minimum number of hops. This is important as otherwise we would have to account for the area due to latches (just as with inverters). Data is summarized in Table 14. NW Distance Needed: (mm) CCs (unbuffered, local wire) CCs (unbuffered, semi wire) CCs (unbuffered, global wire) 62 1. 7E 56 -0 1. 2E 2 49 -0 1. 7E 2 43 -0 1. 2E 2 36 -0 1. 7E 2 30 -0 1. 2E 2 23 -0 1. 7E 2 17 -0 1. 2E 2 10 -0 1. 7E 2 04 -0 9. 2E 2 76 -0 9. 5E 2 11 -0 8. 4E 3 46 -0 7. 3E 3 81 -0 7. 2E 3 16 -0 6. 1E 3 51 -0 5. 0E 3 85 -0 5. 9E 3 20 -0 4. 8E 3 55 -0 3. 7E 3 90 -0 3. 6E 3 25 -0 2. 5E 3 60 -0 1. 4E 3 95 -0 1. 3E 3 30 -0 6. 2E 3 51 -0 0E 3 -0 4 1. w (mm) Fig. 13: Estimates of network delay for buffered and unbuffered wire. Ring 3.41 16.00 8.00 8.00 Mesh 3.41 10.66 5.33 5.33 Torus 6.82 12.00 8.00 4.00 Tree 3.97 7.40 3.70 3.70 Table 14: CC estimates using data from [Balfour 2006] for unbuffered wire. 10 Semi-global wire data is used as semi-global interconnect will most likely be used for form the interconnection network. The power dissipated per node-to-node message can also be quite high. Estimates that leverage work in [Balfour 2006] appear in Fig. 14.
Power dissipated to send message between nodes 10.00000
Local Semi-global Global Watts to Send Message 1.00000
62 7E 54 -02 6 1. E46 02 5 1. E38 02 3 1. E30 02 2 1. E22 02 1 1. E13 02 9 1. E05 02 8E 9. 76 -02 5 8. E95 03 1 8. E13 03 7 7. E32 03 3 6. E51 03 0E 5. 69 -03 6 4. E88 03 2 4. E06 03 9 3. E25 03 5 2. E44 03 1 1. E62 03 7 8. E13 03 7E -0 4 1. 0.10000 0.01000 1. w (mm) Fig 14: Estimates of power dissipated to send 640 bit message from one node to next (activity factor =0.5). Future work will consider buffered wire versus unbuffered wire for specific topologies in more detail. As mentioned, we are already starting to consider contention. We will use these results to consider switch structures, traffic, specific configurations, etc. ­ and tie this work more heavily to specific implementations and layouts. References:
[Saleh 2005]: P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, "Performance Evaluation and Design Tradeoffs for Network-on-Chip Interconnect Architectures," IEEE T. on Computers, Vol 54, No. 8, August 2005. [Balfour 2006]: J. Balfour, "Design Tradeoffs for Tiled CMP On-Chip Networks", at the 2006 International Conference on Supercomputing, June/July 2006, Cairns, Australia. [ITRS]: [Rabaey]: [Saleh 2004]: International Technology Roadmap for Semiconductors, 2005 Edition. J. Rabaey, "Digital Integrated Circuits: A Design Perspective," Prentice Hall, New Jersey, 1996. C. Grecu, P. Pande, A. Ivanov, and R. Saleh, "A Scalable Communication-Centric SoC Interconnect Architecture," 5th International Symposium on Quality Electronic Design, 2004, p. 343-348. W. Dally and B. Towles, "Principles and Practices of Interconnection Networks," Morgan Kauffman, New York, 2003. J. Brockman, "Notes on Processor-in-Memory Component Sizings," July 21, 2003. Ron Ho, "On-chip Wires: Scaling and Efficiency,", Ph.D. Dissertation, Stanford University, August, 2003. P. Bai, et. al, "A 65 nm logic technology featuring 35 nm gate lengths, enhanced channel strain, 8 cu interconnect layers, low-k ild and 0.57 µm2 SRAM cell," in Electronic Devices Meeting, 2004 IEDM Technical Digest. IEEE International, Dec. 2004, pp. 657-660. [Dally 2003]: [Brock. 2003] [Ho 2003] [Bai 2004] [Chat. 2004] A. Chatterjee, et. al., "A 65 nm CMOS technology for mobile and digital signal processing applications," in Electronic Devices Meeting, 2004 IEDM Technical Digest. IEEE International, Dec. 2004, pp. 665-668. Z. Luo, et. al., "High performance and low power transistors integrated in 65 nm bulk CMOS technology," in Electronic Devices Meeting, 2004 IEDM Technical Digest. IEEE International, Dec. 2004, pp. 661-664. [Luo 2004]