smm_cap1
-
Upload
georgeanton -
Category
Documents
-
view
217 -
download
0
Transcript of smm_cap1
-
8/13/2019 smm_cap1
1/101
1
Titular curs: dr. ing. Ioan [email protected]
Durat: 14 sptmni x 3 ore
mailto:[email protected]:[email protected] -
8/13/2019 smm_cap1
2/101
2
Cum se face notarea? Nota la examen NF = 45%NE + 45%NL + 10%PC
NF = Nota Final
NE = Nota la Examen NL = Nota la Laborator
PC = Prezenala Curs
Examenul se desfoarn scris.
Se dau trei subiecte sub formde probleme
Prezena la curs poate fi compensat cu o tem suplimentarlegatde curs, cu aprobarea cadrului didactic.
Nota la examen se calculeaz cu activitile executate i notate itrebuie s fie de cel puin 45% (4,5).
Se dau bonusuri de pn 10% pentru activiti suplimentare
(participarea la concursuri, conferine studeneti, proiecte, etc.)
-
8/13/2019 smm_cap1
3/101
Bibliografie
1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura
Victor, ISBN 973-8128-05-6, 2000
2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE
PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw
Hill, 1993, ISBN 0-07-113342-9.
3. Hesham El-Rewini, Southern Methodist University, Mostafa Abd-
El-Barr Kuwait University, ADVANCED COMPUTER
ARCHITECTURE AND PARALLEL PROCESSING, A JOHN
WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5
4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID
COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3
5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall
2003, ISBN 0-13-145660+1
-
8/13/2019 smm_cap1
4/101
Problemele intale cursului
Arhitecturi multiprocesor
Magistrale de interconectarePCI, PCIe,Infiniband
Chipuri multimicroprocesorIntel Xeon, Cell BE
Cluster computing
Grid computing
-
8/13/2019 smm_cap1
5/101
Introducere
-
8/13/2019 smm_cap1
6/101
1.1 Why Parallel Processing?
Fig. 1.1 The exponential growth of microprocessor performance,
known as Moores Law, shown over the past two decades (extrapolated).
19901980 2000 2010KIPS
MIPS
GIPS
TIPS
Processorperformance
Calendar year
80286
68000
80386
8048668040
Pentium
Pentium II R10000
1.6 / yr
-
8/13/2019 smm_cap1
7/101
From:
Robots After All,
by H. Moravec,
CACM, pp. 90-97,
October 2003.
Mental power in four scales
Evolution of Computer Performance/Cost
-
8/13/2019 smm_cap1
8/101
The Semiconductor Technology Roadmap
From the 2001 edition of the roadmap [Alla02]
Calendar year 2001 2004 2007 2010 2013 2016Halfpitch (nm) 140 90 65 45 32 22Clock freq. (GHz) 2 4 7 12 20 30Wiring levels 7 8 9 10 10 10Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5Max. power (W) 130 160 190 220 250 290
Factors contributing to the validity of Moores law
Denser circuits; Architectural improvements
Measures of processor performanceInstructions/second (MIPS, GIPS, TIPS, PIPS)
Floating-point operations per second
(MFLOPS, GFLOPS, TFLOPS, PFLOPS)
Running time on benchmark suites 19901980 2000 2010KIPS
MIPS
GIPS
TIPS
Processorp
erformance
Calendar year
80286
68000
80386
8048668040
Pentium
Pentium II R10000
1.6 / yr
-
8/13/2019 smm_cap1
9/101
Why High-Performance Computing?
Higher speed (solve problems faster)Important when there are hard or soft deadlines;e.g., 24-hour weather forecast
Higher throughput (solve more problems)Important when there are many similar tasks to perform;e.g., transaction processing
Higher computational power (solve larger problems)e.g., weather forecast for a week rather than 24 hours,or with a finer mesh for greater accuracy
Categories of supercomputersUniprocessor; aka vector machineMultiprocessor; centralized or distributed shared memoryMulticomputer; communicating via message passingMassively parallel processor (MPP; 1K or more processors)
-
8/13/2019 smm_cap1
10/101
The Speed-of-Light Argument
The speed of light is about 30 cm/ns.
Signals travel at a fraction of speed of light (say, 1/3).
If signals must travel 1 cm during the execution of an
instruction, that instruction will take at least 0.1 ns;thus, performance will be limited to 10 GIPS.
This limitation is eased by continued miniaturization,architectural methods such as cache memory, etc.;
however, a fundamental limit does exist.
How does parallel processing help? Wouldnt multipleprocessors need to communicate via signals as well?
-
8/13/2019 smm_cap1
11/101
Why Do We Need TIPS or TFLOPS Performance?
Reasonable running time = Fraction of hour to several hours (103-104s)
In this time, a TIPS/TFLOPS machine can perform 1015-1016operations
Example 2: Fluid dynamics calculations (1000 1000 1000 lattice)
109lattice points 1000 FLOP/point 10 000 time steps = 1016FLOP
Example 3: Monte Carlo simulation of nuclear reactor
1011
particles to track (for 1000 escapes) 104
FLOP/particle = 1015
FLOP
Decentralized supercomputing( from Mathworld News, 2006/4/7 ):
Grid of tens of thousands networked computers discovers 2304024571,
the 43rdMersenne prime, as the largest known prime (9 152 052 digits )
Example 1: Southern oceans heat Modeling
(10-minute iterations)
300 GFLOP per iteration
300 000 iterations per 6 yrs =
10
16
FLOP
-
8/13/2019 smm_cap1
12/101
The ASCI Program
20001995 2005 2010
Performa
nce(TFLOPS)
Calendar year
ASCI Red
ASCI Blue
ASCI White
1+ TFLOPS, 0.5 TB
3+ TFLOPS, 1.5 TB
10+ TFLOPS, 5 TB
30+ TFLOPS, 10 TB
100+ TFLOPS, 20 TB
1
10
100
1000 Plan Develop Use
ASCI
ASCI Purple
ASCI Q
Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)
Computing Initiative (ASCI) program, sponsored by the US Department of
Energy, with extrapolation up to the PFLOPS level.
-
8/13/2019 smm_cap1
13/101
The Quest for Higher Performance
1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim
LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama
Material science, nuclearstockpile sim
Aerospace/space sim,climate research
Atmospheric, oceanic, andearth sciences
32,768 procs, 8 TB, 28
TB disk storage
10,240 procs, 20 TB,
440 TB disk storage
5,120 procs, 10 TB, 700
TB disk storageLinux + custom OS Linux Unix
71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M?
Dual-proc Power-PC chips
(10-15 W power)
20x Altix (512 Itanium2)
linked by Infiniband
Built of custom vector
microprocessorsFull system: 130k-proc,360 TFLOPS (est)
Volume = 50x IBM, Power= 14x IBM
* Led the top500 list for 2.5 yrs
Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)
-
8/13/2019 smm_cap1
14/101
The Quest for Higher Performance: 2008 Update
1. IBM Roadrunner 2. IBM Blue Gene/L 3. Sun Blade X6420
LANL, New Mexico LLNL, California U Texas Austin
Nuclear stockpilecalculations, and more
Advanced scientificsimulations
Open science research
122,400 procs, 98 TB,
0.4 TB/s file system I/O
212,992 procs, 74 TB,
2 PB disk storage
62,976 procs, 126 TB
Red Hat Linux CNK/SLES 9 Linux
1.38 PFLOPS, $130M 0.596PFLOPS,$100M 0.504 PFLOPS*
PowerXCell 8i 3.2 GHz,
AMD Opteron (hybrid)
PowerPC 440 700 MHz AMD X86-64 Opteron
quad core 2 GHz2.35 MW power, expandsto 1M procs
1.60 MW power, expandsto 0.5M procs
2.00 MW power, Expandsto 0.3M procs
Top Three Supercomputers in June 2008 (http://www.top500.org)
* Actually 4thon top-500 list, with the 3rdbeing another IBM Blue Gene system at 0.557 PFLOPS
-
8/13/2019 smm_cap1
15/101
Supercomputer Performance Growth
Fig. 1.2 The exponential growth in supercomputer performance over
the past two decades (from [Bell92], with ASCI performance goals and
microprocessor peak FLOPS superimposed as dotted lines).
19901980 2000 2010
MFLOPS
Supercom
puterperformance
Calendar year
CrayX-MP
Y-MP
CM-2
GFLOPS
TFLOPS
PFLOPS
Vector supers
CM-5
CM-5
$240M MPPs
$30M MPPs
ASCI goals
Micros
80386
80860
Alpha
-
8/13/2019 smm_cap1
16/101
What Exactly is Parallel Processing?
Parallelism = Concurrency
Doing more than one thing at a time
Has been around for decades, since early computers
I/O channels, DMA, device controllers, multiple ALUs
The sense in which we use it in this course
Multiple agents (hardware units, software processes)collaborate to perform our main computational task
- Multiplying two matrices- Breaking a secret code- Deciding on the next chess move
-
8/13/2019 smm_cap1
17/101
ABCs of Parallel Processing in One Slide
A Amdahls Law (Speedup Formula)Bad newsSequential overhead will kill you, because:
Speedup = T1/Tp 1/[f+ (1f)/p] min(1/f,p)Morale:For f= 0.1, speedup is at best 10, regardless of peak OPS.
B Brents Scheduling TheoremGood newsOptimal scheduling is very difficult, but even a naive
scheduling algorithm can ensure:T1/p TpT1/p+ T= (T1/p)[1 +p/(T1/T)]Result:For a reasonably parallel task (large T1/T), or for a suitablysmallp(say,pT1/T), good speedup and efficiency are possible.
C Cost-Effectiveness Adage
Real newsThe most cost-effective parallel solution may not bethe one with highest peak OPS (communication?), greatest speed-up(at what cost?), or best utilization (hardware busy doing what?).Analogy:Mass transit might be more cost-effective than private carseven if it is slower and leads to many empty seats.
-
8/13/2019 smm_cap1
18/101
Historical Perspective
-
8/13/2019 smm_cap1
19/101
Types of Parallelism
Note: many overlaps
lookahead & pipelining
vectorization
concurrency & simultaneity data and control parallelism
partitioning & specialization
interleaving & overlapping of physical subsystems
multiplicity & replication
time & space sharing multitasking & multiprogramming
multi-threading
distributed computing - for speed or availability
-
8/13/2019 smm_cap1
20/101
ScalarSequential Lookahead
I/E Overlap FunctionalParallelismMultiple
Func. Units PipelineImplicit Vector Explicit Vector
Memory-to-
memory Register-to-registerSIMD MIMD
AssociativeProcessor
Processor
Array
MulticomputerMutiprocessor
Architectural
Evolution
-
8/13/2019 smm_cap1
21/101
Parallel Computer Architectures
(a) On-chip parallelism. (b) A coprocessor. (c) Amultiprocessor.
(d) A multicomputer. (e) A grid.
-
8/13/2019 smm_cap1
22/101
Instruction-Level Parallelism
(a) A CPU pipeline. (b) A sequence of VLIW instructions.
(c) An instruction stream with bundles marked.
-
8/13/2019 smm_cap1
23/101
The TriMedia VLIW CPU (1)
A typical TriMedia instruction, showing five possibleoperations.
-
8/13/2019 smm_cap1
24/101
The TriMedia VLIW CPU (2)
The TM3260 functional units, their quantity,latency,
and which instruction slots they can use.
-
8/13/2019 smm_cap1
25/101
The TriMedia VLIW CPU (3)
The major groups of TriMedia custom operations.
-
8/13/2019 smm_cap1
26/101
The TriMedia VLIW CPU (4)
(a) An array of 8-bit elements. (b) The transposed array.
(c) The original array fetched into four registers.
(d) The transposed array in four registers.
-
8/13/2019 smm_cap1
27/101
On-Chip Multithreading (1)
(a)(c) Three threads. The empty boxes indicated that
the threadhas stalled waiting for memory. (d) Fine-grained
multithreading.
(e) Coarse-grained multithreading.
-
8/13/2019 smm_cap1
28/101
On-Chip Multithreading (2)
Multithreading with a dual-issue superscalar CPU.
(a) Fine-grained multithreading.
(b) Coarse-grained multithreading.
(c) Simultaneous multithreading.
-
8/13/2019 smm_cap1
29/101
Hyperthreading on the Pentium 4
Resource sharing between threads in the
Pentium 4 NetBurst microarchitecture.
-
8/13/2019 smm_cap1
30/101
Homogeneous Multiprocessors on a Chip
Single-chip multiprocessors.
(a) A dual-pipeline chip. (b) A chip with two cores.
-
8/13/2019 smm_cap1
31/101
Heterogeneous Multiprocessors on a Chip (1)
The logical structure of a simple DVD player contains aheterogeneous
multiprocessor containing multiple cores for different
functions.
-
8/13/2019 smm_cap1
32/101
Heterogeneous Multiprocessors on a Chip (2)
An example of the IBM CoreConnect architecture.
-
8/13/2019 smm_cap1
33/101
Introduction to Networking (1)
How users are connected to servers on theInternet.
-
8/13/2019 smm_cap1
34/101
Introduction to Networking (2)
A packet as it appears on the Ethernet.
-
8/13/2019 smm_cap1
35/101
Introduction to Network Processors
A typical network processor board and chip.
-
8/13/2019 smm_cap1
36/101
The Nexperia Media Processor
The Nexperia heterogeneous multiprocessor on a chip.
-
8/13/2019 smm_cap1
37/101
Multiprocessors
(a) A multiprocessor with 16 CPUs sharing a commonmemory.
(b) An image partitioned into 16 sections, each being
analyzed by a different CPU.
-
8/13/2019 smm_cap1
38/101
Multicomputers (1)
(a) A multicomputer with 16 CPUs, each with its own privatememory.
(b) The bit-map image of Fig. 8-17 split up among the 16
memories.
-
8/13/2019 smm_cap1
39/101
Multicomputers (2)
Various layers where shared memory can beimplemented. (a) The
hardware. (b) The operating system. (c) Thelanguage runtime system.
-
8/13/2019 smm_cap1
40/101
Hybrid approach
In short, programming a multicomputer is much moredifficult than programming a multiprocessor.
Under these conditions, why would anyone buildmulticomputers, when multiprocessors are easier to
program? The answer is simple: large multicomputers are much
simpler and cheaper to build than multiprocessors withthe same number of CPUs.
Implementing a memory shared by even a few hundredCPUs is a substantial undertaking, whereas building amulticomputer with 10,000 CPUs or more isstraightforward.
-
8/13/2019 smm_cap1
41/101
Hybrid approachcont
Thus we have a dilemma: multiprocessors are hard to
build but easy to program whereas multicomputers areeasy to build but hard to program.
This observation has led to a great deal of effort toconstruct hybrid systems that are relatively easy to buildand relatively easy to program.
This work has led to the realization that shared memorycan be implemented in various ways, each with its ownset of advantages and disadvantages.
In fact, much of the research in parallel architectures
these days relates to the convergence of multiprocessorand multicomputer architectures into hybrid forms thatcombine the strengths of each.
The holy grail here is to find designs that are scalable,that is, continue to perform well as more and more CPUs
are added.
-
8/13/2019 smm_cap1
42/101
Hybrid approachcont
One approachto building hybrid systems is based on the fact thatmodern computer systems are not monolithic but are constructed as
a series of layers. This insight opens the possibility of implementing the shared
memory at any one of several layers, as shown in Fig. 8-3.
In Fig. 8-3(a) we see the shared memory being implemented by thehardware as a true multiprocessor.
In this design, there is a single copy of the operating system with asingle set of tables, in particular, the memory allocation table.
When a process needs more memory, it traps to the operatingsystem, which then looks in its table for a free page and maps thepage into the callers address space.
As far as the operating system is concerned, there is a singlememory and it keeps track of which process owns which page insoftware.
There are many ways to implement hardware shared memory, aswe will see later.
-
8/13/2019 smm_cap1
43/101
Hybrid approachcont
A second possibilityis to use multicomputer hardware and havethe operating system simulate shared memory by providing a singlesystem-wide paged shared virtual address space.
In this approach, called DSM (Distributed Shared Memory) (Li,1988; and Li and Hudak, 1986, 1989), each page is located in oneof the memories of Fig. 8-2(a).
Each machine has its own virtual memory and its own page tables.When a CPU does a LOAD or STORE on a page it does not have, atrap to the operating system occurs.
The operating system then locates the page and asks the CPUcurrently holding it to unmap the page and send it over theinterconnection network.
When it arrives, the page is mapped in and the faulting instruction isrestarted.
In effect, the operating system is just satisfying page faults fromremote memory instead of from disk. To the user, the machinelooks as if it has shared memory.
H b id h t
-
8/13/2019 smm_cap1
44/101
Hybrid approachcont
A third possibilityis to have a user-level runtimesystem implement a (possibly language-specific) form ofshared memory.
In this approach, the programming language providessome kind of shared memory abstraction, which is thenimplemented by the compiler and runtime system.
For example, the Linda model is based on theabstraction of a shared space of tuples (data recordscontaining a collection of fields).
Processes on any machine can input a tuple from theshared tuple space or output a tuple to the shared tuplespace.
Because access to the tuple space is controlled entirelyin software (by the Linda runtime system), no specialhardware or operating system support is needed.
-
8/13/2019 smm_cap1
45/101
Topology
Various topologies. The heavy dots represent switches. TheCPUs
and memories are not shown. (a) A star. (b) A completeinterconnect.
(c) A tree. (d) A ring. (e) A grid. (f) A double torus.
(g) A cube. (h) A 4D hypercube.
-
8/13/2019 smm_cap1
46/101
Topology - cont
When designing or analyzing an interconnectionnetwork, several areas stand out as beingimportant.
First, there is the matter of topology, that is,how the components are arranged.
Second is how the switching works and how to
handle contention for resources. Third is what routing algorithm is used to get
messages to their destination efficiently.
T l t
-
8/13/2019 smm_cap1
47/101
Topology - cont
The topology of an interconnection network describes
how the links and switches are arranged, for example,as a ring or as a grid. Topological designs can be modeled as graphs, with the
links as arcs and the switches as nodes, as shown in Fig.8-4.
Each node in an interconnection network (or its graph)has some number of links connected to it. Mathematicians call the number of links the degree of
the node; engineers call it the fanout. In general, the greater the fanout, the more routing
choices there are and the greater the fault tolerance,that is, the ability to continue functioning even if a linkfails by routing around it.
If every node has k arcs and the wiring is done right, itis possible to design the network so that it remains fullyconnected even if k - 1 links fail.
T l t
-
8/13/2019 smm_cap1
48/101
Topology - cont
Another property of an interconnection network (or its graph) isits diameter.
If we measure the distance between two nodes by the numberof arcs that have to be traversed to get from one to the other,then the diameter of a graph is the distance between the twonodes that are the farthest apart (i.e., have the greatest
distance between them). The diameter of an interconnection network is related to the
worst-case delay when sending packets from CPU to CPU orfrom CPU to memory because each hop across a link takes a
finite amount of time. The smaller the diameter is, the better the worst-case
performance is.
Also important is the average distance between two nodes,
since this relates to the average packet transit time.
-
8/13/2019 smm_cap1
49/101
Topology - cont Yet another important property of an interconnection network is its
transmission capacity, that is, how much data it can move per second. One useful measure of this capacity is the bisection bandwidth.
To compute this quantity, we first have to (conceptually) partitionthe network into two equal (in terms of number of nodes) butunconnected parts by removing a set of arcs from its graph.
Then we compute the total bandwidth of the arcs that have beenremoved. There may be many different ways to partition thenetwork into two equal parts. The bisection bandwidth is theminimum of all the possible partitions.
The significance of this number is that if the bisection bandwidth is,say, 800 bits/sec, then if there is a lot of communication betweenthe two halves, the total throughput may be limited to only 800bits/sec, in the worst case.
Many designers believe bisection bandwidth is the most importantmetric of in interconnection network. Many interconnectionnetworks are designed with the goal of maximizing the bisectionbandwidth.
-
8/13/2019 smm_cap1
50/101
Topology - cont
Interconnection networks can be characterized by their
dimensionality. For our purposes, the dimensionality is determined by
the number of choices there are to get from the sourceto the destination.
If there is never any choice (i.e., there is only one pathfrom each source to each destination), the network iszero dimensional.
If there is one dimension in which a choice can be made,for example, go east or go west, the network is one
dimensional. If there are two axes, so a packet can go east or west or
alternatively, go north or south, the network is twodimensional, and so on.
T l t
-
8/13/2019 smm_cap1
51/101
Topology - cont
Several topologies are shown in Fig. 8-4 . Only the links
(lines) and switches (dots) are shown here. Thememories and CPUs (not shown) would typically beattached to the switches by interfaces.
In Fig. 8-4(a), we have a zero-dimensional star
configuration, in which the CPUs and memories would beattached to the outer nodes, with the central one justdoing switching.
Although a simple design, for a large system, the centralswitch is likely to be a major bottleneck.
Also, from a fault-tolerance perspective, this is a poordesign since a single failure at the central switchcompletely destroys the system.
Topology cont
-
8/13/2019 smm_cap1
52/101
Topology - cont
In Fig. 8-4(b), we have another zero-dimensional design that is at the other end ofthe spectrum, a full interconnect.
Here every node has a direct connection toevery other node. This design maximizes the bisection bandwidth,
minimizes the diameter, and is exceedingly fault
tolerant (it can lose any six links and still be fullyconnected).
Unfortunately, the number of links required for knodes is k(k - 1)/2, which quickly gets out ofhand for large k.
-
8/13/2019 smm_cap1
53/101
Topology - cont
Yet a third zero-dimensional topology is the tree,
illustrated in Fig. 8-4(c). A problem with this design isthat the bisection bandwidth is equal to the link capacity. Since there will normally be a lot of traffic near the top
of the tree, the top few nodes will become bottlenecks.
One way around this problem is to increase the bisectionbandwidth by giving the upper links more bandwidth. For example, the lowest level links might have a capacity
b, the next level might have a capacity 2b and the top-level links might each have 4b.
Such a design is called a fat tree and has been used incommercial multicomputers, such as the (now-defunct)Thinking Machines CM-5.
T l t
-
8/13/2019 smm_cap1
54/101
Topology - cont
The ring of Fig. 8-4(d) is a one-dimensional topology by
our definition because every packet sent has a choice ofgoing left or going right.
The grid or mesh of Fig. 8-4(e) is a two-dimensionaldesign that has been used in many commercial systems.
It is highly regular, easy to scale up to large sizes, andhas a diameter that only increases as the square root ofthe number of nodes.
A variant on the grid is the double torus of Fig. 8-4(f),which is a grid with the edges connected.
Not only is it more fault tolerant than the grid, but thediameter is also less because the opposite corners cannow communicate in only two hops.
-
8/13/2019 smm_cap1
55/101
Topology - cont
The cube of Fig. 8-4(g) is a regular three-dimensional
topology. We have illustrated a 2 2 2 cube, but inthe general case it could be a k k k cube.
In Fig. 8-4(h) we have a four-dimensional cube
constructed from two threedimensional cubes with thecorresponding edges connected.
We could make a five-dimensional cube by cloning thestructure of Fig. 8-4(h) and connecting the
corresponding nodes to form a block of four cubes. To go to six dimensions, we could replicate the block
of four cubes and interconnect the correspondingnodes, and so on.
Topology cont
-
8/13/2019 smm_cap1
56/101
Topology - con t
An n-dimensional cube formed this way is called ahypercube.
Many parallel computers use this topology because thediameter grows linearly with the dimensionality.
Put in other words, the diameter is the base 2 logarithmof the number of nodes, so, for example, a 10-dimensional hypercube has 1024 nodes but a diameter
of only 10, giving excellent delay properties. Note that in contrast, 1024 nodes arranged as a 32 32
grid has a diameter of 62, more than six times worsethan the hypercube.
The price paid for the smaller diameter is that the fanout
and thus the number of links (and the cost) is muchlarger for the hypercube.
Nevertheless, the hypercube is a common choice forhigh-performance systems.
-
8/13/2019 smm_cap1
57/101
-
8/13/2019 smm_cap1
58/101
Topology - cont
When designing or analyzing an interconnectionnetwork, several areas stand out as beingimportant.
First, there is the matter of topology, that is,how the components are arranged. Otherimportant aspects are:
Second is how the switching works and how to
handle contentio for resources. Third is what routing algorithm is used to get
messages to their destination efficiently.
-
8/13/2019 smm_cap1
59/101
Flynns taxonomy
-
8/13/2019 smm_cap1
60/101
-
8/13/2019 smm_cap1
61/101
1.4 Types of Parallelism: A Taxonomy
Fig. 1.11 The Flynn-Johnson classification of computer systems.
SISD SIMD
MISD MIMD
GMSV GMMP
DMSV DMMP
Single datastream
Multiple datastreams
Singleinstr
stream
Multipleinstr
streams
Flynns categories Johnsons
exp
ansion
Shared
variablesMessagepassing
Global
memory
Distributed
memory
Uniprocessors
Rarely used
Array or vectorprocessors
Multiprocs ormulticomputers
Shared-memorymultiprocessors
Rarely used
Distributedshared memory
Distrib-memorymulticomputers
MIMD t
-
8/13/2019 smm_cap1
62/101
MIMD category
The MIMD category includes a wide class of computers. For
this reason, in 1988, E. E. Johnson proposed a further classification of such machines
based on
their memory structure (global or distributed) and
the mechanism used for communication/synchronization(shared variables or message passing).
Again, one of the four categories (GMMP) is not widelyused.
The GMSV class is what is loosely referred to as (shared-memory) multiprocessors.
At the other extreme, the DMMP class is known as(distributed-memory) multicomputers.
-
8/13/2019 smm_cap1
63/101
MIMD category - cont
Finally, the DMSV class, which is becoming popular in view ofcombining the implementation ease of distributed memory with theprogramming ease of the shared-variable scheme, is sometimescalled distributed shared memory.
When all processors in a MIMD-type machine execute the same
program, the result is sometimes referred to as single-programmultipledata [SPMD (spim-dee)].
Although Fig. 1.11 lumps all SIMD machines together, there are infact variations similar to those suggested above for MIMD machines.At least conceptually, there can be shared-memory and distributed-memory SIMD machines in which the processors communicate bymeans of shared variables or
explicit message passing.
A taxonomy of parallel computers /
-
8/13/2019 smm_cap1
64/101
y p p
Tanenbaum
-
8/13/2019 smm_cap1
65/101
Tanenbaum taxonomy
In Tanenebaum taxonomy, the MIMD categoryhas been split into Multiprocessors (shared-memory machines), and
Multicomputers (message-passing machines). Three kinds of multiprocessors exist,
distinguished by the way the shared memory isimplemented on them. They are called
UMA (Uniform Memory Access), NUMA (NonUniform Memory Access), and
COMA (Cache Only Memory Access).
-
8/13/2019 smm_cap1
66/101
Tanenbaum taxonomy - cont
These categories exist because in large multiprocessors,the memory is usually split up into multiple modules.
UMA machineshave the property that each CPU hasthe same access time to every memory module.
In other words, every memory word can be read as fast asevery other memory word.
If this is technically impossible, the fastest references areslowed down to match the slowest ones, so programmers
do not see the difference. This is what uniform means here.
This uniformity makes the performance predictable, animportant factor for writing efficient code.
-
8/13/2019 smm_cap1
67/101
T b
-
8/13/2019 smm_cap1
68/101
Tanenbaum taxonomy - cont As we mentioned before, even on a multicomputer, user
programs may have the ability to access remote memory byusing LOAD and STORE instructions, but this illusion issupported by the operating system, not the hardware.
This difference is subtle, but very important.
Because multicomputers do not have direct access to
remote memory, they are sometimes called NORMA (NORemote Memory Access) machines.
Multicomputers can be roughly divided into two categories.
The first categorycontains the MPPs (Massively
Parallel Processors), which are expensivesupercomputers consisting of many CPUs tightly coupled bya high-speed proprietary interconnection network. The CrayT3E and IBM SP/2 are wellknown examples.
Tanenbaum taxonomy - cont
-
8/13/2019 smm_cap1
69/101
Tanenbaum taxonomy - con t
The other categoryconsists of regular PCs or
workstations, possibly rackmounted, and connected bycommercial off-the-shelf interconnection technology.
Logically, there is not much difference, but hugesupercomputers costing many millions of dollars areused differently than networks of PCs assembled bythe users for a fraction of the price of an MPP.
These home-brew machines go by various names,
including NOW (Network of Workstations) and
COW (Cluster of Workstations).
-
8/13/2019 smm_cap1
70/101
Sequential Consistency
(a) Two CPUs writing and two CPUs reading a commonmemoryword. (b) - (d) Three possible ways the two writes
and four
reads might be interleaved in time.
Weak Consistency
-
8/13/2019 smm_cap1
71/101
Weak Consistency
Weakly consistent memory uses synchronizationoperations to
divide time into sequential epochs.
-
8/13/2019 smm_cap1
72/101
UMA Symmetric Multiprocessor Architectures
Three bus-based multiprocessors. (a) Without caching.(b) With
caching. (c) With caching and private memories.
S i C h
-
8/13/2019 smm_cap1
73/101
Snooping Caches
The write through cache coherence protocol.
The empty boxes indicate that no action is taken.
The MESI Cache Coherence Protocol
-
8/13/2019 smm_cap1
74/101
The MESI Cache Coherence Protocol
The MESI cache coherence protocol.
UMA Multiprocessors Using Crossbar Switches
-
8/13/2019 smm_cap1
75/101
UMA Multiprocessors Using Crossbar Switches
(a) An 8 8 crossbar switch.
(b) An open crosspoint.
(c) A closed crosspoint.
UMA Multiprocessors Using Multistage Switching
-
8/13/2019 smm_cap1
76/101
UMA Multiprocessors Using Multistage Switching
Networks (1)
(a) A 2 2 switch.
(b) A message format.
UMA Multiprocessors Using Multistage Switching
-
8/13/2019 smm_cap1
77/101
UMA Multiprocessors Using Multistage Switching
Networks (2)
An omega switching network.
-
8/13/2019 smm_cap1
78/101
NUMA Multiprocessors
A NUMA machine based on two levels of buses. TheCm* was
the first multiprocessor to use this design.
Cache Coherent NUMA Multiprocessors
-
8/13/2019 smm_cap1
79/101
Cache Coherent NUMA Multiprocessors
(a) A 256-node directory-based multiprocessor. (b) Divisiona 32-bit
memory address into fields. (c) The directory at node 36.
The Sun Fire E25K NUMA Multiprocessor (1)
-
8/13/2019 smm_cap1
80/101
The Sun Fire E25K NUMA Multiprocessor (1)
The Sun Microsystems E25K multiprocessor.
Th S Fi E25K NUMA M lti (2)
-
8/13/2019 smm_cap1
81/101
The Sun Fire E25K NUMA Multiprocessor (2)
The SunFire E25K uses a four-level interconnect.Dashed lines
are address paths. Solid lines are data paths.
-
8/13/2019 smm_cap1
82/101
Message-Passing Multicomputers
A generic multicomputer.
BlueGene (1)
-
8/13/2019 smm_cap1
83/101
BlueGene (1)
The BlueGene/L custom processor chip.
BlueGene (2)
-
8/13/2019 smm_cap1
84/101
ueGe e ( )
The BlueGene/L. (a) Chip. (b) Card. (c) Board.
(d) Cabinet. (e) System.
Red Storm (1)
-
8/13/2019 smm_cap1
85/101
Red Storm (1)
Packaging of the Red Storm components.
-
8/13/2019 smm_cap1
86/101
Red Storm (2)
The Red Storm system as viewed from above.
A Comparison of BlueGene/L and Red Storm
-
8/13/2019 smm_cap1
87/101
A Comparison of BlueGene/L and Red Storm
A
comparison of
BlueGene/Land Red Storm.
Google (1)
-
8/13/2019 smm_cap1
88/101
g ( )
Processing of a Google query.
-
8/13/2019 smm_cap1
89/101
Google (2)
A typicalGoogle
cluster.
Scheduling
-
8/13/2019 smm_cap1
90/101
g
Scheduling a cluster. (a) FIFO. (b) Without head-of-line blocking.
(c) Tiling. The shaded areas indicate idle CPUs.
Distributed Shared Memory (1)
-
8/13/2019 smm_cap1
91/101
Distributed Shared Memory (1)
A virtual address space consisting of 16 pages
spread over four nodes of a multicomputer.
(a) The initial situation. .
-
8/13/2019 smm_cap1
92/101
Distributed Shared Memory (2)
A virtual address space consisting of 16 pages
spread over four nodes of a multicomputer.
(b) After CPU 0 references page 10.
-
8/13/2019 smm_cap1
93/101
Distributed Shared Memory (3)
A virtual address space consisting of 16 pages
spread over four nodes of a multicomputer.
(c) After CPU 1 references page 10, here assumed to be a read-only page.
-
8/13/2019 smm_cap1
94/101
Linda
Three Linda tuples.
-
8/13/2019 smm_cap1
95/101
Orca
A simplified ORCA stack object, with internal dataand two operations.
Software Metrics (1)
-
8/13/2019 smm_cap1
96/101
Real programs achieve less than the perfectspeedup
indicated by the dotted line.
Software Metrics (2)
-
8/13/2019 smm_cap1
97/101
(a) A program has a sequential part and aparallelizable part.
(b) Effect of running part of the program inparallel.
Achieving High Performance
-
8/13/2019 smm_cap1
98/101
(a) A 4-CPU bus-based system. (b) A 16-CPU bus-basedsystem.
(c) A 4-CPU grid-based system. (d) A 16-CPU grid-basedsystem.
-
8/13/2019 smm_cap1
99/101
Grid Computing
The grid layers.
-
8/13/2019 smm_cap1
100/101
Bibliografie
1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura
Victor, ISBN 973-8128-05-6, 2000
2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE
PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw
Hill, 1993, ISBN 0-07-113342-9.3. Hesham El-Rewini, Southern Methodist University, Mostafa Abd-
El-Barr Kuwait University, ADVANCED COMPUTER
ARCHITECTURE AND PARALLEL PROCESSING, A JOHN
WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5
4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID
COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-35. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall
2003, ISBN 0-13-145660+1
-
8/13/2019 smm_cap1
101/101