smm_cap1

8/13/2019 smm_cap1

1/101

1

Titular curs: dr. ing. Ioan [email protected]

Durat: 14 sptmni x 3 ore
mailto:[email protected]:[email protected]

8/13/2019 smm_cap1

2/101

2

Cum se face notarea? Nota la examen NF = 45%NE + 45%NL + 10%PC

NF = Nota Final

NE = Nota la Examen NL = Nota la Laborator

PC = Prezenala Curs

Examenul se desfoarn scris.

Se dau trei subiecte sub formde probleme

Prezena la curs poate fi compensat cu o tem suplimentarlegatde curs, cu aprobarea cadrului didactic.

Nota la examen se calculeaz cu activitile executate i notate itrebuie s fie de cel puin 45% (4,5).

Se dau bonusuri de pn 10% pentru activiti suplimentare

(participarea la concursuri, conferine studeneti, proiecte, etc.)

8/13/2019 smm_cap1

3/101

Bibliografie

1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura

Victor, ISBN 973-8128-05-6, 2000

2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE

PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw

Hill, 1993, ISBN 0-07-113342-9.

3. Hesham El-Rewini, Southern Methodist University, Mostafa Abd-

El-Barr Kuwait University, ADVANCED COMPUTER

ARCHITECTURE AND PARALLEL PROCESSING, A JOHN

WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5

4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID

COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3

5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall

2003, ISBN 0-13-145660+1

8/13/2019 smm_cap1

4/101

Problemele intale cursului

Arhitecturi multiprocesor

Magistrale de interconectarePCI, PCIe,Infiniband

Chipuri multimicroprocesorIntel Xeon, Cell BE

Cluster computing

Grid computing

8/13/2019 smm_cap1

5/101

Introducere

8/13/2019 smm_cap1

6/101

1.1 Why Parallel Processing?

Fig. 1.1 The exponential growth of microprocessor performance,

known as Moores Law, shown over the past two decades (extrapolated).

19901980 2000 2010KIPS

MIPS

GIPS

TIPS

Processorperformance

Calendar year

80286

68000

80386

8048668040

Pentium

Pentium II R10000

1.6 / yr

8/13/2019 smm_cap1

7/101

From:

Robots After All,

by H. Moravec,

CACM, pp. 90-97,

October 2003.

Mental power in four scales

Evolution of Computer Performance/Cost

8/13/2019 smm_cap1

8/101

The Semiconductor Technology Roadmap

From the 2001 edition of the roadmap [Alla02]

Calendar year 2001 2004 2007 2010 2013 2016Halfpitch (nm) 140 90 65 45 32 22Clock freq. (GHz) 2 4 7 12 20 30Wiring levels 7 8 9 10 10 10Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5Max. power (W) 130 160 190 220 250 290

Factors contributing to the validity of Moores law

Denser circuits; Architectural improvements

Measures of processor performanceInstructions/second (MIPS, GIPS, TIPS, PIPS)

Floating-point operations per second

(MFLOPS, GFLOPS, TFLOPS, PFLOPS)

Running time on benchmark suites 19901980 2000 2010KIPS

MIPS

GIPS

TIPS

Processorp

erformance

Calendar year

80286

68000

80386

8048668040

Pentium

Pentium II R10000

1.6 / yr

8/13/2019 smm_cap1

9/101

Why High-Performance Computing?

Higher speed (solve problems faster)Important when there are hard or soft deadlines;e.g., 24-hour weather forecast

Higher throughput (solve more problems)Important when there are many similar tasks to perform;e.g., transaction processing

Higher computational power (solve larger problems)e.g., weather forecast for a week rather than 24 hours,or with a finer mesh for greater accuracy

Categories of supercomputersUniprocessor; aka vector machineMultiprocessor; centralized or distributed shared memoryMulticomputer; communicating via message passingMassively parallel processor (MPP; 1K or more processors)

8/13/2019 smm_cap1

10/101

The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at a fraction of speed of light (say, 1/3).

If signals must travel 1 cm during the execution of an

instruction, that instruction will take at least 0.1 ns;thus, performance will be limited to 10 GIPS.

This limitation is eased by continued miniaturization,architectural methods such as cache memory, etc.;

however, a fundamental limit does exist.

How does parallel processing help? Wouldnt multipleprocessors need to communicate via signals as well?

8/13/2019 smm_cap1

11/101

Why Do We Need TIPS or TFLOPS Performance?

Reasonable running time = Fraction of hour to several hours (103-104s)

In this time, a TIPS/TFLOPS machine can perform 1015-1016operations

Example 2: Fluid dynamics calculations (1000 1000 1000 lattice)

109lattice points 1000 FLOP/point 10 000 time steps = 1016FLOP

Example 3: Monte Carlo simulation of nuclear reactor

1011

particles to track (for 1000 escapes) 104

FLOP/particle = 1015

FLOP

Decentralized supercomputing( from Mathworld News, 2006/4/7 ):

Grid of tens of thousands networked computers discovers 2304024571,

the 43rdMersenne prime, as the largest known prime (9 152 052 digits )

Example 1: Southern oceans heat Modeling

(10-minute iterations)

300 GFLOP per iteration

300 000 iterations per 6 yrs =

10

16

FLOP

8/13/2019 smm_cap1

12/101

The ASCI Program

20001995 2005 2010

Performa

nce(TFLOPS)

Calendar year

ASCI Red

ASCI Blue

ASCI White

1+ TFLOPS, 0.5 TB

3+ TFLOPS, 1.5 TB

10+ TFLOPS, 5 TB

30+ TFLOPS, 10 TB

100+ TFLOPS, 20 TB

1

10

100

1000 Plan Develop Use

ASCI

ASCI Purple

ASCI Q

Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)

Computing Initiative (ASCI) program, sponsored by the US Department of

Energy, with extrapolation up to the PFLOPS level.

8/13/2019 smm_cap1

13/101

The Quest for Higher Performance

1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim

LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama

Material science, nuclearstockpile sim

Aerospace/space sim,climate research

Atmospheric, oceanic, andearth sciences

32,768 procs, 8 TB, 28

TB disk storage

10,240 procs, 20 TB,

440 TB disk storage

5,120 procs, 10 TB, 700

TB disk storageLinux + custom OS Linux Unix

71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M?

Dual-proc Power-PC chips

(10-15 W power)

20x Altix (512 Itanium2)

linked by Infiniband

Built of custom vector

microprocessorsFull system: 130k-proc,360 TFLOPS (est)

Volume = 50x IBM, Power= 14x IBM

* Led the top500 list for 2.5 yrs

Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

8/13/2019 smm_cap1

14/101

The Quest for Higher Performance: 2008 Update

1. IBM Roadrunner 2. IBM Blue Gene/L 3. Sun Blade X6420

LANL, New Mexico LLNL, California U Texas Austin

Nuclear stockpilecalculations, and more

Advanced scientificsimulations

Open science research

122,400 procs, 98 TB,

0.4 TB/s file system I/O

212,992 procs, 74 TB,

2 PB disk storage

62,976 procs, 126 TB

Red Hat Linux CNK/SLES 9 Linux

1.38 PFLOPS, $130M 0.596PFLOPS,$100M 0.504 PFLOPS*

PowerXCell 8i 3.2 GHz,

AMD Opteron (hybrid)

PowerPC 440 700 MHz AMD X86-64 Opteron

quad core 2 GHz2.35 MW power, expandsto 1M procs

1.60 MW power, expandsto 0.5M procs

2.00 MW power, Expandsto 0.3M procs

Top Three Supercomputers in June 2008 (http://www.top500.org)

* Actually 4thon top-500 list, with the 3rdbeing another IBM Blue Gene system at 0.557 PFLOPS

8/13/2019 smm_cap1

15/101

Supercomputer Performance Growth

Fig. 1.2 The exponential growth in supercomputer performance over

the past two decades (from [Bell92], with ASCI performance goals and

microprocessor peak FLOPS superimposed as dotted lines).

19901980 2000 2010

MFLOPS

Supercom

puterperformance

Calendar year

CrayX-MP

Y-MP

CM-2

GFLOPS

TFLOPS

PFLOPS

Vector supers

CM-5

CM-5

$240M MPPs

$30M MPPs

ASCI goals

Micros

80386

80860

Alpha

8/13/2019 smm_cap1

16/101

What Exactly is Parallel Processing?

Parallelism = Concurrency

Doing more than one thing at a time

Has been around for decades, since early computers

I/O channels, DMA, device controllers, multiple ALUs

The sense in which we use it in this course

Multiple agents (hardware units, software processes)collaborate to perform our main computational task

- Multiplying two matrices- Breaking a secret code- Deciding on the next chess move

8/13/2019 smm_cap1

17/101

ABCs of Parallel Processing in One Slide

A Amdahls Law (Speedup Formula)Bad newsSequential overhead will kill you, because:

Speedup = T1/Tp 1/[f+ (1f)/p] min(1/f,p)Morale:For f= 0.1, speedup is at best 10, regardless of peak OPS.

B Brents Scheduling TheoremGood newsOptimal scheduling is very difficult, but even a naive

scheduling algorithm can ensure:T1/p TpT1/p+ T= (T1/p)[1 +p/(T1/T)]Result:For a reasonably parallel task (large T1/T), or for a suitablysmallp(say,pT1/T), good speedup and efficiency are possible.

C Cost-Effectiveness Adage

Real newsThe most cost-effective parallel solution may not bethe one with highest peak OPS (communication?), greatest speed-up(at what cost?), or best utilization (hardware busy doing what?).Analogy:Mass transit might be more cost-effective than private carseven if it is slower and leads to many empty seats.

8/13/2019 smm_cap1

18/101

Historical Perspective

8/13/2019 smm_cap1

19/101

Types of Parallelism

Note: many overlaps

lookahead & pipelining

vectorization

concurrency & simultaneity data and control parallelism

partitioning & specialization

interleaving & overlapping of physical subsystems

multiplicity & replication

time & space sharing multitasking & multiprogramming

multi-threading

distributed computing - for speed or availability

8/13/2019 smm_cap1

20/101

ScalarSequential Lookahead

I/E Overlap FunctionalParallelismMultiple

Func. Units PipelineImplicit Vector Explicit Vector

Memory-to-

memory Register-to-registerSIMD MIMD

AssociativeProcessor

Processor

Array

MulticomputerMutiprocessor

Architectural

Evolution

8/13/2019 smm_cap1

21/101

Parallel Computer Architectures

(a) On-chip parallelism. (b) A coprocessor. (c) Amultiprocessor.

(d) A multicomputer. (e) A grid.

8/13/2019 smm_cap1

22/101

Instruction-Level Parallelism

(a) A CPU pipeline. (b) A sequence of VLIW instructions.

(c) An instruction stream with bundles marked.

8/13/2019 smm_cap1

23/101

The TriMedia VLIW CPU (1)

A typical TriMedia instruction, showing five possibleoperations.

8/13/2019 smm_cap1

24/101


The TM3260 functional units, their quantity,latency,

and which instruction slots they can use.

8/13/2019 smm_cap1

25/101


The major groups of TriMedia custom operations.

8/13/2019 smm_cap1

26/101


(a) An array of 8-bit elements. (b) The transposed array.

(c) The original array fetched into four registers.

(d) The transposed array in four registers.

8/13/2019 smm_cap1

27/101

On-Chip Multithreading (1)

(a)(c) Three threads. The empty boxes indicated that

the threadhas stalled waiting for memory. (d) Fine-grained

multithreading.

(e) Coarse-grained multithreading.

8/13/2019 smm_cap1

28/101

On-Chip Multithreading (2)

Multithreading with a dual-issue superscalar CPU.

(a) Fine-grained multithreading.

(b) Coarse-grained multithreading.

(c) Simultaneous multithreading.

8/13/2019 smm_cap1

29/101

Hyperthreading on the Pentium 4

Resource sharing between threads in the

Pentium 4 NetBurst microarchitecture.

8/13/2019 smm_cap1

30/101

Homogeneous Multiprocessors on a Chip

Single-chip multiprocessors.

(a) A dual-pipeline chip. (b) A chip with two cores.

8/13/2019 smm_cap1

31/101

Heterogeneous Multiprocessors on a Chip (1)

The logical structure of a simple DVD player contains aheterogeneous

multiprocessor containing multiple cores for different

functions.

8/13/2019 smm_cap1

32/101

Heterogeneous Multiprocessors on a Chip (2)

An example of the IBM CoreConnect architecture.

8/13/2019 smm_cap1

33/101

Introduction to Networking (1)

How users are connected to servers on theInternet.

8/13/2019 smm_cap1

34/101

Introduction to Networking (2)

A packet as it appears on the Ethernet.

8/13/2019 smm_cap1

35/101

Introduction to Network Processors

A typical network processor board and chip.

8/13/2019 smm_cap1

36/101

The Nexperia Media Processor

The Nexperia heterogeneous multiprocessor on a chip.

8/13/2019 smm_cap1

37/101

Multiprocessors

(a) A multiprocessor with 16 CPUs sharing a commonmemory.

(b) An image partitioned into 16 sections, each being

analyzed by a different CPU.

8/13/2019 smm_cap1

38/101

Multicomputers (1)

(a) A multicomputer with 16 CPUs, each with its own privatememory.

(b) The bit-map image of Fig. 8-17 split up among the 16

memories.

8/13/2019 smm_cap1

39/101

Multicomputers (2)

Various layers where shared memory can beimplemented. (a) The

hardware. (b) The operating system. (c) Thelanguage runtime system.

8/13/2019 smm_cap1

40/101

Hybrid approach

In short, programming a multicomputer is much moredifficult than programming a multiprocessor.

Under these conditions, why would anyone buildmulticomputers, when multiprocessors are easier to

program? The answer is simple: large multicomputers are much

simpler and cheaper to build than multiprocessors withthe same number of CPUs.

Implementing a memory shared by even a few hundredCPUs is a substantial undertaking, whereas building amulticomputer with 10,000 CPUs or more isstraightforward.

8/13/2019 smm_cap1

41/101

Hybrid approachcont

Thus we have a dilemma: multiprocessors are hard to

build but easy to program whereas multicomputers areeasy to build but hard to program.

This observation has led to a great deal of effort toconstruct hybrid systems that are relatively easy to buildand relatively easy to program.

This work has led to the realization that shared memorycan be implemented in various ways, each with its ownset of advantages and disadvantages.

In fact, much of the research in parallel architectures

these days relates to the convergence of multiprocessorand multicomputer architectures into hybrid forms thatcombine the strengths of each.

The holy grail here is to find designs that are scalable,that is, continue to perform well as more and more CPUs

are added.

8/13/2019 smm_cap1

42/101

Hybrid approachcont

One approachto building hybrid systems is based on the fact thatmodern computer systems are not monolithic but are constructed as

a series of layers. This insight opens the possibility of implementing the shared

memory at any one of several layers, as shown in Fig. 8-3.

In Fig. 8-3(a) we see the shared memory being implemented by thehardware as a true multiprocessor.

In this design, there is a single copy of the operating system with asingle set of tables, in particular, the memory allocation table.

When a process needs more memory, it traps to the operatingsystem, which then looks in its table for a free page and maps thepage into the callers address space.

As far as the operating system is concerned, there is a singlememory and it keeps track of which process owns which page insoftware.

There are many ways to implement hardware shared memory, aswe will see later.

8/13/2019 smm_cap1

43/101

Hybrid approachcont

A second possibilityis to use multicomputer hardware and havethe operating system simulate shared memory by providing a singlesystem-wide paged shared virtual address space.

In this approach, called DSM (Distributed Shared Memory) (Li,1988; and Li and Hudak, 1986, 1989), each page is located in oneof the memories of Fig. 8-2(a).

Each machine has its own virtual memory and its own page tables.When a CPU does a LOAD or STORE on a page it does not have, atrap to the operating system occurs.

The operating system then locates the page and asks the CPUcurrently holding it to unmap the page and send it over theinterconnection network.

When it arrives, the page is mapped in and the faulting instruction isrestarted.

In effect, the operating system is just satisfying page faults fromremote memory instead of from disk. To the user, the machinelooks as if it has shared memory.

H b id h t

8/13/2019 smm_cap1

44/101

Hybrid approachcont

A third possibilityis to have a user-level runtimesystem implement a (possibly language-specific) form ofshared memory.

In this approach, the programming language providessome kind of shared memory abstraction, which is thenimplemented by the compiler and runtime system.

For example, the Linda model is based on theabstraction of a shared space of tuples (data recordscontaining a collection of fields).

Processes on any machine can input a tuple from theshared tuple space or output a tuple to the shared tuplespace.

Because access to the tuple space is controlled entirelyin software (by the Linda runtime system), no specialhardware or operating system support is needed.

8/13/2019 smm_cap1

45/101

Topology

Various topologies. The heavy dots represent switches. TheCPUs

and memories are not shown. (a) A star. (b) A completeinterconnect.

(c) A tree. (d) A ring. (e) A grid. (f) A double torus.

(g) A cube. (h) A 4D hypercube.

8/13/2019 smm_cap1

46/101

Topology - cont

When designing or analyzing an interconnectionnetwork, several areas stand out as beingimportant.

First, there is the matter of topology, that is,how the components are arranged.

Second is how the switching works and how to

handle contention for resources. Third is what routing algorithm is used to get

messages to their destination efficiently.

T l t

8/13/2019 smm_cap1

47/101

Topology - cont

The topology of an interconnection network describes

how the links and switches are arranged, for example,as a ring or as a grid. Topological designs can be modeled as graphs, with the

links as arcs and the switches as nodes, as shown in Fig.8-4.

Each node in an interconnection network (or its graph)has some number of links connected to it. Mathematicians call the number of links the degree of

the node; engineers call it the fanout. In general, the greater the fanout, the more routing

choices there are and the greater the fault tolerance,that is, the ability to continue functioning even if a linkfails by routing around it.

If every node has k arcs and the wiring is done right, itis possible to design the network so that it remains fullyconnected even if k - 1 links fail.

T l t

8/13/2019 smm_cap1

48/101

Topology - cont

Another property of an interconnection network (or its graph) isits diameter.

If we measure the distance between two nodes by the numberof arcs that have to be traversed to get from one to the other,then the diameter of a graph is the distance between the twonodes that are the farthest apart (i.e., have the greatest

distance between them). The diameter of an interconnection network is related to the

worst-case delay when sending packets from CPU to CPU orfrom CPU to memory because each hop across a link takes a

finite amount of time. The smaller the diameter is, the better the worst-case

performance is.

Also important is the average distance between two nodes,

since this relates to the average packet transit time.

8/13/2019 smm_cap1

49/101

Topology - cont Yet another important property of an interconnection network is its

transmission capacity, that is, how much data it can move per second. One useful measure of this capacity is the bisection bandwidth.

To compute this quantity, we first have to (conceptually) partitionthe network into two equal (in terms of number of nodes) butunconnected parts by removing a set of arcs from its graph.

Then we compute the total bandwidth of the arcs that have beenremoved. There may be many different ways to partition thenetwork into two equal parts. The bisection bandwidth is theminimum of all the possible partitions.

The significance of this number is that if the bisection bandwidth is,say, 800 bits/sec, then if there is a lot of communication betweenthe two halves, the total throughput may be limited to only 800bits/sec, in the worst case.

Many designers believe bisection bandwidth is the most importantmetric of in interconnection network. Many interconnectionnetworks are designed with the goal of maximizing the bisectionbandwidth.

8/13/2019 smm_cap1

50/101

Topology - cont

Interconnection networks can be characterized by their

dimensionality. For our purposes, the dimensionality is determined by

the number of choices there are to get from the sourceto the destination.

If there is never any choice (i.e., there is only one pathfrom each source to each destination), the network iszero dimensional.

If there is one dimension in which a choice can be made,for example, go east or go west, the network is one

dimensional. If there are two axes, so a packet can go east or west or

alternatively, go north or south, the network is twodimensional, and so on.

T l t

8/13/2019 smm_cap1

51/101

Topology - cont

Several topologies are shown in Fig. 8-4 . Only the links

(lines) and switches (dots) are shown here. Thememories and CPUs (not shown) would typically beattached to the switches by interfaces.

In Fig. 8-4(a), we have a zero-dimensional star

configuration, in which the CPUs and memories would beattached to the outer nodes, with the central one justdoing switching.

Although a simple design, for a large system, the centralswitch is likely to be a major bottleneck.

Also, from a fault-tolerance perspective, this is a poordesign since a single failure at the central switchcompletely destroys the system.

Topology cont

8/13/2019 smm_cap1

52/101

Topology - cont

In Fig. 8-4(b), we have another zero-dimensional design that is at the other end ofthe spectrum, a full interconnect.

Here every node has a direct connection toevery other node. This design maximizes the bisection bandwidth,

minimizes the diameter, and is exceedingly fault

tolerant (it can lose any six links and still be fullyconnected).

Unfortunately, the number of links required for knodes is k(k - 1)/2, which quickly gets out ofhand for large k.

8/13/2019 smm_cap1

53/101

Topology - cont

Yet a third zero-dimensional topology is the tree,

illustrated in Fig. 8-4(c). A problem with this design isthat the bisection bandwidth is equal to the link capacity. Since there will normally be a lot of traffic near the top

of the tree, the top few nodes will become bottlenecks.

One way around this problem is to increase the bisectionbandwidth by giving the upper links more bandwidth. For example, the lowest level links might have a capacity

b, the next level might have a capacity 2b and the top-level links might each have 4b.

Such a design is called a fat tree and has been used incommercial multicomputers, such as the (now-defunct)Thinking Machines CM-5.

T l t

8/13/2019 smm_cap1

54/101

Topology - cont

The ring of Fig. 8-4(d) is a one-dimensional topology by

our definition because every packet sent has a choice ofgoing left or going right.

The grid or mesh of Fig. 8-4(e) is a two-dimensionaldesign that has been used in many commercial systems.

It is highly regular, easy to scale up to large sizes, andhas a diameter that only increases as the square root ofthe number of nodes.

A variant on the grid is the double torus of Fig. 8-4(f),which is a grid with the edges connected.

Not only is it more fault tolerant than the grid, but thediameter is also less because the opposite corners cannow communicate in only two hops.

8/13/2019 smm_cap1

55/101

Topology - cont

The cube of Fig. 8-4(g) is a regular three-dimensional

topology. We have illustrated a 2 2 2 cube, but inthe general case it could be a k k k cube.

In Fig. 8-4(h) we have a four-dimensional cube

constructed from two threedimensional cubes with thecorresponding edges connected.

We could make a five-dimensional cube by cloning thestructure of Fig. 8-4(h) and connecting the

corresponding nodes to form a block of four cubes. To go to six dimensions, we could replicate the block

of four cubes and interconnect the correspondingnodes, and so on.

Topology cont

8/13/2019 smm_cap1

56/101

Topology - con t

An n-dimensional cube formed this way is called ahypercube.

Many parallel computers use this topology because thediameter grows linearly with the dimensionality.

Put in other words, the diameter is the base 2 logarithmof the number of nodes, so, for example, a 10-dimensional hypercube has 1024 nodes but a diameter

of only 10, giving excellent delay properties. Note that in contrast, 1024 nodes arranged as a 32 32

grid has a diameter of 62, more than six times worsethan the hypercube.

The price paid for the smaller diameter is that the fanout

and thus the number of links (and the cost) is muchlarger for the hypercube.

Nevertheless, the hypercube is a common choice forhigh-performance systems.

8/13/2019 smm_cap1

57/101

8/13/2019 smm_cap1

58/101

Topology - cont

When designing or analyzing an interconnectionnetwork, several areas stand out as beingimportant.

First, there is the matter of topology, that is,how the components are arranged. Otherimportant aspects are:

Second is how the switching works and how to

handle contentio for resources. Third is what routing algorithm is used to get

messages to their destination efficiently.

8/13/2019 smm_cap1

59/101

Flynns taxonomy

8/13/2019 smm_cap1

60/101

8/13/2019 smm_cap1

61/101

1.4 Types of Parallelism: A Taxonomy

Fig. 1.11 The Flynn-Johnson classification of computer systems.

SISD SIMD

MISD MIMD

GMSV GMMP

DMSV DMMP

Single datastream

Multiple datastreams

Singleinstr

stream

Multipleinstr

streams

Flynns categories Johnsons

exp

ansion

Shared

variablesMessagepassing

Global

memory

Distributed

memory

Uniprocessors

Rarely used

Array or vectorprocessors

Multiprocs ormulticomputers

Shared-memorymultiprocessors

Rarely used

Distributedshared memory

Distrib-memorymulticomputers

MIMD t

8/13/2019 smm_cap1

62/101

MIMD category

The MIMD category includes a wide class of computers. For

this reason, in 1988, E. E. Johnson proposed a further classification of such machines

based on

their memory structure (global or distributed) and

the mechanism used for communication/synchronization(shared variables or message passing).

Again, one of the four categories (GMMP) is not widelyused.

The GMSV class is what is loosely referred to as (shared-memory) multiprocessors.

At the other extreme, the DMMP class is known as(distributed-memory) multicomputers.

8/13/2019 smm_cap1

63/101

MIMD category - cont

Finally, the DMSV class, which is becoming popular in view ofcombining the implementation ease of distributed memory with theprogramming ease of the shared-variable scheme, is sometimescalled distributed shared memory.

When all processors in a MIMD-type machine execute the same

program, the result is sometimes referred to as single-programmultipledata [SPMD (spim-dee)].

Although Fig. 1.11 lumps all SIMD machines together, there are infact variations similar to those suggested above for MIMD machines.At least conceptually, there can be shared-memory and distributed-memory SIMD machines in which the processors communicate bymeans of shared variables or

explicit message passing.

A taxonomy of parallel computers /

8/13/2019 smm_cap1

64/101

y p p

Tanenbaum

8/13/2019 smm_cap1

65/101

Tanenbaum taxonomy

In Tanenebaum taxonomy, the MIMD categoryhas been split into Multiprocessors (shared-memory machines), and

Multicomputers (message-passing machines). Three kinds of multiprocessors exist,

distinguished by the way the shared memory isimplemented on them. They are called

UMA (Uniform Memory Access), NUMA (NonUniform Memory Access), and

COMA (Cache Only Memory Access).

8/13/2019 smm_cap1

66/101

Tanenbaum taxonomy - cont

These categories exist because in large multiprocessors,the memory is usually split up into multiple modules.

UMA machineshave the property that each CPU hasthe same access time to every memory module.

In other words, every memory word can be read as fast asevery other memory word.

If this is technically impossible, the fastest references areslowed down to match the slowest ones, so programmers

do not see the difference. This is what uniform means here.

This uniformity makes the performance predictable, animportant factor for writing efficient code.

8/13/2019 smm_cap1

67/101

T b

8/13/2019 smm_cap1

68/101

Tanenbaum taxonomy - cont As we mentioned before, even on a multicomputer, user

programs may have the ability to access remote memory byusing LOAD and STORE instructions, but this illusion issupported by the operating system, not the hardware.

This difference is subtle, but very important.

Because multicomputers do not have direct access to

remote memory, they are sometimes called NORMA (NORemote Memory Access) machines.

Multicomputers can be roughly divided into two categories.

The first categorycontains the MPPs (Massively

Parallel Processors), which are expensivesupercomputers consisting of many CPUs tightly coupled bya high-speed proprietary interconnection network. The CrayT3E and IBM SP/2 are wellknown examples.

Tanenbaum taxonomy - cont

8/13/2019 smm_cap1

69/101

Tanenbaum taxonomy - con t

The other categoryconsists of regular PCs or

workstations, possibly rackmounted, and connected bycommercial off-the-shelf interconnection technology.

Logically, there is not much difference, but hugesupercomputers costing many millions of dollars areused differently than networks of PCs assembled bythe users for a fraction of the price of an MPP.

These home-brew machines go by various names,

including NOW (Network of Workstations) and

COW (Cluster of Workstations).

8/13/2019 smm_cap1

70/101

Sequential Consistency

(a) Two CPUs writing and two CPUs reading a commonmemoryword. (b) - (d) Three possible ways the two writes

and four

reads might be interleaved in time.

Weak Consistency

8/13/2019 smm_cap1

71/101

Weak Consistency

Weakly consistent memory uses synchronizationoperations to

divide time into sequential epochs.

8/13/2019 smm_cap1

72/101

UMA Symmetric Multiprocessor Architectures

Three bus-based multiprocessors. (a) Without caching.(b) With

caching. (c) With caching and private memories.

S i C h

8/13/2019 smm_cap1

73/101

Snooping Caches

The write through cache coherence protocol.

The empty boxes indicate that no action is taken.

The MESI Cache Coherence Protocol

8/13/2019 smm_cap1

74/101

The MESI Cache Coherence Protocol

The MESI cache coherence protocol.

UMA Multiprocessors Using Crossbar Switches

8/13/2019 smm_cap1

75/101

UMA Multiprocessors Using Crossbar Switches

(a) An 8 8 crossbar switch.

(b) An open crosspoint.

(c) A closed crosspoint.

UMA Multiprocessors Using Multistage Switching

8/13/2019 smm_cap1

76/101


Networks (1)

(a) A 2 2 switch.

(b) A message format.


8/13/2019 smm_cap1

77/101


Networks (2)

An omega switching network.

8/13/2019 smm_cap1

78/101

NUMA Multiprocessors

A NUMA machine based on two levels of buses. TheCm* was

the first multiprocessor to use this design.

Cache Coherent NUMA Multiprocessors

8/13/2019 smm_cap1

79/101

Cache Coherent NUMA Multiprocessors

(a) A 256-node directory-based multiprocessor. (b) Divisiona 32-bit

memory address into fields. (c) The directory at node 36.

The Sun Fire E25K NUMA Multiprocessor (1)

8/13/2019 smm_cap1

80/101


The Sun Microsystems E25K multiprocessor.

Th S Fi E25K NUMA M lti (2)

8/13/2019 smm_cap1

81/101


The SunFire E25K uses a four-level interconnect.Dashed lines

are address paths. Solid lines are data paths.

8/13/2019 smm_cap1

82/101

Message-Passing Multicomputers

A generic multicomputer.

BlueGene (1)

8/13/2019 smm_cap1

83/101

BlueGene (1)

The BlueGene/L custom processor chip.

BlueGene (2)

8/13/2019 smm_cap1

84/101

ueGe e ( )

The BlueGene/L. (a) Chip. (b) Card. (c) Board.

(d) Cabinet. (e) System.

Red Storm (1)

8/13/2019 smm_cap1

85/101

Red Storm (1)

Packaging of the Red Storm components.

8/13/2019 smm_cap1

86/101

Red Storm (2)

The Red Storm system as viewed from above.

A Comparison of BlueGene/L and Red Storm

8/13/2019 smm_cap1

87/101

A Comparison of BlueGene/L and Red Storm

A

comparison of

BlueGene/Land Red Storm.

Google (1)

8/13/2019 smm_cap1

88/101

g ( )

Processing of a Google query.

8/13/2019 smm_cap1

89/101

Google (2)

A typicalGoogle

cluster.

Scheduling

8/13/2019 smm_cap1

90/101

g

Scheduling a cluster. (a) FIFO. (b) Without head-of-line blocking.

(c) Tiling. The shaded areas indicate idle CPUs.

Distributed Shared Memory (1)

8/13/2019 smm_cap1

91/101


A virtual address space consisting of 16 pages

spread over four nodes of a multicomputer.

(a) The initial situation. .

8/13/2019 smm_cap1

92/101




(b) After CPU 0 references page 10.

8/13/2019 smm_cap1

93/101




(c) After CPU 1 references page 10, here assumed to be a read-only page.

8/13/2019 smm_cap1

94/101

Linda

Three Linda tuples.

8/13/2019 smm_cap1

95/101

Orca

A simplified ORCA stack object, with internal dataand two operations.

Software Metrics (1)

8/13/2019 smm_cap1

96/101

Real programs achieve less than the perfectspeedup

indicated by the dotted line.

Software Metrics (2)

8/13/2019 smm_cap1

97/101

(a) A program has a sequential part and aparallelizable part.

(b) Effect of running part of the program inparallel.

Achieving High Performance

8/13/2019 smm_cap1

98/101

(a) A 4-CPU bus-based system. (b) A 16-CPU bus-basedsystem.

(c) A 4-CPU grid-based system. (d) A 16-CPU grid-basedsystem.

8/13/2019 smm_cap1

99/101

Grid Computing

The grid layers.

8/13/2019 smm_cap1

100/101

Bibliografie

1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura

Victor, ISBN 973-8128-05-6, 2000

2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE

PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw

Hill, 1993, ISBN 0-07-113342-9.3. Hesham El-Rewini, Southern Methodist University, Mostafa Abd-

El-Barr Kuwait University, ADVANCED COMPUTER

ARCHITECTURE AND PARALLEL PROCESSING, A JOHN

WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5

4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID

COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-35. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall

2003, ISBN 0-13-145660+1

8/13/2019 smm_cap1

101/101

smm_cap1

Documents

Transcript of smm_cap1