smm_cap1

download smm_cap1

of 101

Transcript of smm_cap1

  • 8/13/2019 smm_cap1

    1/101

    1

    Titular curs: dr. ing. Ioan [email protected]

    Durat: 14 sptmni x 3 ore

    mailto:[email protected]:[email protected]
  • 8/13/2019 smm_cap1

    2/101

    2

    Cum se face notarea? Nota la examen NF = 45%NE + 45%NL + 10%PC

    NF = Nota Final

    NE = Nota la Examen NL = Nota la Laborator

    PC = Prezenala Curs

    Examenul se desfoarn scris.

    Se dau trei subiecte sub formde probleme

    Prezena la curs poate fi compensat cu o tem suplimentarlegatde curs, cu aprobarea cadrului didactic.

    Nota la examen se calculeaz cu activitile executate i notate itrebuie s fie de cel puin 45% (4,5).

    Se dau bonusuri de pn 10% pentru activiti suplimentare

    (participarea la concursuri, conferine studeneti, proiecte, etc.)

  • 8/13/2019 smm_cap1

    3/101

    Bibliografie

    1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura

    Victor, ISBN 973-8128-05-6, 2000

    2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE

    PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw

    Hill, 1993, ISBN 0-07-113342-9.

    3. Hesham El-Rewini, Southern Methodist University, Mostafa Abd-

    El-Barr Kuwait University, ADVANCED COMPUTER

    ARCHITECTURE AND PARALLEL PROCESSING, A JOHN

    WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5

    4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID

    COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-3

    5. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall

    2003, ISBN 0-13-145660+1

  • 8/13/2019 smm_cap1

    4/101

    Problemele intale cursului

    Arhitecturi multiprocesor

    Magistrale de interconectarePCI, PCIe,Infiniband

    Chipuri multimicroprocesorIntel Xeon, Cell BE

    Cluster computing

    Grid computing

  • 8/13/2019 smm_cap1

    5/101

    Introducere

  • 8/13/2019 smm_cap1

    6/101

    1.1 Why Parallel Processing?

    Fig. 1.1 The exponential growth of microprocessor performance,

    known as Moores Law, shown over the past two decades (extrapolated).

    19901980 2000 2010KIPS

    MIPS

    GIPS

    TIPS

    Processorperformance

    Calendar year

    80286

    68000

    80386

    8048668040

    Pentium

    Pentium II R10000

    1.6 / yr

  • 8/13/2019 smm_cap1

    7/101

    From:

    Robots After All,

    by H. Moravec,

    CACM, pp. 90-97,

    October 2003.

    Mental power in four scales

    Evolution of Computer Performance/Cost

  • 8/13/2019 smm_cap1

    8/101

    The Semiconductor Technology Roadmap

    From the 2001 edition of the roadmap [Alla02]

    Calendar year 2001 2004 2007 2010 2013 2016Halfpitch (nm) 140 90 65 45 32 22Clock freq. (GHz) 2 4 7 12 20 30Wiring levels 7 8 9 10 10 10Power supply (V) 1.1 1.0 0.8 0.7 0.6 0.5Max. power (W) 130 160 190 220 250 290

    Factors contributing to the validity of Moores law

    Denser circuits; Architectural improvements

    Measures of processor performanceInstructions/second (MIPS, GIPS, TIPS, PIPS)

    Floating-point operations per second

    (MFLOPS, GFLOPS, TFLOPS, PFLOPS)

    Running time on benchmark suites 19901980 2000 2010KIPS

    MIPS

    GIPS

    TIPS

    Processorp

    erformance

    Calendar year

    80286

    68000

    80386

    8048668040

    Pentium

    Pentium II R10000

    1.6 / yr

  • 8/13/2019 smm_cap1

    9/101

    Why High-Performance Computing?

    Higher speed (solve problems faster)Important when there are hard or soft deadlines;e.g., 24-hour weather forecast

    Higher throughput (solve more problems)Important when there are many similar tasks to perform;e.g., transaction processing

    Higher computational power (solve larger problems)e.g., weather forecast for a week rather than 24 hours,or with a finer mesh for greater accuracy

    Categories of supercomputersUniprocessor; aka vector machineMultiprocessor; centralized or distributed shared memoryMulticomputer; communicating via message passingMassively parallel processor (MPP; 1K or more processors)

  • 8/13/2019 smm_cap1

    10/101

    The Speed-of-Light Argument

    The speed of light is about 30 cm/ns.

    Signals travel at a fraction of speed of light (say, 1/3).

    If signals must travel 1 cm during the execution of an

    instruction, that instruction will take at least 0.1 ns;thus, performance will be limited to 10 GIPS.

    This limitation is eased by continued miniaturization,architectural methods such as cache memory, etc.;

    however, a fundamental limit does exist.

    How does parallel processing help? Wouldnt multipleprocessors need to communicate via signals as well?

  • 8/13/2019 smm_cap1

    11/101

    Why Do We Need TIPS or TFLOPS Performance?

    Reasonable running time = Fraction of hour to several hours (103-104s)

    In this time, a TIPS/TFLOPS machine can perform 1015-1016operations

    Example 2: Fluid dynamics calculations (1000 1000 1000 lattice)

    109lattice points 1000 FLOP/point 10 000 time steps = 1016FLOP

    Example 3: Monte Carlo simulation of nuclear reactor

    1011

    particles to track (for 1000 escapes) 104

    FLOP/particle = 1015

    FLOP

    Decentralized supercomputing( from Mathworld News, 2006/4/7 ):

    Grid of tens of thousands networked computers discovers 2304024571,

    the 43rdMersenne prime, as the largest known prime (9 152 052 digits )

    Example 1: Southern oceans heat Modeling

    (10-minute iterations)

    300 GFLOP per iteration

    300 000 iterations per 6 yrs =

    10

    16

    FLOP

  • 8/13/2019 smm_cap1

    12/101

    The ASCI Program

    20001995 2005 2010

    Performa

    nce(TFLOPS)

    Calendar year

    ASCI Red

    ASCI Blue

    ASCI White

    1+ TFLOPS, 0.5 TB

    3+ TFLOPS, 1.5 TB

    10+ TFLOPS, 5 TB

    30+ TFLOPS, 10 TB

    100+ TFLOPS, 20 TB

    1

    10

    100

    1000 Plan Develop Use

    ASCI

    ASCI Purple

    ASCI Q

    Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)

    Computing Initiative (ASCI) program, sponsored by the US Department of

    Energy, with extrapolation up to the PFLOPS level.

  • 8/13/2019 smm_cap1

    13/101

    The Quest for Higher Performance

    1. IBM Blue Gene/L 2. SGI Columbia 3. NEC Earth Sim

    LLNL, California NASA Ames, California Earth Sim Ctr, Yokohama

    Material science, nuclearstockpile sim

    Aerospace/space sim,climate research

    Atmospheric, oceanic, andearth sciences

    32,768 procs, 8 TB, 28

    TB disk storage

    10,240 procs, 20 TB,

    440 TB disk storage

    5,120 procs, 10 TB, 700

    TB disk storageLinux + custom OS Linux Unix

    71 TFLOPS, $100 M 52 TFLOPS, $50 M 36 TFLOPS*, $400 M?

    Dual-proc Power-PC chips

    (10-15 W power)

    20x Altix (512 Itanium2)

    linked by Infiniband

    Built of custom vector

    microprocessorsFull system: 130k-proc,360 TFLOPS (est)

    Volume = 50x IBM, Power= 14x IBM

    * Led the top500 list for 2.5 yrs

    Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

  • 8/13/2019 smm_cap1

    14/101

    The Quest for Higher Performance: 2008 Update

    1. IBM Roadrunner 2. IBM Blue Gene/L 3. Sun Blade X6420

    LANL, New Mexico LLNL, California U Texas Austin

    Nuclear stockpilecalculations, and more

    Advanced scientificsimulations

    Open science research

    122,400 procs, 98 TB,

    0.4 TB/s file system I/O

    212,992 procs, 74 TB,

    2 PB disk storage

    62,976 procs, 126 TB

    Red Hat Linux CNK/SLES 9 Linux

    1.38 PFLOPS, $130M 0.596PFLOPS,$100M 0.504 PFLOPS*

    PowerXCell 8i 3.2 GHz,

    AMD Opteron (hybrid)

    PowerPC 440 700 MHz AMD X86-64 Opteron

    quad core 2 GHz2.35 MW power, expandsto 1M procs

    1.60 MW power, expandsto 0.5M procs

    2.00 MW power, Expandsto 0.3M procs

    Top Three Supercomputers in June 2008 (http://www.top500.org)

    * Actually 4thon top-500 list, with the 3rdbeing another IBM Blue Gene system at 0.557 PFLOPS

  • 8/13/2019 smm_cap1

    15/101

    Supercomputer Performance Growth

    Fig. 1.2 The exponential growth in supercomputer performance over

    the past two decades (from [Bell92], with ASCI performance goals and

    microprocessor peak FLOPS superimposed as dotted lines).

    19901980 2000 2010

    MFLOPS

    Supercom

    puterperformance

    Calendar year

    CrayX-MP

    Y-MP

    CM-2

    GFLOPS

    TFLOPS

    PFLOPS

    Vector supers

    CM-5

    CM-5

    $240M MPPs

    $30M MPPs

    ASCI goals

    Micros

    80386

    80860

    Alpha

  • 8/13/2019 smm_cap1

    16/101

    What Exactly is Parallel Processing?

    Parallelism = Concurrency

    Doing more than one thing at a time

    Has been around for decades, since early computers

    I/O channels, DMA, device controllers, multiple ALUs

    The sense in which we use it in this course

    Multiple agents (hardware units, software processes)collaborate to perform our main computational task

    - Multiplying two matrices- Breaking a secret code- Deciding on the next chess move

  • 8/13/2019 smm_cap1

    17/101

    ABCs of Parallel Processing in One Slide

    A Amdahls Law (Speedup Formula)Bad newsSequential overhead will kill you, because:

    Speedup = T1/Tp 1/[f+ (1f)/p] min(1/f,p)Morale:For f= 0.1, speedup is at best 10, regardless of peak OPS.

    B Brents Scheduling TheoremGood newsOptimal scheduling is very difficult, but even a naive

    scheduling algorithm can ensure:T1/p TpT1/p+ T= (T1/p)[1 +p/(T1/T)]Result:For a reasonably parallel task (large T1/T), or for a suitablysmallp(say,pT1/T), good speedup and efficiency are possible.

    C Cost-Effectiveness Adage

    Real newsThe most cost-effective parallel solution may not bethe one with highest peak OPS (communication?), greatest speed-up(at what cost?), or best utilization (hardware busy doing what?).Analogy:Mass transit might be more cost-effective than private carseven if it is slower and leads to many empty seats.

  • 8/13/2019 smm_cap1

    18/101

    Historical Perspective

  • 8/13/2019 smm_cap1

    19/101

    Types of Parallelism

    Note: many overlaps

    lookahead & pipelining

    vectorization

    concurrency & simultaneity data and control parallelism

    partitioning & specialization

    interleaving & overlapping of physical subsystems

    multiplicity & replication

    time & space sharing multitasking & multiprogramming

    multi-threading

    distributed computing - for speed or availability

  • 8/13/2019 smm_cap1

    20/101

    ScalarSequential Lookahead

    I/E Overlap FunctionalParallelismMultiple

    Func. Units PipelineImplicit Vector Explicit Vector

    Memory-to-

    memory Register-to-registerSIMD MIMD

    AssociativeProcessor

    Processor

    Array

    MulticomputerMutiprocessor

    Architectural

    Evolution

  • 8/13/2019 smm_cap1

    21/101

    Parallel Computer Architectures

    (a) On-chip parallelism. (b) A coprocessor. (c) Amultiprocessor.

    (d) A multicomputer. (e) A grid.

  • 8/13/2019 smm_cap1

    22/101

    Instruction-Level Parallelism

    (a) A CPU pipeline. (b) A sequence of VLIW instructions.

    (c) An instruction stream with bundles marked.

  • 8/13/2019 smm_cap1

    23/101

    The TriMedia VLIW CPU (1)

    A typical TriMedia instruction, showing five possibleoperations.

  • 8/13/2019 smm_cap1

    24/101

    The TriMedia VLIW CPU (2)

    The TM3260 functional units, their quantity,latency,

    and which instruction slots they can use.

  • 8/13/2019 smm_cap1

    25/101

    The TriMedia VLIW CPU (3)

    The major groups of TriMedia custom operations.

  • 8/13/2019 smm_cap1

    26/101

    The TriMedia VLIW CPU (4)

    (a) An array of 8-bit elements. (b) The transposed array.

    (c) The original array fetched into four registers.

    (d) The transposed array in four registers.

  • 8/13/2019 smm_cap1

    27/101

    On-Chip Multithreading (1)

    (a)(c) Three threads. The empty boxes indicated that

    the threadhas stalled waiting for memory. (d) Fine-grained

    multithreading.

    (e) Coarse-grained multithreading.

  • 8/13/2019 smm_cap1

    28/101

    On-Chip Multithreading (2)

    Multithreading with a dual-issue superscalar CPU.

    (a) Fine-grained multithreading.

    (b) Coarse-grained multithreading.

    (c) Simultaneous multithreading.

  • 8/13/2019 smm_cap1

    29/101

    Hyperthreading on the Pentium 4

    Resource sharing between threads in the

    Pentium 4 NetBurst microarchitecture.

  • 8/13/2019 smm_cap1

    30/101

    Homogeneous Multiprocessors on a Chip

    Single-chip multiprocessors.

    (a) A dual-pipeline chip. (b) A chip with two cores.

  • 8/13/2019 smm_cap1

    31/101

    Heterogeneous Multiprocessors on a Chip (1)

    The logical structure of a simple DVD player contains aheterogeneous

    multiprocessor containing multiple cores for different

    functions.

  • 8/13/2019 smm_cap1

    32/101

    Heterogeneous Multiprocessors on a Chip (2)

    An example of the IBM CoreConnect architecture.

  • 8/13/2019 smm_cap1

    33/101

    Introduction to Networking (1)

    How users are connected to servers on theInternet.

  • 8/13/2019 smm_cap1

    34/101

    Introduction to Networking (2)

    A packet as it appears on the Ethernet.

  • 8/13/2019 smm_cap1

    35/101

    Introduction to Network Processors

    A typical network processor board and chip.

  • 8/13/2019 smm_cap1

    36/101

    The Nexperia Media Processor

    The Nexperia heterogeneous multiprocessor on a chip.

  • 8/13/2019 smm_cap1

    37/101

    Multiprocessors

    (a) A multiprocessor with 16 CPUs sharing a commonmemory.

    (b) An image partitioned into 16 sections, each being

    analyzed by a different CPU.

  • 8/13/2019 smm_cap1

    38/101

    Multicomputers (1)

    (a) A multicomputer with 16 CPUs, each with its own privatememory.

    (b) The bit-map image of Fig. 8-17 split up among the 16

    memories.

  • 8/13/2019 smm_cap1

    39/101

    Multicomputers (2)

    Various layers where shared memory can beimplemented. (a) The

    hardware. (b) The operating system. (c) Thelanguage runtime system.

  • 8/13/2019 smm_cap1

    40/101

    Hybrid approach

    In short, programming a multicomputer is much moredifficult than programming a multiprocessor.

    Under these conditions, why would anyone buildmulticomputers, when multiprocessors are easier to

    program? The answer is simple: large multicomputers are much

    simpler and cheaper to build than multiprocessors withthe same number of CPUs.

    Implementing a memory shared by even a few hundredCPUs is a substantial undertaking, whereas building amulticomputer with 10,000 CPUs or more isstraightforward.

  • 8/13/2019 smm_cap1

    41/101

    Hybrid approachcont

    Thus we have a dilemma: multiprocessors are hard to

    build but easy to program whereas multicomputers areeasy to build but hard to program.

    This observation has led to a great deal of effort toconstruct hybrid systems that are relatively easy to buildand relatively easy to program.

    This work has led to the realization that shared memorycan be implemented in various ways, each with its ownset of advantages and disadvantages.

    In fact, much of the research in parallel architectures

    these days relates to the convergence of multiprocessorand multicomputer architectures into hybrid forms thatcombine the strengths of each.

    The holy grail here is to find designs that are scalable,that is, continue to perform well as more and more CPUs

    are added.

  • 8/13/2019 smm_cap1

    42/101

    Hybrid approachcont

    One approachto building hybrid systems is based on the fact thatmodern computer systems are not monolithic but are constructed as

    a series of layers. This insight opens the possibility of implementing the shared

    memory at any one of several layers, as shown in Fig. 8-3.

    In Fig. 8-3(a) we see the shared memory being implemented by thehardware as a true multiprocessor.

    In this design, there is a single copy of the operating system with asingle set of tables, in particular, the memory allocation table.

    When a process needs more memory, it traps to the operatingsystem, which then looks in its table for a free page and maps thepage into the callers address space.

    As far as the operating system is concerned, there is a singlememory and it keeps track of which process owns which page insoftware.

    There are many ways to implement hardware shared memory, aswe will see later.

  • 8/13/2019 smm_cap1

    43/101

    Hybrid approachcont

    A second possibilityis to use multicomputer hardware and havethe operating system simulate shared memory by providing a singlesystem-wide paged shared virtual address space.

    In this approach, called DSM (Distributed Shared Memory) (Li,1988; and Li and Hudak, 1986, 1989), each page is located in oneof the memories of Fig. 8-2(a).

    Each machine has its own virtual memory and its own page tables.When a CPU does a LOAD or STORE on a page it does not have, atrap to the operating system occurs.

    The operating system then locates the page and asks the CPUcurrently holding it to unmap the page and send it over theinterconnection network.

    When it arrives, the page is mapped in and the faulting instruction isrestarted.

    In effect, the operating system is just satisfying page faults fromremote memory instead of from disk. To the user, the machinelooks as if it has shared memory.

    H b id h t

  • 8/13/2019 smm_cap1

    44/101

    Hybrid approachcont

    A third possibilityis to have a user-level runtimesystem implement a (possibly language-specific) form ofshared memory.

    In this approach, the programming language providessome kind of shared memory abstraction, which is thenimplemented by the compiler and runtime system.

    For example, the Linda model is based on theabstraction of a shared space of tuples (data recordscontaining a collection of fields).

    Processes on any machine can input a tuple from theshared tuple space or output a tuple to the shared tuplespace.

    Because access to the tuple space is controlled entirelyin software (by the Linda runtime system), no specialhardware or operating system support is needed.

  • 8/13/2019 smm_cap1

    45/101

    Topology

    Various topologies. The heavy dots represent switches. TheCPUs

    and memories are not shown. (a) A star. (b) A completeinterconnect.

    (c) A tree. (d) A ring. (e) A grid. (f) A double torus.

    (g) A cube. (h) A 4D hypercube.

  • 8/13/2019 smm_cap1

    46/101

    Topology - cont

    When designing or analyzing an interconnectionnetwork, several areas stand out as beingimportant.

    First, there is the matter of topology, that is,how the components are arranged.

    Second is how the switching works and how to

    handle contention for resources. Third is what routing algorithm is used to get

    messages to their destination efficiently.

    T l t

  • 8/13/2019 smm_cap1

    47/101

    Topology - cont

    The topology of an interconnection network describes

    how the links and switches are arranged, for example,as a ring or as a grid. Topological designs can be modeled as graphs, with the

    links as arcs and the switches as nodes, as shown in Fig.8-4.

    Each node in an interconnection network (or its graph)has some number of links connected to it. Mathematicians call the number of links the degree of

    the node; engineers call it the fanout. In general, the greater the fanout, the more routing

    choices there are and the greater the fault tolerance,that is, the ability to continue functioning even if a linkfails by routing around it.

    If every node has k arcs and the wiring is done right, itis possible to design the network so that it remains fullyconnected even if k - 1 links fail.

    T l t

  • 8/13/2019 smm_cap1

    48/101

    Topology - cont

    Another property of an interconnection network (or its graph) isits diameter.

    If we measure the distance between two nodes by the numberof arcs that have to be traversed to get from one to the other,then the diameter of a graph is the distance between the twonodes that are the farthest apart (i.e., have the greatest

    distance between them). The diameter of an interconnection network is related to the

    worst-case delay when sending packets from CPU to CPU orfrom CPU to memory because each hop across a link takes a

    finite amount of time. The smaller the diameter is, the better the worst-case

    performance is.

    Also important is the average distance between two nodes,

    since this relates to the average packet transit time.

  • 8/13/2019 smm_cap1

    49/101

    Topology - cont Yet another important property of an interconnection network is its

    transmission capacity, that is, how much data it can move per second. One useful measure of this capacity is the bisection bandwidth.

    To compute this quantity, we first have to (conceptually) partitionthe network into two equal (in terms of number of nodes) butunconnected parts by removing a set of arcs from its graph.

    Then we compute the total bandwidth of the arcs that have beenremoved. There may be many different ways to partition thenetwork into two equal parts. The bisection bandwidth is theminimum of all the possible partitions.

    The significance of this number is that if the bisection bandwidth is,say, 800 bits/sec, then if there is a lot of communication betweenthe two halves, the total throughput may be limited to only 800bits/sec, in the worst case.

    Many designers believe bisection bandwidth is the most importantmetric of in interconnection network. Many interconnectionnetworks are designed with the goal of maximizing the bisectionbandwidth.

  • 8/13/2019 smm_cap1

    50/101

    Topology - cont

    Interconnection networks can be characterized by their

    dimensionality. For our purposes, the dimensionality is determined by

    the number of choices there are to get from the sourceto the destination.

    If there is never any choice (i.e., there is only one pathfrom each source to each destination), the network iszero dimensional.

    If there is one dimension in which a choice can be made,for example, go east or go west, the network is one

    dimensional. If there are two axes, so a packet can go east or west or

    alternatively, go north or south, the network is twodimensional, and so on.

    T l t

  • 8/13/2019 smm_cap1

    51/101

    Topology - cont

    Several topologies are shown in Fig. 8-4 . Only the links

    (lines) and switches (dots) are shown here. Thememories and CPUs (not shown) would typically beattached to the switches by interfaces.

    In Fig. 8-4(a), we have a zero-dimensional star

    configuration, in which the CPUs and memories would beattached to the outer nodes, with the central one justdoing switching.

    Although a simple design, for a large system, the centralswitch is likely to be a major bottleneck.

    Also, from a fault-tolerance perspective, this is a poordesign since a single failure at the central switchcompletely destroys the system.

    Topology cont

  • 8/13/2019 smm_cap1

    52/101

    Topology - cont

    In Fig. 8-4(b), we have another zero-dimensional design that is at the other end ofthe spectrum, a full interconnect.

    Here every node has a direct connection toevery other node. This design maximizes the bisection bandwidth,

    minimizes the diameter, and is exceedingly fault

    tolerant (it can lose any six links and still be fullyconnected).

    Unfortunately, the number of links required for knodes is k(k - 1)/2, which quickly gets out ofhand for large k.

  • 8/13/2019 smm_cap1

    53/101

    Topology - cont

    Yet a third zero-dimensional topology is the tree,

    illustrated in Fig. 8-4(c). A problem with this design isthat the bisection bandwidth is equal to the link capacity. Since there will normally be a lot of traffic near the top

    of the tree, the top few nodes will become bottlenecks.

    One way around this problem is to increase the bisectionbandwidth by giving the upper links more bandwidth. For example, the lowest level links might have a capacity

    b, the next level might have a capacity 2b and the top-level links might each have 4b.

    Such a design is called a fat tree and has been used incommercial multicomputers, such as the (now-defunct)Thinking Machines CM-5.

    T l t

  • 8/13/2019 smm_cap1

    54/101

    Topology - cont

    The ring of Fig. 8-4(d) is a one-dimensional topology by

    our definition because every packet sent has a choice ofgoing left or going right.

    The grid or mesh of Fig. 8-4(e) is a two-dimensionaldesign that has been used in many commercial systems.

    It is highly regular, easy to scale up to large sizes, andhas a diameter that only increases as the square root ofthe number of nodes.

    A variant on the grid is the double torus of Fig. 8-4(f),which is a grid with the edges connected.

    Not only is it more fault tolerant than the grid, but thediameter is also less because the opposite corners cannow communicate in only two hops.

  • 8/13/2019 smm_cap1

    55/101

    Topology - cont

    The cube of Fig. 8-4(g) is a regular three-dimensional

    topology. We have illustrated a 2 2 2 cube, but inthe general case it could be a k k k cube.

    In Fig. 8-4(h) we have a four-dimensional cube

    constructed from two threedimensional cubes with thecorresponding edges connected.

    We could make a five-dimensional cube by cloning thestructure of Fig. 8-4(h) and connecting the

    corresponding nodes to form a block of four cubes. To go to six dimensions, we could replicate the block

    of four cubes and interconnect the correspondingnodes, and so on.

    Topology cont

  • 8/13/2019 smm_cap1

    56/101

    Topology - con t

    An n-dimensional cube formed this way is called ahypercube.

    Many parallel computers use this topology because thediameter grows linearly with the dimensionality.

    Put in other words, the diameter is the base 2 logarithmof the number of nodes, so, for example, a 10-dimensional hypercube has 1024 nodes but a diameter

    of only 10, giving excellent delay properties. Note that in contrast, 1024 nodes arranged as a 32 32

    grid has a diameter of 62, more than six times worsethan the hypercube.

    The price paid for the smaller diameter is that the fanout

    and thus the number of links (and the cost) is muchlarger for the hypercube.

    Nevertheless, the hypercube is a common choice forhigh-performance systems.

  • 8/13/2019 smm_cap1

    57/101

  • 8/13/2019 smm_cap1

    58/101

    Topology - cont

    When designing or analyzing an interconnectionnetwork, several areas stand out as beingimportant.

    First, there is the matter of topology, that is,how the components are arranged. Otherimportant aspects are:

    Second is how the switching works and how to

    handle contentio for resources. Third is what routing algorithm is used to get

    messages to their destination efficiently.

  • 8/13/2019 smm_cap1

    59/101

    Flynns taxonomy

  • 8/13/2019 smm_cap1

    60/101

  • 8/13/2019 smm_cap1

    61/101

    1.4 Types of Parallelism: A Taxonomy

    Fig. 1.11 The Flynn-Johnson classification of computer systems.

    SISD SIMD

    MISD MIMD

    GMSV GMMP

    DMSV DMMP

    Single datastream

    Multiple datastreams

    Singleinstr

    stream

    Multipleinstr

    streams

    Flynns categories Johnsons

    exp

    ansion

    Shared

    variablesMessagepassing

    Global

    memory

    Distributed

    memory

    Uniprocessors

    Rarely used

    Array or vectorprocessors

    Multiprocs ormulticomputers

    Shared-memorymultiprocessors

    Rarely used

    Distributedshared memory

    Distrib-memorymulticomputers

    MIMD t

  • 8/13/2019 smm_cap1

    62/101

    MIMD category

    The MIMD category includes a wide class of computers. For

    this reason, in 1988, E. E. Johnson proposed a further classification of such machines

    based on

    their memory structure (global or distributed) and

    the mechanism used for communication/synchronization(shared variables or message passing).

    Again, one of the four categories (GMMP) is not widelyused.

    The GMSV class is what is loosely referred to as (shared-memory) multiprocessors.

    At the other extreme, the DMMP class is known as(distributed-memory) multicomputers.

  • 8/13/2019 smm_cap1

    63/101

    MIMD category - cont

    Finally, the DMSV class, which is becoming popular in view ofcombining the implementation ease of distributed memory with theprogramming ease of the shared-variable scheme, is sometimescalled distributed shared memory.

    When all processors in a MIMD-type machine execute the same

    program, the result is sometimes referred to as single-programmultipledata [SPMD (spim-dee)].

    Although Fig. 1.11 lumps all SIMD machines together, there are infact variations similar to those suggested above for MIMD machines.At least conceptually, there can be shared-memory and distributed-memory SIMD machines in which the processors communicate bymeans of shared variables or

    explicit message passing.

    A taxonomy of parallel computers /

  • 8/13/2019 smm_cap1

    64/101

    y p p

    Tanenbaum

  • 8/13/2019 smm_cap1

    65/101

    Tanenbaum taxonomy

    In Tanenebaum taxonomy, the MIMD categoryhas been split into Multiprocessors (shared-memory machines), and

    Multicomputers (message-passing machines). Three kinds of multiprocessors exist,

    distinguished by the way the shared memory isimplemented on them. They are called

    UMA (Uniform Memory Access), NUMA (NonUniform Memory Access), and

    COMA (Cache Only Memory Access).

  • 8/13/2019 smm_cap1

    66/101

    Tanenbaum taxonomy - cont

    These categories exist because in large multiprocessors,the memory is usually split up into multiple modules.

    UMA machineshave the property that each CPU hasthe same access time to every memory module.

    In other words, every memory word can be read as fast asevery other memory word.

    If this is technically impossible, the fastest references areslowed down to match the slowest ones, so programmers

    do not see the difference. This is what uniform means here.

    This uniformity makes the performance predictable, animportant factor for writing efficient code.

  • 8/13/2019 smm_cap1

    67/101

    T b

  • 8/13/2019 smm_cap1

    68/101

    Tanenbaum taxonomy - cont As we mentioned before, even on a multicomputer, user

    programs may have the ability to access remote memory byusing LOAD and STORE instructions, but this illusion issupported by the operating system, not the hardware.

    This difference is subtle, but very important.

    Because multicomputers do not have direct access to

    remote memory, they are sometimes called NORMA (NORemote Memory Access) machines.

    Multicomputers can be roughly divided into two categories.

    The first categorycontains the MPPs (Massively

    Parallel Processors), which are expensivesupercomputers consisting of many CPUs tightly coupled bya high-speed proprietary interconnection network. The CrayT3E and IBM SP/2 are wellknown examples.

    Tanenbaum taxonomy - cont

  • 8/13/2019 smm_cap1

    69/101

    Tanenbaum taxonomy - con t

    The other categoryconsists of regular PCs or

    workstations, possibly rackmounted, and connected bycommercial off-the-shelf interconnection technology.

    Logically, there is not much difference, but hugesupercomputers costing many millions of dollars areused differently than networks of PCs assembled bythe users for a fraction of the price of an MPP.

    These home-brew machines go by various names,

    including NOW (Network of Workstations) and

    COW (Cluster of Workstations).

  • 8/13/2019 smm_cap1

    70/101

    Sequential Consistency

    (a) Two CPUs writing and two CPUs reading a commonmemoryword. (b) - (d) Three possible ways the two writes

    and four

    reads might be interleaved in time.

    Weak Consistency

  • 8/13/2019 smm_cap1

    71/101

    Weak Consistency

    Weakly consistent memory uses synchronizationoperations to

    divide time into sequential epochs.

  • 8/13/2019 smm_cap1

    72/101

    UMA Symmetric Multiprocessor Architectures

    Three bus-based multiprocessors. (a) Without caching.(b) With

    caching. (c) With caching and private memories.

    S i C h

  • 8/13/2019 smm_cap1

    73/101

    Snooping Caches

    The write through cache coherence protocol.

    The empty boxes indicate that no action is taken.

    The MESI Cache Coherence Protocol

  • 8/13/2019 smm_cap1

    74/101

    The MESI Cache Coherence Protocol

    The MESI cache coherence protocol.

    UMA Multiprocessors Using Crossbar Switches

  • 8/13/2019 smm_cap1

    75/101

    UMA Multiprocessors Using Crossbar Switches

    (a) An 8 8 crossbar switch.

    (b) An open crosspoint.

    (c) A closed crosspoint.

    UMA Multiprocessors Using Multistage Switching

  • 8/13/2019 smm_cap1

    76/101

    UMA Multiprocessors Using Multistage Switching

    Networks (1)

    (a) A 2 2 switch.

    (b) A message format.

    UMA Multiprocessors Using Multistage Switching

  • 8/13/2019 smm_cap1

    77/101

    UMA Multiprocessors Using Multistage Switching

    Networks (2)

    An omega switching network.

  • 8/13/2019 smm_cap1

    78/101

    NUMA Multiprocessors

    A NUMA machine based on two levels of buses. TheCm* was

    the first multiprocessor to use this design.

    Cache Coherent NUMA Multiprocessors

  • 8/13/2019 smm_cap1

    79/101

    Cache Coherent NUMA Multiprocessors

    (a) A 256-node directory-based multiprocessor. (b) Divisiona 32-bit

    memory address into fields. (c) The directory at node 36.

    The Sun Fire E25K NUMA Multiprocessor (1)

  • 8/13/2019 smm_cap1

    80/101

    The Sun Fire E25K NUMA Multiprocessor (1)

    The Sun Microsystems E25K multiprocessor.

    Th S Fi E25K NUMA M lti (2)

  • 8/13/2019 smm_cap1

    81/101

    The Sun Fire E25K NUMA Multiprocessor (2)

    The SunFire E25K uses a four-level interconnect.Dashed lines

    are address paths. Solid lines are data paths.

  • 8/13/2019 smm_cap1

    82/101

    Message-Passing Multicomputers

    A generic multicomputer.

    BlueGene (1)

  • 8/13/2019 smm_cap1

    83/101

    BlueGene (1)

    The BlueGene/L custom processor chip.

    BlueGene (2)

  • 8/13/2019 smm_cap1

    84/101

    ueGe e ( )

    The BlueGene/L. (a) Chip. (b) Card. (c) Board.

    (d) Cabinet. (e) System.

    Red Storm (1)

  • 8/13/2019 smm_cap1

    85/101

    Red Storm (1)

    Packaging of the Red Storm components.

  • 8/13/2019 smm_cap1

    86/101

    Red Storm (2)

    The Red Storm system as viewed from above.

    A Comparison of BlueGene/L and Red Storm

  • 8/13/2019 smm_cap1

    87/101

    A Comparison of BlueGene/L and Red Storm

    A

    comparison of

    BlueGene/Land Red Storm.

    Google (1)

  • 8/13/2019 smm_cap1

    88/101

    g ( )

    Processing of a Google query.

  • 8/13/2019 smm_cap1

    89/101

    Google (2)

    A typicalGoogle

    cluster.

    Scheduling

  • 8/13/2019 smm_cap1

    90/101

    g

    Scheduling a cluster. (a) FIFO. (b) Without head-of-line blocking.

    (c) Tiling. The shaded areas indicate idle CPUs.

    Distributed Shared Memory (1)

  • 8/13/2019 smm_cap1

    91/101

    Distributed Shared Memory (1)

    A virtual address space consisting of 16 pages

    spread over four nodes of a multicomputer.

    (a) The initial situation. .

  • 8/13/2019 smm_cap1

    92/101

    Distributed Shared Memory (2)

    A virtual address space consisting of 16 pages

    spread over four nodes of a multicomputer.

    (b) After CPU 0 references page 10.

  • 8/13/2019 smm_cap1

    93/101

    Distributed Shared Memory (3)

    A virtual address space consisting of 16 pages

    spread over four nodes of a multicomputer.

    (c) After CPU 1 references page 10, here assumed to be a read-only page.

  • 8/13/2019 smm_cap1

    94/101

    Linda

    Three Linda tuples.

  • 8/13/2019 smm_cap1

    95/101

    Orca

    A simplified ORCA stack object, with internal dataand two operations.

    Software Metrics (1)

  • 8/13/2019 smm_cap1

    96/101

    Real programs achieve less than the perfectspeedup

    indicated by the dotted line.

    Software Metrics (2)

  • 8/13/2019 smm_cap1

    97/101

    (a) A program has a sequential part and aparallelizable part.

    (b) Effect of running part of the program inparallel.

    Achieving High Performance

  • 8/13/2019 smm_cap1

    98/101

    (a) A 4-CPU bus-based system. (b) A 16-CPU bus-basedsystem.

    (c) A 4-CPU grid-based system. (d) A 16-CPU grid-basedsystem.

  • 8/13/2019 smm_cap1

    99/101

    Grid Computing

    The grid layers.

  • 8/13/2019 smm_cap1

    100/101

    Bibliografie

    1. Francisc IACOB, SISTEME MULTIMICROPROCESOR, Editura

    Victor, ISBN 973-8128-05-6, 2000

    2. Kai Hwang, ADVANCED COMPUTER ARCHITECTURE

    PARALLELISM, SCALABILITY, PROGRAMMABILITY, MacGraw

    Hill, 1993, ISBN 0-07-113342-9.3. Hesham El-Rewini, Southern Methodist University, Mostafa Abd-

    El-Barr Kuwait University, ADVANCED COMPUTER

    ARCHITECTURE AND PARALLEL PROCESSING, A JOHN

    WILEY & SONS, INC PUBLICATION, 2005, ISBN 0-471-46740-5

    4. DISTRIBUTED AND PARALLEL SYSTEMS CLUSTER AND GRID

    COMPUTING, Springer, 2005, eBook ISBN: 0-387-23096-35. Joshy Joseph, Craig Fellenstein Grid Computing Prentice Hall

    2003, ISBN 0-13-145660+1

  • 8/13/2019 smm_cap1

    101/101