2009, multi-core programming for medical imaging.pdf

Upload: anatoli-krasilnikov

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    1/163

    Dec. 22, 2009 Harbin Engineering University

    Enabling Technology ofEnabling Technology of

    MultiMulti--core Computing forcore Computing for

    Medical ImagingMedical ImagingDr. Jun Ni, Ph.D. M.E.Dr. Jun Ni, Ph.D. M.E.Associate Professor, Radiology, Biomedical Engineering,Associate Professor, Radiology, Biomedical Engineering,

    Mechanical Engineering, and Computer ScienceMechanical Engineering, and Computer ScienceThe University of Iowa, Iowa City, Iowa, USAThe University of Iowa, Iowa City, Iowa, USA

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    2/163

    OutlineOutline

    MultiMulti--core Architecture and Programmingcore Architecture and Programming

    EnvironmentEnvironment Enabling Technology of MultiEnabling Technology of Multi--core Computingcore Computing

    for Medical Imagingfor Medical Imaging

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    3/163

    Recommended ResourcesRecommended Resources (Wikipedia and Textbook)(Wikipedia and Textbook)

    Multicore programming onMulticore programming onWikipidiaWikipidia

    Extracted fromExtracted from Multi-core Programming, by ShameemAkhter and Jason Roberts

    Professional Multicore Programming: Design andProfessional Multicore Programming: Design andImplementation for C++ Developers (Implementation for C++ Developers (WroxWrox), 2008), 2008

    The art of multiprocessorThe art of multiprocessorprogrammingprogramming, by, byMauriceMaurice HerlihyHerlihy,,NirNir ShavitShavit, 2008, 2008

    Parallel MATLAB for Multicore andParallel MATLAB for Multicore and MultinodeMultinode Computers,Computers,JeremyJeremyKepnerKepner, 2009, 2009

    Java Performance on MultiJava Performance on Multi--Core PlatformsCore Platforms Charles J. HuntCharles J. Hunt,, PaulPaulHohenseeHohensee,, BinuBinuJohn, DavidJohn, David DagastineDagastine

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    4/163

    MultiMulti--core Computing Themecore Computing Theme

    Increasing performance through

    hardware in multi-core architecture software in multi-threading

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    5/163

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    6/163

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    7/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Only process

    Operating system handled the details of allocatingCPU time for each individual program at a time

    Concurrency at the process level

    Systems programmer switches job task

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    8/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Early PCs

    standalone devices with simple, single-user operatingsystems

    Only one program would run at a time User interaction occurred via simple text based

    interfaces

    Programs followed straight-line instruction

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    9/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Lately, more sophisticated computing platforms

    Operating system vendors used the advance in CPU Graphics performance to develop more

    sophisticated user environments

    Graphical User Interfaces (GUIs)

    Standard and enabled users to start and run multiple

    programs in the same user environment Networking on PCs became pervasive

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    10/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Increased user expectations

    enable to run multiple jobs simultaneously have their computing platform to be quick and responsive

    enable applications to start quickly and handle

    inconvenient background tasks

    Challenges

    problems that face hardware and softwaredevelopers

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    11/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Most end-users:

    Simplistic view of complex computer systems

    Reality:

    Implementation of such a system is far moredifficult

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    12/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Client-server-based computation environment

    for multimedia streaming and displaying Client side

    PC must be able to download the streaming video data

    decompress/decode it

    draw it on the video display

    PC also handles any streaming audio thataccompanies the video stream and send it to thesoundcard.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    13/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    On the server side, a provider must be able

    To receive the original broadcastTo encode/compress it in near real-time

    To send it over the network to potentially hundredsof thousands of clients

    A computer system capable of streaming a Web

    broadcast system

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    14/163

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    15/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    A streaming multimedia delivery service with the

    end users perspective of the system In order to provide an acceptable end-user

    experience, system designers must be able toeffectively manage many independent

    subsystems that operate in parallel

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    16/163

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    17/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Concurrency

    A way to manage the sharing of resources used atthe same time

    Important for several reasons:

    Concurrency allows for the most efficient use of systemresources

    Efficient resource utilization is the key to maximizingperformance of computing systems

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    18/163

    Need for MultiNeed for Multi--core Architecturecore Architecture

    Highly inefficient approach

    complete idle system while waiting for data to come in fromthe network.

    A better approach would be to stage the work so that while

    the system is waiting for the next job to come in from thenetwork

    The previous job is being decoded by the CPU, thereby

    improving overall resource utilization Multicore processor is on demand and developed

    recently!

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    19/163

    What is MultiWhat is Multi--core?core?

    A multi-core processor

    a processing system composed of two or moreindependent integrated circuit to which two or more

    individual sub-processors (called coresin this sense)

    The cores are typically

    integrated onto a single integrated circuit die (Chip

    Multiprocessor or CMP) May be integrated onto multiple dies in a single chip

    package

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    20/163

    What is MultiWhat is Multi--core?core?

    A dual-core processor contains two cores

    A quad-core processor contains four cores.A multi-core processor implements

    multiprocessing in a single physical package

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    21/163

    What is MultiWhat is Multi--core?core?

    Cores in a multi-core device may be coupled

    together tightly or loosely. Cores may or may not share caches

    Implement either in

    message passing

    shared memory inter-core communication methods

    Common network topologies to interconnect coresinclude:

    bus, ring, 2-dimensional mesh, and crossbar

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    22/163

    What is MultiWhat is Multi--core?core?

    All cores are identical in homogeneousmulti-core systems Not identical in heterogeneousmulti-core systems.

    Just as with single-processor systems, cores in multi-core

    systems may implement different architectures Superscalar

    VLIW

    Vector processing

    SIMD

    Multithreading

    http://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/Superscalar
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    23/163

    What is MultiWhat is Multi--core?core?

    Multi-core processors arewidely used across manyapplication domainsincluding: general-purpose

    Embedded

    Network

    Digital signal processing

    Graphics

    Medical imaging is ourfocus!

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    24/163

    What is MultiWhat is Multi--core?core?

    The amount of performance

    strongly dependent on software algorithms andimplementation

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    25/163

    What is MultiWhat is Multi--core?core?

    Parallel Cases:

    Limited by the fraction of the software that can beparallelized to run on multiple cores simultaneously

    Described by Amdahl's law

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    26/163

    What is MultiWhat is Multi--core?core?

    Parallel Cases:

    In the best case, embarrassingly parallel problems mayrealize speedup factors near the number of cores

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    27/163

    What is MultiWhat is Multi--core?core?

    Parallel Cases:

    Many typical applications do not realize such largespeedup factors

    Parallelization of software is a significant on-going topicof research

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    28/163

    TerminologyTerminology

    There is some discrepancy in the semantics by

    which the terms multi-coreand dual-coreare defined. Most commonly they are used to refer to some

    sort of central processing unit (CPU)

    Sometimes also applied to digital signal processors(DSP) and System-on-a-chip (SoC).

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    29/163

    TerminologyTerminology

    Some use these terms to refer only to multi-core

    microprocessors that are manufactured on the sameintegrated circuit die.

    These people generally refer to separate

    microprocessor dies in the same package byanother name, such as multi-chip module

    Both the terms "multi-core" and "dual-core" to

    reference microelectronic CPUs manufactured onthe sameintegrated circuit

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    30/163

    TerminologyTerminology

    In contrast to multi-core systems, the term multi-

    CPUrefers to multiple physically separateprocessing units

    often contain special circuitry to facilitate

    communication between each other The terms many-core and massively multi-coreare

    sometimes used to describe multi-core

    architectures with an especially high number ofcores (tens or hundreds).

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    31/163

    TerminologyTerminology

    Some systems use many soft microprocessor cores

    placed on a single FPGA. Each of "cores" can be considered a

    "semiconductor intellectual property core" as well

    as a CPU core.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    32/163

    DevelopmentDevelopment

    While manufacturing technology continues to

    improve: reducing the size of single gates

    physical limits of semiconductor-based microelectronics

    have become a major design concern. these physical limitations can cause significant

    problems

    heat dissipation data synchronization.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    33/163

    DevelopmentDevelopment

    The demand for more capable microprocessors

    causes CPU designers to use various methods of

    increasing performance

    instruction-level parallelism(ILP) methods like

    superscalar pipelining are suitable for manyapplications

    inefficient for others that tend to contain difficult-to-

    predict code.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    34/163

    DevelopmentDevelopment

    Many applications are better suited to thread levelparallelism(TLP) methods

    Multiple independent CPUs is one common

    method used to increase a system's overall TLP.

    A combination of increased available space due torefined manufacturing processes

    A demand for increased TLP is the logic behind

    the creation of multi-core CPUs.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    35/163

    Commercial Incentives

    Several business motives drive the development of

    dual-core architectures.

    Since symmetric multiprocessing (SMP) designs

    have long been implemented using discrete CPUs

    Issues regarding implementing the architecture andsupporting it in software are well known.

    Utilizing a proven processing core design without

    architectural changes reduces design risk

    significantly.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    36/163

    Commercial Incentives

    For general-purpose processors, much of the

    motivation for multi-core processors comes from

    greatly diminished gains in processor performancefrom increasing the operating frequency.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    37/163

    Commercial Incentives

    This is due to three primary factors:

    The memory wall; the increasing gap between processorand memory speeds. this effect pushes cache sizeslarger in order to mask the latency of memory.

    This helps only to the extent that memory bandwidth is not

    the bottleneck in performance.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    38/163

    Commercial Incentives

    This is due to three primary factors:

    The ILP wall; the increasing difficulty of finding enoughparallelism in a single instructions stream to keep a highperformance single-core processor busy.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    39/163

    Commercial Incentives

    This is due to three primary factors:

    Thepower wall; the trend of consuming exponentiallyincreasing power with each factorial increase ofoperating frequency.

    This increase can be mitigated by "shrinking" the processor by

    using smaller traces for the same logic.

    Thepower wallposes manufacturing, system design anddeployment problems that have not been justified in the faceof the diminished gains in performance due to the memory walland ILP wall.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    40/163

    Commercial Incentives

    The terminology "dual-core" (and other multiples) lendsitself to marketing efforts.

    In order to continue delivering regular performanceimprovements for general-purpose processors,manufacturers such as Intel andAMD have turned tomulti-core designs, sacrificing lower manufacturing costsfor higher performance in some applications and systems.

    Multi-core architectures are being developed, but so are thealternatives.

    An especially strong contender for established markets isthe further integration of peripheral functions into thechip.

    http://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/AMDhttp://en.wikipedia.org/wiki/Intel
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    41/163

    Advantages The proximity of multiple CPU cores on thesame die allows the cache coherency circuitry

    to operate at a much higher clock rate than ispossible if the signals have to travel off-chip.

    Combining equivalent CPUs on a single die

    significantly improves the performance ofcache snoop (alternative: Bus snooping)operations.

    http://en.wikipedia.org/wiki/Cache_coherencyhttp://en.wikipedia.org/wiki/Cache_snoopinghttp://en.wikipedia.org/wiki/Bus_snoopinghttp://en.wikipedia.org/wiki/Bus_snoopinghttp://en.wikipedia.org/wiki/Cache_snoopinghttp://en.wikipedia.org/wiki/Cache_coherency
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    42/163

    Advantages Put simply, this means that signals betweendifferent CPUs travel shorter distances, and

    therefore those signals degrade less. These higher quality signals allow more data

    to be sent in a given time period since

    individual signals can be shorter and do notneed to be repeated as often.

    http://en.wikipedia.org/wiki/Discrete_signalhttp://en.wikipedia.org/wiki/Discrete_signal
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    43/163

    Advantages The largest boost in performance will likely be noticedin improved response time

    while running CPU-intensive processes, like antivirus scans,ripping/burning media (requiring file conversion), orsearching for folders.

    If the automatic virus scan initiates while a movie isbeing watched, the application running the movie isfar less likely to be starved of processor power, as the antivirus program will be assigned to a different

    processor core than the one running the movie playback.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    44/163

    Advantages Assuming that the die can fit into the package,physically, the multi-core CPU designs require

    much less Printed Circuit Board (PCB) spacethan multi-chip SMP designs.

    A dual-core processor uses slightly less power

    than two coupled single-core processors,principally because of the decreased powerrequired to drive signals external to the chip.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    45/163

    Advantages Furthermore, the cores share some circuitry,like the L2 cache and the interface to the front

    side bus (FSB). In terms of competing technologies for the

    available silicon die area, multi-core design

    can Make use of proven CPU core library designs and

    produce a product with lower risk of design errorthan devising a new wider core design.

    Adding more cache suffers from diminishingreturns.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    46/163

    Disadvantages In addition to operating system (OS) support,adjustments to existing software are required

    to maximize utilization of the computing

    resources provided by multi-core processors.

    Also, the ability of multi-core processors toincrease application performance depends on

    the use of multiple threads within

    applications.

    http://en.wikipedia.org/wiki/Operating_systemhttp://en.wikipedia.org/wiki/Operating_system
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    47/163

    Disadvantages The situation is improving Valve Corporation's Source engine, offers multi-

    core support,

    Crytekhas developed similar technologies for

    CryEngine 2, which powers their game, Crysis.

    Emergent Game Technologies' Gamebryo engine

    includes their Floodgate technology which

    simplifies multi-core development across game

    platforms.

    Disadvantages

    http://en.wikipedia.org/wiki/Valve_Corporationhttp://en.wikipedia.org/wiki/Source_engine#Multiprocessor_optimizationshttp://en.wikipedia.org/wiki/Crytekhttp://en.wikipedia.org/wiki/CryEngine_2http://en.wikipedia.org/wiki/Crysishttp://en.wikipedia.org/wiki/Crysishttp://en.wikipedia.org/wiki/Emergent_Game_Technologieshttp://en.wikipedia.org/wiki/Gamebryohttp://en.wikipedia.org/wiki/Gamebryohttp://en.wikipedia.org/wiki/Emergent_Game_Technologieshttp://en.wikipedia.org/wiki/Crysishttp://en.wikipedia.org/wiki/CryEngine_2http://en.wikipedia.org/wiki/Crytekhttp://en.wikipedia.org/wiki/Source_engine#Multiprocessor_optimizationshttp://en.wikipedia.org/wiki/Valve_Corporation
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    48/163

    Disadvantages Integration of a multi-core chip drives production

    yields down and they are more difficult to manage

    thermally than lower-density single-chip designs. Intel has partially countered this first problem by

    creating its quad-core designs by combining two

    dual-core on a single die with a unified cache, Any two working dual-core dies can be used, as

    opposed to producing four cores on a single die and

    requiring all four to work to produce a quad-core.

    Disadvantages

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    49/163

    Disadvantages From an architectural point of view,

    ultimately, single CPU designs may make

    better use of the silicon surface area than

    multiprocessing cores, so a development

    commitment to this architecture may carry therisk of obsolescence.

    Disadvantages

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    50/163

    Disadvantages Finally, raw processing power is not the only

    constraint on system performance.

    Two processing cores sharing the same

    system bus and memory bandwidth limits the

    real-world performance advantage. If a single core is close to being memory

    bandwidth limited, going to dual-core might

    only give 30% to 70% improvement.

    Disadvantages

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    51/163

    Disadvantages If memory bandwidth is not a problem, a 90%

    improvement can be expected

    It would be possible for an application that

    used two CPUs to end up running faster on

    one dual-core if communication between theCPUs was the limiting factor, which would

    count as more than 100% improvement.

    Hardware

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    52/163

    Hardware

    The general trend in processor development has been

    from multi-core to many-core: from dual-, tri-, quad-,hexa-, octo-core chips to ones with tens or even

    hundreds of cores.

    In addition, multi-core chips mixed withsimultaneous multithreading, memory-on-chip, and

    special-purpose "heterogeneous" cores promise

    further performance and efficiency gains, especially inprocessing multimedia, recognition and networking

    applications.

    Hardware

    http://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Heterogeneous_computinghttp://en.wikipedia.org/wiki/Simultaneous_multithreading
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    53/163

    Hardware

    There is also a trend of improving energyefficiency by focusing on performance-per-

    watt with advanced fine-grain or ultra fine-

    grainpower management and dynamicvoltage and frequency scaling (i.e. laptop

    computers andportable media players).

    http://en.wikipedia.org/wiki/Power_managementhttp://en.wikipedia.org/wiki/Voltage_and_frequency_scalinghttp://en.wikipedia.org/wiki/Laptophttp://en.wikipedia.org/wiki/Portable_media_playerhttp://en.wikipedia.org/wiki/Portable_media_playerhttp://en.wikipedia.org/wiki/Laptophttp://en.wikipedia.org/wiki/Voltage_and_frequency_scalinghttp://en.wikipedia.org/wiki/Power_management
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    54/163

    Architecture

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    55/163

    Architecture

    Some architectures use one core design which

    is repeated consistently ("homogeneous"),while others use a mixture of different cores,

    each optimized for a different role

    ("heterogeneous"). As an example of this discussion, the article CPU

    designers debate multi-core futureby Rick Merritt, EE

    Times 2008, includes comments:

    Architecture

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    56/163

    Architecture

    Some architectures use one core design whichis repeated consistently ("homogeneous"),

    while others use a mixture of different cores,

    each optimized for a different role("heterogeneous").

    "Chuck Moore... suggested computers should be more

    like cellphones, using a variety of specialty cores to run

    modular software scheduled by a high-level applications

    programming interface.

    Architecture

    http://en.wikipedia.org/wiki/Chuck_Moorehttp://en.wikipedia.org/wiki/Chuck_Moorehttp://en.wikipedia.org/wiki/Chuck_Moore
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    57/163

    Architecture

    The application may create a new thread forthe scan process, while the GUI thread waits

    for commands from the user (e.g. cancel thescan).

    In such cases, multicore architecture is of little

    benefit for the application itself due to thesingle thread doing all heavy lifting and theinability to balance the work evenly acrossmultiple cores.

    Architecture

    http://en.wikipedia.org/wiki/Graphical_user_interfacehttp://en.wikipedia.org/wiki/Graphical_user_interface
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    58/163

    Architecture

    Programming truly multithreaded code often

    requires complex co-ordination of threads and

    can easily introduce subtle and difficult-to-

    find bugs due to the interleaving of processing

    on data shared between threads thread-safety).

    Consequently, such code is much more

    difficult to debug than single-threaded code

    when it breaks.

    Architecture

    http://en.wikipedia.org/wiki/Thread-safetyhttp://en.wikipedia.org/wiki/Thread-safety
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    59/163

    Architecture

    There has been a perceived lack of motivation

    for writing consumer-level threaded

    applications because of the relative rarity of

    consumer-level multiprocessor hardware.

    Although threaded applications incur littleadditional performance penalty on single-

    processor machines, the extra overhead of

    development has been difficult to justify dueto the preponderance of single-processor

    machines.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    60/163

    Programming Environment

    Given the increasing emphasis on multicorechip design, stemming from the grave thermal

    and power consumption problems posed byany further significant increase in processorclock speeds, the extent to which software canbe multithreaded to take advantage of thesenew chips is likely to be the single greatestconstraint on computer performance in thefuture.

    If developers are unable to design software tofully exploit the resources provided bymultiple cores, then they will ultimately reachan insurmountable performance ceiling.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    61/163

    Programming Environment

    The telecommunications market had been oneof the first that needed a new design of

    parallel datapath packet processing becausethere was a very quick adoption of thesemultiple core processors for the datapath andthe control plane.

    These MPUs are going to replace thetraditional Network Processors that werebased on proprietary micro- or pico-code.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    62/163

    g g v

    Parallel programming techniques can benefitfrom multiple cores directly.

    Some existingparallel programming modelssuch as Cilk++, OpenMP, Skandium, MPIcan be used on multi-core platforms.

    Intel introduced a new abstraction for C++parallelism calledTBB.

    Other research efforts include

    Codeplay Sieve System Cray's Chapel

    Sun's Fortress

    IBM's X10.

    Programming Environment

    http://en.wikipedia.org/wiki/Parallel_programminghttp://en.wikipedia.org/wiki/Parallel_programming_modelhttp://www.cilk.com/http://en.wikipedia.org/wiki/OpenMPhttp://skandium.niclabs.cl/http://en.wikipedia.org/wiki/Message_Passing_Interfacehttp://en.wikipedia.org/wiki/Intel_Threading_Building_Blockshttp://en.wikipedia.org/wiki/Sieve_C%2B%2B_Parallel_Programming_Systemhttp://en.wikipedia.org/wiki/Chapel_programming_languagehttp://en.wikipedia.org/wiki/Fortress_programming_languagehttp://en.wikipedia.org/wiki/X10_(programming_language)http://en.wikipedia.org/wiki/X10_(programming_language)http://en.wikipedia.org/wiki/Fortress_programming_languagehttp://en.wikipedia.org/wiki/Chapel_programming_languagehttp://en.wikipedia.org/wiki/Sieve_C%2B%2B_Parallel_Programming_Systemhttp://en.wikipedia.org/wiki/Intel_Threading_Building_Blockshttp://en.wikipedia.org/wiki/Message_Passing_Interfacehttp://skandium.niclabs.cl/http://en.wikipedia.org/wiki/OpenMPhttp://www.cilk.com/http://en.wikipedia.org/wiki/Parallel_programming_modelhttp://en.wikipedia.org/wiki/Parallel_programming
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    63/163

    g g

    Multi-core processing has also affected theability of modern day computational software

    development. Developers programming in newer languages

    might find that their modern languages do notsupport multi-core functionality.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    64/163

    g g

    This then requires the use ofnumericallibraries to access code written in languages

    like C and Fortran, which perform mathcomputations faster than newer languages likeC#.

    Intel's MKL and AMD'sACML are written inthese native languages and take advantage ofmulti-core processing.

    Programming Environment

    http://en.wikipedia.org/wiki/List_of_numerical_librarieshttp://en.wikipedia.org/wiki/List_of_numerical_librarieshttp://en.wikipedia.org/wiki/Chttp://en.wikipedia.org/wiki/Chttp://en.wikipedia.org/wiki/MKLhttp://en.wikipedia.org/wiki/ACMLhttp://en.wikipedia.org/wiki/ACMLhttp://en.wikipedia.org/wiki/MKLhttp://en.wikipedia.org/wiki/Chttp://en.wikipedia.org/wiki/Chttp://en.wikipedia.org/wiki/List_of_numerical_librarieshttp://en.wikipedia.org/wiki/List_of_numerical_libraries
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    65/163

    g g

    Managing concurrency acquires a central rolein developing parallel applications.

    The basic steps in designing parallelapplications are:

    Partitioning

    The partitioning stage of a design isintended to expose opportunities forparallel execution.

    Hence, the focus is on defining a largenumber of small tasks in order to yield whatis termed a fine-grained decomposition of a

    problem.

    Programming Environment

    http://en.wikipedia.org/wiki/Concurrent_computinghttp://en.wikipedia.org/wiki/Concurrent_computing
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    66/163

    g g

    Communication

    The tasks generated by a partition are

    intended to execute concurrently butcannot, in general, execute independently.

    The computation to be performed in one

    task will typically require data associatedwith another task.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    67/163

    g g

    Data must then be transferred betweentasks so as to allow computation to

    proceed.This information flow is specified in the

    communication phase of a design.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    68/163

    g g

    Agglomeration

    In the third stage, we move from the

    abstract toward the concrete.We revisit decisions made in the

    partitioning and communication phases

    with a view to obtaining an algorithm thatwill execute efficiently on some class ofparallel computer.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    69/163

    g g

    In particular, we consider whether it isuseful to combine, or agglomerate, tasks

    identified by the partitioning phase, so as toprovide a smaller number of tasks, each ofgreater size.

    We also determine whether it is worthwhileto replicate data and/or computation.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    70/163

    Mapping

    In the fourth and final stage of the parallel

    algorithm design process, we specify whereeach task is to execute.

    This mapping problem does not arise on

    uniprocessors or on shared-memorycomputers that provide automatic taskscheduling.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    71/163

    On the other hand, on the server side,multicore processors are ideal because they

    allow many users to connect to a sitesimultaneously and have independent threadsof execution.

    This allows for Web servers and applicationservers that have much better throughput.

    Programming Environment

    http://en.wikipedia.org/wiki/Server-sidehttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Thread_(computer_science)http://en.wikipedia.org/wiki/Server-side
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    72/163

    Typically, proprietary enterprise server

    software is licensed "per processor".

    In the past a CPU was a processor and most

    computers had only one CPU, so there was noambiguity.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    73/163

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    74/163

    Oracle counts an AMD X2 or Intel dual-core

    CPU as a single processor but has other

    numbers for other types, especially for

    processors with more than two cores.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    75/163

    IBM and HP count a multi-chip module as multiple

    processors. If multi-chip modules count as one processor, CPU

    makers have an incentive to make large expensive

    multi-chip modules so their customers save onsoftware licensing.

    So it seems that the industry is slowly heading

    towards counting each die (see Integrated circuit) as aprocessor, no matter how many cores each die has.

    Programming Environment

    http://en.wikipedia.org/wiki/Integrated_circuithttp://en.wikipedia.org/wiki/Integrated_circuit
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    76/163

    An area of processor technology distinct from

    "mainstream" PCs is that ofembeddedcomputing.

    The same technological drivers towards

    multicore apply here too. Indeed, in many cases the application is a

    "natural" fit for multicore technologies, if thetask can easily be partitioned between thedifferent processors.

    Programming Environment

    http://en.wikipedia.org/wiki/Embedded_computinghttp://en.wikipedia.org/wiki/Embedded_computinghttp://en.wikipedia.org/wiki/Embedded_computinghttp://en.wikipedia.org/wiki/Embedded_computing
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    77/163

    In addition, embedded software is typically

    developed for a specific hardware release,making issues of software portability, legacycode or supporting independent developers

    less critical than is the case for PC orenterprise computing.

    As a result, it is easier for developers to adopt

    new technologies and as a result there is agreater variety of multicore processingarchitectures and suppliers.

    Programming Environment

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    78/163

    In network processing, it is now mainstream

    for devices to be multi-core, with companies

    such as Freescale Semiconductor, Cavium

    Networks, and Broadcom all manufacturingproducts with eight processors.

    Programming Environment

    http://en.wikipedia.org/wiki/Network_processinghttp://en.wikipedia.org/wiki/Freescale_Semiconductorhttp://en.wikipedia.org/wiki/Cavium_Networkshttp://en.wikipedia.org/wiki/Cavium_Networkshttp://en.wikipedia.org/wiki/Broadcomhttp://en.wikipedia.org/wiki/Broadcomhttp://en.wikipedia.org/wiki/Cavium_Networkshttp://en.wikipedia.org/wiki/Cavium_Networkshttp://en.wikipedia.org/wiki/Freescale_Semiconductorhttp://en.wikipedia.org/wiki/Network_processing
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    79/163

    Texas Instruments

    Three-core TMS320C6488 and four-core TMS320C5441, Freescale

    Four-core MSC8144 (eight-core successors).

    Stream Processors, Inc Newer entries include the Storm-1 family from with 40 and 80

    general purpose ALUs per chip

    All programmable in C as a SIMD engine

    Picochip Three-hundred processors on a single die, focused on

    communication applications

    Commercial Hardware

    http://en.wikipedia.org/wiki/Texas_Instrumentshttp://en.wikipedia.org/wiki/Freescalehttp://www.streamprocessors.com/http://en.wikipedia.org/wiki/Picochiphttp://en.wikipedia.org/wiki/Picochiphttp://www.streamprocessors.com/http://en.wikipedia.org/wiki/Freescalehttp://en.wikipedia.org/wiki/Texas_Instruments
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    80/163

    SPARCAmulti-core that exists in fault

    tolerant version.

    Ageia PhysXA multi-corephysics processing unit.

    AmbricAm2045, a 336-core Massively Parallel

    Processor Array (MPPA)

    Commercial Hardware

    http://en.wikipedia.org/wiki/SPARChttp://en.wikipedia.org/wiki/Ageiahttp://en.wikipedia.org/wiki/PhysXhttp://en.wikipedia.org/wiki/Physics_processing_unithttp://en.wikipedia.org/wiki/Ambrichttp://en.wikipedia.org/wiki/Ambrichttp://en.wikipedia.org/wiki/Physics_processing_unithttp://en.wikipedia.org/wiki/PhysXhttp://en.wikipedia.org/wiki/Ageiahttp://en.wikipedia.org/wiki/SPARC
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    81/163

    AMD

    Athlon 64,Athlon 64 FX andAthlon 64 X2 family, dual-core

    desktop processors.Opteron

    Dual-, quad-, and hex-coreserver/workstation processors

    Commercial Hardware

    http://en.wikipedia.org/wiki/Advanced_Micro_Deviceshttp://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/Athlon_64_FXhttp://en.wikipedia.org/wiki/Athlon_64_X2http://en.wikipedia.org/wiki/Opteronhttp://en.wikipedia.org/wiki/Opteronhttp://en.wikipedia.org/wiki/Athlon_64_X2http://en.wikipedia.org/wiki/Athlon_64_FXhttp://en.wikipedia.org/wiki/Athlon_64http://en.wikipedia.org/wiki/Advanced_Micro_Devices
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    82/163

    Phenom

    dual-, triple-, and quad-core desktopprocessors, dual-core entry level processors.

    Turion 64 X2 dual-core laptop processors.

    Radeon and FireStream multi-core GPU/GPGPU (10 cores, 16 5-

    issue wide superscalar stream processorsper core)

    Commercial Hardware

    http://en.wikipedia.org/wiki/Phenom_(processor)http://en.wikipedia.org/wiki/Turion_64_X2http://en.wikipedia.org/wiki/Radeonhttp://en.wikipedia.org/wiki/AMD_FireStreamhttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/GPGPUhttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/AMD_FireStreamhttp://en.wikipedia.org/wiki/Radeonhttp://en.wikipedia.org/wiki/Turion_64_X2http://en.wikipedia.org/wiki/Phenom_(processor)
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    83/163

    Analog Devices Blackfin

    BF561, a symmetrical dual-core processor.

    ARM

    a fully synthesizable multicore container for andARM Cortex-A9 MPCoreprocessorcores, intended for high-performance embedded and entertainment applications.

    ModemX, up to 128 cores, wireless applications.

    Azul Systems

    Vega 1, a 24-core processor, released in 2005.

    Vega 2, a 48-core processor, released in 2006. Vega 3, a 54-core processor, released in 2008.

    Broadcom SiByte SB1250, SB1255 and SB1455.

    Cradle Technologies CT3400 and CT3600, both multi-core DSPs.

    Cavium Networks Octeon, a 16-core MIPS MPU. Freescale Semiconductor QorIQ series processors, up to 8 cores, Power

    Architecture MPU.

    Hewlett-Packard PA-8800 and PA-8900, dual core PA-RISCprocessors.

    Commercial Hardware

    http://en.wikipedia.org/wiki/Analog_Deviceshttp://en.wikipedia.org/wiki/Blackfinhttp://en.wikipedia.org/wiki/ARM_architecturehttp://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCorehttp://en.wikipedia.org/wiki/Azul_Systemshttp://en.wikipedia.org/wiki/Cavium_Networkshttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/Manycore_processing_unithttp://en.wikipedia.org/wiki/Manycore_processing_unithttp://en.wikipedia.org/wiki/Hewlett-Packardhttp://en.wikipedia.org/wiki/PA-8800http://en.wikipedia.org/wiki/PA-8900http://en.wikipedia.org/wiki/PA-RISChttp://en.wikipedia.org/wiki/PA-RISChttp://en.wikipedia.org/wiki/PA-8900http://en.wikipedia.org/wiki/PA-8800http://en.wikipedia.org/wiki/Hewlett-Packardhttp://en.wikipedia.org/wiki/Manycore_processing_unithttp://en.wikipedia.org/wiki/Manycore_processing_unithttp://en.wikipedia.org/wiki/MIPS_architecturehttp://en.wikipedia.org/wiki/Cavium_Networkshttp://en.wikipedia.org/wiki/Azul_Systemshttp://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCorehttp://en.wikipedia.org/wiki/ARM_architecturehttp://en.wikipedia.org/wiki/Blackfinhttp://en.wikipedia.org/wiki/Analog_Devices
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    84/163

    IBM

    POWER4, the world's first non-embedded dual-core processor, released in 2001.

    POWER5, a dual-core processor, released in 2004. POWER6, a dual-core processor, released in 2007.

    PowerPC 970MP, a dual-core processor, used in

    the Apple Power Mac G5. Xenon, a triple-core, SMT-capable, PowerPC

    microprocessor used in the Microsoft Xbox 360game console.

    IBM, Sony, andToshiba Cellprocessor, a nine-core processorwith onegeneral purpose PowerPC core and eight specialized SPUs (SynergysticProcessing Unit) optimized for vector operations used in the SonyPlayStation 3.

    Infineon Danube, a dual-core, MIPS-based, home gatewayprocessor

    Commercial Hardware

    http://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/Sonyhttp://en.wikipedia.org/wiki/PlayStation_3http://en.wikipedia.org/wiki/Infineonhttp://en.wikipedia.org/wiki/Home_gatewayhttp://en.wikipedia.org/wiki/Home_gatewayhttp://en.wikipedia.org/wiki/Infineonhttp://en.wikipedia.org/wiki/PlayStation_3http://en.wikipedia.org/wiki/Sonyhttp://en.wikipedia.org/wiki/IBMhttp://en.wikipedia.org/wiki/IBM
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    85/163

    Intel

    Celeron Dual-Core, the first dual-core processor for

    the budget/entry-level market. Core Duo, a dual-core processor.

    Core 2 Duo, a dual-core processor.

    Core 2 Quad, a quad-core processor. core i3, Core i5, Core i7 and Core i9, a family of

    multicore processors, the successor of the Core 2

    Duo and the Core 2 Quad. Itanium 2, a dual-core processor.

    Commercial Hardware

    http://en.wikipedia.org/wiki/Intelhttp://en.wikipedia.org/wiki/Celeron#Celeron_Dual-Core_.28Core.29http://en.wikipedia.org/wiki/Core_Duohttp://en.wikipedia.org/wiki/Core_2_Duohttp://en.wikipedia.org/wiki/Core_2_Duohttp://en.wikipedia.org/wiki/Core_2_Duohttp://en.wikipedia.org/wiki/Core_2_Quadhttp://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Itanium_2http://en.wikipedia.org/wiki/Core_2_Quadhttp://en.wikipedia.org/wiki/Core_2_Duohttp://en.wikipedia.org/wiki/Core_2_Duohttp://en.wikipedia.org/wiki/Core_2_Duohttp://en.wikipedia.org/wiki/Core_Duohttp://en.wikipedia.org/wiki/Celeron#Celeron_Dual-Core_.28Core.29http://en.wikipedia.org/wiki/Intel
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    86/163

    Pentium D, 2 single-core dies packaged in a multi-

    chip module.

    Pentium Dual-Core, a dual-core processor. Teraflops Research Chip (Polaris), a 3.16 GHz, 80-

    core processor prototype, which the company says

    will be released within the next five years[8]. Xeon dual-, quad- and hexa-core processors.

    IntellaSys

    SEAforth 40C18, a 40-core processor [9] SEAforth24, a 24-core processor designed by

    Charles H. Moore

    Commercial Hardware

    http://en.wikipedia.org/wiki/Pentium_Dhttp://en.wikipedia.org/wiki/Pentium_Dual-Corehttp://en.wikipedia.org/wiki/Teraflops_Research_Chiphttp://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Charles_H._Moorehttp://en.wikipedia.org/wiki/Charles_H._Moorehttp://en.wikipedia.org/wiki/Xeonhttp://en.wikipedia.org/wiki/Teraflops_Research_Chiphttp://en.wikipedia.org/wiki/Pentium_Dual-Corehttp://en.wikipedia.org/wiki/Pentium_D
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    87/163

    Nvidia

    GeForce 9 multi-core GPU (8 cores, 16

    scalar stream processorsper core)

    GeForce 200 multi-core GPU (10

    cores, 24 scalar stream processorspercore)

    Tesla multi-core GPGPU (10 cores, 24scalar stream processorsper core)

    Commercial Hardware

    http://en.wikipedia.org/wiki/Nvidiahttp://en.wikipedia.org/wiki/GeForce_9_Serieshttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Scalar_processorhttp://en.wikipedia.org/wiki/Scalar_processorhttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/Stream_processinghttp://en.wikipedia.org/wiki/Scalar_processorhttp://en.wikipedia.org/wiki/Scalar_processorhttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/Graphics_processing_unithttp://en.wikipedia.org/wiki/GeForce_9_Serieshttp://en.wikipedia.org/wiki/Nvidia
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    88/163

    Parallax Propeller P8X32, an eight-core

    microcontroller.

    picoChip PC200 series 200300 cores per device forDSP & wireless

    Plurality HAL series tightly coupled 16-256 cores, L1

    shared memory, hardware synchronized processor. Rapport Kilocore KC256

    `a 257-core microcontroller with a PowerPC core and 256 8-

    bit "processing elements". Is now out of business. Raza Microelectronics XLR, an eight-core MIPS

    MPU

    Commercial Hardware

    http://en.wikipedia.org/wiki/Parallax,_Inc._(company)http://en.wikipedia.org/wiki/Parallax_Propellerhttp://en.wikipedia.org/wiki/PicoChiphttp://en.wikipedia.org/wiki/Plurality_(company)http://en.wikipedia.org/wiki/Kilocorehttp://en.wikipedia.org/wiki/Manycore_processing_unithttp://en.wikipedia.org/wiki/Manycore_processing_unithttp://en.wikipedia.org/wiki/Kilocorehttp://en.wikipedia.org/wiki/Plurality_(company)http://en.wikipedia.org/wiki/PicoChiphttp://en.wikipedia.org/wiki/Parallax_Propellerhttp://en.wikipedia.org/wiki/Parallax,_Inc._(company)
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    89/163

    SiCortex "SiCortex node" has six MIPS64 cores on a single chip.

    Sun Microsystems

    MAJC 5200, two-core VLIW processor UltraSPARC IV and UltraSPARC IV+,

    dual-core processors.

    UltraSPARC T1, an eight-core, 32-thread

    processor.

    UltraSPARC T2, an eight-core, 64-concurrent-thread processor.

    Commercial Hardware

    http://en.wikipedia.org/wiki/SiCortexhttp://en.wikipedia.org/wiki/MIPS_architecture#MIPS_based_Supercomputershttp://en.wikipedia.org/wiki/Sun_Microsystemshttp://en.wikipedia.org/wiki/MAJChttp://en.wikipedia.org/wiki/UltraSPARC_T1http://en.wikipedia.org/wiki/UltraSPARC_T2http://en.wikipedia.org/wiki/UltraSPARC_T2http://en.wikipedia.org/wiki/UltraSPARC_T1http://en.wikipedia.org/wiki/MAJChttp://en.wikipedia.org/wiki/Sun_Microsystemshttp://en.wikipedia.org/wiki/MIPS_architecture#MIPS_based_Supercomputershttp://en.wikipedia.org/wiki/SiCortex
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    90/163

    Texas InstrumentsTMS320C80 MVP, a five-core

    multimedia video processor.

    TileraTILE64, a 64-core processor XMOS Software Defined Silicon quad-core XS1-G4

    Commercial Hardware

    http://en.wikipedia.org/wiki/Texas_Instrumentshttp://en.wikipedia.org/wiki/Texas_Instruments_TMS320http://en.wikipedia.org/wiki/Tilerahttp://en.wikipedia.org/wiki/TILE64http://en.wikipedia.org/wiki/XMOShttp://en.wikipedia.org/wiki/Software_Defined_Siliconhttp://en.wikipedia.org/wiki/Software_Defined_Siliconhttp://en.wikipedia.org/wiki/XMOShttp://en.wikipedia.org/wiki/TILE64http://en.wikipedia.org/wiki/Tilerahttp://en.wikipedia.org/wiki/Texas_Instruments_TMS320http://en.wikipedia.org/wiki/Texas_Instruments
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    91/163

    Academic MIT, 16-core RAWprocessor

    University of California, Davis,Asynchronous array of

    simple processors (AsAP)

    36-core 610 MHzAsAP 167-core 1.2 GHzAsAP2

    Keywords

    http://en.wikipedia.org/wiki/MIThttp://groups.csail.mit.edu/cag/raw/http://en.wikipedia.org/wiki/University_of_California,_Davishttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/Asynchronous_array_of_simple_processorshttp://en.wikipedia.org/wiki/University_of_California,_Davishttp://groups.csail.mit.edu/cag/raw/http://en.wikipedia.org/wiki/MIT
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    92/163

    Multicore Association

    Multithreading (computer hardware)

    Multiprocessing

    Hyper-threading

    Symmetric multiprocessing (SMP)

    Simultaneous multithreading (SMT)

    Multitasking Parallel computing

    PureMVC MultiCore a modular programming

    framework XMTC

    Parallel Random Access Machine

    References

    http://en.wikipedia.org/wiki/Multicore_Associationhttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Computer_multitaskinghttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/PureMVChttp://en.wikipedia.org/wiki/XMTChttp://en.wikipedia.org/wiki/Parallel_Random_Access_Machinehttp://en.wikipedia.org/wiki/Parallel_Random_Access_Machinehttp://en.wikipedia.org/wiki/XMTChttp://en.wikipedia.org/wiki/PureMVChttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Computer_multitaskinghttp://en.wikipedia.org/wiki/Simultaneous_multithreadinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Hyper-threadinghttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Multithreading_(computer_hardware)http://en.wikipedia.org/wiki/Multicore_Association
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    93/163

    TechTarget --- multi-core processor

    Multi-core in the Source Engine

    AMD: dual-core not for gamers... yet

    Gamebryo's Floodgate page

    CPU designers debate multi-core future", byRick Merritt, EE Times 2008

    Multicore packet processing Forum

    References

    http://searchdatacenter.techtarget.com/sDefinition/0,,sid80_gci1015740,00.htmlhttp://www.bit-tech.net/gaming/2006/11/02/Multi_core_in_the_Source_Engin/1.htmlhttp://www.theregister.co.uk/2005/04/22/amd_dual-core_games/http://www.emergent.net/index.php/homepage/products-and-services/floodgatehttp://www.eetimes.com/showArticle.jhtml?articleID=206105179http://multicorepacketprocessing.com/http://multicorepacketprocessing.com/http://www.eetimes.com/showArticle.jhtml?articleID=206105179http://www.emergent.net/index.php/homepage/products-and-services/floodgatehttp://www.theregister.co.uk/2005/04/22/amd_dual-core_games/http://www.bit-tech.net/gaming/2006/11/02/Multi_core_in_the_Source_Engin/1.htmlhttp://searchdatacenter.techtarget.com/sDefinition/0,,sid80_gci1015740,00.html
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    94/163

    Multicore Packet Processing Forum

    Parallel Computing Research wiki: "Chip

    Multiprocessor Comparison Chart" (Additionswelcome)

    A Berkeley View on the Parallel Computing

    LandscapeArgues for the desperate need toinnovate around "manycore".

    BMDFM: Binary Modular Dataflow Machine

    Multi-core Runtime Environment(BMDFM)

    References

    http://multicorepacketprocessing.com/http://view.eecs.berkeley.edu/wiki/Chip_Multi_Processor_Watchhttp://view.eecs.berkeley.edu/wiki/Chip_Multi_Processor_Watchhttp://view.eecs.berkeley.edu/http://view.eecs.berkeley.edu/http://bmdfm.com/http://en.wikipedia.org/wiki/BMDFMhttp://en.wikipedia.org/wiki/BMDFMhttp://bmdfm.com/http://view.eecs.berkeley.edu/http://view.eecs.berkeley.edu/http://view.eecs.berkeley.edu/wiki/Chip_Multi_Processor_Watchhttp://view.eecs.berkeley.edu/wiki/Chip_Multi_Processor_Watchhttp://multicorepacketprocessing.com/
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    95/163

    Intel Tera-scale Computing Research

    Program

    Overview of Intel's Dual Core CPUs'Specifications (Intel's Website)

    Multi-core Programming blog e-Book on Multicore Programming e-Book

    outlining multicore programming challenges,

    and the leading programming approaches todeal with them.

    References

    http://www.intel.com/go/terascale/http://www.intel.com/go/terascale/http://www.intel.com/products/processor/core2duo/specifications.htm?iid=prod_core2duo+tab_spechttp://www.intel.com/products/processor/core2duo/specifications.htm?iid=prod_core2duo+tab_spechttp://www.cilk.com/multicore-blog/http://www.cilk.com/multicore-e-book/http://www.cilk.com/multicore-e-book/http://www.cilk.com/multicore-blog/http://www.intel.com/products/processor/core2duo/specifications.htm?iid=prod_core2duo+tab_spechttp://www.intel.com/products/processor/core2duo/specifications.htm?iid=prod_core2duo+tab_spechttp://www.intel.com/go/terascale/http://www.intel.com/go/terascale/
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    96/163

    XMTC: PRAM-like Programming Software release

    Online multicore community

    IEEE: Multicore Is Bad News For Supercomputersfor some computing tasks, 8 cores aren't (yet) much

    better than 4

    Muticore short course at MIT Diploma thesis: A Virtual Platform for High Speed

    Message-Passing-Hardware ResearchA virtual

    network interface for many core CPUs

    Medical Imaging ApplicationsMedical Imaging Applications

    http://sourceforge.net/projects/xmtc/http://multicore.ning.com/http://spectrum.ieee.org/nov08/6912http://web.mit.edu/professional/short-programs/courses/multicore_programming.htmlhttp://rechner-architektur.de/mpi-research/http://rechner-architektur.de/mpi-research/http://rechner-architektur.de/mpi-research/http://rechner-architektur.de/mpi-research/http://web.mit.edu/professional/short-programs/courses/multicore_programming.htmlhttp://spectrum.ieee.org/nov08/6912http://multicore.ning.com/http://sourceforge.net/projects/xmtc/
  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    97/163

    IBM and Mayo Clinic announced theirIBM and Mayo Clinic announced their

    collaboration to explore parallel computercollaboration to explore parallel computer

    architecture and memory bandwidth for thearchitecture and memory bandwidth for the

    processing of 3processing of 3--D medical imagesD medical images

    Graphic chips Sony, Toshiba, and IBM made forGraphic chips Sony, Toshiba, and IBM made forgaming can be employed for improving healthgaming can be employed for improving health

    care services.care services.

    Medical Imaging ApplicationsMedical Imaging Applications

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    98/163

    Mayo Clinic scientists utilized the IBM CellMayo Clinic scientists utilized the IBM Cell

    processors to align two medical images obtainedprocessors to align two medical images obtained

    at different dates and by using different imagingat different dates and by using different imaging

    devices Mayo Clinic radiologists can more easilydevices Mayo Clinic radiologists can more easily

    detect structural changes such as the growth ordetect structural changes such as the growth orshrinkage of tumors.shrinkage of tumors.

    Medical Imaging ApplicationsMedical Imaging Applications

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    99/163

    "This alignment of images both improves the"This alignment of images both improves the

    accuracy of interpretation and improvesaccuracy of interpretation and improves

    radiologist efficiency, particularly for diseasesradiologist efficiency, particularly for diseases

    like cancer," says Mayo radiology researcherlike cancer," says Mayo radiology researcher

    Bradley Erickson, M.D., Ph.D. who initiallyBradley Erickson, M.D., Ph.D. who initiallycontacted IBM to discuss Mayo's computingcontacted IBM to discuss Mayo's computing

    needs.needs.

    Medical Imaging ApplicationsMedical Imaging Applications

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    100/163

    Through porting and optimization of MayoThrough porting and optimization of Mayo

    Clinic's Image Registration Application on theClinic's Image Registration Application on the

    IBMIBM BladeCenterBladeCenter QS20, the image registrationQS20, the image registration

    results is50 times faster than the applicationresults is50 times faster than the application

    running on a traditional processor configuration.running on a traditional processor configuration.

    Medical Imaging ApplicationsMedical Imaging Applications

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    101/163

    This breakout event inspirits the UI medicalThis breakout event inspirits the UI medical

    imaging researchers to seek a highimaging researchers to seek a high--endend

    computing facility for accelerating their currentcomputing facility for accelerating their current

    NIHNIH--funded research projects.funded research projects.

    Medical Imaging ApplicationsMedical Imaging Applications

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    102/163

    Presently, there is no supercomputer or HPCPresently, there is no supercomputer or HPC

    cluster which is available to these projectcluster which is available to these project

    investigators.investigators.

    Medical Imaging ApplicationsMedical Imaging Applications

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    103/163

    This project is completely driven by UIThis project is completely driven by UIs ends end--

    users in medical imaging and informatics.users in medical imaging and informatics.

    They are classified into two groups: 5 major userThey are classified into two groups: 5 major user

    groups.groups.

    Medical Imaging ApplicationsMedical Imaging Applications

    M di l i i li ti fil h l t d fi thM di l i i ppli ti p fil h lp t d fi th

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    104/163

    Medical imaging application profiles help us to define theMedical imaging application profiles help us to define thesystemsystems basic requirements:s basic requirements: (1) a high(1) a high--end supercomputer with multiple computing nodes;end supercomputer with multiple computing nodes;

    (2) multi(2) multi--core, graphic accelerator processors to speed up and handlecore, graphic accelerator processors to speed up and handlemultimulti--threads;threads;

    (3) a high(3) a high--performance interconnection;performance interconnection;

    (4) capability for graphic computing and data visualization;(4) capability for graphic computing and data visualization; (5) certain data storage capacity and connection to PACS is requ(5) certain data storage capacity and connection to PACS is required;ired;

    (6) multi(6) multi--core programming environment, selective medical imagingcore programming environment, selective medical imagingsoftware, parallel libraries or tools, and administration/managesoftware, parallel libraries or tools, and administration/management suitsment suits

    (accounting, job scheduling and monitoring, etc);(accounting, job scheduling and monitoring, etc); (7) strong technical support; and (8) parallel application suppo(7) strong technical support; and (8) parallel application supports.rts.

    Medical Imaging ApplicationsMedical Imaging Applications

    Medical Imaging Registration usingMedical Imaging Registration using CellBECellBE/GPU/GPU

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    105/163

    Medical Imaging Registration usingMedical Imaging Registration using CellBECellBE/GPU/GPUprocessors.processors.

    The cell processors or graphic accelerators, initially forThe cell processors or graphic accelerators, initially forgame industrial, began to replace traditional CPUs ingame industrial, began to replace traditional CPUs insome applications.some applications.

    Such recent trend allows port a medical imagingSuch recent trend allows port a medical imagingapplication on a Cell or GPUapplication on a Cell or GPU--based system.based system. For example, Sony, Toshiba, and IBM recently established aFor example, Sony, Toshiba, and IBM recently established a

    joint effort in developing Cell Broadband Engine (Cell/BE).joint effort in developing Cell Broadband Engine (Cell/BE).

    In 2007, IBM and Mayo Clinic conducted a linear imageIn 2007, IBM and Mayo Clinic conducted a linear imageregistration of 98 sets of medical images using IBM Cellregistration of 98 sets of medical images using IBM CellQS20 processors as regular processors.QS20 processors as regular processors.

    Medical Imaging ApplicationsMedical Imaging Applications

    Th d h i li i f fTh d h i li i f f

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    106/163

    They used their own application software ofThey used their own application software of

    MRIcroviewerMRIcroviewer, Mayo Clinic Image (, Mayo Clinic Image (ImageFileImageFile

    and Mayo Open Sourceand Mayo Open Source--ITK) to register aITK) to register a

    moving image to a fixed image on a IBM cellmoving image to a fixed image on a IBM cell--

    based cluster.based cluster.

    Medical Imaging ApplicationsMedical Imaging Applications

    Th i d 60 i dTh i d 60 i d f h lf h l

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    107/163

    They received 60 times speedThey received 60 times speed--up for the totalup for the total

    registration time of 98 data sets from hours toregistration time of 98 data sets from hours to

    516 seconds.516 seconds.

    Medical Imaging ApplicationsMedical Imaging Applications

    I i i l h iIt iti l t t t th ti

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    108/163

    It was critical to restructure the entire programIt was critical to restructure the entire program

    to achieve this performance gain, such as toto achieve this performance gain, such as to

    maximize the SPE usage, to minimize themaximize the SPE usage, to minimize the

    memory traffic, and to optimize the code for thememory traffic, and to optimize the code for the

    SPE pipeline structure with SIMD intrinsic.SPE pipeline structure with SIMD intrinsic.

    Medical Imaging ApplicationsMedical Imaging Applications

    (IBM R h R t RC24138 2007(IBM R r h R p rt RC24138 2007 OhOh r tt

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    109/163

    (IBM Research Report RC24138, 2007,(IBM Research Report RC24138, 2007, OharaOhara etet

    al, 2007a,b, Gong et al., 2008).al, 2007a,b, Gong et al., 2008).

    Collaborating between IBM and Mayo ClinicCollaborating between IBM and Mayo Clinic

    achieves the ability and enhance the facility toachieves the ability and enhance the facility to

    register medical images up to 50 times quickerregister medical images up to 50 times quickerand provides critical diagnosis, such as duringand provides critical diagnosis, such as during

    the growth or shrinkage of tumors, in secondsthe growth or shrinkage of tumors, in secondsinstead of hours.instead of hours.

    Medical Imaging ApplicationsMedical Imaging Applications

    With th IBM C ll/BE l t r th r t klinWith the IBM Cell/BE cluster they are tackling

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    110/163

    With the IBM Cell/BE cluster, they are tacklingWith the IBM Cell/BE cluster, they are tackling

    couple clinicallycouple clinically--potential projects, includingpotential projects, including

    maximummaximum--resolution of organ imaging, imageresolution of organ imaging, image--guided tumor ablation, automated changeguided tumor ablation, automated change

    detection and analysis.detection and analysis.

    Medical Imaging ApplicationsMedical Imaging Applications

    This successful study encourages many of ourThis successful study encourages many of our

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    111/163

    This successful study encourages many of ourThis successful study encourages many of our

    UI users who need high performanceUI users who need high performance

    registrations.registrations.

    It inspirits us to conduct preliminary study inIt inspirits us to conduct preliminary study in

    how to develop efficient parallel algorithms, datahow to develop efficient parallel algorithms, datadecomposition utilizing the Celldecomposition utilizing the Cells intrinsic (PPEs intrinsic (PPE

    and SPE) architecture for multithreading dataand SPE) architecture for multithreading datafetching.fetching.

    Medical Imaging ApplicationsMedical Imaging Applications

    Alternatively people use GPU processors toAlternatively people use GPU processors to

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    112/163

    Alternatively, people use GPU processors toAlternatively, people use GPU processors tohandle image registration.handle image registration.

    For example,For example, SamantSamant et al (2008) compared theet al (2008) compared thetraditional CPUtraditional CPU--based with GPUbased with GPU--basedbased

    deformable image registration (DIR) for andeformable image registration (DIR) for anadaptive radiotherapy, they concluded a GPUadaptive radiotherapy, they concluded a GPUregistration is about 50 times faster than the oneregistration is about 50 times faster than the one

    using a single thread CPU and 30 times fasterusing a single thread CPU and 30 times fasterthan the one using multithan the one using multi--thread CPU.thread CPU.

    Medical Imaging ApplicationsMedical Imaging Applications

    Yang (2009) described a robust and accurateYang (2009) described a robust and accurate

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    113/163

    Yang (2009) described a robust and accurateYang (2009) described a robust and accurate

    2D/3D image registration algorithm.2D/3D image registration algorithm.

    The 2D version of the image registrationThe 2D version of the image registration

    algorithm is implemented on the IBM Cell/B.E.algorithm is implemented on the IBM Cell/B.E.

    Yang achieved about 10 times speed up, whichYang achieved about 10 times speed up, whichallows their registration algorithm to completeallows their registration algorithm to complete

    the nonlinear registration of a pair of imagesthe nonlinear registration of a pair of images(192(192 192) in less than five seconds.192) in less than five seconds.

    Medical Imaging ApplicationsMedical Imaging Applications

    On cell or on GPU which each is the bestOn cell or on GPU which each is the best

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    114/163

    On cell or on GPU, which each is the bestOn cell or on GPU, which each is the best

    solution is unclear. It depends on clustersolution is unclear. It depends on clusterss

    internal architecture and multiprogramminginternal architecture and multiprogrammingskills.skills.

    For this application, we will take our threeFor this application, we will take our three--stepstepstrategy. First, we implement our existingstrategy. First, we implement our existing

    parallel registration codes on the CPUparallel registration codes on the CPU

    --basedbased

    processors, we us have a basic solution.processors, we us have a basic solution.

    Medical Imaging ApplicationsMedical Imaging Applications

    Then we will exploit the deployment of ourThen we will exploit the deployment of our

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    115/163

    Then, we will exploit the deployment of ourThen, we will exploit the deployment of our

    registration programs on the Cell/BE of GPUregistration programs on the Cell/BE of GPU

    processors to have comparison.processors to have comparison.

    We conduct our parallel productive registrationsWe conduct our parallel productive registrations

    on the system for the NIH projects.on the system for the NIH projects. The experience, parallel algorithms,The experience, parallel algorithms,

    implementation procedures and tips as well asimplementation procedures and tips as well assoftware programs will be open for public use.software programs will be open for public use.

    Medical Imaging ApplicationsMedical Imaging Applications

    Medical Imaging Reconstruction on Cell/B.E.Medical Imaging Reconstruction on Cell/B.E.

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    116/163

    g g /g g /processors.processors. Image reconstruction is one of ourImage reconstruction is one of our

    technical tasks.technical tasks. Computational acceleration on graphics processingComputational acceleration on graphics processing

    units (units (GPUsGPUs) can make advanced magnetic resonance) can make advanced magnetic resonance

    imaging (MRI) reconstruction algorithms attractive inimaging (MRI) reconstruction algorithms attractive inclinical settings, thereby improving the quality of MRclinical settings, thereby improving the quality of MRimages across a broad spectrum of applications.images across a broad spectrum of applications.

    SonteSonte et al (2008) presented their acceleration algorithmet al (2008) presented their acceleration algorithmon a single NVIDIAon a single NVIDIAss QuadroQuadro FX 5600.FX 5600.

    Medical Imaging ApplicationsMedical Imaging Applications

    The reconstruction of a 3D image with 1283The reconstruction of a 3D image with 1283

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    117/163

    The reconstruction of a 3D image with 1283gvoxelsvoxels achieves up to 180 GFLOPS and requiresachieves up to 180 GFLOPS and requires

    just over one minute on thejust over one minute on the QuadroQuadro, while, whilereconstruction on a quadreconstruction on a quad--core CPU is twentycore CPU is twenty--one times slower.one times slower.

    The Cell processor technology offers theThe Cell processor technology offers theadvantages of a costadvantages of a cost--effective, higheffective, high--performanceperformance

    platform for medical reconstruction andplatform for medical reconstruction andimaging.imaging.

    Medical Imaging ApplicationsMedical Imaging Applications

    A research group in the Inst. of Medical Physics,A research group in the Inst. of Medical Physics,

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    118/163

    g p yErlangen, Germany worked with Mercury ComputerErlangen, Germany worked with Mercury ComputerSystem in Germany to experiment many cases of theSystem in Germany to experiment many cases of theCT medical imaging reconstruction of 5123 volume onCT medical imaging reconstruction of 5123 volume onCell/BE. They achieve sufficient computingCell/BE. They achieve sufficient computingperformance for high image quality.performance for high image quality.

    ((KnaupKnaup andand KachelriebKachelrieb, 2007;, 2007; KachelriebKachelrieb et al, 2007a,b;et al, 2007a,b;KnaupKnaup et al, 2007b,et al, 2007b, KachelrieKachelrie et al, 2007a,b; Kaup etet al, 2007a,b; Kaup etal, 2007)al, 2007)..

    The systematically compared the performance of CTThe systematically compared the performance of CTreconstruction with GPU, Filed Programmable Gatereconstruction with GPU, Filed Programmable GateArrays (FPGA), and Cells.Arrays (FPGA), and Cells.

    Medical Imaging ApplicationsMedical Imaging Applications

    Their recent study shows that the coneTheir recent study shows that the cone--beambeam

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    119/163

    yy

    backprojectionbackprojection of 512 projections into the 5123 volumeof 512 projections into the 5123 volume

    took 3.2 min on the PC and is as fast as 13.6s on Cells.took 3.2 min on the PC and is as fast as 13.6s on Cells. Thereby, the cell greatly outperforms todayThereby, the cell greatly outperforms todays tops top--notchnotch

    backback--projections based onprojections based on GPUsGPUs..

    Using bothUsing both CBEsCBEs of our dual cellof our dual cell--based blade providedbased blade provided

    by Mercury Computer Systems allows to 2Dby Mercury Computer Systems allows to 2D

    backprojectbackproject 330 images/s and one can complete the 3D330 images/s and one can complete the 3Dconecone--beam backbeam back--projection in 6.8 s (projection in 6.8 s (KachelriebKachelrieb, 2007)., 2007).

    Medical Imaging ApplicationsMedical Imaging Applications

    We will deploy many image reconstruction algorithmsWe will deploy many image reconstruction algorithms

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    120/163

    p y y g gp y y g g

    (2D to 3D, parallel beam, fan, beam, to spiral cone(2D to 3D, parallel beam, fan, beam, to spiral cone

    beams, FBP to EM, etc) on the systems.beams, FBP to EM, etc) on the systems. We would like to compare our results with the ones weWe would like to compare our results with the ones we

    had before.had before.

    We begin to program our algorithms on the cellWe begin to program our algorithms on the cell

    processors to study the performance benchmarks.processors to study the performance benchmarks.

    The best option will be recommended for theThe best option will be recommended for theproduction for the major users.production for the major users.

    Medical Imaging ApplicationsMedical Imaging Applications

    Medical Image Segmentation onMedical Image Segmentation on

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    121/163

    g gg g

    GPU/Cell/B.E processors.GPU/Cell/B.E processors.

    As we discussed, besides of the imageAs we discussed, besides of the image

    registration, image segmentation is one of ourregistration, image segmentation is one of our

    desired applications used frequently by ourdesired applications used frequently by ourmajor users.major users.

    Medical Imaging ApplicationsMedical Imaging Applications

    The medical image segmentation using highThe medical image segmentation using high--endend

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    122/163

    g g g gg g g g

    computing technology is still at a very earlycomputing technology is still at a very early

    stage, although it is extremely important tostage, although it is extremely important toclinical practices.clinical practices.

    BaggiaBaggia et al. (2007) presented their performanceet al. (2007) presented their performancecomparison of image segmentations betweencomparison of image segmentations between

    different multidifferent multi--core architectures, namely Cell,core architectures, namely Cell,

    GPU, and SIMD.GPU, and SIMD.

    Medical Imaging ApplicationsMedical Imaging Applications

    For single processors, their results show that forFor single processors, their results show that for

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    123/163

    g p ,g p

    a a256 by 256, 3.3ms on CPU, 2.4 ms on CPUa a256 by 256, 3.3ms on CPU, 2.4 ms on CPU

    with SIMD, 1.0ms on GPU (8 pixelwith SIMD, 1.0ms on GPU (8 pixel shadersshaders),),0.87 ms on PS3 Cell (60.87 ms on PS3 Cell (6 SPEsSPEs) and 0,4 ms in) and 0,4 ms in

    another GPU (32 pixelanother GPU (32 pixel shadersshaders).).

    Medical Imaging ApplicationsMedical Imaging Applications

    Again, the effective and fundamental parallelismAgain, the effective and fundamental parallelism

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    124/163

    for segmentation is the key to open the door forfor segmentation is the key to open the door for

    advanced computations on the cell or GPUadvanced computations on the cell or GPUprocessors. The major concern for the Cell isprocessors. The major concern for the Cell isthe programming effort.the programming effort.

    People suggested that if usingPeople suggested that if usingNvidaNvidass CUBACUBA--API for the implement of code on a GPUAPI for the implement of code on a GPU--GPUGPU

    cluster, one may gain 9 times faster than thecluster, one may gain 9 times faster than theCPU processors.CPU processors.

    Medical Imaging ApplicationsMedical Imaging Applications

    They strongly suggested the GPUThey strongly suggested the GPU--based image librariesbased image libraries

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    125/163

    should be available on GPU and Cell processors in theshould be available on GPU and Cell processors in the

    future.future. We already start to work on the parallel algorithms forWe already start to work on the parallel algorithms for

    segmentation and run on the NCSAsegmentation and run on the NCSATeraGridTeraGrid clusters.clusters.

    We will immigrate our segmentation program on theWe will immigrate our segmentation program on the

    proposed system using the CPU processors first, andproposed system using the CPU processors first, and

    then test on the cell processors in multithen test on the cell processors in multi--corecoreprogramming.programming.

    Medical Imaging ApplicationsMedical Imaging Applications

    Bioinformatics on Cell/B.E. processorsBioinformatics on Cell/B.E. processors..

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    126/163

    RecentlyRecentlySachdevaSachdeva et al (2008) systematicallyet al (2008) systematically

    evaluated the performance of three popularevaluated the performance of three popularbioinformatics applications (namely, FASTA,bioinformatics applications (namely, FASTA,

    ClutalWClutalW, and, and HMMerHMMer) on the Cell/B.E.) on the Cell/B.E. They preliminary results show the cellThey preliminary results show the cell--basedbased

    cluster is a promising powercluster is a promising power--efficient platformefficient platform

    for future bioinformatics.for future bioinformatics.

    Medical Imaging ApplicationsMedical Imaging Applications

    Zola et al (2009) Constructed Gene RegulatoryZola et al (2009) Constructed Gene RegulatoryNetworks across multiple Cells multiple cores withinNetworks across multiple Cells multiple cores within

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    127/163

    Networks, across multiple Cells, multiple cores withinNetworks, across multiple Cells, multiple cores withineach Cell, and vector units within the cores to develop aeach Cell, and vector units within the cores to develop a

    high performance implementation that they presentedhigh performance implementation that they presentedexperimental results comparing the Cell implementationexperimental results comparing the Cell implementationwith a standardwith a standard uniprocessoruniprocessor implementation and animplementation and an

    implementation on a conventional supercomputer.implementation on a conventional supercomputer. They concluded a Cell cluster outperforms theThey concluded a Cell cluster outperforms the

    BlueGgeneBlueGgene/L system. Computation time with 64 SPE/L system. Computation time with 64 SPEcores on the Cell cluster is the same as that with 128cores on the Cell cluster is the same as that with 128PPC440 cores on BG/L, which shows a factor of 2PPC440 cores on BG/L, which shows a factor of 2performance gain.performance gain.

    Medical Imaging ApplicationsMedical Imaging Applications

    Martin (2008), in the University of Aarhus,Martin (2008), in the University of Aarhus,

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    128/163

    Denmark, evaluated the applicability of the CellDenmark, evaluated the applicability of the Cell

    processor inprocessor in PhylogeneticsPhylogenetics and otherand othercomputational intensive problems.computational intensive problems.

    Martin concluded Cell processor has anMartin concluded Cell processor has animpressive performance and it's an interestingimpressive performance and it's an interesting

    alternative to mainstream processors like x86alternative to mainstream processors like x86

    processors.processors.

    Medical Imaging ApplicationsMedical Imaging Applications

    However, Cell architecture makes softwareHowever, Cell architecture makes software

    d l h d d i i dd l h d d i i d

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    129/163

    development hard and time consuming compared todevelopment hard and time consuming compared to

    mainstream architectures.mainstream architectures. Libraries and compilers that could make softwareLibraries and compilers that could make software

    development easier are under development.development easier are under development.

    Although there exist development tools like Cell SDK,Although there exist development tools like Cell SDK,

    debuggers, and optimizations tools for Cell softwaredebuggers, and optimizations tools for Cell software

    development, an extensive knowledge of the Celldevelopment, an extensive knowledge of the Cellarchitecture are strongly needed.architecture are strongly needed.

    Medical Imaging ApplicationsMedical Imaging Applications

    Martin concluded that the Cell should be seen as aMartin concluded that the Cell should be seen as ahybrid between an x86 processor and a GPU and ishybrid between an x86 processor and a GPU and is

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    130/163

    hybrid between an x86 processor and a GPU, and ishybrid between an x86 processor and a GPU, and is

    therefore suited for problems that require thetherefore suited for problems that require theproperties of both architectures.properties of both architectures.

    If a suitable problem can be found and there is plentyIf a suitable problem can be found and there is plenty

    of time for software development the Cell processor isof time for software development the Cell processor isworth considering but otherwise x86 processors are aworth considering but otherwise x86 processors are abetter choice.better choice.

    GPU becomes more useful with support for branchingGPU becomes more useful with support for branchingand wider FP calculations.and wider FP calculations.

    Medical Imaging ApplicationsMedical Imaging Applications

    Regarding the technical feasibility, applicability,Regarding the technical feasibility, applicability,

    /

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    131/163

    and cost/performance efforts,and cost/performance efforts, BuehrerBuehrer andand

    ParthasaraphyParthasaraphy(2007) conducted a NSF project(2007) conducted a NSF projectto study the potential of Cell/BE for datato study the potential of Cell/BE for data

    mining.mining. They report cell processors is up to 34 timesThey report cell processors is up to 34 times

    more efficient than the competing technologiesmore efficient than the competing technologies

    in general.in general.

    Medical Imaging ApplicationsMedical Imaging Applications

    However, for major data mining algorithms, theirHowever, for major data mining algorithms, their

    li i i ti ti i di t d th t it t itpr limi r i ti ti i di t d th t it t it

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    132/163

    preliminary investigation indicated that it not quitepreliminary investigation indicated that it not quite

    ready to employ the Cell technology for endready to employ the Cell technology for end--useruserapplications, although it has great potentials.applications, although it has great potentials.

    Therefore, for this application, we will only use CPUs,Therefore, for this application, we will only use CPUs,

    while keep eye open to seeking new solution.while keep eye open to seeking new solution.

    We believe, very soon, the genetics linkage codes willWe believe, very soon, the genetics linkage codes will

    be portable on GPU and Cell processors.be portable on GPU and Cell processors.

    Medical Imaging ApplicationsMedical Imaging Applications

    Fast Fourier Transform and Discrete WaveletFast Fourier Transform and Discrete Wavelet

    Transform on Cell/BE processorsTransform on Cell/BE processors

  • 7/28/2019 2009, Multi-core Programming for Medical Imaging.pdf

    133/163

    Transform on Cell