redp4477

  • Upload
    prblajr

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 redp4477

    1/50ibm.com/redbooks

    Redpaper

    Front cover

    Roadrunner: Hardware

    and Software Overview

    Dr. Andrew Komornic

    Gary Mullen-Schu

    Deb Lando

    Review components that comprise theRoadrunner supercomputer

    Understand Roadrunner hardware

    components

    Learn about Roadrunner

    system software

    http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/
  • 7/29/2019 redp4477

    2/50

  • 7/29/2019 redp4477

    3/50

    International Technical Support Organization

    Roadrunner: Hardware and Software Overview

    January 2009

    REDP-4477-00

  • 7/29/2019 redp4477

    4/50

    Copyright International Business Machines Corporation 2009. All rights reserved.

    Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule

    Contract with IBM Corp.

    First Edition (January 2009)

    This edition applies to the Roadrunner computing system.

    Note: Before using this information and the product it supports, read the information in Notices on page v.

  • 7/29/2019 redp4477

    5/50

    Copyright IBM Corp. 2009. All rights reserved.iii

    Contents

    Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v

    Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

    Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

    The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

    Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

    Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

    Chapter 1. Roadrunner hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 What Roadrunner is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Roadrunner hardware components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2.1 TriBlade: a unique concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2.2 IBM BladeCenter QS22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.2.3 IBM BladeCenter LS21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Rack configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.3.1 Compute node rack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3.2 Compute node and I/O rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3.3 Switch and service rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    1.4 The Connected Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1.5 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1.5.1 Networks within a Connected Unit cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1.5.2 Networks between Connected Unit clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    Chapter 2. Roadrunner software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.1 Roadrunner components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.1.1 Compute node (TriBlade) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2 I/O node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.1.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.1.4 Master (management) node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.2 Cluster boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.2.1 Boot scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.4 How applications are written and executed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.4.1 Application core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.4.2 Offloading logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    Appendix A. The Cell Broadband Engine (Cell/B.E.) processor . . . . . . . . . . . . . . . . . . 27Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    The processor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30The Element Interconnet Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

  • 7/29/2019 redp4477

    6/50

    iv Roadrunner: Hardware and Software Overview

    Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    How to get Redbooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

  • 7/29/2019 redp4477

    7/50

    Copyright IBM Corp. 2009. All rights reserved.v

    Notices

    This information was developed for products and services offered in the U.S.A.

    IBM may not offer the products, services, or features discussed in this document in other countries. Consultyour local IBM representative for information on the products and services currently available in your area. Anyreference to an IBM product, program, or service is not intended to state or imply that only that IBM product,program, or service may be used. Any functionally equivalent product, program, or service that does notinfringe any IBM intellectual property right may be used instead. However, it is the user's responsibility toevaluate and verify the operation of any non-IBM product, program, or service.

    IBM may have patents or pending patent applications covering subject matter described in this document. Thefurnishing of this document does not give you any license to these patents. You can send license inquiries, inwriting, to:IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.

    The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATIONPROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR

    IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer ofexpress or implied warranties in cer tain transactions, therefore, this statement may not apply to you.

    This information could include technical inaccuracies or typographical errors. Changes are periodically madeto the information herein; these changes will be incorporated in new editions of the publication. IBM may makeimprovements and/or changes in the product(s) and/or the program(s) described in this publication at any timewithout notice.

    Any references in this information to non-IBM Web sites are provided for convenience only and do not in anymanner serve as an endorsement of those Web sites. The materials at those Web sites are not part of thematerials for this IBM product and use of those Web sites is at your own risk.

    IBM may use or distribute any of the information you supply in any way it believes appropriate without incurringany obligation to you.

    Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products and cannot confirm theaccuracy of performance, compatibility or any other claims related to non-IBM products. Questions on thecapabilities of non-IBM products should be addressed to the suppliers of those products.

    This information contains examples of data and reports used in daily business operations. To illustrate themas completely as possible, the examples include the names of individuals, companies, brands, and products.All of these names are fictitious and any similarity to the names and addresses used by an actual businessenterprise is entirely coincidental.

    COPYRIGHT LICENSE:

    This information contains sample application programs in source language, which illustrate programmingtechniques on various operating platforms. You may copy, modify, and distribute these sample programs in

    any form without payment to IBM, for the purposes of developing, using, marketing or distributing applicationprograms conforming to the application programming interface for the operating platform for which the sampleprograms are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,cannot guarantee or imply reliability, serviceability, or function of these programs.

  • 7/29/2019 redp4477

    8/50

    vi Roadrunner: Hardware and Software Overview

    Trademarks

    IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business MachinesCorporation in the United States, other countries, or both. These and other IBM trademarked terms aremarked on their first occurrence in this information with the appropriate symbol ( or ), indicating USregistered or common law trademarks owned by IBM at the time this information was published. Such

    trademarks may also be registered or common law trademarks in other countries. A current list of IBMtrademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml

    The following terms are trademarks of the International Business Machines Corporation in the United States,other countries, or both:

    AS/400

    BladeCenter

    Blue Gene/L

    Blue Gene

    Domino

    GPFS

    IBM PowerXCell

    IBM

    iSeries

    PartnerWorld

    Power Architecture

    POWER3

    POWER5

    PowerPC

    Redbooks

    Redbooks (logo)

    RS/6000

    System i

    WebSphere

    The following terms are trademarks of other companies:

    AMD, AMD Opteron, HyperTransport, the AMD Arrow logo, and combinations thereof, are trademarks ofAdvanced Micro Devices, Inc.

    InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand TradeAssociation.

    Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the UnitedStates, other countries, or both and is used under license therefrom.

    Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States,other countries, or both.

    Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, othercountries, or both.

    Intel Pentium, Intel, Pentium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registeredtrademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.

    UNIX is a registered trademark of The Open Group in the United States and other countries.

    Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

    Other company, product, or service names may be trademarks or service marks of others.

    http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtml
  • 7/29/2019 redp4477

    9/50

    Copyright IBM Corp. 2009. All rights reserved.vii

    Preface

    This IBM Redpaper publication provides an overview of the hardware and software

    components that constitute a Roadrunner system. This includes the actual chips, cards, andso on that comprise a Roadrunner connected unit, as well as the peripheral systems required

    to run applications. It also includes a brief description of the software used to manage and runthe system.

    The team that wrote this paper

    This publication was produced by a team of IBM specialists working in collaboration with theInternational Technical Support Organization (ITSO), Rochester Center.

    Dr. Andrew Komornicki is an accomplished computational scientist with many years of

    experience. Prior to joining IBM, his career included independent research, scientificmanagement, government service, as well as work in the computer industry. During the

    1990s, he spent two years as a rotator at the National Science Foundation as a programdirector, where he co-managed the program in computational chemistry. As a computational

    scientist, he also spent four years as the chair of the allocation committee at the San DiegoSupercomputer Center. He has consulted extensively in both the computer and chemical

    industry. Upon his return from Washington, he spent several years at Sun Microsystems,where he worked as a business development executive tasked with the development ofvertical markets in the chemistry and pharmaceutical markets. Three years ago, he joined the

    Advanced Technical Support group at IBM in the role of supporting scientific computing in theHigh Performance Computing (HPC) arena. His duties have included support of large scale

    procurements, benchmarks, and some software contributions.

    Gary Mullen-Schulz is a Consulting IT Specialist at the ITSO, Rochester Center. He leadsthe team responsible for producing Roadrunner documentation, and was the primary author

    of IBM System Blue Gene Solution: Application Development, SG24-7179. Gary also focuseson Java and WebSphere. He is a Sun Certified Java Programmer, Developer and

    Architect, and has three issued patents.

    Deb Landon is an IBM Certified Senior IT Specialist in the IBM ITSO, Rochester Center.Debbie has been with IBM for 25 years, working first with the S/36 and then the AS/400,

    which has since evolved to the IBM System i platform. Before joining the ITSO in Novemberof 2000, Debbie was a member of the PartnerWorld for Developers iSeries team,

    supporting IBM Business Partners in the area of Domino for iSeries.

    Thanks to the following people for their contributions to this project:

    Bill BrandmeyerMike Brutman

    Chris EngelSusan Lee

    Dave LimpertCamille MannAndrew Schram

    IBM Rochester

  • 7/29/2019 redp4477

    10/50

    viii Roadrunner: Hardware and Software Overview

    Prashant Manikal

    Cornell WrightIBM Austin

    Debbie Landon

    Wade WallaceInternational Technical Support Organization, Rochester Center

    Become a published author

    Join us for a two- to six-week residency program! Help write a book dealing with specificproducts or solutions, while getting hands-on experience with leading-edge technologies. Youwill have the opportunity to team with IBM technical professionals, Business Partners, and

    Clients.

    Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you

    will develop a network of contacts in IBM development labs, and increase your productivityand marketability.

    Learn more about the residency program, browse the residency index, and apply online at:

    ibm.com/redbooks/residencies.html

    Comments welcome

    Your comments are important to us!

    We want our papers to be as helpful as possible. Send us your comments about this paper or

    other IBM Redbooks in one of the following ways:

    Use the online Contact us review Redbooks form found at:ibm.com/redbooks

    Send your comments in an e-mail to:

    [email protected]

    Mail your comments to:

    IBM Corporation, International Technical Support OrganizationDept. HYTD Mail Station P099

    2455 South RoadPoughkeepsie, NY 12601-5400

    http://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/contacts.htmlhttp://www.redbooks.ibm.com/contacts.htmlhttp://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/residencies.html
  • 7/29/2019 redp4477

    11/50

    Copyright IBM Corp. 2009. All rights reserved.1

    Chapter 1. Roadrunner hardware overview

    This chapter describes the hardware components that comprise the Roadrunner system.

    Specifically, this chapter examines the various components that make up a Connected Unit(CU) and then discusses how the CUs are tied together to create a complete Roadrunner

    cluster.

    1

    Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather abig picture discussion meant to acquaint the reader with the Roadrunner system.

  • 7/29/2019 redp4477

    12/50

    2 Roadrunner: Hardware and Software Overview

    1.1 What Roadrunner is

    Roadrunner is the first general purpose computer system to reach the petaflop milestone. OnJune 10, 2008, IBM announced that this supercomputer had sustained a record-breaking

    petaflop, or 1015 floating point operations per second, as measured by the Linpackbenchmark. As a result of this achievement, Roadrunner became the worlds fastestsupercomputer.

    Roadrunner was designed, manufactured, and tested at the IBM facility in Rochester,Minnesota. The actual initial petaflop run was done in Poughkeepsie, New York. Its final

    destination is the Los Alamos National Laboratory (LANL) in New Mexico, which will use thissystem for a variety of scientific efforts. Most notably, Roadrunner is the latest tool used bythe National Nuclear Security Administration (NNSA) to ensure the safety and reliability of the

    US nuclear weapons stockpile.

    This computer system has a number of unique characteristics. The most notable is its sheer

    size and the fact that this is the first modern heterogeneous system of its kind. As a petascaledesign, the Roadrunner system has the fewest number of compute nodes and the fewestnumber of cores of any of the outstanding designs considered to date. In a nutshell, the

    attributes of this system can be summarized with the following characteristics:

    Roadrunner is a cluster of clusters.

    The fundamental building block of the Roadrunner system is a Connected Unit (CU). As

    originally designed, Roadrunner would have 18 such connected units, of which 17 havebeen delivered to LANL for the final system configuration. Roadrunner is made up ofapproximately 6500 AMD dual-core processors coupled with 12,240 Cell Broadband

    Engine (Cell/B.E.) processors. The total peak (theoretical) performance of this hybridsystem is in excess of 1.3 petaflops. The memory on this system consists of a total of 98

    TB equally distributed between the Opteron and the Cell/B.E. nodes.

    Each CU is made up of 180 compute nodes and 12 I/O nodes. A unique aspect of the

    Roadrunner design is the creation of a TriBlade as a fundamental building block for the

    CU. Each TriBlade consists of an AMD Opteron blade and two Cell/B.E. IBMBladeCenter QS22 blades. The Opteron blade contains two dual-core processors, whilethe Cell/B.E. blades each contain two new Cell/B.E. eDP (double precision) processors.This architecture allows for a one-to-one mapping of Opteron cores to Cell/B.E.

    processors. As discussed in 1.2.1, TriBlade: a unique concept on page 5, this designarchitecture creates a master-subordinate relationship between the Opterons and the

    Cell/B.E. processors. Each Opteron core is connected to a Cell/B.E. chip through adedicated PCIe link. Communications between Opteron nodes is accomplished through

    an extensive InfiniBand network.

    Fedora Linux is the operating system of choice for this system.

    System management of this cluster of clusters is accomplished with the xCAT cluster

    management software tools.

    It is worthwhile to note some of the physical characteristics of this system. The entire systemconsists of 278 racks that occupy approximately 5000 square feet of floor space. The weightof this system is approximately 500,000 pounds, or 250 tons. The networking required for

    both the compute and management tasks consists of 55 miles of InfiniBand (IB) cables.Lastly, even though the system consumes 2.4 MW of power, it is very energy efficient,

    delivering almost 437 megaflops per watt.

    Roadrunner holds a unique position in the history of scientific computing. It was over tenyears ago that the first teraflop (1012 floating point operations per second) computer was built.

    In 1997, a computer consisting of 7000+ Intel Pentium II processors sustained a teraflop

  • 7/29/2019 redp4477

    13/50

    Chapter 1. Roadrunner hardware overview3

    on the Linpack benchmark. Roadrunner in 2008 has demonstrated a thousand fold increase

    in sustained compute performance.

    1.1.1 A historical perspective

    Machines of Roadrunners size and capability are the direct result of the scientific needs ofthe weapons-physics communities. In October of 1992, the United States (U.S.) entered thestart of the nuclear testing moratorium that banned all nuclear testing above and below

    ground. Prior to this moratorium, the US nuclear weapons stockpile was maintained through acombination of underground nuclear testing as well as the development of new weapons

    systems. When theory and experiment were combined, the Department of Energy could relyon much simpler models than those needed today. Without nuclear testing, weapons

    scientists must rely much more heavily on sophisticated hardware and software to simulate

    the complex aging process of both weapons systems as well as their components.

    Established in 1995, the Advanced Simulation and Computing Program (ASC) is an integral

    part of the Department of Energy's National Nuclear Security Administration (NNSA) shift inemphasis from test-based to simulation-based programs. Under the ASC, computer

    simulation capabilities are continually developed to analyze and predict the performance,safety, and reliability of nuclear weapons and to certify their functionality. All of this work isintegrated into the three weapons laboratories:

    Los Alamos National Laboratory (LANL) Lawrence Livermore National Laboratory (LLNL) Sandia National Laboratories (SNL)

    The predecessor of the ASC was the Accelerated Strategic Computing Initiative (known asthe ASCI program) in direct response to the National Defense Authorization Act of 1994,which required, in the absence of nuclear testing, for the Department of Energy to:

    Support a focused multifaceted program to increase the understanding of the existingnuclear stockpile.

    Predict, detect, and evaluate potential problems associated with the aging of the nuclearstockpile.

    Maintain the science and engineering institutions needed to support the national nuclear

    deterrent, now and in the future.

    In response to this mandate, the ASCI program set the following objectives in order to meetthe needs and requirements of the Stockpile Stewardship program. These were enumerated

    to include performance, safety, reliability, and renewal, and were articulated in the ASCIprogram plan, published by the Department of Energy Defense Programs on January 2000:

    Create predictive simulations of nuclear weapon systems to analyze behavior and asses

    performance in an environment without nuclear testing.

    Predict with high certainty the behavior of full weapon systems in complex accident

    scenarios.

    Achieve sufficient, validated predictive simulations to extend the lifetime of the stockpile,predict failure mechanisms, and reduce routine maintenance.

    Note: The name Roadrunner was chosen by Los Alamos National Laboratory and is not aproduct name of the IBM Corporation. This supercomputer was designed and developed

    for the Department of Energy and Los Alamos National Laboratory under the project name

    Roadrunner. The project was named after the state bird of New Mexico.

  • 7/29/2019 redp4477

    14/50

    4 Roadrunner: Hardware and Software Overview

    Use virtual prototyping and modeling to understand how new production processors and

    materials affect performance, safety, reliability, and aging. This understanding helps definethe right configuration of production and testing facilities necessary for managing thestockpile throughout the next several decades.

    Throughout the history of this program, the IBM Corporation has been a key partner of theDepartment of Energy's National Nuclear Security Administration (NNSA) program. Here areseveral historical examples:

    In 1998, IBM delivered the ASCI Blue Pacific system, which consisted of 5,856 PowerPC604e microprocessors. The theoretical peak performance of this system was 3.8 teraflops.

    In 2000, IBM delivered the ASCI White system. This computer system was based on the

    IBM RS/6000 computer, which contained IBM POWER3 nodes running at 375 MHz.This cluster consisted of 512 nodes, each of which had 16 processors for a total of 8,192processors. The power requirements for this machine consisted of 3 MW for the computer

    and an additional 3 MW required for cooling. The theoretical peak processing power was12.3 teraflops and a Linpack performance of 7.2 teraflops.

    In 2005, IBM delivered and installed the ASC Purple system at Lawrence LivermoreLaboratories. This system was a 100 teraflop machine and was the successful realization

    of a goal set a decade earlier (1996) to deliver a 100 teraflop machine within the 2004 to2005 time frame.

    ASC Purple is based on the symmetric shared memory IBM POWER5 architecture. Thecombined system contains approximately 12,500 POWER5 processors and requires 7.5

    MW of electrical power for both the computer and cooling equipment.

    Another machine in the ASC program is the IBM System Blue Gene/L machinedelivered by IBM to Lawrence Livermore Laboratories. The Blue Gene architecture is

    unique in that it allows for a very dense packing of computer nodes. A single Blue Gene

    rack contains 1024 nodes. On March 24, 2005, the US Department of Energy announcedthat the Blue Gene/L installation at Lawrence Livermore Laboratory had achieved a speed

    of 135 teraflops on a system consisting of 32 racks. On October 27, 2005, LawrenceLivermore Laboratories and IBM announced that Blue Gene/L had produced a Linpack

    benchmark that exceeded 280 teraflops. This system consisted of 65,536 compute nodeshoused in 64 Blue Gene racks.

    As with each of the systems described above, the Roadrunner project is a partnership with

    IBM. The original contract was signed in September 2006 and projected for three phases. Inphase 1, a base system was delivered consisting of Opteron nodes. A hybrid node prototype

    system was projected for phase 2. The delivery of a hybrid final system, one that wouldachieve a sustained petaflop in Linpack performance, was projected for phase 3.

    For more information, refer to the Advanced Simulation and Computing Web site at:

    http://www.sandia.gov/NNSA/ASC/about.html

    Note: At the time these goals were set, computers were still at the gigaflop level and

    were still two years away from the realization of the first teraflop machine.

    http://www.sandia.gov/NNSA/ASC/about.htmlhttp://www.sandia.gov/NNSA/ASC/about.html
  • 7/29/2019 redp4477

    15/50

    Chapter 1. Roadrunner hardware overview5

    1.2 Roadrunner hardware components

    A simple way to describe the Roadrunner system is that it is a heterogeneous cluster ofclusters, each of which is accelerated by Cell/B.E. processors. The unique feature of this

    design is that each compute node consists of node-attached Cell/B.E. processors, rather thana simple cluster of Cell/B.E. processors. A collection of such compute and I/O nodes, allconnected through a high speed switch fabric, makes up a scalable unit known as a

    Connected Unit (CU).

    The fundamental building block of a CU is a compute node, each of which is a TriBlade. The

    TriBlade is an original design concept created for the Roadrunner system and allows for theintegration of Cell/B.E. and Opteron blades. Architecturally, this design allows for theincorporation of these TriBlades into a IBM BladeCenter chassis.

    1.2.1 TriBlade: a unique concept

    The TriBlade makes up what is called a hybrid compute node.The components of this nodeconsist of an IBM LS21 Opteron blade, two IBM BladeCenter QS22 Cell/B.E. blades, and a

    fourth blade that houses the communications fabric for the compute node. This expansionblade connects the two QS22 blades through four PCI Express x8 links to the Opteron blade

    and provides each node with an InfiniBand 4x DDR cluster interconnect. Figure 1-1 shows aschematic of a TriBlade.

    Figure 1-1 TriBlade schematic

  • 7/29/2019 redp4477

    16/50

    6 Roadrunner: Hardware and Software Overview

    The node design of the TriBlade offers a number of important characteristics. Since each

    node is accelerated by Cell/B.E. processors, by design there is one Cell/B.E. chip for eachOpteron core. The TriBlade is populated with 16 GB of Opteron memory and an equal amountof Cell/B.E. memory. Since the new Cell/B.E. eDP processors are capable of delivering 102.4

    gigaflops of peak performance, each TriBlade node is capable of approximately 400 gigaflopsof double precision compute power. For additional information about the Cell/B.E. processor,

    see Appendix A, The Cell Broadband Engine (Cell/B.E.) processor on page 27.

    The design of the TriBlade presents the user with a very specific memory hierarchy. TheOpteron processors establish a master-subordinate relationship with the Cell/B.E.

    processors. Each Opteron blade contains 4 GB of memory per core, resulting in 8 GB ofshared memory per socket. The Opteron blade thus contains 16 GB of NUMA shared

    memory per node.

    Each Cell/B.E. processor contains 4 GB of shared memory, resulting in 8 GB of shared

    memory per blade. In total, the Cell/B.E. blades contain 16 GB of distributed memory perTriBlade node. It is important to note that not only is there a one-to-one mapping of Opteron

    cores to Cell/B.E. processors, but also each node consists of a distribution of equal memoryamong each of these components.

    In order to sustain this compute power, the connectivity within each node consists of four PCIExpress x8 links, each capable of 2 GBs transfer rates, with a 2 micro-second latency. Theexpansion slot also contains the InfiniBand interconnect, which allows communications to the

    rest of the cluster. The capability of the InfiniBand 4x DDR interconnect is rated at 2 GBs witha 2 micro-second latency.

    1.2.2 IBM BladeCenter QS22

    The IBM BladeCenter QS22 is based on the IBM PowerXCell 8i processor, a newgeneration processor based on the Cell/B.E. architecture. In contrast to its predecessors, the

    QS20 and QS21, the QS22 is based on the second generation processor of the Cell/B.E.architecture and offers single instruction multiple data (SIMD) vector capability along with

    strong parallelization. It performs double precision floating point operations at five times thespeed of the previous generations of Cell/B.E. processors.

    Due to its parallel nature and extraordinary computing speed, the QS22 is ideal for use in

    scientific applications, which is why it was chosen as an integral part of the Roadrunnersystem by IBM and Los Alamos. The QS22 is a single-wide blade server that offers an SMPwith shared memory and two Cell/B.E. processors in a single blade enclosure.

    Figure 1-2 on page 7 provides an illustration of the IBM BladeCenter QS22. Features of theQS22 include:

    Two 3.2 GHz IBM PowerXCell 8i processors Up to 32 GB of PC2-6400 800 MHz DDR2 memory

    460 single-precision gigaflops per blade (peak) 217 double-precision gigaflops per blade (peak) Integrated dual 1 Gb Ethernet IBM Enhance I/O Bridge chip Serial Over LAN

    The QS22 is based on the 64-bit IBM PowerXCell 8i processor. This processor operates at

    3.2 GHz. Each of the eight SIMD vector processors is capable of producing four floating pointresults per clock period. The memory subsystem on the QS22 consists of eight DIMM slots,enabling configurations from 4 GB up to 32 GB of ECC memory.

  • 7/29/2019 redp4477

    17/50

    Chapter 1. Roadrunner hardware overview7

    For additional information about the Cell/B.E. processor, see Appendix A, The Cell

    Broadband Engine (Cell/B.E.) processor on page 27.

    Figure 1-2 IBM BladeCenter QS22

    For more information about the QS22, see the IBM BladeCenter QS22 Web page at:

    http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html

    1.2.3 IBM BladeCenter LS21

    The IBM BladeCenter LS21 is a single width AMD Opteron-based server. The LS21 bladeserver supports up to two of the dual-core 2200 series AMD Opteron processors combined

    with up to 32 GB of ECC memory and one fixed SAS HDD.

    The memory used in the LS21 are DDR2 and are ECC protected. The general memoryconfiguration for the LS21 has to follow these guidelines:

    A total of eight DIMM slots (four per processor socket). Two of these slots (1 and 2) arepreconfigured with a pair of DIMMs.

    Because memory is 2-way interleaved, the memory modules must be installed in matchedpairs. However, one DIMM pair is not required to match the other in capacity.

    A maximum of 32 GB of installed memory is achieved when all DIMM sockets are

    populated with 4 GB DIMMs.

    Important: The implementation chosen for the Roadrunner system consists of thestandard blade populated with 16 GB of DDR2 memory. As with the Opteron blades, all of

    the Cell/B.E. based blades are diskless.

    Important: The configuration used for the Roadrunner system contains two AMD Opteron

    processors running at 1.8 GHz, 16 GB of ECC memory, and no hard disk. The disklessconfiguration is an important implementation design, which eliminates additional movingparts and potential points of failure for a system with so many thousands of nodes.

    http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.htmlhttp://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html
  • 7/29/2019 redp4477

    18/50

    8 Roadrunner: Hardware and Software Overview

    For each installed microprocessor, a set of four DIMM sockets are enabled.

    The processors used in these blades are standard low-power processors. The standard AMD

    Opteron processors draw a maximum of 95 W. Specially manufactured low-power processorsoperate at 68 W or less without any performance trade-offs. This savings in power at the

    processor level combined with the smarter power solution that IBM BladeCenter deliversmake these blades very attractive for installations that are limited by power and coolingresources.

    This blade is designed with power management capability to provide the maximum up time

    possible. In extended thermal conditions, rather than shut down completely or fail, the LS21automatically reduces the processor frequency to maintain acceptable thermal levels.

    A standard LS21 blade server offers these features:

    Up to two high-performance, AMD Dual-Core Opteron processors.

    A system board containing eight DIMM connectors, supporting 512 MB, 1 GB, 2 GB, or 4

    GB DIMMs.

    Up to 32 GB of system memory is supported with 4 GB DIMMs.

    A SAS controller, supporting one internal SAS drive (36 or 73 GB) and up to threeadditional SAS drives with optional SIO blade.

    Two TCP/IP Offload Engine enabled Gigabit Ethernet controllers (Broadcom 5706S) asstandard, with load balancing and failover features.

    Support for concurrent KVM (cKVM) and concurrent USB/DVD (cMedia) through

    Advanced Management Module and an optional daughter card.

    Support for a Storage and I/O Expansion (SIO) unit.

    Dual Gigabit Ethernet controllers are standard, providing high-speed data transfers and

    offering TCP/IP Offload Engine support, load-balancing, and failover capabilities. The versionused for Roadrunner uses optional InfiniBand expansion cards, allowing high speed

    communication between nodes. The InfiniBand fabric installed with Roadrunner provides

    4x DDR connections that have a theoretical peak of 2 GB per second.

    Finally, the LS21 supports both the Windows and Linux operating systems. The Roadrunner

    implementation uses the Fedora version of Linux.

    Figure 1-3 on page 9 shows a schematic of the planar of an LS21.

  • 7/29/2019 redp4477

    19/50

    Chapter 1. Roadrunner hardware overview9

    Figure 1-3 LS21 planar

    For more information about the LS21, see the IBM BladeCenter LS21 Web page at:

    http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html

    1.3 Rack configurations

    TriBlades are combined into racks to create assemblies of hybrid compute nodes. In addition,some racks contain other components for other required functionality. There are three

    different rack types:

    Compute node rack Compute node and I/O rack Switch and service rack

    In general, these racks look very similar. Each can hold a maximum of 12 TriBlades and somehold additional components.

    http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.htmlhttp://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html
  • 7/29/2019 redp4477

    20/50

    10 Roadrunner: Hardware and Software Overview

    1.3.1 Compute node rack

    A compute node rack holds a total of 12 TriBlades, which means it holds 12 LS21s and 24QS22s. A compute node rack looks similar to the picture shown in Figure 1-4.

    Figure 1-4 Compute node rack

    1.3.2 Compute node and I/O rack

    A compute node and I/O rack contains 12 TriBlades, but also contains an IBM System x3655(x3655) at the bottom of the rack. The x3655 performs input/output (I/O) services on behalf ofthe system. A compute and I/O node rack looks similar to the picture shown in Figure 1-5 on

    page 11.

    The x3655 is a new rack-optimized server based on the AMD Opteron dual-core processor.

    The x3655 supports four processor sockets and 32 memory DIMM slots. The memory is 667MHz DDR2, in sizes ranging from 512 MB to 4 GB per DIMM. This gives a total capacity of upto 128 GB of main system memory.

    Note: The x3655 used in the Roadrunner system supports 16 GB or 32 GB of memory.

  • 7/29/2019 redp4477

    21/50

    Chapter 1. Roadrunner hardware overview11

    Figure 1-5 Compute and I/O node rack

    1.3.3 Switch and service rack

    The switch and service rack contains no TriBlades. This rack contains a Voltaire Grid DirectorISR 9288 switch that is used to manage InfiniBand networking traffic. This is known in

    Roadrunner as afirst-stage switch. See First-stage InfiniBand switch on page 14 for moreinformation about its role and function.

    You can learn more about the Voltaire switch technology on the Voltaire Web page at:

    http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288

    In addition, this rack contains an IBM System x3655, which serves as the service node for the

    CU. The functions that the service node performs include the following:

    Holds the boot images used to IPL the Opteron and Cell/B.E. blades, as well as the I/O

    nodes.

    IPLs all elements in the CU when instructed to do so by the central management node.

    http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288
  • 7/29/2019 redp4477

    22/50

    12 Roadrunner: Hardware and Software Overview

    A switch and service rack looks similar to the picture shown in Figure 1-6.

    Figure 1-6 Switch and service rack

    1.4 The Connected Unit

    The Connected Unit (CU) is a core concept in the Roadrunner system. Groups of the various

    rack configurations discussed in 1.3, Rack configurations on page 9 are put together tocreate a single CU. Table 1-1 lists the racks that comprise a single CU.

    Table 1-1 Racks making up a Connected Unit

    A CU can be thought of as a base cluster unit. The racks that make up a CU are connected toeach other through first-stage switches. CUs are then tied together through second-stage

    switches to create a larger grid.

    The size of a CU is largely determined by the capabilities of the first-stage switch. There are

    180 TriBlades in a CU. This number of TriBlades means that a Connected Unit contains 180AMD Opteron LS21s and 360 IBM BladeCenter QS22s. See Figure 1-7 on page 13.

    MiscMiscMisc

    Rack type Number of racks in

    the Connected Unit

    Number of TriBlades

    in a rack

    Total number of

    TriBlades

    Compute node rack 3 12 36

    Compute node and I/O rack 12 12 144

    Switch and service rack 1 0 0

    Total 16 N/A 180

  • 7/29/2019 redp4477

    23/50

    Chapter 1. Roadrunner hardware overview13

    Figure 1-7 Racks comprising a Connected Unit

    1.5 Networks

    Given the high number of racks and nodes in the Roadrunner system, it should come as nosurprise that there are several different networks used to tie the system together. This sectionprovides an overview of the different networks involved as well as their functional purpose.

    1.5.1 Networks within a Connected Unit cluster

    First-stage switches are used to connect all the racks making up a Connected Unit (CU)

    together and to allow the CU to communicate with the outside world (for example, a filesystem) and other CUs. The second-stage switches primarily serve as a hub to tie the 17 CUs

    together into a common computational system.

    Note: As previously discussed in this chapter, the entire Roadrunner system or cluster is

    comprised of a total of 17 CUs.

    Misc

    Connected Unit

    I/O + Compute rack

    x12

    Compute rack

    x3

    Switch and

    Service rack

  • 7/29/2019 redp4477

    24/50

    14 Roadrunner: Hardware and Software Overview

    First-stage InfiniBand switchAs discussed in 1.3.3, Switch and service rack on page 11, each CU contains a rack with aVoltaire Grid Director ISR 9288 switch. This switch allows for 288 different InfiniBand inputs,which are used as shown in Table 1-2.

    Table 1-2 Connections in and out from a first-stage switch

    InfiniBand Connected UnitThis network creates a fat tree that allows the AMD Opterons to communicate with eachother using the industry-standard Message Passing Interface (MPI). It is built on top of the

    switched InfiniBand network. A fat tree is a special topology invented by Charles E.Leiserson of MIT. Unlike a traditional binary tree, a fat tree has thicker branches the closeryou get to the trees root. In this way, you do not end up with a communications bottleneck at

    the root of the tree.

    Figure 1-8 shows a traditional binary tree. Note that as messages flow up the tree, the single

    links to the root node can become a point of congestion.

    Figure 1-8 Traditional binary tree

    Figure 1-9 on page 15, on the other hand, shows a fat tree. Notice how the number of links

    between nodes increases as you get closer to the trees root. The number of links shown isjust one example of a fat tree configuration; the actual number may be higher or lower

    between any two nodes depending on the given requirements.

    Component Number ofconnections

    Purpose

    TriBlades InfiniBand link 180 Connects the AMD Opteron nodes together

    to allow them to participate in a network.

    InfiniBand links to second-stage

    switch

    96 Allows the CUs to be tied together into a

    single network.

    InfiniBand links to I/O nodes 8 Provides the hybrid compute nodes access

    to the file system for application input and

    output.

    Total 288

  • 7/29/2019 redp4477

    25/50

    Chapter 1. Roadrunner hardware overview15

    Figure 1-9 Fat tree

    Fat tree topologies are becoming quite popular in InfiniBand clusters. For more informationabout fat trees and their usage with InfiniBand, see the ar ticle Performance Modeling of

    Subnet Management on Fat Tree InfiniBand Networks using OpenSM, which is available atthe following Web site:

    http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf

    10 Gigabit Ethernet file system LANEvery CU has twelve I/O nodes, each of which has a single InfiniBand connection to the CU's

    InfiniBand Switch. This allows the hybrid compute nodes (TriBlades) to retrieve and pass datato the I/O nodes over the InfiniBand network. The file system is connected through the I/O

    nodes, each of which have two 10 GB links to the file system LAN.

    Gigabit Ethernet Control VLAN (CVLAN)The 1 GB Ethernet control VLAN is used to perform vital program and node control functionswithin each CU, such as Message Passing Information (MPI) required for program operationand communication.

    Gigabit Ethernet Management VLAN (MVLAN)The 1 GB Ethernet Management VLAN is used to perform vital system managementfunctions within each CU, such as passing the required operating system boot images from

    the CU's service node to the processors on the hybrid compute nodes and I/O nodes in orderto IPL them.

    PCI Express link between LS21 and Cell/B.E. bladesEach AMD Opteron has a one-to-one master-subordinate relationship with a Cell/B.E.

    processor. Although the Opterons participate in MPI communications with other Opteronnodes and access the file system through the I/O nodes, the Cell/B.E. processors only

    communicate with their master Opteron.

    Important: This VLAN is used exclusively for control traffic, no user data flows across thisnetwork.

    http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf
  • 7/29/2019 redp4477

    26/50

  • 7/29/2019 redp4477

    27/50

    Chapter 1. Roadrunner hardware overview17

    Gigabit Ethernet management VLAN (MVLAN)The 1 GB Ethernet management VLAN is the grid-wide system management network. It isused for booting, system control, and status determination operations between themanagement nodes and the various managed elements throughout the cluster. The MVLAN

    does not have direct network access to the internals of a CU (for example, the hybridcompute nodes and I/O nodes). Management operations to those nodes occurs from the

    MVLAN to the CU's MVLAN through the service node to the desired target.

    The MVLAN has no user or application data flow across this network. Only systemmanagement and control traffic flows across the MVLAN.

  • 7/29/2019 redp4477

    28/50

    18 Roadrunner: Hardware and Software Overview

  • 7/29/2019 redp4477

    29/50

    Copyright IBM Corp. 2009. All rights reserved.19

    Chapter 2. Roadrunner software overview

    This chapter briefly describes the software used to run applications on the Roadrunner

    system.

    2

    Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather a

    big picture discussion meant to acquaint the reader with the Roadrunner system.

  • 7/29/2019 redp4477

    30/50

    20 Roadrunner: Hardware and Software Overview

    2.1 Roadrunner components

    This section provides a brief explanation of the software used to run on the variouscomponents that comprise a Roadrunner system.

    2.1.1 Compute node (TriBlade)

    As described in 1.2.1, TriBlade: a unique concept on page 5, a TriBlade is made up of one

    IBM BladeCenter LS21 blade and two IBM BladeCenter QS22 blades. Each of these runs itsown operating system image, but shares a common user application.

    The following is the software that runs on the various components of the TriBlade:

    AMD Opteron LS21 for IBM BladeCenter

    Each LS21 is standard except for the fact that it is diskless. The operating system isFedora Linux. Since it is diskless, it is booted up from its Connected Units service node.

    IBM BladeCenter QS22

    Each QS22 is standard except for the fact that it is diskless. The operating system isFedora Linux. Since it is diskless, it is booted up from its Connected Units service node.

    Broadcom HT-2100 (PCIe adapter)

    The dual Opteron host blade (LS21) is connected to the two QS22s through a PCIExpress (PCIe) interconnect. Two HyperTransport x16 connections from the LS21 blade

    drive an expansion card containing two Broadcom HT-2100 HyperTransport to PCI

    Express bridge chips. Each Broadcom HT-2100 drives two PCI Express x8 connections tothe two Axon Southbridge chips on one of the Cell Broadband Engine (Cell/B.E.) blades(QS22). This provides a dedicated PCIe x8 connection to each Cell/B.E processor.

    The PCIe interconnect is supported by a low-level device driver that provides direct

    memory access (DMA) and a remote memory mapped small message area (SMA). DMAoperations can be started by calls to the device driver from programs on either the LS21 orthe QS22. The device driver initiates the DMA operation using a DMA controller in the

    Axon Southbridge. The small message area provides regions of memory that can beaccessed remotely by user space instructions without a context switch to the kernel or

    device driver interaction. There is a unique device driver instance on both the Opteron andthe Cell/B.E. blade for each Axon Southbridge. A virtual Ethernet driver (also replicated

    per Axon) supports point-to-point communications between the Opteron and each

    Cell/B.E processor.

    2.1.2 I/O node

    As mentioned previously in 1.3.2, Compute node and I/O rack on page 10, each I/O node isan IBM System x3655 server. I/O nodes are diskless and serve as pipes to the external file

    system across the 10 Gigabit Ethernet file system LAN.

    Each I/O node runs Fedora Linux as its operating system. Since the node is diskless, it isbooted up from its Connected Units service node. The I/O node will run either the IBM

    Note: From an IBM BladeCenter Advanced Management Module (AMM) perspective, theTriBlade still appears as separate blades. In other words, it appears as one LS21 and two

    QS22s. The logical grouping of the LS21 and QS22s is handled through the xCATmanagement tools. See 2.3, xCAT on page 23 for more information.

  • 7/29/2019 redp4477

    31/50

    Chapter 2. Roadrunner software overview21

    GPFS or Panasas PanFS client to communicate with the external file system, depending on

    what file system software is running there.

    2.1.3 Service node

    Service nodes are standard IBM System x3655 Opteron-based servers and are diskless.

    There is one dedicated service node per Connected Unit, so this image can be updateddirectly from the master node over the management network (MVLAN) described in GigabitEthernet management VLAN (MVLAN) on page 17.

    Service nodes obtain copies of the boot images for the I/O nodes and compute nodes fromthe master node. These images are refreshed on an as needed basis. The images are loadedover the CVLAN (see Gigabit Ethernet Control VLAN (CVLAN) on page 15).

    2.1.4 Master (management) node

    The master node is a standard IBM System x3655 Opteron-based server and is booted fromthe local disk. The master node runs Fedora Linux.

    2.2 Cluster boot sequence

    The initial booting of the nodes is complicated by two factors in the Roadrunner system:

    All of the nodes except for the master node are diskless, so they must boot over thenetwork.

    There are over 3,000 total nodes and 10,000 operating system images that need to be

    installed and booted.

    There will be times when the entire system needs to be booted, and there will be times when

    only parts of the system need to be booted (while the rest of the system is still available butpowered off). This places two distinct demands on the management network:

    It must be able to boot the entire system without causing timeouts on the management

    network such that no boot progress is being made.

    It must be able to boot substantial portions of the system without interfering with anystatus and control operations that are occurring on the running portion of the system.

    Since the majority of nodes are diskless, a scalable way to move the boot images to each ofthe nodes is required. To this end, a hierarchy of management nodes has been created.

    The solution to this concern is to use a bootstrap protocol (BOOTP) together with the trivialfile transfer protocol (TFTP) subnet multicast to boot the diskless LS21 Opteron and QS22Cell/B.E. blades. This method provides a broadcast of the common boot image that the

    LS21s and QS22s can pick up midstream. The multicast repeats until all requesting bladeshave received all packets of the boot image. There are unique boot images for the various

    configurations. The boot images are stored on the Connected Unit service nodes andmulticast over the CVLAN. This method significantly reduces network traffic compared tosending individual boot images to each processor.

    Note: There is only one master node for the entire Roadrunner cluster.

  • 7/29/2019 redp4477

    32/50

    22 Roadrunner: Hardware and Software Overview

    2.2.1 Boot scenarios

    This section describes in more detail what happens when a cluster (or parts of the cluster)are booted up.

    Master (management) node (tier 1)

    This node is installed and booted with the required management node image. Themanagement node boots from the local disk.

    Service nodes (tier 2)There is only one service node per Connected Unit, so this image can be updated directlyfrom the master node over the MVLAN at any time (not just at service node bring-up). Once

    booted, service nodes obtain copies of the boot images for the I/O nodes and compute nodesfrom the master node. These images are refreshed on an as-needed basis. The images are

    loaded over the CVLAN through the multicast boot process, which allows for far less networktraffic and parallel image download.

    I/O nodes

    Once successfully booted, the service nodes begin transferring the required boot imagesdown the CVLAN. The I/O nodes are standard Opteron Linux servers and are booted diskless

    with the required image. I/O nodes are connected to the 10 GB Global File System (GFS) toservice the compute nodes file access requests. The image required to boot the I/O node is

    received from its local service node through the CVLAN network.

    Compute nodes (TriBlades)Compute nodes (TriBlades) are either accelerated or non-accelerated, with the difference

    being that accelerated nodes will have their associated Cell/B.E. blades powered on andbooted, while Cell/B.E. blades on the non-accelerated nodes are left powered off.

    There is no need for a heartbeat function between the Opteron core and its associated Cell

    Broadband Engine processor. The general health of both resources is known by the xCATsoftware and reflected in the resource manager. Communication health status between the

    two resources is monitored and understood on demand by the application running on theOpteron side. The Data Communication and Synchronization (DaCS) API is notified of errors

    from the Cell/B.E. processor concerning any data transfer or communications request.Failures of these transactions is reported by the software structures. If the PCI Expressconnection between the Opteron and Cell/B.E. processor fails, an appropriate error event is

    posted and the application terminated.

    Given the PCI Express interface between the Opteron and Cell/B.E. processor, it is necessaryto boot the Cell/B.E. processor portions of a compute node (in the accelerated node pool)before the Opteron portion. This allows the proper initialization of the interconnect firmwareand PCI Express device drivers. The Cell/B.E. PCI Express device drivers listen for the

    necessary firmware/driver handshakes from the LS21 and Broadcom HT-2100 (PCIe adapter)expansion card to establish communication. The process of insuring the correct booting

    sequence is controlled by the xCAT software.

    Note: There is no low power mode for the Cell/B.E. blades, so some sort of standby

    mode is not possible. They are either on (accelerated) or off (non-accelerated).

  • 7/29/2019 redp4477

    33/50

    Chapter 2. Roadrunner software overview23

    2.3 xCAT

    Setting up the installation and management of a cluster is a complicated task and doingeverything manually can become very complicated. The development of xCAT grew out of the

    desire to automate a lot of the repetitive steps involved in installing and configuring a Linuxcluster.

    The development of xCAT is driven by customer requirements. Because xCAT itself is written

    entirely using scripting languages such as korn shell, Perl, and Expect, an administrator caneasily modify the scripts should the need arise.

    The main functions of xCAT are grouped as follows:

    Automated installation Hardware management and monitoring Software administration Remote console support for text and graphics

    For more information about xCAT, refer to the xCAT Web site at:

    http://xcat.sourceforge.net

    2.4 How applications are written and executed

    This section discusses how applications are written and executed on the Roadrunner system.The unique architecture employed means that applications are designed and written in a

    revolutionary new manner compared to previous parallel processing applications.

    2.4.1 Application core

    The bulk of the user application, including initiation and termination, runs on the AMDOpteron processor (LS21). It uses Message Passing Interface (MPI) APIs to communicatewith the other Opteron processors the application is running on in a typical single program,

    multiple data (SPMD) fashion. The number of compute nodes used to run the application is

    determined at program launch.

    The MPI implementation of Roadrunner is based on the open-source Open MPI Project and

    therefore is standard MPI. In this regard, Roadrunner applications are similar to other typicalMPI applications (such as those that run on the IBM Blue Gene solution). Where Roadrunner

    differs in the sphere of application architecture is how its Cell/B.E. accelerators areemployed. At any point in the application flow, the MPI application running on each Opteroncan offload computationally-complex logic to its subordinate Cell/B.E. processor.

    For more information about Open MPI Project, refer to the Open MPI: Open Source HighPerformance Computing Web site at:

    http://www.open-mpi.org/

    http://xcat.sourceforge.net/http://www.open-mpi.org/http://xcat.sourceforge.net/http://www.open-mpi.org/
  • 7/29/2019 redp4477

    34/50

    24 Roadrunner: Hardware and Software Overview

    2.4.2 Offloading logic

    Determining which logic routines get offloaded to the Cell/B.E. processor, and when thatoccurs, is one of the most challenging tasks facing an application developer of theRoadrunner system. But it is this very challenge that makes the opportunity for incredibly high

    application performance possible.

    There are two primary techniques that a developer can employ to actually perform

    asynchronous offloads of logic. This section briefly describes each, and points to areas whereyou can find more detailed information.

    DaCSThe Data Communication and Synchronization (DaCS) library provides a set of services thatease the development of applications and application frameworks in a heterogeneous

    multi-tiered system (for example, a 64-bit x86 system (x86_64) and one or more Cell/B.E.processor systems). The DaCS services are implemented as a set of APIs providing an

    architecturally neutral layer for application developers on a variety of multi-core systems. Oneof the key abstractions that further differentiates DaCS from other programming frameworks

    is a hierarchical topology of processing elements, each referred to as a DaCS Element (DE).

    Within the hierarchy, each DE can serve one or both of the following roles:

    A general purpose processing element, acting as a supervisor, control, or masterprocessor. This type of element usually runs a full operating system and manages jobsrunning on other DEs. This is referred to as a Host Element (HE).

    A general or special purpose processing element running tasks assigned by an HE. Thisis referred to as an Accelerator Element (AE).

    DaCS for Hybrid (DaCSH) is an implementation of the DaCS API specification that supports

    the connection of an HE on an x86_64 system to one or more AEs on Cell/B.E. processors. InSDK 3.0, DaCSH only supports the use of sockets to connect the HE with the AEs. Direct

    access to the Synergistic Processor Elements (SPEs) on the Cell/B.E. processor is notprovided. Instead, DaCSH provides access to the PowerPC Processor Element (PPE),

    allowing a PPE program to be started and stopped and allowing data transfer between thex86_64 system and the PPE. The SPEs can only be used by the program running on thePPE.

    For more information about DaCS, see IBM Software Development Kit for MulticoreAcceleration Data Communication and Synchronization Library for Hybrid-x86 Programmer'sGuide and API Reference, SC33-8408.

    ALFThe Accelerated Library Framework (ALF) provides a programming environment for data and

    task parallel applications and libraries. The ALF API provides you with a set of interfaces tosimplify library development on heterogeneous multi-core systems. You can use the provided

    framework to offload the computationally intensive work to the accelerators. More complexapplications can be developed by combining the several function offload libraries. You can

    also choose to implement applications directly to the ALF interface.

    ALF supports the multiple-program-multiple-data (MPMD) programming module wheremultiple programs can be scheduled to run on multiple accelerator elements at the same

    time.

  • 7/29/2019 redp4477

    35/50

    Chapter 2. Roadrunner software overview25

    The ALF functionality includes:

    Data transfer management Parallel task management Double buffering Dynamic load balancing for data parallel tasks

    With the provided API, you can also create descriptions for multiple compute tasks and definetheir execution orders by defining task dependency. Task parallelism is accomplished by

    having tasks without direct or indirect dependencies between them. The ALF run timeprovides an optimal parallel scheduling scheme for the tasks based on given dependencies.

    For more information about ALF, see IBM Software Development Kit for Multicore

    Acceleration Accelerated Library Framework for Hybrid-x86 Programmer's Guide and APIReference, SC33-8406.

  • 7/29/2019 redp4477

    36/50

    26 Roadrunner: Hardware and Software Overview

  • 7/29/2019 redp4477

    37/50

    Copyright IBM Corp. 2009. All rights reserved.27

    Appendix A. The Cell Broadband Engine

    (Cell/B.E.) processor

    Of all of the components that make up the Roadrunner cluster, the Cell/B.E. processor holdsa special place in that it provides extraordinary compute power that can be harnessed from a

    single multi-core chip. This appendix provides a brief architectural overview of the currentCell/B.E. processor, the motivation for some of its features, as well as the general properties

    of this unique processor.

    For additional information about the Cell/B.E. processor, refer to the following resources:

    Programming the Cell Broadband Engine Architecture: Examples and Best Practices,SG24-7575

    IBM Software Development Kit for Multicore Acceleration Data Communication andSynchronization Library for Cell/B.E. Programmer's Guide and API Reference, SC33-8407

    IBM Software Development Kit for Multicore Acceleration Accelerated Library Frameworkfor Cell/B.E. Programmer's Guide and API Reference, SC33-8333

    The Cell/B.E. project at IBM Research, found at:

    http://www.research.ibm.com/cell/

    The Cell/B.E. resource center, found at:

    http://www.ibm.com/developerworks/power/cell/

    A

    Note: Be aware that ample and extensive resources exist on the Cell/B.E. processor, theCell/B.E. architecture, as well as tutorials for the interested programmer. It is not the

    intention of this publication to reproduce all of this information in this short section. Wehave utilized these extensive resources in our attempt to provide this summary.

    http://www.research.ibm.com/cell/http://www.ibm.com/developerworks/power/cell/http://www.ibm.com/developerworks/power/cell/http://www.research.ibm.com/cell/
  • 7/29/2019 redp4477

    38/50

    28 Roadrunner: Hardware and Software Overview

    Background

    The Cell/B.E. architecture is designed to support a very broad range of applications. The firstimplementation is a single-chip multiprocessor with nine processor elements operating on a

    shared memory model, as shown in Figure A-1. In this respect, the Cell/B.E. processorextends current trends in PC and server processors. The most distinguishing feature of the

    Cell/B.E. processor is that, although all processor elements can share or access all availablememory, their function is specialized into two types: the Power Processor Element (PPE) andthe Synergistic Processor Element (SPE). The Cell/B.E. processor has one PPE and eight

    SPEs.

    The architectural definition of the physical Cell/B.E. architecture-compliant processor is muchmore general than the initial implementation. A Cell/B.E. architecture-compliant processor

    can consist of a single chip, a multi-chip module (or modules), or multiple single-chip moduleson a system board or other second-level package. The design depends on the technology

    used and performance characteristics of the intended design.

    Logically, the Cell/B.E. architecture defines four separate types of functional components:

    PowerPC Processor Element (PPE) Synergistic Processor Unit (SPU) Memory Flow Controller (MFC) Internal Interrupt Controller (IIC)

    The computational units in the Cell/B.E. architecture-compliant processor are the PPEs andthe SPUs. Each SPU must have a dedicated local storage, a dedicated MFC with its

    associated memory management unit (MMU), and a replacement management table (RMT).The combination of these components is called a Synergistic Processor Element (SPE).

    Figure A-1 Cell/B.E. schematic

    The first type of processor element, the PPE, contains a 64-bit PowerPC architecture core. Itcomplies with the 64-bit PowerPC architecture and can run 32-bit and 64-bit applications. The

    second type of processor element, the SPE, is designed to run computationally intensivesingle-instruction multiple-data (SIMD)/vector applications. It is not intended to run a full

    featured operating system. The SPEs are independent processor elements, each runningtheir own individual application programs or threads. Each SPE has full access to sharedmemory, including the memory-mapped I/O space implemented by multiple DMA units. There

    is a mutual dependence between the PPE and the SPEs. The SPEs depend on the PPE torun the operating system and, in many cases, the top-level thread control for a user code. The

    PPE depends on the SPEs to provide the bulk of compute power.

  • 7/29/2019 redp4477

    39/50

    Appendix A. The Cell Broadband Engine (Cell/B.E.) processor29

    The SPEs are designed to be programmed in high level languages. They support a rich

    instruction set that includes extensive SIMD functionality. However, like conventionalprocessors with SIMD extensions, use of SIMD data is preferred but not mandatory. Forprogramming convenience, the PPE also supports the standard PowerPC architecture

    instruction set and the SIMD/vector multimedia extensions. To an application programmer, theCell/B.E. processor looks like a single core, dual threaded processor with eight additional

    cores, each having their own local store. The PPE is more adept than the SPEs atcontrol-intensive tasks and quicker at task switching. The SPEs are more adept at compute

    intensive tasks and slower than the PPE at task switching. Either processor element iscapable of both types of functions. This specialization is a significant factor in accounting forthe order-of magnitude improvement in peak computational performance and power

    efficiency that the Cell/B.E. processor achieves over conventional processors.

    The more significant difference between the SPE and PPE lies in how they access memory.

    The PPE accesses memory with load and store instructions that move data between mainstorage and a set of registers, the contents of which may be cached. PPE memory access islike that of a conventional processor. The SPEs in contrast access main storage with direct

    memory access (DMA) commands that move data and instructions between main storageand a private local memory, called a local store (LS). An SPE's instruction fetches and

    load/store instructions access a private local store rather than the shared main memory.

    This three-level organization of storage (registers, LS, and main memory), with asynchronous

    DMA transfers between LS and main memory, is a radical break from conventionalarchitecture and programming models. It explicitly parallels computation with the transfer of

    data and instructions that feed computation and stores the results of computation in mainmemory.

    A primary motivation for this new memory model is the realization that over the past twenty

    five years, memory latency, as measured in processor cycles, has increased by almost threeorders of magnitude. The result is that application performance is, in most cases, limited by

    memory latency rather than peak compute capability, as measured by processor clockspeeds. When a sequential program performs a load instruction that encounters a cache

    miss, program execution comes to a halt for several hundred cycles (techniques such ashardware threading attempt to hide these stalls, but it does not help single threadedapplications). Compared to this penalty, the few cycles that it takes to set up a DMA transfer

    for an SPE is a much better trade off, especially considering the fact that each of the eightSPE's DMA controllers can maintain up to 16 DMA transfers in flight simultaneously.

    Anticipating DMA needs efficiently can provide just in time delivery of data, which mayreduce this stall or eliminate it entirely. Conventional processors, even with deep and costly

    speculation, manage to get, at best, a handful of independent memory accesses in flight.

    One of the SPE's DMA transfer methods supports a list (such as a scatter gather list) of DMAtransfers that is constructed in an SPE's local store, so that the SPE's DMA controller can

    process the list asynchronously while the SPE operates on previously transferred data. Inseveral cases, this approach of accessing memory has improved application performance by

    almost two orders of magnitude when compared to the performance of conventionalprocessors This is significantly more than one would expect from the peak performance ratio(approximately 10x) between the Cell/B.E. processor and conventional PC processors.

  • 7/29/2019 redp4477

    40/50

    30 Roadrunner: Hardware and Software Overview

    The processor elements

    The general Cell/B.E. architecture-compliant processor may contain one or more PPEs, whilethe current implementation consists of only one. The PPE contains a 64-bit, dual threaded

    PowerPC RISC core and supports a PowerPC virtual memory subsystem. The currentPowerPC PPE runs at 3.2 GHz. It has 32 KB level-1 (L1) instruction and data caches and a

    512 KB level-2 (L2) unified (instruction and data) cache. It is intended primarily for controlprocessing, running an operating system, managing system resources, and managing SPEthreads. It can run existing PowerPC architecture software and is well suited to executing

    system control code. The instruction set for the PPE is an extended version of the PowerPCinstruction set. It includes the vector/SIMD multimedia extensions.

    Each of the eight Synergistic Processor Elements (SPEs) contains a 3.2 GHz Synergistic

    Processor Unit (SPU) vector processor plus the 256 KB of local store that is directlyaddressable. Computationally, each of these SPEs is capable of producing four floating point

    results per clock period. Simple arithmetic shows that all eight of these SPEs have a peakcompute power of 102.4 gigaflops.

    The eight identical SPEs are single-instruction multiple-data (SIMD) processor elements that

    are intended for computationally intensive operations allocated to them by the PPE. EachSPE contains a RISC core, 256 KB software controlled local store for instructions and data,

    and a set of 128 registers, each of which is 128 bits wide. The SPEs support a special SIMDinstruction set and a unique set of commands for managing DMA transfers and

    inter-processor messaging and control.

    SPE DMA transfers access main memory using PowerPC effective addresses. As in the PPE,

    SPE address translation is governed by PowerPC architecture segment and page tables,which are loaded into the SPEs by privileged software running on the PPE. The SPEs are not

    intended to run an operating system.

    An SPE controls DMA transfers and communicates with the system by means of channelsthat are implemented in and managed by the SPE's Memory Flow Controller (MFC). The

    channels are unidirectional message passing interfaces. The PPE and other devices on thesystem, including other SPEs, can also access this MFC state through the MFC's

    memory-mapped I/O (MMIO) registers and queues, which are visible to software in the mainmemory address space.

    The Element Interconnet Bus

    The SPEs, PPE, the Memory Interface Controller (MIC) and broadband interface, and the

    connection to other Cell/B.E. processors within an SMP are interconnected through a highspeed Element Interconnect Bus (EIB). The EIB is the communication path for commandsand data between all processor elements on the Cell/B.E. processor and the on chip

    controllers for memory and I/O. The EIB supports full memory coherent and symmetricmultiprocessor (SMP) operations. A Cell/B.E. architecture processor is designed to be

    combined coherently with other Cell/B.E. architecture processors to produce a cluster. TheCell/B.E. blade is one such example where two Cell/B.E. processors are combined in a

    shared memory environment to produce an SMP.

    The EIB consists of four 16 byte wide data rings, two in each direction, and a central arbiter.In the absence of path contention, each ring can perform three concurrent data transfers.

    Each ring transfers 128 bytes (one PPE cache line) at a time. Processor elements can driveand receive data simultaneously. The SPEs, PPE, and PIC each have 25.6 GBps links to and

    from the EIB. In aggregate, the EIB is capable of 204.8 GBps transfers. Figure A-1 on

  • 7/29/2019 redp4477

    41/50

    Appendix A. The Cell Broadband Engine (Cell/B.E.) processor31

    page 28 shows each of these elements and the order in which the elements are connected to

    the EIB. The connection order is important to programmers seeking to minimize the latency oftransfers on the EIB, where latency is a function of the number of connection hops. Transfersbetween adjacent elements have the shortest latencies, while transfers between elements

    separated by multiple hops have the longest latencies.

    The EIB's internal maximum bandwidth is 96 bytes per processor clock cycle. Multipletransfers can be in process concurrently on each ring, including more than 100 outstanding

    DMA memory transfer requests between main storage and the SPEs in either direction.These requests also may include SPE memory to and from the I/O space. The EIB does not

    support any particular quality of service (QoS) behavior other than to guarantee forwardprogress. However, a resource allocation management (RAM) facility resides in the EIB.

    Privileged software can use it to regulate the rate at which resource requesters (the PPE,SPEs, and I/O devices) can use memory and I/O resources.

    Memory Flow Controller

    The Memory Flow Controller (MFC) is the data transfer engine. It provides the primarymethod for data transfer, protection, and synchronization between main storage and the

    associated local storage, or between the associated local storage and another local storage.An MFC command describes the transfer to be performed. A principal architectural objectiveof the MFC is to perform these data transfer operations in as fast and as fair a manner as

    possible, thereby maximizing the overall throughput of the processor.

    Commands that transfer data are called MFC DMA commands. These commands areconverted into DMA transfers between the local storage domain and main storage domain.

    Each MFC can typically support multiple DMA transfers at the same time and can maintainand process multiple MFC commands. To accomplish this, the MFC maintains and processes

    queues of MFC commands. Each MFC provides one queue for the associated SPU (MFCSPU command queue) and one queue for other processors and devices (MFC proxy

    command queue). Logically, a set of MFC queues is always associated with each SPU in aCell/B.E. architecture-compliant processor.

    The on-chip memory interface controller (MIC) provides the interface between the EIB and

    physical memory. The IBM BladeCenter QS22 uses normal DDR memory and additionalhardware logic to implement the MIC. Memory accesses on each interface are 1 to 8, 16, 32,

    64, or 128 bytes, with coherent memory ordering. Up to 64 reads and 64 writes can bequeued. The resource allocation token manager provides feedback about queue levels. TheMIC has multiple software controlled modes, including fast path mode (for improved latency

    when command queues are empty), high priority read (for prioritizing SPE reads in front of allother reads), early read (for starting a read before a previous write completes), speculative

    read, and slow mode (for power management). The MIC implements a closed page controller(bank rows are closed after being read, written, or refreshed), memory initialization, and

    memory scrubbing.

  • 7/29/2019 redp4477

    42/50

    32 Roadrunner: Hardware and Software Overview

  • 7/29/2019 redp4477

    43/50

    Copyright IBM Corp. 2009. All rights reserved.33

    Glossary

    Accelerator General or special purpose processing

    element in a hybrid system. An accelerator might have amulti-level architecture with both host elements and

    accelerator elements. An accelerator, as defined here, is

    a hierarchy with potentially multiple layers of hosts and

    accelerators. An accelerator element is always associated

    with one host. Aside from its direct host, an accelerator

    cannot communicate with other processing elements in

    the system. The memory subsystem of the accelerator

    can be viewed as distinct and independent from a host.

    This is referred to as the subordinate in a cluster

    collective.

    All-reduce operation Output from multiple accelerators

    is reduced and combined into one output.

    API Application Programming Interface. An application

    programming interface defines the syntax and semantics

    for invoking services from within an executing application.

    All APIs are targeted to be available to both FORTRAN

    and C programs, although implementation issues (such

    as whether the FORTRAN routines are simply wrappers

    for calling C routines) are up to the supplier.

    ASCI The name commonly used for the Advanced

    Simulation and Computing program administered by

    Department of Energy (DOE)/National Nuclear Security

    Agency (NNSA).

    ASIC Application Specific Integrated Circuit.

    B/U Bring up.

    CEC Central electronic complex.

    cluster A collection of nodes.

    compute kernel Part of the accelerator code that does

    stateless computation tasks on one piece of input data

    and generates the corresponding output results.

    compute task An accelerator execution image that

    consists of a compute kernel linked with the accelerated

    library framework accelerator runtime library.

    DaCS element A general or special purpose processing

    element in a topology. This refers specifically to the

    physical unit in the topology. A DaCS element can serve

    as a host or an accelerator.

    DDR Double Data Rate. DDR is a technique for

    doubling the switching rate of a circuit by triggering both

    the rising edge and falling edge of a clock signal.

    DE See DaCS element.

    de_id A unique number assigned by the DaCS

    application at run time to a physical processing element ina topology group A group construct specifies a collection

    of DEs and processes in a system.

    EDRAM Enhanced dynamic random access memory is

    dynamic random access memory that includes a small

    amount of static RAM (SRAM) inside a larger amount of

    DRAM. Performance is enhanced by making sure that

    many of the memory accesses will be to the faster SRAM.

    EMC Electromagnetic compatibility.

    ESD Electrostatic discharge.

    ETH Ethernet, as in adapter or interface.

    FLOP Floating Point OPeration. A measure of

    computations speed frequently used with

    supercomputers.

    FLOP/s FLOPs per second.

    FPU Floating point unit.

    FRU Field replaceable unit.

    GFLOP GigaFLOP. A gigaFLOP/s is a billion (109 =

    1,000,000,000) floating point operations per second.

    handle A handle is an abstraction of a data object,usually a pointer to a structure.

    HBCT Hardware-based cycle time.

    host A general purpose processing element in a hybrid

    system. A host can have multiple accelerators attached to

    it. This is often referred to as the master node in a cluster

    collective.

    hybrid A 64-bit x86 system using a Cell Broadband

    Engine (Cell/B.E.) architecture as an accelerator.

    I/O I/O (input/output) describes any operation, program,

    or device that transfers data to or from a computer.

    I/O node The I/O nodes (ION) are responsible, in part,

    for providing I/O services to compute nodes.

    Job A job is a cluster-wide abstraction similar to a

    POSIX session, with certain characteristics and attributes.

    Commands are targeted to be available to manipulate a

    job as a single entity (including kill, modify, query

    characteristics, and query state).

  • 7/29/2019 redp4477

    44/50

    34 Roadrunner: Hardware and Software Overview

    LANL Los Alamos National Laboratory.

    LINPACK LINPACK is a collection of FORTRAN

    subroutines that analyze and solve linear equations and

    linear leastsquares problems.

    main thread The main thread of the application. In

    many cases, Cell/B.E. architecture programs aremulti-threaded using multiple SPEs running concurrently.

    A typical scenario is that the application consists of a main

    thread that creates as many SPE threads as needed and

    the application organizes them.

    MFLOP MegaFLOP/s. A megaFLOP/s is a million (106

    = 1,000,000) floating point operations per second.

    MPI Message passing interface.

    MPICH2 MPICH is an implementation of the MPI

    standard available from Argonne National Laboratory.

    node A node is a functional unit in the system topology,consisting of one host together with all the accelerators

    connected as children in the topology (this includes any

    children of accelerators).

    parent The parent of a DE is the DE that resides

    immediately above it in the topology tree.

    PPE Power Processor Element: 64-bit Power

    Architecture unit within the CBE that is optimized for

    running operating systems and applications. The PPE

    depends on the SPEs to provide the bulk of the application

    performance.

    PPE PowerPC Processor Element. Thegeneral-purpose processor in the Cell/B.E. processor.

    process A process is a standard UNIX-type process

    with a separate address space.

    RAS Reliability, availability, and serviceability.

    service node The service node is responsible, in part,

    for management and control of RoadRunner.

    SIMD Single Instruction Multiple Data. Processing in

    which a single instruction operates on multiple data

    elements that make up a vector data type. Also known as

    vector processing. This style of programming implementsdata-level parallelism.

    SN See service node.

    SPE Synergistic Processor Element. Eight of these exist

    within the Cell/B.E. processor, optimized for running

    compute-intensive applications, and they are not

    optimized for running an operating system. The SPEs are

    independent processors, each running its own individual

    application programs.

    SPE Synergistic Processor Element. Extends the

    PowerPC 64 architecture by acting as cooperative offload

    processors (synergistic processors), with the direct

    memory access (DMA) and synchronization mechanisms

    to communicate with them (memory flow control), and with

    enhancements for real-time management. There are eight

    SPEs on each Cell/B.E. processor.

    SPMD Single Program Multiple Data. A common style of

    parallel computing. All processes use the same program,

    but each has its own data.

    SPU Synergistic Processor Unit. The part of an SPE

    that execut