redp4477

7/29/2019 redp4477

1/50ibm.com/redbooks

Redpaper

Front cover

Roadrunner: Hardware

and Software Overview

Dr. Andrew Komornic

Gary Mullen-Schu

Deb Lando

Review components that comprise theRoadrunner supercomputer

Understand Roadrunner hardware

components

Learn about Roadrunner

system software
http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/

7/29/2019 redp4477

2/50

7/29/2019 redp4477

3/50

International Technical Support Organization

Roadrunner: Hardware and Software Overview

January 2009

REDP-4477-00

7/29/2019 redp4477

4/50

Copyright International Business Machines Corporation 2009. All rights reserved.

Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule

Contract with IBM Corp.

First Edition (January 2009)

This edition applies to the Roadrunner computing system.

Note: Before using this information and the product it supports, read the information in Notices on page v.

7/29/2019 redp4477

5/50

Copyright IBM Corp. 2009. All rights reserved.iii

Contents

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v

Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1. Roadrunner hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 What Roadrunner is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Roadrunner hardware components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 TriBlade: a unique concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 IBM BladeCenter QS22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 IBM BladeCenter LS21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Rack configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Compute node rack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 Compute node and I/O rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.3 Switch and service rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 The Connected Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.1 Networks within a Connected Unit cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.2 Networks between Connected Unit clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Chapter 2. Roadrunner software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Roadrunner components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Compute node (TriBlade) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2 I/O node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.4 Master (management) node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Cluster boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Boot scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 How applications are written and executed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.1 Application core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.2 Offloading logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Appendix A. The Cell Broadband Engine (Cell/B.E.) processor . . . . . . . . . . . . . . . . . . 27Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

The processor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30The Element Interconnet Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7/29/2019 redp4477

6/50

iv Roadrunner: Hardware and Software Overview

Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

How to get Redbooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7/29/2019 redp4477

7/50

Copyright IBM Corp. 2009. All rights reserved.v

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consultyour local IBM representative for information on the products and services currently available in your area. Anyreference to an IBM product, program, or service is not intended to state or imply that only that IBM product,program, or service may be used. Any functionally equivalent product, program, or service that does notinfringe any IBM intellectual property right may be used instead. However, it is the user's responsibility toevaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. Thefurnishing of this document does not give you any license to these patents. You can send license inquiries, inwriting, to:IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATIONPROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR

IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer ofexpress or implied warranties in cer tain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically madeto the information herein; these changes will be incorporated in new editions of the publication. IBM may makeimprovements and/or changes in the product(s) and/or the program(s) described in this publication at any timewithout notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in anymanner serve as an endorsement of those Web sites. The materials at those Web sites are not part of thematerials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurringany obligation to you.

Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products and cannot confirm theaccuracy of performance, compatibility or any other claims related to non-IBM products. Questions on thecapabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate themas completely as possible, the examples include the names of individuals, companies, brands, and products.All of these names are fictitious and any similarity to the names and addresses used by an actual businessenterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programmingtechniques on various operating platforms. You may copy, modify, and distribute these sample programs in

any form without payment to IBM, for the purposes of developing, using, marketing or distributing applicationprograms conforming to the application programming interface for the operating platform for which the sampleprograms are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,cannot guarantee or imply reliability, serviceability, or function of these programs.

7/29/2019 redp4477

8/50

vi Roadrunner: Hardware and Software Overview

Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business MachinesCorporation in the United States, other countries, or both. These and other IBM trademarked terms aremarked on their first occurrence in this information with the appropriate symbol ( or ), indicating USregistered or common law trademarks owned by IBM at the time this information was published. Such

trademarks may also be registered or common law trademarks in other countries. A current list of IBMtrademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks of the International Business Machines Corporation in the United States,other countries, or both:

AS/400

BladeCenter

Blue Gene/L

Blue Gene

Domino

GPFS

IBM PowerXCell

IBM

iSeries

PartnerWorld

Power Architecture

POWER3

POWER5

PowerPC

Redbooks

Redbooks (logo)

RS/6000

System i

WebSphere

The following terms are trademarks of other companies:

AMD, AMD Opteron, HyperTransport, the AMD Arrow logo, and combinations thereof, are trademarks ofAdvanced Micro Devices, Inc.

InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand TradeAssociation.

Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the UnitedStates, other countries, or both and is used under license therefrom.

Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States,other countries, or both.

Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, othercountries, or both.

Intel Pentium, Intel, Pentium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registeredtrademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtml

7/29/2019 redp4477

9/50

Copyright IBM Corp. 2009. All rights reserved.vii

Preface

This IBM Redpaper publication provides an overview of the hardware and software

components that constitute a Roadrunner system. This includes the actual chips, cards, andso on that comprise a Roadrunner connected unit, as well as the peripheral systems required

to run applications. It also includes a brief description of the software used to manage and runthe system.

The team that wrote this paper

This publication was produced by a team of IBM specialists working in collaboration with theInternational Technical Support Organization (ITSO), Rochester Center.

Dr. Andrew Komornicki is an accomplished computational scientist with many years of

experience. Prior to joining IBM, his career included independent research, scientificmanagement, government service, as well as work in the computer industry. During the

1990s, he spent two years as a rotator at the National Science Foundation as a programdirector, where he co-managed the program in computational chemistry. As a computational

scientist, he also spent four years as the chair of the allocation committee at the San DiegoSupercomputer Center. He has consulted extensively in both the computer and chemical

industry. Upon his return from Washington, he spent several years at Sun Microsystems,where he worked as a business development executive tasked with the development ofvertical markets in the chemistry and pharmaceutical markets. Three years ago, he joined the

Advanced Technical Support group at IBM in the role of supporting scientific computing in theHigh Performance Computing (HPC) arena. His duties have included support of large scale

procurements, benchmarks, and some software contributions.

Gary Mullen-Schulz is a Consulting IT Specialist at the ITSO, Rochester Center. He leadsthe team responsible for producing Roadrunner documentation, and was the primary author

of IBM System Blue Gene Solution: Application Development, SG24-7179. Gary also focuseson Java and WebSphere. He is a Sun Certified Java Programmer, Developer and

Architect, and has three issued patents.

Deb Landon is an IBM Certified Senior IT Specialist in the IBM ITSO, Rochester Center.Debbie has been with IBM for 25 years, working first with the S/36 and then the AS/400,

which has since evolved to the IBM System i platform. Before joining the ITSO in Novemberof 2000, Debbie was a member of the PartnerWorld for Developers iSeries team,

supporting IBM Business Partners in the area of Domino for iSeries.

Thanks to the following people for their contributions to this project:

Bill BrandmeyerMike Brutman

Chris EngelSusan Lee

Dave LimpertCamille MannAndrew Schram

IBM Rochester

7/29/2019 redp4477

10/50

viii Roadrunner: Hardware and Software Overview

Prashant Manikal

Cornell WrightIBM Austin

Debbie Landon

Wade WallaceInternational Technical Support Organization, Rochester Center

Become a published author

Join us for a two- to six-week residency program! Help write a book dealing with specificproducts or solutions, while getting hands-on experience with leading-edge technologies. Youwill have the opportunity to team with IBM technical professionals, Business Partners, and

Clients.

Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you

will develop a network of contacts in IBM development labs, and increase your productivityand marketability.

Learn more about the residency program, browse the residency index, and apply online at:

ibm.com/redbooks/residencies.html

Comments welcome

Your comments are important to us!

We want our papers to be as helpful as possible. Send us your comments about this paper or

other IBM Redbooks in one of the following ways:

Use the online Contact us review Redbooks form found at:ibm.com/redbooks

Send your comments in an e-mail to:

[email protected]

Mail your comments to:

IBM Corporation, International Technical Support OrganizationDept. HYTD Mail Station P099

2455 South RoadPoughkeepsie, NY 12601-5400
http://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/contacts.htmlhttp://www.redbooks.ibm.com/contacts.htmlhttp://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/residencies.html

7/29/2019 redp4477

11/50

Copyright IBM Corp. 2009. All rights reserved.1

Chapter 1. Roadrunner hardware overview

This chapter describes the hardware components that comprise the Roadrunner system.

Specifically, this chapter examines the various components that make up a Connected Unit(CU) and then discusses how the CUs are tied together to create a complete Roadrunner

cluster.

1

Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather abig picture discussion meant to acquaint the reader with the Roadrunner system.

7/29/2019 redp4477

12/50

2 Roadrunner: Hardware and Software Overview

1.1 What Roadrunner is

Roadrunner is the first general purpose computer system to reach the petaflop milestone. OnJune 10, 2008, IBM announced that this supercomputer had sustained a record-breaking

petaflop, or 1015 floating point operations per second, as measured by the Linpackbenchmark. As a result of this achievement, Roadrunner became the worlds fastestsupercomputer.

Roadrunner was designed, manufactured, and tested at the IBM facility in Rochester,Minnesota. The actual initial petaflop run was done in Poughkeepsie, New York. Its final

destination is the Los Alamos National Laboratory (LANL) in New Mexico, which will use thissystem for a variety of scientific efforts. Most notably, Roadrunner is the latest tool used bythe National Nuclear Security Administration (NNSA) to ensure the safety and reliability of the

US nuclear weapons stockpile.

This computer system has a number of unique characteristics. The most notable is its sheer

size and the fact that this is the first modern heterogeneous system of its kind. As a petascaledesign, the Roadrunner system has the fewest number of compute nodes and the fewestnumber of cores of any of the outstanding designs considered to date. In a nutshell, the

attributes of this system can be summarized with the following characteristics:

Roadrunner is a cluster of clusters.

The fundamental building block of the Roadrunner system is a Connected Unit (CU). As

originally designed, Roadrunner would have 18 such connected units, of which 17 havebeen delivered to LANL for the final system configuration. Roadrunner is made up ofapproximately 6500 AMD dual-core processors coupled with 12,240 Cell Broadband

Engine (Cell/B.E.) processors. The total peak (theoretical) performance of this hybridsystem is in excess of 1.3 petaflops. The memory on this system consists of a total of 98

TB equally distributed between the Opteron and the Cell/B.E. nodes.

Each CU is made up of 180 compute nodes and 12 I/O nodes. A unique aspect of the

Roadrunner design is the creation of a TriBlade as a fundamental building block for the

CU. Each TriBlade consists of an AMD Opteron blade and two Cell/B.E. IBMBladeCenter QS22 blades. The Opteron blade contains two dual-core processors, whilethe Cell/B.E. blades each contain two new Cell/B.E. eDP (double precision) processors.This architecture allows for a one-to-one mapping of Opteron cores to Cell/B.E.

processors. As discussed in 1.2.1, TriBlade: a unique concept on page 5, this designarchitecture creates a master-subordinate relationship between the Opterons and the

Cell/B.E. processors. Each Opteron core is connected to a Cell/B.E. chip through adedicated PCIe link. Communications between Opteron nodes is accomplished through

an extensive InfiniBand network.

Fedora Linux is the operating system of choice for this system.

System management of this cluster of clusters is accomplished with the xCAT cluster

management software tools.

It is worthwhile to note some of the physical characteristics of this system. The entire systemconsists of 278 racks that occupy approximately 5000 square feet of floor space. The weightof this system is approximately 500,000 pounds, or 250 tons. The networking required for

both the compute and management tasks consists of 55 miles of InfiniBand (IB) cables.Lastly, even though the system consumes 2.4 MW of power, it is very energy efficient,

delivering almost 437 megaflops per watt.

Roadrunner holds a unique position in the history of scientific computing. It was over tenyears ago that the first teraflop (1012 floating point operations per second) computer was built.

In 1997, a computer consisting of 7000+ Intel Pentium II processors sustained a teraflop

7/29/2019 redp4477

13/50

Chapter 1. Roadrunner hardware overview3

on the Linpack benchmark. Roadrunner in 2008 has demonstrated a thousand fold increase

in sustained compute performance.

1.1.1 A historical perspective

Machines of Roadrunners size and capability are the direct result of the scientific needs ofthe weapons-physics communities. In October of 1992, the United States (U.S.) entered thestart of the nuclear testing moratorium that banned all nuclear testing above and below

ground. Prior to this moratorium, the US nuclear weapons stockpile was maintained through acombination of underground nuclear testing as well as the development of new weapons

systems. When theory and experiment were combined, the Department of Energy could relyon much simpler models than those needed today. Without nuclear testing, weapons

scientists must rely much more heavily on sophisticated hardware and software to simulate

the complex aging process of both weapons systems as well as their components.

Established in 1995, the Advanced Simulation and Computing Program (ASC) is an integral

part of the Department of Energy's National Nuclear Security Administration (NNSA) shift inemphasis from test-based to simulation-based programs. Under the ASC, computer

simulation capabilities are continually developed to analyze and predict the performance,safety, and reliability of nuclear weapons and to certify their functionality. All of this work isintegrated into the three weapons laboratories:

Los Alamos National Laboratory (LANL) Lawrence Livermore National Laboratory (LLNL) Sandia National Laboratories (SNL)

The predecessor of the ASC was the Accelerated Strategic Computing Initiative (known asthe ASCI program) in direct response to the National Defense Authorization Act of 1994,which required, in the absence of nuclear testing, for the Department of Energy to:

Support a focused multifaceted program to increase the understanding of the existingnuclear stockpile.

Predict, detect, and evaluate potential problems associated with the aging of the nuclearstockpile.

Maintain the science and engineering institutions needed to support the national nuclear

deterrent, now and in the future.

In response to this mandate, the ASCI program set the following objectives in order to meetthe needs and requirements of the Stockpile Stewardship program. These were enumerated

to include performance, safety, reliability, and renewal, and were articulated in the ASCIprogram plan, published by the Department of Energy Defense Programs on January 2000:

Create predictive simulations of nuclear weapon systems to analyze behavior and asses

performance in an environment without nuclear testing.

Predict with high certainty the behavior of full weapon systems in complex accident

scenarios.

Achieve sufficient, validated predictive simulations to extend the lifetime of the stockpile,predict failure mechanisms, and reduce routine maintenance.

Note: The name Roadrunner was chosen by Los Alamos National Laboratory and is not aproduct name of the IBM Corporation. This supercomputer was designed and developed

for the Department of Energy and Los Alamos National Laboratory under the project name

Roadrunner. The project was named after the state bird of New Mexico.

7/29/2019 redp4477

14/50


Use virtual prototyping and modeling to understand how new production processors and

materials affect performance, safety, reliability, and aging. This understanding helps definethe right configuration of production and testing facilities necessary for managing thestockpile throughout the next several decades.

Throughout the history of this program, the IBM Corporation has been a key partner of theDepartment of Energy's National Nuclear Security Administration (NNSA) program. Here areseveral historical examples:

In 1998, IBM delivered the ASCI Blue Pacific system, which consisted of 5,856 PowerPC604e microprocessors. The theoretical peak performance of this system was 3.8 teraflops.

In 2000, IBM delivered the ASCI White system. This computer system was based on the

IBM RS/6000 computer, which contained IBM POWER3 nodes running at 375 MHz.This cluster consisted of 512 nodes, each of which had 16 processors for a total of 8,192processors. The power requirements for this machine consisted of 3 MW for the computer

and an additional 3 MW required for cooling. The theoretical peak processing power was12.3 teraflops and a Linpack performance of 7.2 teraflops.

In 2005, IBM delivered and installed the ASC Purple system at Lawrence LivermoreLaboratories. This system was a 100 teraflop machine and was the successful realization

of a goal set a decade earlier (1996) to deliver a 100 teraflop machine within the 2004 to2005 time frame.

ASC Purple is based on the symmetric shared memory IBM POWER5 architecture. Thecombined system contains approximately 12,500 POWER5 processors and requires 7.5

MW of electrical power for both the computer and cooling equipment.

Another machine in the ASC program is the IBM System Blue Gene/L machinedelivered by IBM to Lawrence Livermore Laboratories. The Blue Gene architecture is

unique in that it allows for a very dense packing of computer nodes. A single Blue Gene

rack contains 1024 nodes. On March 24, 2005, the US Department of Energy announcedthat the Blue Gene/L installation at Lawrence Livermore Laboratory had achieved a speed

of 135 teraflops on a system consisting of 32 racks. On October 27, 2005, LawrenceLivermore Laboratories and IBM announced that Blue Gene/L had produced a Linpack

benchmark that exceeded 280 teraflops. This system consisted of 65,536 compute nodeshoused in 64 Blue Gene racks.

As with each of the systems described above, the Roadrunner project is a partnership with

IBM. The original contract was signed in September 2006 and projected for three phases. Inphase 1, a base system was delivered consisting of Opteron nodes. A hybrid node prototype

system was projected for phase 2. The delivery of a hybrid final system, one that wouldachieve a sustained petaflop in Linpack performance, was projected for phase 3.

For more information, refer to the Advanced Simulation and Computing Web site at:

http://www.sandia.gov/NNSA/ASC/about.html

Note: At the time these goals were set, computers were still at the gigaflop level and

were still two years away from the realization of the first teraflop machine.
http://www.sandia.gov/NNSA/ASC/about.htmlhttp://www.sandia.gov/NNSA/ASC/about.html

7/29/2019 redp4477

15/50


1.2 Roadrunner hardware components

A simple way to describe the Roadrunner system is that it is a heterogeneous cluster ofclusters, each of which is accelerated by Cell/B.E. processors. The unique feature of this

design is that each compute node consists of node-attached Cell/B.E. processors, rather thana simple cluster of Cell/B.E. processors. A collection of such compute and I/O nodes, allconnected through a high speed switch fabric, makes up a scalable unit known as a

Connected Unit (CU).

The fundamental building block of a CU is a compute node, each of which is a TriBlade. The

TriBlade is an original design concept created for the Roadrunner system and allows for theintegration of Cell/B.E. and Opteron blades. Architecturally, this design allows for theincorporation of these TriBlades into a IBM BladeCenter chassis.

1.2.1 TriBlade: a unique concept

The TriBlade makes up what is called a hybrid compute node.The components of this nodeconsist of an IBM LS21 Opteron blade, two IBM BladeCenter QS22 Cell/B.E. blades, and a

fourth blade that houses the communications fabric for the compute node. This expansionblade connects the two QS22 blades through four PCI Express x8 links to the Opteron blade

and provides each node with an InfiniBand 4x DDR cluster interconnect. Figure 1-1 shows aschematic of a TriBlade.

Figure 1-1 TriBlade schematic

7/29/2019 redp4477

16/50


The node design of the TriBlade offers a number of important characteristics. Since each

node is accelerated by Cell/B.E. processors, by design there is one Cell/B.E. chip for eachOpteron core. The TriBlade is populated with 16 GB of Opteron memory and an equal amountof Cell/B.E. memory. Since the new Cell/B.E. eDP processors are capable of delivering 102.4

gigaflops of peak performance, each TriBlade node is capable of approximately 400 gigaflopsof double precision compute power. For additional information about the Cell/B.E. processor,

see Appendix A, The Cell Broadband Engine (Cell/B.E.) processor on page 27.

The design of the TriBlade presents the user with a very specific memory hierarchy. TheOpteron processors establish a master-subordinate relationship with the Cell/B.E.

processors. Each Opteron blade contains 4 GB of memory per core, resulting in 8 GB ofshared memory per socket. The Opteron blade thus contains 16 GB of NUMA shared

memory per node.

Each Cell/B.E. processor contains 4 GB of shared memory, resulting in 8 GB of shared

memory per blade. In total, the Cell/B.E. blades contain 16 GB of distributed memory perTriBlade node. It is important to note that not only is there a one-to-one mapping of Opteron

cores to Cell/B.E. processors, but also each node consists of a distribution of equal memoryamong each of these components.

In order to sustain this compute power, the connectivity within each node consists of four PCIExpress x8 links, each capable of 2 GBs transfer rates, with a 2 micro-second latency. Theexpansion slot also contains the InfiniBand interconnect, which allows communications to the

rest of the cluster. The capability of the InfiniBand 4x DDR interconnect is rated at 2 GBs witha 2 micro-second latency.

1.2.2 IBM BladeCenter QS22

The IBM BladeCenter QS22 is based on the IBM PowerXCell 8i processor, a newgeneration processor based on the Cell/B.E. architecture. In contrast to its predecessors, the

QS20 and QS21, the QS22 is based on the second generation processor of the Cell/B.E.architecture and offers single instruction multiple data (SIMD) vector capability along with

strong parallelization. It performs double precision floating point operations at five times thespeed of the previous generations of Cell/B.E. processors.

Due to its parallel nature and extraordinary computing speed, the QS22 is ideal for use in

scientific applications, which is why it was chosen as an integral part of the Roadrunnersystem by IBM and Los Alamos. The QS22 is a single-wide blade server that offers an SMPwith shared memory and two Cell/B.E. processors in a single blade enclosure.

Figure 1-2 on page 7 provides an illustration of the IBM BladeCenter QS22. Features of theQS22 include:

Two 3.2 GHz IBM PowerXCell 8i processors Up to 32 GB of PC2-6400 800 MHz DDR2 memory

460 single-precision gigaflops per blade (peak) 217 double-precision gigaflops per blade (peak) Integrated dual 1 Gb Ethernet IBM Enhance I/O Bridge chip Serial Over LAN

The QS22 is based on the 64-bit IBM PowerXCell 8i processor. This processor operates at

3.2 GHz. Each of the eight SIMD vector processors is capable of producing four floating pointresults per clock period. The memory subsystem on the QS22 consists of eight DIMM slots,enabling configurations from 4 GB up to 32 GB of ECC memory.

7/29/2019 redp4477

17/50


For additional information about the Cell/B.E. processor, see Appendix A, The Cell

Broadband Engine (Cell/B.E.) processor on page 27.

Figure 1-2 IBM BladeCenter QS22

For more information about the QS22, see the IBM BladeCenter QS22 Web page at:

http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html

1.2.3 IBM BladeCenter LS21

The IBM BladeCenter LS21 is a single width AMD Opteron-based server. The LS21 bladeserver supports up to two of the dual-core 2200 series AMD Opteron processors combined

with up to 32 GB of ECC memory and one fixed SAS HDD.

The memory used in the LS21 are DDR2 and are ECC protected. The general memoryconfiguration for the LS21 has to follow these guidelines:

A total of eight DIMM slots (four per processor socket). Two of these slots (1 and 2) arepreconfigured with a pair of DIMMs.

Because memory is 2-way interleaved, the memory modules must be installed in matchedpairs. However, one DIMM pair is not required to match the other in capacity.

A maximum of 32 GB of installed memory is achieved when all DIMM sockets are

populated with 4 GB DIMMs.

Important: The implementation chosen for the Roadrunner system consists of thestandard blade populated with 16 GB of DDR2 memory. As with the Opteron blades, all of

the Cell/B.E. based blades are diskless.

Important: The configuration used for the Roadrunner system contains two AMD Opteron

processors running at 1.8 GHz, 16 GB of ECC memory, and no hard disk. The disklessconfiguration is an important implementation design, which eliminates additional movingparts and potential points of failure for a system with so many thousands of nodes.
http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.htmlhttp://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html

7/29/2019 redp4477

18/50


For each installed microprocessor, a set of four DIMM sockets are enabled.

The processors used in these blades are standard low-power processors. The standard AMD

Opteron processors draw a maximum of 95 W. Specially manufactured low-power processorsoperate at 68 W or less without any performance trade-offs. This savings in power at the

processor level combined with the smarter power solution that IBM BladeCenter deliversmake these blades very attractive for installations that are limited by power and coolingresources.

This blade is designed with power management capability to provide the maximum up time

possible. In extended thermal conditions, rather than shut down completely or fail, the LS21automatically reduces the processor frequency to maintain acceptable thermal levels.

A standard LS21 blade server offers these features:

Up to two high-performance, AMD Dual-Core Opteron processors.

A system board containing eight DIMM connectors, supporting 512 MB, 1 GB, 2 GB, or 4

GB DIMMs.

Up to 32 GB of system memory is supported with 4 GB DIMMs.

A SAS controller, supporting one internal SAS drive (36 or 73 GB) and up to threeadditional SAS drives with optional SIO blade.

Two TCP/IP Offload Engine enabled Gigabit Ethernet controllers (Broadcom 5706S) asstandard, with load balancing and failover features.

Support for concurrent KVM (cKVM) and concurrent USB/DVD (cMedia) through

Advanced Management Module and an optional daughter card.

Support for a Storage and I/O Expansion (SIO) unit.

Dual Gigabit Ethernet controllers are standard, providing high-speed data transfers and

offering TCP/IP Offload Engine support, load-balancing, and failover capabilities. The versionused for Roadrunner uses optional InfiniBand expansion cards, allowing high speed

communication between nodes. The InfiniBand fabric installed with Roadrunner provides

4x DDR connections that have a theoretical peak of 2 GB per second.

Finally, the LS21 supports both the Windows and Linux operating systems. The Roadrunner

implementation uses the Fedora version of Linux.

Figure 1-3 on page 9 shows a schematic of the planar of an LS21.

7/29/2019 redp4477

19/50


Figure 1-3 LS21 planar

For more information about the LS21, see the IBM BladeCenter LS21 Web page at:

http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html

1.3 Rack configurations

TriBlades are combined into racks to create assemblies of hybrid compute nodes. In addition,some racks contain other components for other required functionality. There are three

different rack types:

Compute node rack Compute node and I/O rack Switch and service rack

In general, these racks look very similar. Each can hold a maximum of 12 TriBlades and somehold additional components.
http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.htmlhttp://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html

7/29/2019 redp4477

20/50


1.3.1 Compute node rack

A compute node rack holds a total of 12 TriBlades, which means it holds 12 LS21s and 24QS22s. A compute node rack looks similar to the picture shown in Figure 1-4.

Figure 1-4 Compute node rack

1.3.2 Compute node and I/O rack

A compute node and I/O rack contains 12 TriBlades, but also contains an IBM System x3655(x3655) at the bottom of the rack. The x3655 performs input/output (I/O) services on behalf ofthe system. A compute and I/O node rack looks similar to the picture shown in Figure 1-5 on

page 11.

The x3655 is a new rack-optimized server based on the AMD Opteron dual-core processor.

The x3655 supports four processor sockets and 32 memory DIMM slots. The memory is 667MHz DDR2, in sizes ranging from 512 MB to 4 GB per DIMM. This gives a total capacity of upto 128 GB of main system memory.

Note: The x3655 used in the Roadrunner system supports 16 GB or 32 GB of memory.

7/29/2019 redp4477

21/50


Figure 1-5 Compute and I/O node rack

1.3.3 Switch and service rack

The switch and service rack contains no TriBlades. This rack contains a Voltaire Grid DirectorISR 9288 switch that is used to manage InfiniBand networking traffic. This is known in

Roadrunner as afirst-stage switch. See First-stage InfiniBand switch on page 14 for moreinformation about its role and function.

You can learn more about the Voltaire switch technology on the Voltaire Web page at:

http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288

In addition, this rack contains an IBM System x3655, which serves as the service node for the

CU. The functions that the service node performs include the following:

Holds the boot images used to IPL the Opteron and Cell/B.E. blades, as well as the I/O

nodes.

IPLs all elements in the CU when instructed to do so by the central management node.
http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288

7/29/2019 redp4477

22/50


A switch and service rack looks similar to the picture shown in Figure 1-6.

Figure 1-6 Switch and service rack

1.4 The Connected Unit

The Connected Unit (CU) is a core concept in the Roadrunner system. Groups of the various

rack configurations discussed in 1.3, Rack configurations on page 9 are put together tocreate a single CU. Table 1-1 lists the racks that comprise a single CU.

Table 1-1 Racks making up a Connected Unit

A CU can be thought of as a base cluster unit. The racks that make up a CU are connected toeach other through first-stage switches. CUs are then tied together through second-stage

switches to create a larger grid.

The size of a CU is largely determined by the capabilities of the first-stage switch. There are

180 TriBlades in a CU. This number of TriBlades means that a Connected Unit contains 180AMD Opteron LS21s and 360 IBM BladeCenter QS22s. See Figure 1-7 on page 13.

MiscMiscMisc

Rack type Number of racks in

the Connected Unit

Number of TriBlades

in a rack

Total number of

TriBlades

Compute node rack 3 12 36

Compute node and I/O rack 12 12 144

Switch and service rack 1 0 0

Total 16 N/A 180

7/29/2019 redp4477

23/50


Figure 1-7 Racks comprising a Connected Unit

1.5 Networks

Given the high number of racks and nodes in the Roadrunner system, it should come as nosurprise that there are several different networks used to tie the system together. This sectionprovides an overview of the different networks involved as well as their functional purpose.

1.5.1 Networks within a Connected Unit cluster

First-stage switches are used to connect all the racks making up a Connected Unit (CU)

together and to allow the CU to communicate with the outside world (for example, a filesystem) and other CUs. The second-stage switches primarily serve as a hub to tie the 17 CUs

together into a common computational system.

Note: As previously discussed in this chapter, the entire Roadrunner system or cluster is

comprised of a total of 17 CUs.

Misc

Connected Unit

I/O + Compute rack

x12

Compute rack

x3

Switch and

Service rack

7/29/2019 redp4477

24/50


First-stage InfiniBand switchAs discussed in 1.3.3, Switch and service rack on page 11, each CU contains a rack with aVoltaire Grid Director ISR 9288 switch. This switch allows for 288 different InfiniBand inputs,which are used as shown in Table 1-2.

Table 1-2 Connections in and out from a first-stage switch

InfiniBand Connected UnitThis network creates a fat tree that allows the AMD Opterons to communicate with eachother using the industry-standard Message Passing Interface (MPI). It is built on top of the

switched InfiniBand network. A fat tree is a special topology invented by Charles E.Leiserson of MIT. Unlike a traditional binary tree, a fat tree has thicker branches the closeryou get to the trees root. In this way, you do not end up with a communications bottleneck at

the root of the tree.

Figure 1-8 shows a traditional binary tree. Note that as messages flow up the tree, the single

links to the root node can become a point of congestion.

Figure 1-8 Traditional binary tree

Figure 1-9 on page 15, on the other hand, shows a fat tree. Notice how the number of links

between nodes increases as you get closer to the trees root. The number of links shown isjust one example of a fat tree configuration; the actual number may be higher or lower

between any two nodes depending on the given requirements.

Component Number ofconnections

Purpose

TriBlades InfiniBand link 180 Connects the AMD Opteron nodes together

to allow them to participate in a network.

InfiniBand links to second-stage

switch

96 Allows the CUs to be tied together into a

single network.

InfiniBand links to I/O nodes 8 Provides the hybrid compute nodes access

to the file system for application input and

output.

Total 288

7/29/2019 redp4477

25/50


Figure 1-9 Fat tree

Fat tree topologies are becoming quite popular in InfiniBand clusters. For more informationabout fat trees and their usage with InfiniBand, see the ar ticle Performance Modeling of

Subnet Management on Fat Tree InfiniBand Networks using OpenSM, which is available atthe following Web site:

http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf

10 Gigabit Ethernet file system LANEvery CU has twelve I/O nodes, each of which has a single InfiniBand connection to the CU's

InfiniBand Switch. This allows the hybrid compute nodes (TriBlades) to retrieve and pass datato the I/O nodes over the InfiniBand network. The file system is connected through the I/O

nodes, each of which have two 10 GB links to the file system LAN.

Gigabit Ethernet Control VLAN (CVLAN)The 1 GB Ethernet control VLAN is used to perform vital program and node control functionswithin each CU, such as Message Passing Information (MPI) required for program operationand communication.

Gigabit Ethernet Management VLAN (MVLAN)The 1 GB Ethernet Management VLAN is used to perform vital system managementfunctions within each CU, such as passing the required operating system boot images from

the CU's service node to the processors on the hybrid compute nodes and I/O nodes in orderto IPL them.

PCI Express link between LS21 and Cell/B.E. bladesEach AMD Opteron has a one-to-one master-subordinate relationship with a Cell/B.E.

processor. Although the Opterons participate in MPI communications with other Opteronnodes and access the file system through the I/O nodes, the Cell/B.E. processors only

communicate with their master Opteron.

Important: This VLAN is used exclusively for control traffic, no user data flows across thisnetwork.
http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf

7/29/2019 redp4477

26/50

7/29/2019 redp4477

27/50


Gigabit Ethernet management VLAN (MVLAN)The 1 GB Ethernet management VLAN is the grid-wide system management network. It isused for booting, system control, and status determination operations between themanagement nodes and the various managed elements throughout the cluster. The MVLAN

does not have direct network access to the internals of a CU (for example, the hybridcompute nodes and I/O nodes). Management operations to those nodes occurs from the

MVLAN to the CU's MVLAN through the service node to the desired target.

The MVLAN has no user or application data flow across this network. Only systemmanagement and control traffic flows across the MVLAN.

7/29/2019 redp4477

28/50


7/29/2019 redp4477

29/50


Chapter 2. Roadrunner software overview

This chapter briefly describes the software used to run applications on the Roadrunner

system.

2

Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather a

big picture discussion meant to acquaint the reader with the Roadrunner system.

7/29/2019 redp4477

30/50


2.1 Roadrunner components

This section provides a brief explanation of the software used to run on the variouscomponents that comprise a Roadrunner system.

2.1.1 Compute node (TriBlade)

As described in 1.2.1, TriBlade: a unique concept on page 5, a TriBlade is made up of one

IBM BladeCenter LS21 blade and two IBM BladeCenter QS22 blades. Each of these runs itsown operating system image, but shares a common user application.

The following is the software that runs on the various components of the TriBlade:

AMD Opteron LS21 for IBM BladeCenter

Each LS21 is standard except for the fact that it is diskless. The operating system isFedora Linux. Since it is diskless, it is booted up from its Connected Units service node.

IBM BladeCenter QS22

Each QS22 is standard except for the fact that it is diskless. The operating system isFedora Linux. Since it is diskless, it is booted up from its Connected Units service node.

Broadcom HT-2100 (PCIe adapter)

The dual Opteron host blade (LS21) is connected to the two QS22s through a PCIExpress (PCIe) interconnect. Two HyperTransport x16 connections from the LS21 blade

drive an expansion card containing two Broadcom HT-2100 HyperTransport to PCI

Express bridge chips. Each Broadcom HT-2100 drives two PCI Express x8 connections tothe two Axon Southbridge chips on one of the Cell Broadband Engine (Cell/B.E.) blades(QS22). This provides a dedicated PCIe x8 connection to each Cell/B.E processor.

The PCIe interconnect is supported by a low-level device driver that provides direct

memory access (DMA) and a remote memory mapped small message area (SMA). DMAoperations can be started by calls to the device driver from programs on either the LS21 orthe QS22. The device driver initiates the DMA operation using a DMA controller in the

Axon Southbridge. The small message area provides regions of memory that can beaccessed remotely by user space instructions without a context switch to the kernel or

device driver interaction. There is a unique device driver instance on both the Opteron andthe Cell/B.E. blade for each Axon Southbridge. A virtual Ethernet driver (also replicated

per Axon) supports point-to-point communications between the Opteron and each

Cell/B.E processor.

2.1.2 I/O node

As mentioned previously in 1.3.2, Compute node and I/O rack on page 10, each I/O node isan IBM System x3655 server. I/O nodes are diskless and serve as pipes to the external file

system across the 10 Gigabit Ethernet file system LAN.

Each I/O node runs Fedora Linux as its operating system. Since the node is diskless, it isbooted up from its Connected Units service node. The I/O node will run either the IBM

Note: From an IBM BladeCenter Advanced Management Module (AMM) perspective, theTriBlade still appears as separate blades. In other words, it appears as one LS21 and two

QS22s. The logical grouping of the LS21 and QS22s is handled through the xCATmanagement tools. See 2.3, xCAT on page 23 for more information.

7/29/2019 redp4477

31/50

Chapter 2. Roadrunner software overview21

GPFS or Panasas PanFS client to communicate with the external file system, depending on

what file system software is running there.

2.1.3 Service node

Service nodes are standard IBM System x3655 Opteron-based servers and are diskless.

There is one dedicated service node per Connected Unit, so this image can be updateddirectly from the master node over the management network (MVLAN) described in GigabitEthernet management VLAN (MVLAN) on page 17.

Service nodes obtain copies of the boot images for the I/O nodes and compute nodes fromthe master node. These images are refreshed on an as needed basis. The images are loadedover the CVLAN (see Gigabit Ethernet Control VLAN (CVLAN) on page 15).

2.1.4 Master (management) node

The master node is a standard IBM System x3655 Opteron-based server and is booted fromthe local disk. The master node runs Fedora Linux.

2.2 Cluster boot sequence

The initial booting of the nodes is complicated by two factors in the Roadrunner system:

All of the nodes except for the master node are diskless, so they must boot over thenetwork.

There are over 3,000 total nodes and 10,000 operating system images that need to be

installed and booted.

There will be times when the entire system needs to be booted, and there will be times when

only parts of the system need to be booted (while the rest of the system is still available butpowered off). This places two distinct demands on the management network:

It must be able to boot the entire system without causing timeouts on the management

network such that no boot progress is being made.

It must be able to boot substantial portions of the system without interfering with anystatus and control operations that are occurring on the running portion of the system.

Since the majority of nodes are diskless, a scalable way to move the boot images to each ofthe nodes is required. To this end, a hierarchy of management nodes has been created.

The solution to this concern is to use a bootstrap protocol (BOOTP) together with the trivialfile transfer protocol (TFTP) subnet multicast to boot the diskless LS21 Opteron and QS22Cell/B.E. blades. This method provides a broadcast of the common boot image that the

LS21s and QS22s can pick up midstream. The multicast repeats until all requesting bladeshave received all packets of the boot image. There are unique boot images for the various

configurations. The boot images are stored on the Connected Unit service nodes andmulticast over the CVLAN. This method significantly reduces network traffic compared tosending individual boot images to each processor.

Note: There is only one master node for the entire Roadrunner cluster.

7/29/2019 redp4477

32/50


2.2.1 Boot scenarios

This section describes in more detail what happens when a cluster (or parts of the cluster)are booted up.

Master (management) node (tier 1)

This node is installed and booted with the required management node image. Themanagement node boots from the local disk.

Service nodes (tier 2)There is only one service node per Connected Unit, so this image can be updated directlyfrom the master node over the MVLAN at any time (not just at service node bring-up). Once

booted, service nodes obtain copies of the boot images for the I/O nodes and compute nodesfrom the master node. These images are refreshed on an as-needed basis. The images are

loaded over the CVLAN through the multicast boot process, which allows for far less networktraffic and parallel image download.

I/O nodes

Once successfully booted, the service nodes begin transferring the required boot imagesdown the CVLAN. The I/O nodes are standard Opteron Linux servers and are booted diskless

with the required image. I/O nodes are connected to the 10 GB Global File System (GFS) toservice the compute nodes file access requests. The image required to boot the I/O node is

received from its local service node through the CVLAN network.

Compute nodes (TriBlades)Compute nodes (TriBlades) are either accelerated or non-accelerated, with the difference

being that accelerated nodes will have their associated Cell/B.E. blades powered on andbooted, while Cell/B.E. blades on the non-accelerated nodes are left powered off.

There is no need for a heartbeat function between the Opteron core and its associated Cell

Broadband Engine processor. The general health of both resources is known by the xCATsoftware and reflected in the resource manager. Communication health status between the

two resources is monitored and understood on demand by the application running on theOpteron side. The Data Communication and Synchronization (DaCS) API is notified of errors

from the Cell/B.E. processor concerning any data transfer or communications request.Failures of these transactions is reported by the software structures. If the PCI Expressconnection between the Opteron and Cell/B.E. processor fails, an appropriate error event is

posted and the application terminated.

Given the PCI Express interface between the Opteron and Cell/B.E. processor, it is necessaryto boot the Cell/B.E. processor portions of a compute node (in the accelerated node pool)before the Opteron portion. This allows the proper initialization of the interconnect firmwareand PCI Express device drivers. The Cell/B.E. PCI Express device drivers listen for the

necessary firmware/driver handshakes from the LS21 and Broadcom HT-2100 (PCIe adapter)expansion card to establish communication. The process of insuring the correct booting

sequence is controlled by the xCAT software.

Note: There is no low power mode for the Cell/B.E. blades, so some sort of standby

mode is not possible. They are either on (accelerated) or off (non-accelerated).

7/29/2019 redp4477

33/50


2.3 xCAT

Setting up the installation and management of a cluster is a complicated task and doingeverything manually can become very complicated. The development of xCAT grew out of the

desire to automate a lot of the repetitive steps involved in installing and configuring a Linuxcluster.

The development of xCAT is driven by customer requirements. Because xCAT itself is written

entirely using scripting languages such as korn shell, Perl, and Expect, an administrator caneasily modify the scripts should the need arise.

The main functions of xCAT are grouped as follows:

Automated installation Hardware management and monitoring Software administration Remote console support for text and graphics

For more information about xCAT, refer to the xCAT Web site at:

http://xcat.sourceforge.net

2.4 How applications are written and executed

This section discusses how applications are written and executed on the Roadrunner system.The unique architecture employed means that applications are designed and written in a

revolutionary new manner compared to previous parallel processing applications.

2.4.1 Application core

The bulk of the user application, including initiation and termination, runs on the AMDOpteron processor (LS21). It uses Message Passing Interface (MPI) APIs to communicatewith the other Opteron processors the application is running on in a typical single program,

multiple data (SPMD) fashion. The number of compute nodes used to run the application is

determined at program launch.

The MPI implementation of Roadrunner is based on the open-source Open MPI Project and

therefore is standard MPI. In this regard, Roadrunner applications are similar to other typicalMPI applications (such as those that run on the IBM Blue Gene solution). Where Roadrunner

differs in the sphere of application architecture is how its Cell/B.E. accelerators areemployed. At any point in the application flow, the MPI application running on each Opteroncan offload computationally-complex logic to its subordinate Cell/B.E. processor.

For more information about Open MPI Project, refer to the Open MPI: Open Source HighPerformance Computing Web site at:

http://www.open-mpi.org/
http://xcat.sourceforge.net/http://www.open-mpi.org/http://xcat.sourceforge.net/http://www.open-mpi.org/

7/29/2019 redp4477

34/50


2.4.2 Offloading logic

Determining which logic routines get offloaded to the Cell/B.E. processor, and when thatoccurs, is one of the most challenging tasks facing an application developer of theRoadrunner system. But it is this very challenge that makes the opportunity for incredibly high

application performance possible.

There are two primary techniques that a developer can employ to actually perform

asynchronous offloads of logic. This section briefly describes each, and points to areas whereyou can find more detailed information.

DaCSThe Data Communication and Synchronization (DaCS) library provides a set of services thatease the development of applications and application frameworks in a heterogeneous

multi-tiered system (for example, a 64-bit x86 system (x86_64) and one or more Cell/B.E.processor systems). The DaCS services are implemented as a set of APIs providing an

architecturally neutral layer for application developers on a variety of multi-core systems. Oneof the key abstractions that further differentiates DaCS from other programming frameworks

is a hierarchical topology of processing elements, each referred to as a DaCS Element (DE).

Within the hierarchy, each DE can serve one or both of the following roles:

A general purpose processing element, acting as a supervisor, control, or masterprocessor. This type of element usually runs a full operating system and manages jobsrunning on other DEs. This is referred to as a Host Element (HE).

A general or special purpose processing element running tasks assigned by an HE. Thisis referred to as an Accelerator Element (AE).

DaCS for Hybrid (DaCSH) is an implementation of the DaCS API specification that supports

the connection of an HE on an x86_64 system to one or more AEs on Cell/B.E. processors. InSDK 3.0, DaCSH only supports the use of sockets to connect the HE with the AEs. Direct

access to the Synergistic Processor Elements (SPEs) on the Cell/B.E. processor is notprovided. Instead, DaCSH provides access to the PowerPC Processor Element (PPE),

allowing a PPE program to be started and stopped and allowing data transfer between thex86_64 system and the PPE. The SPEs can only be used by the program running on thePPE.

For more information about DaCS, see IBM Software Development Kit for MulticoreAcceleration Data Communication and Synchronization Library for Hybrid-x86 Programmer'sGuide and API Reference, SC33-8408.

ALFThe Accelerated Library Framework (ALF) provides a programming environment for data and

task parallel applications and libraries. The ALF API provides you with a set of interfaces tosimplify library development on heterogeneous multi-core systems. You can use the provided

framework to offload the computationally intensive work to the accelerators. More complexapplications can be developed by combining the several function offload libraries. You can

also choose to implement applications directly to the ALF interface.

ALF supports the multiple-program-multiple-data (MPMD) programming module wheremultiple programs can be scheduled to run on multiple accelerator elements at the same

time.

7/29/2019 redp4477

35/50


The ALF functionality includes:

Data transfer management Parallel task management Double buffering Dynamic load balancing for data parallel tasks

With the provided API, you can also create descriptions for multiple compute tasks and definetheir execution orders by defining task dependency. Task parallelism is accomplished by

having tasks without direct or indirect dependencies between them. The ALF run timeprovides an optimal parallel scheduling scheme for the tasks based on given dependencies.

For more information about ALF, see IBM Software Development Kit for Multicore

Acceleration Accelerated Library Framework for Hybrid-x86 Programmer's Guide and APIReference, SC33-8406.

7/29/2019 redp4477

36/50


7/29/2019 redp4477

37/50


Appendix A. The Cell Broadband Engine

(Cell/B.E.) processor

Of all of the components that make up the Roadrunner cluster, the Cell/B.E. processor holdsa special place in that it provides extraordinary compute power that can be harnessed from a

single multi-core chip. This appendix provides a brief architectural overview of the currentCell/B.E. processor, the motivation for some of its features, as well as the general properties

of this unique processor.

For additional information about the Cell/B.E. processor, refer to the following resources:

Programming the Cell Broadband Engine Architecture: Examples and Best Practices,SG24-7575

IBM Software Development Kit for Multicore Acceleration Data Communication andSynchronization Library for Cell/B.E. Programmer's Guide and API Reference, SC33-8407

IBM Software Development Kit for Multicore Acceleration Accelerated Library Frameworkfor Cell/B.E. Programmer's Guide and API Reference, SC33-8333

The Cell/B.E. project at IBM Research, found at:

http://www.research.ibm.com/cell/

The Cell/B.E. resource center, found at:

http://www.ibm.com/developerworks/power/cell/

A

Note: Be aware that ample and extensive resources exist on the Cell/B.E. processor, theCell/B.E. architecture, as well as tutorials for the interested programmer. It is not the

intention of this publication to reproduce all of this information in this short section. Wehave utilized these extensive resources in our attempt to provide this summary.
http://www.research.ibm.com/cell/http://www.ibm.com/developerworks/power/cell/http://www.ibm.com/developerworks/power/cell/http://www.research.ibm.com/cell/

7/29/2019 redp4477

38/50


Background

The Cell/B.E. architecture is designed to support a very broad range of applications. The firstimplementation is a single-chip multiprocessor with nine processor elements operating on a

shared memory model, as shown in Figure A-1. In this respect, the Cell/B.E. processorextends current trends in PC and server processors. The most distinguishing feature of the

Cell/B.E. processor is that, although all processor elements can share or access all availablememory, their function is specialized into two types: the Power Processor Element (PPE) andthe Synergistic Processor Element (SPE). The Cell/B.E. processor has one PPE and eight

SPEs.

The architectural definition of the physical Cell/B.E. architecture-compliant processor is muchmore general than the initial implementation. A Cell/B.E. architecture-compliant processor

can consist of a single chip, a multi-chip module (or modules), or multiple single-chip moduleson a system board or other second-level package. The design depends on the technology

used and performance characteristics of the intended design.

Logically, the Cell/B.E. architecture defines four separate types of functional components:

PowerPC Processor Element (PPE) Synergistic Processor Unit (SPU) Memory Flow Controller (MFC) Internal Interrupt Controller (IIC)

The computational units in the Cell/B.E. architecture-compliant processor are the PPEs andthe SPUs. Each SPU must have a dedicated local storage, a dedicated MFC with its

associated memory management unit (MMU), and a replacement management table (RMT).The combination of these components is called a Synergistic Processor Element (SPE).

Figure A-1 Cell/B.E. schematic

The first type of processor element, the PPE, contains a 64-bit PowerPC architecture core. Itcomplies with the 64-bit PowerPC architecture and can run 32-bit and 64-bit applications. The

second type of processor element, the SPE, is designed to run computationally intensivesingle-instruction multiple-data (SIMD)/vector applications. It is not intended to run a full

featured operating system. The SPEs are independent processor elements, each runningtheir own individual application programs or threads. Each SPE has full access to sharedmemory, including the memory-mapped I/O space implemented by multiple DMA units. There

is a mutual dependence between the PPE and the SPEs. The SPEs depend on the PPE torun the operating system and, in many cases, the top-level thread control for a user code. The

PPE depends on the SPEs to provide the bulk of compute power.

7/29/2019 redp4477

39/50

Appendix A. The Cell Broadband Engine (Cell/B.E.) processor29

The SPEs are designed to be programmed in high level languages. They support a rich

instruction set that includes extensive SIMD functionality. However, like conventionalprocessors with SIMD extensions, use of SIMD data is preferred but not mandatory. Forprogramming convenience, the PPE also supports the standard PowerPC architecture

instruction set and the SIMD/vector multimedia extensions. To an application programmer, theCell/B.E. processor looks like a single core, dual threaded processor with eight additional

cores, each having their own local store. The PPE is more adept than the SPEs atcontrol-intensive tasks and quicker at task switching. The SPEs are more adept at compute

intensive tasks and slower than the PPE at task switching. Either processor element iscapable of both types of functions. This specialization is a significant factor in accounting forthe order-of magnitude improvement in peak computational performance and power

efficiency that the Cell/B.E. processor achieves over conventional processors.

The more significant difference between the SPE and PPE lies in how they access memory.

The PPE accesses memory with load and store instructions that move data between mainstorage and a set of registers, the contents of which may be cached. PPE memory access islike that of a conventional processor. The SPEs in contrast access main storage with direct

memory access (DMA) commands that move data and instructions between main storageand a private local memory, called a local store (LS). An SPE's instruction fetches and

load/store instructions access a private local store rather than the shared main memory.

This three-level organization of storage (registers, LS, and main memory), with asynchronous

DMA transfers between LS and main memory, is a radical break from conventionalarchitecture and programming models. It explicitly parallels computation with the transfer of

data and instructions that feed computation and stores the results of computation in mainmemory.

A primary motivation for this new memory model is the realization that over the past twenty

five years, memory latency, as measured in processor cycles, has increased by almost threeorders of magnitude. The result is that application performance is, in most cases, limited by

memory latency rather than peak compute capability, as measured by processor clockspeeds. When a sequential program performs a load instruction that encounters a cache

miss, program execution comes to a halt for several hundred cycles (techniques such ashardware threading attempt to hide these stalls, but it does not help single threadedapplications). Compared to this penalty, the few cycles that it takes to set up a DMA transfer

for an SPE is a much better trade off, especially considering the fact that each of the eightSPE's DMA controllers can maintain up to 16 DMA transfers in flight simultaneously.

Anticipating DMA needs efficiently can provide just in time delivery of data, which mayreduce this stall or eliminate it entirely. Conventional processors, even with deep and costly

speculation, manage to get, at best, a handful of independent memory accesses in flight.

One of the SPE's DMA transfer methods supports a list (such as a scatter gather list) of DMAtransfers that is constructed in an SPE's local store, so that the SPE's DMA controller can

process the list asynchronously while the SPE operates on previously transferred data. Inseveral cases, this approach of accessing memory has improved application performance by

almost two orders of magnitude when compared to the performance of conventionalprocessors This is significantly more than one would expect from the peak performance ratio(approximately 10x) between the Cell/B.E. processor and conventional PC processors.

7/29/2019 redp4477

40/50


The processor elements

The general Cell/B.E. architecture-compliant processor may contain one or more PPEs, whilethe current implementation consists of only one. The PPE contains a 64-bit, dual threaded

PowerPC RISC core and supports a PowerPC virtual memory subsystem. The currentPowerPC PPE runs at 3.2 GHz. It has 32 KB level-1 (L1) instruction and data caches and a

512 KB level-2 (L2) unified (instruction and data) cache. It is intended primarily for controlprocessing, running an operating system, managing system resources, and managing SPEthreads. It can run existing PowerPC architecture software and is well suited to executing

system control code. The instruction set for the PPE is an extended version of the PowerPCinstruction set. It includes the vector/SIMD multimedia extensions.

Each of the eight Synergistic Processor Elements (SPEs) contains a 3.2 GHz Synergistic

Processor Unit (SPU) vector processor plus the 256 KB of local store that is directlyaddressable. Computationally, each of these SPEs is capable of producing four floating point

results per clock period. Simple arithmetic shows that all eight of these SPEs have a peakcompute power of 102.4 gigaflops.

The eight identical SPEs are single-instruction multiple-data (SIMD) processor elements that

are intended for computationally intensive operations allocated to them by the PPE. EachSPE contains a RISC core, 256 KB software controlled local store for instructions and data,

and a set of 128 registers, each of which is 128 bits wide. The SPEs support a special SIMDinstruction set and a unique set of commands for managing DMA transfers and

inter-processor messaging and control.

SPE DMA transfers access main memory using PowerPC effective addresses. As in the PPE,

SPE address translation is governed by PowerPC architecture segment and page tables,which are loaded into the SPEs by privileged software running on the PPE. The SPEs are not

intended to run an operating system.

An SPE controls DMA transfers and communicates with the system by means of channelsthat are implemented in and managed by the SPE's Memory Flow Controller (MFC). The

channels are unidirectional message passing interfaces. The PPE and other devices on thesystem, including other SPEs, can also access this MFC state through the MFC's

memory-mapped I/O (MMIO) registers and queues, which are visible to software in the mainmemory address space.

The Element Interconnet Bus

The SPEs, PPE, the Memory Interface Controller (MIC) and broadband interface, and the

connection to other Cell/B.E. processors within an SMP are interconnected through a highspeed Element Interconnect Bus (EIB). The EIB is the communication path for commandsand data between all processor elements on the Cell/B.E. processor and the on chip

controllers for memory and I/O. The EIB supports full memory coherent and symmetricmultiprocessor (SMP) operations. A Cell/B.E. architecture processor is designed to be

combined coherently with other Cell/B.E. architecture processors to produce a cluster. TheCell/B.E. blade is one such example where two Cell/B.E. processors are combined in a

shared memory environment to produce an SMP.

The EIB consists of four 16 byte wide data rings, two in each direction, and a central arbiter.In the absence of path contention, each ring can perform three concurrent data transfers.

Each ring transfers 128 bytes (one PPE cache line) at a time. Processor elements can driveand receive data simultaneously. The SPEs, PPE, and PIC each have 25.6 GBps links to and

from the EIB. In aggregate, the EIB is capable of 204.8 GBps transfers. Figure A-1 on

7/29/2019 redp4477

41/50

Appendix A. The Cell Broadband Engine (Cell/B.E.) processor31

page 28 shows each of these elements and the order in which the elements are connected to

the EIB. The connection order is important to programmers seeking to minimize the latency oftransfers on the EIB, where latency is a function of the number of connection hops. Transfersbetween adjacent elements have the shortest latencies, while transfers between elements

separated by multiple hops have the longest latencies.

The EIB's internal maximum bandwidth is 96 bytes per processor clock cycle. Multipletransfers can be in process concurrently on each ring, including more than 100 outstanding

DMA memory transfer requests between main storage and the SPEs in either direction.These requests also may include SPE memory to and from the I/O space. The EIB does not

support any particular quality of service (QoS) behavior other than to guarantee forwardprogress. However, a resource allocation management (RAM) facility resides in the EIB.

Privileged software can use it to regulate the rate at which resource requesters (the PPE,SPEs, and I/O devices) can use memory and I/O resources.

Memory Flow Controller

The Memory Flow Controller (MFC) is the data transfer engine. It provides the primarymethod for data transfer, protection, and synchronization between main storage and the

associated local storage, or between the associated local storage and another local storage.An MFC command describes the transfer to be performed. A principal architectural objectiveof the MFC is to perform these data transfer operations in as fast and as fair a manner as

possible, thereby maximizing the overall throughput of the processor.

Commands that transfer data are called MFC DMA commands. These commands areconverted into DMA transfers between the local storage domain and main storage domain.

Each MFC can typically support multiple DMA transfers at the same time and can maintainand process multiple MFC commands. To accomplish this, the MFC maintains and processes

queues of MFC commands. Each MFC provides one queue for the associated SPU (MFCSPU command queue) and one queue for other processors and devices (MFC proxy

command queue). Logically, a set of MFC queues is always associated with each SPU in aCell/B.E. architecture-compliant processor.

The on-chip memory interface controller (MIC) provides the interface between the EIB and

physical memory. The IBM BladeCenter QS22 uses normal DDR memory and additionalhardware logic to implement the MIC. Memory accesses on each interface are 1 to 8, 16, 32,

64, or 128 bytes, with coherent memory ordering. Up to 64 reads and 64 writes can bequeued. The resource allocation token manager provides feedback about queue levels. TheMIC has multiple software controlled modes, including fast path mode (for improved latency

when command queues are empty), high priority read (for prioritizing SPE reads in front of allother reads), early read (for starting a read before a previous write completes), speculative

read, and slow mode (for power management). The MIC implements a closed page controller(bank rows are closed after being read, written, or refreshed), memory initialization, and

memory scrubbing.

7/29/2019 redp4477

42/50


7/29/2019 redp4477

43/50


Glossary

Accelerator General or special purpose processing

element in a hybrid system. An accelerator might have amulti-level architecture with both host elements and

accelerator elements. An accelerator, as defined here, is

a hierarchy with potentially multiple layers of hosts and

accelerators. An accelerator element is always associated

with one host. Aside from its direct host, an accelerator

cannot communicate with other processing elements in

the system. The memory subsystem of the accelerator

can be viewed as distinct and independent from a host.

This is referred to as the subordinate in a cluster

collective.

All-reduce operation Output from multiple accelerators

is reduced and combined into one output.

API Application Programming Interface. An application

programming interface defines the syntax and semantics

for invoking services from within an executing application.

All APIs are targeted to be available to both FORTRAN

and C programs, although implementation issues (such

as whether the FORTRAN routines are simply wrappers

for calling C routines) are up to the supplier.

ASCI The name commonly used for the Advanced

Simulation and Computing program administered by

Department of Energy (DOE)/National Nuclear Security

Agency (NNSA).

ASIC Application Specific Integrated Circuit.

B/U Bring up.

CEC Central electronic complex.

cluster A collection of nodes.

compute kernel Part of the accelerator code that does

stateless computation tasks on one piece of input data

and generates the corresponding output results.

compute task An accelerator execution image that

consists of a compute kernel linked with the accelerated

library framework accelerator runtime library.

DaCS element A general or special purpose processing

element in a topology. This refers specifically to the

physical unit in the topology. A DaCS element can serve

as a host or an accelerator.

DDR Double Data Rate. DDR is a technique for

doubling the switching rate of a circuit by triggering both

the rising edge and falling edge of a clock signal.

DE See DaCS element.

de_id A unique number assigned by the DaCS

application at run time to a physical processing element ina topology group A group construct specifies a collection

of DEs and processes in a system.

EDRAM Enhanced dynamic random access memory is

dynamic random access memory that includes a small

amount of static RAM (SRAM) inside a larger amount of

DRAM. Performance is enhanced by making sure that

many of the memory accesses will be to the faster SRAM.

EMC Electromagnetic compatibility.

ESD Electrostatic discharge.

ETH Ethernet, as in adapter or interface.

FLOP Floating Point OPeration. A measure of

computations speed frequently used with

supercomputers.

FLOP/s FLOPs per second.

FPU Floating point unit.

FRU Field replaceable unit.

GFLOP GigaFLOP. A gigaFLOP/s is a billion (109 =

1,000,000,000) floating point operations per second.

handle A handle is an abstraction of a data object,usually a pointer to a structure.

HBCT Hardware-based cycle time.

host A general purpose processing element in a hybrid

system. A host can have multiple accelerators attached to

it. This is often referred to as the master node in a cluster

collective.

hybrid A 64-bit x86 system using a Cell Broadband

Engine (Cell/B.E.) architecture as an accelerator.

I/O I/O (input/output) describes any operation, program,

or device that transfers data to or from a computer.

I/O node The I/O nodes (ION) are responsible, in part,

for providing I/O services to compute nodes.

Job A job is a cluster-wide abstraction similar to a

POSIX session, with certain characteristics and attributes.

Commands are targeted to be available to manipulate a

job as a single entity (including kill, modify, query

characteristics, and query state).

7/29/2019 redp4477

44/50


LANL Los Alamos National Laboratory.

LINPACK LINPACK is a collection of FORTRAN

subroutines that analyze and solve linear equations and

linear leastsquares problems.

main thread The main thread of the application. In

many cases, Cell/B.E. architecture programs aremulti-threaded using multiple SPEs running concurrently.

A typical scenario is that the application consists of a main

thread that creates as many SPE threads as needed and

the application organizes them.

MFLOP MegaFLOP/s. A megaFLOP/s is a million (106

= 1,000,000) floating point operations per second.

MPI Message passing interface.

MPICH2 MPICH is an implementation of the MPI

standard available from Argonne National Laboratory.

node A node is a functional unit in the system topology,consisting of one host together with all the accelerators

connected as children in the topology (this includes any

children of accelerators).

parent The parent of a DE is the DE that resides

immediately above it in the topology tree.

PPE Power Processor Element: 64-bit Power

Architecture unit within the CBE that is optimized for

running operating systems and applications. The PPE

depends on the SPEs to provide the bulk of the application

performance.

PPE PowerPC Processor Element. Thegeneral-purpose processor in the Cell/B.E. processor.

process A process is a standard UNIX-type process

with a separate address space.

RAS Reliability, availability, and serviceability.

service node The service node is responsible, in part,

for management and control of RoadRunner.

SIMD Single Instruction Multiple Data. Processing in

which a single instruction operates on multiple data

elements that make up a vector data type. Also known as

vector processing. This style of programming implementsdata-level parallelism.

SN See service node.

SPE Synergistic Processor Element. Eight of these exist

within the Cell/B.E. processor, optimized for running

compute-intensive applications, and they are not

optimized for running an operating system. The SPEs are

independent processors, each running its own individual

application programs.

SPE Synergistic Processor Element. Extends the

PowerPC 64 architecture by acting as cooperative offload

processors (synergistic processors), with the direct

memory access (DMA) and synchronization mechanisms

to communicate with them (memory flow control), and with

enhancements for real-time management. There are eight

SPEs on each Cell/B.E. processor.

SPMD Single Program Multiple Data. A common style of

parallel computing. All processes use the same program,

but each has its own data.

SPU Synergistic Processor Unit. The part of an SPE

that execut

Documents

redp4477