Upload
prblajr
View
214
Download
0
Embed Size (px)
Citation preview
7/29/2019 redp4477
1/50ibm.com/redbooks
Redpaper
Front cover
Roadrunner: Hardware
and Software Overview
Dr. Andrew Komornic
Gary Mullen-Schu
Deb Lando
Review components that comprise theRoadrunner supercomputer
Understand Roadrunner hardware
components
Learn about Roadrunner
system software
http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/7/29/2019 redp4477
2/50
7/29/2019 redp4477
3/50
International Technical Support Organization
Roadrunner: Hardware and Software Overview
January 2009
REDP-4477-00
7/29/2019 redp4477
4/50
Copyright International Business Machines Corporation 2009. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
First Edition (January 2009)
This edition applies to the Roadrunner computing system.
Note: Before using this information and the product it supports, read the information in Notices on page v.
7/29/2019 redp4477
5/50
Copyright IBM Corp. 2009. All rights reserved.iii
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
The team that wrote this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1. Roadrunner hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 What Roadrunner is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 A historical perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Roadrunner hardware components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 TriBlade: a unique concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 IBM BladeCenter QS22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 IBM BladeCenter LS21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Rack configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Compute node rack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Compute node and I/O rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Switch and service rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 The Connected Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Networks within a Connected Unit cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Networks between Connected Unit clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 2. Roadrunner software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Roadrunner components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Compute node (TriBlade) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2 I/O node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.4 Master (management) node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Cluster boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Boot scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 xCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 How applications are written and executed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Application core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Offloading logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Appendix A. The Cell Broadband Engine (Cell/B.E.) processor . . . . . . . . . . . . . . . . . . 27Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
The processor elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30The Element Interconnet Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Memory Flow Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7/29/2019 redp4477
6/50
iv Roadrunner: Hardware and Software Overview
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
How to get Redbooks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7/29/2019 redp4477
7/50
Copyright IBM Corp. 2009. All rights reserved.v
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consultyour local IBM representative for information on the products and services currently available in your area. Anyreference to an IBM product, program, or service is not intended to state or imply that only that IBM product,program, or service may be used. Any functionally equivalent product, program, or service that does notinfringe any IBM intellectual property right may be used instead. However, it is the user's responsibility toevaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. Thefurnishing of this document does not give you any license to these patents. You can send license inquiries, inwriting, to:IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATIONPROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer ofexpress or implied warranties in cer tain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically madeto the information herein; these changes will be incorporated in new editions of the publication. IBM may makeimprovements and/or changes in the product(s) and/or the program(s) described in this publication at any timewithout notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in anymanner serve as an endorsement of those Web sites. The materials at those Web sites are not part of thematerials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurringany obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products and cannot confirm theaccuracy of performance, compatibility or any other claims related to non-IBM products. Questions on thecapabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate themas completely as possible, the examples include the names of individuals, companies, brands, and products.All of these names are fictitious and any similarity to the names and addresses used by an actual businessenterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programmingtechniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing applicationprograms conforming to the application programming interface for the operating platform for which the sampleprograms are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,cannot guarantee or imply reliability, serviceability, or function of these programs.
7/29/2019 redp4477
8/50
vi Roadrunner: Hardware and Software Overview
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business MachinesCorporation in the United States, other countries, or both. These and other IBM trademarked terms aremarked on their first occurrence in this information with the appropriate symbol ( or ), indicating USregistered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBMtrademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,other countries, or both:
AS/400
BladeCenter
Blue Gene/L
Blue Gene
Domino
GPFS
IBM PowerXCell
IBM
iSeries
PartnerWorld
Power Architecture
POWER3
POWER5
PowerPC
Redbooks
Redbooks (logo)
RS/6000
System i
WebSphere
The following terms are trademarks of other companies:
AMD, AMD Opteron, HyperTransport, the AMD Arrow logo, and combinations thereof, are trademarks ofAdvanced Micro Devices, Inc.
InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of the InfiniBand TradeAssociation.
Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the UnitedStates, other countries, or both and is used under license therefrom.
Java, Sun, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States,other countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, othercountries, or both.
Intel Pentium, Intel, Pentium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registeredtrademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
http://www.ibm.com/legal/copytrade.shtmlhttp://www.ibm.com/legal/copytrade.shtml7/29/2019 redp4477
9/50
Copyright IBM Corp. 2009. All rights reserved.vii
Preface
This IBM Redpaper publication provides an overview of the hardware and software
components that constitute a Roadrunner system. This includes the actual chips, cards, andso on that comprise a Roadrunner connected unit, as well as the peripheral systems required
to run applications. It also includes a brief description of the software used to manage and runthe system.
The team that wrote this paper
This publication was produced by a team of IBM specialists working in collaboration with theInternational Technical Support Organization (ITSO), Rochester Center.
Dr. Andrew Komornicki is an accomplished computational scientist with many years of
experience. Prior to joining IBM, his career included independent research, scientificmanagement, government service, as well as work in the computer industry. During the
1990s, he spent two years as a rotator at the National Science Foundation as a programdirector, where he co-managed the program in computational chemistry. As a computational
scientist, he also spent four years as the chair of the allocation committee at the San DiegoSupercomputer Center. He has consulted extensively in both the computer and chemical
industry. Upon his return from Washington, he spent several years at Sun Microsystems,where he worked as a business development executive tasked with the development ofvertical markets in the chemistry and pharmaceutical markets. Three years ago, he joined the
Advanced Technical Support group at IBM in the role of supporting scientific computing in theHigh Performance Computing (HPC) arena. His duties have included support of large scale
procurements, benchmarks, and some software contributions.
Gary Mullen-Schulz is a Consulting IT Specialist at the ITSO, Rochester Center. He leadsthe team responsible for producing Roadrunner documentation, and was the primary author
of IBM System Blue Gene Solution: Application Development, SG24-7179. Gary also focuseson Java and WebSphere. He is a Sun Certified Java Programmer, Developer and
Architect, and has three issued patents.
Deb Landon is an IBM Certified Senior IT Specialist in the IBM ITSO, Rochester Center.Debbie has been with IBM for 25 years, working first with the S/36 and then the AS/400,
which has since evolved to the IBM System i platform. Before joining the ITSO in Novemberof 2000, Debbie was a member of the PartnerWorld for Developers iSeries team,
supporting IBM Business Partners in the area of Domino for iSeries.
Thanks to the following people for their contributions to this project:
Bill BrandmeyerMike Brutman
Chris EngelSusan Lee
Dave LimpertCamille MannAndrew Schram
IBM Rochester
7/29/2019 redp4477
10/50
viii Roadrunner: Hardware and Software Overview
Prashant Manikal
Cornell WrightIBM Austin
Debbie Landon
Wade WallaceInternational Technical Support Organization, Rochester Center
Become a published author
Join us for a two- to six-week residency program! Help write a book dealing with specificproducts or solutions, while getting hands-on experience with leading-edge technologies. Youwill have the opportunity to team with IBM technical professionals, Business Partners, and
Clients.
Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you
will develop a network of contacts in IBM development labs, and increase your productivityand marketability.
Learn more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks in one of the following ways:
Use the online Contact us review Redbooks form found at:ibm.com/redbooks
Send your comments in an e-mail to:
Mail your comments to:
IBM Corporation, International Technical Support OrganizationDept. HYTD Mail Station P099
2455 South RoadPoughkeepsie, NY 12601-5400
http://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/contacts.htmlhttp://www.redbooks.ibm.com/contacts.htmlhttp://www.redbooks.ibm.com/http://www.redbooks.ibm.com/http://www.redbooks.ibm.com/residencies.htmlhttp://www.redbooks.ibm.com/residencies.html7/29/2019 redp4477
11/50
Copyright IBM Corp. 2009. All rights reserved.1
Chapter 1. Roadrunner hardware overview
This chapter describes the hardware components that comprise the Roadrunner system.
Specifically, this chapter examines the various components that make up a Connected Unit(CU) and then discusses how the CUs are tied together to create a complete Roadrunner
cluster.
1
Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather abig picture discussion meant to acquaint the reader with the Roadrunner system.
7/29/2019 redp4477
12/50
2 Roadrunner: Hardware and Software Overview
1.1 What Roadrunner is
Roadrunner is the first general purpose computer system to reach the petaflop milestone. OnJune 10, 2008, IBM announced that this supercomputer had sustained a record-breaking
petaflop, or 1015 floating point operations per second, as measured by the Linpackbenchmark. As a result of this achievement, Roadrunner became the worlds fastestsupercomputer.
Roadrunner was designed, manufactured, and tested at the IBM facility in Rochester,Minnesota. The actual initial petaflop run was done in Poughkeepsie, New York. Its final
destination is the Los Alamos National Laboratory (LANL) in New Mexico, which will use thissystem for a variety of scientific efforts. Most notably, Roadrunner is the latest tool used bythe National Nuclear Security Administration (NNSA) to ensure the safety and reliability of the
US nuclear weapons stockpile.
This computer system has a number of unique characteristics. The most notable is its sheer
size and the fact that this is the first modern heterogeneous system of its kind. As a petascaledesign, the Roadrunner system has the fewest number of compute nodes and the fewestnumber of cores of any of the outstanding designs considered to date. In a nutshell, the
attributes of this system can be summarized with the following characteristics:
Roadrunner is a cluster of clusters.
The fundamental building block of the Roadrunner system is a Connected Unit (CU). As
originally designed, Roadrunner would have 18 such connected units, of which 17 havebeen delivered to LANL for the final system configuration. Roadrunner is made up ofapproximately 6500 AMD dual-core processors coupled with 12,240 Cell Broadband
Engine (Cell/B.E.) processors. The total peak (theoretical) performance of this hybridsystem is in excess of 1.3 petaflops. The memory on this system consists of a total of 98
TB equally distributed between the Opteron and the Cell/B.E. nodes.
Each CU is made up of 180 compute nodes and 12 I/O nodes. A unique aspect of the
Roadrunner design is the creation of a TriBlade as a fundamental building block for the
CU. Each TriBlade consists of an AMD Opteron blade and two Cell/B.E. IBMBladeCenter QS22 blades. The Opteron blade contains two dual-core processors, whilethe Cell/B.E. blades each contain two new Cell/B.E. eDP (double precision) processors.This architecture allows for a one-to-one mapping of Opteron cores to Cell/B.E.
processors. As discussed in 1.2.1, TriBlade: a unique concept on page 5, this designarchitecture creates a master-subordinate relationship between the Opterons and the
Cell/B.E. processors. Each Opteron core is connected to a Cell/B.E. chip through adedicated PCIe link. Communications between Opteron nodes is accomplished through
an extensive InfiniBand network.
Fedora Linux is the operating system of choice for this system.
System management of this cluster of clusters is accomplished with the xCAT cluster
management software tools.
It is worthwhile to note some of the physical characteristics of this system. The entire systemconsists of 278 racks that occupy approximately 5000 square feet of floor space. The weightof this system is approximately 500,000 pounds, or 250 tons. The networking required for
both the compute and management tasks consists of 55 miles of InfiniBand (IB) cables.Lastly, even though the system consumes 2.4 MW of power, it is very energy efficient,
delivering almost 437 megaflops per watt.
Roadrunner holds a unique position in the history of scientific computing. It was over tenyears ago that the first teraflop (1012 floating point operations per second) computer was built.
In 1997, a computer consisting of 7000+ Intel Pentium II processors sustained a teraflop
7/29/2019 redp4477
13/50
Chapter 1. Roadrunner hardware overview3
on the Linpack benchmark. Roadrunner in 2008 has demonstrated a thousand fold increase
in sustained compute performance.
1.1.1 A historical perspective
Machines of Roadrunners size and capability are the direct result of the scientific needs ofthe weapons-physics communities. In October of 1992, the United States (U.S.) entered thestart of the nuclear testing moratorium that banned all nuclear testing above and below
ground. Prior to this moratorium, the US nuclear weapons stockpile was maintained through acombination of underground nuclear testing as well as the development of new weapons
systems. When theory and experiment were combined, the Department of Energy could relyon much simpler models than those needed today. Without nuclear testing, weapons
scientists must rely much more heavily on sophisticated hardware and software to simulate
the complex aging process of both weapons systems as well as their components.
Established in 1995, the Advanced Simulation and Computing Program (ASC) is an integral
part of the Department of Energy's National Nuclear Security Administration (NNSA) shift inemphasis from test-based to simulation-based programs. Under the ASC, computer
simulation capabilities are continually developed to analyze and predict the performance,safety, and reliability of nuclear weapons and to certify their functionality. All of this work isintegrated into the three weapons laboratories:
Los Alamos National Laboratory (LANL) Lawrence Livermore National Laboratory (LLNL) Sandia National Laboratories (SNL)
The predecessor of the ASC was the Accelerated Strategic Computing Initiative (known asthe ASCI program) in direct response to the National Defense Authorization Act of 1994,which required, in the absence of nuclear testing, for the Department of Energy to:
Support a focused multifaceted program to increase the understanding of the existingnuclear stockpile.
Predict, detect, and evaluate potential problems associated with the aging of the nuclearstockpile.
Maintain the science and engineering institutions needed to support the national nuclear
deterrent, now and in the future.
In response to this mandate, the ASCI program set the following objectives in order to meetthe needs and requirements of the Stockpile Stewardship program. These were enumerated
to include performance, safety, reliability, and renewal, and were articulated in the ASCIprogram plan, published by the Department of Energy Defense Programs on January 2000:
Create predictive simulations of nuclear weapon systems to analyze behavior and asses
performance in an environment without nuclear testing.
Predict with high certainty the behavior of full weapon systems in complex accident
scenarios.
Achieve sufficient, validated predictive simulations to extend the lifetime of the stockpile,predict failure mechanisms, and reduce routine maintenance.
Note: The name Roadrunner was chosen by Los Alamos National Laboratory and is not aproduct name of the IBM Corporation. This supercomputer was designed and developed
for the Department of Energy and Los Alamos National Laboratory under the project name
Roadrunner. The project was named after the state bird of New Mexico.
7/29/2019 redp4477
14/50
4 Roadrunner: Hardware and Software Overview
Use virtual prototyping and modeling to understand how new production processors and
materials affect performance, safety, reliability, and aging. This understanding helps definethe right configuration of production and testing facilities necessary for managing thestockpile throughout the next several decades.
Throughout the history of this program, the IBM Corporation has been a key partner of theDepartment of Energy's National Nuclear Security Administration (NNSA) program. Here areseveral historical examples:
In 1998, IBM delivered the ASCI Blue Pacific system, which consisted of 5,856 PowerPC604e microprocessors. The theoretical peak performance of this system was 3.8 teraflops.
In 2000, IBM delivered the ASCI White system. This computer system was based on the
IBM RS/6000 computer, which contained IBM POWER3 nodes running at 375 MHz.This cluster consisted of 512 nodes, each of which had 16 processors for a total of 8,192processors. The power requirements for this machine consisted of 3 MW for the computer
and an additional 3 MW required for cooling. The theoretical peak processing power was12.3 teraflops and a Linpack performance of 7.2 teraflops.
In 2005, IBM delivered and installed the ASC Purple system at Lawrence LivermoreLaboratories. This system was a 100 teraflop machine and was the successful realization
of a goal set a decade earlier (1996) to deliver a 100 teraflop machine within the 2004 to2005 time frame.
ASC Purple is based on the symmetric shared memory IBM POWER5 architecture. Thecombined system contains approximately 12,500 POWER5 processors and requires 7.5
MW of electrical power for both the computer and cooling equipment.
Another machine in the ASC program is the IBM System Blue Gene/L machinedelivered by IBM to Lawrence Livermore Laboratories. The Blue Gene architecture is
unique in that it allows for a very dense packing of computer nodes. A single Blue Gene
rack contains 1024 nodes. On March 24, 2005, the US Department of Energy announcedthat the Blue Gene/L installation at Lawrence Livermore Laboratory had achieved a speed
of 135 teraflops on a system consisting of 32 racks. On October 27, 2005, LawrenceLivermore Laboratories and IBM announced that Blue Gene/L had produced a Linpack
benchmark that exceeded 280 teraflops. This system consisted of 65,536 compute nodeshoused in 64 Blue Gene racks.
As with each of the systems described above, the Roadrunner project is a partnership with
IBM. The original contract was signed in September 2006 and projected for three phases. Inphase 1, a base system was delivered consisting of Opteron nodes. A hybrid node prototype
system was projected for phase 2. The delivery of a hybrid final system, one that wouldachieve a sustained petaflop in Linpack performance, was projected for phase 3.
For more information, refer to the Advanced Simulation and Computing Web site at:
http://www.sandia.gov/NNSA/ASC/about.html
Note: At the time these goals were set, computers were still at the gigaflop level and
were still two years away from the realization of the first teraflop machine.
http://www.sandia.gov/NNSA/ASC/about.htmlhttp://www.sandia.gov/NNSA/ASC/about.html7/29/2019 redp4477
15/50
Chapter 1. Roadrunner hardware overview5
1.2 Roadrunner hardware components
A simple way to describe the Roadrunner system is that it is a heterogeneous cluster ofclusters, each of which is accelerated by Cell/B.E. processors. The unique feature of this
design is that each compute node consists of node-attached Cell/B.E. processors, rather thana simple cluster of Cell/B.E. processors. A collection of such compute and I/O nodes, allconnected through a high speed switch fabric, makes up a scalable unit known as a
Connected Unit (CU).
The fundamental building block of a CU is a compute node, each of which is a TriBlade. The
TriBlade is an original design concept created for the Roadrunner system and allows for theintegration of Cell/B.E. and Opteron blades. Architecturally, this design allows for theincorporation of these TriBlades into a IBM BladeCenter chassis.
1.2.1 TriBlade: a unique concept
The TriBlade makes up what is called a hybrid compute node.The components of this nodeconsist of an IBM LS21 Opteron blade, two IBM BladeCenter QS22 Cell/B.E. blades, and a
fourth blade that houses the communications fabric for the compute node. This expansionblade connects the two QS22 blades through four PCI Express x8 links to the Opteron blade
and provides each node with an InfiniBand 4x DDR cluster interconnect. Figure 1-1 shows aschematic of a TriBlade.
Figure 1-1 TriBlade schematic
7/29/2019 redp4477
16/50
6 Roadrunner: Hardware and Software Overview
The node design of the TriBlade offers a number of important characteristics. Since each
node is accelerated by Cell/B.E. processors, by design there is one Cell/B.E. chip for eachOpteron core. The TriBlade is populated with 16 GB of Opteron memory and an equal amountof Cell/B.E. memory. Since the new Cell/B.E. eDP processors are capable of delivering 102.4
gigaflops of peak performance, each TriBlade node is capable of approximately 400 gigaflopsof double precision compute power. For additional information about the Cell/B.E. processor,
see Appendix A, The Cell Broadband Engine (Cell/B.E.) processor on page 27.
The design of the TriBlade presents the user with a very specific memory hierarchy. TheOpteron processors establish a master-subordinate relationship with the Cell/B.E.
processors. Each Opteron blade contains 4 GB of memory per core, resulting in 8 GB ofshared memory per socket. The Opteron blade thus contains 16 GB of NUMA shared
memory per node.
Each Cell/B.E. processor contains 4 GB of shared memory, resulting in 8 GB of shared
memory per blade. In total, the Cell/B.E. blades contain 16 GB of distributed memory perTriBlade node. It is important to note that not only is there a one-to-one mapping of Opteron
cores to Cell/B.E. processors, but also each node consists of a distribution of equal memoryamong each of these components.
In order to sustain this compute power, the connectivity within each node consists of four PCIExpress x8 links, each capable of 2 GBs transfer rates, with a 2 micro-second latency. Theexpansion slot also contains the InfiniBand interconnect, which allows communications to the
rest of the cluster. The capability of the InfiniBand 4x DDR interconnect is rated at 2 GBs witha 2 micro-second latency.
1.2.2 IBM BladeCenter QS22
The IBM BladeCenter QS22 is based on the IBM PowerXCell 8i processor, a newgeneration processor based on the Cell/B.E. architecture. In contrast to its predecessors, the
QS20 and QS21, the QS22 is based on the second generation processor of the Cell/B.E.architecture and offers single instruction multiple data (SIMD) vector capability along with
strong parallelization. It performs double precision floating point operations at five times thespeed of the previous generations of Cell/B.E. processors.
Due to its parallel nature and extraordinary computing speed, the QS22 is ideal for use in
scientific applications, which is why it was chosen as an integral part of the Roadrunnersystem by IBM and Los Alamos. The QS22 is a single-wide blade server that offers an SMPwith shared memory and two Cell/B.E. processors in a single blade enclosure.
Figure 1-2 on page 7 provides an illustration of the IBM BladeCenter QS22. Features of theQS22 include:
Two 3.2 GHz IBM PowerXCell 8i processors Up to 32 GB of PC2-6400 800 MHz DDR2 memory
460 single-precision gigaflops per blade (peak) 217 double-precision gigaflops per blade (peak) Integrated dual 1 Gb Ethernet IBM Enhance I/O Bridge chip Serial Over LAN
The QS22 is based on the 64-bit IBM PowerXCell 8i processor. This processor operates at
3.2 GHz. Each of the eight SIMD vector processors is capable of producing four floating pointresults per clock period. The memory subsystem on the QS22 consists of eight DIMM slots,enabling configurations from 4 GB up to 32 GB of ECC memory.
7/29/2019 redp4477
17/50
Chapter 1. Roadrunner hardware overview7
For additional information about the Cell/B.E. processor, see Appendix A, The Cell
Broadband Engine (Cell/B.E.) processor on page 27.
Figure 1-2 IBM BladeCenter QS22
For more information about the QS22, see the IBM BladeCenter QS22 Web page at:
http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html
1.2.3 IBM BladeCenter LS21
The IBM BladeCenter LS21 is a single width AMD Opteron-based server. The LS21 bladeserver supports up to two of the dual-core 2200 series AMD Opteron processors combined
with up to 32 GB of ECC memory and one fixed SAS HDD.
The memory used in the LS21 are DDR2 and are ECC protected. The general memoryconfiguration for the LS21 has to follow these guidelines:
A total of eight DIMM slots (four per processor socket). Two of these slots (1 and 2) arepreconfigured with a pair of DIMMs.
Because memory is 2-way interleaved, the memory modules must be installed in matchedpairs. However, one DIMM pair is not required to match the other in capacity.
A maximum of 32 GB of installed memory is achieved when all DIMM sockets are
populated with 4 GB DIMMs.
Important: The implementation chosen for the Roadrunner system consists of thestandard blade populated with 16 GB of DDR2 memory. As with the Opteron blades, all of
the Cell/B.E. based blades are diskless.
Important: The configuration used for the Roadrunner system contains two AMD Opteron
processors running at 1.8 GHz, 16 GB of ECC memory, and no hard disk. The disklessconfiguration is an important implementation design, which eliminates additional movingparts and potential points of failure for a system with so many thousands of nodes.
http://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.htmlhttp://www.ibm.com/systems/bladecenter/hardware/servers/qs22/index.html7/29/2019 redp4477
18/50
8 Roadrunner: Hardware and Software Overview
For each installed microprocessor, a set of four DIMM sockets are enabled.
The processors used in these blades are standard low-power processors. The standard AMD
Opteron processors draw a maximum of 95 W. Specially manufactured low-power processorsoperate at 68 W or less without any performance trade-offs. This savings in power at the
processor level combined with the smarter power solution that IBM BladeCenter deliversmake these blades very attractive for installations that are limited by power and coolingresources.
This blade is designed with power management capability to provide the maximum up time
possible. In extended thermal conditions, rather than shut down completely or fail, the LS21automatically reduces the processor frequency to maintain acceptable thermal levels.
A standard LS21 blade server offers these features:
Up to two high-performance, AMD Dual-Core Opteron processors.
A system board containing eight DIMM connectors, supporting 512 MB, 1 GB, 2 GB, or 4
GB DIMMs.
Up to 32 GB of system memory is supported with 4 GB DIMMs.
A SAS controller, supporting one internal SAS drive (36 or 73 GB) and up to threeadditional SAS drives with optional SIO blade.
Two TCP/IP Offload Engine enabled Gigabit Ethernet controllers (Broadcom 5706S) asstandard, with load balancing and failover features.
Support for concurrent KVM (cKVM) and concurrent USB/DVD (cMedia) through
Advanced Management Module and an optional daughter card.
Support for a Storage and I/O Expansion (SIO) unit.
Dual Gigabit Ethernet controllers are standard, providing high-speed data transfers and
offering TCP/IP Offload Engine support, load-balancing, and failover capabilities. The versionused for Roadrunner uses optional InfiniBand expansion cards, allowing high speed
communication between nodes. The InfiniBand fabric installed with Roadrunner provides
4x DDR connections that have a theoretical peak of 2 GB per second.
Finally, the LS21 supports both the Windows and Linux operating systems. The Roadrunner
implementation uses the Fedora version of Linux.
Figure 1-3 on page 9 shows a schematic of the planar of an LS21.
7/29/2019 redp4477
19/50
Chapter 1. Roadrunner hardware overview9
Figure 1-3 LS21 planar
For more information about the LS21, see the IBM BladeCenter LS21 Web page at:
http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html
1.3 Rack configurations
TriBlades are combined into racks to create assemblies of hybrid compute nodes. In addition,some racks contain other components for other required functionality. There are three
different rack types:
Compute node rack Compute node and I/O rack Switch and service rack
In general, these racks look very similar. Each can hold a maximum of 12 TriBlades and somehold additional components.
http://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.htmlhttp://www.ibm.com/systems/bladecenter/hardware/servers/ls21/features.html7/29/2019 redp4477
20/50
10 Roadrunner: Hardware and Software Overview
1.3.1 Compute node rack
A compute node rack holds a total of 12 TriBlades, which means it holds 12 LS21s and 24QS22s. A compute node rack looks similar to the picture shown in Figure 1-4.
Figure 1-4 Compute node rack
1.3.2 Compute node and I/O rack
A compute node and I/O rack contains 12 TriBlades, but also contains an IBM System x3655(x3655) at the bottom of the rack. The x3655 performs input/output (I/O) services on behalf ofthe system. A compute and I/O node rack looks similar to the picture shown in Figure 1-5 on
page 11.
The x3655 is a new rack-optimized server based on the AMD Opteron dual-core processor.
The x3655 supports four processor sockets and 32 memory DIMM slots. The memory is 667MHz DDR2, in sizes ranging from 512 MB to 4 GB per DIMM. This gives a total capacity of upto 128 GB of main system memory.
Note: The x3655 used in the Roadrunner system supports 16 GB or 32 GB of memory.
7/29/2019 redp4477
21/50
Chapter 1. Roadrunner hardware overview11
Figure 1-5 Compute and I/O node rack
1.3.3 Switch and service rack
The switch and service rack contains no TriBlades. This rack contains a Voltaire Grid DirectorISR 9288 switch that is used to manage InfiniBand networking traffic. This is known in
Roadrunner as afirst-stage switch. See First-stage InfiniBand switch on page 14 for moreinformation about its role and function.
You can learn more about the Voltaire switch technology on the Voltaire Web page at:
http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288
In addition, this rack contains an IBM System x3655, which serves as the service node for the
CU. The functions that the service node performs include the following:
Holds the boot images used to IPL the Opteron and Cell/B.E. blades, as well as the I/O
nodes.
IPLs all elements in the CU when instructed to do so by the central management node.
http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_9288http://www.voltaire.com/Products/Grid_Backbone_Switches/Voltaire_Grid_Director_ISR_92887/29/2019 redp4477
22/50
12 Roadrunner: Hardware and Software Overview
A switch and service rack looks similar to the picture shown in Figure 1-6.
Figure 1-6 Switch and service rack
1.4 The Connected Unit
The Connected Unit (CU) is a core concept in the Roadrunner system. Groups of the various
rack configurations discussed in 1.3, Rack configurations on page 9 are put together tocreate a single CU. Table 1-1 lists the racks that comprise a single CU.
Table 1-1 Racks making up a Connected Unit
A CU can be thought of as a base cluster unit. The racks that make up a CU are connected toeach other through first-stage switches. CUs are then tied together through second-stage
switches to create a larger grid.
The size of a CU is largely determined by the capabilities of the first-stage switch. There are
180 TriBlades in a CU. This number of TriBlades means that a Connected Unit contains 180AMD Opteron LS21s and 360 IBM BladeCenter QS22s. See Figure 1-7 on page 13.
MiscMiscMisc
Rack type Number of racks in
the Connected Unit
Number of TriBlades
in a rack
Total number of
TriBlades
Compute node rack 3 12 36
Compute node and I/O rack 12 12 144
Switch and service rack 1 0 0
Total 16 N/A 180
7/29/2019 redp4477
23/50
Chapter 1. Roadrunner hardware overview13
Figure 1-7 Racks comprising a Connected Unit
1.5 Networks
Given the high number of racks and nodes in the Roadrunner system, it should come as nosurprise that there are several different networks used to tie the system together. This sectionprovides an overview of the different networks involved as well as their functional purpose.
1.5.1 Networks within a Connected Unit cluster
First-stage switches are used to connect all the racks making up a Connected Unit (CU)
together and to allow the CU to communicate with the outside world (for example, a filesystem) and other CUs. The second-stage switches primarily serve as a hub to tie the 17 CUs
together into a common computational system.
Note: As previously discussed in this chapter, the entire Roadrunner system or cluster is
comprised of a total of 17 CUs.
Misc
Connected Unit
I/O + Compute rack
x12
Compute rack
x3
Switch and
Service rack
7/29/2019 redp4477
24/50
14 Roadrunner: Hardware and Software Overview
First-stage InfiniBand switchAs discussed in 1.3.3, Switch and service rack on page 11, each CU contains a rack with aVoltaire Grid Director ISR 9288 switch. This switch allows for 288 different InfiniBand inputs,which are used as shown in Table 1-2.
Table 1-2 Connections in and out from a first-stage switch
InfiniBand Connected UnitThis network creates a fat tree that allows the AMD Opterons to communicate with eachother using the industry-standard Message Passing Interface (MPI). It is built on top of the
switched InfiniBand network. A fat tree is a special topology invented by Charles E.Leiserson of MIT. Unlike a traditional binary tree, a fat tree has thicker branches the closeryou get to the trees root. In this way, you do not end up with a communications bottleneck at
the root of the tree.
Figure 1-8 shows a traditional binary tree. Note that as messages flow up the tree, the single
links to the root node can become a point of congestion.
Figure 1-8 Traditional binary tree
Figure 1-9 on page 15, on the other hand, shows a fat tree. Notice how the number of links
between nodes increases as you get closer to the trees root. The number of links shown isjust one example of a fat tree configuration; the actual number may be higher or lower
between any two nodes depending on the given requirements.
Component Number ofconnections
Purpose
TriBlades InfiniBand link 180 Connects the AMD Opteron nodes together
to allow them to participate in a network.
InfiniBand links to second-stage
switch
96 Allows the CUs to be tied together into a
single network.
InfiniBand links to I/O nodes 8 Provides the hybrid compute nodes access
to the file system for application input and
output.
Total 288
7/29/2019 redp4477
25/50
Chapter 1. Roadrunner hardware overview15
Figure 1-9 Fat tree
Fat tree topologies are becoming quite popular in InfiniBand clusters. For more informationabout fat trees and their usage with InfiniBand, see the ar ticle Performance Modeling of
Subnet Management on Fat Tree InfiniBand Networks using OpenSM, which is available atthe following Web site:
http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf
10 Gigabit Ethernet file system LANEvery CU has twelve I/O nodes, each of which has a single InfiniBand connection to the CU's
InfiniBand Switch. This allows the hybrid compute nodes (TriBlades) to retrieve and pass datato the I/O nodes over the InfiniBand network. The file system is connected through the I/O
nodes, each of which have two 10 GB links to the file system LAN.
Gigabit Ethernet Control VLAN (CVLAN)The 1 GB Ethernet control VLAN is used to perform vital program and node control functionswithin each CU, such as Message Passing Information (MPI) required for program operationand communication.
Gigabit Ethernet Management VLAN (MVLAN)The 1 GB Ethernet Management VLAN is used to perform vital system managementfunctions within each CU, such as passing the required operating system boot images from
the CU's service node to the processors on the hybrid compute nodes and I/O nodes in orderto IPL them.
PCI Express link between LS21 and Cell/B.E. bladesEach AMD Opteron has a one-to-one master-subordinate relationship with a Cell/B.E.
processor. Although the Opterons participate in MPI communications with other Opteronnodes and access the file system through the I/O nodes, the Cell/B.E. processors only
communicate with their master Opteron.
Important: This VLAN is used exclusively for control traffic, no user data flows across thisnetwork.
http://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-papers/2005/vishnu-fastos05.pdf7/29/2019 redp4477
26/50
7/29/2019 redp4477
27/50
Chapter 1. Roadrunner hardware overview17
Gigabit Ethernet management VLAN (MVLAN)The 1 GB Ethernet management VLAN is the grid-wide system management network. It isused for booting, system control, and status determination operations between themanagement nodes and the various managed elements throughout the cluster. The MVLAN
does not have direct network access to the internals of a CU (for example, the hybridcompute nodes and I/O nodes). Management operations to those nodes occurs from the
MVLAN to the CU's MVLAN through the service node to the desired target.
The MVLAN has no user or application data flow across this network. Only systemmanagement and control traffic flows across the MVLAN.
7/29/2019 redp4477
28/50
18 Roadrunner: Hardware and Software Overview
7/29/2019 redp4477
29/50
Copyright IBM Corp. 2009. All rights reserved.19
Chapter 2. Roadrunner software overview
This chapter briefly describes the software used to run applications on the Roadrunner
system.
2
Note: This IBM Redpaper publication is not intended to be a detailed analysis, but rather a
big picture discussion meant to acquaint the reader with the Roadrunner system.
7/29/2019 redp4477
30/50
20 Roadrunner: Hardware and Software Overview
2.1 Roadrunner components
This section provides a brief explanation of the software used to run on the variouscomponents that comprise a Roadrunner system.
2.1.1 Compute node (TriBlade)
As described in 1.2.1, TriBlade: a unique concept on page 5, a TriBlade is made up of one
IBM BladeCenter LS21 blade and two IBM BladeCenter QS22 blades. Each of these runs itsown operating system image, but shares a common user application.
The following is the software that runs on the various components of the TriBlade:
AMD Opteron LS21 for IBM BladeCenter
Each LS21 is standard except for the fact that it is diskless. The operating system isFedora Linux. Since it is diskless, it is booted up from its Connected Units service node.
IBM BladeCenter QS22
Each QS22 is standard except for the fact that it is diskless. The operating system isFedora Linux. Since it is diskless, it is booted up from its Connected Units service node.
Broadcom HT-2100 (PCIe adapter)
The dual Opteron host blade (LS21) is connected to the two QS22s through a PCIExpress (PCIe) interconnect. Two HyperTransport x16 connections from the LS21 blade
drive an expansion card containing two Broadcom HT-2100 HyperTransport to PCI
Express bridge chips. Each Broadcom HT-2100 drives two PCI Express x8 connections tothe two Axon Southbridge chips on one of the Cell Broadband Engine (Cell/B.E.) blades(QS22). This provides a dedicated PCIe x8 connection to each Cell/B.E processor.
The PCIe interconnect is supported by a low-level device driver that provides direct
memory access (DMA) and a remote memory mapped small message area (SMA). DMAoperations can be started by calls to the device driver from programs on either the LS21 orthe QS22. The device driver initiates the DMA operation using a DMA controller in the
Axon Southbridge. The small message area provides regions of memory that can beaccessed remotely by user space instructions without a context switch to the kernel or
device driver interaction. There is a unique device driver instance on both the Opteron andthe Cell/B.E. blade for each Axon Southbridge. A virtual Ethernet driver (also replicated
per Axon) supports point-to-point communications between the Opteron and each
Cell/B.E processor.
2.1.2 I/O node
As mentioned previously in 1.3.2, Compute node and I/O rack on page 10, each I/O node isan IBM System x3655 server. I/O nodes are diskless and serve as pipes to the external file
system across the 10 Gigabit Ethernet file system LAN.
Each I/O node runs Fedora Linux as its operating system. Since the node is diskless, it isbooted up from its Connected Units service node. The I/O node will run either the IBM
Note: From an IBM BladeCenter Advanced Management Module (AMM) perspective, theTriBlade still appears as separate blades. In other words, it appears as one LS21 and two
QS22s. The logical grouping of the LS21 and QS22s is handled through the xCATmanagement tools. See 2.3, xCAT on page 23 for more information.
7/29/2019 redp4477
31/50
Chapter 2. Roadrunner software overview21
GPFS or Panasas PanFS client to communicate with the external file system, depending on
what file system software is running there.
2.1.3 Service node
Service nodes are standard IBM System x3655 Opteron-based servers and are diskless.
There is one dedicated service node per Connected Unit, so this image can be updateddirectly from the master node over the management network (MVLAN) described in GigabitEthernet management VLAN (MVLAN) on page 17.
Service nodes obtain copies of the boot images for the I/O nodes and compute nodes fromthe master node. These images are refreshed on an as needed basis. The images are loadedover the CVLAN (see Gigabit Ethernet Control VLAN (CVLAN) on page 15).
2.1.4 Master (management) node
The master node is a standard IBM System x3655 Opteron-based server and is booted fromthe local disk. The master node runs Fedora Linux.
2.2 Cluster boot sequence
The initial booting of the nodes is complicated by two factors in the Roadrunner system:
All of the nodes except for the master node are diskless, so they must boot over thenetwork.
There are over 3,000 total nodes and 10,000 operating system images that need to be
installed and booted.
There will be times when the entire system needs to be booted, and there will be times when
only parts of the system need to be booted (while the rest of the system is still available butpowered off). This places two distinct demands on the management network:
It must be able to boot the entire system without causing timeouts on the management
network such that no boot progress is being made.
It must be able to boot substantial portions of the system without interfering with anystatus and control operations that are occurring on the running portion of the system.
Since the majority of nodes are diskless, a scalable way to move the boot images to each ofthe nodes is required. To this end, a hierarchy of management nodes has been created.
The solution to this concern is to use a bootstrap protocol (BOOTP) together with the trivialfile transfer protocol (TFTP) subnet multicast to boot the diskless LS21 Opteron and QS22Cell/B.E. blades. This method provides a broadcast of the common boot image that the
LS21s and QS22s can pick up midstream. The multicast repeats until all requesting bladeshave received all packets of the boot image. There are unique boot images for the various
configurations. The boot images are stored on the Connected Unit service nodes andmulticast over the CVLAN. This method significantly reduces network traffic compared tosending individual boot images to each processor.
Note: There is only one master node for the entire Roadrunner cluster.
7/29/2019 redp4477
32/50
22 Roadrunner: Hardware and Software Overview
2.2.1 Boot scenarios
This section describes in more detail what happens when a cluster (or parts of the cluster)are booted up.
Master (management) node (tier 1)
This node is installed and booted with the required management node image. Themanagement node boots from the local disk.
Service nodes (tier 2)There is only one service node per Connected Unit, so this image can be updated directlyfrom the master node over the MVLAN at any time (not just at service node bring-up). Once
booted, service nodes obtain copies of the boot images for the I/O nodes and compute nodesfrom the master node. These images are refreshed on an as-needed basis. The images are
loaded over the CVLAN through the multicast boot process, which allows for far less networktraffic and parallel image download.
I/O nodes
Once successfully booted, the service nodes begin transferring the required boot imagesdown the CVLAN. The I/O nodes are standard Opteron Linux servers and are booted diskless
with the required image. I/O nodes are connected to the 10 GB Global File System (GFS) toservice the compute nodes file access requests. The image required to boot the I/O node is
received from its local service node through the CVLAN network.
Compute nodes (TriBlades)Compute nodes (TriBlades) are either accelerated or non-accelerated, with the difference
being that accelerated nodes will have their associated Cell/B.E. blades powered on andbooted, while Cell/B.E. blades on the non-accelerated nodes are left powered off.
There is no need for a heartbeat function between the Opteron core and its associated Cell
Broadband Engine processor. The general health of both resources is known by the xCATsoftware and reflected in the resource manager. Communication health status between the
two resources is monitored and understood on demand by the application running on theOpteron side. The Data Communication and Synchronization (DaCS) API is notified of errors
from the Cell/B.E. processor concerning any data transfer or communications request.Failures of these transactions is reported by the software structures. If the PCI Expressconnection between the Opteron and Cell/B.E. processor fails, an appropriate error event is
posted and the application terminated.
Given the PCI Express interface between the Opteron and Cell/B.E. processor, it is necessaryto boot the Cell/B.E. processor portions of a compute node (in the accelerated node pool)before the Opteron portion. This allows the proper initialization of the interconnect firmwareand PCI Express device drivers. The Cell/B.E. PCI Express device drivers listen for the
necessary firmware/driver handshakes from the LS21 and Broadcom HT-2100 (PCIe adapter)expansion card to establish communication. The process of insuring the correct booting
sequence is controlled by the xCAT software.
Note: There is no low power mode for the Cell/B.E. blades, so some sort of standby
mode is not possible. They are either on (accelerated) or off (non-accelerated).
7/29/2019 redp4477
33/50
Chapter 2. Roadrunner software overview23
2.3 xCAT
Setting up the installation and management of a cluster is a complicated task and doingeverything manually can become very complicated. The development of xCAT grew out of the
desire to automate a lot of the repetitive steps involved in installing and configuring a Linuxcluster.
The development of xCAT is driven by customer requirements. Because xCAT itself is written
entirely using scripting languages such as korn shell, Perl, and Expect, an administrator caneasily modify the scripts should the need arise.
The main functions of xCAT are grouped as follows:
Automated installation Hardware management and monitoring Software administration Remote console support for text and graphics
For more information about xCAT, refer to the xCAT Web site at:
http://xcat.sourceforge.net
2.4 How applications are written and executed
This section discusses how applications are written and executed on the Roadrunner system.The unique architecture employed means that applications are designed and written in a
revolutionary new manner compared to previous parallel processing applications.
2.4.1 Application core
The bulk of the user application, including initiation and termination, runs on the AMDOpteron processor (LS21). It uses Message Passing Interface (MPI) APIs to communicatewith the other Opteron processors the application is running on in a typical single program,
multiple data (SPMD) fashion. The number of compute nodes used to run the application is
determined at program launch.
The MPI implementation of Roadrunner is based on the open-source Open MPI Project and
therefore is standard MPI. In this regard, Roadrunner applications are similar to other typicalMPI applications (such as those that run on the IBM Blue Gene solution). Where Roadrunner
differs in the sphere of application architecture is how its Cell/B.E. accelerators areemployed. At any point in the application flow, the MPI application running on each Opteroncan offload computationally-complex logic to its subordinate Cell/B.E. processor.
For more information about Open MPI Project, refer to the Open MPI: Open Source HighPerformance Computing Web site at:
http://www.open-mpi.org/
http://xcat.sourceforge.net/http://www.open-mpi.org/http://xcat.sourceforge.net/http://www.open-mpi.org/7/29/2019 redp4477
34/50
24 Roadrunner: Hardware and Software Overview
2.4.2 Offloading logic
Determining which logic routines get offloaded to the Cell/B.E. processor, and when thatoccurs, is one of the most challenging tasks facing an application developer of theRoadrunner system. But it is this very challenge that makes the opportunity for incredibly high
application performance possible.
There are two primary techniques that a developer can employ to actually perform
asynchronous offloads of logic. This section briefly describes each, and points to areas whereyou can find more detailed information.
DaCSThe Data Communication and Synchronization (DaCS) library provides a set of services thatease the development of applications and application frameworks in a heterogeneous
multi-tiered system (for example, a 64-bit x86 system (x86_64) and one or more Cell/B.E.processor systems). The DaCS services are implemented as a set of APIs providing an
architecturally neutral layer for application developers on a variety of multi-core systems. Oneof the key abstractions that further differentiates DaCS from other programming frameworks
is a hierarchical topology of processing elements, each referred to as a DaCS Element (DE).
Within the hierarchy, each DE can serve one or both of the following roles:
A general purpose processing element, acting as a supervisor, control, or masterprocessor. This type of element usually runs a full operating system and manages jobsrunning on other DEs. This is referred to as a Host Element (HE).
A general or special purpose processing element running tasks assigned by an HE. Thisis referred to as an Accelerator Element (AE).
DaCS for Hybrid (DaCSH) is an implementation of the DaCS API specification that supports
the connection of an HE on an x86_64 system to one or more AEs on Cell/B.E. processors. InSDK 3.0, DaCSH only supports the use of sockets to connect the HE with the AEs. Direct
access to the Synergistic Processor Elements (SPEs) on the Cell/B.E. processor is notprovided. Instead, DaCSH provides access to the PowerPC Processor Element (PPE),
allowing a PPE program to be started and stopped and allowing data transfer between thex86_64 system and the PPE. The SPEs can only be used by the program running on thePPE.
For more information about DaCS, see IBM Software Development Kit for MulticoreAcceleration Data Communication and Synchronization Library for Hybrid-x86 Programmer'sGuide and API Reference, SC33-8408.
ALFThe Accelerated Library Framework (ALF) provides a programming environment for data and
task parallel applications and libraries. The ALF API provides you with a set of interfaces tosimplify library development on heterogeneous multi-core systems. You can use the provided
framework to offload the computationally intensive work to the accelerators. More complexapplications can be developed by combining the several function offload libraries. You can
also choose to implement applications directly to the ALF interface.
ALF supports the multiple-program-multiple-data (MPMD) programming module wheremultiple programs can be scheduled to run on multiple accelerator elements at the same
time.
7/29/2019 redp4477
35/50
Chapter 2. Roadrunner software overview25
The ALF functionality includes:
Data transfer management Parallel task management Double buffering Dynamic load balancing for data parallel tasks
With the provided API, you can also create descriptions for multiple compute tasks and definetheir execution orders by defining task dependency. Task parallelism is accomplished by
having tasks without direct or indirect dependencies between them. The ALF run timeprovides an optimal parallel scheduling scheme for the tasks based on given dependencies.
For more information about ALF, see IBM Software Development Kit for Multicore
Acceleration Accelerated Library Framework for Hybrid-x86 Programmer's Guide and APIReference, SC33-8406.
7/29/2019 redp4477
36/50
26 Roadrunner: Hardware and Software Overview
7/29/2019 redp4477
37/50
Copyright IBM Corp. 2009. All rights reserved.27
Appendix A. The Cell Broadband Engine
(Cell/B.E.) processor
Of all of the components that make up the Roadrunner cluster, the Cell/B.E. processor holdsa special place in that it provides extraordinary compute power that can be harnessed from a
single multi-core chip. This appendix provides a brief architectural overview of the currentCell/B.E. processor, the motivation for some of its features, as well as the general properties
of this unique processor.
For additional information about the Cell/B.E. processor, refer to the following resources:
Programming the Cell Broadband Engine Architecture: Examples and Best Practices,SG24-7575
IBM Software Development Kit for Multicore Acceleration Data Communication andSynchronization Library for Cell/B.E. Programmer's Guide and API Reference, SC33-8407
IBM Software Development Kit for Multicore Acceleration Accelerated Library Frameworkfor Cell/B.E. Programmer's Guide and API Reference, SC33-8333
The Cell/B.E. project at IBM Research, found at:
http://www.research.ibm.com/cell/
The Cell/B.E. resource center, found at:
http://www.ibm.com/developerworks/power/cell/
A
Note: Be aware that ample and extensive resources exist on the Cell/B.E. processor, theCell/B.E. architecture, as well as tutorials for the interested programmer. It is not the
intention of this publication to reproduce all of this information in this short section. Wehave utilized these extensive resources in our attempt to provide this summary.
http://www.research.ibm.com/cell/http://www.ibm.com/developerworks/power/cell/http://www.ibm.com/developerworks/power/cell/http://www.research.ibm.com/cell/7/29/2019 redp4477
38/50
28 Roadrunner: Hardware and Software Overview
Background
The Cell/B.E. architecture is designed to support a very broad range of applications. The firstimplementation is a single-chip multiprocessor with nine processor elements operating on a
shared memory model, as shown in Figure A-1. In this respect, the Cell/B.E. processorextends current trends in PC and server processors. The most distinguishing feature of the
Cell/B.E. processor is that, although all processor elements can share or access all availablememory, their function is specialized into two types: the Power Processor Element (PPE) andthe Synergistic Processor Element (SPE). The Cell/B.E. processor has one PPE and eight
SPEs.
The architectural definition of the physical Cell/B.E. architecture-compliant processor is muchmore general than the initial implementation. A Cell/B.E. architecture-compliant processor
can consist of a single chip, a multi-chip module (or modules), or multiple single-chip moduleson a system board or other second-level package. The design depends on the technology
used and performance characteristics of the intended design.
Logically, the Cell/B.E. architecture defines four separate types of functional components:
PowerPC Processor Element (PPE) Synergistic Processor Unit (SPU) Memory Flow Controller (MFC) Internal Interrupt Controller (IIC)
The computational units in the Cell/B.E. architecture-compliant processor are the PPEs andthe SPUs. Each SPU must have a dedicated local storage, a dedicated MFC with its
associated memory management unit (MMU), and a replacement management table (RMT).The combination of these components is called a Synergistic Processor Element (SPE).
Figure A-1 Cell/B.E. schematic
The first type of processor element, the PPE, contains a 64-bit PowerPC architecture core. Itcomplies with the 64-bit PowerPC architecture and can run 32-bit and 64-bit applications. The
second type of processor element, the SPE, is designed to run computationally intensivesingle-instruction multiple-data (SIMD)/vector applications. It is not intended to run a full
featured operating system. The SPEs are independent processor elements, each runningtheir own individual application programs or threads. Each SPE has full access to sharedmemory, including the memory-mapped I/O space implemented by multiple DMA units. There
is a mutual dependence between the PPE and the SPEs. The SPEs depend on the PPE torun the operating system and, in many cases, the top-level thread control for a user code. The
PPE depends on the SPEs to provide the bulk of compute power.
7/29/2019 redp4477
39/50
Appendix A. The Cell Broadband Engine (Cell/B.E.) processor29
The SPEs are designed to be programmed in high level languages. They support a rich
instruction set that includes extensive SIMD functionality. However, like conventionalprocessors with SIMD extensions, use of SIMD data is preferred but not mandatory. Forprogramming convenience, the PPE also supports the standard PowerPC architecture
instruction set and the SIMD/vector multimedia extensions. To an application programmer, theCell/B.E. processor looks like a single core, dual threaded processor with eight additional
cores, each having their own local store. The PPE is more adept than the SPEs atcontrol-intensive tasks and quicker at task switching. The SPEs are more adept at compute
intensive tasks and slower than the PPE at task switching. Either processor element iscapable of both types of functions. This specialization is a significant factor in accounting forthe order-of magnitude improvement in peak computational performance and power
efficiency that the Cell/B.E. processor achieves over conventional processors.
The more significant difference between the SPE and PPE lies in how they access memory.
The PPE accesses memory with load and store instructions that move data between mainstorage and a set of registers, the contents of which may be cached. PPE memory access islike that of a conventional processor. The SPEs in contrast access main storage with direct
memory access (DMA) commands that move data and instructions between main storageand a private local memory, called a local store (LS). An SPE's instruction fetches and
load/store instructions access a private local store rather than the shared main memory.
This three-level organization of storage (registers, LS, and main memory), with asynchronous
DMA transfers between LS and main memory, is a radical break from conventionalarchitecture and programming models. It explicitly parallels computation with the transfer of
data and instructions that feed computation and stores the results of computation in mainmemory.
A primary motivation for this new memory model is the realization that over the past twenty
five years, memory latency, as measured in processor cycles, has increased by almost threeorders of magnitude. The result is that application performance is, in most cases, limited by
memory latency rather than peak compute capability, as measured by processor clockspeeds. When a sequential program performs a load instruction that encounters a cache
miss, program execution comes to a halt for several hundred cycles (techniques such ashardware threading attempt to hide these stalls, but it does not help single threadedapplications). Compared to this penalty, the few cycles that it takes to set up a DMA transfer
for an SPE is a much better trade off, especially considering the fact that each of the eightSPE's DMA controllers can maintain up to 16 DMA transfers in flight simultaneously.
Anticipating DMA needs efficiently can provide just in time delivery of data, which mayreduce this stall or eliminate it entirely. Conventional processors, even with deep and costly
speculation, manage to get, at best, a handful of independent memory accesses in flight.
One of the SPE's DMA transfer methods supports a list (such as a scatter gather list) of DMAtransfers that is constructed in an SPE's local store, so that the SPE's DMA controller can
process the list asynchronously while the SPE operates on previously transferred data. Inseveral cases, this approach of accessing memory has improved application performance by
almost two orders of magnitude when compared to the performance of conventionalprocessors This is significantly more than one would expect from the peak performance ratio(approximately 10x) between the Cell/B.E. processor and conventional PC processors.
7/29/2019 redp4477
40/50
30 Roadrunner: Hardware and Software Overview
The processor elements
The general Cell/B.E. architecture-compliant processor may contain one or more PPEs, whilethe current implementation consists of only one. The PPE contains a 64-bit, dual threaded
PowerPC RISC core and supports a PowerPC virtual memory subsystem. The currentPowerPC PPE runs at 3.2 GHz. It has 32 KB level-1 (L1) instruction and data caches and a
512 KB level-2 (L2) unified (instruction and data) cache. It is intended primarily for controlprocessing, running an operating system, managing system resources, and managing SPEthreads. It can run existing PowerPC architecture software and is well suited to executing
system control code. The instruction set for the PPE is an extended version of the PowerPCinstruction set. It includes the vector/SIMD multimedia extensions.
Each of the eight Synergistic Processor Elements (SPEs) contains a 3.2 GHz Synergistic
Processor Unit (SPU) vector processor plus the 256 KB of local store that is directlyaddressable. Computationally, each of these SPEs is capable of producing four floating point
results per clock period. Simple arithmetic shows that all eight of these SPEs have a peakcompute power of 102.4 gigaflops.
The eight identical SPEs are single-instruction multiple-data (SIMD) processor elements that
are intended for computationally intensive operations allocated to them by the PPE. EachSPE contains a RISC core, 256 KB software controlled local store for instructions and data,
and a set of 128 registers, each of which is 128 bits wide. The SPEs support a special SIMDinstruction set and a unique set of commands for managing DMA transfers and
inter-processor messaging and control.
SPE DMA transfers access main memory using PowerPC effective addresses. As in the PPE,
SPE address translation is governed by PowerPC architecture segment and page tables,which are loaded into the SPEs by privileged software running on the PPE. The SPEs are not
intended to run an operating system.
An SPE controls DMA transfers and communicates with the system by means of channelsthat are implemented in and managed by the SPE's Memory Flow Controller (MFC). The
channels are unidirectional message passing interfaces. The PPE and other devices on thesystem, including other SPEs, can also access this MFC state through the MFC's
memory-mapped I/O (MMIO) registers and queues, which are visible to software in the mainmemory address space.
The Element Interconnet Bus
The SPEs, PPE, the Memory Interface Controller (MIC) and broadband interface, and the
connection to other Cell/B.E. processors within an SMP are interconnected through a highspeed Element Interconnect Bus (EIB). The EIB is the communication path for commandsand data between all processor elements on the Cell/B.E. processor and the on chip
controllers for memory and I/O. The EIB supports full memory coherent and symmetricmultiprocessor (SMP) operations. A Cell/B.E. architecture processor is designed to be
combined coherently with other Cell/B.E. architecture processors to produce a cluster. TheCell/B.E. blade is one such example where two Cell/B.E. processors are combined in a
shared memory environment to produce an SMP.
The EIB consists of four 16 byte wide data rings, two in each direction, and a central arbiter.In the absence of path contention, each ring can perform three concurrent data transfers.
Each ring transfers 128 bytes (one PPE cache line) at a time. Processor elements can driveand receive data simultaneously. The SPEs, PPE, and PIC each have 25.6 GBps links to and
from the EIB. In aggregate, the EIB is capable of 204.8 GBps transfers. Figure A-1 on
7/29/2019 redp4477
41/50
Appendix A. The Cell Broadband Engine (Cell/B.E.) processor31
page 28 shows each of these elements and the order in which the elements are connected to
the EIB. The connection order is important to programmers seeking to minimize the latency oftransfers on the EIB, where latency is a function of the number of connection hops. Transfersbetween adjacent elements have the shortest latencies, while transfers between elements
separated by multiple hops have the longest latencies.
The EIB's internal maximum bandwidth is 96 bytes per processor clock cycle. Multipletransfers can be in process concurrently on each ring, including more than 100 outstanding
DMA memory transfer requests between main storage and the SPEs in either direction.These requests also may include SPE memory to and from the I/O space. The EIB does not
support any particular quality of service (QoS) behavior other than to guarantee forwardprogress. However, a resource allocation management (RAM) facility resides in the EIB.
Privileged software can use it to regulate the rate at which resource requesters (the PPE,SPEs, and I/O devices) can use memory and I/O resources.
Memory Flow Controller
The Memory Flow Controller (MFC) is the data transfer engine. It provides the primarymethod for data transfer, protection, and synchronization between main storage and the
associated local storage, or between the associated local storage and another local storage.An MFC command describes the transfer to be performed. A principal architectural objectiveof the MFC is to perform these data transfer operations in as fast and as fair a manner as
possible, thereby maximizing the overall throughput of the processor.
Commands that transfer data are called MFC DMA commands. These commands areconverted into DMA transfers between the local storage domain and main storage domain.
Each MFC can typically support multiple DMA transfers at the same time and can maintainand process multiple MFC commands. To accomplish this, the MFC maintains and processes
queues of MFC commands. Each MFC provides one queue for the associated SPU (MFCSPU command queue) and one queue for other processors and devices (MFC proxy
command queue). Logically, a set of MFC queues is always associated with each SPU in aCell/B.E. architecture-compliant processor.
The on-chip memory interface controller (MIC) provides the interface between the EIB and
physical memory. The IBM BladeCenter QS22 uses normal DDR memory and additionalhardware logic to implement the MIC. Memory accesses on each interface are 1 to 8, 16, 32,
64, or 128 bytes, with coherent memory ordering. Up to 64 reads and 64 writes can bequeued. The resource allocation token manager provides feedback about queue levels. TheMIC has multiple software controlled modes, including fast path mode (for improved latency
when command queues are empty), high priority read (for prioritizing SPE reads in front of allother reads), early read (for starting a read before a previous write completes), speculative
read, and slow mode (for power management). The MIC implements a closed page controller(bank rows are closed after being read, written, or refreshed), memory initialization, and
memory scrubbing.
7/29/2019 redp4477
42/50
32 Roadrunner: Hardware and Software Overview
7/29/2019 redp4477
43/50
Copyright IBM Corp. 2009. All rights reserved.33
Glossary
Accelerator General or special purpose processing
element in a hybrid system. An accelerator might have amulti-level architecture with both host elements and
accelerator elements. An accelerator, as defined here, is
a hierarchy with potentially multiple layers of hosts and
accelerators. An accelerator element is always associated
with one host. Aside from its direct host, an accelerator
cannot communicate with other processing elements in
the system. The memory subsystem of the accelerator
can be viewed as distinct and independent from a host.
This is referred to as the subordinate in a cluster
collective.
All-reduce operation Output from multiple accelerators
is reduced and combined into one output.
API Application Programming Interface. An application
programming interface defines the syntax and semantics
for invoking services from within an executing application.
All APIs are targeted to be available to both FORTRAN
and C programs, although implementation issues (such
as whether the FORTRAN routines are simply wrappers
for calling C routines) are up to the supplier.
ASCI The name commonly used for the Advanced
Simulation and Computing program administered by
Department of Energy (DOE)/National Nuclear Security
Agency (NNSA).
ASIC Application Specific Integrated Circuit.
B/U Bring up.
CEC Central electronic complex.
cluster A collection of nodes.
compute kernel Part of the accelerator code that does
stateless computation tasks on one piece of input data
and generates the corresponding output results.
compute task An accelerator execution image that
consists of a compute kernel linked with the accelerated
library framework accelerator runtime library.
DaCS element A general or special purpose processing
element in a topology. This refers specifically to the
physical unit in the topology. A DaCS element can serve
as a host or an accelerator.
DDR Double Data Rate. DDR is a technique for
doubling the switching rate of a circuit by triggering both
the rising edge and falling edge of a clock signal.
DE See DaCS element.
de_id A unique number assigned by the DaCS
application at run time to a physical processing element ina topology group A group construct specifies a collection
of DEs and processes in a system.
EDRAM Enhanced dynamic random access memory is
dynamic random access memory that includes a small
amount of static RAM (SRAM) inside a larger amount of
DRAM. Performance is enhanced by making sure that
many of the memory accesses will be to the faster SRAM.
EMC Electromagnetic compatibility.
ESD Electrostatic discharge.
ETH Ethernet, as in adapter or interface.
FLOP Floating Point OPeration. A measure of
computations speed frequently used with
supercomputers.
FLOP/s FLOPs per second.
FPU Floating point unit.
FRU Field replaceable unit.
GFLOP GigaFLOP. A gigaFLOP/s is a billion (109 =
1,000,000,000) floating point operations per second.
handle A handle is an abstraction of a data object,usually a pointer to a structure.
HBCT Hardware-based cycle time.
host A general purpose processing element in a hybrid
system. A host can have multiple accelerators attached to
it. This is often referred to as the master node in a cluster
collective.
hybrid A 64-bit x86 system using a Cell Broadband
Engine (Cell/B.E.) architecture as an accelerator.
I/O I/O (input/output) describes any operation, program,
or device that transfers data to or from a computer.
I/O node The I/O nodes (ION) are responsible, in part,
for providing I/O services to compute nodes.
Job A job is a cluster-wide abstraction similar to a
POSIX session, with certain characteristics and attributes.
Commands are targeted to be available to manipulate a
job as a single entity (including kill, modify, query
characteristics, and query state).
7/29/2019 redp4477
44/50
34 Roadrunner: Hardware and Software Overview
LANL Los Alamos National Laboratory.
LINPACK LINPACK is a collection of FORTRAN
subroutines that analyze and solve linear equations and
linear leastsquares problems.
main thread The main thread of the application. In
many cases, Cell/B.E. architecture programs aremulti-threaded using multiple SPEs running concurrently.
A typical scenario is that the application consists of a main
thread that creates as many SPE threads as needed and
the application organizes them.
MFLOP MegaFLOP/s. A megaFLOP/s is a million (106
= 1,000,000) floating point operations per second.
MPI Message passing interface.
MPICH2 MPICH is an implementation of the MPI
standard available from Argonne National Laboratory.
node A node is a functional unit in the system topology,consisting of one host together with all the accelerators
connected as children in the topology (this includes any
children of accelerators).
parent The parent of a DE is the DE that resides
immediately above it in the topology tree.
PPE Power Processor Element: 64-bit Power
Architecture unit within the CBE that is optimized for
running operating systems and applications. The PPE
depends on the SPEs to provide the bulk of the application
performance.
PPE PowerPC Processor Element. Thegeneral-purpose processor in the Cell/B.E. processor.
process A process is a standard UNIX-type process
with a separate address space.
RAS Reliability, availability, and serviceability.
service node The service node is responsible, in part,
for management and control of RoadRunner.
SIMD Single Instruction Multiple Data. Processing in
which a single instruction operates on multiple data
elements that make up a vector data type. Also known as
vector processing. This style of programming implementsdata-level parallelism.
SN See service node.
SPE Synergistic Processor Element. Eight of these exist
within the Cell/B.E. processor, optimized for running
compute-intensive applications, and they are not
optimized for running an operating system. The SPEs are
independent processors, each running its own individual
application programs.
SPE Synergistic Processor Element. Extends the
PowerPC 64 architecture by acting as cooperative offload
processors (synergistic processors), with the direct
memory access (DMA) and synchronization mechanisms
to communicate with them (memory flow control), and with
enhancements for real-time management. There are eight
SPEs on each Cell/B.E. processor.
SPMD Single Program Multiple Data. A common style of
parallel computing. All processes use the same program,
but each has its own data.
SPU Synergistic Processor Unit. The part of an SPE
that execut