Upload
phungtuyen
View
214
Download
0
Embed Size (px)
Citation preview
One Team – One Culture – One Purpose – One SSC
Environment and Climate Change Canada HPC Renewal Project:Procurement Results
17th Workshop on HPC in meteorology
ECMWF, Reading, UK
Alain St-Denis & Luc Corbeil
October 2016
One Team – One Culture – One Purpose – One SSC
Outline
• Background
• History
• Scope
• RFP
• Outcome
2
One Team – One Culture – One Purpose – One SSC
HPC Renewal for ECCC Background
• Environment Canada highly dependent on HPC in delivery of mandate: simulation of Environmental Forecasts for health, safety, security and economic well-being of Canadians.
• Contract with IBM expiring with few remaining options to extend
• Linked to Meteorological Service of Canada (MSC) Renewal Treasury Board Submission
Component 1: Monitoring Networks
Component 2: Supercomputing capacity
Component 3: Weather Warnings and Forecast System
• Joint ECCC-SSC submission for Supercomputing Capacity3
One Team – One Culture – One Purpose – One SSC
New player: Shared Services Canada
• Created in 2012, to take responsibility of email, networks and data center for the whole Government of Canada.
• Supercomputing IT people working for ECCC transferred to SSC.
• Scope of the HPC team expanded to all science departments
• As in any reorganization, there are challenges and opportunities!
One Team – One Culture – One Purpose – One SSC 5
Shared Services Canada was
formed to consolidate and
streamline the delivery of IT
infrastructure services,
specifically email, data centre
and network services. Our
mandate is to do this so that
federal organizations and their
stakeholders have access to
reliable, efficient and secure IT
infrastructure services at the
best possible value.
SSC will Innovate, ensure full Value for Money and achieve Service Excellence !
Service to
Canadians
Departmental
Programs
SSC
Services
Shared Services Canada – Our Mandate
One Team – One Culture – One Purpose – One SSC
A Bit of History
• ECCC has been using a supercomputer for weather forecasting and atmospheric science for more than half a century
6
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
Millio
ns o
f F
loati
ng
Po
int
Op
era
tio
ns p
er
Seco
nd
Year
Peak Sustained
Power7
360/65
NEC IBM
1
CRAYCDC
G20
IBMBendix
7600176
X-XMP 28
X-XMP 4-16
SX-3/44SX-3/44R
SX-4/16
SX-4/80M3SX-5/32M2
SX-6/80M10
Power4
Power5
One Team – One Culture – One Purpose – One SSC
A Bit of (More Recent) History
• Request for Information (Fall 2012,
• Invitation to Qualify (Fall 2013, 4 bidders qualified)
• Review Refine Requirements (Summer 2014)
• Requests for Proposal (November 2014 – June 2015)
• Treasury Board Approval (April 2016)
• Contract Award (May 27 2016)
7
One Team – One Culture – One Purpose – One SSC
Scope
Scope In replacement of
Supercomputer clusters Two 8192 P7 cores clusters
Pre/Post-Processing clusters (PPP) Two 640 X86 cores custom clusters
Global Parallel Storage (Site-Store) CNFS and ESS clusters
Near-Line Storage (HP-NLS) StorNext based archiving cluster
Home directories Netapp home directories
8
As well as
• Hosting of the Solution
• High Performance Interconnects
• Software & tools
• Maintenance & Support
• Training & Conversion support
• On-going High Availability
One Team – One Culture – One Purpose – One SSC
ECCC Supercomputing Procurement Requirements
• Contract for Hosted HPC Solution: 8.5 years + one 2.5 year option (Transition year + two upgrades + one optional)
• Connectivity betweenHPC Solution Data Halls and Dorval
• No more than 70km between Hall A, Hall B& Dorval
• Flexible Options for additional needs
Hall B
NCF
Solution Data Hall A Solution Data Hall B
Inter-H
all L
ink (x2
) Inter-Hall Link (x2)
Inter-Hall Link (x2)
Hall A
On-going Availability
One Team – One Culture – One Purpose – One SSC
High Level Architecture
10
SCF Data Flow – Logical View 2014-10-07LPT, HPN/DADS, SSC
HP-NLS
SupercomputerB
HP-NLS
HPN Data Transfer
Storage Synchronization
Scratch
Cache Cache
Scratch
Site
Store
Home
Out-of-Band Management
Site
Store
Home
DATA Feeds
Pre/Post
ProcessingB
Pre/Post
ProcessingA
NCF
DATA Feeds
SupercomputerA
Solution
Data Hall B
Solution
Data Hall A
One Team – One Culture – One Purpose – One SSC
Outcome
• IBM was awarded the contract
Evaluation based on benchmark performance on a fixed budget
• IBM's Proposal for initial system
Supercomputer: Cray XC-40, Intel Broadwell, Sonexion Lustre Storage
PPP: Cray CS-400, Intel Broadwell
Site-Store and Homes: IBM Elastic Storage Server (ESS, GPFS-based)
HP-NLS: based on IBM High Performance Storage System (HPSS)
11
One Team – One Culture – One Purpose – One SSC
Sizing
• Computing
About 35,000 Intel Broadwell cores per Data Hall ♦ Super and PPP combined
• More than 40PB of disk storage
2.5 PB scratch storage per supercomputer (one per data hall)
18 PB site store per data hall
1.1 PB disk cache to the archive per data hall
• More than 230 petabytes of tape storage (two copies)
12
One Team – One Culture – One Purpose – One SSC
Comparison
13
HP-NLS storage (vs current tape capacity), petabytes
Site-Store, homes storage (vs current), petabytes
Sustained TFlopsSupercomputer and PPP
(vs P7, current PPP)
Peak TFlopsSupercomputer and PPP
(vs P7, current PPP)
Cores count Supercomputer and PPP (vs P7, current
PPP)
Scratch storage (vs p7), petabytes
0
1
2
3
4
5
6
Increase Factors
One Team – One Culture – One Purpose – One SSC
The Newest Addition to a Long History
14
Bendix G20IBM 360/65
CDC 7600 CDC 176
Cray 1S
Cray XMP-28
Cray XMP 416NEC SX-3/44
NEC SX-3/44R
NEC SX-4/16
NEC SX-4/80M3NEC SX-5/32/M2
NEC SX-6/80M10IBM P4IBM P5
IBM P7
IBM/XC-40
0.01
0.10
1.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
100000000.00
1000000000.00
10000000000.00
Historical Performance, EC Supercomputers (Flops)
Sustained
Peak
One Team – One Culture – One Purpose – One SSC
Resulting Architecture
15
One Team – One Culture – One Purpose – One SSC
HPC Implementation Milestones: Delivery to Acceptance
• Data Hall and Hosting Site Certification
• Functionality Testing (IT infra)
• Security Accreditation
• Performance testing
• Conversion of Operational codes (Automated Environmental Analysis & Production (AEAPPS)
• Meeting the above triggers a 30 day availability test
Inspection
Functionality Testing
Performance Testing
Conversion
RFU
Acceptance
16
One Team – One Culture – One Purpose – One SSC
Challenge
• Change the Supercomputer clusters, PPP clusters, archiving system and homes. All at once. Never been done
A lot of preparation work has been done ahead of time♦ Most codes have already been ported to Intel architecture
♦ Our General Purpose Science Clusters available for PPP migration work
– Linux containers are being leveraged to smooth the transition
17
One Team – One Culture – One Purpose – One SSC 18
Thank you!
Questions?