View
216
Download
1
Category
Preview:
Citation preview
Hosting Large-scalee-Infrastructure Resources
Mark Leese
mark.leese@stfc.ac.uk
Contents
• Speed dating introduction to STFC
• Idyllic life, pre-e-Infrastructure
• Sample STFC hosted e-Infrastructure projects
• RAL network re-design
• Other issues to consider
STFC• One of seven publicly funded UK Research
Councils• Formed from 2007 merger of CCLRC and PPARC• STFC does a lot, including…
– awarding research, project & PhD grants– providing access to international science facilities
through its funded membership of bodies like CERN– shares it expertise in areas such as materials and
space science with academic and industrial communities
• …but it is mainly recognised for hosting large scale scientific facilities, inc. High Performance Computing (HPC) resources
Harwell Oxford Campus
- STFC major shareholder in Diamond Light Source
- Electron beam accelerated to near light speed within ring
- Resulting light (X-Ray, UV or IR) interacts with samples being studied
- ISIS- ‘super-microscope’ employing
neutron beams to study materials at atomic level
Harwell Oxford Campus
- STFC’s Rutherford Appleton Lab is part of Harwell Oxford Science and Innovation Campus with UKAEA, and commercial campus management company
- Co-locate hi-tech start-ups and multi-national organisations alongside established scientific and technical expertise
- Similar arrangement at Daresbury in Cheshire- Both within George Osbourne Enterprise
Zones:- Reduced business rates - Government support for roll out of super
fast broadband
PreviousExperiences
ATLAS
CMS
ALICE LHCb
16.5 mile
s
Large Hadron Collider• LHC at CERN• Search for
elementary but hypothetical Higgs boson particle
• Two proton (hadron) beams
• Four experiments (particle detectors)
• Detector electronics generate data during collisions
LHC and Tier-1• After initial processing,
the four experiments generated 13 PetaBytes of data in 2010 (> 15m GB or 3.3m single layer DVDs)
• In last 12 months, Tier-1 received ≈ 6 PBs from CERN and other Tier-1s
• GridPP contributes equivalent of 20,000 PCs
Backup
10 Gbps lightpath
CERN LHC OPN
Optical PrivateNetwor
k
UK Tier-1 at RAL
RAL Site
Site Access Router
Primary
Backup
Front Door
Firewall
Security
Tier-1
PetaBytes?!?“Normal” data
UK Light
Router
LHC data
Tier-0 & other Tier-
1s
Tier-1 to Tier-2s (universities)
• Individual Tier-1 hosts route data to routers A or UKLight as appropriate
• Config pushed out with Quattor Grid/cluster management tool
• Access Control Lists of IP address on SAR, UKLight router and/or hosts replaces firewall security
• As Tier-2 (universities) network capabilities increase, so must RAL’s (102030 Gbps)
Router A
Internal Distributio
n
- LOw Frequency Array- World's largest and most
sensitive radio telescope- Thousands of simple dipole
antennas, 38 European arrays- 1st UK array opened at
Chilbolton, Sept 2010- 7 PetaBytes a year raw data
generated (> 1.5m DVDs)- Data transmitted in real-time
to IBM BlueGene/P super computer at Uni of Groningen
- Data processed & combined in software to produce images of the radio sky
LOFAR
- 10 Gbps Janet Lightpath- Janet GÉANT SURFnet- Big leap from FedEx’ing data
tapes or drives- 2011 RCUK e-IAG
“Southampton and UCL make specific reference ... quicker to courier 1TB of data on a portable drive”
- Funded by LOFAR-UK- cf. LHC: centralised not
distributed processing- Expected to pioneer approach
for other projects, e.g. Square Kilometre Array
LOFAR
Sample STFCe-Infrastructure
Projects
ICE-CSE• International Centre of Excellence for Computational Science
and Engineering• Was going to be Hartree Centre, now DFSC• STFC Daresbury Laboratory, Cheshire• Partnership with IBM• Mission to provide HPC resources and develop software• DL previously hosted HPCx, big academic HPC before HECToR• IBM BlueGene/Q supercomputer• 114,688 processor cores, 1.4 Petaflops peak performance • Partner IBM’s tests were first time a Petaflop application has
been run in the UK (one thousand trillion calculations per second)
• 13th in this year’s TOP500 worldwide list• Rest of Europe appears five times in Top 10• DiRAC and HECToR (Edinburgh) 20th and 32nd
ICE-CSE• DL network upgraded to support up to 8 * 10 Gbps lightpaths to current regional
Janet deliverer, Net North West, in Liverpool and Manchester• Same optical fibres, different colours of light:
1. 10G JANET IP service (primary)2. 10G JANET IP service (secondary)3. 10G DEISA (consortium of European supercomputers)4. 10G HECToR (Edinburgh)5. 10G ISIC (STFC-RAL)
More expected as part of IBM-STFC collaboration
• Feasible because NNW rents its own dark (unlit) fibre network• NNW ‘simply’ change the optical equipment on each end of the dark fibre
• Key aim is for machine and expertise to be available to commercial companies• How? Over Janet?
• A Strategic Vision for UK e-Infrastructure estimates that 1,400 companies could make use of HPC, with 300 quite likely to do so
• So even if some instead go for the commercial “cloud” option...
JASMIN & CEMS
• Joint Analysis System Meeting Infrastructure Needs
• JASMIN and CEMS funded by BIS through NERC, and UKSA and ISIC respectively
• Compute and storage cluster for the climate and earth system modelling community
Big compute and storage cluster
4.6 PetaBytesfast disc storage
JASMIN will talk internally to other STFC resources
compute + 500 TB
150 TB
JASMIN will talk to its satellite
systems
150 TB
JASMIN will talk to the
Nederlands, the MET Office &
Edinburgh over UKLight
CEMS in the ISIC• Climate and Environmental Monitoring from Space• Essentially JASMIN for commercial users• Promote use of ‘space’ data and technology within new market sectors• Four consortia already won funding from public funded ‘Space for
Growth’ competition (run by UKSA, TSB and SEEDA)
• Hosted in International Space Innovation Centre• A ‘not-for-profit’ formed by industrials, academia and government.• Part of UK’s Space Innovation and Growth Strategy to grow the
sector’s £turnover
• ISIC is STFC ‘Partner Organisation’ in terms of Janet Eligibility Policy• So... Janet-BCE (Business and Community Engagement) for network
access related to academic and ISIC partners• Commercial ISP for network access related to commercial customers• As the industrial collaboration agenda is pushed, this needs to be
controlled and applicable elsewhere in STFC
Rtr JASMIN
Janet
Janet & Janet-BCE traffic
10 Gbps fibre
10 Gbps fibre
Commercial traffic
BT
Commercial
customers VLAN
Janet-BCE VLAN
RAL Infrastructu
re
• JASMIN and CEMS connected at 10 Gbps…
• …but no Janet access for CEMS via JASMIN
• Keeping Janet ‘permitted’ traffic as separate BCE VLAN allows tighter control
• Customers will access CEMS on different IP addresses depending on who they are (academia, partners, commercials)
• This could be enforced
No CEMS traffic permitted
RAL Network Re-Design& Other Issues
Tier-1Tier-1
The Outside World
Two main aims:1. Resilience: Reduce serial paths and single points of failure.2. Scalability and flexibility: Remove need for special cases. Make
adding bandwidth and adding ‘clouds’ (e.g. Tier-1 or tenants) a repeatable process with known costs.
RAL PoP
SiteAccessRouter
UKLightRouter
Internal Distributi
onRouter A
RAL Network Re-Design
RAL Site
CERN LHC OPN
Janet
“Normal” data
LHC data
Firewall
ISIS
Admin
JASMIN
Internal Distribution
Site Access & Distribution
Security
SiteExternal Connectivity
Backup
Primary
Primary
Janet
CERN LHC OPN
Campus
Commer-cial ISP
Tenants
Visitors
Rtr A
Rest of RAL site
Rtr
Project,Facility,
Dept
Virtualfirewall
Sw 1
Sw 2Rtr 2
Rtr 1
Implicit trust relationship =bypass firewall
RAL PoP:
CampusAccess
&Distributi
on
InternalSite
Distribution
Rtr 1 & 2, Sw1 & 2
Front: 48 ports 1/10 GbE (SFP+)
Back: 4 ports 40 GbE (QSFP+)
• Lots of 10 Gigs:– clouds and new providers can be readily added– bandwidth readily added to existing clouds– clouds can be dual connected
RAL Site Resilience
500 ft100m
Backup to London
Primary to Reading
User Education• Belief that you can plug a node or cluster into “the network” and
be immediately firing lots of data all over the world is a fallacy• Over provisioning is not a complete solution• Having invested £m’s elsewhere, most network problems that
do arise are within the last mile: campus network individual devices applications
• On the end systems...– Network Interface Card– Hard disc– TCP configuration– Poor cabling– Does your application use parallel TCP streams?– What protocols does your application use for data transfer
(GridFTP, HTTP...)?• Know what to do on your end systems• Know what questions to ask of others
User Support• 2010 example: CMIP5 - RAL Space sharing environmental data with
Lawrence Livermore (West coast US) and DKRZ (Germany)– ESNet, California GÉANT, London 800 Mbps– ESNet, California RAL Space 30 Mbps– RAL Space DKRZ, Germany 40Mbps– So RAL is the problem right? Not necessarily...– DKRZ, Germany RAL Space up to 700Mbps
• Involved six distinct parties: RAL Space, STFC Networking, Janet, DANTE, ESNet, LLNL
• Difficult, although the experiences probably fed into the aforementioned JASMIN
• Tildesley’s Strategic Vision for UK e-Infrastructure talks of “the additional effort to provide the skills and training needed for advice and guidance on matching end-systems to high-capacity networks”
I’ll do anything for a free lunch• Access Control and Identity Management– During DTI’s e-Science programme access to
resources was often controlled using personal X.509 certificates
– Is that scalable?– Will you run or pay for a PKI?– Resource providers may want to try Moonshot
• extension of eduroam technology• users of e-Infrastructure resources authenticated
with user credentials held by their employer
• Will the Janet Brokerage be applicable to HPC e-Infrastructure resources?
ConclusionsFrom the STFC networking perspective:• Adding bandwidth should be repeatable
process with known costs• Networking is now a core utility, just like
electricity: plan for resilience on many levels• Plan for commercial interaction• In all the excitement don’t forget security• e-Infrastructure funding is paying for capital
investments - be aware of the recurrent costs
Recommended