16
NL Cloud Meeting, 5 April 2011 Israel ATLAS Tier2 Status 1 Israel ATLAS TIER-2 Status April 2011 Lorne Levinson

Israel ATLAS TIER-2 Status April 2011 Lorne Levinson

  • Upload
    lundy

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Israel ATLAS TIER-2 Status April 2011 Lorne Levinson. Israel HEP community. ATLAS is the only LHC experiment in which we participate also Phenix (Heavy Ion @ BNL ), ILC , ZEUS Israel is “1.35% of ATLAS” ( MoU pledge, authors, common fund) 25-30 people doing physics analysis 3 sites: - PowerPoint PPT Presentation

Citation preview

Page 1: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 1NL Cloud Meeting, 5 April 2011

Israel ATLAS TIER-2

Status

April 2011Lorne Levinson

Page 2: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 2NL Cloud Meeting, 5 April 2011

Israel HEP community• ATLAS is the only LHC experiment in which we participate

– also Phenix (Heavy Ion @BNL), ILC, ZEUS– Israel is “1.35% of ATLAS” (MoU pledge, authors, common fund)– 25-30 people doing physics analysis

• 3 sites: – Tel Aviv University, Tel Aviv (1956)

• a university– The Technion Israel Institute of Technology, Haifa (1924)

• a university– Weizmann Institute of Science, Rehovot (1934)

• a research institute for Biology, Chemistry, Physics, Math & CS) with graduate school (no undergrads)

• longest travel is Weizmann Technion 2 hours office-to-office

Page 3: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 3NL Cloud Meeting, 5 April 2011

Organization• we are a distributed Tier2/Tier3

• each site combines Tier2 and Tier3 resources in the same cluster– all resources shared flexibly between T2 and T3 (Lustre/Storm)

• single management and budget, single purchasing

• three sites as identical as possible

• Steering Committee for overall policy

• Management & Operations team for the three sites

• stable funding approved until 2012

Page 4: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 4

StorageContinues to be the biggest reliability issue.• Our hardware is now stable:

– replaced DDN 6620’s with DDN 9900 • Fully redundant, 300 disk slots, 8x8Gb/s FC ports 5GB/s

– two Lustre “OSS” servers – WI servers with 10Gb/s to cluster,

TAU, Tech will install 10G in April

• Gave up on Thumpers+Lustre and Thumpers+iSCSI+Lustre. – We NFS mount Thumpers with Solaris+ZFS for extra "archive"

storage, home directories or /opt/exp_soft

• Lustre + Storm problem is Storm team does not test new Storm releases on Lustre– Storm-Lustre community must solve this

NL Cloud Meeting, 5 April 2011

Page 5: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 5NL Cloud Meeting, 5 April 2011

Storm/Lustre• Storm allows LCG SRM storage and our local global file name

space to share the same physical storage.– No rigid boundary– Jobs in cluster can do Linux file io to read SRM files

• Storm can run over Lustre (open source) or GPFS (IBM)• Lustre:

– Object Storage Targets serve (stripes of) file data– Meta-Data Server holds directories

• redundant failover of MDS’s will soon be supported

Page 6: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 6

Storage – installed SRM + local capacity

TAU Technion Weizmann Total

2010 240 192 288 7202011 purchase 96 144 144 384Total 2011 336 336 432 1104Heavy Ion 3Q2011 48 1152

NL Cloud Meeting, 5 April 2011

Net TB

Page 7: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 7NL Cloud Meeting, 5 April 2011

Group disks

• We are hosting four ATLASGROUPDISK areas– Muon performance (Technion)– Top (Weizmann)– Heavy Ion (Weizmann)– Standard Model (TAU) (empty)

Page 8: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 8

CPU• Last purchase was dual Intel E5520 quad core• May delivery purchase is dual Intel X5650 hex-core

– again 4 motherboards per 2U box with redundant power supply

NL Cloud Meeting, 5 April 2011

cores Tel Aviv Technion Weizmann Total

Now 192 272 448 944

May 336 464 640 1440

We benefit a lot that some other groups place some cores in our cluster:* Weizmann: ATLAS+Phenix/Heavy-Ion, HEP Theory, Condensed matter* Technion: HEP Theory and Bio-informatics* TAU includes: HEP Theory

Page 9: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 9

Services nodesVirtualize most services• Two 8-core servers, 48GB• Failover• Easier management

– VM images– Roll-back– Image sharing– Easier testing: temp machines

• May delivery of HW• Deciding among: VMware,

Xen, Citrix, KVM• SE not included

Service WheregLite CE per site

gLite site-BDII per site

gLite MON per site

glite APEL per site

ELOG electronic log book WI

Zenoss fabric monitoring per site

LDAP, DNS, DHCP, syslog per site

Frontier DB cache per site

VOMS (for Israel) TAU

gLite WMS, LB (for Israel) WI

gLite myproxy (for Israel) WI

gLite Top-BDII (for Israel) WIgLite NAGIOS for Israel grid service monitoring WI

Mantis issue tracker TechManagers’ Wiki pages Tech

NL Cloud Meeting, 5 April 2011

Page 10: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 10

NetworkingOur networking is not good• Geant connection is 2 x 1.5G (subscribed on 2 x 2.5G infrastructure)• “Political” limits: TAU 500M, Technion 350M, WI 400M

– Because a 1G line is shared with institute traffic and the shared router is not really able to do 1G duplex

• We suspect that the gross mismatch with SARA/NIKHEF’s 10G causes failed connections due to dropped packets.– Lowering the # of files & streams to avoid dropped packets leaves

us with even worse net BW• Expensive because it is an undersea fiber and one (Italian) company

owns the fibers.– An Israeli competitor is installing another fiber now

NL Cloud Meeting, 5 April 2011

Page 11: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 11

Networking

NL Cloud Meeting, 5 April 2011

Page 12: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 12

GEANT

NL Cloud Meeting, 5 April 2011

Page 13: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 13

Networking plansMay 2011(?): • Increase international connection: from 3Gb/s to 4Gb/s.

– 5G might be possible later this year, but not budgeted.• Replace old routers at entrances to institutes with 10G capable

equipment.– This should increase our thru’put and reliability and allow us to

actually use a major share of the 1G BW to the sites

• Negotiating 10G academic backbone• Could have 10G to Geant in spring 2012

NL Cloud Meeting, 5 April 2011

Page 14: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 14

SAM/NAGIOS• Our NGI did not take on the SAM/NAGIOS monitoring responsibility• After the new NAGIOS tests replaced SAM tests, we received no

alerts on failed tests.• This was a severe problem• Finally in December it was agreed with EGI, our NGI and us that we

would deploy a NAGIOS test service for Israel, until our NGI succeeded to do it.– The only functioning grid sites in Israel are our 3 ATLAS sites

• Our NAGIOS service was up and running in January.

NL Cloud Meeting, 5 April 2011

Page 15: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 15

Upcoming work• Deploy Zenoss fabric and service monitor on all three clusters

– currently in-test at Weizmann

• Deploy Puppet configuration system on all three clusters– We gave up on Quattor after having finally succeeded in getting

it to run,• Clear that it was unsustainable

– Currently for work nodes at Weizmann– Needs to include gLite nodes

• Virtualization of services (excl SE)

• Address Storm “untested new version” problem

NL Cloud Meeting, 5 April 2011

Page 16: Israel ATLAS TIER-2 Status  April 2011 Lorne Levinson

Israel ATLAS Tier2 Status 16NL Cloud Meeting, 5 April 2011

End