La federazione dei Tier2 di CMS

La federazione dei Tier2 di CMS

M. Paganoni

La federazione dei Tier2 di CMS

La federazione dei Tier2 di CMS

http://www.cnaf.infn.it/~dbonacorsi/SC4-INFNT2.html a breve pagina twiki

Legnaro-Padova e Roma approvaticome Tier2 di CMS

Pisa è Tier2 sub-judice (costo infrastruttura)

Bari è proto-Tier2 (determinazione infrastruttura e OK locale)

Finanziamenti per il 2006 verrannodiscussi al CSN1 di luglio

Tutti 4 i centri contribuisconoa CMS, con il supporto fortedelle comunità di riferimento(inclusi Tier3)

Tier2 Legnaro-Padova

Tier2 Legnaro-Padova• 76 computing nodes (152 cpus), most of them in

5 Intel Blade Centers (with dual Xeon from 2.4GHz to 3.0GHz), plus some dual-core Opteron 275 (~ 200 kSI2K)

• Old “production” storage: disk servers with 3ware RAID arrays, access through ‘classic’ rfio protocol (16 TB)

• New storage (under a storage mgmt system, currently DPM, not yet in production for CMS):– ~ 5TB in old 3ware servers (used in SC3)– ~ 7TB in our new SAN infrastructure (FC

controllers + SATA/FC disk boxes): just installed the first components, need to build experience on this, plan to use in SC4

Tier2 Roma

Tier2 Roma•11 WN for a total of 23 kSI2k

+ 3 service machines (CE, UI, Squid)range from PIII (being phased out) to dual core Opterons 275

•4 NAS servers, 16 TB effective2 for local use (6 TB)2 for Grid use (3 TB classic SE, 7 TB DPM SE)

Otranto, 8/6/06 M. Paganoni 7

Otranto, 8/6/06 M. Paganoni 8

The roles of Tier0,1,2 for CMS

• Tier0 (CERN): – safe keeping of RAW data (first copy);– first pass reconstruction;– distribution of RAW and RECO to Tier1;– reprocessing of data during LHC down-times.

• Tier1 (ASCC,CCIN2P3,FNAL,GridKA,INFN-CNAF,PIC,RAL): – safe keeping of a proportional share of RAW and RECO (2nd

copy);– large scale reprocessing and safe keeping of the output;– distribution of data products to Tier2s and safe keeping

of a share of simulated data produced at these Tier2s.

• Tier2 (~40 centres):– handling analysis requirements; – proportional share of simulated event production and


Service Challenge 4

Service Challenge 4• SC4 goal is to progress the distributed computing infrastructure to a production level service (WLCG)

• In April throughput phase for disk-to-disk and disk-to-tape transfers

• In May roll-out of gLite 3.0 • The first two weeks of June CMS will complete a computing model functionality test (rerun of the functionalities missing in SC3)

• The last two weeks of July: integration tests

• The first two weeks of September CMS will prepare CSA06 (see next slides)

Transfer activities Tier1-Tier2 for SC4

Transfer activities Tier1-Tier2 for SC4

Tier-1 to Tier-2: very bursty and driven by analysis

Goal is to reach from 10MB/s (worst Tier-2s) to 100MB/s (best Tier-2s) by June 2006.

Tier-2 to Tier-1: continuous simulation transfers

Goal is to reach 10MB/s from Tier-2s to Tier-1 centers (1TB per day)

The PhEDEx FTS integration has been reached Two tools (Heartbeat and transfer activity) help CMS with the continuous transfer

CMS distributed analysis uses CMS Remote Analysis Builder (CRAB), now interfaced to CMSSW

Also trivial file catalogs workThe goal is 25 - 50 kjobs/day

First outcomes from SC4

First outcomes from SC4• The difficult part has been the end-to-end system and maintaining the rates over long periods of time

• It takes too long to get going and it takes too much effort to keep going

• Even if the challenge has concentration periods we need a continous effort to make things work and scale

• Need a CMS coordinator to monitor PhEDEx and a service coordinator to monitor FTS (shifts ?)

• A larger number of application failures come from data publishing and data access problems than from problems with grid submission

• Need more testing of the new event data model and data management infrastructure

Goals of SC04

Goals of SC04Transfers Demonstration of PhEDEx driving FTS at EGEE sitesDemonstration of Data Administration on sitesTransfer into Trivial File Catalog and Access DataRemove Data from siteRequest new data for site Achieve Tier-1 to Tier-2 transfers at all permutationsAnalysis Workflow CRAB Access to CMSSW Data at all sites Bulk submission use of gLite Achieve more than 1k successful jobs/day on all Tiers Production Workflow Submission to all participating LCG and OSG sites and

return of results Data registration in DBS and import to PhEDEx for

replication to CERN

Otranto, 8/6/06 M. Paganoni 13

Computing, Software, & Analysis Challenge 2006

– A 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS

– Receive from HLT (previously simulated) events with online tag at 25 % of the HLT bandwidth (35-40 Hz)

– Prompt reconstruction at Tier-0, including determination of calibration constants (some FEVT and all AOD to the Tier-1s)

– Streaming of ~7 physics datasets (Local creation of AOD and distribution to all Tier-1s)

– Physics jobs on AOD at some Tier-1s– Skim jobs at some Tier-1s with data propagated to Tier-2s to run there Physics Jobs (50 kjobs-day in total)

Wide scale system test of software-computing synchronization at the production level focusing on the early data scenario.

Performance metric under scrutiny

Timescale foreseen for CSA06

Timescale foreseen for CSA06

1-6-06: Simulation Software ready for CSA06 Computing systems ready for SC4 15-6-06: Physics validation complete 1-7-06: start simulation production

(25M minbias; 5M electrons; 5M muons; 5M jets; 5M HLT “cocktail”; 5M miscalibrated/misaligned)

15-8-06: Calibration, alignment, HLT, reconstruction, and analysis tools ready

30-8-06: 50 Mevt produced, 5M with HLT pre-processing

1-9-06: Computing systems ready for CSA0615-9-06: Start CSA0615-11-06: Finish CSA06

Resources needed for CSA06

Resources needed for CSA06• Taking into account that 40% of the resources are located at the Tier-2s and that CSA06 is a test at 25% of what is needed in 2008

➨100 CPUs per Tier-2

➨25 TB per Tier-2 ➨10-100 MB/s to each Tier-2

• Should test most of the possible Tier-1 Tier-2 permutations

• The pre-production of MC events is on the critical path

Coordinamento delle attività

Coordinamento delle attività

• Phone conference settimanale (lun 14:30)• Riunioni periodiche delle comunità di riferimento dei Tier2 (ex. Roma 18-5-06)

• Riunioni al CNAF per il coordinamento di Tier1 e Tier2

• Riunioni al CERN per il coordinamento delle attività con CMS e WLCG (SC4, CSA06, …)

• Contatti con altri centri di calcolo della collaborazione (Lione, DESY, Barcellona, …)

• Dashboard (pagina web o wiki)• Oltre ai responsabili locali, ogni Tier2 individua le persone che svolgono le funzioni di site manager per CRAB, PhEDEx, produzione MC

Site manager di CRAB

Site manager di CRAB

– Mantiene i contatti con la comunità degli sviluppatori• Per definire quando è necessario fare upgrade, seguire eventuali problemi, ...

– Mantiene i contatti con la comunità degli utenti• Necessità specifiche? Richieste? Supporto?

– Installazione/configurazione e manutenzione• Capire se ci sono necessità specifiche

– Software da installare sulle macchine?– Configurazioni di code dedicate?

– In contatto con coordinatore nazionale CRAB (S. Lacaprara)

Site manager di PhEDEx

Site manager di PhEDEx– Gestisce le operazioni day-to-day di PhEDEx

• Controlla log per eventuali problemi, ...

– Richiede l’iniezione di nuovi file, in risposta alle richieste• di CMS• della comunità di utenti “locali” del Tier2

– Gestisce l’iniezione dei file prodotti dal T2 in PhEDEx

– Agisce da punto di contatto con gli sviluppatori ed i gestori PhEDEx dei Tier1 e degli altri Tier2

– Determina necessità specifiche• Spazio disco insufficiente ? ...

– Installazione/configurazione e manutenzione sistemistica di PhEDEx

– In contatto con coordinatore nazionale PhEDEx (D. Bonacorsi)

Site manager della produzione MC

Site manager della produzione MC

• Gestisce la produzione MC ufficiale del T2 interfacciandosi con CMS– Richiede nuovi dataset quando una produzione è completa

– Verifica che il trasferimento dei dati prodotti sia andato a buon fine

– Ottimizza l’uso delle risorse (CPU, disco, ...)– Compiti day-to-day di produzione

• Controllo log, produzioni fallite ed eventuali resubmit, ...

– Gestisce le richieste di update del software di produzione• Interfacciandosi con il Software Manager di CMS

– Richiede manutenzione sistemistica, quando necessaria

– In contatto con coordinatore nazionale Produzione MC (S. Gennai)

The Tier2 and the GRID infrastructure

The Tier2 and the GRID infrastructure

CMS user point of view:

1. Hidden interface to distributed data and resources (CRAB)

2. Standard and unified support interface (GGUS ticketing system)

3. Advanced policy management(in the near future)

• Dynamic allocation of resources for task of production and analysis

• Dynamic allocation of resources for CMS analysis groups

Tier2 administrator point of view:

1. TIER2 infrastructure can be built upon standard grid farm infrastructure (maintained by grid people) by sharing hardware and middleware support

2. User access, authentication and management done by the GRID Middleware

3. Grid infrastructure controlled and monitored 24 hour/day 7day/week

4. automatic discovery of problems related to: job submission, data management etc. handled by

• OMC, CIC’s and ROC’s support (via ticketing system)

• ROC shifts

• GridICE notification

5. Shared interface for error handling of user related problems and infrastructure failures

6. Standard information system to publish farm configuration and software tags

Open questions

Open questions• Storage Management (dCache/DPM/STORM)

– DPM è attualmente preferito per semplicità di interfaccia, dai Tier2, ma• la sua scalabilità non è garantita• Ha problemi di interfaccia con srm (implementazione per castor) e altre funzionalità mancanti

– dCache richiede localmente una expertise più complessa, ma è scalabile a sistemi più complessi

– STORM è in fase di sviluppo

• Database locali– Trivial Catalogue o implementazione locale di LFC?

Conclusioni

Conclusioni• Stiamo mettendo insieme la struttura della Federazione

• La difficoltà principale consiste nel processo di decisione a molti livelli (Tier2 locale, Federazione Tier2, esperimento, GRID)

• Abbiamo bisogno che CCR continui il supporto, specialmente sugli aspetti di gestione sistemistici e di consulenza per le gare