Upload
khalil-wink
View
220
Download
1
Embed Size (px)
Citation preview
www.consorzio-cometa.it
FESR
Consorzio COMETA - Progetto PI2S2
The gLite Workload Management System
Annamaria MuoioINFN Catania Italy [email protected] Tutorial per utenti e sviluppo di applicazioni in Grid16 - 20 July 2007 Catania
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 2
Outline
This presentation will cover the following arguments:
• Overview of the gLite WMS Architecture
• Job Description Language Overview - Principal Attributes
• References and hands-on
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 3
Overview of gLite Middleware
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 4
Workload Management System
Workload Management SystemWorkload Management System (WMS) comprises a set of Grid middleware components responsible for
distribution and management of tasks across Grid resources.
Purpose of Workload Manager (WM) is accept and satisfy requests for job management coming from
its clients meaning of the submission request is to pass the responsibility of the job to the WM.
WM will pass the job to an appropriate CE for executiontaking into account requirements and the
preferences expressed in the job description.
The decision of which resource should be used is the outcome of a matchmakingmatchmaking process.
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 5
UI
Job Contr.-
Condor
ComputingElement
StorageElement
CE characts& status
SE characts& status
Job submission
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 6
Logging &Logging &Book-keepingBook-keeping
UI
Job Contr.-
CondorG
ComputingElement
StorageElement
CE characts& status
SE characts& status
Job Status
UI: allows users to access the functionalitiesof the WMS(via command line, GUI, C++ and Java APIs)WMS: Workload Management System
LFC
Information System
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 7
Network Server
Logging &Logging &Book-keepingBook-keeping
Information System
LFCUI
Job Contr.-
CondorG
ComputingElement
StorageElement
RB node
CE characts& status
SE characts& status
edg-job-submit myjob.jdlmyjob.jdl
JobType = “Normal”;Executable = "$(CMS)/exe/sum.exe";InputSandbox = {"/home/user/WP1testC","/home/file*”};OutputSandbox = {“sim.err”, “test.out”, “sim.log"};Requirements = other. GlueHostOperatingSystemName == “linux" && other.GlueCEPolicyMaxCPUTime > 10000;Rank = other.GlueCEStateFreeCPUs;
submitted
Job Status
Job Description Language(JDL) to specify job characteristics and requirements
Job
Stat
us
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 8
UI
Job Contr.-
CondorG
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
Input Sandboxfiles
Job
submitted
Job Status
NS: network daemon responsible for acceptingincoming requests
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 9
Logging &Logging &Book-keepingBook-keeping
UI
Job Contr.-
CondorG
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
waiting
submitted
Job Status
WM: responsible to takethe appropriate actions to satisfy the request
Job
Where must thisjob be executed ?
Match-Maker/ Broker
LFC
Information System
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 10
Network Server
UI
Job Contr.-
Condor
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
waiting
submitted
Job Status
Match-Maker/ Broker
Where are (which SEs) the needed data ?
What is thestatus of the
Grid ?
Matchmaker: responsible to find the “best” CE where to submit a job
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 11
UI
Job Contr.-
Condor
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
waiting
submitted
Job Status
Match-Maker/Broker
CE choice
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 12
Job
Stat
us
Logging &Logging &Book-keepingBook-keeping
UI
Job Contr.-
Condor
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
waiting
submitted
Job Status
JobAdapter
JA: responsible for the final “touches” to the job before it’s passed to Condor(e.g. creation of wrapper script, etc.)
LFC
Information System
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 13
Job
Stat
us
Logging &Logging &Book-keepingBook-keeping
UI
Job Contr.-
Condor
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
Job Status
JC: responsible for theactual job managementoperations (done via CondorG)
Job
submitted
waiting
ready
LFC
Information System
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 14
UI
Job Contr.-
Condor
ComputingElement
StorageElement
CE characts& status
SE characts& status
WMSstorage
Job Status
Job
InputSandboxfiles
submitted
waiting
ready
scheduled
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 15
UI
Job Contr.-
Condor
ComputingElement
StorageElement
WMSstorage
Job Status
InputSandbox
submitted
waiting
ready
scheduled
running
“Grid enabled”data transfers/
accesses
Job
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 16
UI
Job Contr.-
Condor
ComputingElement
StorageElement
WMSstorage
Job Status
OutputSandboxfiles
submitted
waiting
ready
scheduled
running
done
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 17
UI
Job Contr.-
Condor
ComputingElement
StorageElement
WMSstorage
Job Status
OutputSandbox
submitted
waiting
ready
scheduled
running
done
edg-job-get-output <dg-job-id>
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 18
UI
Job Contr.-
Condor
ComputingElement
StorageElement
WMSstorage
Job Status
OutputSandboxfiles
submitted
waiting
ready
scheduled
running
done
cleared
LFC
Information System
Logging &Logging &Book-keepingBook-keeping
Job
Stat
us
Network Server
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 19
Possible job states
Flag Meaning
SUBMITTED submission logged in the LB
WAIT job match making for resources
READY job being sent to executing CE
SCHEDULED job scheduled in the CE queue manager
RUNNING job executing on a WN of the selected CE queue
DONE job terminated without grid errors
CLEARED job output retrieved
ABORT job aborted by middleware, check reason
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 20
Workload Management System
LFCLFCCatalogueCatalogue
Logging &Logging &Book-keepingBook-keeping
Resource BrokerResource Broker(WorkLoad Mgr.)(WorkLoad Mgr.)
StorageStorageElementElement
ComputingComputingElementElement
Information Information ServiceService
Job Status
DataSets info
Author.&Authen.
Job S
ub
mit
Even
t
Job
Qu
ery Job
Stat
us
Input “sandbox”
Input “sandbox” + Broker Info
Output “sandbox”
Output “sandbox”
Pu
blis
h
SE & CE info
““User User interface”interface”
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 21
Command Line Interface
--vo <vo name> : perform submission with a different VO than the UI default one.
--output, -o <output file> save jobId on a file.--resource, -r <resource value> specify the
resource for execution. --nomsgi neither message nor errors on the stdout
will be displayed.
Job Submission
$ edg-job-submit [options] <jdl_file>
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 22
If the request has been correctly submitted this is the typical output that you can get:
edg-job-submit test.jdl
====================edg-job-submit Success =====================The job has been successfully submitted to the Network Server.Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:- https://lxshare0234.cern.ch:9000/rIBubkFFKhnSQ6CjiLUY8Q==============================================================
In case of failure, an error message will be displayed instead, and an exit status different form zero will be retured.
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 23
If the command returns the following error message:
**** Error: API_NATIVE_ERROR ****Error while calling the "NSClient::multi" native apiAuthenticationException: Failed to establish security context...**** Error: UI_NO_NS_CONTACT ****Unable to contact any Network Server
it means that there are authentication problems between the UI and the Network Server (check yourproxy or contact the site administrator).
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 24
It is possible to see which CEs are eligible to run a job specified by a given JDL file using the command
edg-job-list-match test.jdl
Connecting to host lxshare0380.cern.ch, port 7772Selected Virtual Organisation name (from UI conf file): dteam*********************************************************************COMPUTING ELEMENT IDs LISTThe following CE(s) matching your job requirements have been found:adc0015.cern.ch:2119/jobmanager-lcgpbs-infiniteadc0015.cern.ch:2119/jobmanager-lcgpbs-longadc0015.cern.ch:2119/jobmanager-lcgpbs-short**********************************************************************
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 25
After a job is submitted, it is possible to see its status using the glite-job-status command.
edg-job-status https://lxshare0234.cern.ch:9000/X-ehTxfdlXxSoIdVLS0L0w
*************************************************************BOOKKEEPING INFORMATION:Printing status info for the Job:https://lxshare0234.cern.ch:9000/X-ehTxfdlXxSoIdVLS0L0wCurrent Status: ScheduledStatus Reason: Job successfully submitted to GlobusDestination: lxshare0277.cern.ch:2119/jobmanager-pbs-infinitereached on: Fri Aug 1 12:21:35 2003*************************************************************
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 26
After the job has finished (it reaches the DONE status), its output can be copied to the UI
edg-job-get-output https://lxshare0234.cern.ch:9000/snPegp1YMJcnS22yF5pFlg
Retrieving files from host lxshare0234.cern.ch*****************************************************************JOB GET OUTPUT OUTCOMEOutput sandbox files for the job:- https://lxshare0234.cern.ch:9000/snPegp1YMJcnS22yF5pFlghave been successfully retrieved and stored in the directory:/tmp/jobOutput/larocca_snPegp1YMJcnS22yF5pFlg*****************************************************************
By default, the output is stored under /tmp/jobOutput, but it is possible to specify in which directory to save theoutput using the - -dir <path name> option.
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 27
A job can be canceled before it ends using the command edg-job-cancel.
edg-job-cancel https://lxshare0234.cern.ch:9000/dAE162is6EStca0VqhVkog
Are you sure you want to remove specified job(s)? [y/n]n :y=================== edg-job-cancel Success====================The cancellation request has been successfully submitted for the following job(s)- https://lxshare0234.cern.ch:9000/dAE162is6EStca0VqhVkog===========================================================
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 28
•In gLite Job Description Language (JDL)Job Description Language (JDL) is used to
describe jobs for execution on Grid.•The JDL adopted within the gLite middleware is based upon Condor CLASSified Advertisement CLASSified Advertisement language (ClassAd)language (ClassAd).– A ClassAd is a record-like structure composed of a
finite number of attributes separated by a semi-colon (;)
– A ClassAd is highly flexible and can be used to represent arbitrary services
•The JDL is used in gLite to specify the job’s characteristics and constrains, which are used during the match-making processmatch-making process to select the best resources that satisfy job’s requirements.
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 29
The JDL syntax JDL syntax consists on statements like:
Attribute = value;Attribute = value;
Comments must be preceded by a sharp character ( ## ) or have to follow the C++ syntax
WARNING: The JDL is sensitive to blank
characters and tabs. No blank charactersor tabs should follow the
semicolon at the end of a line.
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 30
In a JDL, some attributes are mandatory while others are optional.An “essential” JDL is the following:
Executable = “test.sh”;
StdOutput = “std.out”;
StdError = “std.err”;
InputSandbox = {“test.sh”,”input.dat”};
OutputSandbox = {“std.out”,”std.err”};
If needed, arguments to the executable can be passed:Arguments = “Hello World!”;
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 31
If the argument contains quoted strings, the quotes must be escaped with a backslash
e.g. Arguments = “\”Hello World!\“ 10”;
Special characters such as &, |, >, < are only allowed if specified inside a quoted string or preceded by triple \
(e.g. Arguments = "-f file1\\\&file2";)
The backtick character ` cannot be specified in the JDL.
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 32
JDL : Relevant Attributes
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 33
• JobTypeJobType (optional) – Normal (simple, sequential job), Interactive,
MPICH, Checkpointable, Partitionable, Parametric
– Or combination of them Checkpointable, Interactive Checkpointable, MPI
E.g. JobType = “Interactive”; JobType = “Interactive”;
JobType = {“Interactive”,”Checkpointable”}; JobType = {“Interactive”,”Checkpointable”};
““Interactive” + “MPI” not yet permittedInteractive” + “MPI” not yet permitted
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 34
ExecutableExecutable (mandatory)
This is a string representing the executable/command name.The user can specify an executable which is already on the remote CE Executable = Executable = {“/opt/EGEODE/GCT/egeode.sh”};{“/opt/EGEODE/GCT/egeode.sh”};
The user can provide a local executable name which will be staged from the UI to the WN Executable = {“egeode.sh”};Executable = {“egeode.sh”};
InputSandbox = {“/home/larocca/egeode/InputSandbox = {“/home/larocca/egeode/
egeode.sh”};egeode.sh”};
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 35
ArgumentsArguments (optional)This is a string containing all the job command line arguments.
E.g.: If your executable sum has to be started as:
$ sum N1 N2 –out result.out
Executable = “sum”;Executable = “sum”;
Arguments = “N1 N2 –out result.out”;Arguments = “N1 N2 –out result.out”;
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 36
EnvironmentEnvironment (optional)
List of environment settings needed by the job to run properlyE.g. Environment = Environment = {“JAVABIN=/usr/local/java”};{“JAVABIN=/usr/local/java”};
InputSandboxInputSandbox (optional)
List of files on the UI local disk needed by the job for runningThe listed files will automatically staged to the remote resource
E.g. InputSandbox InputSandbox ={“myscript.sh”,”/tmp/cc,sh”};={“myscript.sh”,”/tmp/cc,sh”};
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 37
OutputSandboxOutputSandbox (optional)
List of files, generated by the job, which have to be retrieved
E.g. OutputSandbox =OutputSandbox =
{ {
““std.out”,”std.err”,std.out”,”std.err”,
“ “image.png”image.png”
};};
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 38
RequirementsRequirements (optional)
Job requirements on computing resources Specified using attributes of resources published in the Information ServiceIf not specified, default value defined in UI config\uration file is consideredDefault.
Requirements = Requirements = other.GlueCEStateStatus other.GlueCEStateStatus == "Production“;== "Production“;
Requirements = other.GlueCEInfoLRMSType == “PBS” && other.GlueCEInfoTotalCPUs > 2 && Member (“ALICE-2.1.7”,
other.GlueHostApplicationSoftwareRunTimeEnvironment);
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 39
RankRank (optional)
Floating-point expression used to rank CEs that have already fulfill the Requirements expression.The Rank expression can contain attributes that describe the CE in the Information System (IS).The evaluation of the rank expression is performed by the Resource Broker (RB) during the match-making phase.A higher numeric value equals a better rank.
E.g.: Rank = Rank = other.GlueCEStateFreeCPUs;other.GlueCEStateFreeCPUs;
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 40
InputDataInputData (optional)
This is a string or a list of strings representing the Logical File Name (LFN) orGrid Unique Identifier (GUID) needed by the job as input.The list is used by the RB to find the CE from which the specified files can be better accessed and schedules the job to run there.
InputData = {InputData = {
““lfn:cmstestfile”,lfn:cmstestfile”,
“ “guid:135b7b23-4a6a-11d7-87e7-9d101f8c8b70”guid:135b7b23-4a6a-11d7-87e7-9d101f8c8b70”
};};
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 41
DataAccessProtocolDataAccessProtocol (mandatory if InputData has been specified)
The protocol or the list of protocols which the application is able to “speak” with for accessing files listed in InputData on a given SE.
Supported protocols in gLite are currently gsiftpgsiftp, and filefile.
DataAccessProtocol = {“file”,“gsiftp”};DataAccessProtocol = {“file”,“gsiftp”};
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 42
OutputSEOutputSE (optional)
This string representing the URI of the Storage Element (SE) where the user wants to store the output data.
This attribute is used by the Resource Broker to find the bestCE “close” to this SE and schedule the job there.
OutputSE = “aliserv6.ct.infn.it”;OutputSE = “aliserv6.ct.infn.it”;
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 43
OutputDataOutputData (optional)
This attribute allows the user to ask for the automatic upload and registration of datasets produced by the job on the Worker Node (WN).
This attribute contains the following three attributes:
OutputFileStorageElementLogicalFileName
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 45
NodeNumber NodeNumber (mandatory if JobType=MPICH)
NodeNumber attribute is an integer specifying the number of nodes needed for a MPI job.The RB uses this attribute during the matchmaking for selecting those CE having a number of CPUs equals or greater the one specified in NodeNumber.
NodeNumber = 5;NodeNumber = 5;
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 46
JobSteps JobSteps (mandatory for checkpointable or partitionable jobs)
JobSteps attribute can be either an integer representing the number of steps for a checkpointable or partitionable job e.g.:
JobSteps = 100000;JobSteps = 100000;
or a list of strings representing labels associated to the steps of a checkpointable or partitionable job e.g.:
JobSteps = JobSteps = {“d0”, “d1”, ”gmos”};{“d0”, “d1”, ”gmos”};
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 47
CurrentStep CurrentStep (mandatory for checkpointable or partitionable jobs)
CurrentStep attribute used to indicate the initial step when submitting a checkpointable or partitionable job.
CurrentStep = 2;CurrentStep = 2;
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 48
References & Hands-on
JDL (sottomissione via WMS Netrwork Server)https://edms.cern.ch/file/555796/1/EGEE-JRA1-TEC-555796-JDL-Attributes-v0-7.doc
https://grid.ct.infn.it/twiki/bin/view/GILDA/SimpleJobSubmissionWithRB
https://grid.ct.infn.it/twiki/bin/view/GILDA/MoreOnJDL-withedgcommands
Remember to initialize the proxy before to interact with
the WMS!
Tutorial per utenti e sviluppo di applicazioni in Grid – 16 - 20 July -Catania 49
Thank you
for your attentio
n!!!!