57
SDP LOCAL MONITORING AND CONTROL DESIGN Document number SKATELSDP0000026 Document type DRE Revision 02 Author S. Ratcliffe, I. Emsley, S. Gounden, Y. Grange, R. Lakhoo, C. Wu, A. DeMarco Release Date 20160720 Document Classification Unrestricted Status Released

SDP LOCAL MONITORING AND CONTROL DESIGNska-sdp.org/sites/default/files/attachments/ska-tel-sdp...SDP LOCAL MONITORING AND CONTROL DESIGN Document number SKA TEL SDP 0000026 Document

Embed Size (px)

Citation preview

 

 

SDP LOCAL MONITORING AND CONTROL DESIGN

 

 

Document number SKA­TEL­SDP­0000026 

Document type DRE 

Revision 02 

Author S. Ratcliffe, I. Emsley, S. Gounden, Y. Grange, R. Lakhoo, C. Wu, A. DeMarco 

Release Date 2016­07­20 

Document Classification Unrestricted 

Status Released 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Lead Author  Designation  Affiliation 

Shagita Gounden  LMC Team Lead  SKA­SA 

Signature & Date:     

 

Owned by  Designation  Affiliation 

Bojan Nikolic  SDP Project Engineer  University of Cambridge 

Signature & Date:     

 

Approved by   Designation  Affiliation 

Paul Alexander  SDP Project Lead  University of Cambridge 

Signature & Date:     

 

Released by  Designation  Affiliation 

Paul Alexander  SDP Project Lead  University of Cambridge 

Signature & Date:     

 

 

 

 

Version  Date of Issue  Prepared by  Comments 

0.95    LMC Lead et al.  Release for PDR 

1.0    LMC Lead et al.  Updated with 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 2 of 57 

observations from OAR. 

01C  2016­03­31  LMC Lead et al.  Release for dPDR 

02  2016­07­20  LMC Lead et al.  Updated with delta­PDR OARs. 

 

 

ORGANISATION DETAILS 

 

Name  Science Data Processor Consortium 

Address  Astrophysics 

Cavendish Laboratory 

JJ Thomson Avenue 

Cambridge CB3 0HE 

Website  http://ska­sdp.org 

Email  ska­sdp­[email protected] 

 

 

 

   

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 3 of 57 

1 Table of Contents  

       ITEM PAGE 

1 Table of Contents 4  

2 List of Figures 4 

3 List of Tables 6 

4 List of Abbreviations 6 

5 References 8 

5.1 Applicable Documents 8 

5.2 Reference Documents 8 

6 Introduction 9 

6.1 Purpose of the document 9 

7 LMC 9 

7.1 Configuration and Control 15 

7.1.1 Master Controller 15 

7.1.2 Discovery 16 

7.1.3 Configuration / Deployment 17 

7.1.4 Execution 17 

7.1.5 Teardown 17 

7.1.6 Simulation 18 

7.1.7 Target of Opportunity 18 

7.1.8 States and Modes 18 

7.1.9 Standby 21 

7.2 Health and Status Monitoring 23 

7.2.1 Standard Monitoring Points 23 

7.2.2 Logging 23 

7.2.3 Alarms and Events 26 

7.3 Error Handling 26 

7.3.1 Error Detection 26 

7.3.2 Error Recovery 27 

7.4 Quality Assessment (QA) 27 

7.4.1 Metrics 27 

7.4.2 Visualisation 28 

7.4.3 QA Interface and Data Flow 29 

7.5 Telescope State Information 31 

7.5.1 Telescope State 31 

7.5.2 State Information Categories 31 

7.5.3 Simulated State Information 32 

7.5.4 Lifespan 32 

7.6 Local Telescope State (LTS) Manager 32 

7.7 Local Sky Model (LSM) Manager 34 

7.8 Fast Telescope State Producer 35 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 4 of 57 

7.9 LMC Resource and Workflow Manager 36 

7.9.1 SDP Resource and Workflow Manager 36 

7.10 Workload Characterisation Framework 37 

7.10.1 WCF External Interfaces 38 

7.11 Managing Science Events 40 

7.12 Interfaces 40 

7.12.1 External Interfaces 40 

7.12.2 Internal interfaces 42 

7.13 Implementation 42  

7.14 Infrastructure 43 

7.15 Logical Data Store 44 

7.16 Configuration Management 44 

7.17 Software 44 

8 Prototyping Plan and Associated Risks 45 

9 Function to Requirements Traceability 46 

 

 

  

2 List of Figures  

Figure  Page 

Figure 1: LMC Functional breakdown  10 

Figure 2: LMC Product Tree breakdown  11 

Figure 3:   The diagram shows the component relationships between the 

internal components of the LMC, the internal SDP components and external 

interfaces.  

12 

Figure 4: Top level state transition diagram for the SDP  20 

Figure 5: State Transitions within a single SDP capability  21 

Figure 6: Sample interface showing an aggregate of the spectrum for each 

auto­correlation with average 

29 

Figure 7: Quality Assurance (QA) data flow.  30 

Figure 8: Local Telescope State (LTS) architectural focus.  33 

Figure 9: Local Sky Model (LSM) architectural focus.  35 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 5 of 57 

Figure 10: Workload Characterisation framework  37 

Figure 11: WCF External interfaces  38 

Figure 12: Level 1 functional flow diagram of the SDP  41 

Figure 13: LMC ­SDP Internal interfaces  42 

 

 

3 List of Tables  

 

Table  Page 

Table 1: Log levels as proposed by the LMC Interface Guidelines  25 

Table 2: Function to Product Traceability  45 

Table 3: Function to L2 Requirements Traceability   46 

 

 

4 List of Abbreviations 

 API  Application Programming Interface 

CDR   Critical Design Review 

CPU  Central Processing Unit 

CSP  Central Signal Processor 

CTRL  Configuration and Control 

FLOP  Floating­Point Operations 

FOV  Field of view 

GSM  General Sky Model 

GTS  General Telescope State 

GUI  Graphical User Interface 

HM  Health and Status Monitoring 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 6 of 57 

I/O  Input/Output 

ICD  Interface Control Document 

JSON  JavaScript Object Notation 

LHC  Large Hadron Collider 

LMC   Local Monitoring and Control 

LOM  Lights­out­management 

LSM  Local Sky Model Manager 

LTS  Local Telescope State 

PDG   Physical Deployment Graph 

PDR  Preliminary Design Review 

PIP  Pipelines 

QA   Quality Assurance 

SDBS  SDP Deployable Benchmark Suite 

SDP   Science Data Processor 

SPEAD  Streaming Protocol for Exchanging Astronomical Data 

SSD  Solid State Drive 

TM   Telescope Manager 

ToO  Targets of Opportunity 

WCF  Workload Characterisation Framework 

   

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 7 of 57 

5 References 

5.1 Applicable Documents The following documents are applicable to the extent stated herein. In the event of conflict between 

the contents of the applicable documents and this document, the applicable documents shall take precedence. 

Ref No  Doc No   Document Title 

AD01  SKA­TEL­SKO­DD­001  SKA1 SYSTEM BASELINE DESIGN 

AD02    MISCELLANEOUS CORRECTIONS TO THE BASELINE DESIGN 

AD03  SKA­TEL­SKO­0000008 

SKA PHASE 1 SYSTEM (LEVEL 1) REQUIREMENTS SPECIFICATION 

AD04  300­000000­029  SKA1 INTERFACE CONTROL DOCUMENT SDP TO TM MID 

AD05  300­000000­002  SKA1 MID SDP­CSP INTERFACE CONTROL DOCUMENT 

AD06  SKA­TEL.SADT.SE­TEL.SDP.SE­ICD 

SKA­TEL.SADT.SE­TEL.SDP.SE­ICD­001_Rev_D.docx 

AD07  SKA­TEL­SDP.0000013 

PDR.01 SDP Architecture 

AD08  PDR­03­05  Requirements Matrix 

5.2 Reference Documents  

The following documents are referenced in this document. In the event of conflict between the 

contents of the referenced documents and this document, this document shall take precedence.  

Reference Number  Reference 

RD01  LMC Scope and Responsibilities document SKA­TEL­TM­0000030LMC Scope and Responsibilities Document.docx Revision: 01 Dated: 2015­11­24 

RD02  LMC Interface Guidelines document SKA­TEL­TM­0000031LMC Interface 

Guidelines Document.docx Revision: 01 Dated: 2015­11­24 

RD03  SDP Architecture SKA­TEL­SDP­0000013 Revision: 02C Dated: 2016­03­24 

RD04  SDP Execution Framework Design SKA­TEL­SDP­0000015 Revision: 01C 

Dated: 2016­03­24 

RD05  SDP Pipelines Design SKA­TEL­SDP­0000027 Revision: 01C Dated: 

2016­03­24 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 8 of 57 

RD06  SDP  Data Processor Platform Design SKA­TEL­SDP­0000018 Revision: 01C 

Dated: 2016­03­24 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 9 of 57 

6 Introduction  

In any large and complex engineering effort, proper definition of interfaces and clearly delineated 

functional elements are essential. The phase 1 design effort of the SKA has been split into a number of 

elements, each of which with a distinct functional scope.  

 

These elements are controlled by a central Telescope Manager (TM) which requires each element to 

provide a well­defined interface to allow monitoring and control of the element. This Local Monitor 

and Control (LMC) function promotes an interface led design and allows segmented simulation and 

integration of telescope components. 

 

 In the context of the Science Data Processor (SDP), the LMC also provides top­level control over the 

SDP resources, both hardware and software, to ensure that the required capabilities can be realised. 

The LMC thus encompasses both the internal control and monitoring of the individual components of 

the SDP and the interface(s) between the SDP and other telescope subsystems, directly in the case of 

TM, and by proxy via the TM in the case of other elements. 

6.1 Purpose of the document The purpose of this document is to describe the SDP LMC architecture and design. The roles of the 

LMC are to: 

● Monitor and control hardware of the various SDP subsystems 

● Raise alarms and events to inform external entities on the state of the SDP 

● Run/Schedule data processing pipelines 

● Run maintenance workflows 

● Routinely perform status and diagnostic checks 

● Expose maintenance functionality for fault finding, mitigation and correction 

● Interact with the primary user of SDP, which is TM, through a defined interface 

● Save any data required by the LMC subsystem to a local store 

● Manage quality assurance (QA) of the SDP sub­systems 

● Control and manage the Local Sky Model 

● Control and manage the Local Telescope Model 

 

7 LMC The LMC performs the following functions: 

● Configuration and Control (CTRL) Controls the high level capabilities provided by the SDP,                         

including commands required for availability and schedule planning, and exposes this control                       

to the TM. This function is performed by the LMC Master Controller.  

● Health and Status Monitoring (HM): Performs health, alarms and status monitoring and                       

provides this information to the TM and internal elements as required. This function will be                             

carried out by the LMC Master Controller. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 10 of 57 

● Aggregate QA metrics: Provides aggregate quality assessment metrics to the end user to allow                           

verification of the quality of science data as it is processed through the SDP. 

● Visualise Quality Assessment: This element is responsible for the visualisation of the produced                         

metrics, and their encapsulation within an appropriate web deliverable container. An                     

accessible reference to this will then be passed to Telescope Manager, to allow the                           

visualisation to be displayed within the general user interface framework.  

● Manage Local Telescope State (LTS): A repository of telescope state that is accessible by the                             

sub­elements of the SDP. This includes static configuration, external dynamic state such as                         

antenna pointing, and internally generated state such as calibration tables. 

● Manage Local Sky Model(LSM): A repository of sources e.g. point sources and Gaussian                         

extended sources. Sources can be grouped into so­called patches. 

● Produce Fast Telescope State: The element responsible for ensuring timely delivery of critical                         

telescope state information from the Telescope Manager to the SDP Receiver.  

● Manage Science Events: The element is responsible for alerting TM to transient events. 

Figure 1 indicates the major functional elements of the LMC. The LMC product tree is shown by Figure                                   

2. Figure 3 depicts a component relationship diagram that places LMC in the context of the SDP.The                                 

LMC products are discussed in more detail in the sections to follow. 

 

FIGURE 1: LMC FUNCTIONAL BREAKDOWN   

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 11 of 57 

 FIGURE 2: LMC PRODUCT TREE BREAKDOWN 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 12 of 57 

 

 

FIGURE 3: THE DIAGRAM SHOWS THE COMPONENT RELATIONSHIPS BETWEEN THE INTERNAL COMPONENTS OF LMC, THE INTERNAL SDP                                 COMPONENTS AND EXTERNAL INTERFACES.      

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 13 of 57 

 

 

Fundamentally the role of the LMC is to present a single, unified view of the SDP element to the rest                                       

of the observatory. This black box view of the each element is central to the design of the Telescope                                     

Manager, which has published a set of documents outlining the scope and responsibilities of each of                               

the LMC element functions (see RD01, RD02). 

This single view encompasses a wide variety of interactions, including: 

● Resource discovery, reservation and control 

● System health monitoring and status reports 

● Science payload quality control and integrity monitoring 

● Operator visibility into various element level details 

● Element activation from a power off state 

In addition to this outward facing role, the LMC has the primary internal control responsibility for                               

configuration and preparation of resources. The concept of a capability is central to the architectural                             

approach adopted in the LMC component of the SDP. This fundamental definition is chosen to align                               

with the current description of a capability given by the Telescope Manager.  

In general, a capability is a grouping of SDP components and products to provide a particular pipeline                                 

analysis function (see RD03). In the context of the SDP, a capability maps directly to a particular                                 

analysis or engineering task. More specifically for SDP: 

● A capability is the minimum­sized task which can be scheduled by TM – TM does not control                                 

what happens within a capability; 

● Capabilities do not share information directly, but information (e.g. Telescope State                     

information) may be made available between capabilities via TM; 

● A capability is defined by a set of configuration parameters, functions and products which are                             

required to produce a defined product; 

● When a capability is executed, further parameters may be defined that refine the behaviour                           

of the capability and the SDP resources are allocated for execution; 

● The SDP will support multiple simultaneous executing capabilities – this will be a standard                           

situation with for example one executing capability being responsible for receiving data for                         

the current observation while another executing capability is processing data stored within                       

the Buffer. 

 

A capability may further be regarded as a package of functions which are required for a given                                 

operation: not all of this functionality may be required for every capability. In the SDP architecture,                               

pipelines are executed within a data­driven framework (see RD03). The data dependencies and their                           

execution within this framework are graph driven and so­called ‘logical graph­templates’ provide a                         

basic description of this functionality. A capability may also be constructed internally by SDP from one                               

or more logical graph templates, each describing an aspect of the capability, which are then formed                               

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 14 of 57 

into a single graph execution framework. Not all SDP functions are associated with individual                           

capabilities. We define those functions that exist independently of capabilities as “singleton                       

functions”. 

 

Capabilities are the minimal­sized tasks which can be scheduled by TM, as stated above. The                             

relationship between capabilities, which are exposed to TM by SDP, and the scheduling block used by                               

TM needs further definition and clarity. As currently envisaged, the SDP capabilities are not formally                             

exposed to observers, but rather TM translates project requirements into a series of scheduled SDP                             

capabilities. Further system­level consideration needs to be undertaken to decide if this overall                         

architectural approach is optimal. 

 

In the context of this document, an observation is a specific instance of an SDP capability. In other                                   

words, an observation involves executing a particular SDP capability for a defined period of time. This                               

then naturally defines the basic unit of control, scheduling and availability monitoring of the SDP as a                                 

system. 

 

As an example, “continuum imaging” is a particular capability supported by the SDP. When combined                             

with appropriate parameters such as desired field of view, number of channels, number of major                             

cycles etc. an observation is defined. Thus, multiple observations using the same underlying capability                           

may operate in parallel as long as the resource pool of the SDP is sufficient. 

7.1 Configuration and Control The SDP control concept is a combination of centrally orchestrated control by the Master Controller                             

and devolved control by the Execution Framework (see RD03, RD04). The boundaries of this control                             

concept are established by the definition of an SDP capability. All singleton functions and associated                             

processes not within capabilities are managed directly by the Master Controller and the Master                           

Controller has the responsibility for triggering the setup and execution of a capability. Within an                             

executing capability, the execution framework manages processes which are instantiated by the                       

requirements of the graph­driven execution framework. Within an executing capability, processes                     

have access to services provided by components under the direct control of the Master Controller                             

(e.g. the LTS, LSM and metric aggregation).  

Configuration and control is handled on a per capability basis. To the outside world, the various                               

capabilities are presented as completely parallel SDP instances, controlled through a single point.                         

Physically this single point is implemented in a redundant master­slave configuration, which                       

duplicates internal state across two nodes. Should the master fail, the slave will take over the                               

connection to TM. Commands in progress during the failure will return an error code and TM will                                 

need to reissue them. 

This failure behaviour could be mitigated through the use of virtual lockstep techniques that allow a                               

slave to be run in complete synchronisation with the master. This approach may yield significant                             

benefits with respect to the overall SDP availability allocation, however no formal requirement exists                           

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 15 of 57 

as yet for a seamless LMC handover on failure. The formal analysis of the availability requirements will                                 

be concluded during the CDR phase, which will in turn provide guidance to LMC on this topic. 

From the perspective of the TM, a particular observation has several phases that make up the overall                                 

operational model, namely: Discovery, Configuration/Deployment, Execution and Teardown. These                 

are described in the sections to follow.  

7.1.1 Master Controller 

TheMaster Control function encompasses a wide variety of functionality that is needed to ensure that                               

the appropriate emergent behaviour of the SDP system is achieved in response to command requests                             

originating from the Telescope Manager. In general, this will involve the receipt of a command to                               

activate a particular capability, the setup and initial control of which is the responsibility of the Master                                 

Controller. The Master Control function will also include functionality to estimate resource usage of a                             

particular capability to allow both internal planning and external scheduling by the Telescope                         

Manager. 

In our data­driven system, fine grained control of components does not require direct overview from                             

the Master Controller. It simply plays a supervisory role, ensuring that all needed components have                             

been started with an appropriate configuration, and that such components persist in a functional                           

state for the duration of the requested capability. 

For many of the components of the SDP, the control from the Master Controller will be very                                 

lightweight and will simply ensure component availability, base configuration and health monitoring.  

The Master Control function is also responsible for maintaining the overall SDP state machine, and in                               

particular handles the critical power on to idle and power off transitions. 

7.1.2 Discovery In the discovery phase, the TM attempts to construct a view of the telescope resources available for a                                   

particular observation, and the health state of these resources.  

In the LMC context, this discovery phase is supported by queries received from TM relating to the                                 

availability of particular capabilities. A query will thus be issued by TM against a particular, predefined                               

capability model. The LMC will then populate this model with a potential physical set of resources and                                 

report on the availability of this set to TM. 

This provides no particular guarantee of availability at the specific time of observation, but is a crucial                                 

general purpose planning tool. 

7.1.3 Configuration / Deployment Once TM has ascertained the availability of a particular capability from a systemwide perspective, the                               

next step is to configure the SDP to activate the specified capability in preparation for data flow.  

The TM will issue a configuration command to the SDP that contains a sufficiently complete set of                                 

capability parameters to allow a physical graph to be constructed. The PDG describes each physical                             

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 16 of 57 

node that is to be used to process this observation, what will be run on it, where it gets its data from                                           

and where it passes its results to. 

In general this list of parameters will be quite large as it encompasses all of the system and user                                     

supplied settings needed to perform the full set of reduction steps specified by the capability. The set                                 

of parameters will be referred to as the Observation Configuration. 

LMC then queries the processing platform to determine the availability of the specific group of                             

resources as determined by the combination of the logical graph template and the observation                           

configuration. The logical graph template is pre­computed for each particular SDP capability. 

At this stage, due to failures or other unexpected events, there may be insufficient resources available                               

for this configuration to proceed. This will be reported back to TM as a configuration failure. TM has                                   

requested that this failure include sufficient information to allow TM to modify the configuration                           

appropriately to allow a new observation to proceed. This will then be used to issue a modified                                 

observation request to SDP. The request should be sure to leave some margin on the resources                               

required to prevent churn. 

Once the pool of available compute resources has been confirmed, the LMC will then construct a final                                 

PDG that fully encodes all of the resources and connections needed. The nodes identified in the PDG                                 

are then started and the required software and configuration deployed to them via the scheduling                             

capability of the SDP.  

More detail on the construction of the physical graph is provided in the Execution Framework                             

Document (see RD04).  

7.1.4 Execution The execution phase for a generic observation also has two distinct phases. The first is synchronous                               

data ingest and real­time processing, in which raw scientific data is ingested, flagged and marked up                               

before being stored in buffer data drops and made available to the real­time pipelines. 

The real­time processing aspects of the system will include service processing such as gain calibration                             

for beamformer use and the image plane transient detection pipeline which runs at the correlator                             

dump rate. Results that are required by the wider telescope will be fed back to TM via the Local                                     

Telescope State repository. 

All other downstream processing runs in an asynchronous mode by using data which have                           

accumulated for some period in the buffer. 

The starting of the reduction / scientific pipelines will be triggered automatically by the buffer DOs.                               

No further direct control from LMC is envisaged as the actual processing proceeds purely based on                               

the availability of data for each node in the execution graph. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 17 of 57 

Generally, the synchronous and asynchronous execution phases will be running simultaneously, but                       

for different observations i.e. one or more observations will be buffering, whilst the others are                             

processing. 

Execution progress will be monitored via LMC as the data triggers that drive progression through the                               

execution graph will be aggregated through the various levels of data managers and reported as                             

monitoring data. 

7.1.5 Teardown Explicit teardown of an observation is triggered in a number of ways: 

● A stop packet in the input data stream is encountered (issued by CSP) 

● The TM terminates the observation ahead of schedule 

● A sufficiently severe error occurs (i.e. error state is entered) 

 

In all cases the initial step is a graceful shutdown by simply terminating the data input and allowing 

the graph to halt in a steady state. The state of the graph must then be interrogated to determine if 

an appropriate point in the data reduction, processing and storage has been reached i.e. is there a 

complete, viable data product available. 

 

If not, the graph must be backed out to the nearest termination point. These will be predefined                                 

breakpoints in the processing chain that have been identified by the initial static analysis. Effectively                             

they describe the points at which the processing is in a suitable state to have a checkpoint issued. 

At this point the state of the observation is communicated to the TelescopeManager, which can make                                 

decisions about how to proceed. 

For example, an observation could be terminated whilst the visibility buffer is still being filled. In this                                 

case it is quite likely that allowing the graph to terminate will result in a structurally viable visibility                                   

buffer, albeit one that contains less data than the observation initially intended. The operator (or an                               

automated TM process) may then decide to discard the buffer, or go ahead anyway and produce a                                 

science product from it. 

More information on errors and error handling can be found in section 6.3. 

Implicit teardown on normal observation completion is handled by the DOs and nodes themselves. 

7.1.6 Simulation The LMC will also provide simulation capabilities for use by the TM and internal components. For TM                                 

it will allow them to test out a control sequence that will generate appropriate timings, state changes                                 

and log messages.  

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 18 of 57 

For internal use it will allow testing of system components without a real TM present. For this use                                   

case a simple web delivered interface will be developed to allow the execution of fake observations,                               

tweaking of observational parameters and monitoring of logs and events. 

7.1.7 Target of Opportunity In general, telescope observations are planned and scheduled ahead of time and can thus afford a 

modest setup and configuration time. Targets of Opportunity (ToO) on the other hand typically 

require reaction time of order 10 seconds from a subordinate system such as the SDP.  

Practically this requires a stripped down startup capability that at minimum will start the ingest 

pipeline and allow data to be recorded to the buffer. Once recording has begun, other startup can be 

done in a more orderly fashion. 

LMC will provide a ToO capability that has a pre­configured data flow graph allowing a minimal set of 

ingest capability. This will be periodically refreshed by LMC so that the bulk of the required resources 

are always available. 

In practice, it may be necessary to mandate that ToO events reuse the existing ingest configuration 

(channelization, baselines, etc…) to ensure sufficiently reactivity to the event. In this case, the SDP 

can effectively respond in near zero time to the requirement to capture data to the buffer. 

7.1.8 States and Modes 

The overall state of the SDP is directly controlled, and expressed to TM, by the master controller.                                 

Three types of transitions can occur: 

● Normal: A commanded (or automatic) transition from one state to another. 

● Error: An error condition may result in the transition to another state or prevent a normal                               

state transition from occurring. 

● Recovery: On resolution of error conditions (either through direct intervention or automated                       

recovery) a state transition may occur. 

Figure 4 shows the main states expressed at the top level of the SDP. A brief description of each state                                       

follows: 

● Off: Complete shutdown of the entire SDP system. In normal operation transition out of this                             

state will require an external system to apply power to the essential components required to                             

start the master controller. 

● Standby: In this state the master controller is fully powered and operational and sufficient                           

platform elements are started to allow the master controller to enact further state transitions. 

● Error: A special state used purely to handle the case in which the master controller fails to                                 

start and suffers catastrophic failure. Recovery from this state to Standby is possible. 

● Un­configured: This is a transition state used when the full SDP system is being started or                               

stopped. Once complete the system will transition into some form of availability state or be                             

back in standby. 

● Available: In this state all capabilities of the SDP are available for configuration and use by the                                 

TM. In general it is not required that every component of the SDP is available for the overall                                   

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 19 of 57 

system to be in the available state, as the system will be built with some excess capacity to                                   

handle failures. 

● Degraded: Some capabilities are not available, but the system can still perform some functions                           

and may be able to continue with operations that are in flight. 

● Unavailable: No capabilities currently available due to errors in the system. It may be possible                             

for some tasks to complete gracefully. Transition out of this state will either occur through                             

automated recovery of sufficient capacity to move to the degraded state, or the systemmay                             

be completely de­configured. 

In general, the SDP will spend the majority of its operational time transitioning between the three                               

Available, Degraded and Unavailable states as system component availability due to error fluctuates. 

 

 

FIGURE 4: TOP LEVEL STATE TRANSITION DIAGRAM FOR THE SDP 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 20 of 57 

Figure 5 shows the state transitions within a single SDP capability. These state machines are                             

maintained per capability by the master controller. A brief description of each state follows: 

● Un­allocated capability: A virtual state used to represent a capability that has not been                           

configured as yet. 

● Idle: A capability that has been configured by TM but does not have SDP capacity allocated as                                 

yet. In terms of the data flow system this represents a fully populated logical graph that has                                 

not been allocated to actual hardware yet. 

● Ready: The capability is fully allocated to SDP hardware and is ready to receive observational                             

data from CSP and TM. 

● Buffering: An intermediate state used to indicate that data is being received from the CSP and                               

stored in the buffer. 

● Processing: In this state, no further data is expected from the CSP, but further processing by                               

the SDP is occurring. 

● Stopped: The graph for this capability has completed successfully and the system will                         

transition back to Idle. 

● Capability Error: Indicates that a critical error has occurred in one of the required processing                             

steps and the capability could not continue without external intervention. Recovery could                       

include a transition back to Idle (i.e. discarding the capability) or back to Ready which could                               

allow the capability to resume. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 21 of 57 

 

FIGURE 5: STATE TRANSITIONS WITHIN A SINGLE SDP CAPABILITY 

7.1.9 Standby Particular attention needs to be paid to the standby mode, as this differs from most other states in 

that large portions of the SDP element are unavailable. 

 

The only component of the LMC that is running in the Standby state is the local master controller. This 

must be configured such that on application of power to the SDP element it will start up. In addition 

to the master controller, the management network switches must be powered, and power must be 

physically provided to the various nodes in the compute platform to allow remote power on. 

 

It may be that the Platform Management component also starts up in the Standby state, to allow the 

LMC to interact with the lights­out­management (LOM) capabilities of the compute nodes indirectly. 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 22 of 57 

The master controller contains a boot configuration database component that provides initial boot 

and startup information, so that on a request from TM to transition to the un­configured state, the 

remainder of the SDP (both hardware and software) can be instructed to start. 

 

This will likely be a significant step, that requires close monitoring of component states and precise 

sequencing of operations. Lessons from the operation of large scale compute facilities and other 

projects such as the LHC will be valuable here. 

 

 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 23 of 57 

7.2 Health and Status Monitoring To support the configuration and control capabilities provided to TM by the LMC, a broad range of                                 

sensors and events are supported to provide a continuous view of the functional status of the SDP                                 

element. 

This section details some of the most important types of information provided by the LMC, but a full                                   

accounting of the individual sensors and event types is only provided in the referenced ICD [AD04]. 

7.2.1 Standard Monitoring Points Each element LMC is expected to provide a wide range of standard monitoring points to allow the TM,                                   

and by proxy, the telescope operators, to have sufficient system wide visibility to make informed                             

decisions on the operation of the telescope. 

SDP is no exception and will provide monitoring information in the following broad categories: 

Health 

● Rolled­up Element health 

● Rolled­up Capability health 

Status 

● Operational, Usage, Capability, etc. 

● Version and build (overall and per sub­element) 

The SDP LMC will handle monitoring using a defined push based strategies from lower level                             

components. It is expected that many components will include some local aggregation of subordinate                           

components, to reduce the operating volume of metadata. In an error or diagnosis scenario, more                             

detailed monitoring from these components will be enabled. Internal distribution of this information                         

is likely to use the same communication protocol as to be proposed by TM for the top level LMC                                     

monitoring distribution. TM has decided to use TANGO as their control and communication protocol. 

Having a common protocol in use for LMC operations, both internally and externally will be a                               

significant risk reduction, especially in the early phases of commissioning. 

7.2.2 Logging LMC will provide both remote and local (but downloadable) logging of the various internal SDP                             

components. 

Internally all log messages will be collected via a central component provided by the LMC. This will                                 

create local, rotating log files based on the specified log levels and retention settings. Sensible                             

defaults will be in place, but can be overridden by TM. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 24 of 57 

This central component will also buffer a portion of the log messages in memory and allow remote                                 

connections to this log buffer for the purposes of online logging. It will also allow remote users to                                   

request changes to the log levels of internal components for the purposes of debugging. 

This feature will be extremely useful to both SDP and TM during the early phases of integration and                                   

set­to­work activities. In addition to allowing log retrieval via the TM interface, a simple web based UI                                 

will be provided for early work. 

At present, the products that produce logging information are (see RD05): 

● Visibility Receiver (C.1.2.1.4) 

● Flagging Components (C.1.2.2.1.3.1) 

● Strong Source Removal (C.1.2.2.1.3.2) 

● Calibration Components (C.1.2.2.1.4) 

● Source Finding Components (C.1.2.2.1.5) 

● Imaging Components (C.1.2.2.1.6) 

In addition to standard logging level controls (as defined by the LMC Interface Guidelines document –                               

extract shown below), it is likely that the underlying architecture will be amenable to very fine grained                                 

logging control, to the extent that individual nodes within the overall execution graph can be targeted                               

to enable logging output. This, coupled with a filtering and rate limiting system, will allow precision                               

logging to be used as a powerful debugging tool. 

In terms of the logging of processing platform components, each compute island will have its own log                                 

server, which will collect routine system logs for the components in the island. A condensed log will                                 

be collected centrally by the a LMC controlled logging service (see RD06). This central service will send                                 

an even further condensed log summary to TM. At node or switch level access to performance                               

counters and integrated sensors will be available upon request from LMC, subject to data rate                             

limitations in the relevant networks.  

   

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 25 of 57 

TABLE 1: LOG LEVELS AS PROPOSED BY THE LMC INTERFACE GUIDELINES 

 

Log Level 

Definition   Expected Content 

OFF  OFF is the highest possible logging level and is intended to turn logging off. 

No information. Devices should never log messages directly to the OFF logging level. 

FATAL   The device has failed. There is no workaround. Recovery is not possible 

The logged message should capture as much system state information as possible in order to assist with debugging the problem. Logging information at this level should not directly impact the performance of the device. 

ERROR   An error has occurred. A function or operation did not complete successfully. A workaround may be possible. The device can continue, potentially with degraded functionality. Logging information at this level should not directly impact the performance of the device 

The error message should capture detailed information relating to the event that has occurred. 

WARN   A condition was detected which may lead to functional degradation (e.g. an anomaly threshold has been crossed), but the device is still fully functional. Logging information at this level should not directly impact the performance of the device 

The warning message should capture the information relating to what functional degradation may occur and list thresholds that have been exceeded. 

INFO  This level of logging should give information about workflow at a coarse­grained level. Information at this level may be considered useful for tracking process flow. Logging information at this level should not directly impact the performance of the device. 

The information message should capture information relating to the operation that has completed. 

DEBUG  Verbose output used for detailed analysis and debugging of a device. Logging information at this level may impact the performance of the device. 

This level of logging should show workflow at a fine­grained level. Information relating to parameters, data values and device states should be reported. 

TRACE  Extremely verbose output for detailed analysis and debugging of a device. Logging information at this level may impact the performance of the device. 

This level of logging should show function call stacks and provide a high level of debug information 

ALL  ALL is the lowest possible logging level and is intended to turn on all logging 

Logging will occur at the most detailed level. Devices should never log messages directly to the ALL logging level. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 26 of 57 

 

7.2.3 Alarms and Events The LMC is responsible for first level monitoring of all the equipment and services that comprise the                                 

SDP element. In particular detecting and reporting critical conditions, both those affecting operations                         

and those that compromise the telescope mission, is an important role within the LMC. 

In addition to these predefined critical conditions, the TM may wish to define run time specifications                               

for particular conditions to aid in debugging and commissioning. 

The LMC will handle this requirement through a generic event framework, which will describe and                             

maintain a set of centralised rules that reside within the monitoring component. Events will be                             

generated with a specific destination, criticality level and sufficient information to accurately convey                         

intent. 

Escalation and repeat event thresholds can be specified per criticality level so as to prevent alarm                               

swamping of upstream components. 

7.3 Error Handling Perhaps the most challenging aspect, from a control perspective, of a large­scale facility such as the                               

SDP will be error handling. When dealing with extreme scale, it is a certainty that a fraction of the                                     

overall facility will be unavailable at any one time. More challengingly, we can expect multiple failures                               

throughout the system during the processing of any particular observation run. 

Some of these failures will be amenable to trend prediction and can be pre­emptively dealt with, but                                 

those that occur without warning need to be dealt with in near real­time, especially in the front end                                   

of the SDP. 

7.3.1 Error Detection Of the three broad categories of error (hard, soft, silent), errors which are undetectable through                             

standard OS level tools will be the hardest to deal with. This includes memory and transport                               

problems, which introduce subtle, bit level errors that can at the worst lead to completely unexpected                               

operation of the processing chain. 

A cohesive, introspective error detection system that combines system tools that monitor the                         

platform and applications, coupled with the detailed automated QC system that provides real­time                         

monitoring to the end user will allow the most detailed view of errors possible. 

In a system of the eventual complexity of the SKA, it will be essential to have aggregate, hierarchical                                   

reporting to avoid users becoming swamped with low level error messages. This is handled via a                               

number of mechanisms.  

Firstly, errors and state information are broadly classified into alarms and events. Alarms are events,                             

that without further scope broadening represent a critical telescope fault and require operator                         

notification (although intervention may still be automatic within the element). Events are points of                           

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 27 of 57 

interest, that when taken in a broader system view, perhaps via automatic aggregation, could reveal a                               

higher level issue. 

It is also important to be able to mark alarms, and other errors, with a variety of states, such as                                       

'known failure', to manage and prioritize faults. This is especially important for alarms that get                             

triggered at a regular cadence – the systemmust be able to suppress further warnings when directed,                                 

but still track the alarm events themselves. 

7.3.2 Error Recovery The primary assumption of the SDP is that sufficient parallelism exists at every level of the processing                                 

chain to allow scale out without becoming limited by any particular serial process. This tenet is also                                 

fundamental to resilience and error recovery. In effect, and where possible, errors will be detected,                             

and that particular parallel thread (typically a frequency channel) will simply be discarded until such                             

time as the error can be corrected. If the system will have data that is deemed critical to a particular                                       

reduction (so called ‘precious’ data), or indeed operation of the SDP itself, this data will be tagged as                                   

such and protection mechanisms will be used to prevent data loss. This will likely include redundant                               

copies of the data stored on multiple nodes, allowing easy failover.  

The major design outcome of this is to ensure that all data reception points are amenable to missing,                                   

or corrupt (and flagged as such) data. In the most simplistic view, missing data will be covered by the                                     

existing flagging mechanisms. 

In general we expect a significant portion of the data to be flagged due to science quality issues.                                   

Adding a few percent of additional flagging based on system errors is not a significant cost. 

Some of the critical paths will be explicitly covered in the execution graph by including redundant                               

paths. Other errors may require sections of the graph to be discarded, and later nodes redistributed                               

to hardware not originally assigned to the physical graph. 

7.4 Quality Assessment (QA) Whilst the functional health of the SDP element is exposed to TM by the health monitoring function,                                 

the quality and integrity of the scientific payload is monitored separately. LMC is responsible for two                               

primary aspects of QA, namely metric aggregation and visualization. 

7.4.1 Metrics Internal QA reporting is achieved through the production of a range of predefined metrics that are                               

calculated within various stages of the SDP processing pipeline. These range from low­level indicators                           

such as closure quantities generated in the Ingest phase, to high level indicators such as snapshot                               

images produced from the various imaging pipelines. 

In general, these metrics do not stand on their own, and may require merging with other metrics or                                   

telescope state information in order to allow complete and accurate interpretation by the end user. 

The LMC is responsible for this aggregation and sanitization function, and provides the single point of                               

contact for end users and other elements wishing to receive these QA metrics. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 28 of 57 

At this stage it is assumed that the SDP will provide both a programmatic and a web based human                                     

interface to allow examination of QA information. This will be clarified once the observatory wide user                               

interface policy has been completed, and the human interface could be incorporated into the central                             

user interfaces delivered by TM. 

QA information is provided per capability, although it may be possible to produce cross capability                             

aggregate metrics in certain cases. A wide variety of metrics will be provided; these will be detailed in                                   

the L2/L3 SDP Requirements which will be delivered at element PDR.  

The following broad categories of metrics are included: 

Interferometric Measures – These represent a near real­time view on telescope performance and                         

include metrics such as visibility phase per baseline as a function of time. 

Instrumental Calibration – These show the state of dynamic calibration of the instrument and include                             

metrics such as bandpass calibration. 

Performance Metrics – These are higher level, derived metrics that show the overall current                           

performance of the instrument in terms of quantities such as astrometry (showing deviation of source                             

position from known standards). 

7.4.2 Visualisation In the current system breakdown, it is proposed that the SDP handles the visualisation of the                               

produced metrics, and their encapsulation within an appropriate web deliverable container. An                       

accessible reference to this will then be passed to TelescopeManager, to allow the visualisation to be                                 

displayed within the general user interface framework. 

This approach also allows optimisation of the outbound network traffic from the SDP element, by                             

using an adaptive mix of server and client side rendering to suit the particular metric under display. 

Many of the metrics under consideration will have a substantial number of individual components                           

(e.g. phase per baseline), and thus the visualisation of these in a human digestible form is a significant                                   

challenge. 

Care needs to be taken that the data that is presented is both sufficient for diagnostic needs, and                                   

minimal enough to prevent data fatigue for the operator.  

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 29 of 57 

 

FIGURE 6: SAMPLE INTERFACE SHOWING AN AGGREGATE OF THE SPECTRUM FOR EACH AUTO­CORRELATION WITH AVERAGE 

An example of a possible approach is shown in Figure 6. In this case, we show an aggregate of the                                       

spectrum for each auto­correlation with average, and then percentile levels to show an overall                           

envelope. Outliers are shown individually and particular baselines or signals of interest can be added                             

for debugging. 

7.4.3 QA Interface and Data Flow 

An overview of QA LMC data flow is shown in Figure 7. Each processing block that produces a QA                                     

metric data sends this data over the network to the QA aggregator device. The configuration of the                                 

services that send data from the various blocks may require some configuration parameters to be set.                               

In this diagram, TANGO is indicated as the framework used for controlling service configurations.                           

TANGO is used here as a placeholder for the control framework to be used internally to the SDP. The                                     

choice of the exact messaging and control framework used internally to the SDP has not yet been                                 

decided upon. At this stage, it is certain that TANGOwill be used to implement the external interface                                   

between SDP and TM. This is described fully in the SDP­TM ICD document [AD04]. In the example                                 

shown in Figure 7, the TANGO device serving the various blocks producing metrics could be                             

responsible for functionality like: 

● Start/Stop metric publisher 

● Report current rate of publishing 

● Read/write destination IP/port to where metrics are published (essentially this will be the                         

destination address of where the QA aggregator is running) 

  

As an alternative to sending data over the network in raw packets e.g. UDP, depending on the                                 

estimated data rate of the various blocks, data can be sent over the TANGO bus via the use of pipes                                       

(streaming). This solution is however not being recommended. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 30 of 57 

 

FIGURE 7: QUALITY ASSURANCE (QA) DATA FLOW. THE VARIOUS FUNCTIONAL BLOCKS INTERACTING WITH THE QA AGGREGATOR ARE ALL 

UNIDIRECTIONAL PUSH DATA FLOWS, WITH A LISTENING AGGREGATOR THAT PROCESSES AND PUSHES THE RESULT FLOW TO A VISUALIZER.  

 

By allowing a control protocol, like TANGO to marshal whether the individual blocks publish data to                               

the network or not, and by allowing the configuration of a destination for the metric packets to be                                   

transferred to, a control protocol like TANGO essentially serves as a very light marshal for a                               

publish­subscribe data flow mechanism for QA metrics to flow to the QA metric aggregator. The QA                               

aggregator is a services that is continuously listening for metrics from different processing blocks. The                             

aggregator will perform QA aggregation on the incoming flow, and publish the resulting data to the                               

QA visualizer. It is opportune to monitor and control the QA aggregator similarly to other metric                               

publishing blocks with the following basic options: 

● Start/Stop the Aggregator service. 

● Report current rate of ingestion. ● Report current rate of publishing. 

● Read/write destination IP/port to where aggregated metrics are published (essentially this will                       

be the destination address of where the QA visualizer is running) 

  

The QA visualizer will be implemented as a graphical user interface, populated by the SDP LMC and 

presented to TM.  

 

 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 31 of 57 

7.5 Telescope State Information The additional metadata required by the SDP to fulfil its mission, over and above the scientific payload                                 

received from the CSP, can become a serious complication in the design of the facility. 

This is due to two factors. Firstly, the metadata itself is very varied, drawn from multiple sources at                                   

multiple cadences, and in many cases requires interpretation through a computational model before                         

it can be used. Secondly, the complexity of these interpretive models themselves is very varied, and                               

thus their computational cost is as well. 

At scale, this means careful attention must be paid to reduce the complexity of the interfaces to                                 

external sources of metadata, and ensure that the correct models are running in the optimal location                               

(which may be internal or external). 

7.5.1 Telescope State The SKAO is currently in discussions with a number of parties regarding the use of a central telescope                                   

model. This would be a single, telescope wide, repository of not only telescope state metadata but                               

also computational models.  

The clear benefit of this approach is that there is a single source of truth and updating and versioning                                     

models is very easy. It also avoids the distribution of metadata which may be extraneous to a                                 

particular task, as only the specifically required information is pulled into a task. 

The potential downside, especially at scale, is that the model becomes a bottleneck, and the                             

computational load of the modelling becomes too great.  

For now, the baseline design for the SDP includes a local telescope state repository that will contain a                                   

copy of the system metadata needed for correct operation of the SDP. This repository will also be                                 

used for the storage and distribution of internally generated metadata, particularly metadata that                         

needs to be exposed upstream to the Telescope Manager. This is described in more detail in section                                 

6.6. 

More commentary on this will be included in the SDP CDR design documents once the discussions                               

around the Telescope Model have concluded. 

7.5.2 State Information Categories The following broad categories of state information are likely, and are presented here to enhance                             

clarity on the role played by this information: 

● Data Identification – Information directly needed to interpret the scientific payload structure                       

(e.g. baseline indexing). 

● Pointing – Information related to physical (antenna) and virtual (beam­former) positions,                     

modes and targets. 

● Telescope State / Configuration – The static configuration of the telescope as it describes the                             

environment in which a particular capability is to be executed (e.g. channel bandwidth). 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 32 of 57 

● Environment – Factors within the telescope environment used for further scientific analysis                       

(e.g. wind speed). 

● SDP Calibration Data ­ Data products required in near real­time by other parts of the                             

telescope (e.g. beamformer gain solutions) will need to be passed to the TelescopeManager                           

through the Telescope State repository. Further description of these products is made in the                           

SDP­TM ICD [AD04], and in current form appears to be quite manageable (as the overall                             

update rate required is relatively low). However, additional use cases may well emerge and                           

very careful attention will need to be paid to the resultant requirements in terms of data                               

volumes and latency. 

7.5.3 Simulated State Information Standalone operation of the SDP processing chain is highly desirable during commissioning and                         

debugging. To support this, a facility to simulate Telescope State Information is required. This can be                               

done either from an extant observation, to allow recreation of a particular test, or from a purely                                 

simulated source. In both cases a telescope model is populated, and this is then either serialized to a                                   

stream or accessed directly. 

7.5.4 Lifespan In all cases it is assumed that at least the local portion of the telescope model persists for the duration                                       

of a particular observation. Once an observation has been completed, the telescope model will be                             

serialized and stored in the archive alongside the standard scientific products output by the SDP.  

This is a critical aspect of producing a fully marked up data product, and will ensure that post                                   

processing investigations can determine the full operating context of the telescope without resorting                         

to additional queries. 

7.6 Local Telescope State (LTS) Manager Two categories of data exist within the confines of the Local Telescope State (LTS): 

● Data that is relatively static and can be simply distributed to ensure uniform access times                             

across the full compute platform. 

● Working data that may be highly localised, with intensive access in these locales, which will                             

likely have intense periods of local read/write accesses marked by transaction boundaries that                         

align with iteration boundaries within the processing pipeline. 

  

The static data for the Local Telescope State (LTS) contains the static metadata of an observation,                               

more or less the information contained in the subtables of the MeasurementSet. The LTS has to offer                                 

functionality to let the applications obtain the metadata. An interface to the LTS will provide access to                                 

the LTS in one of three forms: 

1. Through getter methods e.g. “getXXX” functions, where XXX is more or less the name of the                               

metadata item to be obtained. 

2. The metadata is stored as a dictionary in key­value pairs (where the value can be an array). A                                   

generic getter method can then provide the value to a given meta­data key. The advantage is                               

that it is simple and open (any item can be obtained). A possible disadvantage is that the user                                   

needs to know the name of the key (but that is not very much different from having to know                                     

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 33 of 57 

the name of a function). Another disadvantage is that the return type has to be somewhat                               

abstract to be able to handle all possible data types. 

3. A richer way would be to provide a data structure for a given key which includes the actual                                   

data, and meta­data about the data such as the datatype of the data etc. This could be in the                                     

form of a JSON result. 

  

It should be possible to query for data from the LTS at various granularity levels. Requests could either                                   

return singular values for a singular key, or else effectively return a collection of meta­data with the                                 

entire configuration of an antenna (names, positions, diameters etc.). In this case, the third option                             

above is more suitable, as an entire document with a hierarchical structure could be returned. 

 

The working data is highly localized to the various processes that are interfacing with the LTS,                               

performing updates on the fly. For this kind of application the data flow is more intensive, and an                                   

in­memory data store of sorts can be used. 

 

In order to satisfy the requirements for both static and highly volatile data parts of the LTS is the use                                       

of a distributed database. In order not to introduce bottlenecks in the data flow, the different services                                 

will all act as direct clients to the LTS distributed database. The load on queries and updates to the LTS                                       

has to be balanced by the compute size of the platform hosting the LTS database. Typical large­scale                                 

distributed databases use techniques such as clustering and sharding specifically for this use case.                           

These distributed database systems should provide APIs that allow native access to the LTS database. 

 

 

 

FIGURE 8: LOCAL TELESCOPE STATE (LTS) ARCHITECTURAL FOCUS. THE VARIOUS FUNCTIONAL BLOCKS INTERACTING WITH THE LTS ARE 

SHOWN WITH UNIDIRECTIONAL OR BIDIRECTIONAL REQUESTS (PUSH/PULL REQUESTS). LMC INTERFACING FROM LTS TO GTM IS DONE VIA 

TANGO. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 34 of 57 

 

In the case of static data, the values from this data store will be wrapped into higher—level                                 

descriptions as described earlier, through a higher­level “getter” API. An overview of the data flow for                               

LTS within SDP LMC is shown in Figure 8. The different blocks interacting with the LTS inside the SDP                                     

LMC system will connect directly to an API for the distributed database providing LTS services. In turn,                                 

the LTS store can be configured to push/pull from the General Telescope State.A TANGO interface will                               

exist between LTS and GTS so that the GTS can possibly notify LTS of updates, as well as any other                                       

control parameters required. It is possible, if required, to have other modules in the system hook up                                 

to the LTS manager to listen for any particular notifications or control information, such as the current                                 

state of the LTS (e.g. is it available or not) etc. 

7.7 Local Sky Model (LSM) Manager 

The Local Sky Model (LSM) will constitute a collection of sources, such as point sources and Gaussian                                 

extended sources. Possibly other source types could be needed as well, most notably shapelets.                           

Sources can and should be grouped into so­called patches. A patch is mainly used to model an A­team                                   

source in a detailed manner, such as with a group of 1000 or more point sources. Alternatively, this                                   

can be done as a shapelet. A number of operations are envisaged to be performed on the LSM: 

● Creation of the LSM from the GSM given the field of view (FOV). The LSM should also contain                                   

the strong sources outside the FOV possibly entering through the sidelobes. 

● Addition of newly found sources/clean components. Each addition could have a fidelity metric                         

associated with it. 

● Tesselation of the sources into facets; this should be sufficiently smart that sources don't cross                             

facets. 

● Stepping through the facets in order of total flux, so faceted calibration can process the facets                               

in order of flux. 

● Obtain the strongest source from a facet (to be used as directional calibrator). 

●  Get all sources of a facet. 

● Update sources for improved flux and/or position. 

● Merge the final sources into the GSM. 

 It is yet undecided whether the LSM will exist physically within SDP LMC, or whether it is a logical                                     

reference to a centralized LSM store, owned by TM. In any case, the first and last operations above                                   

are the point of contact with the General Sky Model (GSM). Possibly, the source information is kept                                 

per frequency band. It is envisaged that the data I/O can occur in bursts of high cadence; the data                                     

processing will be in parallel in frequency bands, and could be occurring at the same time. It would                                   

then be possible to store the solutions in an indexed fashion e.g. by antenna, frequency, time etc. and                                   

in turn, fetched as such. 

 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 35 of 57 

 

FIGURE 9: LOCAL SKY MODEL (LSM) ARCHITECTURAL FOCUS. THE VARIOUS FUNCTIONAL BLOCKS INTERACTING WITH THE LSM ARE 

SHOWN WITH UNIDIRECTIONAL OR BIDIRECTIONAL REQUESTS (PUSH/PULL REQUESTS). LMC INTERFACING FROM LSM TO GSM IS DONE VIA 

TANGO. 

The architecture for the LSM is meant to deal with high volume and high cadence operations and for                                   

this reason, the LSM is implemented as a large­scale, highly available, distributed database. Queries to                             

this database system will be done via an API to the LSM services. It is only anticipated to have basic                                       

measures of monitoring and control over the LSM database subsystem, in much the same way as is                                 

done for LTM. 

An overview of the data flow for LSM within SDP LMC is shown in Figure 9. The different blocks                                     

interacting with the LSM inside the SDP LMC systemwill connect directly to an API for the distributed                                   

database providing LSM services. In turn, the LSM store can be configured to push/pull from the                               

General Sky Model. A TANGO interface will exist between LSM and GSM so that the GSM can possibly                                   

notify LSM of updates, as well as any other control parameters required. It is possible, if required to                                   

have other modules in the system hook up to the LSM manager to listen for any particular                                 

notifications or control information, such as the current state of the LSM. 

7.8 Fast Telescope State Producer 

The Fast Telescope State Producer is responsible for ensuring timely delivery of critical telescope state                             

information from the Telescope Manager to the SDP Receiver. Such critical state information is                           

needed at the point at which SPEAD data from the Central Signal Processor is received and decoded. If                                   

such data is not present at the time of receiving a particular correlator dump, the dump must either                                   

be buffered or discarded. 

The ICD between SDP and TM defines those pieces of telescope state information that are deemed to                                 

be critical, but in general will include information needed to unambiguously assign the incoming data                             

to a particular observation and to correctly interpret the axes of the data. 

 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 36 of 57 

7.9 LMC Resource and Workflow Manager The resources of the LMC hardware platform can technically be managed separately on a node per                               

node basis, such that the job submission system as well as the administrators need to know how the                                   

cluster is configured. An abstraction over this is to aggregate all the available resources as a single                                 

entity, such that the LMC hardware platform is viewed as a single large node. These tools implement a                                   

level of abstraction on top of operating systems, and can be viewed as a cluster operating system. It is                                     

suggested that an open­source toolkit to manage the LMC platform is selected. The LMC platform will                               

require a level of redundancy and reliability, especially when a master node fails. 

7.9.1 SDP Resource and Workflow Manager Similarly to the above, it is envisaged that the SDP as a whole will be composed of a number of                                       

compute islands, forming the bulk of the SDP compute cluster. It will be required for this cluster to                                   

have its resources managed, with a number of strict criteria for the resource manager selected,                             

namely: 

● Scalability to tens of thousands of nodes 

● Fault­tolerant replicated master and slave nodes 

● Support for stacks like containers and other virtualization technologies 

● Isolation between the task being performed and the physical node executing the operation 

● An API that allows interfacing from the LMC controller. 

  

The internal interface between the SDP data processing platform and the LMC is responsible for the                               

subscription of events, listening for alarms, reporting health and monitoring information, initiating                       

and running pipelines etc. But the role of the LMC here is strictly as an interface to monitor and                                     

control the processing platform at a very high level of abstraction. The complex platform                           

management is handled by one of the very highly specialized toolkits available for this task. It is                                 

suggested, again, to utilize an open source framework that is designed to handle the data processing                               

use­cases of the SDP e.g. OpenStack, Mesos etc. 

 

In general, such a framework running on top of the processing platform cluster nodes consists of two                                 

components: a scheduler that registers with the master nodes to be offered resources, and an                             

executor process that is launched on slave nodes to run the framework’s tasks. Much more is                               

required, such as routing of data, software defined networking etc, and it is not being suggested to                                 

have this directly controlled via TANGO. 

 

The framework, or unit within the framework for job scheduling and execution will run applications                             

and services across the shared pool of machines, and is responsible for keeping them running, forever                               

if needed. When machines experience failure, the framework shall intelligently reschedule those jobs                         

onto healthy machines. 

While the selected framework core focuses on distributing individual tasks across nodes in a cluster,                             

typical services consist of dozens or hundreds of replicas of tasks. As a service scheduler, the                               

framework will provides the abstraction of a "job" to bundle and manage these tasks/pipelines. We                             

can think of many key functionalities centered around a job, including: definition, the concept of an                               

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 37 of 57 

instance and the server­set, deployment and scheduling, health checking, and introspection. It also                         

allows cross­cutting concerns to be handled like observability and log collection. 

7.10 Workload Characterisation Framework 

The purpose of Workload Characterisation Framework (WCF) is three­fold and contains two main                         

parts ­ the Workload Characterisation Manager (WCM) and the SDP Deployable Benchmark Suite                         

(SDBS). The WCM communicates characterisation and performance information to the scheduling                     

process, and provides query, visualisation and modelling functionality. The SDBS captures and replays                         

workload characteristics, and allows micro­benchmarks to be executed in a reproducible manner.                       

Figure 10 depicts the WCF and its sub­components. 

 

 

FIGURE 10: WORKLOAD CHARACTERISATION FRAMEWORK 

The first purpose of the WCF aims to provide the scheduler (or the “graph manager”) with the                                 

essential workload characteristics for each Processing Component (i.e. an Application Drop ), in order                         

to make optimal (cost­efficient) scheduling decisions. WCF aims to facilitate the mapping from logical                           

graph to the physical graph, thus determining which Processing Components are allocated to which                           

compute islands, according to its workload characteristic. The workload characteristics are directly                       

related to the Resource Usage (including compute, I/O, and energy consumption) for each Processing                           

Component , and the estimates will be provided as a set of performance metrics with some basic                               

statistics. For example, if a Processing Component exhibits a strong random­writes­dominated I/O                       

access pattern, it is desirable that the scheduler allocate such a Processing Component to an                             

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 38 of 57 

adequate SSD­backed storage I/O buffer. The WCF is designed with low system overheads, to provide                             

information with minimal application and performance impact, including energy consumption, for                     

continuous performance monitoring. Figure 5, in the SDP Execution Framework Design document (see                         

RD04), depicts the sequence of interactions between TM, LMC and the WCF. 

The WCF aims to provide a framework, for developers and vendors alike, in this sub­area, as a                                 

common consistent platform for testing, optimisation and verification. The micro­benchmarks                   

emulate specific, intrinsic workload characteristics of processing pipelines. Existing and custom tools                       

can be executed by WCF that can be used to model and replay I/O transactions to mimic behaviours                                   

(for example), without having to compile, deploy and execute entire pipeline processing software                         

stack on vendors’ hardware platform. It is an efficient method for both developers and vendors to                               

evaluate, design, and compare hardware platforms in the relevant context of pipelines in a scalable,                             

modular, consistent and reproducible manner. 

Secondly, the WCF aims to assist SDP developers / deployers of Pipeline Processing Components to                             

obtain a quantitative understanding of the compute and data access behaviours exhibited by various                           

radio astronomy data processing pipelines, components and algorithms, especially in production                     

environments. Developers can compare results in a reproducible manner, across different                     

hardware/software platforms, aiding them to make informed decisions about their implementations. 

Finally, the WCF aims to enable SDP platform vendors to micro­benchmark their platforms and                           

derive standardised performance statistics, which can be used to optimise and compare platforms                         

that will be tendered during the procurement process.  

7.10.1 WCF External Interfaces 

 

The WCF consists of four sets of interfaces: the component submission interface, the component 

profile query interface, the hardware benchmarking interface,  and the management interface. 

 

 

FIGURE 11: WCF EXTERNAL INTERFACES 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 39 of 57 

The component submission interface allows the PIP or SDP component developers to submit 

Processing Components  to WCF, which will then trigger the process of capturing and recording 

workload characteristics. 

  

The high level interface between PIP and WCF for the submission is defined as: 

Submit Processing Component 

INPUT from Component developer:   

Pipeline components location, identifier, option parameters and setup 

information 

OUTPUT from WCF: Acknowledgement of receipt and exception information (e.g. validation errors, 

component not found, component with the same identifier already exists, 

etc.) 

Note that the component identifier should be consistently produced prior to the submission 

procedure. 

  

Check profile status 

INPUT from PIP:  Component identifier 

OUTPUT from WCF:  Status (e.g. Ready / Not Ready / Component not found) 

  

Upon receiving a legitimate Processing Component , WCF performs workload characterisation and 

stores the profiling information (e.g. traces or models) together with the corresponding hardware 

capability information in the workload characteristics database. 

  

The component profile query interface allows the SDP Execution Framework (in particularly the 

Graph manager product tree item) and other related SDP product tree items (e.g. PIP) to query the 

profiling information of a given component. 

  

The high interface between the Graph manager and WCF for querying the profile information is 

defined as: 

Query profile 

INPUT from Graph manager:  Processing Component  identifier, the level of granularity 

needed (e.g. summary, time series, time resolution, etc.) 

OUTPUT from WCF:  (1) Detailed workload characteristics including: usage statistics 

(CPU, accelerator, memory, I/O, FLOP, power), completion 

time, and bottleneck hints (I/O bound, memory bound, 

compute bound, etc.) 

(2) Information on associated hardware capabilities on which 

the workload is executed and profiled. 

  

  

The management interface allows WCF admin to configure WCF. 

The high interface between the admin and WCF is defined as: 

  

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 40 of 57 

INPUT from admin: CRUD of metrics to be included in the profile report 

CRUD of hardware capabilities with which component will be profiled 

OUTPUT from WCF: Acknowledgement of receipt and exception information 

The hardware benchmark interface allows hardware vendors to collect performance measurements                     

on their platforms by running and profiling submitted Processing Components in order to either                           

optimise their platforms, or compare different configurations of the same platform. Upon specific                         

benchmark requests from a vendor, the interface produces the parameterised benchmark software                       

package to vendors, who will deploy the package on their vendor­specific hardware platform. The                           

benchmark package in turn sets up performance capturing mechanism on the targeted platform,                         

executes Input Traces, micro­benchmarks or Processing Components and collects performance                   

measurement that will be sent back to WCF, to be displayed and visualised via the query profile                                 

interface. 

7.11 Managing Science Events The LMC function, ‘Manage Science Events’, is responsible for alerting TM to the following transient 

events:  

a)      detection of an imaging transient 

b)      detection of a single pulse transient 

c)       Pulsar Timing accuracy is achieved  

7.12 Interfaces 

7.12.1 External Interfaces The Level 1 functional flow diagram of the SDP is shown by Figure 12. As depicted in Figure 12, the 

LMC interfaces with Telescope Manager (TM). This interface is fully described by the SDP­TM Interface 

Control Document [AD04].  

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 41 of 57 

FIGURE 12: LEVEL 1 FUNCTIONAL FLOW DIAGRAM OF THE SDP 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 42 of 57 

 

7.12.2 Internal interfaces Internally to the SDP, the LMC interfaces with the following products: 

● Data Processor 

● Preservation 

● Delivery system. 

● Observatory Support Tools 

Internal SDP­LMC interfaces are shown in Figure 13. 

‘ 

 

FIGURE 13: LMC ­SDP INTERNAL INTERFACES 

 

7.13 Implementation In the overall SDP context the hardware demands of the LMC are not particularly onerous. The major 

areas of concern for the LMC platform are as follows: 

 

● High Availability: Those functions that directly support the ability of TM to manage and 

monitor the health and integrity aspects of the SDP must have significantly high availability. 

This will include redundant hardware and lock­step slave software protection. 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 43 of 57 

● PDG Generation: The generation and management of the PDG could become a significant 

challenge as the number of graph nodes could extend into the millions. Fortunately the 

process of generation is amenable to parallelisation and thus can be tackled in a traditional 

HPC fashion. A small cluster is envisioned for this function. The final graph size is influenced 

directly by the likely axes of parallelism that get used. If we assume 100k channels and 10s of 

facets, this already gives us 1m nodes for a single step of the imager. Combined with multiple 

imaging steps this number will grow out substantially. 

● Log handling: In such a large, and distributed system, log handling can quickly grow into a 

major concern. This risk area will be covered through the use of distributed, hierarchical 

loggers that rate limit upward events to prevent tree pollution. As mentioned earlier, the fine 

grained control over logging should allow most debugging events to generate only targeted 

logs rather than a system wide storm of logging. 

● Local telescope model: Some aspects of the LTM will require careful design, in particular 

aspects such as the Local Sky Model which will need to field a large number of parallel reads 

and writes and must have sufficient performance to avoid becoming a bottleneck to any of 

the pipeline processing. 

● Documentation: As the single point of contact for SDP, it is important that the LMC document 

set is complete and accurate. This will be particularly key for startup and shutdown 

sequencing, which is likely to be significantly complex.  

Apart from these areas, the remainder of the LMC is fairly straightforward and can be deployed into 

an environment that prioritises ease of deployment and management over outright performance. 

 

As with other sub­elements of the SDP, the LMC will have a distinct instance per telescope (Mid, Low).                                   

This allows completely independent operation of each facility with no particular dependencies                       

between them. Coordinated activities around maintenance of the common codebase will be                       

undertaken. 

7.14 Infrastructure  

At PDR stage, we do not specify any particular hardware platform in detail, but rather list any 

requirements on it that are deemed to be significant drivers of cost or complexity. It is highly likely 

that the LMC will be deployed in a distributed, componentised form using containerised virtualisation 

techniques such as Docker (https://www.docker.com/).  This technique will carry significant 

deployment and testing benefits as it allows containers to be easily migrated and used by a variety of 

team members in a variety of locations. Deployment of containers into testing environments, and the 

subsequent migration of these directly to production will also significantly ease the testing to 

production loop. Hypervisor virtualisation is likely to be used in conjunction with containerisation 

techniques to enable virtual lock­step slave takeover functions that are likely to be employed in the 

master controller high availability solution. A number of excellent deployment management tools 

have emerged over the last few years, particularly in response to the rise of cloud and virtualised 

deployment. A tool such as Chef (https://www.chef.io/chef/) is well suited to our needs at 

exceedingly modest cost. 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 44 of 57 

 

7.15 Logical Data Store It is possible that SDP LMC will require the use of a logical volume that caters for storing LMC­related                                     

data. It is probable that this data store will be a distributed file system (or a partition of full SDP                                       

storage resources), or a distributed database. However, it is important for this logical store to satisfy a                                 

number of criteria: 

● The store has to be efficient at distributed access. 

● The store has to be resilient to failure and allow for automatic recovery. 

● The store has to allow for data of dynamic and undefined structure to be stored. 

  

While it is not necessary to state, as yet, which particular solution is to be chosen for this purpose, it is 

advisable to have an open­source, off­the­shelf solution, that provides programmatic (API­based) 

access to the operation of the logical data store. Some valid examples of a file­based store are 

GlusterFS and Lustre. An example of a database­based store are MongoDB. 

7.16 Configuration Management Configuration management is a systems engineering process for establishing and maintaining                     

consistency of a cluster setup, performance, functional and physical attributes with its requirements,                         

design and operational information throughout its life. The setup of the software platform for SDP will                               

require quite a lot of configuration. It is anticipated, therefore, that the platform and services for SDP                                 

LMC nodes as well as the SDP cluster in general utilize a toolkit for configuration management to                                 

maintain this deployment. Similarly, it is advisable to have an open­source, off­the­shelf solution, that                           

provides programmatic (API­based) access to the operation. An example of such a toolkit is Ansible. 

7.17 Software  

The bulk of the LMC software implementation is deemed to be relatively low risk. This is based on 

both the inherent difficulty as estimated during the costing phase, and on the basis that similar 

systems exist in the various precursors facilities, and are currently in near or full operation.  

 

Management of the data flow graphs and graceful error handling are deemed to be the highest risk 

items, as they are both inherently complex topics and not very fully addressed by any of the extant 

precursors.  

 

For the lower risk, non performance critical components, rapid development and the ability to 

leverage existing code (astronomical and other) becomes key. 

 

Significant reuse of code from both TM and the precursors is envisaged. In particular the master 

controller, logging framework, internal monitoring and some of the proposed GUIs can be directly 

used from the precursor facilities. 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 45 of 57 

A decision has been taken by the local interface working group of the telescope manger consortium to 

standardize on the Tango control framework for interoperation between the LMC components of the 

various elements. 

 

Incorporation of this design decision into the SDP LMC design will be undertaken early in the post PDR 

design period. 

8 Prototyping Plan and Associated Risks  

TABLE 2: FUNCTION TO PRODUCT TRACEABILITY 

 

 

Table 2 indicates the LMC function to product tree mapping. During the analysis of the product tree,                                 

the Master Controller (C.3.2.3) was the only product that was identified as being high­risk as it                               

represented a single point of failure. Risk has been mitigated by completing the prototyping selection                             

process for the Master Controller and TANGO was selected as the candidate solution for                           

implementing the SDP­TM interface functionality of the Master Controller. Prototyping work in                       

TANGO has commenced.  

 

The functionality of the Master Controller that is internal to the SDP will be carried out with a                                   

message broker. Alternative control frameworks such as JSON will also be investigated to carry out                             

the internal SDP control functionalityThe LMCmessage broker will pass control information within the                           

SDP, thus facilitating the internal LMC interfaces (refer to section 6.13.2) Products C.3.2.1 (QA                           

Monitoring), C.3.2.4 (Event Monitoring and Logging) will be employ the same message broker.                         

Candidate message brokers that were identified for prototyping are listed below and in the C.3.2.1                             

(QA Monitoring) section of the Product Tree analysis space in Confluence: 

● ZeroMQ (TANGO’s messaging broker) 

● RabbitMQ 

● Kafka 

● ActiveMQ 

● Kestrel 

 

The Workload Characterisation Framework (C.3.2.5), as it has been, is on­going with prototyping                         

work, and will remain so, as new hardware requires newmethods to provide suitable characterisation                             

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 46 of 57 

metrics. Software components and operating system kernels (i.e. Linux) are also expected continue                         

evolving during the project's lifetime. To mitigate risk prototyping activities are on­going, in­line with                           

the project's requirements, and use standard OTS interfaces, in a modular design.  

The LMC hardware platform is not a high­risk product as the physical architecture will comprise of 

COTS (Commercial, off­the­shelf) hardware. 

9 Function to Requirements Traceability  

TABLE 3: FUNCTION TO L2 REQUIREMENTS TRACEABILITY 

 

Function  L2 Requirement 

Name  Description  L1 Parent 

F.4.1 Visualise QA Metrics  SDP_REQ­40  

Quality assessment interface 

The SDP shall provide an interface to allow the quality of scientific data to be assessed interactively. 

SKA1­SYS_REQ­2744 

F.4.1 Visualise QA Metrics  SDP_REQ­732 

Science Product Catalogue QA 

The SDP shall allow internal users access to query the Science Product Catalogue for the purposes of Quality Assessment. 

SKA1­SYS_REQ­2744 

F.4.1 Visualise QA Metrics  SDP_REQ­737 

Science product QA analysis tools 

The SDP shall provide the capability, and allow observatory staff and other authorised persons, to analyse science products for the purpose of Quality Assessment. 

SKA1­SYS_REQ­2744 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 47 of 57 

F.4.2 Manage Local Telescope State 

SDP_REQ­288 

Telescope Model  The SDP shall use a dynamic computational model of the Telescope to answer all queries about the state of the Telescope. The telescope model shall consist of configuration information, numerical models, empirical parameters and conventions. 

SKA1­SYS_REQ­2645 

F.4.2 Manage Local Telescope State 

SDP_REQ­38  Metadata input  The SDP shall query the Telescope Model for all static and dynamic configuration data that is required prior to and during the execution of a specific observation. 

SKA1­SYS_REQ­2645 

F.4.2 Manage Local Telescope State 

SDP_REQ­725 

Telescope state information bandwidth 

The SDP shall support receiving telescope state information from TM at 1Gb/s (TBC) and send telescope state information to TM at 1Gb/s (TBC). 

SKA1­SYS_REQ­2645 

F.4.2 Manage Local  SDP_REQ­72 Telescope state  The SDP shall  SKA1­SYS_REQ

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 48 of 57 

Telescope State  6  information data items 

support at least 10 000 telescope state information data items per capability. 

­2645 

F.4.3 Aggregate QA Metrics 

SDP_REQ­441 

Automated Quality Assessment. 

The SDP shall perform standardised, automated Quality Assessment of Images along the axes of astrometry, photometry, radiometry, polarimetry and spectrometry. 

SKA1­SYS_REQ­2347 

F.4.3 Aggregate QA Metrics 

SDP_REQ­450 

SDP standard pipeline products 

The SDP shall include as data products the pipeline processing log, and Quality Assessment log for all pipelines. 

SKA1­SYS_REQ­2336 

F.4.3 Aggregate QA Metrics 

SDP_REQ­718 

QA metric aggregation 

The SDP shall allow aggregation of at least 1% (TBC) of scientific data for QA visualisation purposes. 

SKA1­SYS_REQ­2347 

F.4.3 Aggregate QA Metrics 

SDP_REQ­733 

Science Product QA 

The SDP shall allow internal users access to request and receive Science Products for the purposes of Quality 

SKA1­SYS_REQ­2744 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 49 of 57 

Assessment. 

F.4.4 Manage Local Sky Model 

SDP_REQ­720 

Update Global Sky Model 

The SDP shall send updates for the Global Sky Model to the Telescope Manager once the execution of a Capability has completed. 

SKA1­SYS_REQ­2322 

F.4.4 Manage Local Sky Model  

SDP_REQ­719 

Receive Local Sky Model from TM 

The LSM shall be capable of storing various source types (point, gaussian). 

SKA1­SYS_REQ­2322 

F.4.5 Produce Fast Telescope State 

SDP_REQ­750 

Critical Telescope State Information latency 

The Fast Telescope State component shall immediately process received meta data changes from TM. 

 

F.4.5 Produce Fast Telescope State 

SDP_REQ­750 

Critical Telescope State Information latency 

The Fast Telescope State component shall receive relevant meta data changes within 0.01 (TBC) second from receiving the CSP data. 

 

F.4.5 Produce Fast Telescope State 

SDP_REQ­751 

Critical Telescope State Information 

Relevant meta data changes are: broken hardware (to flag data), pointing, TBD. 

 

F.4.6 Master Control  SDP_REQ­29  Targets of Opportunity 

The SDP shall provide a 

SKA1­SYS_REQ

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 50 of 57 

priority override mechanism to support scheduling and execution of approved Target of Opportunity (ToO) events. 

­2285  

F.4.6 Master Control  SDP_REQ­3  Observation Handling 

The SDP shall support the concept of observations via the LMC. 

SKA1­SYS_REQ­2133 

F.4.6 Master Control  SDP_REQ­4  Capability availability 

The SDP shall, on request from the TM, provide availability information for the requested capability. 

SKA1­SYS_REQ­2133 

F.4.6 Master Control  SDP_REQ­5  Observation scheduling 

The SDP shall allow TM to schedule observation data taking and data processing, either together or separately. The resources required for this will be allocated and reserved by the SDP for this scheduled capability. 

SKA1­SYS_REQ­2133 

F.4.6 Master Control  SDP_REQ­7  Observation Control 

The SDP shall provide commands to control a previously scheduled observation. 

SKA1­SYS_REQ­2133 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 51 of 57 

F.4.6 Master Control  SDP_REQ­728 

Master controller failover latency 

The SDP master controller shall support failover to an operational state within 1s (TBC). 

SKA1­SYS_REQ­2133 

F.4.6 Master Control  SDP_REQ­33  Flagging control  The SDP shall allow the TM to control internal flagging, including the definition of a pre­selected RFI mask, and other parameters as may be relevant to particular flagging strategies. 

SKA1­SYS_REQ­2474 

F.4.6 Master Control  SDP_REQ­34  Self testing  The SDP shall provide an automated self­test function that will generate a report on the health of the SDP system as a whole, including malfunction and out­of­tolerance operation. 

SKA1­SYS_REQ­2544 

F.4.6 Master Control  SDP_REQ­35  Alarms  The SDP shall support the generation of notification events, in the form of alarms, to the TM based on predefined conditions. 

SKA1­SYS_REQ­2312 

F.4.6 Master Control  SDP_REQ­37  Alarm latency  The SDP shall  SKA1­SYS_REQ

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 52 of 57 

issue an alarm no more than 1 second after detection of the relevant event. 

­2312 

F.4.6 Master Control  SDP_REQ­4  Capability availability 

The SDP shall, on request from the TM, provide availability information for the requested capability. 

SKA1­SYS_REQ­2133 

F.4.6 Master Control  SDP_REQ­50  Invalid input handling 

The SDP shall be robust against input errors including invalid state, improper authorisation and malformed or malicious commands. 

SKA1­SYS_REQ­2431 

SKA1­SYS_REQ­2482 

F.4.6 Master Control  SDP_REQ­724 

Critical metadata arrival latency 

The SDP shall receive critical metadata from TM within 1 second of TM receiving the critical metadata. 

SKA1­SYS_REQ­2431 

F.4.6 Master Control  SDP_REQ­727 

Monitoring and alert reporting latency 

The SDP shall report monitoring and internal alerts to TM within 1s. 

SKA1­SYS_REQ­2431 

F.4.6 Master Control  SDP_REQ­592 

Status Monitoring 

The SDP shall query the status of science and calibration data streams. 

SKA1­SYS_REQ­2280 

F.4.6 Master Control  SDP_REQ­721 

Log level configuration 

The SDP shall support configuration of 

 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 53 of 57 

internal log levels to support external drill­down. 

F.4.6 Master Control  SDP_REQ­594 

Status Reporting  The SDP shall use internally received status information to report on technical performance. 

SKA1­SYS_REQ­2280 

F.4.6 Master Control  SDP_REQ­604 

Precursor integration 

The SDP shall directly control any precursor infrastructure required for use by the SDP. 

SKA1­SYS_REQ­2833 

F.4.6 Master Control  SDP_REQ­25  Independent Operations 

The SDP control functions should be able to operate concurrently and with a performance which is independent of other SDP instances. 

SKA1­SYS_REQ­2127 

F.4.6 Master Control  SDP_REQ­30  Graceful degradation 

The SDP shall handle internal hardware and software failures in a graceful fashion. 

SKA1­SYS_REQ­2716 

F.4.6 Master Control  SDP_REQ­51  Engineering Support 

The SDP shall provide engineering support functions to the relevant Engineering Operations 

SKA1­SYS_REQ­2116 

SKA1­SYS_REQ­2118  

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 54 of 57 

centres. 

F.4.6 Master Control  SDP_REQ­52  Failsafe  The SDP shall actively ensure that internal failures do not result in a hazardous situation to the systems and personnel with which it interfaces. 

SKA1­SYS_REQ­2788 

F.4.6 Master Control  SDP_REQ­11  Frequency handling 

The SDP shall support the transition from one frequency band of operation to another within 30 seconds (TBC ­ need feedback on allocation of parent time allowance. This is 30s overall and SDP's allocation of this time will depend on feedback from the System Review (Project Office)). 

SKA1­SYS_REQ­2224 

F.4.6 Master Control  SDP_REQ­27  Observation control ­ switching 

The SDP shall be able to switch between previously scheduled observations within 30 seconds (TBC­ based on performance allocations of SKA­2133. This is 

SKA1­SYS_REQ­3021 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 55 of 57 

30s overall and SDP's allocation of this time will depend on feedback from the System Review (Project Office)). 

F.4.6 Master Control  SDP_REQ­37  Alarm latency  The SDP shall issue an alarm no more than 1 second after detection of the relevant event. 

SKA1­SYS_REQ­2312 

F.4.6 Master Control  SDP_REQ­725 

Telescope State Information bandwidth 

The SDP shall support receiving telescope state information from TM at 1Gb/s and send telescope state information to TM at 1Gb/s per capability. 

SKA1­SYS_REQ­2645 

F.4.6 Master Control  SDP_REQ­726 

Telescope state information data items 

The SDP shall support at least 10 000 telescope state information data items per capability. 

SKA1­SYS_REQ­2645 

F.4.6 Master Control  SDP_REQ­727 

Monitoring and alert reporting latency 

The SDP shall report monitoring and internal alerts to TM within 1s. 

SKA1­SYS_REQ­2431 

F.4.6 Master Control  SDP_REQ­730 

Processing requirement estimation accuracy 

The SDP shall be able to estimate the processing requirements of a capability to 

SKA1­SYS_REQ­2285 

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 56 of 57 

be scheduled to better than 10% 

F.4.6 Master Control  SDP_REQ­594 

Status Reporting  Continuous performance monitoring. Where possible, the system shall be designed to provide continuous performance monitoring. 

SKA1­SYS_REQ­

2546 

F.4.7 Manage Science Events 

SDP_REQ­731 

Science events  The SDP shall send events to the TM for the following activities: 

a)      detection of an imaging transient 

b)      detection of a single pulse transient 

c)       Pulsar Timing accuracy is achieved  

                 

 

  

Document No: SKA­TEL­SDP­0000026 Unrestricted 

Revision:  02 Author: S. Ratcliffe et al. 

Release Date: 2016­07­20 Page 57 of 57