Download pdf - Testing Platform for a PCIe-based readout and control ... Testing Platform for a PCIe-based readout and control board to interface with new-generation detectors for the LHC upgrade

ALMA MATER STUDIORUM ● UNIVERSITÀ DI BOLOGNA

Scuola di Ingegneria e Architettura

Dipartimento di Ingegneria dell’Energia Elettrica

e dell’Informazione “Guglielmo Marconi” - DEI

Corso di Laurea in Ingegneria Elettronica e Telecomunicazioni

Tesi di Laurea

in

Fisica Generale T-2

Testing Platform for a PCIe-based readout and

control board to interface with new-generation

detectors for the LHC upgrade

Anno Accademico 2016/2017

Sessione I

Relatore:

Chiar.mo Prof. Mauro Villa

Correlatore:

Chiar.mo Prof. Alessandro Gabrielli

Candidato:

Giulio Masinelli

“Cui dono lepidum novum libellum

arida modo pumice expolitum?

Corneli, tibi: namque tu solebas

meas esse aliquid putare nugas.”

Catullo, Carme I

Abstract

Questa tesi è rivolta alla comprensione e validazione di una scheda PCIe,

denominata Pixel-ROD e progettata dal laboratorio di elettronica dell'Istituto

Nazionale di Fisica Nucleare e da docenti del dipartimento di Fisica e Astronomia

per le necessità dell'esperimento ATLAS al CERN di Ginevra.

Il lavoro di tesi è consistito nello sviluppo di una piattaforma di test da impiegare

per la verifica della corretta funzionalità della scheda e della sua capacità di

rispondere a ripetuti e continui stimoli, con particolare enfasi sul sottosistema di

memoria e sull’interfaccia PCIe. La scheda è stata progettata come sostituzione

dell’elettronica off-detector attualmente installata per l’esperimento ATLAS,

composta da schede VME, note con il nome di Back of Crate (BOC) e Read Out

Driver (ROD). La scelta dello standard PCIe segue l’ormai consolidato trend di

impiegare schede dotate di FPGA per velocizzare il calcolo real-time eseguito sui

PC. Queste sono solitamente costituite da PCB dedicate e necessitano di essere

connesse alla scheda madre del computer per mezzo di un’interfaccia adeguata.

Essendo quest’ultima a costituire, in genere, il collo di bottiglia di tali sistemi, le

schede demo di ultima generazione comunicano attraverso la performante

interfaccia PCIe. Non solo, le schede PCIe possono essere direttamente connesse

alla scheda madre di PC dedicati all’acquisizione dati consentendo, dunque, una più

elevata velocità di risposta (avendo accesso diretto alle principali risorse della

macchina ospitante) ed una più facile installazione.

Questa tesi fornisce una breve panoramica dell’ambiente per il quale la scheda è

stata progettata. In particolare, dopo questa introduzione, il primo capitolo presenta

l’esperimento ATLAS concentrandosi sui rivelatori più prossimi al punto di

interazione, dove avviene la collisione fra i fasci di protoni dell’acceleratore. Il

secondo capitolo, invece, descrive l’attuale elettronica off-detector così come le sue

principali limitazioni. Il terzo capitolo delinea lo standard PCIe concentrandosi

sugli aspetti più importanti con cui ci si è dovuti scontrare in fase di test. Il quarto

capitolo fornisce le principali motivazioni delle scelte di progetto, quali la decisione

di realizzare la scheda come unione di due demo board, adottando la collaudata

architettura Master-Slave dell’accoppiata BOC-ROD. Infine, il quinto capitolo

descrive la piattaforma di test sviluppata e tutti i test a cui la scheda è stata

sottoposta.

Il lavoro svolto è stato principalmente focalizzato allo sviluppo di un intensivo test

che coinvolgesse le memorie installate sulla scheda ed alla creazione di una

piattaforma di test per fornire l’hardware necessario alla verifica della possibilità di

eseguire trasferimenti di dati tra il PC ospitante e le memorie on-board per mezzo

dell’interfaccia PCIe.

Index 1 The LHC accelerator and the ATLAS experiment ........................................ 3

1.1 The Large Hadron Collider ....................................................................................3

1.1.1 Machine Parameters ......................................................................................4

1.1.2 Main experiments at LHC .............................................................................5

1.2 The ATLAS detector .............................................................................................5

1.2.1 The coordinate system of ATLAS .................................................................5

1.3 The Layout of the ATLAS detector ........................................................................6

1.3.1 Inner Detector ...............................................................................................7

1.3.2 Semiconductor Tracker .................................................................................9

1.3.3 Pixel Detector ...............................................................................................9

1.4 Structure of Pixel Detector ................................................................................... 10

1.4.1 IBL ............................................................................................................. 10

1.4.2 Sensors for IBL ........................................................................................... 11

1.4.3 FE-I4 .......................................................................................................... 12

2 Current off detector electronics for IBL ...................................................... 15

2.1 IBL electronics .................................................................................................... 15

2.2 IBL BOC ............................................................................................................. 17

2.2.1 BOC Control FPGA .................................................................................... 18

2.2.2 BOC Main FPGA........................................................................................ 18

2.3 IBL ROD............................................................................................................. 19

2.3.1 ROD Master................................................................................................ 19

2.3.2 ROD Slaves ................................................................................................ 19

2.4 TIM ..................................................................................................................... 19

2.5 System limitations ............................................................................................... 20

3 PCIe specifications and usage model ............................................................ 21

3.1 Architecture of a PCIe system .............................................................................. 21

3.2 Interconnection .................................................................................................... 22

3.3 Topology ............................................................................................................. 24

3.4 Electrical specifications and I/O Lines ................................................................. 26

3.5 PCIe Address Space ............................................................................................ 27

3.6 Device Tree ......................................................................................................... 31

3.6.1 Linux PCIe Subsystem ................................................................................ 33

3.7 The three Layers of the protocol .......................................................................... 34

3.7.1 Transaction Layer ....................................................................................... 35

3.7.2 Data Link Layer .......................................................................................... 35

3.7.3 Physical Layer ............................................................................................ 37

4 Pixel-ROD board ........................................................................................... 39

4.1 Xilinx KC705 ...................................................................................................... 39

4.1.1 Why PCIe ................................................................................................... 40

4.1.2 Kintex-7 FPGA ........................................................................................... 41

4.2 Xilinx ZC702 ...................................................................................................... 41

4.3 The Pixel-ROD board .......................................................................................... 42

4.3.1 Space constraints ........................................................................................ 43

5 Pixel-ROD test results ................................................................................... 47

5.1 Power supply ....................................................................................................... 47

5.2 Board interfaces and memory .............................................................................. 48

5.2.1 Vivado Design Suite ................................................................................... 48

5.3 Memory test ........................................................................................................ 48

5.3.1 Vivado IP Integrator and AXI4 Interface ..................................................... 49

5.3.2 Architecture of the memory test .................................................................. 54

5.4 Hardware Testing Platform .................................................................................. 59

5.5 PCIe interface test ............................................................................................... 61

5.5.1 Architecture of the PCIe interface test ......................................................... 62

Conclusions ...................................................................................................... 67

Bibliography ..................................................................................................... 69

1

Introduction

This thesis refers to the comprehension and validation of a PCIe board, named

Pixel-ROD and developed by the Electronic Design Laboratory of INFN (Istituto

Nazionale di Fisica Nucleare) and by teachers of the DIFA (DIpartimento di Fisica

e Astronomia) to meet the needs of ATLAS experiment at CERN, in Geneva.

The thesis work consisted in the development of a test bench used to verify the

board full functionalities and its response to stressful stimuli with emphasis on the

memory subsystem and the PCIe interface. The board was designed as an upgrade

for the current off-detector electronics present at ATLAS, replacing the previous

series of readout boards, which are mainly made up of VME boards, known as Back

of Crate (BOC) and Read Out Driver (ROD). The choice of a PCIe card follows the

upcoming trend to exploit FPGA boards in order to speed-up real time calculations

performed in a PC. They are usually mounted on an external PCB so they must be

connected via an appropriate interface to the motherboard of the host computer.

Since this interface is often the bottleneck of such systems, new generation FPGA

evaluation boards communicate via PCIe. But PCIe advantages go further than this.

Indeed, PCIe boards can be directly connected on the motherboard of ATLAS

TDAQ PCs providing a faster response (giving straight access to the main resources

of the PCs) and an easier installation.

The thesis is intended to provide a brief overview of the environment the board was

developed to be installed in. After this Introduction, the Chapter 1 summarizes the

ATLAS experiment, focusing on the closest detectors to the interaction point.

Chapter 2 describes the off-detector electronics the board is meant to replace,

concentrating on their limitations. Chapter 3 delineates the PCIe protocol focusing

on the crucial aspects for board validation. Chapter 4 illustrates how the board was

developed as a merging of two demo boards in order to exploit the Master-Slave

architecture of the BOC-ROD pair as well as narrowing down all the possible

mistakes. Finally, Chapter 5 describes the testing platform and the tests that have

been carried out.

2

The work of this thesis was mainly focused to thoroughly test the board memory

subsystem in order to validate the device response to stressful stimuli and to develop

a testing platform to provide the hardware needed to fully verify the ability to

perform PCIe data transaction to a host PC and the related performance.

3

Chapter 1

The LHC accelerator and the ATLAS experiment

In this chapter, a brief overview of the LHC accelerator complex and of the ATLAS

experiment is presented. More details are provided for the ATLAS pixel detector, for which

the readout board described in the next chapters has been designed.

1.1 The Large Hadron Collider The Large Hadron Collider (LHC) is the largest and most powerful particle accelerator on

Earth. It is located in the circular tunnel which housed the LEP (Large Electron-Positron

collider) in Geneva at the border between France and Switzerland and it is managed by the

European Organization for Nuclear Research, also known as CERN (Conseil Européen

pour la Recherche Nucleaire). CERN is a collaboration between 20 European member

states and non-member states from the rest of the world. The accelerator is approximately

27 km in circumference and lies 100 m below the ground [1]. There are four interaction

points where protons or lead ions are forced to collide at high energies (see figure 1.1). At

the four interaction points, gigantic experiments (ALICE, ATLAS, CMS and LHCb) are

set up to record every detail of the particle collisions. The acceleration chain consists of

several steps (see figure 1.2). Indeed, protons are not directly inserted into the beam pipe

of the main ring, but they begin their acceleration in the linear accelerator LINAC 4, where

they are accelerated from rest to an energy of 50 MeV. After being sent to the Booster to

be accelerated to 1.4 GeV, they are injected into the Proton Synchrotron to be accelerated

to 25 GeV. The last acceleration before the Large Hadron Collider is provided by the Super

Proton Synchrotron where they are accelerated to 450 GeV. Finally, in the LHC, protons

are accelerated from 450 GeV to 6.5 TeV.

Figure 1.1: LHC overview.

4

1.1.1 Machine Parameters The nominal maximum collision energy for proton in LHC is 14 TeV; however, the

accelerator is now running at a collision energy of 13 TeV, 6.5 TeV per proton beam. At

this level of energy, protons move with a speed very close to the speed of light in vacuum.

The proton beams consist of 2808 bunches of protons. Each bunch contains about 1011

particles each so that many proton collisions can happen at each bunch crossing. The

protons are held in the accelerator ring by 1232 superconducting dipole magnets that create

a maximum magnet field of 8.3 T. Along the beam line, 392 quadrupole magnets are used

to focus the particles in the interaction points and defocus them after. The Large Hadron

collider is built to have a peak instantaneous luminosity of 𝐿 = 1034 cm−2s−1 (the ratio

of the number of events detected in a certain time to the interaction cross-section).

Figure 1.2: the accelerator complex at CERN: LHC protons start from LINAC4 and are

accelerated by the Booster, the Proton Synchrotron and the Super Proton Synchrotron

before entering in LHC.

5

1.1.2 Main experiments at LHC As already stated, at the four interaction points, where protons or lead ions are collided

there are four detectors (fig. 1.1):

• ATLAS, A Toroidal Lhc ApparaTus: being one of the two general purpose

detectors at LHC, it was designed for the research of the Higgs Boson and it

investigates a wide range of physical problems, spacing from searching concerning

Beyond Standard Model Physics (Dark Matter, Supersymmetries) to precision

measurements of the Standard Model parameters.

• CMS, Compact Muon Solenoid: the second general purpose detector at LHC. It

has the same purpose of ATLAS experiment (same scientific goals) although it

uses different technical solutions and different magnet-system designs.

• LHCb, Large Hadrons Collider beauty: specialized in investigating the

differences between matter and antimatter by studying quark beauty particles.

• ALICE, A Large Ion Collider Experiment: a heavy ion collision detector. It

studies the proprieties of strongly interacting matter at extreme energy densities,

where the matter forms a new phase, called Quark-Gluon Plasma.

1.2 The ATLAS detector The ATLAS experiment is a general-purpose particle detector installed at the LHC which

is used to study various kinds of physical phenomena. It is 46 m long, 25 m high, 25 m

wide and weighs 7000 tons [2]. It is operated by an international collaboration with

thousands of scientists from all over the world: more than 3000 scientists from 174

institutes in 38 countries work on the ATLAS experiment. ATLAS has a cylindrical

symmetry, as well as the detectors of CMS and ALICE.

1.2.1 The coordinate system of ATLAS To properly and precisely describe the ATLAS detector, a coordinate system is introduced.

In this reference frame collision, events are described using right-handed spherical

coordinates where the origin is set in the nominal interaction point, the z-axis along the

beam direction and the x-y plane transverse to it (the positive x-axis pointing to the center

6

of the LHC ring and the positive y-axis

pointing upwards, see figure 1.3) [3].

Therefore, a vector can be described

using the azimuthal angle φ, the polar

angle θ and the radius r, as shown in

figure 1.3. But instead of using θ, it is

usual to take the pseudorapidity η

defined as −𝑙𝑛 𝑡𝑎𝑛𝜽

𝟐. Its value ranges

from -infinity, corresponding to a

vector being along the negative semi-

axis of z, to + infinity, referring to a

vector being along the positive semi-

axis of z. So, the value η = 0

corresponds to a vector in the x-y

plane.

1.3 The Layout of the ATLAS detector

Figure 1.4: ATLAS detector overview

The structure of ATLAS’s detector is illustrated in a cutaway view in Figure 1.4. It is built

with a cylindrical symmetry around the interaction point and geometrically divided into a

Figure 1.3: LHC’s ATLAS well and its coordinate

system.

7

barrel region (low η region), two endcap regions (medium η region) and two forward

regions (high η region). The full detector is made up of several groups of sub-detectors,

designed to identify and record the particles coming out of the proton-proton collisions.

From inwards to outwards, these sub-detectors form three systems: the Inner Detector

(ID), the Calorimeters and the Muon Spectrometer. The Inner Detector is surrounded by a

central solenoid that provides a 2 T magnetic field while barrel and end-caps section of the

Muon Spectrometer are surrounded by toroidal magnets that provides a field of 0.5 T and

1 T, respectively. Particles produced from the proton-proton collisions firstly arrive in the

Inner Detector which covers the region of |η| < 2.5. The charged particles interact with

different layers of the detector depositing there part of their energy. This is then

transformed in electronic signals and hence hits are used to reconstruct the particle

trajectory. The momenta and charge of these charged particles can be measured by the

bending of their trajectories inside the 2 T magnetic field provided by the central solenoid.

As the innermost layer of ATLAS, the ID provides essential information, such as the

recognition of first and second vertices from which charged particles are seen to emerge.

The ID is therefore designed to have a high granularity and a high momentum measurement

resolution. To meet the performance requirement, semiconductor detectors are used for

precise measurement close to the beam (the Pixel Detector and the Semiconductor Tracker)

and a noble gas detector is used in the outer layer (the Transition-Radiation Tracker).

Further away from the collision point, the Calorimeters can be found. They are composed

of the electromagnetic calorimeters and the hadronic calorimeters, which are designed to

identify electron/photon or hadron respectively and measure their energy and coordinates.

The position information is obtained by segmenting the calorimeters longitudinally and

laterally. The calorimeters will not stop muons as they interact very little with the

calorimeter absorber. Therefore, muons will pass through the full detector and arrive in the

outermost layers of the ATLAS detector: the muon spectrometer. Figure 1.5 shows the

detector response to different particles, using a schematic transverse section view of the

ATLAS detector.

1.3.1 Inner Detector As stated, the Inner Detector (ID) is placed closest to the beam line; therefore, its design

must allow excellent radiation hardness and long-term stability in addition to ensure

adequate performance. As shown in figure 1.6, the full ID is a cylinder 6.2 m long and with

a diameter of 2.1 m and a coverage of |η| < 2.5.

8

Figure 1.5: section view of ATLAS detector in the transverse plane, illustrating

layers’ positioning.

Figure 1.6: a section view of the ATLAS Inner Detector

The ID is segmented into cylindrical structures in the barrel region while it has coaxial

sensor disks in the end cap regions. As shown in figure 1.7, ID in the barrel region is made

up of three main different layer. In the following paragraphs, the two innermost ones are

presented, from the most external to the most internal one.

9

Figure 1.7: structure and arrangement of the layers of Inner Detector in the

barrel region.

1.3.2 Semiconductor Tracker The SemiConductor Tracker (SCT) is a tracker made up of silicon strips, with a technology

similar to the one employed in the Silicon Pixel Detector. Each SCT layer has two sets of

SCT strips which are glued back-to-back with an angle of 40 mrad in between to measure

both lateral and longitudinal coordinates. The choice of a silicon strip detector is mainly

motivated by two facts: the large area covered (about 63 m2) and the small particle

occupancy: in the SCT region, less than one track travels a SCT chip, allowing for a

consistent reduction in terms of instrumented channels keeping spatial accuracy and noise

levels within the design limits.

1.3.3 Pixel Detector The innermost and most important detector is the Pixel Detector. It is designed to have the

finest granularity among the other ones; indeed, being very close to the interaction region,

the track density is high. The system consists of four cylindrical layers, named as the

following, going outwards: Insertable B-Layer (IBL), B-Layer (L0), Layer 1 (L1) and

Layer 2 (L2).

10

1.4 Structure of Pixel Detector As stated, the current configuration of the Pixel Detector consists of four layers, as shown

in Figure 1.7. Together, the L0, L1, L2 layers are composed of 112 long staves that are

made of 13 modules tilted on z axis by 1.1 degrees toward the interaction point (Figure

1.8b); furthermore, to allow overlapping, the staves are tilted by 20 degrees on the x-y plane

(see Figure 1.8a).

Regarding to sensors, 16 Front End (FE-I3) chips, a flex-hybrid, a Module Controller Chip

(MCC) and a pigtail together form what is called a module. FE-I3s are responsible for

reading the charge signal from pixels. Each FE-I3 is 195 µm thick, with a top surface of

1.09 cm by 0.74 cm, counting 3.5 million of transistors in 250 nm CMOS technology. They

are bump bonded over the sensors (Figure 1.9) and each one has an analog amplifier able

to discriminate signals of 5000 electrons with a noise threshold of 200 electrons. The

module collects signals from the sensors and packs them in a single data event which is

sent to the ROD board.

1.4.1 IBL The Insertable Barrel Layer is a pixel detector inserted with a contracted beam-pipe inside

the B-Layer. The fact of being very close to the interaction point forces some constraints

that are not needed in other layers: electronics must be much more radiation hard and the

sensible area needs to cover more than the 70% of the surface as in B-Layer. To achieve

Figure 1.8: staves disposition around the beam pipe (a), and modules

layout inside (b).

11

those objectives a new front-end readout chip, called FE-I4, was developed, leading to an

active area of 90% [4].

1.4.2 Sensors for IBL IBL’s modules and sensors are different from ATLAS’s ones because of the technology

chosen for the pixels. There were two main candidates:

• planar;

• 3D.

The main characteristics of these two technologies are explained hereafter, as well as the

upgrade from FE-I3 chip to FE-I4.

Planar sensors were used within the B-Layer too, but requirements on IBL’s ones was much

stricter: the inactive border had to pass from 1 mm of the old ones to the new 450 µm.

Several studies have been performed since B-Layer pixel were produced and now it is clear

that an irradiated sensor can increase the collected charge if it is less thick. One of the

Figure 1.9: Silicon sensor and read out chip (FE-I3)

bump bonded.

Figure 1.10: cross-sectional schematic of the n-in-p planar

pixel sensor

12

adopted variants of planar sensors is the so-called conservative n-in-n design. It uses

sensors which are already known to work, while trying to fulfill all requirements for IBL

such as the 450 µm inactive edge limit. Moreover, the pixel length in z has been reduced to

match the new 250 µm pixel cell length of the FE-I4 [4] (Figure 1.10).

On the other hand, geometry of 3-D sensors is completely different from planar ones (see

Figure 1.11). Their wafers take advantage of new silicon technology advances that produce

column-like electrodes which penetrate the substrate instead of being implanted on the

surface like the planar technology [5]. Indeed, they are built thanks to plasma micro-

machining to etch deep narrow apertures in the silicon substrate to form the electrodes of

the junctions [6]. Since the charge collected is low, there is the need of reading it from two

electrodes at once. Another downside of these sensors is that noise increases with the

number of electrodes and it is even affected by their diameter.

Nevertheless, full 3-D sensors’ active area extends much closer to the surface, reducing

non-sensible volume. The faces of 3-D sensors, independently from the type, are much

closer one another, allowing a much lower bias voltage (150 V versus 1000 V of a planar

sensor). This also leads to a lower leakage current and thus less cooling. When a particle

passes through the electrode area, efficiency results diminished by 3.3%. This effect affects

only in perpendicular particles and thus will not affect IBL for its sensors are tilted by 20°.

1.4.3 FE-I4 FE-I4 (Figure 1.12) is the new ATLAS pixel chip developed to be used in upgraded

luminosity environments, in the framework of the Insertable B-Layer (IBL) project but also

for the outer pixel layers of Super-LHC. FE-I4 is developed using a 130 nm CMOS process,

in an 8 metal option with 2 thick aluminum top layers for enhanced power routing. Care

Figure 1.11: two types of 3-D sensors, double sided

(a) and full 3-D (b).

13

has been taken to separate analog and digital power

nets. With the reduction of the thickness of the gate

oxide, the 130 nm CMOS process shows an increased

radiation tolerance with respect to previous less

scaled processes. The reasons behind the redesign of

the pixel Front-End FE-I3 came from several aspects

related to system issues and physics performance of

the pixel detector. With a smaller innermost layer

radius for the IBL project and an increased

luminosity, the hit rate increases to levels which the

FE-I3 architecture is not capable of handling; while

the FE-14 can reach an average hit rate with < 1%

data loss of 400 MHz/cm2. It was shown that the FE-

I3 column-drain architecture scales badly with high

hit rates and increased FE area, leading to

unacceptable inefficiencies for the IBL (see Figure

1.13). To avoid that, FE-I4 was designed to stores hits

locally getting rid of a column-drain based transfer.

The FE-I4 pixel size is also reduced, from 50 x 400

µm2 to 50 x 250 µm 2 which reduces the pixel cross-section and enhances the single point

resolution in z direction. FE-I4 is built up from an array of 80 by 336

pixels, each pixel being subdivided into analog and digital section. The

total FE-I4 active size is 20 mm (z direction) by 16.8 mm (φ direction),

with about 2 mm more foreseen for periphery, leading to an active area

of close to 90% of the total. The FE is now a standalone unit avoiding

the extra routing needed for a Module Controller Chip for

communication and data output. Communication and output blocks are

included in the periphery of the FE. Going to a larger FE size is

beneficial with respect to active over total area ratio as well as for the

building up of modules and staves. This leads to more integrated stave

and barrel concepts, and therefore reduces the amount of material

needed per detector layer. This material reduction provides a better

overall tracking measurement, since the probability of small or large

scatterings of particle inside the tracker itself is reduced. One of the main

advantages of having a large FE is also the cost reduction.

Figure 1.12: FE-I4 Layout

Figure 1.13:

performance of column

drain readout

architecture [7].

15

Chapter 2

Current off detector electronics for IBL

Hereafter is presented the current set-up of the off-detector electronics for the Insertable

Barrel Layer in order to describe the environment in which the new Pixel-ROD board was

conceived and to understand the requirements that it needs to fulfill.

High-energy physics experiments usually distinguish between on-detector and off-detector

electronics referring to the front-end electronics implemented near the detector itself and

to the readout system that can be implemented far from the detector. While in the first

scenario radiation resistance is a fundamental parameter, in the second one there is a less

compelling requirement of radiation resistance allowing the employment of more powerful

devices.

2.1 IBL electronics The IBL readout requires an appropriate off-detector system that is schematically shown in

Figure 2.1.

Figure 2.1: schematic block of the IBL readout system

This readout system is made of several components:

• Back of Crate (BOC) board:

• Optical modules to interface FE-I4 chips with BOC board;

16

• S-Link for sending data from the BOC board to the ATLAS TDAQ

system.

• Read Out Driver (ROD) board:

• Gigabit Ethernet to send front-end calibration histograms.

• VME Crate;

• TTC Interface Module (TIM);

• Single Board Computer (SBC).

FE-14 data are received from the BOC board via the RX optical modules, then 8b/10b

decoding is performed before passing data to the ROD that processes them. During physics

runs, events to be sent to the ATLAS TDAQ (Trigger and Data Acquisition) are sent back

to the BOC, where 4 S-Link modules are implemented. S-Link stands for Simple LINK

and it is a simple interconnection protocol that implements error reporting and test

functions, too. Each BOC-ROD pair can interface and route data coming from 16 IBL

modules (32 FE-I4 chips, for a total input bandwidth of 5.12 Gb/s). The whole IBL readout

requires 15 BOC-ROD pairs that can be all placed in a single VME crate (one BOC-ROD

pair for each of the 14 staves of IBL detector (that counts 32 FE-I4s) plus a one pair to

serve the diamond beam monitor detector) [8].

Figure 2.2: visual layout of the data acquisition system. In red the normal data

path, in blue the deviation for the generation of the histogram

Figure 2.2 illustrate the data path: the 32 front-end FE-14 chips drive 32 serial lines, each

supporting 160 Mb/s, connected to the BOC board via optical links. Here the signal from

each line is converted from optical to electrical, then demultiplexed to one 12-bit-wide bus,

which proceeds towards the ROD board, through the VME backplane connector.

17

After that, in order to build the data frame that has to be sent to the TDAQ computers, the

ROD board begin the data formatting. Data, after being transmitted to the ROD board, can

take two different paths, as show in Figure 2.2.

In the first one, ROD data are sent back to the BOC, where four S-Link modules forward

the data towards the ATLAS TDAQ PCs, reaching a total output bandwidth of 5.12 Gb/s;

in the second one, data are delivered to a PC for histogram processing (exclusively used

during calibration runs to properly calibrate the FE-I4 chips).

2.2 IBL BOC The BOC board (Figure 2.3) is responsible for handling the control interface to the detector,

as well as the data interface from the detector itself. Another of the main task of the BOC

is to provide the clock to the front-end chips connected. Furthermore, a Phase Locked Loop

(PLL) generates copies of this clock for the ROD and the detector. The detector clock is

then handled by the FPGAs and coded into the control streams for the individual detector

modules.

The IBL BOC contains three Xilinx Spartan FPGAs:

• one BOC Control FPGA (BCF);

• two BOC Main FPGAs (BMF).

Figure 2.3: IBL BOC board

18

2.2.1 BOC Control FPGA The BOC Control FPGA is responsible for the overall control and data shipping of the

board. An embedded processor (the Microblaze) is instantiated on this FPGA, mainly to

provide Ethernet access to the card, but it is also able to implement some self-test for the

board as well as being responsible for the other FPGAs configuration by loading

configuration data from a FLASH memory accessed via SPI.

2.2.2 BOC Main FPGA The two main FPGAs encode the configuration data regarding the FE-I4 front-end chips

connected to the ROD, into a 40 Mb/s serial stream that is sent straight to the FE-I4

themselves. This two FPGAs also manage the deserialization of the incoming data from

the front-end chips; after the data collection and the word alignment, the decoded data are

sent to the ROD board. On the transmission side, these two Spartan FPGAs also manage

the optical connection via four S-Links to the ATLAS TDAQ system.

Figure 2.4: IBL ROD board

19

2.3 IBL ROD The Insertable Barrel Layer Read Out Driver (IBL ROD) [9] is a board developed to

substitute the older ATLAS Silicon Read Out Driver (SiROD), that is used for the ATLAS

off-Detector electronics sub-system to interface with Silicon Tracker (SCT) and Pixel L0,

L1 and L2 Front End Detector modules.

The board main tasks are: data gathering and event fragment building during physics runs

and histogramming during calibration runs.

As stated, during runs, the board receives data and event fragments from the 32 FE-I4 chips

and transforms them into a ROD data frame, which is sent back to the ATLAS TDAQ

through the BOC S-Link connections (see Figure 2.2).

2.3.1 ROD Master An FPGA (Xilinx Virtex-5 XC5VFX70T-FF1136) is the Master of the Read-Out Driver,

which act as interface with the front-end chips. This FPGA also contains a PowerPC, an

embedded hard processor. Its main task is sending the event information to the two slaves

FPGAs [10].

2.3.2 ROD Slaves The two FPGAs that work as slaves on the ROD board are the Xilinx Spartan-6

XC6SLX150-FGG900. They implement an embedded soft processor, named Microblaze.

All data generated by IBL during ATLAS experiments pass through these two FPGAs and

are collected inside the on-board RAM (SODIMM DDR2 2GB); moreover, during

calibration runs, histograms can be generated and sent to the histogram server.

2.4 TIM The TTC (Timing, Trigger and Control) Interface Module (TIM) acts as interface between

the ATLAS Level-1 Trigger system signals and the pixel Read-Out Drivers using the LHC-

standard TTC and Busy system. In particular, the board is designed to propagate the TTC

clock all over the experiment: for what concerns the IBL off-detector electronics, the TIM

sends the clock to the BOC board, which then propagates it to the ROD, as stated above.

Furthermore, the TIM receives and propagates triggers through the custom backplane.

20

2.5 System limitations Since the ROD board described in the previous section met all the strict requirements the

ATLAS experiment had, it was decided to implement the same system also for all other

layers (L0, L1, L2). But while the IBL electronics required 15 ROD boards, the remaining

layers required about 110 boards. Even if, at the moment, the space occupation is clearly

high but still sustainable, it already shows the limits of this system: the future upgrade of

the whole LHC detector to the higher luminosity HL-LHC (whose luminosity will be raised

up to a factor of 10) will need more and more boards to face a much higher data rate [11].

Indeed, the link bandwidth is proportional to the product of occupancy (which is a function

of the luminosity) and trigger rate of the front-end devices [8]. It is expected that hit rate

for IBL at 140 Pile-Up will be raised up to 3 GHz/cm2 and the readout rate up to 4.8 Gbits/s

per chip for 1 MHz trigger rate [12].

21

Chapter 3

PCIe specifications and usage model

In the so-called Long Shutdown of the LHC accelerator foreseen for 2023, the whole LHC

detector will be upgraded to the higher luminosity HL-LHC. In particular, the nominal

luminosity will be raised up to roughly ten times the actual one [11], implying that the

electronics has to withstand a much higher data rate. Although many different read-out

electronics boards have been presented, they all share a common feature: high flexibility

and configurability with PCIe interface as well as powerful FPGAs connecting to many

optical transceivers [13]. So, before the description of the board named Pixel-ROD, an

overall architectural perspective of the PCIe technology is provided.

3.1 Architecture of a PCIe system PCI Express (PCIe which stands for Peripheral Component Interconnect Express) is a high-

performance I/O device interconnection bus used in the mobile, workstation, desktop,

server, embedded computing and communication platforms in order to expand the

capabilities of a host system by providing slots where expansion cards can be installed. It

has established itself as the successor to PCI providing higher performance, increased

flexibility and scalability. Indeed, since the so called first generation of buses (ISA, EISA,

VESA and Micro Channel) and the second one (PCI, AGP, and PCI-X), PC buses have

doubled in performance roughly every three years. But processors have roughly doubled in

performance in half that time following the Moore’s Law. In addition, although PCI has

enjoyed remarkable success, there were evidence that a multi-drop, parallel bus

implementation was close to its practical limit of performance as it cannot be easily scaled

up in frequency or down in voltage: it faced a series of challenges such as bandwidth

limitations, host pin-count limitations, lack of real-time data transfer and signal skew

limited synchronously clocked data transfer. Indeed, the bandwidth of the PCI bus and its

derivates can be significantly less than the theoretical value due to protocol overhead and

bus topology. All approaches to pushing these limits to create a higher bandwidth, general-

purpose I/O bus result in large cost increases for little performance gain [14]. So, there was

the need to engineer a new generation of PCI to serve as a standard I/O bus for future

generation platforms. There have been several efforts to create higher bandwidth buses and

this has resulted in a PC platform supporting a variety of application-specific buses

22

alongside the PCI I/O expansion bus. Indeed, PCIe offers a serial architecture that alleviates

some of the limitations of parallel bus architectures by using clock data recovery (CDR)

and differential signaling. Using CDR as opposed to source synchronous clocking lowers

pin-count, enables superior frequency scalability and makes data synchronization easier.

Moreover, PCIe was designed to be software-compatible with the PCI (so the older

software systems are still able to detect and configure PCIe cards although without PCIe

features such as the access to the extended configuration space that will be discussed in the

next paragraphs).

3.2 Interconnection A PCIe interconnect that connects two devices together is referred to as a Link. It consists

of either x1, x2, x4, x8, x12, x16 and x32 signals pairs in each direction. These signals are

referred to as Lanes. So, a x1 Link consists of 1 Lane or 1 differential signal pair in each

direction for a total of 4 signals (each Lane constitutes a full-duplex communication

channel). A x32 Link consists of 32 Lanes or 32 signal pairs for each direction for a total

of 128 signals [15]. In order to make the bus software backwards compatible with PCI and

PCI-X systems (predecessor buses), it maintains the same usage model and load-store

communication model. When it comes to the differences, PCI and PCI-X buses are multi-

drop parallel interconnect buses in which many devices share the same bus, while PCIe

implements a serial, point-to-point type interconnection for communication between

devices. In systems requiring multiple devices to be interconnected, interconnections are

made possible thanks to switches. The point-to-point interconnection lead to limited

electrical load on the Link overcoming the limitations of a shared bus. Moreover, as stated,

a serial interconnection results in fewer pins per device package which reduces PCIe board

design cost and complexity. Another significant feature is the possibility to implement

scalable numbers for pins and signal Lanes that allows a huge flexibility according to

communication performance requirements (PCIe specifications defined operations for a

maximum of 32 Lanes). So, the size of PCIe cards and slots vary depending upon the

number of supported Lanes. During hardware initialization, the Link is automatically

initialized for Link width and frequency of operation by the device on the opposite ends of

the Link without involving any kind of firmware. A packed-based communication protocol

is used over the serial interconnection. Packets are serially transmitted and received and

byte striped across the availed Lanes. This feature contributes keeping the device pin count

low and reducing system cost by means of in-band accomplishing of Hot Plug, power

management, error handling and interrupt signaling using packed based messaging instead

23

of side-band signals. Each packet to be transmitted over the Link consist of Bytes of

information. The first generation of the standard has a transmission/reception rate of 2.5

Gbits/s per Lane per direction (it has been doubled in the second generation). PCIe standard

also specifies three clocking architectures: Common Refclk, Separate Refclk, and the

already cited Clock Data Recovery. Common Refclk specifies a 100 MHz clock (Refclk),

with greater than ±300 ppm frequency stability at both the transmitting and receiving

devices. It was the most widely supported architecture among the first commercially

available devices. However, the same clock source must be distributed to every PCIe device

while keeping the clock-to-clock skew to less than 12 ns between devices. This can be a

problem with large circuit boards or when crossing a backplane connector to another circuit

board. In the case a low-skew configuration is not workable, the Separate Refclk

architecture, with independent clocks at each end, can be used. The clocks do not have to

be more accurate than ±300 ppm, because the PCIe standard allows for a total frequency

deviation of 600 ppm between transmitter and receiver. Finally, the Clock Data Recovery

architecture is the simplest, as it requires only one clock source, at the transmitter [16] and

it has become the most used configuration. In this scenario, since there is no clock signal

on the Link, the receiver uses a PLL to recover a clock from the 0-to-1 and 1-to-0 transitions

of the incoming bit stream. To allow for the clock recovery on the signal line independently

on the data transmitted, a DC balanced protocol is used. Every Byte of data to be

transmitted is converted into 10-bit code via an 8b/10b encoder in the transmitter device

(so 10-bit symbols are employed). The consequence is a 25% additional overhead to

transmit a byte of data. All symbols, in order to be compatible with the Clock Data

Recovery architecture, are guaranteed to have one-zero transitions. PCIe implements a

dual-simplex Link capable of transmitting and receiving data simultaneously on a transmit

and receive Lane. So, to obtain the aggregate bandwidth (which assumes simultaneous

traffic in both directions) the transmission/reception rate has to be multiplied by 2 then by

the numbers of Lanes (the so-called Link Width) and finally divided by 10 to account the

10-bits per Byte encoding. For example, a x1 PCI Express Link first generation, has an

aggregate throughput of 0.5 GBytes/s while a x32 PCI Express Link reaches 16 GBytes/s.

In the case of the second version of the PCIe protocol (named PCIe gen. 2), the previous

values had to be multiplied by a factor of two. Indeed, in this case, the data transfer is raised

up to 5 GTransfers/s which means that every Lane could transfer up to 5 Gbit/s using the

8b/10b encoding format [17]. As a side note, the third version of the protocol not only

increased the data-transfer rate, raising it up to 8 GTransfer/s, but also changed the

encoding format from 8b/10b to 128b/130b (to reduce the protocol overhead) [18].

24

3.3 Topology In order to understand the topology of the PCIe standard, some definitions are provided:

PCIe end-point: PCIe device to be connected.

Root complex: host controller that connects the CPU of the host machine to the rest of the

PCIe devices. PCIe has its own address space consisting of either 32 or 64 bits depending

upon the Root-Complex and it is only visible by PCIe components like the Root-Complex,

end-points, switches and bridges. Root-complex can interrupt the CPU for any of the events

generated by the Root-Complex itself or by any of the PCIe devices. Moreover, it can also

access the memory without CPU intervention (acting as a sort of DMA). PCIe end-points

can use this feature to write/read data to/from the memory. In order to do so, Root-complex

makes the end-point the bus master (giving the permission to access the memory) and

generates the corresponding memory address.

Bridge: it provides forward and reverse bridging allowing designers to migrate local bus,

PCI, PCI-X and USB bus interfaces to the serial PCIe architecture.

Switch: born to replace the multi-drop bus used in PCI and to provide fan-out for the I/O

bus, it is also used to realize a peer-to-peer communication between different endpoints and

this traffic, if it does not involve cache-coherent memory transfers, need not be forwarded

to the host bridge.

Figure 3.1 shows how the PCIe components (Root-Complex, bridges, end-points and

switches) are interconnected to PCIe Links.

Figure 3.1: PCIe topology

25

As stated, Root-Complex allows the connections of the many PCIe end-points. This task is

accomplished thanks to root-ports that can be directly connected to end-points, to a bridge

or to a switch connected to several end-points. In the case of Root-Complex or switches, in

order to implement a point-to-point topology (which means that a single serial link connects

two devices) multiple Virtual PCI to PCI bridges are used. These are the devices that

connects multiple buses together providing a (virtual) PCI bridge for the up-stream PCIe

connection and one (virtual) PCI bridge for each down-stream PCIe connection (Figure

3.2). An identification number is assigned to each bus by the software during the

enumeration process that is used by switches and bridges to identify the path of a

transaction. Every switch or bridge must store the information about three bus numbers:

the primary bus number (that reflects the number of the bus the switch is connected to), the

secondary bus number (identifying the bus with the lowest number that can be reached)

and subordinate bus number (the bus with the highest number that can be reached).

Figure 3.2: SoC detail

In the case of the switch of the previous example (see figure 3.3), the primary bus number

is 3, the secondary bus number is 4, and subordinate bus number, 8. So any transaction

targeted from bus 4 to bus 8 will be accepted and handled by the switch [19].

26

3.4 Electrical specifications and I/O Lines Having adopted a serial bus technology, PCIe uses far less I/O lines than PCI. As stated,

PCIe devices employ differential drivers and receivers (a pair of differential TX lines and

a pair of differential RX lines for each Lane) implementing the High-speed LVDS (Low-

Voltage Differential Signaling) electrical signaling standard [15]. The differential driver is

DC coupled from the differential receiver at the opposite end of the Link thanks to a

capacitor at the driver side. This means that two devices at the opposite end of a Link can

use different DC common mode voltages (range: 0 V to 3.6 V). The differential signal is

derived by measuring the voltage difference between two terminals. Logical values: a

positive voltage difference between the positive terminal and the negative one implies

Logical 1. On the other hand, a negative voltage difference between the same terminals

implies a Logical 0. Finally, when the driver is put in a high-impedance tristate condition

(also called Electrical-Idle or low-power state of the Link), the two terminals are driven at

the same potential. Let the voltage with respect to the ground on each conductor be 𝑉𝐷+

and 𝑉𝐷−.

The differential peak-to-peak voltage is defined as 2 ∗ max | 𝑉𝐷+ − 𝑉𝐷− | . To signal a

logical 1 or a logical 0, the differential peak-to-peak voltage driven by the transmitter must

Figure 3.3: Switch detail

27

be between 800 mV (minimum) and 1200 mV (max). Conversely, during the Link

Electrical Idle state, the transmitter drives a differential peak voltage of between 0 mV to

20 mV. As stated, the receiver is able to sense a logical 1, a logical 0 as well as the Electrical

Idle state of the Link, by detecting the voltage on the Link via a differential receiver

amplifier. Due to signal loss along the Link, the receiver must be designed to sense an

attenuated version of the differential signal driven by the transmitter. The receiver

sensitivity is fixed to a differential peak-to-peak voltage of between 175 mV and 1200 mV,

while the electrical idle detect threshold can range from 65 mV (minimum) to 175 mV

(maximum). Any voltage less than 65 mV peak-to-peak implies that the Link is in the

Electrical Idle state [15].

PCIe specifications also defines other auxiliary signals (the differential clock REFCLK

used in the Common Refclk clocking architecture, a voltage signal +12V#, PERST# to

indicate when data signals are stable and present signals (PRSNT1# and PRSNT2#) for

hot-plug detection) (see Figure 3.4). As stated, unlike PCI, PCIe does not use dedicated

interrupt lines but relies on in-band signaling transmitted through the differential TX and

RX lines.

3.5 PCIe Address Space The host system can access any of the PCIe end-points only by using the PCIe Address

Space (Figure 3.5). It is important to note that this address space is virtual, there is no

Figure 3.4: I/O Lines

28

physical memory associated: it only represent a list of addresses used by the transaction

layer (explained later) in order to identify the target of the transaction.

Root-complex has also configuration registers (to configure the Link width, frequency and

the Address Translation Unit that translates CPU addresses to PCIe ones), a Configuration

Space that contains all the information regarding end-points (such as device ID and vendor

ID) and has also registers to configure the end-points (used, for example, to put the device

in low-power mode). PCIe specifications defined the Configuration Space to be backward

compatible to PCI but increased its dimension from 256 B to 4 kB. The first 64 bytes are

standard (they are called the standard headers) and both PCIe and PCI defined two types

of standard headers: type 1 (containing info regarding root-ports, bridges and switches

(such as primary, secondary and subordinate bus numbers)) and type 0 (containing info

regarding end-points). Every PCIe component has its own Configuration Space. Figure 3.6

shows the standardized type 0 header that is present in the Configuration Space of a PCIe

end-point (only the first 64 Bytes are shown). It contains information regarding the device

(device ID, vendor ID, Status and Command used by the host system to configure and

control the end-point), the header type (that differentiates type 0 from type 1 headers) and

Base Address Registers used to configure the Memory Space. The mechanism that

determines the address to which the Configuration Space of a particular end-point should

be mapped in the PCIe Address Space is called Enhanced Configuration Access

Mechanism (ECAM).

Figure 3.5: PCIe Address Space

29

Figure 3.6: Configuration Space Header of an end-point

The address created is a function of bus number, device number, function and register

number (Figure 3.7).

Figure 3.7: Enhanced Configuration Access Mechanism

For example (see Figure 3.5), the Configuration Space of the first end-point (Bus:1,

Device:0, Function:0) is mapped to the address 100000h of the PCIe Address Space. The

host system is capable of reading the Configuration Space of an end-point thanks to the

Configurable Address Space presented in Root-Complex that has a region (CFG0) of 4 kB

(that matches with the size of the configuration space). Indeed, the CPU can only access

the Root-Complex internal registers and cannot directly read the Configuration Spaces of

PCIe devices. To do so, Configurable Address Space has to be “connected” to the

Configuration Space of the end-point. To be able to do that, Root-Complex implements an

Address Translation Table which has to be programmed with the source address (A in the

example of Figure 3.8) that is an address in the Configurable Space, the destination address

(the ECAM address (for example 100000h)) and the size (4 kB in the case of an access to

the Configuration Space). So, when the CPU access the CFG0 region of the Configurable

Address Space, the Address Translation Unit makes sure that Root-Complex accesses to

30

the ECAM address corresponding to the Configuration Space of the desired end-point.

Naturally, multiple end-points can be connected to the Root-complex. For example, a PCIe

Bridge (see Figure 3.9). In the example, its ECAM address (mapped in the PCIe Address

Space) is 200000h. To access the configuration space of the PCIe Bridge, the same region

in the Configurable Address Space can be used (thus programming in a different way the

Address Translation Table). There can be other devices connected to the Bridge (for

example, an end-point). Again, using the Enhanced Configuration Mechanism, the end-

point Configuration Space is mapped in the PCIe Address Space. PCIe specifications define

a new type of transaction in order to access devices connected beyond the bridge: there is

a second region in the Configurable Address Space (CFG1) dedicated to this task. The rest

of the Configurable Address Space can be used for I/O Space (generally 64 kB) and

Memory Space (where the peripheral registers and memory are mapped). The Memory

Space of an end-point cannot be access in the same way the CFG0 and CFG1 are accessed

because its size may vary from card to card. So, the host system must know the size of the

Memory Space of each end-point. To do so, host uses the information stored in the Base

Address Registers presented in the Configuration Space Header (see Figure 3.10). Once

the host system has got the size of the Memory Space of an end-point, it allocates an equal

amount of memory in the Configurable Address Space (in the region dedicated to Memory

Space) that will be used to access the Memory Space of the PCIe end-point. In order to do

so, the Configurable Address Space has to be mapped in the PCIe Address Space. Again,

this is made possible thanks to the Address Translation Table. In the example of Figure 3.9,

Figure 3.8: addressing methods with Address

Translation Table

31

the Address Translation Table is programmed to have: source address B (the first address

of the Configurable Address Space dedicated to Memory Space), destination address B

(address in the PCIe Address Space) and size equal to 256 MB since the whole Memory

Space of the Configurable Address Space needs to be mapped in the PCIe Address Space.

Figure 3.9: PCIe Address Space

Now, whenever the CPU accesses these regions in the Configurable Address Space,

because of Address Translation Table, the Root-Complex will access the corresponding

regions in the PCIe Address Space. It has to be note that, at this point, PCIe end-points will

not respond to the access in this region because theirs Memory Space is not mapped to the

PCIe Address Space, jet. To do so, the Root-Complex has to access to the Base Address

Registers and writes the starting address of the PCIe Address Space to which the Memory

Space of the end-point has to be mapped. After that, the Memory Space of the PCIe end-

point will respond to any memory request that happens to this region of the PCIe Address

Space since there will be a one to one mapping between its Memory Address and the PCIe

Address [19].

3.6 Device Tree A device tree (data structure used to describe the hardware in Linux-based operation

systems) is created for devices not enumerated dynamically (Figure 3.11). So, in this case,

device tree nodes are only created for Root-Complex (the end-points are enumerated

dynamically). The properties configured in the device tree are shared among all the end-

32

points. One of main property that can be configured is a field of the structure called

“ranges”. It is used to program the Address Translation Unit (see Figure 3.12).

Figure 3.10: Configuration Space Header

Figure 3.11: example of Device Tree

33

Figure 3.12: each cell is 32-bit wide. The first 3 cells are dedicated to PCI Address (the

first one contains flags and the others store the address proper (since PCIe Address space

has a maximum depth of 64-bit)), the fourth cell stores the CPU address (an address of the

Configurable Address Space) and the last one contains the size information. This

information is used to program the Address Translation Unit.

3.6.1 Linux PCIe Subsystem

Figure 3.13: Linux PCIe Subsystem: at the bottom, the Root-Complex platform driver.

34

Each platform can have its own Root-Complex Driver (that are responsible for initializing

the Root-Complex registers, programming the Address Translation Unit, extracting the I/O

resource and memory information from the “ranges” properties of the device tree and

invoking an API in the PCI-BIOS layer in order to start the enumeration process). The PCI-

BIOS layer performs BIOS-type initialization. It provides the before mentioned API used

by the Root-Complex drivers and invokes the PCI Core to start bus scanning. So, PCI Core

scans the bus by using the callback provided by the Root-Complex driver in order to read

the Configuration Space of PCIe end-points. During bus scanning, when PCI-Core finds a

device with a known device ID and vendor ID, it provides the corresponding PCIe device

driver. This one stores all the information about the PCIe end-point and it also provides the

implementation of the interrupt handlers for any of the interrupts that can be raised by the

PCIe card. Moreover, each end-point can also interact with its own domain specific upper-

layer (for example, an Ethernet card can interact with the Ethernet stack).

3.7 The three Layers of the protocol As stated, PCIe is a standard that uses a packed-based communication system and its

architecture is specified in layers. Indeed, the protocol is made up of three layers: the

Transaction, Data Link and Physical Layer, as shown in Figure 3.14. Layered protocols

have been used for years in data communication. Indeed, they permit isolation between

different functional area in the protocol and allow upgrading one or more layers without

requiring updates of the others. So, the revision of the protocol might affect the physical

media with no major effects on higher layers [14].

Figure 3.14: layers of the PCIe architecture

35

The software layers will generate read and write requests that are transported by the

transaction layer to the I/O devices using a packet-based, split-transaction protocol. There

are two main packet types: Transaction Layer Packets (TLPs) and Data Link Layer Packets

(DLLPs). While DLLPs are meant for service communication between PCIe constitutive

elements, TLPs are the packets that move the data from and to devices.

3.7.1 Transaction Layer

The Transaction Layer is the higher layer of the PCI Express architecture. It receives read

and write requests from the software layer and creates request packets (TLPs) for

transmission to the Data Link layer. From the transmitting side of a PCIe transaction, TLPs

are formed with protocol information (type of transaction, recipient address, transfer size,

etc.) inserted on header fields. As stated, PCIe does not use dedicated interrupt lines but

relies on in-band signaling. This method of propagating system interrupts was introduced

as an alternative to the hard-wired sideband signal in PCI rev 2.2 specifications and it was

made the primary method for interrupting processing in the PCIe protocol. Transactions

are divided into posted and non-posted transactions. While posted transactions do not need

any response packed, non-posted transactions, need a reply. In this case, the Transaction

Layer has to receive the response packets from the Data Link Layer and to match these

with the original software requests. This task can be easily accomplished since each packet

has a unique identifier that enables response packets to be directed to the correct originator.

The packet format supports 32-bit memory addressing and extended 64-bit memory

addressing. Packets also have attributes such as “no-snoop,” “relaxed-ordering” and

“priority” which may be used to prioritize the flow throughout the platform [14]. It is used,

for example, to process streaming data first in order to avoid late real-time data.

3.7.2 Data Link Layer The Data Link Layer performs as an intermediate stage between the Transaction Layer and

the Physical Layer. Its primary duty is to provide a reliable mechanism for the exchange of

the TLPs by appending a 32-bit cyclic redundancy check (CRC-32) and a sequence ID for

data integrity management (packet acknowledgement and retry mechanisms) (see Figure

3.15).

36

In order to reduce packet retries (and the associated waste of bus bandwidth), a credit-based

fair queuing is adopted: a “credit” is accumulated to queues as they wait for service and it

is spent by queues while they are being serviced. Queues with positive credit are eligible

for service [20]. See figure 3.16. This scheduling algorithm ensures that packets are only

transmitted when it is

known that a buffer is

available to receive the

packet at the other end.

For incoming TLPs,

the Data Link Layer

accepts them from the

Physical Layer and

checks the sequence

number and CRC. If an

error is detected, the

layer communicates

the need to resend.

Otherwise, TLP is

delivered to the

Transaction Layer.

Figure 3.15: each Layer appends a header and a

tail to the packet

Figure 3.16: credit-based fair queuing (traffic shaping)

37

3.7.3 Physical Layer The Physical Layer interfaces the Data Link Layer with signaling technology for link data

interchange. There are two sections of the Layer: the transmit logic and the receive logic

responsible for transmission and reception of packets, respectively. These sections, in turn,

are made up of a logical layer and an electrical layer. As the description of the electrical

layer was already provided (see paragraph 3.1.1), the only logical layer will be discussed.

Its main tasks are (from the transmission side): being responsible for framing the packet

with start and end of packet bytes (see Figure 3.15); splitting Byte data across the lanes (in

multi-lane links) (see Figure 3.17); Byte scrambling to reduce electromagnetic emissions

(by dispersing the power spectrum in a wider frequency band) and to facilitated the work

of the clock recovery (by removing long sequences of ‘0’ or ‘1’ only); 8b/10b encoding

and serialization of the 10-bit symbols before sending it across the link to the receiving

device. From the receiving side, a dual task is performed: deserialization, 8b/10b decoding,

byte descrambling, data reassembling (in multi-lane links) and unframing.

Figure 3.17: single-lane Link byte stream (on the left)

and the splitting of the data across the lanes in the case

of a 4-lane Link (on the right).

39

Chapter 4

Pixel-ROD board

The following paragraphs describe the board, still in prototype stage, developed as a

replacement of the previous series of readout boards, employed into ATLAS Pixel

Detector. To take advantage of all the experience and efforts spent on the ROD board

(allowing also firmware portability), it was decided to keep working with FPGAs from

Xilinx, upgrading to the 7-Series family. Moreover, exploiting the successful Master-Slave

architecture of the previous boards, the Pixel-ROD was conceived as a merging of two

Xilinx evaluation boards: the KC705 (that constitutes the slave device) and the ZC702

(master section). This way of proceeding also allows a huge speed-up of the design and

debugging process since the newly developed hardware and software testing platforms can

be validated on already tested and highly reliable boards (to try out the testing platforms

themselves) before applying them to the prototype. Since the tests that will be lately

discussed are meant to validate the slave section of the board, a brief general overview of

the KC705 is provided before the description of the Pixel-ROD.

4.1 Xilinx KC705 As stated, the slave unit of the Pixel-ROD board is mainly based on a Xilinx evaluation

board, the KC705. This one, shown in Figure 4.1, has a massive range of applications and

its primary features are listed below [21]:

• Kintex-7 28nm FPGA (XC7K325T-2FFG900C);

• 1GB DDR3 memory SODIMM 800MHz/1600Mbps;

• PCIe gen. 2 8-lane endpoint connectivity;

• SFP+ connector;

• 10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1111 PHY;

• 128MB Linear BPI Flash for PCIe Configuration;

• USB-to-UART bridge;

40

• USB JTAG via Digilent module;

• Fixed 200 MHz LVDS oscillator;

• I2C programmable LVDS oscillator;

Figure 4.1: Xilinx KC705 demo board

4.1.1 Why PCIe The KC705 is a PCIe board and the reasons behind the adoption of the PCIe interface are

not only to be found in the fact that PCIe is the best candidate to replace the role of slower

VME buses (whose data-rate is limited to 320 MB/s of the VME320), but also in the

creation of a new installation configurations. Indeed, one or two PCIe boards can be directly

connected on the motherboard of TDAQ PCs providing a faster response (giving straight

access to the main resources of the PCs) and an easier installation. This configuration is the

most likely to be adopted for the experimental phase that will start after the Long Shutdown

scheduled for 2023, not only in the ATLAS experiment, but also in CMS. This

configuration follows the trend established by KC705-like boards that are mainly designed

in order to speed-up real time calculations performed in a PC. They are usually mounted

on an external PCB so they must be connected via an appropriate interface to the

motherboard of the host PC. Since this interface is often the bottleneck of such systems,

new generation FPGA evaluation boards communicate via PCIe.

41

4.1.2 Kintex-7 FPGA

The Xilinx Kintex-7 XC7K325T-2FFG900 mounted on this board is a powerful medium-

range FPGA, that can be used to replace both Spartan-6 devices on the ROD board [22,

23]. Its key features are presented hereafter:

• Advanced high-performance FPGA with logic elements based on real 6 input

lookup table (LUT) than can be used to program the combinatory logic or to be

configured as distributed memory;

• High-performance DDR3 interface supporting up to 1866 Mb/s;

• High-speed serial connectivity with 16 built-in Gigabit transceivers (GTX)

having rates from 600 Mb/s to a maximum of 12.5 Gb/s, offering a special low-

power mode, optimized for chip-to-chip interfaces;

• A user configurable analog interface (XADC), incorporating dual 12-bit analog-

to-digital converters (ADC) with on-chip temperature and supply sensors;

• Powerful clock management tiles (CMT), combining phase-locked loop (PLL) and

mixed-mode clock manager (MMCM) blocks for high precision and low jitter;

• Integrated block for PCI Express (PCIe), for up to x8 Gen2 Endpoint and Root Port

designs;

• 500 maximum user I/Os (excluding GTX) and 16 kb of Block RAM (BRAM).

4.2 Xilinx ZC702 The demo board from which the master section was derived is the Xilinx ZC702. It was

chosen between the many boards in Xilinx catalogue because its FPGA embeds a hard

processor (two ARM Cortex-A9) [24] that will substitute the hard-processor implemented

on the Virtex-5 of the ROD-board (see paragraph 2.31).

42

4.3 The Pixel-ROD board As stated, the Pixel-ROD was conceived as a merging of the KC705 and ZC702. Naturally,

many features of the two boards had to be removed since not necessary for a read-out board

(Xilinx demo boards are not application specific) while many others had to be redesigned

or completely developed from scratch, as they needed to be shared with the whole new

board. Removed features include the LCD display, the SD card reader, the HDMI port, few

GPIOs and LEDs. On the other hand, to implement a ROD-like Master-Slave architecture,

a 21-bit differential bus has been added between the two FPGAs in order to obtain the

necessary communication as well as a 1-bit differential line to provide a common clock.

Moreover, another 5-bit wide single-ended bus was introduced as a general-purpose

interconnection bus. One of the other features that needed a complete redesign was the

JTAG chain which had to include the two FPGAs. To do so, a 12 (3x4) pin header (see

Figure 4.2) was added to allow the possibility of excluding the Kintex from the JTAG chain

in order to prevent unwanted programming of the slave FPGA. In addition, another internal

JTAG from Zynq to Kintex has been added. It allows the programming of the slave FPGA

with the desired firmware, using the Zynq FPGA as Master. This has been very helpful

during the debugging session. Indeed, since the Pixel-ROD board was installed inside a

PC, it was very difficult to access to the JTAG port.

Figure 4.2: custom JTAG

configuration header. In blue

the full JTAG chain, in red the

internal JTAG chain that

excludes the Kintex.

43

The main devices and features implemented on the Pixel-ROD board are the following:

• Kintex-7 28 nm FPGA (XC7K325T-2FFG900C);

• Zynq 7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex A9

MPCore;

• 2 GB DDR3 memory SODIMM (Kintex DDR);

• 1 GB DDR3 component memory (Micron MT41J256M8HX-15E, Zynq DDR3);

• PCI Express Gen2 8-lane endpoint connectivity;

• SFP+ connector;

• Three VITA 57.1 FMC Connectors (one HPC, two LPC);

• Two 10/100/1000 tri-speed Ethernet with Marvell Alaska PHY;

• Two 128 Mb Quad SPI flash memory;

• Two USB-to-UART bridges;

• USB JTAG interface (using a Digilent module or header connection);

• Two fixed 200 MHz LVDS oscillators;

• I2C programmable LVDS oscillator;

As of now, a single Pixel-ROD can interface up to 16 equivalent FE-14 channels (half of

the 32 channels of the BOC-ROD pair).

4.3.1 Space constraints The stack-up of a board defines the composition, the thickness and the function of each

layer of a Printed Circuit Board (PCB). As stated, the constraints of the Pixel-ROD were

extracted from two Xilinx demo boards, but it was not possible a one-to-one mapping since

all the resulting layers needed to be merged into one stack-up of 16 layers. In fact, the

44

maximum number of layers is fixed by PCIe standards in order to respect the constraint on

the thickness of the board (otherwise, it won’t fit into the PCIe slot): the allowable thickness

ranges from 1.44 to 1.70 mm [25]. The stack-up adopted is shown in the Figure 4.3: the 16

PCB layers were used to provide the necessary space to the high number of traces while

ensuring the alternation of signal layers and ground ones as well as the concentration of the

power layers into the innermost section of the board in order to reduce the cross-talk

between planes and to reach the required level of insulation.

In Figure 4.4 an example of a PCB layer is presented. Another constraint on the size was

due to the PC case the board will be installed in. So, the maximum length has been set to

30 cm, thereby adding little space for the device placement. Finally, the height was left

free, to allow sufficient room for all the necessary devices. The result of all these efforts is

presented in Figure 4.5.

Figure 4.3: Stack-up of Pixel-ROD.

45

Figure 4.5: the Pixel-ROD prototype. In blue, the

components connected to the Kintex FPGA; in red, the ones

related to the Zynq FPGA; in yellow, the power stage.

Figure 4.4: one of the 16 layers.

47

Chapter 5

Pixel-ROD test results

Because of its complexity, the Pixel-ROD had to pass through several testing stages in

order to verify its correct behavior. The stages were divided in: hardware wake-up phase,

to assure the hardware is working correctly and can be correctly configured, and validation

of all the board functionalities by configuring, debugging and testing each device installed

on the board. Since this thesis work concerns the creation of a testing platform meant not

only for the validation of the PCIe interface of the board, but also for presenting and

achieving a high-performance PCIe system which can transfer data between the board and

a PC, only a brief description of the hardware wake-up phase is provided. Conversely, the

platform itself, as well as the firmware adopted, will be comprehensively covered.

5.1 Power supply As stated, the first testing stage the board passed through, involved the configuration of the

board power-up which led to the programming of the three UCD9248 power controllers.

The tool Fusion Digital Power Designer from Texas Instruments allows to set many

important parameters such as the voltage of each rail and the sequence of power-up for

each of them (see Figure 5.1).

Figure 5.1: Fusion Digital Power Designer GUI, configuration of the voltage rails

48

5.2 Board interfaces and memory When we started developing the test platform to perform PCIe transaction between a host

PC and the board, the Ethernet Subsystem, the internal bus connecting the Zynq to the

Kintex, the SFP port as well as all the interfaces of the two FPGAs were already tested.

Since all the previous tests in conjunction to the newly developed ones take advantage of

the many tools provided by Vivado Design Suite by Xilinx, a briefly description of this

CAD tool is provided.

5.2.1 Vivado Design Suite Vivado Design Suite is a tool suite developed to increase the overall productivity for

designing, integrating, and implementing systems using many of the Xilinx devices that

come with a variety of recent technology, including high speed I/O interfaces, hardened

microprocessors and peripherals, analog mixed signals and more. The Vivado Design Suite

allows for synthesis and implementation of HDL designs, enabling the developers to

synthesize their designs, perform timing analysis, examine RTL diagrams, simulate a

design reaction to different stimuli and configure the target device with the programmer.

The design implementation is accelerated thanks to place and route tools that analytically

optimize for multiple and concurrent design metrics, such as timing, congestion, total wire

length, utilization and power [26].

5.3 Memory test As the functionalities of the power stage and interfaces of Pixel-ROD had been

consolidated, a complete memory test was designed in order to prove not only the

possibility to perform some basic functions on the board, but also the ability to sustain high

speed memory accesses. In particular, the test was meant to verify if the RAM module

accessible by the Kintex FPGA (a 2 GB SODIMM DDR3 memory bank) is subjected to

disturbance errors when performing repeatedly accesses to the same memory bank but in

different rows in a short period of time. These disturbance errors are caused by charge

leakage and occur when the repeated accesses cause charge loss in a memory cell, before

the cell contents can be refreshed at the next DRAM refresh interval. Moreover, as DRAM

process technology scales down to smaller dimensions, it becomes more difficult to prevent

DRAM cells from electrically interacting with each other [27].

Another advantage of achieving a test of this kind is that it brings the board closer to

implement its full functionalities. In fact, it involves not only the bare trace interconnection

49

between devices, but also specific ICs and especially the development of a complex

firmware (that can become very time consuming). In this way, the smart design obtained

using the KC705 board as reference (see chapter 4.3), speeded-up the entire process since

it made available a platform very similar to the Pixel-ROD board, where the firmware could

be validated before being loaded on the tested board itself.

5.3.1 Vivado IP Integrator and AXI4 Interface In order to develop the firmware of the test, the Intellectual Property (IP) Integrator tool

has been used (it is a part of the Vivado Design Suite). It has been defined by Xilinx as “the

industry’s first plug-and-play system integration design environment” since it allows the

user to create complex system designs by instantiating and interconnecting IP cores from

the Vivado IP catalog into a design canvas. In this way, the user can take advantage of the

already available IP in the Vivado library to speed-up the firmware development, which

otherwise would take a consistent amount of time. Available with the Vivado Design Suite

are many IP subsystems for Ethernet, PCIe, HDMI, video processing and image sensor

processing. As an example, the AXI-4 PCIe subsystem is made up of multiple IP cores

including PCIe, DMA, AXI-4 Interconnect and it is used to provide the software stack

necessary to the developed testing platform. Therefore, before going into further details of

the tests, a brief description of the main IP cores used as well as the AXI interface is

provided.

The AXI protocol

AXI stands for Advanced eXtensible Interface [28] and it is part of the ARM Advanced

Microcontroller Bus Architecture (AMBA), a family of open-standard micro controller

buses. AMBA is widely used on a range of ASIC and SoC parts including applications

processors used in modern portable mobile devices like smartphone. Nowadays, it has

become a de-facto standard for 32-bit embedded processors because of the exhaustive

documentation and no-need to pay royalties. The AXI4 protocol is the default interface for

IP cores an it was extensively used during the debug of the Pixel-ROD board. The AXI4

protocol presents three key features: firstly, it provides a standardized interface between

the many IPs (and so allowing the user to concentrate on the system debug rather than on

the protocol needed); secondly, the AXI4 protocol is flexible, meaning that it suits a variety

of applications, from single, light data transaction to bursts of 256 data transfers with just

50

a single address phase; finally, since the AXI4 is an industrial standard, it also allows the

access to the whole ARM environment.

There are three types of AXI4 interfaces:

• AXI4, used for high-performance memory-mapped operations;

• AXI4-Lite, used for simple, low-throughput memory-mapped communication;

• AXI4-Stream, used for high-speed data streams.

The AXI4 interface uses a Master-Slave architecture and all AXI4 masters and slaves can

be connected by means of a specific IP, named Interconnect. Both AXI4 and AXI4-Lite

defines the following independent transaction channels:

• Read Address Channel;

• Read Data Channel;

• Write Address Channel;

• Write Data Channel;

• Write Response Channel.

The address channels carry control information

that describes the nature of the data to be

transferred. Data can simultaneously move in both

directions between master and slave and data

transfer sizes can vary (see Figure 5.2). The limit

in AXI4 is a burst transaction of up to 256 data

transfers, while the AXI4-Lite interface allows 1

data transfer per transaction only.

When the master needs to read data from a slave,

it sends over the dedicated channel bo