ALMA MATER STUDIORUM ● UNIVERSITÀ DI BOLOGNA
Scuola di Ingegneria e Architettura
Dipartimento di Ingegneria dell’Energia Elettrica
e dell’Informazione “Guglielmo Marconi” - DEI
Corso di Laurea in Ingegneria Elettronica e Telecomunicazioni
Tesi di Laurea
in
Fisica Generale T-2
Testing Platform for a PCIe-based readout and
control board to interface with new-generation
detectors for the LHC upgrade
Anno Accademico 2016/2017
Sessione I
Relatore:
Chiar.mo Prof. Mauro Villa
Correlatore:
Chiar.mo Prof. Alessandro Gabrielli
Candidato:
Giulio Masinelli
“Cui dono lepidum novum libellum
arida modo pumice expolitum?
Corneli, tibi: namque tu solebas
meas esse aliquid putare nugas.”
Catullo, Carme I
Abstract
Questa tesi è rivolta alla comprensione e validazione di una scheda PCIe,
denominata Pixel-ROD e progettata dal laboratorio di elettronica dell'Istituto
Nazionale di Fisica Nucleare e da docenti del dipartimento di Fisica e Astronomia
per le necessità dell'esperimento ATLAS al CERN di Ginevra.
Il lavoro di tesi è consistito nello sviluppo di una piattaforma di test da impiegare
per la verifica della corretta funzionalità della scheda e della sua capacità di
rispondere a ripetuti e continui stimoli, con particolare enfasi sul sottosistema di
memoria e sull’interfaccia PCIe. La scheda è stata progettata come sostituzione
dell’elettronica off-detector attualmente installata per l’esperimento ATLAS,
composta da schede VME, note con il nome di Back of Crate (BOC) e Read Out
Driver (ROD). La scelta dello standard PCIe segue l’ormai consolidato trend di
impiegare schede dotate di FPGA per velocizzare il calcolo real-time eseguito sui
PC. Queste sono solitamente costituite da PCB dedicate e necessitano di essere
connesse alla scheda madre del computer per mezzo di un’interfaccia adeguata.
Essendo quest’ultima a costituire, in genere, il collo di bottiglia di tali sistemi, le
schede demo di ultima generazione comunicano attraverso la performante
interfaccia PCIe. Non solo, le schede PCIe possono essere direttamente connesse
alla scheda madre di PC dedicati all’acquisizione dati consentendo, dunque, una più
elevata velocità di risposta (avendo accesso diretto alle principali risorse della
macchina ospitante) ed una più facile installazione.
Questa tesi fornisce una breve panoramica dell’ambiente per il quale la scheda è
stata progettata. In particolare, dopo questa introduzione, il primo capitolo presenta
l’esperimento ATLAS concentrandosi sui rivelatori più prossimi al punto di
interazione, dove avviene la collisione fra i fasci di protoni dell’acceleratore. Il
secondo capitolo, invece, descrive l’attuale elettronica off-detector così come le sue
principali limitazioni. Il terzo capitolo delinea lo standard PCIe concentrandosi
sugli aspetti più importanti con cui ci si è dovuti scontrare in fase di test. Il quarto
capitolo fornisce le principali motivazioni delle scelte di progetto, quali la decisione
di realizzare la scheda come unione di due demo board, adottando la collaudata
architettura Master-Slave dell’accoppiata BOC-ROD. Infine, il quinto capitolo
descrive la piattaforma di test sviluppata e tutti i test a cui la scheda è stata
sottoposta.
Il lavoro svolto è stato principalmente focalizzato allo sviluppo di un intensivo test
che coinvolgesse le memorie installate sulla scheda ed alla creazione di una
piattaforma di test per fornire l’hardware necessario alla verifica della possibilità di
eseguire trasferimenti di dati tra il PC ospitante e le memorie on-board per mezzo
dell’interfaccia PCIe.
Index 1 The LHC accelerator and the ATLAS experiment ........................................ 3
1.1 The Large Hadron Collider ....................................................................................3
1.1.1 Machine Parameters ......................................................................................4
1.1.2 Main experiments at LHC .............................................................................5
1.2 The ATLAS detector .............................................................................................5
1.2.1 The coordinate system of ATLAS .................................................................5
1.3 The Layout of the ATLAS detector ........................................................................6
1.3.1 Inner Detector ...............................................................................................7
1.3.2 Semiconductor Tracker .................................................................................9
1.3.3 Pixel Detector ...............................................................................................9
1.4 Structure of Pixel Detector ................................................................................... 10
1.4.1 IBL ............................................................................................................. 10
1.4.2 Sensors for IBL ........................................................................................... 11
1.4.3 FE-I4 .......................................................................................................... 12
2 Current off detector electronics for IBL ...................................................... 15
2.1 IBL electronics .................................................................................................... 15
2.2 IBL BOC ............................................................................................................. 17
2.2.1 BOC Control FPGA .................................................................................... 18
2.2.2 BOC Main FPGA........................................................................................ 18
2.3 IBL ROD............................................................................................................. 19
2.3.1 ROD Master................................................................................................ 19
2.3.2 ROD Slaves ................................................................................................ 19
2.4 TIM ..................................................................................................................... 19
2.5 System limitations ............................................................................................... 20
3 PCIe specifications and usage model ............................................................ 21
3.1 Architecture of a PCIe system .............................................................................. 21
3.2 Interconnection .................................................................................................... 22
3.3 Topology ............................................................................................................. 24
3.4 Electrical specifications and I/O Lines ................................................................. 26
3.5 PCIe Address Space ............................................................................................ 27
3.6 Device Tree ......................................................................................................... 31
3.6.1 Linux PCIe Subsystem ................................................................................ 33
3.7 The three Layers of the protocol .......................................................................... 34
3.7.1 Transaction Layer ....................................................................................... 35
3.7.2 Data Link Layer .......................................................................................... 35
3.7.3 Physical Layer ............................................................................................ 37
4 Pixel-ROD board ........................................................................................... 39
4.1 Xilinx KC705 ...................................................................................................... 39
4.1.1 Why PCIe ................................................................................................... 40
4.1.2 Kintex-7 FPGA ........................................................................................... 41
4.2 Xilinx ZC702 ...................................................................................................... 41
4.3 The Pixel-ROD board .......................................................................................... 42
4.3.1 Space constraints ........................................................................................ 43
5 Pixel-ROD test results ................................................................................... 47
5.1 Power supply ....................................................................................................... 47
5.2 Board interfaces and memory .............................................................................. 48
5.2.1 Vivado Design Suite ................................................................................... 48
5.3 Memory test ........................................................................................................ 48
5.3.1 Vivado IP Integrator and AXI4 Interface ..................................................... 49
5.3.2 Architecture of the memory test .................................................................. 54
5.4 Hardware Testing Platform .................................................................................. 59
5.5 PCIe interface test ............................................................................................... 61
5.5.1 Architecture of the PCIe interface test ......................................................... 62
Conclusions ...................................................................................................... 67
Bibliography ..................................................................................................... 69
1
Introduction
This thesis refers to the comprehension and validation of a PCIe board, named
Pixel-ROD and developed by the Electronic Design Laboratory of INFN (Istituto
Nazionale di Fisica Nucleare) and by teachers of the DIFA (DIpartimento di Fisica
e Astronomia) to meet the needs of ATLAS experiment at CERN, in Geneva.
The thesis work consisted in the development of a test bench used to verify the
board full functionalities and its response to stressful stimuli with emphasis on the
memory subsystem and the PCIe interface. The board was designed as an upgrade
for the current off-detector electronics present at ATLAS, replacing the previous
series of readout boards, which are mainly made up of VME boards, known as Back
of Crate (BOC) and Read Out Driver (ROD). The choice of a PCIe card follows the
upcoming trend to exploit FPGA boards in order to speed-up real time calculations
performed in a PC. They are usually mounted on an external PCB so they must be
connected via an appropriate interface to the motherboard of the host computer.
Since this interface is often the bottleneck of such systems, new generation FPGA
evaluation boards communicate via PCIe. But PCIe advantages go further than this.
Indeed, PCIe boards can be directly connected on the motherboard of ATLAS
TDAQ PCs providing a faster response (giving straight access to the main resources
of the PCs) and an easier installation.
The thesis is intended to provide a brief overview of the environment the board was
developed to be installed in. After this Introduction, the Chapter 1 summarizes the
ATLAS experiment, focusing on the closest detectors to the interaction point.
Chapter 2 describes the off-detector electronics the board is meant to replace,
concentrating on their limitations. Chapter 3 delineates the PCIe protocol focusing
on the crucial aspects for board validation. Chapter 4 illustrates how the board was
developed as a merging of two demo boards in order to exploit the Master-Slave
architecture of the BOC-ROD pair as well as narrowing down all the possible
mistakes. Finally, Chapter 5 describes the testing platform and the tests that have
been carried out.
2
The work of this thesis was mainly focused to thoroughly test the board memory
subsystem in order to validate the device response to stressful stimuli and to develop
a testing platform to provide the hardware needed to fully verify the ability to
perform PCIe data transaction to a host PC and the related performance.
3
Chapter 1
The LHC accelerator and the ATLAS experiment
In this chapter, a brief overview of the LHC accelerator complex and of the ATLAS
experiment is presented. More details are provided for the ATLAS pixel detector, for which
the readout board described in the next chapters has been designed.
1.1 The Large Hadron Collider The Large Hadron Collider (LHC) is the largest and most powerful particle accelerator on
Earth. It is located in the circular tunnel which housed the LEP (Large Electron-Positron
collider) in Geneva at the border between France and Switzerland and it is managed by the
European Organization for Nuclear Research, also known as CERN (Conseil Européen
pour la Recherche Nucleaire). CERN is a collaboration between 20 European member
states and non-member states from the rest of the world. The accelerator is approximately
27 km in circumference and lies 100 m below the ground [1]. There are four interaction
points where protons or lead ions are forced to collide at high energies (see figure 1.1). At
the four interaction points, gigantic experiments (ALICE, ATLAS, CMS and LHCb) are
set up to record every detail of the particle collisions. The acceleration chain consists of
several steps (see figure 1.2). Indeed, protons are not directly inserted into the beam pipe
of the main ring, but they begin their acceleration in the linear accelerator LINAC 4, where
they are accelerated from rest to an energy of 50 MeV. After being sent to the Booster to
be accelerated to 1.4 GeV, they are injected into the Proton Synchrotron to be accelerated
to 25 GeV. The last acceleration before the Large Hadron Collider is provided by the Super
Proton Synchrotron where they are accelerated to 450 GeV. Finally, in the LHC, protons
are accelerated from 450 GeV to 6.5 TeV.
Figure 1.1: LHC overview.
4
1.1.1 Machine Parameters The nominal maximum collision energy for proton in LHC is 14 TeV; however, the
accelerator is now running at a collision energy of 13 TeV, 6.5 TeV per proton beam. At
this level of energy, protons move with a speed very close to the speed of light in vacuum.
The proton beams consist of 2808 bunches of protons. Each bunch contains about 1011
particles each so that many proton collisions can happen at each bunch crossing. The
protons are held in the accelerator ring by 1232 superconducting dipole magnets that create
a maximum magnet field of 8.3 T. Along the beam line, 392 quadrupole magnets are used
to focus the particles in the interaction points and defocus them after. The Large Hadron
collider is built to have a peak instantaneous luminosity of 𝐿 = 1034 cm−2s−1 (the ratio
of the number of events detected in a certain time to the interaction cross-section).
Figure 1.2: the accelerator complex at CERN: LHC protons start from LINAC4 and are
accelerated by the Booster, the Proton Synchrotron and the Super Proton Synchrotron
before entering in LHC.
5
1.1.2 Main experiments at LHC As already stated, at the four interaction points, where protons or lead ions are collided
there are four detectors (fig. 1.1):
• ATLAS, A Toroidal Lhc ApparaTus: being one of the two general purpose
detectors at LHC, it was designed for the research of the Higgs Boson and it
investigates a wide range of physical problems, spacing from searching concerning
Beyond Standard Model Physics (Dark Matter, Supersymmetries) to precision
measurements of the Standard Model parameters.
• CMS, Compact Muon Solenoid: the second general purpose detector at LHC. It
has the same purpose of ATLAS experiment (same scientific goals) although it
uses different technical solutions and different magnet-system designs.
• LHCb, Large Hadrons Collider beauty: specialized in investigating the
differences between matter and antimatter by studying quark beauty particles.
• ALICE, A Large Ion Collider Experiment: a heavy ion collision detector. It
studies the proprieties of strongly interacting matter at extreme energy densities,
where the matter forms a new phase, called Quark-Gluon Plasma.
1.2 The ATLAS detector The ATLAS experiment is a general-purpose particle detector installed at the LHC which
is used to study various kinds of physical phenomena. It is 46 m long, 25 m high, 25 m
wide and weighs 7000 tons [2]. It is operated by an international collaboration with
thousands of scientists from all over the world: more than 3000 scientists from 174
institutes in 38 countries work on the ATLAS experiment. ATLAS has a cylindrical
symmetry, as well as the detectors of CMS and ALICE.
1.2.1 The coordinate system of ATLAS To properly and precisely describe the ATLAS detector, a coordinate system is introduced.
In this reference frame collision, events are described using right-handed spherical
coordinates where the origin is set in the nominal interaction point, the z-axis along the
beam direction and the x-y plane transverse to it (the positive x-axis pointing to the center
6
of the LHC ring and the positive y-axis
pointing upwards, see figure 1.3) [3].
Therefore, a vector can be described
using the azimuthal angle φ, the polar
angle θ and the radius r, as shown in
figure 1.3. But instead of using θ, it is
usual to take the pseudorapidity η
defined as −𝑙𝑛 𝑡𝑎𝑛𝜽
𝟐. Its value ranges
from -infinity, corresponding to a
vector being along the negative semi-
axis of z, to + infinity, referring to a
vector being along the positive semi-
axis of z. So, the value η = 0
corresponds to a vector in the x-y
plane.
1.3 The Layout of the ATLAS detector
Figure 1.4: ATLAS detector overview
The structure of ATLAS’s detector is illustrated in a cutaway view in Figure 1.4. It is built
with a cylindrical symmetry around the interaction point and geometrically divided into a
Figure 1.3: LHC’s ATLAS well and its coordinate
system.
7
barrel region (low η region), two endcap regions (medium η region) and two forward
regions (high η region). The full detector is made up of several groups of sub-detectors,
designed to identify and record the particles coming out of the proton-proton collisions.
From inwards to outwards, these sub-detectors form three systems: the Inner Detector
(ID), the Calorimeters and the Muon Spectrometer. The Inner Detector is surrounded by a
central solenoid that provides a 2 T magnetic field while barrel and end-caps section of the
Muon Spectrometer are surrounded by toroidal magnets that provides a field of 0.5 T and
1 T, respectively. Particles produced from the proton-proton collisions firstly arrive in the
Inner Detector which covers the region of |η| < 2.5. The charged particles interact with
different layers of the detector depositing there part of their energy. This is then
transformed in electronic signals and hence hits are used to reconstruct the particle
trajectory. The momenta and charge of these charged particles can be measured by the
bending of their trajectories inside the 2 T magnetic field provided by the central solenoid.
As the innermost layer of ATLAS, the ID provides essential information, such as the
recognition of first and second vertices from which charged particles are seen to emerge.
The ID is therefore designed to have a high granularity and a high momentum measurement
resolution. To meet the performance requirement, semiconductor detectors are used for
precise measurement close to the beam (the Pixel Detector and the Semiconductor Tracker)
and a noble gas detector is used in the outer layer (the Transition-Radiation Tracker).
Further away from the collision point, the Calorimeters can be found. They are composed
of the electromagnetic calorimeters and the hadronic calorimeters, which are designed to
identify electron/photon or hadron respectively and measure their energy and coordinates.
The position information is obtained by segmenting the calorimeters longitudinally and
laterally. The calorimeters will not stop muons as they interact very little with the
calorimeter absorber. Therefore, muons will pass through the full detector and arrive in the
outermost layers of the ATLAS detector: the muon spectrometer. Figure 1.5 shows the
detector response to different particles, using a schematic transverse section view of the
ATLAS detector.
1.3.1 Inner Detector As stated, the Inner Detector (ID) is placed closest to the beam line; therefore, its design
must allow excellent radiation hardness and long-term stability in addition to ensure
adequate performance. As shown in figure 1.6, the full ID is a cylinder 6.2 m long and with
a diameter of 2.1 m and a coverage of |η| < 2.5.
8
Figure 1.5: section view of ATLAS detector in the transverse plane, illustrating
layers’ positioning.
Figure 1.6: a section view of the ATLAS Inner Detector
The ID is segmented into cylindrical structures in the barrel region while it has coaxial
sensor disks in the end cap regions. As shown in figure 1.7, ID in the barrel region is made
up of three main different layer. In the following paragraphs, the two innermost ones are
presented, from the most external to the most internal one.
9
Figure 1.7: structure and arrangement of the layers of Inner Detector in the
barrel region.
1.3.2 Semiconductor Tracker The SemiConductor Tracker (SCT) is a tracker made up of silicon strips, with a technology
similar to the one employed in the Silicon Pixel Detector. Each SCT layer has two sets of
SCT strips which are glued back-to-back with an angle of 40 mrad in between to measure
both lateral and longitudinal coordinates. The choice of a silicon strip detector is mainly
motivated by two facts: the large area covered (about 63 m2) and the small particle
occupancy: in the SCT region, less than one track travels a SCT chip, allowing for a
consistent reduction in terms of instrumented channels keeping spatial accuracy and noise
levels within the design limits.
1.3.3 Pixel Detector The innermost and most important detector is the Pixel Detector. It is designed to have the
finest granularity among the other ones; indeed, being very close to the interaction region,
the track density is high. The system consists of four cylindrical layers, named as the
following, going outwards: Insertable B-Layer (IBL), B-Layer (L0), Layer 1 (L1) and
Layer 2 (L2).
10
1.4 Structure of Pixel Detector As stated, the current configuration of the Pixel Detector consists of four layers, as shown
in Figure 1.7. Together, the L0, L1, L2 layers are composed of 112 long staves that are
made of 13 modules tilted on z axis by 1.1 degrees toward the interaction point (Figure
1.8b); furthermore, to allow overlapping, the staves are tilted by 20 degrees on the x-y plane
(see Figure 1.8a).
Regarding to sensors, 16 Front End (FE-I3) chips, a flex-hybrid, a Module Controller Chip
(MCC) and a pigtail together form what is called a module. FE-I3s are responsible for
reading the charge signal from pixels. Each FE-I3 is 195 µm thick, with a top surface of
1.09 cm by 0.74 cm, counting 3.5 million of transistors in 250 nm CMOS technology. They
are bump bonded over the sensors (Figure 1.9) and each one has an analog amplifier able
to discriminate signals of 5000 electrons with a noise threshold of 200 electrons. The
module collects signals from the sensors and packs them in a single data event which is
sent to the ROD board.
1.4.1 IBL The Insertable Barrel Layer is a pixel detector inserted with a contracted beam-pipe inside
the B-Layer. The fact of being very close to the interaction point forces some constraints
that are not needed in other layers: electronics must be much more radiation hard and the
sensible area needs to cover more than the 70% of the surface as in B-Layer. To achieve
Figure 1.8: staves disposition around the beam pipe (a), and modules
layout inside (b).
11
those objectives a new front-end readout chip, called FE-I4, was developed, leading to an
active area of 90% [4].
1.4.2 Sensors for IBL IBL’s modules and sensors are different from ATLAS’s ones because of the technology
chosen for the pixels. There were two main candidates:
• planar;
• 3D.
The main characteristics of these two technologies are explained hereafter, as well as the
upgrade from FE-I3 chip to FE-I4.
Planar sensors were used within the B-Layer too, but requirements on IBL’s ones was much
stricter: the inactive border had to pass from 1 mm of the old ones to the new 450 µm.
Several studies have been performed since B-Layer pixel were produced and now it is clear
that an irradiated sensor can increase the collected charge if it is less thick. One of the
Figure 1.9: Silicon sensor and read out chip (FE-I3)
bump bonded.
Figure 1.10: cross-sectional schematic of the n-in-p planar
pixel sensor
12
adopted variants of planar sensors is the so-called conservative n-in-n design. It uses
sensors which are already known to work, while trying to fulfill all requirements for IBL
such as the 450 µm inactive edge limit. Moreover, the pixel length in z has been reduced to
match the new 250 µm pixel cell length of the FE-I4 [4] (Figure 1.10).
On the other hand, geometry of 3-D sensors is completely different from planar ones (see
Figure 1.11). Their wafers take advantage of new silicon technology advances that produce
column-like electrodes which penetrate the substrate instead of being implanted on the
surface like the planar technology [5]. Indeed, they are built thanks to plasma micro-
machining to etch deep narrow apertures in the silicon substrate to form the electrodes of
the junctions [6]. Since the charge collected is low, there is the need of reading it from two
electrodes at once. Another downside of these sensors is that noise increases with the
number of electrodes and it is even affected by their diameter.
Nevertheless, full 3-D sensors’ active area extends much closer to the surface, reducing
non-sensible volume. The faces of 3-D sensors, independently from the type, are much
closer one another, allowing a much lower bias voltage (150 V versus 1000 V of a planar
sensor). This also leads to a lower leakage current and thus less cooling. When a particle
passes through the electrode area, efficiency results diminished by 3.3%. This effect affects
only in perpendicular particles and thus will not affect IBL for its sensors are tilted by 20°.
1.4.3 FE-I4 FE-I4 (Figure 1.12) is the new ATLAS pixel chip developed to be used in upgraded
luminosity environments, in the framework of the Insertable B-Layer (IBL) project but also
for the outer pixel layers of Super-LHC. FE-I4 is developed using a 130 nm CMOS process,
in an 8 metal option with 2 thick aluminum top layers for enhanced power routing. Care
Figure 1.11: two types of 3-D sensors, double sided
(a) and full 3-D (b).
13
has been taken to separate analog and digital power
nets. With the reduction of the thickness of the gate
oxide, the 130 nm CMOS process shows an increased
radiation tolerance with respect to previous less
scaled processes. The reasons behind the redesign of
the pixel Front-End FE-I3 came from several aspects
related to system issues and physics performance of
the pixel detector. With a smaller innermost layer
radius for the IBL project and an increased
luminosity, the hit rate increases to levels which the
FE-I3 architecture is not capable of handling; while
the FE-14 can reach an average hit rate with < 1%
data loss of 400 MHz/cm2. It was shown that the FE-
I3 column-drain architecture scales badly with high
hit rates and increased FE area, leading to
unacceptable inefficiencies for the IBL (see Figure
1.13). To avoid that, FE-I4 was designed to stores hits
locally getting rid of a column-drain based transfer.
The FE-I4 pixel size is also reduced, from 50 x 400
µm2 to 50 x 250 µm 2 which reduces the pixel cross-section and enhances the single point
resolution in z direction. FE-I4 is built up from an array of 80 by 336
pixels, each pixel being subdivided into analog and digital section. The
total FE-I4 active size is 20 mm (z direction) by 16.8 mm (φ direction),
with about 2 mm more foreseen for periphery, leading to an active area
of close to 90% of the total. The FE is now a standalone unit avoiding
the extra routing needed for a Module Controller Chip for
communication and data output. Communication and output blocks are
included in the periphery of the FE. Going to a larger FE size is
beneficial with respect to active over total area ratio as well as for the
building up of modules and staves. This leads to more integrated stave
and barrel concepts, and therefore reduces the amount of material
needed per detector layer. This material reduction provides a better
overall tracking measurement, since the probability of small or large
scatterings of particle inside the tracker itself is reduced. One of the main
advantages of having a large FE is also the cost reduction.
Figure 1.12: FE-I4 Layout
Figure 1.13:
performance of column
drain readout
architecture [7].
14
15
Chapter 2
Current off detector electronics for IBL
Hereafter is presented the current set-up of the off-detector electronics for the Insertable
Barrel Layer in order to describe the environment in which the new Pixel-ROD board was
conceived and to understand the requirements that it needs to fulfill.
High-energy physics experiments usually distinguish between on-detector and off-detector
electronics referring to the front-end electronics implemented near the detector itself and
to the readout system that can be implemented far from the detector. While in the first
scenario radiation resistance is a fundamental parameter, in the second one there is a less
compelling requirement of radiation resistance allowing the employment of more powerful
devices.
2.1 IBL electronics The IBL readout requires an appropriate off-detector system that is schematically shown in
Figure 2.1.
Figure 2.1: schematic block of the IBL readout system
This readout system is made of several components:
• Back of Crate (BOC) board:
• Optical modules to interface FE-I4 chips with BOC board;
16
• S-Link for sending data from the BOC board to the ATLAS TDAQ
system.
• Read Out Driver (ROD) board:
• Gigabit Ethernet to send front-end calibration histograms.
• VME Crate;
• TTC Interface Module (TIM);
• Single Board Computer (SBC).
FE-14 data are received from the BOC board via the RX optical modules, then 8b/10b
decoding is performed before passing data to the ROD that processes them. During physics
runs, events to be sent to the ATLAS TDAQ (Trigger and Data Acquisition) are sent back
to the BOC, where 4 S-Link modules are implemented. S-Link stands for Simple LINK
and it is a simple interconnection protocol that implements error reporting and test
functions, too. Each BOC-ROD pair can interface and route data coming from 16 IBL
modules (32 FE-I4 chips, for a total input bandwidth of 5.12 Gb/s). The whole IBL readout
requires 15 BOC-ROD pairs that can be all placed in a single VME crate (one BOC-ROD
pair for each of the 14 staves of IBL detector (that counts 32 FE-I4s) plus a one pair to
serve the diamond beam monitor detector) [8].
Figure 2.2: visual layout of the data acquisition system. In red the normal data
path, in blue the deviation for the generation of the histogram
Figure 2.2 illustrate the data path: the 32 front-end FE-14 chips drive 32 serial lines, each
supporting 160 Mb/s, connected to the BOC board via optical links. Here the signal from
each line is converted from optical to electrical, then demultiplexed to one 12-bit-wide bus,
which proceeds towards the ROD board, through the VME backplane connector.
17
After that, in order to build the data frame that has to be sent to the TDAQ computers, the
ROD board begin the data formatting. Data, after being transmitted to the ROD board, can
take two different paths, as show in Figure 2.2.
In the first one, ROD data are sent back to the BOC, where four S-Link modules forward
the data towards the ATLAS TDAQ PCs, reaching a total output bandwidth of 5.12 Gb/s;
in the second one, data are delivered to a PC for histogram processing (exclusively used
during calibration runs to properly calibrate the FE-I4 chips).
2.2 IBL BOC The BOC board (Figure 2.3) is responsible for handling the control interface to the detector,
as well as the data interface from the detector itself. Another of the main task of the BOC
is to provide the clock to the front-end chips connected. Furthermore, a Phase Locked Loop
(PLL) generates copies of this clock for the ROD and the detector. The detector clock is
then handled by the FPGAs and coded into the control streams for the individual detector
modules.
The IBL BOC contains three Xilinx Spartan FPGAs:
• one BOC Control FPGA (BCF);
• two BOC Main FPGAs (BMF).
Figure 2.3: IBL BOC board
18
2.2.1 BOC Control FPGA The BOC Control FPGA is responsible for the overall control and data shipping of the
board. An embedded processor (the Microblaze) is instantiated on this FPGA, mainly to
provide Ethernet access to the card, but it is also able to implement some self-test for the
board as well as being responsible for the other FPGAs configuration by loading
configuration data from a FLASH memory accessed via SPI.
2.2.2 BOC Main FPGA The two main FPGAs encode the configuration data regarding the FE-I4 front-end chips
connected to the ROD, into a 40 Mb/s serial stream that is sent straight to the FE-I4
themselves. This two FPGAs also manage the deserialization of the incoming data from
the front-end chips; after the data collection and the word alignment, the decoded data are
sent to the ROD board. On the transmission side, these two Spartan FPGAs also manage
the optical connection via four S-Links to the ATLAS TDAQ system.
Figure 2.4: IBL ROD board
19
2.3 IBL ROD The Insertable Barrel Layer Read Out Driver (IBL ROD) [9] is a board developed to
substitute the older ATLAS Silicon Read Out Driver (SiROD), that is used for the ATLAS
off-Detector electronics sub-system to interface with Silicon Tracker (SCT) and Pixel L0,
L1 and L2 Front End Detector modules.
The board main tasks are: data gathering and event fragment building during physics runs
and histogramming during calibration runs.
As stated, during runs, the board receives data and event fragments from the 32 FE-I4 chips
and transforms them into a ROD data frame, which is sent back to the ATLAS TDAQ
through the BOC S-Link connections (see Figure 2.2).
2.3.1 ROD Master An FPGA (Xilinx Virtex-5 XC5VFX70T-FF1136) is the Master of the Read-Out Driver,
which act as interface with the front-end chips. This FPGA also contains a PowerPC, an
embedded hard processor. Its main task is sending the event information to the two slaves
FPGAs [10].
2.3.2 ROD Slaves The two FPGAs that work as slaves on the ROD board are the Xilinx Spartan-6
XC6SLX150-FGG900. They implement an embedded soft processor, named Microblaze.
All data generated by IBL during ATLAS experiments pass through these two FPGAs and
are collected inside the on-board RAM (SODIMM DDR2 2GB); moreover, during
calibration runs, histograms can be generated and sent to the histogram server.
2.4 TIM The TTC (Timing, Trigger and Control) Interface Module (TIM) acts as interface between
the ATLAS Level-1 Trigger system signals and the pixel Read-Out Drivers using the LHC-
standard TTC and Busy system. In particular, the board is designed to propagate the TTC
clock all over the experiment: for what concerns the IBL off-detector electronics, the TIM
sends the clock to the BOC board, which then propagates it to the ROD, as stated above.
Furthermore, the TIM receives and propagates triggers through the custom backplane.
20
2.5 System limitations Since the ROD board described in the previous section met all the strict requirements the
ATLAS experiment had, it was decided to implement the same system also for all other
layers (L0, L1, L2). But while the IBL electronics required 15 ROD boards, the remaining
layers required about 110 boards. Even if, at the moment, the space occupation is clearly
high but still sustainable, it already shows the limits of this system: the future upgrade of
the whole LHC detector to the higher luminosity HL-LHC (whose luminosity will be raised
up to a factor of 10) will need more and more boards to face a much higher data rate [11].
Indeed, the link bandwidth is proportional to the product of occupancy (which is a function
of the luminosity) and trigger rate of the front-end devices [8]. It is expected that hit rate
for IBL at 140 Pile-Up will be raised up to 3 GHz/cm2 and the readout rate up to 4.8 Gbits/s
per chip for 1 MHz trigger rate [12].
21
Chapter 3
PCIe specifications and usage model
In the so-called Long Shutdown of the LHC accelerator foreseen for 2023, the whole LHC
detector will be upgraded to the higher luminosity HL-LHC. In particular, the nominal
luminosity will be raised up to roughly ten times the actual one [11], implying that the
electronics has to withstand a much higher data rate. Although many different read-out
electronics boards have been presented, they all share a common feature: high flexibility
and configurability with PCIe interface as well as powerful FPGAs connecting to many
optical transceivers [13]. So, before the description of the board named Pixel-ROD, an
overall architectural perspective of the PCIe technology is provided.
3.1 Architecture of a PCIe system PCI Express (PCIe which stands for Peripheral Component Interconnect Express) is a high-
performance I/O device interconnection bus used in the mobile, workstation, desktop,
server, embedded computing and communication platforms in order to expand the
capabilities of a host system by providing slots where expansion cards can be installed. It
has established itself as the successor to PCI providing higher performance, increased
flexibility and scalability. Indeed, since the so called first generation of buses (ISA, EISA,
VESA and Micro Channel) and the second one (PCI, AGP, and PCI-X), PC buses have
doubled in performance roughly every three years. But processors have roughly doubled in
performance in half that time following the Moore’s Law. In addition, although PCI has
enjoyed remarkable success, there were evidence that a multi-drop, parallel bus
implementation was close to its practical limit of performance as it cannot be easily scaled
up in frequency or down in voltage: it faced a series of challenges such as bandwidth
limitations, host pin-count limitations, lack of real-time data transfer and signal skew
limited synchronously clocked data transfer. Indeed, the bandwidth of the PCI bus and its
derivates can be significantly less than the theoretical value due to protocol overhead and
bus topology. All approaches to pushing these limits to create a higher bandwidth, general-
purpose I/O bus result in large cost increases for little performance gain [14]. So, there was
the need to engineer a new generation of PCI to serve as a standard I/O bus for future
generation platforms. There have been several efforts to create higher bandwidth buses and
this has resulted in a PC platform supporting a variety of application-specific buses
22
alongside the PCI I/O expansion bus. Indeed, PCIe offers a serial architecture that alleviates
some of the limitations of parallel bus architectures by using clock data recovery (CDR)
and differential signaling. Using CDR as opposed to source synchronous clocking lowers
pin-count, enables superior frequency scalability and makes data synchronization easier.
Moreover, PCIe was designed to be software-compatible with the PCI (so the older
software systems are still able to detect and configure PCIe cards although without PCIe
features such as the access to the extended configuration space that will be discussed in the
next paragraphs).
3.2 Interconnection A PCIe interconnect that connects two devices together is referred to as a Link. It consists
of either x1, x2, x4, x8, x12, x16 and x32 signals pairs in each direction. These signals are
referred to as Lanes. So, a x1 Link consists of 1 Lane or 1 differential signal pair in each
direction for a total of 4 signals (each Lane constitutes a full-duplex communication
channel). A x32 Link consists of 32 Lanes or 32 signal pairs for each direction for a total
of 128 signals [15]. In order to make the bus software backwards compatible with PCI and
PCI-X systems (predecessor buses), it maintains the same usage model and load-store
communication model. When it comes to the differences, PCI and PCI-X buses are multi-
drop parallel interconnect buses in which many devices share the same bus, while PCIe
implements a serial, point-to-point type interconnection for communication between
devices. In systems requiring multiple devices to be interconnected, interconnections are
made possible thanks to switches. The point-to-point interconnection lead to limited
electrical load on the Link overcoming the limitations of a shared bus. Moreover, as stated,
a serial interconnection results in fewer pins per device package which reduces PCIe board
design cost and complexity. Another significant feature is the possibility to implement
scalable numbers for pins and signal Lanes that allows a huge flexibility according to
communication performance requirements (PCIe specifications defined operations for a
maximum of 32 Lanes). So, the size of PCIe cards and slots vary depending upon the
number of supported Lanes. During hardware initialization, the Link is automatically
initialized for Link width and frequency of operation by the device on the opposite ends of
the Link without involving any kind of firmware. A packed-based communication protocol
is used over the serial interconnection. Packets are serially transmitted and received and
byte striped across the availed Lanes. This feature contributes keeping the device pin count
low and reducing system cost by means of in-band accomplishing of Hot Plug, power
management, error handling and interrupt signaling using packed based messaging instead
23
of side-band signals. Each packet to be transmitted over the Link consist of Bytes of
information. The first generation of the standard has a transmission/reception rate of 2.5
Gbits/s per Lane per direction (it has been doubled in the second generation). PCIe standard
also specifies three clocking architectures: Common Refclk, Separate Refclk, and the
already cited Clock Data Recovery. Common Refclk specifies a 100 MHz clock (Refclk),
with greater than ±300 ppm frequency stability at both the transmitting and receiving
devices. It was the most widely supported architecture among the first commercially
available devices. However, the same clock source must be distributed to every PCIe device
while keeping the clock-to-clock skew to less than 12 ns between devices. This can be a
problem with large circuit boards or when crossing a backplane connector to another circuit
board. In the case a low-skew configuration is not workable, the Separate Refclk
architecture, with independent clocks at each end, can be used. The clocks do not have to
be more accurate than ±300 ppm, because the PCIe standard allows for a total frequency
deviation of 600 ppm between transmitter and receiver. Finally, the Clock Data Recovery
architecture is the simplest, as it requires only one clock source, at the transmitter [16] and
it has become the most used configuration. In this scenario, since there is no clock signal
on the Link, the receiver uses a PLL to recover a clock from the 0-to-1 and 1-to-0 transitions
of the incoming bit stream. To allow for the clock recovery on the signal line independently
on the data transmitted, a DC balanced protocol is used. Every Byte of data to be
transmitted is converted into 10-bit code via an 8b/10b encoder in the transmitter device
(so 10-bit symbols are employed). The consequence is a 25% additional overhead to
transmit a byte of data. All symbols, in order to be compatible with the Clock Data
Recovery architecture, are guaranteed to have one-zero transitions. PCIe implements a
dual-simplex Link capable of transmitting and receiving data simultaneously on a transmit
and receive Lane. So, to obtain the aggregate bandwidth (which assumes simultaneous
traffic in both directions) the transmission/reception rate has to be multiplied by 2 then by
the numbers of Lanes (the so-called Link Width) and finally divided by 10 to account the
10-bits per Byte encoding. For example, a x1 PCI Express Link first generation, has an
aggregate throughput of 0.5 GBytes/s while a x32 PCI Express Link reaches 16 GBytes/s.
In the case of the second version of the PCIe protocol (named PCIe gen. 2), the previous
values had to be multiplied by a factor of two. Indeed, in this case, the data transfer is raised
up to 5 GTransfers/s which means that every Lane could transfer up to 5 Gbit/s using the
8b/10b encoding format [17]. As a side note, the third version of the protocol not only
increased the data-transfer rate, raising it up to 8 GTransfer/s, but also changed the
encoding format from 8b/10b to 128b/130b (to reduce the protocol overhead) [18].
24
3.3 Topology In order to understand the topology of the PCIe standard, some definitions are provided:
PCIe end-point: PCIe device to be connected.
Root complex: host controller that connects the CPU of the host machine to the rest of the
PCIe devices. PCIe has its own address space consisting of either 32 or 64 bits depending
upon the Root-Complex and it is only visible by PCIe components like the Root-Complex,
end-points, switches and bridges. Root-complex can interrupt the CPU for any of the events
generated by the Root-Complex itself or by any of the PCIe devices. Moreover, it can also
access the memory without CPU intervention (acting as a sort of DMA). PCIe end-points
can use this feature to write/read data to/from the memory. In order to do so, Root-complex
makes the end-point the bus master (giving the permission to access the memory) and
generates the corresponding memory address.
Bridge: it provides forward and reverse bridging allowing designers to migrate local bus,
PCI, PCI-X and USB bus interfaces to the serial PCIe architecture.
Switch: born to replace the multi-drop bus used in PCI and to provide fan-out for the I/O
bus, it is also used to realize a peer-to-peer communication between different endpoints and
this traffic, if it does not involve cache-coherent memory transfers, need not be forwarded
to the host bridge.
Figure 3.1 shows how the PCIe components (Root-Complex, bridges, end-points and
switches) are interconnected to PCIe Links.
Figure 3.1: PCIe topology
25
As stated, Root-Complex allows the connections of the many PCIe end-points. This task is
accomplished thanks to root-ports that can be directly connected to end-points, to a bridge
or to a switch connected to several end-points. In the case of Root-Complex or switches, in
order to implement a point-to-point topology (which means that a single serial link connects
two devices) multiple Virtual PCI to PCI bridges are used. These are the devices that
connects multiple buses together providing a (virtual) PCI bridge for the up-stream PCIe
connection and one (virtual) PCI bridge for each down-stream PCIe connection (Figure
3.2). An identification number is assigned to each bus by the software during the
enumeration process that is used by switches and bridges to identify the path of a
transaction. Every switch or bridge must store the information about three bus numbers:
the primary bus number (that reflects the number of the bus the switch is connected to), the
secondary bus number (identifying the bus with the lowest number that can be reached)
and subordinate bus number (the bus with the highest number that can be reached).
Figure 3.2: SoC detail
In the case of the switch of the previous example (see figure 3.3), the primary bus number
is 3, the secondary bus number is 4, and subordinate bus number, 8. So any transaction
targeted from bus 4 to bus 8 will be accepted and handled by the switch [19].
26
3.4 Electrical specifications and I/O Lines Having adopted a serial bus technology, PCIe uses far less I/O lines than PCI. As stated,
PCIe devices employ differential drivers and receivers (a pair of differential TX lines and
a pair of differential RX lines for each Lane) implementing the High-speed LVDS (Low-
Voltage Differential Signaling) electrical signaling standard [15]. The differential driver is
DC coupled from the differential receiver at the opposite end of the Link thanks to a
capacitor at the driver side. This means that two devices at the opposite end of a Link can
use different DC common mode voltages (range: 0 V to 3.6 V). The differential signal is
derived by measuring the voltage difference between two terminals. Logical values: a
positive voltage difference between the positive terminal and the negative one implies
Logical 1. On the other hand, a negative voltage difference between the same terminals
implies a Logical 0. Finally, when the driver is put in a high-impedance tristate condition
(also called Electrical-Idle or low-power state of the Link), the two terminals are driven at
the same potential. Let the voltage with respect to the ground on each conductor be 𝑉𝐷+
and 𝑉𝐷−.
The differential peak-to-peak voltage is defined as 2 ∗ max | 𝑉𝐷+ − 𝑉𝐷− | . To signal a
logical 1 or a logical 0, the differential peak-to-peak voltage driven by the transmitter must
Figure 3.3: Switch detail
27
be between 800 mV (minimum) and 1200 mV (max). Conversely, during the Link
Electrical Idle state, the transmitter drives a differential peak voltage of between 0 mV to
20 mV. As stated, the receiver is able to sense a logical 1, a logical 0 as well as the Electrical
Idle state of the Link, by detecting the voltage on the Link via a differential receiver
amplifier. Due to signal loss along the Link, the receiver must be designed to sense an
attenuated version of the differential signal driven by the transmitter. The receiver
sensitivity is fixed to a differential peak-to-peak voltage of between 175 mV and 1200 mV,
while the electrical idle detect threshold can range from 65 mV (minimum) to 175 mV
(maximum). Any voltage less than 65 mV peak-to-peak implies that the Link is in the
Electrical Idle state [15].
PCIe specifications also defines other auxiliary signals (the differential clock REFCLK
used in the Common Refclk clocking architecture, a voltage signal +12V#, PERST# to
indicate when data signals are stable and present signals (PRSNT1# and PRSNT2#) for
hot-plug detection) (see Figure 3.4). As stated, unlike PCI, PCIe does not use dedicated
interrupt lines but relies on in-band signaling transmitted through the differential TX and
RX lines.
3.5 PCIe Address Space The host system can access any of the PCIe end-points only by using the PCIe Address
Space (Figure 3.5). It is important to note that this address space is virtual, there is no
Figure 3.4: I/O Lines
28
physical memory associated: it only represent a list of addresses used by the transaction
layer (explained later) in order to identify the target of the transaction.
Root-complex has also configuration registers (to configure the Link width, frequency and
the Address Translation Unit that translates CPU addresses to PCIe ones), a Configuration
Space that contains all the information regarding end-points (such as device ID and vendor
ID) and has also registers to configure the end-points (used, for example, to put the device
in low-power mode). PCIe specifications defined the Configuration Space to be backward
compatible to PCI but increased its dimension from 256 B to 4 kB. The first 64 bytes are
standard (they are called the standard headers) and both PCIe and PCI defined two types
of standard headers: type 1 (containing info regarding root-ports, bridges and switches
(such as primary, secondary and subordinate bus numbers)) and type 0 (containing info
regarding end-points). Every PCIe component has its own Configuration Space. Figure 3.6
shows the standardized type 0 header that is present in the Configuration Space of a PCIe
end-point (only the first 64 Bytes are shown). It contains information regarding the device
(device ID, vendor ID, Status and Command used by the host system to configure and
control the end-point), the header type (that differentiates type 0 from type 1 headers) and
Base Address Registers used to configure the Memory Space. The mechanism that
determines the address to which the Configuration Space of a particular end-point should
be mapped in the PCIe Address Space is called Enhanced Configuration Access
Mechanism (ECAM).
Figure 3.5: PCIe Address Space
29
Figure 3.6: Configuration Space Header of an end-point
The address created is a function of bus number, device number, function and register
number (Figure 3.7).
Figure 3.7: Enhanced Configuration Access Mechanism
For example (see Figure 3.5), the Configuration Space of the first end-point (Bus:1,
Device:0, Function:0) is mapped to the address 100000h of the PCIe Address Space. The
host system is capable of reading the Configuration Space of an end-point thanks to the
Configurable Address Space presented in Root-Complex that has a region (CFG0) of 4 kB
(that matches with the size of the configuration space). Indeed, the CPU can only access
the Root-Complex internal registers and cannot directly read the Configuration Spaces of
PCIe devices. To do so, Configurable Address Space has to be “connected” to the
Configuration Space of the end-point. To be able to do that, Root-Complex implements an
Address Translation Table which has to be programmed with the source address (A in the
example of Figure 3.8) that is an address in the Configurable Space, the destination address
(the ECAM address (for example 100000h)) and the size (4 kB in the case of an access to
the Configuration Space). So, when the CPU access the CFG0 region of the Configurable
Address Space, the Address Translation Unit makes sure that Root-Complex accesses to
30
the ECAM address corresponding to the Configuration Space of the desired end-point.
Naturally, multiple end-points can be connected to the Root-complex. For example, a PCIe
Bridge (see Figure 3.9). In the example, its ECAM address (mapped in the PCIe Address
Space) is 200000h. To access the configuration space of the PCIe Bridge, the same region
in the Configurable Address Space can be used (thus programming in a different way the
Address Translation Table). There can be other devices connected to the Bridge (for
example, an end-point). Again, using the Enhanced Configuration Mechanism, the end-
point Configuration Space is mapped in the PCIe Address Space. PCIe specifications define
a new type of transaction in order to access devices connected beyond the bridge: there is
a second region in the Configurable Address Space (CFG1) dedicated to this task. The rest
of the Configurable Address Space can be used for I/O Space (generally 64 kB) and
Memory Space (where the peripheral registers and memory are mapped). The Memory
Space of an end-point cannot be access in the same way the CFG0 and CFG1 are accessed
because its size may vary from card to card. So, the host system must know the size of the
Memory Space of each end-point. To do so, host uses the information stored in the Base
Address Registers presented in the Configuration Space Header (see Figure 3.10). Once
the host system has got the size of the Memory Space of an end-point, it allocates an equal
amount of memory in the Configurable Address Space (in the region dedicated to Memory
Space) that will be used to access the Memory Space of the PCIe end-point. In order to do
so, the Configurable Address Space has to be mapped in the PCIe Address Space. Again,
this is made possible thanks to the Address Translation Table. In the example of Figure 3.9,
Figure 3.8: addressing methods with Address
Translation Table
31
the Address Translation Table is programmed to have: source address B (the first address
of the Configurable Address Space dedicated to Memory Space), destination address B
(address in the PCIe Address Space) and size equal to 256 MB since the whole Memory
Space of the Configurable Address Space needs to be mapped in the PCIe Address Space.
Figure 3.9: PCIe Address Space
Now, whenever the CPU accesses these regions in the Configurable Address Space,
because of Address Translation Table, the Root-Complex will access the corresponding
regions in the PCIe Address Space. It has to be note that, at this point, PCIe end-points will
not respond to the access in this region because theirs Memory Space is not mapped to the
PCIe Address Space, jet. To do so, the Root-Complex has to access to the Base Address
Registers and writes the starting address of the PCIe Address Space to which the Memory
Space of the end-point has to be mapped. After that, the Memory Space of the PCIe end-
point will respond to any memory request that happens to this region of the PCIe Address
Space since there will be a one to one mapping between its Memory Address and the PCIe
Address [19].
3.6 Device Tree A device tree (data structure used to describe the hardware in Linux-based operation
systems) is created for devices not enumerated dynamically (Figure 3.11). So, in this case,
device tree nodes are only created for Root-Complex (the end-points are enumerated
dynamically). The properties configured in the device tree are shared among all the end-
32
points. One of main property that can be configured is a field of the structure called
“ranges”. It is used to program the Address Translation Unit (see Figure 3.12).
Figure 3.10: Configuration Space Header
Figure 3.11: example of Device Tree
33
Figure 3.12: each cell is 32-bit wide. The first 3 cells are dedicated to PCI Address (the
first one contains flags and the others store the address proper (since PCIe Address space
has a maximum depth of 64-bit)), the fourth cell stores the CPU address (an address of the
Configurable Address Space) and the last one contains the size information. This
information is used to program the Address Translation Unit.
3.6.1 Linux PCIe Subsystem
Figure 3.13: Linux PCIe Subsystem: at the bottom, the Root-Complex platform driver.
34
Each platform can have its own Root-Complex Driver (that are responsible for initializing
the Root-Complex registers, programming the Address Translation Unit, extracting the I/O
resource and memory information from the “ranges” properties of the device tree and
invoking an API in the PCI-BIOS layer in order to start the enumeration process). The PCI-
BIOS layer performs BIOS-type initialization. It provides the before mentioned API used
by the Root-Complex drivers and invokes the PCI Core to start bus scanning. So, PCI Core
scans the bus by using the callback provided by the Root-Complex driver in order to read
the Configuration Space of PCIe end-points. During bus scanning, when PCI-Core finds a
device with a known device ID and vendor ID, it provides the corresponding PCIe device
driver. This one stores all the information about the PCIe end-point and it also provides the
implementation of the interrupt handlers for any of the interrupts that can be raised by the
PCIe card. Moreover, each end-point can also interact with its own domain specific upper-
layer (for example, an Ethernet card can interact with the Ethernet stack).
3.7 The three Layers of the protocol As stated, PCIe is a standard that uses a packed-based communication system and its
architecture is specified in layers. Indeed, the protocol is made up of three layers: the
Transaction, Data Link and Physical Layer, as shown in Figure 3.14. Layered protocols
have been used for years in data communication. Indeed, they permit isolation between
different functional area in the protocol and allow upgrading one or more layers without
requiring updates of the others. So, the revision of the protocol might affect the physical
media with no major effects on higher layers [14].
Figure 3.14: layers of the PCIe architecture
35
The software layers will generate read and write requests that are transported by the
transaction layer to the I/O devices using a packet-based, split-transaction protocol. There
are two main packet types: Transaction Layer Packets (TLPs) and Data Link Layer Packets
(DLLPs). While DLLPs are meant for service communication between PCIe constitutive
elements, TLPs are the packets that move the data from and to devices.
3.7.1 Transaction Layer
The Transaction Layer is the higher layer of the PCI Express architecture. It receives read
and write requests from the software layer and creates request packets (TLPs) for
transmission to the Data Link layer. From the transmitting side of a PCIe transaction, TLPs
are formed with protocol information (type of transaction, recipient address, transfer size,
etc.) inserted on header fields. As stated, PCIe does not use dedicated interrupt lines but
relies on in-band signaling. This method of propagating system interrupts was introduced
as an alternative to the hard-wired sideband signal in PCI rev 2.2 specifications and it was
made the primary method for interrupting processing in the PCIe protocol. Transactions
are divided into posted and non-posted transactions. While posted transactions do not need
any response packed, non-posted transactions, need a reply. In this case, the Transaction
Layer has to receive the response packets from the Data Link Layer and to match these
with the original software requests. This task can be easily accomplished since each packet
has a unique identifier that enables response packets to be directed to the correct originator.
The packet format supports 32-bit memory addressing and extended 64-bit memory
addressing. Packets also have attributes such as “no-snoop,” “relaxed-ordering” and
“priority” which may be used to prioritize the flow throughout the platform [14]. It is used,
for example, to process streaming data first in order to avoid late real-time data.
3.7.2 Data Link Layer The Data Link Layer performs as an intermediate stage between the Transaction Layer and
the Physical Layer. Its primary duty is to provide a reliable mechanism for the exchange of
the TLPs by appending a 32-bit cyclic redundancy check (CRC-32) and a sequence ID for
data integrity management (packet acknowledgement and retry mechanisms) (see Figure
3.15).
36
In order to reduce packet retries (and the associated waste of bus bandwidth), a credit-based
fair queuing is adopted: a “credit” is accumulated to queues as they wait for service and it
is spent by queues while they are being serviced. Queues with positive credit are eligible
for service [20]. See figure 3.16. This scheduling algorithm ensures that packets are only
transmitted when it is
known that a buffer is
available to receive the
packet at the other end.
For incoming TLPs,
the Data Link Layer
accepts them from the
Physical Layer and
checks the sequence
number and CRC. If an
error is detected, the
layer communicates
the need to resend.
Otherwise, TLP is
delivered to the
Transaction Layer.
Figure 3.15: each Layer appends a header and a
tail to the packet
Figure 3.16: credit-based fair queuing (traffic shaping)
37
3.7.3 Physical Layer The Physical Layer interfaces the Data Link Layer with signaling technology for link data
interchange. There are two sections of the Layer: the transmit logic and the receive logic
responsible for transmission and reception of packets, respectively. These sections, in turn,
are made up of a logical layer and an electrical layer. As the description of the electrical
layer was already provided (see paragraph 3.1.1), the only logical layer will be discussed.
Its main tasks are (from the transmission side): being responsible for framing the packet
with start and end of packet bytes (see Figure 3.15); splitting Byte data across the lanes (in
multi-lane links) (see Figure 3.17); Byte scrambling to reduce electromagnetic emissions
(by dispersing the power spectrum in a wider frequency band) and to facilitated the work
of the clock recovery (by removing long sequences of ‘0’ or ‘1’ only); 8b/10b encoding
and serialization of the 10-bit symbols before sending it across the link to the receiving
device. From the receiving side, a dual task is performed: deserialization, 8b/10b decoding,
byte descrambling, data reassembling (in multi-lane links) and unframing.
Figure 3.17: single-lane Link byte stream (on the left)
and the splitting of the data across the lanes in the case
of a 4-lane Link (on the right).
38
39
Chapter 4
Pixel-ROD board
The following paragraphs describe the board, still in prototype stage, developed as a
replacement of the previous series of readout boards, employed into ATLAS Pixel
Detector. To take advantage of all the experience and efforts spent on the ROD board
(allowing also firmware portability), it was decided to keep working with FPGAs from
Xilinx, upgrading to the 7-Series family. Moreover, exploiting the successful Master-Slave
architecture of the previous boards, the Pixel-ROD was conceived as a merging of two
Xilinx evaluation boards: the KC705 (that constitutes the slave device) and the ZC702
(master section). This way of proceeding also allows a huge speed-up of the design and
debugging process since the newly developed hardware and software testing platforms can
be validated on already tested and highly reliable boards (to try out the testing platforms
themselves) before applying them to the prototype. Since the tests that will be lately
discussed are meant to validate the slave section of the board, a brief general overview of
the KC705 is provided before the description of the Pixel-ROD.
4.1 Xilinx KC705 As stated, the slave unit of the Pixel-ROD board is mainly based on a Xilinx evaluation
board, the KC705. This one, shown in Figure 4.1, has a massive range of applications and
its primary features are listed below [21]:
• Kintex-7 28nm FPGA (XC7K325T-2FFG900C);
• 1GB DDR3 memory SODIMM 800MHz/1600Mbps;
• PCIe gen. 2 8-lane endpoint connectivity;
• SFP+ connector;
• 10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1111 PHY;
• 128MB Linear BPI Flash for PCIe Configuration;
• USB-to-UART bridge;
40
• USB JTAG via Digilent module;
• Fixed 200 MHz LVDS oscillator;
• I2C programmable LVDS oscillator;
Figure 4.1: Xilinx KC705 demo board
4.1.1 Why PCIe The KC705 is a PCIe board and the reasons behind the adoption of the PCIe interface are
not only to be found in the fact that PCIe is the best candidate to replace the role of slower
VME buses (whose data-rate is limited to 320 MB/s of the VME320), but also in the
creation of a new installation configurations. Indeed, one or two PCIe boards can be directly
connected on the motherboard of TDAQ PCs providing a faster response (giving straight
access to the main resources of the PCs) and an easier installation. This configuration is the
most likely to be adopted for the experimental phase that will start after the Long Shutdown
scheduled for 2023, not only in the ATLAS experiment, but also in CMS. This
configuration follows the trend established by KC705-like boards that are mainly designed
in order to speed-up real time calculations performed in a PC. They are usually mounted
on an external PCB so they must be connected via an appropriate interface to the
motherboard of the host PC. Since this interface is often the bottleneck of such systems,
new generation FPGA evaluation boards communicate via PCIe.
41
4.1.2 Kintex-7 FPGA
The Xilinx Kintex-7 XC7K325T-2FFG900 mounted on this board is a powerful medium-
range FPGA, that can be used to replace both Spartan-6 devices on the ROD board [22,
23]. Its key features are presented hereafter:
• Advanced high-performance FPGA with logic elements based on real 6 input
lookup table (LUT) than can be used to program the combinatory logic or to be
configured as distributed memory;
• High-performance DDR3 interface supporting up to 1866 Mb/s;
• High-speed serial connectivity with 16 built-in Gigabit transceivers (GTX)
having rates from 600 Mb/s to a maximum of 12.5 Gb/s, offering a special low-
power mode, optimized for chip-to-chip interfaces;
• A user configurable analog interface (XADC), incorporating dual 12-bit analog-
to-digital converters (ADC) with on-chip temperature and supply sensors;
• Powerful clock management tiles (CMT), combining phase-locked loop (PLL) and
mixed-mode clock manager (MMCM) blocks for high precision and low jitter;
• Integrated block for PCI Express (PCIe), for up to x8 Gen2 Endpoint and Root Port
designs;
• 500 maximum user I/Os (excluding GTX) and 16 kb of Block RAM (BRAM).
4.2 Xilinx ZC702 The demo board from which the master section was derived is the Xilinx ZC702. It was
chosen between the many boards in Xilinx catalogue because its FPGA embeds a hard
processor (two ARM Cortex-A9) [24] that will substitute the hard-processor implemented
on the Virtex-5 of the ROD-board (see paragraph 2.31).
42
4.3 The Pixel-ROD board As stated, the Pixel-ROD was conceived as a merging of the KC705 and ZC702. Naturally,
many features of the two boards had to be removed since not necessary for a read-out board
(Xilinx demo boards are not application specific) while many others had to be redesigned
or completely developed from scratch, as they needed to be shared with the whole new
board. Removed features include the LCD display, the SD card reader, the HDMI port, few
GPIOs and LEDs. On the other hand, to implement a ROD-like Master-Slave architecture,
a 21-bit differential bus has been added between the two FPGAs in order to obtain the
necessary communication as well as a 1-bit differential line to provide a common clock.
Moreover, another 5-bit wide single-ended bus was introduced as a general-purpose
interconnection bus. One of the other features that needed a complete redesign was the
JTAG chain which had to include the two FPGAs. To do so, a 12 (3x4) pin header (see
Figure 4.2) was added to allow the possibility of excluding the Kintex from the JTAG chain
in order to prevent unwanted programming of the slave FPGA. In addition, another internal
JTAG from Zynq to Kintex has been added. It allows the programming of the slave FPGA
with the desired firmware, using the Zynq FPGA as Master. This has been very helpful
during the debugging session. Indeed, since the Pixel-ROD board was installed inside a
PC, it was very difficult to access to the JTAG port.
Figure 4.2: custom JTAG
configuration header. In blue
the full JTAG chain, in red the
internal JTAG chain that
excludes the Kintex.
43
The main devices and features implemented on the Pixel-ROD board are the following:
• Kintex-7 28 nm FPGA (XC7K325T-2FFG900C);
• Zynq 7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex A9
MPCore;
• 2 GB DDR3 memory SODIMM (Kintex DDR);
• 1 GB DDR3 component memory (Micron MT41J256M8HX-15E, Zynq DDR3);
• PCI Express Gen2 8-lane endpoint connectivity;
• SFP+ connector;
• Three VITA 57.1 FMC Connectors (one HPC, two LPC);
• Two 10/100/1000 tri-speed Ethernet with Marvell Alaska PHY;
• Two 128 Mb Quad SPI flash memory;
• Two USB-to-UART bridges;
• USB JTAG interface (using a Digilent module or header connection);
• Two fixed 200 MHz LVDS oscillators;
• I2C programmable LVDS oscillator;
As of now, a single Pixel-ROD can interface up to 16 equivalent FE-14 channels (half of
the 32 channels of the BOC-ROD pair).
4.3.1 Space constraints The stack-up of a board defines the composition, the thickness and the function of each
layer of a Printed Circuit Board (PCB). As stated, the constraints of the Pixel-ROD were
extracted from two Xilinx demo boards, but it was not possible a one-to-one mapping since
all the resulting layers needed to be merged into one stack-up of 16 layers. In fact, the
44
maximum number of layers is fixed by PCIe standards in order to respect the constraint on
the thickness of the board (otherwise, it won’t fit into the PCIe slot): the allowable thickness
ranges from 1.44 to 1.70 mm [25]. The stack-up adopted is shown in the Figure 4.3: the 16
PCB layers were used to provide the necessary space to the high number of traces while
ensuring the alternation of signal layers and ground ones as well as the concentration of the
power layers into the innermost section of the board in order to reduce the cross-talk
between planes and to reach the required level of insulation.
In Figure 4.4 an example of a PCB layer is presented. Another constraint on the size was
due to the PC case the board will be installed in. So, the maximum length has been set to
30 cm, thereby adding little space for the device placement. Finally, the height was left
free, to allow sufficient room for all the necessary devices. The result of all these efforts is
presented in Figure 4.5.
Figure 4.3: Stack-up of Pixel-ROD.
45
Figure 4.5: the Pixel-ROD prototype. In blue, the
components connected to the Kintex FPGA; in red, the ones
related to the Zynq FPGA; in yellow, the power stage.
Figure 4.4: one of the 16 layers.
46
47
Chapter 5
Pixel-ROD test results
Because of its complexity, the Pixel-ROD had to pass through several testing stages in
order to verify its correct behavior. The stages were divided in: hardware wake-up phase,
to assure the hardware is working correctly and can be correctly configured, and validation
of all the board functionalities by configuring, debugging and testing each device installed
on the board. Since this thesis work concerns the creation of a testing platform meant not
only for the validation of the PCIe interface of the board, but also for presenting and
achieving a high-performance PCIe system which can transfer data between the board and
a PC, only a brief description of the hardware wake-up phase is provided. Conversely, the
platform itself, as well as the firmware adopted, will be comprehensively covered.
5.1 Power supply As stated, the first testing stage the board passed through, involved the configuration of the
board power-up which led to the programming of the three UCD9248 power controllers.
The tool Fusion Digital Power Designer from Texas Instruments allows to set many
important parameters such as the voltage of each rail and the sequence of power-up for
each of them (see Figure 5.1).
Figure 5.1: Fusion Digital Power Designer GUI, configuration of the voltage rails
48
5.2 Board interfaces and memory When we started developing the test platform to perform PCIe transaction between a host
PC and the board, the Ethernet Subsystem, the internal bus connecting the Zynq to the
Kintex, the SFP port as well as all the interfaces of the two FPGAs were already tested.
Since all the previous tests in conjunction to the newly developed ones take advantage of
the many tools provided by Vivado Design Suite by Xilinx, a briefly description of this
CAD tool is provided.
5.2.1 Vivado Design Suite Vivado Design Suite is a tool suite developed to increase the overall productivity for
designing, integrating, and implementing systems using many of the Xilinx devices that
come with a variety of recent technology, including high speed I/O interfaces, hardened
microprocessors and peripherals, analog mixed signals and more. The Vivado Design Suite
allows for synthesis and implementation of HDL designs, enabling the developers to
synthesize their designs, perform timing analysis, examine RTL diagrams, simulate a
design reaction to different stimuli and configure the target device with the programmer.
The design implementation is accelerated thanks to place and route tools that analytically
optimize for multiple and concurrent design metrics, such as timing, congestion, total wire
length, utilization and power [26].
5.3 Memory test As the functionalities of the power stage and interfaces of Pixel-ROD had been
consolidated, a complete memory test was designed in order to prove not only the
possibility to perform some basic functions on the board, but also the ability to sustain high
speed memory accesses. In particular, the test was meant to verify if the RAM module
accessible by the Kintex FPGA (a 2 GB SODIMM DDR3 memory bank) is subjected to
disturbance errors when performing repeatedly accesses to the same memory bank but in
different rows in a short period of time. These disturbance errors are caused by charge
leakage and occur when the repeated accesses cause charge loss in a memory cell, before
the cell contents can be refreshed at the next DRAM refresh interval. Moreover, as DRAM
process technology scales down to smaller dimensions, it becomes more difficult to prevent
DRAM cells from electrically interacting with each other [27].
Another advantage of achieving a test of this kind is that it brings the board closer to
implement its full functionalities. In fact, it involves not only the bare trace interconnection
49
between devices, but also specific ICs and especially the development of a complex
firmware (that can become very time consuming). In this way, the smart design obtained
using the KC705 board as reference (see chapter 4.3), speeded-up the entire process since
it made available a platform very similar to the Pixel-ROD board, where the firmware could
be validated before being loaded on the tested board itself.
5.3.1 Vivado IP Integrator and AXI4 Interface In order to develop the firmware of the test, the Intellectual Property (IP) Integrator tool
has been used (it is a part of the Vivado Design Suite). It has been defined by Xilinx as “the
industry’s first plug-and-play system integration design environment” since it allows the
user to create complex system designs by instantiating and interconnecting IP cores from
the Vivado IP catalog into a design canvas. In this way, the user can take advantage of the
already available IP in the Vivado library to speed-up the firmware development, which
otherwise would take a consistent amount of time. Available with the Vivado Design Suite
are many IP subsystems for Ethernet, PCIe, HDMI, video processing and image sensor
processing. As an example, the AXI-4 PCIe subsystem is made up of multiple IP cores
including PCIe, DMA, AXI-4 Interconnect and it is used to provide the software stack
necessary to the developed testing platform. Therefore, before going into further details of
the tests, a brief description of the main IP cores used as well as the AXI interface is
provided.
The AXI protocol
AXI stands for Advanced eXtensible Interface [28] and it is part of the ARM Advanced
Microcontroller Bus Architecture (AMBA), a family of open-standard micro controller
buses. AMBA is widely used on a range of ASIC and SoC parts including applications
processors used in modern portable mobile devices like smartphone. Nowadays, it has
become a de-facto standard for 32-bit embedded processors because of the exhaustive
documentation and no-need to pay royalties. The AXI4 protocol is the default interface for
IP cores an it was extensively used during the debug of the Pixel-ROD board. The AXI4
protocol presents three key features: firstly, it provides a standardized interface between
the many IPs (and so allowing the user to concentrate on the system debug rather than on
the protocol needed); secondly, the AXI4 protocol is flexible, meaning that it suits a variety
of applications, from single, light data transaction to bursts of 256 data transfers with just
50
a single address phase; finally, since the AXI4 is an industrial standard, it also allows the
access to the whole ARM environment.
There are three types of AXI4 interfaces:
• AXI4, used for high-performance memory-mapped operations;
• AXI4-Lite, used for simple, low-throughput memory-mapped communication;
• AXI4-Stream, used for high-speed data streams.
The AXI4 interface uses a Master-Slave architecture and all AXI4 masters and slaves can
be connected by means of a specific IP, named Interconnect. Both AXI4 and AXI4-Lite
defines the following independent transaction channels:
• Read Address Channel;
• Read Data Channel;
• Write Address Channel;
• Write Data Channel;
• Write Response Channel.
The address channels carry control information
that describes the nature of the data to be
transferred. Data can simultaneously move in both
directions between master and slave and data
transfer sizes can vary (see Figure 5.2). The limit
in AXI4 is a burst transaction of up to 256 data
transfers, while the AXI4-Lite interface allows 1
data transfer per transaction only.
When the master needs to read data from a slave,
it sends over the dedicated channel bo