

C.Schwick

### Contents

#### INTRODUCTION The context: LHC & experiments

#### PART1:

#### Trigger at LHC

Requirements & Concepts Muon and Calorimeter triggers (CMS and ATLAS) Specific solutions (ALICE, LHCb) Hardware implementation

#### Part2:

#### Data Flow, Event Building and higher trigger levels

Data Flow of the 4 LHC experiments Data Readout (Interface to central DAQ systems) Event Building: CMS as an example Software: some technologies

### LHC experiments: Lvl 1 rate vs size



# **Further trigger levels**

First level trigger rate still too high for permanent storage Example CMS, Atlas:

Typical event size: 1MB (ATLAS, CMS)

1 MB @ 100 kHz = **100 GB/s** 

"Reasonable" data rate to permanent Storage: 100 MB/s (CMS & ATLAS) ... 1 GB/s (ALICE)

More trigger levels are needed to further reduce the fraction of less interesting events in the selected sample.

### **Trigger/DAQ parameters**

High Level Trigger

|                  | No.Levels<br>Trigger | 5    | <b>LvI 0,1,2</b><br>Rate (Hz) | <b>Event</b><br>Size (Byte)            | <b>Evt Build.</b><br>Bandw.(GB/s) | HLT Out<br>MB/s (Event/s)                                       |
|------------------|----------------------|------|-------------------------------|----------------------------------------|-----------------------------------|-----------------------------------------------------------------|
|                  | 3                    |      | 10⁵<br>3x10³                  | 1.5x10 <sup>6</sup>                    | 4.5                               | <b>300</b> (2x10 <sup>2</sup> )                                 |
|                  | 2                    | LV-1 | 10 <sup>5</sup>               | 10 <sup>6</sup>                        | 100                               | <b>100</b> (10 <sup>2</sup> )                                   |
|                  | 2                    | LV-0 | 10 <sup>6</sup>               | 3x10⁴                                  | 30                                | <b>60</b> (2x10 <sup>3</sup> )                                  |
| PC DIPCIE MAGNET |                      |      | 500<br>10 <sup>3</sup>        | 5x10 <sup>7</sup><br>2x10 <sup>6</sup> | 25                                | <b>1250</b> (10 <sup>2</sup> )<br><b>200</b> (10 <sup>2</sup> ) |





# Implementation of EVBs and HLTs today



Eventbuilder and HLT Farm resemble an entire "computer center"

Higher level triggers are implemented in software. Farms of PCs investigate event data in parallel.



### **Data Flow: Architecture**

### **Data Flow: ALICE**



# **Data Flow: ATLAS**



# Data Flow: LHCb (original plan)



# **Data Flow: LHCb (final design)**



### **Data Flow: CMS**



### **Data Flow: Readout Links**

## **Data Flow: Data Readout**



- VME or Fastbus
- Parallel data transfer (typical: 32 bit) on shared bus
- One source at a time can use the bus



- Optical or electrical
- Data serialized
- Custom or standard protocols
- All sources can send data simultaneously
- Compare trends in industry market:
- 198x: ISA, SCSI(1979), IDE, parallel port, VME(1982)
- 199x: PCI( 1990, 66MHz 1995), USB(1996), FireWire(1995)
- 200x: USB2, FireWire 800, PCIexpress, Infiniband, GbE, 10GbE

buffer



shared data bus

(bottle-neck)

data sources

# **Readout Links of LHC Experiments**

|                         | -                    |                                                                                                                                                                        | Flow Contro |
|-------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
|                         | SLINK                | Optical: 160 MB/s $\approx$ 1600 Links<br>Receiver card interfaces to PC.                                                                                              | Yes         |
|                         | SLINK 64             | LVDS: 400 MB/s (max. 15m) $\approx$ 500 links<br>(FE on average: 200 MB/s to readout buffer)<br>Receiver card interfaces to commercial NIC<br>(Network Interface Card) | yes         |
| TO ASSOREE MUCH FLAMMER | DLL                  | Optical 200 MB/s ≈ 500 links<br>Half duplex: Controls FE (commands,<br>Pedestals,Calibration data)<br>Receiver card interfaces to PC                                   | yes         |
|                         | TELL-1<br>& GbE Link | Copper quad GbE Link ≈ 400 links<br>Protocol: IPv4 (direct connection to GbE switch)<br>Forms "Multi Event Fragments"<br>Implements readout buffer                     | no          |

# **Readout Links: Interface to PC**

### Problem:

Read data in PC with high bandwidth and low CPU load

Note: copying data costs a lot of CPU time!

### Solution: Buffer-Loaning

- Hardware shuffles data via DMA (Direct Memory Access) engines
- Software maintains tables of buffer-chains

### Advantage:



PC

### **Example readout board: LHCb**



### **Event Building: example CMS**

### **Data Flow: Atlas vs CMS**



### **Readout Buffer**

### "Commodity"

#### Concept of "Region Of Interest" (ROI) Increased complexity

- ROI generation (at LvI1)
- ROI Builder (custom module)
- selective readout from buffers

Implemented with commercial PCs

### "Commodity"

### **Event Builder**

### Challenging

1kHz @ 1 MB = O(1) GB/s

100kHz @ 1 MB = 100 GB/s Increased complexity:

- traffic shaping
- specialized (commercial) hardware

### "Modern" EVB architecture



# Intermezzo: Networking

### • TCP/IP on Ethernet networks

- All data packets are surrounded by headers and a trailer



Ethernet:

- Addresses understood by hardware (NIC and switch)

IP:

 - unique addresses (world wide) known by DNS (you can search for <u>www.google.com</u>)
 TCP:

- Provides programmer with an API.

- Establishes "connections" = logical communication channels ("socket programming)

- Makes sure that your packet arrives: requires an acknowledge for every packet sent (retries after timeout)









### **Networking: EVB traffic**



C. Schwick (CERN/CMS)

### **Networking: EVB traffic**



C. Schwick (CERN/CMS)

### **Networking: EVB traffic**



C. Schwick (CERN/CMS)

### **Event Building dilemma**





# In spite of the Event builder traffic pattern congestion should be avoided.





Paradise scenario:

All inputs want to send data to different destinations



#### Paradise scenario:

No congestion, since every data package finds a free path through the switch.



#### Paradise scenario:

Data traffic performs with "wire speed" of switch

### **Switch implementation**

### Crossbar switch: Congestion in EVB traffic



Only one packet at a time can be routed to the destination. "Head of line" blocking

#### Crossbar switch: Improvement : additional input FIFOs



Fifos can "absorb" congestion ... until they are full.















#### Crossbar switch: Improvement : additional input FIFOs



#### Still problematic:

Input Fifios can absorb data fluctuations until they are full. All fine if:

Fifos capacity > event size

In practice: sizes of FIFOs are much smaller!

EVB traffic: blocking problem remains.

Crossbar switch: perfect scenario



Full wirespeed can be reached (sustained) !

#### **Alternative switch implementation**



Similar issue:

The behavior of the switch (blocking or non-blocking) depends largely on the amount of internal memory (FIFOs and shared memory)

## **Conclusion: EVB traffic and switches**

- EVB network traffic is particularly hard for switches
  - The traffic pattern is such that it leads to congestion in the switch.
  - The switch either "blocks" (= packets at input have to "wait") or throws away data packets (Ethernet switches)
- Possible cures:
  - Buy many very expensive switches with a lot of high speed memory in side and "over-dimension" your system in terms of bandwidth and accept to only exploit a small fraction of the "wire-speed".
    - A lot of readout links with lower bandwidth
  - Find a clever method which allows you to anyway exploit your switches to nearly 100%: traffic-shaping

#### **EVB example: CMS**



| Level-1 maximum trigger rate100 kHzAverage event size1 MbyteBuilder network1 Terabit/sEvent filter computing power5 10 <sup>6</sup> MIPSEvent flow control≈ 10 <sup>6</sup> Mssg/s | No. programmable units | ≈512<br>≈512 x n<br>≈10000<br>≈10000<br>≈ % | Achronyms        BPG      Trigger Processor        BPG      Trigger Processor        UT      Lovel 1 Trigger Processor        DTO      Trigger Processor        DTO <td< th=""></td<> |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

- DON DON DON
- **RCS** Pain Costnoi System

### **EVB CMS: 2 stages**



#### CMS: 3D - EVB



## Advantages of 2 stage EVB

- Relaxed requirements:
  - Every RU-Builder works at 12.5 kHz (instead of 100kHz)
- Staging
  - To start up the experiment not the entire hardware needs to be present. Example:
    - If an Event Builder operating at 50 kHz is sufficient for the first beam, only 4 RU-builders need to be bought and set up.
- Technology independence:
  - The RU-Builder can be implemented with a different technology than the FED-Builder
  - Even different RU-Builders can be implemented with different technologies.

## **Stage1: FED-Builder implementation**

- FED Builder functionality
  - Receives event fragments from 8 to 16 Readout Links (FRLs).
  - FRL fragments are merged into "super-fragments" at the destination (Readout Unit).
- FED Builder implementation
  - Requirements:
    - Sustained throughput of 200MB/s for every data source (500 in total).
    - Input interfaces to FPGA (in FRL) -> protocol must be simple.
  - Chosen network technology: Myrinet
    - NICs (Network Interface Cards) with 2x2.0 Gb/s optical links (≈ 2x250 MB/s)
    - Switches based on cross bars (predictable, understandable behaviour).
    - Full duplex with flow control (no packet loss).
    - NIC cards contain RISC processor. Development system available. Can be easily interfaced to FPGAs (custom electronics: receiving part of readout links)



### **Performance of "1 rail" FEDBuilder**

Measurement configuration:

8 sources to 8 destinations





C. Schwick (CERN/CMS)

#### **Solution: "over-dimension" FED-Builder**



# Stage2: RU-Builder (original plan: 2004)

- Implementation: Myrinet
  - Connect 64 Readout-Units to
    64 Builder-Units with switch
  - Wire-speed in Myrinet:
    250MB/s



- Avoid blocking of switch: Traffic shaping with Barrel Shifter
  - Chop event data into fixed size blocks (re-assembly done at receiver)
  - Barrel shifter (next slides)

### **RU-Builder: Barrel shifter**



#### **RU Builder Performance**



## **Event Building: Current design**



#### **Event Builder Components**



#### Half of the CMS FED Builder

One half of the FEDBuilder is installed close to the experiment in the underground. The other half is on the surface close to the RU-Builder and the Filter Farm implementing the HLT.

The FEDBuilder is used to transport the data to the surface.

#### **Event Builder Components**



The RU-Builder Switch

- Aim: Event Builder should perform load balancing
  - If for some reason some destinations are slower then others this should not slow down the entire DAQ system.
  - Another form of **traffic shaping**











#### Online software: Some aspects of software design

## **History: Procedural programming**

- Up to the 90's: procedural programming
  - Use of libraries for algorithms
  - Use of large data structures
    - Data structures passed to library functions
    - Results in form of data structures
- Typical languages used in Experiments:
  - Fortran for data analysis
  - C for online software

#### **Today: Object Oriented Programming**

#### • Fundamental idea of OO: Data is like money: completely useless...if you don't do anything with it...

- Objects (instances of classes) contain the data and the functionality:
  - Nobody wants the data itself: you always want to do something with the data (you want a "service": find jets, find heavy particles, ...)
  - Data is hidden from the user of the object
  - Only the interface (= methods =functions) is exposed to the user.
- Aim of this game:
  - Programmer should not care about data representation but about functionality
  - Achieve better robustness of software by encapsulating the data representation in classes which also contain the methods:
    - The class-designer is responsible for the data representation.
    - He can change it as long as the interface(= exposed functionality) stays the same.
- Used since the 90s in Physics experiments
- Experience so far:
  - It is true that for large software projects a good OO design is more robust and easier to maintain.
  - Good design of a class library is difficult and time consuming and needs experienced programmers.

#### Frameworks vs Libraries

#### What is a software framework?

- Frameworks are programming environments which offer enhanced functionality to the programmer.
- Working with a framework usually implies programming according to some rules which the framework dictates. This is the difference wrt use of libraries.

#### • Some Examples:

- Many frameworks for programming GUIs "own" the main program. The programmer's code is only executed via callbacks if some events are happening (e.g. mouse click, value entered, ...)
- An Physics Analysis framework usually contains the main loop over the events to be analyzed.
- An online software framework contains the functionality to receive commands from a Run-Control program and executes specific call-backs on the programmer's code.

It contains functionality to send "messages" to applications in other computers hiding the complexity of network programming from the application.

## **Distributed computing**

- A way of doing network programming:
  - "Normal Program": runs on a single computer. Objects "live" in the program.
  - Distributed Computing: An application is distributed over many computers connected via a network.
    - An object in computer A can call a method (service) of an object in computer B.
    - Distributed computing is normally provided by a framework.
    - The complexity of network programming is hidden from the programmer.
- Examples:
  - CORBA (Common Object Request Broker Architecture)
    - Used by Atlas
    - Works platform independent and programming language independent
  - SOAP (Simple Object Access Protocol)
    - Used by CMS
    - Designed for Web Applications
    - Based on xml and therefore also independent of platform or language

## **Distributed computing**

A method on a remote object is called:



# **Distributed computing**

#### The result is coming back:



## Conclusions

• Trigger / DAQ at LHC experiments



- Many Trigger levels:
  - partial event readout
  - complex readout buffer
  - "straight forward" EVB



- One Trigger level (CMS):
  - "simple" readout buffer
  - high throughput EVB
  - complex EVB implementation (custom protocols, firmware)
- Detector Readout: Custom Point to Point Links
- Event-Building
  - Implemented with commercial Network technologies
  - Event building is done via "Network-switches" in large distributed systems.
  - Event Building traffic leads to network congestion Traffic shaping copes with these problems

#### Outlook

#### **Moores Law**



# **Technology: FPGAs**

- Performance and features of todays "Top Of The Line"
  - XILINX:
    - High Performance Serial Connectivity (3.125Gb/s transceivers):
      - 10GbE Cores, Infiniband, Fibre Channel, ...
    - PCI-express Core (1x and 4x => 10GbE ready)
    - Embedded Processor:
      - 1 or 2 400MHz Power PC 405 cores on chip
  - ALTERA:
    - 3.125 Gb/s transceivers
      - 10GbE Cores, Infiniband, Fibre Channel, ...
    - PCI-express Core
    - Embedded Processors:
      - ARM processor (200MHz)
      - NIOS "soft" RISC: configurable

# **Technology: PCs**

- Connectivity
  - PCI(x) --> PCI express
  - 10GbE network interface
- INTELs Processor technology
  - Future processors (2015):
    - Parallel processing
    - Many CPU-cores on the same silicon chip
    - Cores might be different (e.g. special cores for communication to offload software)

#### Modern Server PC



#### **Current EVB architecture**



#### **Current EVB architecture**



### **Future EVB architecture I**



## Conclusions

• Trigger / DAQ at LHC experiments



Many Trigger levels:

- partial event readout
- complex readout buffer
- "straight forward" EVB



One Trigger level (CMS):

- "simple" readout buffer
- high throughput EVB
- complex EVB implementation (custom protocols, firmware)
- Detector Readout: Custom Point to Point Links
- Event-Building
  - Implemented with commercial Network technologies
  - Event building is done via "Network-switches" in large distributed systems.
  - Event Building traffic leads to network congestion Traffic shaping copes with these problems
- Outlook: Future technologies

#### **EXTRA SLIDES**

### **Example CMS: data flow**



| Acron | iyms                                 |
|-------|--------------------------------------|
| BCN   | Builder Control Network              |
| BDN   | Builder Data Network                 |
| BM    | Builder Manager                      |
| BU    | Builder Unit                         |
| CSN   | Computing Service Network            |
| DCS   | Detector Control System              |
| DCN   | Detector Control Network             |
| DSN   | DAQ Service Network                  |
| D2S   | Data to Surface                      |
| EVM   | Event Manager                        |
| FB    | FED Builder                          |
| FEC   | Front-End Controller                 |
| FED   | Front-End Driver                     |
| FES   | Front-End System                     |
| FFN   | Filter Farm Network                  |
| FRL   | Front-End Readout Link               |
| FS    | Filter Subfarm                       |
| GTP   | Global Trigger Processor             |
| LV1   | Level-1 Trigger Processor            |
| RTP   | Regional Trigger Processor           |
| RM    | Readout Manager                      |
| RCN   | Readout Control Network              |
| RCMS  | Run Control and Monitor System       |
| RU    | Readout Unit                         |
| TPG   | Trigger Primitive Generator          |
| TTC   | Timing, Trigger and Control          |
| STTS  | synchronous Trigger Throttle System  |
| aTTS  | asynchronous Trigger Throttle System |

# High Level Trigger: CPU usage

- Based on full simulation, full analysis and "offline" HLT Code
- All numbers for a 1 GHz, Intel Pentium-III CPU
- Total: 4092s for 15.1 kHz -> 271 ms/event
- Expect improvements, additions.
- A 100kHz system requires 1.2x10<sup>6</sup> SI95
- Corresponds to 2000 dual CPU boxes in 2007 (assuming Moores's law)

| Trigger                          | -CPU (ms) | -Rate (kHz) | -Total (s) |
|----------------------------------|-----------|-------------|------------|
| _1e/γ, 2e/γ                      | _160      | _4.3        | -688       |
| -1μ, 2μ                          | _710      | _3.6        | _2556      |
| $-1\tau$ , $2\tau$               | _130      | _3.0        | _390       |
| _Jets, Jet * Miss-E <sub>T</sub> | _50       | _3.4        | _170       |
| _e * jet                         | _165      | -0.8        | _132       |
| _B-jets                          | -300      | -0.5        | _150       |

## **CMS an Pb collision**

- Luminosity 10^27
  - 8kHz expected event rate
  - 330 kB to 8.5 MB event size (depending on impact parameter)
  - ==> transfer all collisions to HLT farm (no rejection in LvI 1)

On average 4s per collision with 1500 nodes in filter farm

# **Myrinet (old switch)**

- network built out of crossbars (Xbar16)
- wormhole routing, built-in back pressure (no packet loss)
- switch: 128-Clos switch crate
  - 64x64 x 2.0 Gbit/s port (bisection bandwidth 128 Gbit/s)
- NIC: M3S-PCI64B-2 (LANai9 with RISC), custom Firmware



# LHCb

- Operate at  $L = 2 \times 10^{32} \text{ cm}^{-2}\text{s}^{-1}$ : 10 MHz event rate
- LvI0: 2-4 us latency, 1MHz output
  - Pile-up veto, calorimeter, muon
- Lvl1: 52.4ms latency, 40 kHz output
  - Impact parameter measurements
  - Runs on same farm as HLT, EVB
- Pile up veto
  - Can only tolerate one interaction per bunch crossing since otherwise always a displaced vertex would be found by trigger

### LHCb L1-HLT-Readout network



### **Clos-switch 128 from 8x8 Crossbars**

