40 MHz scouting on-call guide

Monitoring the scouting system

For all links in this section access to the .cms network is required. Follow the steps in the cluster users guide to set this up.

The new L1 scouting Grafana dashboard can be used to obtain a quick view of the state of the entire scouting data flow.

The state of the L1 trigger can be monitored on the L1T Grafana instance. This also provides a link to the Muon rate monitor which can be helpful to determine whether the muon scouting should expect input.

In case of problems a good overview can sometimes be gained from the logs of the function manager. These can be accessed in realtime from any machine in .cms by executing ~daqoncall/DAQTools/utilities/HandsawLife.pl -s l1scoutdev -f DEBUG for the development FM and ~daqoncall/DAQTools/utilities/HandsawLife.pl -s l1scoutpro -f DEBUG for the production one. For post-mortem analysis you need to log into the machine cmsrc-l1scout.cms and execute cat /var/log/rcms/l1scoutpro/Logs_l1scoutpro.xml | ~hsakulin/hs/trunk/Handsaw.pl | less -R (replacing pro with dev when checking the development instance).

Recovering the system (after: Powercut, reinstallation)

If the system comes back from a power cut or a scoutdaq-type machine has been reinstalled a few manual steps need to be followed.

Powercut

After a powercut it is likely that the bitfile needs to be loaded into the FPGA. This can be done with the command $ /usr/sbin/deploy_scouting_firmware.sh [CERN username] [bitfile version] [scouting_board_type] [input_system] as described in the section further below. The latest deployed bitfile can be found in /var/log/kcu1500_bitfile_deployments.log, so you can choose that one, if this is not present (e.g., when the machine was wiped) you can check the latest release in the Gitlab project.

Once the bitfile was loaded the machine needs to be rebooted with sudo shutdown -r now as the PCI tree needs to be re-enumerated.

Reinstallation

After re-installation the following steps need to be performed:

Reload bitfile as described above.
TEMPORARY: Set the scdaq configuration file /etc/scdaq/scdaq.conf as needed for the given machine. (See /opt/scdaq/test/config/ for examples for a given machine. The fields to change are usually "processor_type", "output_filename_prefix", "output_filename_base", and "nOrbitsPerFile".)
Enable and start SCONE with sudo systemctl enable --now scone
Enable and start scdaq with sudo systemctl enable --now runSCdaq

Deploying a bitfile

KCU1500

Deployment of bitfiles is handled via a script called with $ /usr/sbin/deploy_scouting_firmware.sh [CERN username] [bitfile version] [scouting_board_type] [input_system], e.g.

[dinyar@scoutdaq-s1d12-39-01 ~]$ source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
# Example with bitfile built from branch:
[dinyar@scoutdaq-s1d12-39-01 ~]$ deploy_scouting_firmware.sh dinyar chore-deploy_via_package_registry-125d3714-dev kcu1500 demux
# Example with bitfile built from release:
[dinyar@scoutdaq-s1d12-39-01 ~]$ deploy_scouting_firmware.sh dinyar v1.1.1 kcu1500 demux

Note: In case the board has not been programmed at boot (i.e. after a power cut) we still need a reboot after programming the FPGA. This is needed to correctly enumerate the PCI address space.

This script:

Stops scdaq and SCONE
Retrieves the bitfile package from Gitlab and extracts it in ou repository path
Creates a symlink `/opt/l1scouting-hardware/bitfiles/currently_used that points at the directory of the deployed bitfile
Loads the bitfile into the FPGA using the script supplied by th bitfile archive, allowing it to perform board-specific tasks (e.g. setting the oscillator correctly and rescanning the PCIe bus)
Makes note of the bitfile deployment i /var/log/bitfiles/[scouting_board_type]_deployments.log
Starts SCONE again
Resets the board using SCONE
Starts scdaq

Correct register parameters that must be set after loading bitfile, as of March 2023:

On GMT board:

curl -X POST -F "value=1" localhost:8080/kcu1500_ugmt/orbits_per_packet/write
curl -X POST -F "value=1" localhost:8080/kcu1500_ugmt/reset_board/write
curl -X POST -F "value=0" localhost:8080/kcu1500_ugmt/reset_board/write

On Calo board:

curl -X POST -F "value=1" localhost:8080/kcu1500_demux/orbits_per_packet/write
curl -X POST -F "value=1" localhost:8080/kcu1500_demux/reset_board/write
curl -X POST -F "value=0" localhost:8080/kcu1500_demux/reset_board/write

VCU128

After power cycle of the boards/chassis

First step:

source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
deploy_scouting_firmware.sh $USER master-3228b700-dev vcu128 ugmtbmtf 0 1 
deploy_scouting_firmware.sh $USER calo_copy_and_p2gt-49d9000d-dev vcu128 calop2gt 1 1
sudo reboot

Note that in the deploy_scouting_firmware script, the first number is the board_index, and the second number is whether it is the first upload of the bitfile after a power cut.

Excecuting deploy_scouting_firmware.sh without any arguments will give you some helpful info.

If you are prompted to enter your user password, you may also just hit enter to continue.

Second step:

export PYTHONPATH=/opt/xdaq/etc/PyHAL/
export LD_LIBRARY_PATH=/opt/xdaq/lib/

## board 0
# on board QSFPs
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 1 -f 156.25 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 2 -f 322.265625 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 3 -f 322.265625 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 4 -f 322.265625 -v debug
# mezzanine
prog_clock_Si5341.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -f 156.25 -v debug

## board 1
# on board QSFPs
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 1 -f 156.25 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 2 -f 156.25 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 3 -f 322.265625 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 4 -f 322.265625 -v debug
# mezzanine
prog_clock_Si5341.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -f 156.25 -v debug

Third step:

# reupload bitfiles
source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
deploy_scouting_firmware.sh $USER master-5f9668ad-dev vcu128 ugmtbmtf 0 0
deploy_scouting_firmware.sh $USER calo_copy_and_p2gt-49d9000d-dev vcu128 calop2gt 1 0

Fourth step:

# initialize output transceivers
curl -X POST localhost:8080/v2/vcu128_ugmtbmtf/0/initialize
curl -X POST localhost:8080/v2/vcu128_calop2gt/1/initialize

Reload when a bitfile was already loaded on the boards

First step:

# reupload bitfiles
source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
deploy_scouting_firmware.sh $USER master-5f9668ad-dev vcu128 ugmtbmtf 0 0
deploy_scouting_firmware.sh $USER master-5f9668ad-dev vcu128 ugtcalo 1 0

Second step:

# initialize output transceivers
curl -X POST localhost:8080/v2/vcu128_ugmtbmtf/0/initialize
curl -X POST localhost:8080/v2/vcu128_ugtcalo/1/initialize

Restart grafana or prometheus service

To restart prometheus service: On d3vfu-c2e35-33-02 sudo service prometheus start

To restart the grafana dashboard: On d3vfu-c2e35-33-01 sudo service grafana-server start

Tests

Deploy a test version of SCDAQ

To deploy a test version of SCDAQ Puppet can be disabled. You should do so with

sudo /usr/local/bin/maintenance.sh -d -m "([your initials go here]) testing something"

and can then install the test RPM with yum. To re-enable Puppet again you can use:

sudo /usr/local/bin/maintenance.sh -e -c

force puppet to re-run with

sudo puppet agent -t

Recover from fatal crashes

First stop boards on scoutctrl-s1d12-18-01 with

curl -X POST localhost:8080/v2/vcu128_ugmtbmtf/0/stop
curl -X POST localhost:8080/v2/vcu128_calop2gt/1/stop

then to clear rubus and fus run on the 'main' rubu

touch /fff/ramdisk/tsunami_all