40 MHz scouting on-call guide
Monitoring the scouting system
For all links in this section access to the .cms network is required. Follow the steps in the cluster users guide to set this up.
The new L1 scouting Grafana dashboard can be used to obtain a quick view of the state of the entire scouting data flow.
The state of the L1 trigger can be monitored on the L1T Grafana instance. This also provides a link to the Muon rate monitor which can be helpful to determine whether the muon scouting should expect input.
In case of problems a good overview can sometimes be gained from the
logs of the function manager. These can be accessed in realtime from any
machine in .cms by executing
~daqoncall/DAQTools/utilities/HandsawLife.pl -s l1scoutdev -f DEBUG
for the development FM and
~daqoncall/DAQTools/utilities/HandsawLife.pl -s l1scoutpro -f DEBUG
for the production one. For post-mortem analysis you need to log into
the machine cmsrc-l1scout.cms
and execute
cat /var/log/rcms/l1scoutpro/Logs_l1scoutpro.xml | ~hsakulin/hs/trunk/Handsaw.pl | less -R
(replacing pro
with dev
when checking the development instance).
Recovering the system (after: Powercut, reinstallation)
If the system comes back from a power cut or a scoutdaq-type machine has been reinstalled a few manual steps need to be followed.
Powercut
After a powercut it is likely that the bitfile needs to be loaded into
the FPGA. This can be done with the command
$ /usr/sbin/deploy_scouting_firmware.sh [CERN username] [bitfile version] [scouting_board_type] [input_system]
as described in the section further below. The latest deployed bitfile
can be found in /var/log/kcu1500_bitfile_deployments.log
, so you can
choose that one, if this is not present (e.g., when the machine was
wiped) you can check the latest release in the Gitlab
project.
Once the bitfile was loaded the machine needs to be rebooted with
sudo shutdown -r now
as the PCI tree needs to be re-enumerated.
Reinstallation
After re-installation the following steps need to be performed:
- Reload bitfile as described above.
- TEMPORARY: Set the scdaq configuration file
/etc/scdaq/scdaq.conf
as needed for the given machine. (See/opt/scdaq/test/config/
for examples for a given machine. The fields to change are usually "processor_type", "output_filename_prefix", "output_filename_base", and "nOrbitsPerFile".) - Enable and start SCONE with
sudo systemctl enable --now scone
- Enable and start scdaq with
sudo systemctl enable --now runSCdaq
Deploying a bitfile
KCU1500
Deployment of bitfiles is handled via a script called with
$ /usr/sbin/deploy_scouting_firmware.sh [CERN username] [bitfile version] [scouting_board_type] [input_system]
,
e.g.
[dinyar@scoutdaq-s1d12-39-01 ~]$ source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
# Example with bitfile built from branch:
[dinyar@scoutdaq-s1d12-39-01 ~]$ deploy_scouting_firmware.sh dinyar chore-deploy_via_package_registry-125d3714-dev kcu1500 demux
# Example with bitfile built from release:
[dinyar@scoutdaq-s1d12-39-01 ~]$ deploy_scouting_firmware.sh dinyar v1.1.1 kcu1500 demux
Note: In case the board has not been programmed at boot (i.e. after a power cut) we still need a reboot after programming the FPGA. This is needed to correctly enumerate the PCI address space.
This script:
- Stops
scdaq
andSCONE
- Retrieves the bitfile package from Gitlab and extracts it in ou repository path
- Creates a symlink `/opt/l1scouting-hardware/bitfiles/currently_used that points at the directory of the deployed bitfile
- Loads the bitfile into the FPGA using the script supplied by th bitfile archive, allowing it to perform board-specific tasks (e.g. setting the oscillator correctly and rescanning the PCIe bus)
- Makes note of the bitfile deployment i
/var/log/bitfiles/[scouting_board_type]_deployments.log
- Starts
SCONE
again - Resets the board using
SCONE
- Starts
scdaq
Correct register parameters that must be set after loading bitfile, as of March 2023:
On GMT board:
curl -X POST -F "value=1" localhost:8080/kcu1500_ugmt/orbits_per_packet/write
curl -X POST -F "value=1" localhost:8080/kcu1500_ugmt/reset_board/write
curl -X POST -F "value=0" localhost:8080/kcu1500_ugmt/reset_board/write
On Calo board:
curl -X POST -F "value=1" localhost:8080/kcu1500_demux/orbits_per_packet/write
curl -X POST -F "value=1" localhost:8080/kcu1500_demux/reset_board/write
curl -X POST -F "value=0" localhost:8080/kcu1500_demux/reset_board/write
VCU128
After power cycle of the boards/chassis
First step:
source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
deploy_scouting_firmware.sh $USER master-3228b700-dev vcu128 ugmtbmtf 0 1
deploy_scouting_firmware.sh $USER calo_copy_and_p2gt-49d9000d-dev vcu128 calop2gt 1 1
sudo reboot
Excecuting deploy_scouting_firmware.sh without any arguments will give you some helpful info.
If you are prompted to enter your user password, you may also just hit enter to continue.
Second step:
export PYTHONPATH=/opt/xdaq/etc/PyHAL/
export LD_LIBRARY_PATH=/opt/xdaq/lib/
## board 0
# on board QSFPs
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 1 -f 156.25 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 2 -f 322.265625 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 3 -f 322.265625 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -q 4 -f 322.265625 -v debug
# mezzanine
prog_clock_Si5341.py -a /opt/l1scouting-hardware/bitfiles/currently_used/0/address_map_vcu128_ugmtbmtf.dat --devidx 0 -f 156.25 -v debug
## board 1
# on board QSFPs
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 1 -f 156.25 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 2 -f 156.25 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 3 -f 322.265625 -v debug
prog_clock_Si570.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -q 4 -f 322.265625 -v debug
# mezzanine
prog_clock_Si5341.py -a /opt/l1scouting-hardware/bitfiles/currently_used/1/address_map_vcu128_calop2gt.dat --devidx 1 -f 156.25 -v debug
Third step:
# reupload bitfiles
source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
deploy_scouting_firmware.sh $USER master-5f9668ad-dev vcu128 ugmtbmtf 0 0
deploy_scouting_firmware.sh $USER calo_copy_and_p2gt-49d9000d-dev vcu128 calop2gt 1 0
Fourth step:
# initialize output transceivers
curl -X POST localhost:8080/v2/vcu128_ugmtbmtf/0/initialize
curl -X POST localhost:8080/v2/vcu128_calop2gt/1/initialize
Reload when a bitfile was already loaded on the boards
First step:
# reupload bitfiles
source /opt/Xilinx/Vivado_Lab/2018.3/settings64.sh
deploy_scouting_firmware.sh $USER master-5f9668ad-dev vcu128 ugmtbmtf 0 0
deploy_scouting_firmware.sh $USER master-5f9668ad-dev vcu128 ugtcalo 1 0
Second step:
# initialize output transceivers
curl -X POST localhost:8080/v2/vcu128_ugmtbmtf/0/initialize
curl -X POST localhost:8080/v2/vcu128_ugtcalo/1/initialize
Restart grafana or prometheus service
To restart prometheus service:
On d3vfu-c2e35-33-02 sudo service prometheus start
To restart the grafana dashboard:
On d3vfu-c2e35-33-01 sudo service grafana-server start
Tests
Deploy a test version of SCDAQ
To deploy a test version of SCDAQ Puppet can be disabled. You should do so with
sudo /usr/local/bin/maintenance.sh -d -m "([your initials go here]) testing something"
and can then install the test RPM with yum. To re-enable Puppet again you can use:
sudo /usr/local/bin/maintenance.sh -e -c
force puppet to re-run with
sudo puppet agent -t
Recover from fatal crashes
First stop boards on scoutctrl-s1d12-18-01 with
curl -X POST localhost:8080/v2/vcu128_ugmtbmtf/0/stop
curl -X POST localhost:8080/v2/vcu128_calop2gt/1/stop
then to clear rubus and fus run on the 'main' rubu
touch /fff/ramdisk/tsunami_all