How to troubleshoot Raspberry Pi crashes

Long time running headless Pi has started crashing, how to troubleshoot

I have searched the existing open and closed issues

Required Information

  • DietPi version | cat /boot/dietpi/.version
    kellis@DietPi:~$ cat /boot/dietpi/.version G_DIETPI_VERSION_CORE=9 G_DIETPI_VERSION_SUB=8 G_DIETPI_VERSION_RC=0 G_GITBRANCH='master' G_GITOWNER='MichaIng'

  • Distro version | echo $G_DISTRO_NAME $G_RASPBIAN

bullseye 0
  • Kernel version | uname --all
Linux DietPi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023 aarch64 GNU/Linux
  • Architecture | dpkg --print-architecture
arm64
  • SBC model | echo $G_HW_MODEL_NAME or (EG: RPi3)
RPi 4 Model B (aarch64)
  • Power supply used | (EG: 5V 1A RAVpower)
Raspberry Pi Official supply 5v 3A
  • SD card used | (EG: SanDisk ultra)
No SD card, booting directly from SSD

Additional Information (if applicable)

  • Software title | (EG: Nextcloud)

Docker/Portainer with Monitoring stack including Prometheus, Grafana, Cadvisor, alert manager, node exporter, postgres’s-exportor
SSH tunnel to a remote system pulling node_exportor stats
Pl
plus some other stuff installed via diet pi-software, such as Fail2Ban LEMP, Tailscale, Nextcloud (not used), CUPS (for AirPrint) Open SSH, SAMBA, MC, FFMPEG, MariaDB, Regis, Asahi-Daemon, Python3, `

  • Was the software title installed freshly or updated/migrated?
    The system has been running for quite a while and probably has quite a bit of unused software installed. Particularly related to NextCloud
  • Can this issue be replicated on a fresh installation of DietPi?
    Not tried, I would like to avoid a reinstall.

Steps to reproduce

  1. System has been running great for years. In recent weeks system has crashed twice. Each time the system responds to ping, but I am unable to SSH in and my docker containers are in accessible. The only way to restore is to pull the power and re-apply.

How can I trouble-shoot this, what logs are best to look at?

Just a bit more information, looking at the Node Exporter Data that this Raspberry Pi recorded for itself before it became unresponsive, it looks like there was a spike in Disk IOps, specifically disk writes to SDA which is the / and /boot directories.

And also some Memory Page Faults:

How can I track down if this is the problem and what might be causing it?

you could enable persistent logging to have some log after next reboot

persistent system logs:
[code]
dietpi-software uninstall 103 # uninstalls DIetPi-RAMlog
mkdir /var/log/journal # triggers systemd-journald logs to disk
reboot # required to finalise the RAMlog uninstall
[/code]

Then you can check system logs via:
[code]journalctl[/code]

which will then show as well logs from previous boot sessions. To limit the size, you can additionally e.g. apply the following:
[code]
mkdir -p /etc/systemd/journald.conf.d
cat << '_EOF_' > /etc/systemd/journald.conf.d/99-custom.conf
[Journal]
SystemMaxFiles=2
MaxFileSec=7day
_EOF_
[/code]

This will limit logs to 14 days split across two journal files, so that with rotation you will always have between 7 and 14 days of logs available.

if possible, have a look to memory usage. Probably your system is running out of physical memory, causing system to swap. At least an explanation for disk usage and paging.

Thank you. I’ll start here.