RPI4B+ dietpi-backup fails(?) some days

theverymrp · 13 June 2022 02:12

Required Information

DietPi version |
G_DIETPI_VERSION_CORE=8
G_DIETPI_VERSION_SUB=5
G_DIETPI_VERSION_RC=1
G_GITBRANCH=‘master’
G_GITOWNER=‘MichaIng’
Distro version | bullseye
Kernel version | aarch64 GNU/Linux
SBC model | RPI4+
Power supply used | (EG: 5V 1A RAVpower)
SD card used |

Hi, I run a pretty basic setup, really just proFTP and Portainer (that runs vpn and various arrs + plex). Up until late May I was backing up the system as described here: Automate dietpi-backup - #14 by brightwolf

This functionality has since been added to dietpi-backup so I decided to swap over to that. The problem I’ve found is that dietpi-backup regularly doesn’t complete/fails and therefore that the services are not restarted until I do it manually.

Expected behaviour
– services are stopped

backup is completed
services are restarted

Actual behaviour

services are stopped
backup is not completed
services are not restarted

Extra details
Daily cron has been set to run at 2am.
.dietpi-backup.log shows no completed entry for today (13th)

.dietpi-backup_stats
Backup completed : 2022-05-22_04:53:59
Backup completed : 2022-05-23_02:04:32
Backup completed : 2022-05-24_02:40:28
Backup completed : 2022-05-25_02:14:13
Backup completed : 2022-05-28_02:03:55
Backup completed : 2022-05-31_02:04:10
Backup completed : 2022-06-03_02:03:39
Backup completed : 2022-06-06_02:04:34
Backup completed : 2022-06-09_02:05:09
Backup completed : 2022-06-12_02:02:54

● docker.service - Docker Application Container Engine                                                               │
                              │      Loaded: loaded (/lib/systemd/system/docker.service; disabled; vendor preset: enabled)                           │
                              │     Drop-In: /usr/lib/systemd/system/docker.service.d                                                                │
                              │              └─dietpi-simple.conf                                                                                    │
                              │      Active: inactive (dead) since Mon 2022-06-13 02:00:46 +07; 6h ago                                               │
                              │ TriggeredBy: ● docker.socket                                                                                         │
                              │        Docs: https://docs.docker.com                                                                                 │
                              │     Process: 688653 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited,   │
                              │ status=0/SUCCESS)                                                                                                    │
                              │    Main PID: 688653 (code=exited, status=0/SUCCESS)                                                                  │
                              │         CPU: 14min 39.913s                                                                                           │
                              │                                                                                                                      │
                              │ Jun 13 02:00:37 DietPi 85d4103e3a4f[688653]: [s6-finish] sending all processes the TERM signal.                      │
                              │ Jun 13 02:00:39 DietPi ad3d0da37160[688653]: [s6-finish] sending all processes the KILL signal and exiting.          │
                              │ Jun 13 02:00:39 DietPi 3466f74bc0f5[688653]: [s6-finish] sending all processes the KILL signal and exiting.          │
                              │ Jun 13 02:00:39 DietPi 68001ceb3ac3[688653]: [s6-finish] sending all processes the KILL signal and exiting.          │
                              │ Jun 13 02:00:40 DietPi 8f2ad045b4ca[688653]: [s6-finish] sending all processes the KILL signal and exiting.          │
                              │ Jun 13 02:00:40 DietPi 2f6df63c9780[688653]: [s6-finish] sending all processes the KILL signal and exiting.          │
                              │ Jun 13 02:00:40 DietPi 85d4103e3a4f[688653]: [s6-finish] sending all processes the KILL signal and exiting.          │
                              │ Jun 13 02:00:46 DietPi systemd[1]: docker.service: Succeeded.                                                        │
                              │ Jun 13 02:00:46 DietPi systemd[1]: Stopped Docker Application Container Engine.                                      │
                              │ Jun 13 02:00:46 DietPi systemd[1]: docker.service: Consumed 14min 39.913s CPU time.

   │ ● cron.service - Regular background program processing daemon                                                        │
                              │      Loaded: loaded (/lib/systemd/system/cron.service; disabled; vendor preset: enabled)                             │
                              │      Active: inactive (dead)                                                                                         │
                              │        Docs: man:cron(8)                                                                                             │
                              │                                                                                                                      │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1093782 (cron) remains running after unit stopped.     │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1093783 (dash) remains running after unit stopped.     │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1093784 (run-parts) remains running after unit         │
                              │ stopped.                                                                                                             │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1093786 (dietpi) remains running after unit stopped.   │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1094556 (dietpi-backup) remains running after unit     │
                              │ stopped.                                                                                                             │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1094571 (dietpi-services) remains running after unit   │
                              │ stopped.                                                                                                             │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Unit process 1094574 (systemctl) remains running after unit         │
                              │ stopped.                                                                                                             │
                              │ Jun 13 02:00:35 DietPi systemd[1]: Stopped Regular background program processing daemon.                             │
                              │ Jun 13 02:00:35 DietPi systemd[1]: cron.service: Consumed 27.041s CPU time.                                          │
                              │ Jun 13 02:02:34 DietPi CRON[1093782]: pam_unix(cron:session): session closed for user root
							  ```

looking at the cron.service for previous days where it ran successfully:

Jun 12 02:00:01 DietPi CRON[687280]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.daily; })
Jun 12 02:00:25 DietPi systemd[1]: Stopping Regular background program processing daemon...
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Succeeded.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 687279 (cron) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 687280 (dash) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 687281 (run-parts) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 687283 (dietpi) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 687985 (dietpi-backup) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 688000 (dietpi-services) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Unit process 688003 (systemctl) remains running after unit stopped.
Jun 12 02:00:25 DietPi systemd[1]: Stopped Regular background program processing daemon.
Jun 12 02:00:25 DietPi systemd[1]: cron.service: Consumed 25.778s CPU time.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 687279 (cron) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 687280 (dash) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 687281 (run-parts) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 687283 (dietpi) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 687985 (dietpi-backup) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 688645 (dietpi-services) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: cron.service: Found left-over process 688654 (systemctl) in control group while starting unit. Ignoring.
Jun 12 02:02:54 DietPi systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 12 02:02:54 DietPi systemd[1]: Started Regular background program processing daemon.
Jun 12 02:02:54 DietPi cron[688655]: (CRON) INFO (pidfile fd = 3)
Jun 12 02:02:54 DietPi cron[688655]: (CRON) INFO (Skipping @reboot jobs -- not system startup)
Jun 12 02:02:54 DietPi CRON[687279]: pam_unix(cron:session): session closed for user root
Jun 12 02:17:01 DietPi CRON[695567]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)```

Any guidance would be greatly appreciated

Joulinar · 13 June 2022 12:07

does the information on the dietpi-backup.log simply stopped? No error message at the end?

Is there anything in the system log around 2am on the day the backup was not working?

theverymrp · 14 June 2022 01:19

Thanks for the quick reply. Log simply stops, there are no records on failed days. I’ve since rebooted so journalctl is empty. Will check on next fail and report back.

I did find that Plex has scheduled maintenance tasks set to start at 2:00, I’ve changed that.

MichaIng · 14 June 2022 06:00

Could be the OOM (out-of-memory) killer when backup and Plex maintenance exhaust all system memory. Let’s see whether moving the Plex maintenance schedule helps. A quick check for kernel errors (including OOM killer) can be done via:

dmesg -l emerg,alert,crit,err

theverymrp · 15 June 2022 00:11

backup completed successfully today. Here’s what i get for the kernal errors

[    0.570571] bcm2708_fb soc:fb: Unable to determine number of FBs. Disabling driver.
[    2.273440] sd 0:0:0:0: [sda] No Caching mode page found
[    2.273452] sd 0:0:0:0: [sda] Assuming drive cache: write through
[   50.254950] EXT4-fs error (device sda1): ext4_validate_block_bitmap:390: comm ext4lazyinit: bg 1769: bad block bitmap checksum
[  142.755847] EXT4-fs error (device sda1): ext4_validate_block_bitmap:390: comm ext4lazyinit: bg 29304: bad block bitmap checksum
[65810.832264] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665819: comm du: iget: checksum invalid
[65810.835090] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665793: comm du: iget: checksum invalid
[65810.835892] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665825: comm du: iget: checksum invalid
[65810.837587] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665803: comm du: iget: checksum invalid
[65810.837851] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665799: comm du: iget: checksum invalid
[65810.838145] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665829: comm du: iget: checksum invalid
[65810.838375] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665804: comm du: iget: checksum invalid
[65810.838899] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665839: comm du: iget: checksum invalid
[65810.839963] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665808: comm du: iget: checksum invalid
[65810.840349] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665821: comm du: iget: checksum invalid
[65893.487131] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665819: comm rsync: iget: checksum invalid
[65893.487826] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665793: comm rsync: iget: checksum invalid
[65893.488215] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665825: comm rsync: iget: checksum invalid
[65893.489876] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665803: comm rsync: iget: checksum invalid
[65893.490199] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665799: comm rsync: iget: checksum invalid
[65893.490486] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665829: comm rsync: iget: checksum invalid
[65893.490756] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665804: comm rsync: iget: checksum invalid
[65893.491155] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665839: comm rsync: iget: checksum invalid
[65893.491850] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665808: comm rsync: iget: checksum invalid
[65893.492172] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665821: comm rsync: iget: checksum invalid
[65982.664186] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665819: comm rsync: iget: checksum invalid
[65982.664732] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665793: comm rsync: iget: checksum invalid
[65982.665004] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665825: comm rsync: iget: checksum invalid
[65982.665443] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665803: comm rsync: iget: checksum invalid
[65982.665662] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665799: comm rsync: iget: checksum invalid
[65982.665852] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665829: comm rsync: iget: checksum invalid
[65982.666025] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665804: comm rsync: iget: checksum invalid
[65982.666295] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665839: comm rsync: iget: checksum invalid
[65982.666712] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665808: comm rsync: iget: checksum invalid
[65982.666929] EXT4-fs error (device sda1): ext4_lookup:1785: inode #11665821: comm rsync: iget: checksum invalid```

Joulinar · 15 June 2022 12:52

looks like your external drive is not healthy. There are some file system errors. Is this drive your backup location? Or do you run the entire system on it?

theverymrp · 16 June 2022 02:31

Yes that drive is the backup location, the system runs on another drive. I ran fsck today to fix errors. Backup was successful again this morning.

Joulinar · 16 June 2022 08:21

keep an eye on the kernel error messages if you still have file system errors

dmesg -l emerg,alert,crit,err

theverymrp · 18 June 2022 01:20

file system errors are gone now:

[    0.570498] bcm2708_fb soc:fb: Unable to determine number of FBs. Disabling driver.
[    2.273416] sd 0:0:0:0: [sda] No Caching mode page found
[    2.273429] sd 0:0:0:0: [sda] Assuming drive cache: write through```

Since adjusting plex schedule backup has completed successfully.

Joulinar · 18 June 2022 07:13

That looks at least promising