Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Having issues with your DietPi installation, or, found a bug? Post it here.
1activegeek
Posts: 6
Joined: Sat Jun 15, 2019 7:30 pm

Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by 1activegeek » Sat Jun 15, 2019 7:41 pm

I'm having an issue lately that I can put a target on. I'm hoping to ask for troubleshooting tips to potentially identify what is the cause of the issue. Running DietPi 6.24.1 (it has happened on a previous 1-2 releases as well I believe) - on a Rock64 board. I run a few things on here so it could be possible an apt package update or a software version update has been the root cause. I'm trying to see what troubleshooting tools are available in this scenario to help identify. Packages running include Node-Red, HomeAssistant, PiHole, and NetData. The way I'm knowing that the issue has happened is through the use of hc.io health checks running against the HomeAssistant web UI.

The scenario goes that once the device comes up - it will run fine for roughly 1-2 days. After this time, it will either reboot or freeze. Reboot will cause hc.io notifications that the service is down, then roughly 5-10 minutes it comes back up. If it lasts longer than this, then it has frozen. What I mean by frozen, is that it no longer responds to any network activity. IP is no longer pinging and I can not access the device via SSH. At this time I don't have any easy way based on it's location to access the console or plug in a monitor/keyboard to do more in depth troubleshooting.

What I'm hoping for is perhaps a way to use a logging utility to troubleshooting tool that might be able to dump the current running system logs to a non-volatile storage rather than the usual clearing of log data after a reboot. Any thoughts, ideas, or help you guys can provide? Appreciate any input at this point since this is driving me nuts. Controlling my HA systems with this means that it dies every few days. We rely heavily on our voice assistants working through HA.
Last edited by 1activegeek on Mon Jun 24, 2019 5:14 am, edited 1 time in total.

User avatar
MichaIng
Site Admin
Posts: 1728
Joined: Sat Nov 18, 2017 5:21 pm

Re: Trbouleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by MichaIng » Thu Jun 20, 2019 7:43 pm

@1activegeek
First of all check dmesg for any red lines, possibly related to voltage/power issues or disk I/O errors.

To have boot persistent system logs, so you can check last entries before freeze:
dietpi-software > Uninstall DietPi-RAMlog > reboot to apply (required to disable the tmpfs without loosing any existing logs).
Enable boot persistent system logs: mkdir /var/log/journal

Then after a freeze occurred or also just a hang/restart of HA: journalctl
- Scroll to the end to see last entries, or optionally to have reverse order, so see last entries at the top: journalctl -r

1activegeek
Posts: 6
Joined: Sat Jun 15, 2019 7:30 pm

Re: Trbouleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by 1activegeek » Fri Jun 21, 2019 2:57 am

Thanks for the input @MichaIng - I was perusing dmesg originally and not seeing anything of use. Today though I do notice the below block of log repeating. A quick google didn't turn up much useful outside of what the purpose of lost+found is. :lol: I will look at setting up the persistent logging so I can try to check what's going down after it fails. It's intermittent beyond belief lately. The other day (while I was away none the less) - the thing was rebooting almost every hour through the night. And then again at times it has just gone down and stayed down until I power cycle the PoE switch port. Super aggravating at this point. Thank for the pointers - hope I can catch this in the next day or so again.

Code: Select all

 2090.326503] EXT4-fs warning (device zram0): ext4_dirent_csum_verify:353: inode #11: comm find: No space for directory leaf checksum. Please run e2fsck -D.
[ 2090.326519] EXT4-fs error (device zram0): ext4_readdir:189: inode #11: comm find: path /var/log/lost+found: directory fails checksum at offset 4096
[ 2090.327424] EXT4-fs warning (device zram0): ext4_dirent_csum_verify:353: inode #11: comm find: No space for directory leaf checksum. Please run e2fsck -D.
[ 2090.327435] EXT4-fs error (device zram0): ext4_readdir:189: inode #11: comm find: path /var/log/lost+found: directory fails checksum at offset 8192
[ 2090.328472] EXT4-fs warning (device zram0): ext4_dirent_csum_verify:353: inode #11: comm find: No space for directory leaf checksum. Please run e2fsck -D.
[ 2090.328489] EXT4-fs error (device zram0): ext4_readdir:189: inode #11: comm find: path /var/log/lost+found: directory fails checksum at offset 12288

User avatar
MichaIng
Site Admin
Posts: 1728
Joined: Sat Nov 18, 2017 5:21 pm

Re: Trbouleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by MichaIng » Sun Jun 23, 2019 6:17 pm

@1activegeek
Ah this must be the zRam implementation that comes with ARMbian... Seems to cause issues here.

Please do the following:

Code: Select all

G_CONFIG_INJECT 'ENABLED=' 'ENABLED=false' /etc/default/armbian-zram-config
rm /etc/cron.*/armbian*
dietpi-services stop
/DietPi/dietpi/func/dietpi-ramlog 1
for i in /lib/systemd/system/armbian*
do
systemctl disable --now $i
done
umount /var/log
rm -R /var/log
mkdir /var/log
mount /var/log
/DietPi/dietpi/func/dietpi-ramlog 0
rm -R /var/log/lost+found
dietpi-services start

User avatar
MichaIng
Site Admin
Posts: 1728
Joined: Sat Nov 18, 2017 5:21 pm

Re: Trbouleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by MichaIng » Sun Jun 23, 2019 8:04 pm

Added a patch for DietPi v6.25 to re-remove ARMbian services (and some related files), as some of them are new, so got installed and enabled with the last linux-root-* package upgrade: https://github.com/MichaIng/DietPi/comm ... 6c8c18e7e9

1activegeek
Posts: 6
Joined: Sat Jun 15, 2019 7:30 pm

Re: Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by 1activegeek » Mon Jun 24, 2019 5:37 am

Thank you for this. Possible this was the cause of the reboots? Something in that function or functions that was causing a conflict of sorts at certain points? I would strongly suggest this happened maybe 3-4 revisions ago? I wanted to blame my HA software, not dietPi, but perhaps I should have been more suspicious? :D

We'll see how this goes. I will say I also noticed that little extra bit of screen from something about ARMbian on boot too - like a MOTD that was being overwritten by the DietPi one.

Crossing my fingers for now. I also adjusted the lines you gave me to add in the /system to the dir path

Code: Select all

for i in /lib/systemd/system/armbian*
location to the systemd files. Thankfully I understood what you were trying to do :P

User avatar
WarHawk
Posts: 525
Joined: Thu Jul 20, 2017 7:55 am

Re: Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by WarHawk » Mon Jun 24, 2019 9:55 am

Most of the time flaky restarts and unstability is due to power

The output of the adapter or whatever is 5+vdc but thru a cheapo cable the voltage at the device is well below 5vdc...

1activegeek
Posts: 6
Joined: Sat Jun 15, 2019 7:30 pm

Re: Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by 1activegeek » Mon Jun 24, 2019 5:15 pm

Well I can try testing a different power source, but right now this is using what should be a pretty solid one. I'm using the Rock64 own PoE adapter. It's been running solidly for quite some time as well. This only started more recently (say past 2+ months). Iv'e been fighting it for awhile and haven't had time unfortunately to sit and try to troubleshoot extensively due to work travels.

User avatar
MichaIng
Site Admin
Posts: 1728
Joined: Sat Nov 18, 2017 5:21 pm

Re: Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by MichaIng » Sun Jun 30, 2019 2:20 pm

@1activegeek
Ah lol now I recognised the missing "system/" above, indeed good you found it ;).

I added this cleanup to v6.25 patch as well btw, including the removal of all the services and traces: https://github.com/MichaIng/DietPi/blob ... 2051-L2067

I hope this indeed solves your random reboot/crash issues. What I am interested in is if any APT updates from the ARMbian repo will reinstall those files (most likely) and re-enable the services (hopefully not).

Would be great if you could check back from time to time if this prints some active result: systemctl status armbian*

1activegeek
Posts: 6
Joined: Sat Jun 15, 2019 7:30 pm

Re: Troubleshooting help - how to identify "Random" reboots and/or non-responsive device

Post by 1activegeek » Sun Jun 30, 2019 5:42 pm

Ya I checked the git link you posted. I did get it working for now. I'll try and setup a reminder to check that status info after some apt updates that I run (usually every other week on Sundays).

It would seem so far its "working" - but I'm still finding that the device is having some reboots, it just isn't crashing the system anymore like it used to. What I mean by this is I'm able to login and see that the uptime is not days (since my last crash alert), it is some number of hours or minutes in some instances. I also notice that sometimes when logging in, it sort of gets hung up trying to display the full MOD banner, gets through like 80% of the info and then sits like its running but hasn't finished. CTRL+C cancels it and I get a prompt, but just odd behavior.

I may in the end decide to wipe it clean again and start anew on a separate SD card. Possibly some comparison on the images. But for now at least it's better than it was - went away for the whole week and it didn't seem to crash at all to the point where I had to go power cycling the PoE port for it. Thanks for the help!!

Post Reply