Shouldn't fsck run with '-y' when needed?

Hello, I hope I’m posting in the right category as I don’t think it’s a troubleshooting issue, if not please advice.

So, yesterday my pi5 restarted and won’t come up on SSH. It pinged, but no ssh. I went and bought a microhdmi cable only to discover that video output is not enabled by default on a raspberry5 -_-

I then booted to a new fresh sd card, ran fsck on my original root partition, chose Y at every question and that was it, it fixed the problem. You know, it’s a raspberry, sometimes corruption happens.

So I am left here wondering: why did I have to do that? What was the purpose of that? I doubt anyone nowdays (me included) has any clue what FSCK is asking anyway, shouldn’t fsck be run with the -y argument by default?

I understand that there is a degree of risk there, but we’re talking about an OS that mostly runs on small SOC systems that are usually headless and remote.

I think it’s risk worth taking, considering that the alternative is hours of downtime, a potential trip to the store for some hardware and an operator on site… To just press Y.

Also, is there any way to configure fsck to run with -y? I know some disto offer that option, but I couldn’t find anything official for DietPI.

Thanks to anyone that wants to contribute. The project is great by the way, this is the first real issue I had and except for this I couldn’t be happier. Cheers!

Maybe a misunderstanding what DietPi is. DietPi is not an own operating system or destro. It’s a Debian system running some tweeks and scripts to make it “Diet”. Means all tools are plain Debain tools. This includes documentation and functionality.

I’m aware that it’s “just scripts on debian”, but calling it that seems demeaning.

Back on topic: on Debian you’d add fsck.repair=yes on grub but generally speaking grub is not found on raspberry and it’s not (at least by default) on DietPI.

Yet it seems that on raspbian the option is on by default in the cmdline.txt, so I wonder why is not present in DietPI?

Was it overlooked? Or is it a design choice?

In any case, I believe the current default should change or there should at the very least be an option to do so, but I’d like to hear other people opinion on that.

grub is a bootloader for x86 based systems. Its not something available on arm SBC

I guess @MichaIng can explain why it is like it is.

I wrote so because grub apparently works on raspberry too.

Still I’d like not to go off topic with this, my question was genuine and we’re focusing on the wrong things here.

HDMI output is not enabled OOTB on RPi 5? On my last test it was IIRC :thinking:. But hotplug detection does usually not work in my case, so I need to have the screen attached before boot.

What did you do to enable video output, if it was not before?

About your actual question: The default for fsck.repair is preen, which passes the fsck -p argument instead of -y, fixing only issues which are safe to fix without potentially causing more damage/data loss: e2fsck(8) — e2fsprogs — Debian trixie — Debian Manpages

We decided to follow the conservative default instead of enforcing all repair steps, though I agree that this has downsides as well, especially if users have no other Linux system to manually check the affected drive/SD card. Like better some chance for more data loss, but a booting system, than the system not booting at all?

But it cannot be the actual bootloader, just a 2nd stage bootloader, which then IMO does not make much sense, as long as you do not require some of its facilities in particular. Especially RPi have convenient config.txt and cmdline.txt which allow altering kernel command-line parameters and firmware settings easier than GRUB does. We use GRUB where it can really act as first stage bootloader on the root drive, which is the case only for x86 systems. All ARMs and RISC-V systems I am aware of require a custom vendor bootloader (RPi), U-Boot, or EFI, and only from there GRUB can be additionally invoked, which IMO then looses its point. U-Boot can invoke more flexible own boot scripts, U-Boot and EFI can use more low-level standardized extlinux etc.

So if you want fsck.repair=yes, knowing the implications, just add it to /boot/cmdline.txt on RPi. On other ARMs it would be the extraargs line in /boot/dietpiEnv.txt, or the append line in /boot/extlinux/extlinux.conf if that one is used instead of a boot script.

Sort of annoying that there is no generic bootloader for all systems. But ARMs/RISC-V have no BIOS or UEFI, like they do no bus enumeration and hence require a device tree, so something else is needed in any case, whether we like it or not. Forcing GRUB on top of that other first level bootloader may enable a generic high level config, but also cases more complexity and overhead, more points of failure etc.

I have multiple running raspberries (since the first one came out!) but believe it or not that was the first time I actually had to connect a screen to one (hence the need to get a cable for it), so I’m not sure if there screen issue was due to some hotplug detection that didn’t happen because no screen was there during install.

To have it work I had to launch dietpi-config, choose Display Options, it prompted to install some something and after rebooting the screen worked.

We decided to follow the conservative default instead of enforcing all repair steps, though I agree that this has downsides as well, especially if users have no other Linux system to manually check the affected drive/SD card.

I am well aware of the tradeoff, I think it’s worth it tough.

I might be wrong, but I assume that target install of DietPI are cheap low powered headless systems, usually hidden somewhere out of sight. File systems repairs are a routine thing, not an exception.

Like better some chance for more data loss, but a booting system, than the system not booting at all?

Here you’re assuming the average (because it’s a default behavior) user knows how to answer fsck questions, which I’m quite sure it’s not the case: normal people will just press Y unaware of any consequences.

Sure, there will always be that user that was here since before Slackware and that was writing his own drivers that will object, but for the vast majority of people you’re causing unnecessary downtime and adding a complicated manual step to what should be an automated process.

One could argue that data security is not the main selling point of the cheap low powered headless system you’re targeting, still those special power users in need of that kind of safety will definitely know how to configure their cmdline.txt (and will hopefully have more than one backup of their data). To everyone else, you’re doing them a disservice.

It’s no coincidence that everyone is doing it nowdays, from windows to linux. It has been both debian and raspbian default for years, why change it?

No screen/HDMI device needs to be present during install, but it may be unreliably detected when you attach the screen after boot. It should actually work, RPis are supposed to have HDMI HPD, and you can actively disable it, enforcing either that it assumes a connected HDMI device, or none. But leaving it enabled, in my case, does not work reliably. I usually need to reboot, or power cycle, for anything to show up on the screen.

Do you remember which option you toggled exactly, because on RPi there are a lot :sweat_smile:. In case you called dietpi-display from there, it asked you to enable KMS/DRM to work. But probably the reboot itself was what fixed it, like in my case, and not the change you did in the config.

In average this is true, at least DietPi is designed to work as good as it gets on such hardware, but nowadays with eMMC, NVMe-capable M.2 ports, USB boot capabilities, and simply dropping costs for better and especially much larger SD cards, filesystem corruption is a fading problem. Much has changed in the 13 (?) years since the first RPi has been released. We also try much our end to reduce the chance for corruption, minimizing disk writes anyway, but also having modern ext4 metadata chechsumming enabled, automatic R/O remount in case of I/O errors etc.

However, of course it still happens, and also whether this is more of less often is probably not so relevant, since more regular filesystem errors, more regular fsck calls, means also more regular risk to make the data loss worse if -y is used. So the question is more, in case of a filesystem error which fsck -p does not fix, is the risk of causing more data loss with -y a larger downside than the need to check and in case manually repair it on an external system, considering what the alternative would be: using some professional data rescue tool, or bringing the boot media to a professional data rescue service, for best but most expensive chance to bring things back. Probably not done by anyone with an RPi boot media, of course :smile:.

Most will probably do. But taking away the chance for a more careful review from experienced users as a default?

I get your point. Valid argument since we aim to make things easy for less experienced users. But note that fsck.repair=yes is, despite all these arguments, not the default on any other distro I know, other than Raspberry Pi OS. Armbian has it in a few selected cases, but the minority. systemd as well as initramfs implementations use fsck.repair=preen by default, and primary distributions do not override this. So that this “should be an automated process” is obviously not shared by any of them. The fsck -p option is basically made for exactly these cases, safe automated repair where possible. It can be discussed whether the benefits outweigh the risks in our case, for our average user base, or whether it conditionally does, e.g. if the system was booted from SD card or eMMC chip.

The question is what they expect to be the default, to even look into cmdline.txt. And since they have backups, for them fsck.repair=yes is probably a lower risk than it is for “everyone else” :wink:.

As said, I get your point, I’ll think about it and talk with @Joulinar and @StephanStS. Just please do not call it a “disservice” and “unnecessary” and what it “should be” instead, in a case where we follow defaults which were chosen by almost all people in charge (systemd, initramfs-tools, Debian, other distros …) who know each more about the implications of -y vs -p than both of us together. Sure our average use case is different, but also this has shifted over the years, with a lot more x86_64 servers and VMs running DietPi, and a lot more capable SBCs with robust boot media options. So this decision is not as trivial to me, as you seem to see it.

Now that you mentioned yes, it asked to enable KMS/DRM and then after a reboot it worked. I’m positive it’s not the reboot itself that fixed it because the raspberry spent its day being moved from place to place, powered on and off over and over for toubleshooting. Also, what fixed it was a fresh SD install which didn’t have video output as well (but it would come up on network at least).

One could argue that all those options increase entropy and ends up reducing stability, not the other way around. Take the raspberry ecosystem as an example: nvme hats are rarely original, they’re usually third party and purchased with the “order by cheapest” algorithm by the vast majority of people. Power supplies, nvme drives (which are still not 100% compatible, some works and some don’t - and some mostly works, which is the worst case) and other devices face the same fate.

I’m glad we agree on that. Then, the default should care for their needs an give them what they want: less hassle and more automation.

While the default will care for most of the people, those few power users that prefers a different approach are free to configure it. You’re not taking their choice away!

Again, I understand that it’s a tradeoff, there is no one-solution-suits-all here. In my humble experience when fsck -y ends up doing more harm than good, the data is mostly gone already.

I apologize if my comment came out as harsh, it wasn’t my intention. I have lots of respect for you guys and this project.

If you end up deciding to stick with the preem approach it might be worth communicating it somehow to users, so those coming from raspbian will know what to expect

You mean the boot media options? I’d say they are a necessary evolution of SBCs, aligning their capabilities with regular PCs and server hardware on a lower power level, also being able to replace NAS boxes, routers etc, with lower vendor locks regarding OS and software etc. Staying with the capabilities of the first RPi models while hardware and software around evolves like IT marked in general always did, doesn’t make much sense either.

That is true both way round: one can always add the cmdline argument for full fsck repair, like one needs to do an almost any other Linux distro if wanted. The question only is which default serves best for our typical use case. We just chose to go with Linux defaults at some point.

I am considering to add fsck.repair=yes by default for the following reasons:

  • We enable fsck on boot for root and boot partitions only, not for any additionally attached drives. If there is important userdata, it is often on external drives.
  • The OS itself and installed software is harmless to loose. No one would use a professional data recovery service to save the time of installing a fresh instance of the OS. Either fsck -y manages to repair it sustainably, or one will anyway need to flash a new image, agreeing with what you said in these regards.
  • If there is really important userdata, one usually has a backup on another drive.
  • Indeed SD cards are still the most used rootfs media among our users.

The concerns I have:

  • ext4 is a journaling filesystem, so in case of power loss and similar, it can automatically restore functional metadata without any need for fsck. If there are lost links of incomplete file writes, fsck -p (default) stores them to /list+found. If really ext4 itself and in case fsck -p are not able to repair it, there is almost certainly some hardware issue, like bad blocks on the SD card, or some bug/corruption in kernel or lowlevel userland tools which cause these non-safely fixable filesystem errors. It is then quite common that fsck -y does not sustainably repair it, but that any next write just causes the same issue again. I had this very issue with bad blocks on my home server’s SD card. One could basically watch with fsck -y iterations that every repair moved other data into the bad blocks, so the next iteration found again errors with other inodes, broke the related files that way. I did have an external drive and backups, so it was not a problem and interesting to monitor somewhat. I reflashed the OS a few times, after which it worked well for a quite similar time until data rotation reached the bad blocks again. fsck -y sometimes was able to make the system bootable again, just depending on the random chance wheher the files it broke in turn were system-critical or not. But it usually just took another apt upgrade until something serious was broken again. Just making that clear: fsck -y really broke additional files each time it was running and restoring or moving to /lost+found other inodes. This is natural, the way it works, and the very reason why -y is not the default.
  • While in cases fsck -y is anyway just the only chance, either working or not, without much to loose, a bigger problem I see is that users do not even recognize for some time that their hardware (SD card/drive) is broken. They run into errors or crashes, and a reboot “fixes” it silently, since most people won’t know to check journalctl -t systemd-fsck or /run/initramfs/fsck.log (if an initramfs is used). So all seems fine while in fact they might have lost data already. And this may continue for an extended period of time until system-critical data was affected. The problem with this is that there might have been files affected long before which do not cause immediate errors when missing or corrupted, but may compromise the system in other ways: Think of an .htaccess file which prevents visitors of a web application to access personal data directly, bypassing authentication. Or plain text passwords in one of its config files, which is quite common for e.g. database access, or SMTP server credentials for sending out emails etc. Changes for this exact thing to happen are low of course, but running a system solely on regular fsck -y certainly is a very bad idea, not only regarding potential irreversible data loss, and fsck.repair=yes raises the chance that such situation remains unrecognized for a longer time.

Thank you! That’s what I’ve been trying to say!

Regarding the other two points

  • While it is true that ext4 is a journaling FS and fsck should not be needed, on cheap low powered devices errors do happen more often than usual (think cheap nvme hats that are so popular nowdays) and they’re kind of the target audience
    I don’t know why I got that error, the same system has now been stable ever since. Maybe the super-thin nvme hat cable wasn’t seated properly? Maybe the power adapter was unstable (it wasn’t the original one, it was supposed to be a better GaN version)? Maybe moving 300gb of data on Immich trigger some sort of otherwise hidden error? Anyway, this stuff kinda happens on cheap devices. They’re not meant to be rock solid, they’re meant to be cheap.
  • It is 100% true that fsck -y can mask a failing SD card. I have nothing to argue here. I’ve seen systems boot just fine, writing to what was basically /dev/null working for weeks until that data needed to be read again. I even saved one with a rpi-clone on a new SD!

So basically it all goes to this last point: warning the user.

In my opinion forcing an user to connect a display, keyboard and mouse to a low powered, usually hidden away system should be avoided if not absolutely necessary. And I argue that running fsck -y is not.

Would it be possible to insert a check to the journalctl -t systemd-fsck command on the login scripts and warn the user of any error? I don’t mean to run it at every login… Maybe it could be a once-a-boot thing that keeps record of it on tmpfs? I don’t know, I’m just thinking out loud.

Done: fs_partition_resize: force fsck via max mount count · MichaIng/DietPi@77c11a5 · GitHub

If an initramfs is used, it is /run/initramfs/fsck.log instead. Actually a good idea to check for errors in logs once at boot and show it in the login banner if some were found. Should be done for ext3/ext4 filesystems only, since with other rootfs types, fsck is either a dummy, or it always fails due to wrong arguments etc. Sadly, even that there is supposed to be a generic CLI, in practice all relevant filesystem types do not implement it that way.