Nvme Controller down after retrieving big amount of data (log files home automation)

Creating a bug report/issue

Required Information

  • DietPi version |
    G_DIETPI_VERSION_CORE=8
    G_DIETPI_VERSION_SUB=12
    G_DIETPI_VERSION_RC=1
    G_GITBRANCH=‘master’
    G_GITOWNER=‘MichaIng’
    G_LIVE_PATCH_STATUS[0]=‘not applied’

  • Distro version |
    bullseye

  • Kernel version |
    Linux xxxnas 5.10.66-27-rockchip-gea60d388902d #rockchip SMP Mon Oct 24 08:25:47 UTC 2022 aarch64 GNU/Linux

  • SBC model |
    Generic Device (aarch64) - Radxa Rock 5b

  • SD card used
    no SD Card | System:emmc | Data: nvme Samsung 970 Evo

Steps to reproduce

  1. Retrieving temperature diagramms of the homeautomation system (webinterface). The Log files are on the nvme disk.
    Maybe the error is connected with the last dietpi-update I made.
    The system itself ist up and running for several weeks without a problem (before this issue).

Actual behaviour

  • The nvme is unmounted and no longer accessible

Extra details

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[  146.629525] blk_update_request: I/O error, dev nvme0n1, sector 500288960 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 0
[  146.630033] blk_update_request: I/O error, dev nvme0n1, sector 500165712 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 0
[  146.658048] nvme 0000:01:00.0: enabling device (0000 -0002)
[  146.658159] nvme nvme0: Removing after probe failure status: -19
[  146.674836] blk_update_request: I/O error, dev nvme0n1, sector 499124272 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.674969] Buffer I/O error on dev nvme0n1p1, logical block 62390278, lost async page write
[  146.675038] blk_update_request: I/O error, dev nvme0n1, sector 499124248 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.675064] Buffer I/O error on dev nvme0n1p1, logical block 62390275, lost async page write
[  146.675095] blk_update_request: I/O error, dev nvme0n1, sector 331352064 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.675121] Buffer I/O error on dev nvme0n1p1, logical block 41418752, lost async page write
[  146.675153] blk_update_request: I/O error, dev nvme0n1, sector 327158016 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.675179] Buffer I/O error on dev nvme0n1p1, logical block 40894496, lost async page write
[  146.675210] blk_update_request: I/O error, dev nvme0n1, sector 296175592 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.675234] Buffer I/O error on dev nvme0n1p1, logical block 37021693, lost async page write
[  146.675266] blk_update_request: I/O error, dev nvme0n1, sector 2288 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.675290] Buffer I/O error on dev nvme0n1p1, logical block 30, lost async page write
[  146.675322] blk_update_request: I/O error, dev nvme0n1, sector 2208 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[  146.675347] Buffer I/O error on dev nvme0n1p1, logical block 20, lost async page write
[  146.675376] blk_update_request: I/O error, dev nvme0n1, sector 486804432 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
[  146.675476] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:349: I/O error 10 writing to inode 9175961 starting block 62605846)
[  146.675511] Buffer I/O error on device nvme0n1p1, logical block 62605589
[  146.675560] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:349: I/O error 10 writing to inode 9175947 starting block 62601605)
[  146.675586] Aborting journal on device nvme0n1p1-8.
[  146.675590] Buffer I/O error on device nvme0n1p1, logical block 62601348
[  146.675609] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:349: I/O error 10 writing to inode 9175967 starting block 62608630)
[  146.675624] EXT4-fs error (device nvme0n1p1): ext4_check_bdev_write_error:216: comm perl: Error while async write back metadata
[  146.675630] Buffer I/O error on dev nvme0n1p1, logical block 60850176, lost sync page write
[  146.675639] JBD2: Error -5 detected when updating journal superblock for nvme0n1p1-8.
[  146.675642] Buffer I/O error on device nvme0n1p1, logical block 62608373
[  146.675659] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:349: I/O error 10 writing to inode 9175952 starting block 62591579)
[  146.675666] Buffer I/O error on device nvme0n1p1, logical block 62591322
[  146.675697] Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page write
[  146.675717] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:349: I/O error 10 writing to inode 9175951 starting block 62595481)
[  146.675732] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  146.675744] Buffer I/O error on device nvme0n1p1, logical block 62595224
[  146.675765] EXT4-fs error (device nvme0n1p1) in ext4_reserve_inode_write:5783: Journal has aborted
[  146.675831] Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page write
[  146.675858] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  146.675879] EXT4-fs error (device nvme0n1p1): ext4_dirty_inode:5985: inode #9175950: comm perl: mark_inode_dirty error
[  146.675910] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  146.675931] EXT4-fs error (device nvme0n1p1) in ext4_dirty_inode:5987: Journal has aborted
[  146.675959] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  146.675996] EXT4-fs error (device nvme0n1p1) in ext4_da_write_end:3141: Journal has aborted
[  146.676026] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  146.676709] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  146.676727] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: Detected aborted journal
[  146.676740] EXT4-fs (nvme0n1p1): Remounting filesystem read-only
[  146.709466] JBD2: Error while async write back metadata bh 20.
[  146.709515] JBD2: Error while async write back metadata bh 30.
[  146.710234] JBD2: Error while async write back metadata bh 37021693.
[  146.710411] JBD2: Error while async write back metadata bh 40894496.
[  146.710444] JBD2: Error while async write back metadata bh 41418752.
[  146.712791] JBD2: Error while async write back metadata bh 62390275.
[  146.712889] JBD2: Error while async write back metadata bh 62390278.
[  147.576774] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1611: inode #10223617: comm perl: reading directory lblock 0
[  147.577010] ------------[ cut here ]------------
[  147.577027] bdi-(unknown) not registered
[  147.577068] WARNING: CPU: 1 PID: 1469 at fs/fs-writeback.c:2309 __mark_inode_dirty+0x21c/0x2e8
[  147.577081] Modules linked in: ch341 usbserial fuse ip_tables x_tables ipv6 r8125
[  147.577122] CPU: 1 PID: 1469 Comm: perl Not tainted 5.10.66-27-rockchip-gea60d388902d #rockchip
[  147.577135] Hardware name: Radxa ROCK 5B (DT)
[  147.577148] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[  147.577160] pc : __mark_inode_dirty+0x21c/0x2e8
[  147.577171] lr : __mark_inode_dirty+0x21c/0x2e8
[  147.577180] sp : ffffffc015e8ba80
[  147.577189] x29: ffffffc015e8ba80 x28: ffffff80240dfcb8
[  147.577203] x27: 0000000000001000 x26: 0000000000000000
[  147.577217] x25: ffffff80240dfb98 x24: 0000000000000000
[  147.577230] x23: 0000000000000000 x22: 0000000000000004
[  147.577243] x21: ffffffc011be4000 x20: ffffff8001a1bc00
[  147.577256] x19: ffffff80240dfb18 x18: 000000000000000a
[  147.577270] x17: 0000000000000000 x16: 0000000000000000
[  147.577282] x15: 000000000008ce03 x14: 0720072007200720
[  147.577296] x13: 0720072007200720 x12: 0720072007200720
[  147.577309] x11: 0720072007200720 x10: 0720072007200720
[  147.577322] x9 : 0720072007200720 x8 : 6f6e20296e776f6e
[  147.577336] x7 : 6b6e75282d696462 x6 : ffffffc011a49518
[  147.577348] x5 : ffffff80ef67f900 x4 : 0000000000000000
[  147.577361] x3 : 0000000000000027 x2 : 0000000000000027
[  147.577375] x1 : 88dd3cfc11e6dd00 x0 : 0000000000000000
[  147.577389] Call trace:
[  147.577400]  __mark_inode_dirty+0x21c/0x2e8
[  147.577413]  mark_buffer_dirty+0xf8/0x124
[  147.577424]  __block_commit_write.isra.43+0xd8/0xdc
[  147.577437]  block_page_mkwrite+0x9c/0xdc
[  147.577450]  ext4_page_mkwrite+0xd0/0x404
[  147.577464]  do_page_mkwrite+0x58/0xac
[  147.577476]  do_wp_page+0x184/0x1e4
[  147.577486]  handle_pte_fault+0x5bc/0x6d8
[  147.577497]  __handle_mm_fault+0xf4/0x108
[  147.577507]  handle_mm_fault+0x7c/0xcc
[  147.577521]  do_page_fault+0x2ac/0x3c4
[  147.577531]  do_mem_abort+0x54/0xb8
[  147.577543]  el0_da+0x38/0x48
[  147.577553]  el0_sync_handler+0xe4/0x12c
[  147.577564]  el0_sync+0x178/0x180
[  147.577574]
               PC: 0xffffffc01022640c:
[  147.577584] 640c  340000e0 f9401ee0 b40000a0 f9400400 2a1403e2 aa1303e1 97ffeb7d f240069f
[  147.577616] 642c  54000040 12147a94 d5033bbf f9405660 93407e96 ea2002df 54000b80 12150298
[  147.577645] 644c  36580074 f240041f 54000b01 91020279 aa1903e0 942dfc74 f9405677 34000078
[  147.577673] 646c  f24006ff 54000ae1 ea3702df 54000aa0 f9409260 b5000080 d2800001 aa1303e0
[  147.577701] 648c  97ffff5d f240069f 54000080 f9405660 9274f800 f9005660 f9405660 aa0002d6
[  147.577730] 64ac  f9005676 378808f6 79400260 12140c00 7140181f 54000060 f9408660 b4000820
[  147.577758] 64cc  37280816 f2400aff 540007c1 aa1303e0 97fffe3b aa0003f4 f9400000 b9404401
[  147.577786] 64ec  36000121 f9400681 370000e1 97fdf0e2 aa0003e1 d0007a80 9103c400 97f943ed
[  147.577814] 650c  d4210000 b000b900 f944c001 f9007a61 34000078 f944c000 f9007e60 f9405661
[  147.577842] 652c  91006280 91012282 910882b5 911f02b5 f240083f aa1403e1 9a800042 aa1303e0
[  147.577870] 654c  97fffcca 12001c17 91016280 942dfc69 b9400aa0 7100001f 540001cd 97ffeace
[  147.577899] 656c  b00092b6 910002d6 9000b921 b8606ac0 91258021 97ffeac1 340000c0 f9401ea0
[  147.577927] 658c  b4000080 f9400400 aa1303e1 97ffeda1 340000d7 f9400280 b9404400 36000060
[  147.577955] 65ac  aa1403e0 97fdf3e0 a94153f3 a9425bf5 a94363f7 f94023f9 a8c57bfd d65f03c0
[  147.577983] 65cc  aa1903e0 942dfc4b 17fffff8 a9bb7bfd 910003fd a90153f3 a9025bf5 a90363f7
[  147.578011] 65ec  a9046bf9 aa0003f3 aa0103f5 aa1e03e0 d503201f f9405660 f9401a78 f94002b7
[  147.578039]
[  147.586145] ---[ end trace 0a1dc90b30cad801 ]---
[  148.445310] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  148.445981] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  150.525408] EXT4-fs warning (device nvme0n1p1): htree_dirblock_to_tree:1021: inode #2: lblock 0: comm ls: error -5 reading directory block
[  153.472847] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  153.475337] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  153.487639] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  153.488718] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  158.519886] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block
[  158.520454] EXT4-fs error (device nvme0n1p1): ext4_get_inode_loc:4494: inode #7996900: block 31981694: comm smbd: unable to read itable block

I have cut the call trace…

Additional System information:

 ─────────────────────────────────────────────────────
 DietPi v8.12.1 : 10:50 - Sa 24.12.2022
 ─────────────────────────────────────────────────────
 - Device model : Generic Device (aarch64)
 - CPU temp : 38 °C / 100 °F : Cool runnings
 - LAN IP : 192.168.x.x (eth0)
 - Freespace (RootFS) : 12G
 - MOTD : Check out the DietPi v8.12 release notes:
          https://dietpi.com/docs/releases/v8_12/
 ─────────────────────────────────────────────────────

 DietPi Team     : https://github.com/MichaIng/DietPi#the-dietpi-project-team
 Image by        : rock5b (pre-image: ROCK 5 Debian)
 Patreon Legends : Camry2731, Chris Gelatt
 Website         : https://dietpi.com/ | https://twitter.com/DietPi_
 Contribute      : https://dietpi.com/contribute.html
 Web Hosting by  : https://myvirtualserver.com

 dietpi-launcher : All the DietPi programs in one place
 dietpi-config   : Feature rich configuration tool for your device
 dietpi-software : Select optimised software for installation
 htop            : Resource monitor
 cpu             : Shows CPU information and stats

Output of smartctl:

root@xxxnas:~# smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-5.10.66-27-rockchip-gea60d388902d] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 500GB
Serial Number:                      S4EVNX0NB47204E
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500.107.862.016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500.107.862.016 [500 GB]
Namespace 1 Utilization:            358.916.820.992 [358 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5b01b0bcf7
Local Time is:                      Sat Dec 24 10:55:12 2022 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     3.40W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0100W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        51 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    4.639.035 [2,37 TB]
Data Units Written:                 3.968.308 [2,03 TB]
Host Read Commands:                 66.779.704
Host Write Commands:                44.655.502
Controller Busy Time:               157
Power Cycles:                       464
Power On Hours:                     363
Unsafe Shutdowns:                   180
Media and Data Integrity Errors:    0
Error Information Log Entries:      4.261
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               51 Celsius
Temperature Sensor 2:               55 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

looks like data corruption on your disk. How is the disk powered? Does it have an own PSU or just via USB/SBC?

Thanks for your reply.

I have tested the nvme/file system in another system - everything seem alright.

The nvme is onboard - the radxa rock 5b has an integrated slot for it.

The systems runs for weeks without a problem. It is possible that error occured first shortly after the last update of dietpi.

Everythins runs fine until this operation (show diagramms in webinterface (fhem home automation) which evaluates different log files with a huge amount of entries (text-files).

Our update don’t have any impact on attached drives. we just update our script and trigger an apt update && apt upgrade.

As you use the Radxa Debian image as basis, was there probably a kernel or firmware update done along with the DietPi update?

EDIT: The kernel was updated 3 weeks ago: stable repositories: update ROCK 5 Kernel packages · radxa/apt@6f9b463 · GitHub
Was is probably this one which caused the NVMe issues? Not sure whether it can be downgraded:

apt show -a linux-5.10-rock-5-latest
apt show linux-image-5.10.110-33-rockchip-g42dcedf2c3ac

The system is up and running (again :slight_smile: )!

I also changed the powercable a few weeks ago to a shorter (cheaper?) one and that caused the trouble… :frowning:

Thank your for your help!

Happy new year!

at least you found it :wink: