Debugging random freezes in Linux

In early 2024, I assembled a desktop. The machine ran smoothly for the first year or so, but after a NixOS upgrade I started getting random freezes. Everything would freeze. I couldn't switch browser tabs, but I could still move my mouse cursor. Clicking did nothing, and the OS didn't respond to keyboard input either. This would last around 20 to 30 seconds, after which everything returned to normal.

It happened occasionally, maybe once a day or every two days. Annoying, but not annoying enough to demand immediate attention.

I searched various forums and found similar threads, mostly about GPU issues. I tried different settings, but nothing worked. This was the kind of bug I didn't know how to debug further.

Given the popularity of LLMs, I thought this would be a good opportunity to put one to use. The steps I describe in the rest of this post were guided by an interactive chat with an LLM, no agents, since I wanted to understand what was going on at each stage.

Dmesg logs

dmesg (diagnostic message) is a command that prints the kernel ring buffer, essentially a log of what the Linux kernel has been doing.

The next time the freeze happened, I checked dmesg and found something interesting:

[ 7956.370041] ata1.00: status: { DRDY }
[ 7956.370044] ata1.00: failed command: WRITE FPDMA QUEUED
[ 7956.370046] ata1.00: cmd 61/00:80:90:84:36/01:00:47:00:00/40 tag 16 ncq dma 131072 out
                        res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 7956.370052] ata1.00: status: { DRDY }
[ 7956.370055] ata1.00: failed command: WRITE FPDMA QUEUED
[ 7956.370057] ata1.00: cmd 61/08:88:98:80:80/00:00:5c:01:00/40 tag 17 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 7956.370063] ata1.00: status: { DRDY }
[ 7956.370066] ata1.00: failed command: WRITE FPDMA QUEUED
[ 7956.370068] ata1.00: cmd 61/08:90:30:ba:80/00:00:5c:01:00/40 tag 18 ncq dma 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 7956.370074] ata1.00: status: { DRDY }
[ 7956.370077] ata1.00: failed command: WRITE FPDMA QUEUED
[ 7956.370079] ata1.00: cmd 61/08:98:90:ad:81/00:00:5c:01:00/40 tag 19 ncq dma 4096 out
                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 7956.370085] ata1.00: status: { DRDY }
[ 7956.370088] ata1.00: failed command: WRITE FPDMA QUEUED
[ 7956.370090] ata1.00: cmd 61/08:a0:18:b0:81/00:00:5c:01:00/40 tag 20 ncq dma 4096 out
                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 7956.370096] ata1.00: status: { DRDY }
[ 7956.370104] ata1: hard resetting link
[ 7956.838153] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 7956.842299] ata1.00: configured for UDMA/133
[ 7956.842399] ata1: EH complete

Here's how the LLM interpreted these logs:

  • The drive stopped responding to queued write commands (WRITE FPDMA QUEUED), resulting in timeouts (Emask 0x4 (timeout)).
  • The PHYRdyChg error indicated the physical SATA connection between the motherboard and the drive had briefly dropped.
  • The Linux kernel successfully performed a hard reset to recover the drive and re-established the connection (SATA link up 6.0 Gbps... EH complete).

This pointed to a temporary hardware communication failure with a SATA drive on port 1. Likely culprits:

  1. Loose SATA cables.
  2. The drive's controller is crashing or failing.

I felt a loose SATA data cable was a possibility, but the fact that the system recovered after a short time made me doubt it. Also, I was feeling a bit lazy to open up the cabinet and reseat cables if loose cables weren't the issue. I wanted to see how much information I could gather about the disk health purely from the software side before reaching for the screwdriver.

SMART diagnostics

SMART diagnostics (smartmontools) can query the drive's internal health data and error counters.

First, I needed to identify the drive for ata1. Then I used smartctl to check the drive's SMART attributes:

sudo smartctl -A /dev/sda

The LLM asked me to look for these specific SMART attributes and explained how to interpret them:

  • 199 UDMA_CRC_Error_Count: Greater than 0 and increasing over time would point to a faulty or loose SATA data cable.
  • 5 Reallocated_Sector_Ct or 197 Current_Pending_Sector: Greater than 0 would indicate the drive itself is failing.
  • 192 Power-Off_Retract_Count: High numbers here could point to power supply issues.

It also suggested running a drive self-test with smartctl -t short /dev/sda and checking results with smartctl -l selftest /dev/sda. A passing result would mean the drive's internal mechanics are likely fine, narrowing things down to cables or power.

Here's what I got when I ran smartctl:

$ smartctl -A /dev/sda
smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.12.60] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   206   205   021    Pre-fail  Always       -       2683
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       639
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8118
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       638
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2961
194 Temperature_Celsius     0x0022   109   103   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

The SMART data looked clean to me. All values were within normal ranges, and the self-tests completed without errors. Thankfully, no issues from the drive's side.

SATA Link Power Management (ALPM) controls how aggressively the kernel puts SATA links into low-power states. If the policy is too aggressive, the link can fail to wake up properly, causing the PHYRdyChg error.

I checked the current policy:

$ cat /sys/class/scsi_host/host*/link_power_management_policy
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm

A value of min_power or med_power_with_dipm means the kernel is using aggressive power saving. If the link fails to wake up properly, it could cause the PHYRdyChg error.

I read the kernel documentation to understand further about it. I found that my system was using medium power saving settings for the SATA links, allowing them to enter low-power sleep states. And since DIPM (Device Initiated Power Management) was enabled, the hard drive itself could ask the motherboard to put the connection into a low-power sleep.

I applied this possible fix in my Nix configuration:

+  services.udev = {
+    # To avoid random freezes.
+    extraRules = ''
+      ACTION=="add|change", SUBSYSTEM=="scsi_host", KERNEL=="host*", ATTR{link_power_management_policy}="max_performance"
+    '';
+  };

And that actually fixed it. My system has been running with zero freezes for the last two months.

What caused it

I did a quick search to see if there was any kernel change related to this and found this Phoronix article. Turns out the kernel defaults for SATA link power management changed in version 6.11.

These are the past NixOS releases:

VersionCodenameRelease DateDefault Kernel
24.05UakariMay 31, 2024Linux 6.6 LTS
24.11VicuñaNov 30, 2024Linux 6.6 LTS
25.05WarblerMay 23, 2025Linux 6.12 LTS

Now I can see why my upgrade to 25.05 resulted in this issue!

LLM assistance

In this case, using LLM accelerated my debugging and I learned several new things along the way. I fed it raw tool outputs, the dmesg logs with ATA errors, the smartctl SMART attribute dump, and it helped interpret what those cryptic registers and counters actually meant, and it pointed out exactly which fields to watch for. I had no prior experience with kernel logs, SMART diagnostics, or SATA power management.

After solving the issue, I tried looking up the details in man pages, things like UDMA_CRC_Error_Count from the smartctl output and WRITE FPDMA QUEUED from the dmesg logs, but I couldn't find authoritative documentation for them. I found scattered forum discussions, but nothing I'd consider a canonical reference. If anyone can point me to proper documentation for these error codes, I'd appreciate it. The LLM answered confidently and its explanations matched what I observed, but I'd still prefer an authoritative source.

That said, iterating with the LLM got me to a resolution quickly for a problem I had been living with for months, and I plan to lean on this approach more for debugging going forward.