Hi,
I have recently purchased an AMD Ryzen 9 5900X to replace my 5800X. The old 5800X was working fine. With the new CPU I experienced WHEA errors and BSODs in Windows 10 and so I RMAed it and the new one seems to work fine.
I like optimizing my computer before using it so I tinkered with PBO settings and curve optimizer and tested the system to be prime95 stable (per core and multicore test) in Windows 10. The computer worked fine in Windows 10 but I was only using Windows for stress testing. My main OS is Arch Linux and I am on linux-5.11.13-arch1-1
. I pined the kernel version because of an Nvidia driver issue. This happened before I swapped the CPU. Everything was working fine in Windows 10 but then when I switched to Linux, I got random reboots. journalctl -b | grep Hardware
gives two different kinds of hardware errors.
This is the kind of error that I get after the system randomly reboots, I think. I can't remember exactly.
Jun 04 03:23:26 Arch-AMD kernel: mce: [Hardware Error]: Machine check events logged
Jun 04 03:23:26 Arch-AMD kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 0: dc20000000080015
Jun 04 03:23:26 Arch-AMD kernel: mce: [Hardware Error]: TSC 0 ADDR 7f2c1b290000 MISC d0130fff00000000 SYND 3a01002a IPID 1000b000000000
Jun 04 03:23:26 Arch-AMD kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1622777000 SOCKET 0 APIC 1 microcode a201009
I reset the curve optimizer and it seems to fix the random reboots but then I still get these.
Jun 05 06:15:40 Arch-AMD kernel: mce: [Hardware Error]: Machine check events logged
Jun 05 06:15:40 Arch-AMD kernel: [Hardware Error]: Corrected error, no action required.
Jun 05 06:15:40 Arch-AMD kernel: [Hardware Error]: CPU:0 (19:21:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0x9c20000000080015
Jun 05 06:15:40 Arch-AMD kernel: [Hardware Error]: Error Addr: 0x00007fdd0203f000
Jun 05 06:15:40 Arch-AMD kernel: [Hardware Error]: IPID: 0x001000b000000000, Syndrome: 0x000000003a01002a
Jun 05 06:15:40 Arch-AMD kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 8, A parity error was detected in an L1 TLB entry by any access.
Jun 05 06:15:40 Arch-AMD kernel: [Hardware Error]: cache level: L1, tx: DATA
Jun 05 06:47:04 Arch-AMD kernel: mce: [Hardware Error]: Machine check events logged
Jun 05 06:47:05 Arch-AMD kernel: [Hardware Error]: Corrected error, no action required.
Jun 05 06:47:05 Arch-AMD kernel: [Hardware Error]: CPU:0 (19:21:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0x9c20000000080015
Jun 05 06:47:05 Arch-AMD kernel: [Hardware Error]: Error Addr: 0x00007f30cb909000
Jun 05 06:47:05 Arch-AMD kernel: [Hardware Error]: IPID: 0x001000b000000000, Syndrome: 0x000000003a01002a
Jun 05 06:47:05 Arch-AMD kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 8, A parity error was detected in an L1 TLB entry by any access.
Jun 05 06:47:05 Arch-AMD kernel: [Hardware Error]: cache level: L1, tx: DATA
Jun 05 13:26:53 Arch-AMD kernel: mce: [Hardware Error]: Machine check events logged
Jun 05 13:26:53 Arch-AMD kernel: [Hardware Error]: Corrected error, no action required.
Jun 05 13:26:53 Arch-AMD kernel: [Hardware Error]: CPU:12 (19:21:0) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0x9c20000000080015
Jun 05 13:26:53 Arch-AMD kernel: [Hardware Error]: Error Addr: 0x00007eff5357b000
Jun 05 13:26:53 Arch-AMD kernel: [Hardware Error]: IPID: 0x001000b000000000, Syndrome: 0x000000003a01002a
Jun 05 13:26:53 Arch-AMD kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 8, A parity error was detected in an L1 TLB entry by any access.
Jun 05 13:26:53 Arch-AMD kernel: [Hardware Error]: cache level: L1, tx: DATA
Typically the problematic CPUs are 0, 12. (It also happened on CPU:16 but it seems fixed when I reset the curve optimizer)
This is the output of lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 5994.4331 2200.0000 4221.871
1 0 0 1 1:1:1:0 yes 6442.4800 2200.0000 4221.866
2 0 0 2 2:2:2:0 yes 6442.4800 2200.0000 4221.856
3 0 0 3 3:3:3:0 yes 6294.3350 2200.0000 4221.854
4 0 0 4 4:4:4:0 yes 5846.2891 2200.0000 4221.849
5 0 0 5 5:5:5:0 yes 6146.1909 2200.0000 4221.847
6 0 0 6 8:8:8:1 yes 5101.9531 2200.0000 4221.845
7 0 0 7 9:9:9:1 yes 5698.1440 2200.0000 4221.842
8 0 0 8 10:10:10:1 yes 5550.0000 2200.0000 4221.842
9 0 0 9 11:11:11:1 yes 5398.2422 2200.0000 4213.202
10 0 0 10 12:12:12:1 yes 4950.1948 2200.0000 4221.833
11 0 0 11 13:13:13:1 yes 5250.0972 2200.0000 4221.829
12 0 0 0 0:0:0:0 yes 5994.4331 2200.0000 4221.824
13 0 0 1 1:1:1:0 yes 6442.4800 2200.0000 4221.827
14 0 0 2 2:2:2:0 yes 6442.4800 2200.0000 4221.820
15 0 0 3 3:3:3:0 yes 6294.3350 2200.0000 4221.817
16 0 0 4 4:4:4:0 yes 5846.2891 2200.0000 4221.810
17 0 0 5 5:5:5:0 yes 6146.1909 2200.0000 4221.810
18 0 0 6 8:8:8:1 yes 5101.9531 2200.0000 4221.807
19 0 0 7 9:9:9:1 yes 5698.1440 2200.0000 4221.804
20 0 0 8 10:10:10:1 yes 5550.0000 2200.0000 4221.802
21 0 0 9 11:11:11:1 yes 5398.2422 2200.0000 4221.795
22 0 0 10 12:12:12:1 yes 4950.1948 2200.0000 4221.798
23 0 0 11 13:13:13:1 yes 5250.0972 2200.0000 4221.795
which suggests that CPU 0 and 12 are the virtual processors of the same core.
I don't think there's issue with my other hardware because they were working fine with my old 5800X. However, I did recently updated the motherboard BIOS so that might be an issue. I don't know how to trigger these error. I ran stress tests and can't reproduce them. They seemed to appear from nowhere even when my computer is idling. I tend not to RMA my CPU because I have already done it once. I will do that unless that's absolutely necessary.
Here are my hardware specs
-
CPU: AMD Ryzen 9 5900X
-
Motherboard: Gigabyte B550I AORUS PRO AX (BIOS F13i with AGESA 1.2.0.2)
-
Ram: Corsair VENGEANCE LPX 32GB (2 x 16GB) DDR4 DRAM 3200MHz C16 Memory Kit
-
Storage: Samsung SSD 970 EVO Plus 1TB
-
GPU: Gigabyte RTX 3070 Eagle OC
I read online and some suggested memory issues. I have now reduced the FCLK in my BIOS and will observe if the problem persists.
Any help is appreciated. Thank you very much.
Edit 1 [2021-06-06]: Got the same Hardware corrected errors with FCLK decreased to 1567. These happened at random intervals at idle. Will try downgrading the BIOS to F13g.
Edit 2 [2021-06-06]: Switched to F13g, with PBO auto and D.O.C.P. on, still received corrected errors. Will try disabling D.O.C.P.
Edit 3 [2021-06-07]: With D.O.C.P. disabled, I haven't seen any hardware error in 13+ Hours usage. This suggests to RAM issues, possibly unstable RAM overclock. I then re-enabled D.O.C.P. and ramp up the DRAM voltage to 1.37 V (BIOS shows cha A/B volts 1.452 for some reason). Running the computer for 12+ hours. No issues so far. Will try to re-enable curve optimizer.
Edit 4 [2021-06-07]: Re-entered the previous curve optimizer prime95 stable negative offsets. Got corrected errors very quickly after boot. Will try lowering the offsets.
Edit 5 [2021-06-07]: Curve optimizer all core -10 and -5 resulted in hardware errors. Will disable curve optimizer once again and keep monitoring.
Edit 6 [2021-06-08]: With curve optimizer disabled, an hardware corrected error occured even with DRAM voltage set to 1.37 V. I honestly ran out of idea. My baseline is to have PBO and D.C.O.P enabled. I will try lowering ram frequency to see if there are any improvements.
Edit 7 [2021-06-09]: With the curve optimizer disabled and with lowered Ram frequency, hardware error still persists. I ended up dialing +5 all core in curve optimizer and re-enabled the stock D.O.C.P. It seems to have worked and I haven't seen an error with 1day+ uptime. I will occasionally monitor the journal log but if I don't update this post anymore, consider this as a solution. Thank you.