Monday, May 29, 2023

The Mysterious Ryzen Reboot Bug

There's an old saying about boat fanatics. "The happiest days of a boat owner's life are the day he gets a boat and the day he sells the boat."

It's funny because it's true. A boat gives you access to happy (we hope) times on the water where you can swim, fish, sunbathe, and race. But the asset is often expensive to purchase and like most things, can come with a high cost of ongoing ownership. 

You can't say the same thing about a PC. For PC owners, it should be, "The happiest days for a PC owner are the day he gets a new PC and the day he gets an even newer PC." That's because a PC isn't a boat...it's a commodity and you expect it to just work and only get better as you climb up the tree of PC evolution. 

The Problem

Up to about a year ago I'd run on a Dell 5675 Ryzen 7 1700X PC that was mostly great. For the first three years it was solid and a good performer. It could run games fine and kicked Windows in the ass. Then, in late 2020 when I added a second monitor for work-from-home (WFH) the system would start to randomly crash after a short period of idling. It didn't yield an error, it didn't point to a specific app, it just would reboot. I could not figure out what the source was. Because it started happening after I went to dual monitors, I thought it was a video driver issue. I had the latest drivers for the GPU (an nVidia 1060) but the problem persisted.

I was not Alone

Online, I found that others had the issue too. There were a myriad of ideas, but no one could really pin it down. Some found it was related to power management, some theorized a Windows issue, some felt it was due to a faulty CPU, others said it was a faulty motherboard, some said it was the BIOS. 

And sometimes the testimonials contradicted each other. One person would say it was only on one particular chip, but several people with different Ryzens were seeing the issue. Some said it was due to a bad motherboard maker but the issue was happening across different motherboard brands. 

The Long Road of Diagnostics (and Trial and Error)

In my personal experience, it might have been some weird combination of all those things. Updating drivers did not solve the issue. 

Reinstalling the Operating System

I got a temporary reprieve by refreshing my Windows 10 install. It seemed to settle the system down and for a month or two, it ran great again. The problem returned and I knew that going through the hassle of refreshing the OS install would only bring temporary relief. I knew there had to be an explanation but I could not figure it out.

Checking the RAM

I then looked at the memory chip. It didn't appear to be the issue. It was a single 8GB DIMM chip. I bought another 8GB DIMM to swap it out with. No change, the reboots would still happen with either chip. The one good thing about getting the extra 8GB stick was now I had effectively upgraded the PC RAM to 16GB. 

The CPU?

I wasn't doing any fancy overclocking and the PC had been running fine for years, so I didn't think it was the CPU. But I made sure to update the drivers and the BIOS and verify both it and the RAM chips were seated firmly. Nothing changed.

The GPU

Because the issue seemed to start when I went to dual monitors and was running Citrix to run a remote desktop from work, I wondered if it could have been the video card. But again I'd made sure the drivers were updated and I wasn't doing any overclocking. I tried changing the video cables. I tried going back to a single monitor by turning off and disconnecting one of the monitors. No luck.

Time to Make a Change

Through most of 2021 and the first half of 2022 I limped along with the issue. Sometimes it would be kind to me and not trigger for two weeks at a time. Sometimes I'd do the Windows refresh trick and buy a couple months. The issue was like the taxman, it would always come back for me. 

Here's a quick bit of info about my personal PC history. Since about 2008 or so, I've only upgraded my primary desktop PC about every five years. You'd think that's not often enough given that I play games too. PCs are good enough now that I don't feel the need to change every two years like I did in the 80's and 90's. 

As we rolled past the midpoint of 2022 I figured one way to solve the reboot issue was to replace the PC. It was about time anyway as the Dell was getting to be five years old. So I began researching what I was going to do. 

I've tended to alternate between building my own PC and buying a prebuilt. I enjoy building PCs but sometimes a mass PC producer can hit a pretty good price point and save you the time of having to order parts and put a PC together. I compared some builds at HP and Dell against a parts list I built at PCPartPicker.com. Because GPUs were ridiculously expensive at the time, I decided to go with the AMD Ryzen 7 5700G APU. I'd had great luck back in 2012 when I built a PC using the AMD A10 chip. It was a great chip; no, I couldn't run games at the highest settings, but it was more than good enough to run everything I needed, was great at handling Windows, and it would save me the cost, space, and power of having to use a discrete GPU.

As the time came to buy the components, the price of the 5700G spiked. So I looked at the Ryzen 5 version of that chip, the 5600G. It was nearly $100 cheaper and although it had two fewer CPU cores, its performance compared favorably to my existing Ryzen 7 1700X. I pulled the trigger on all the parts and looked forward to an infusion of upgrades coming in the form of a more modern motherboard, CPU, NVMe SSD, and a healthy 32GB of RAM. 

It Lives!

I put the 5600G system together and it was great. Performance was good, the motherboard came with a BIOS version already capable of handling the CPU on the first boot, and the RAM and NVMe drive are fast and I wasn't getting any random reboots. 

It DIES!

What the fuck. After two blissful weeks the new system started to spontaneously reboot as the old system had. I was so frustrated. Again I started the scouring of the web for testimonials from others with the same issue. And there were definitely others that had the same issue. But as before, everyone had a different solution.

Several found out it was a power issue and that the Ryzens have trouble with power fluctuations as the PC tries to conserve or feed more juice as the PC's needs change. Some fixed it by switching the Windows power profile from balanced to high power. Others went into the BIOS and turned off overclocking or slightly increased CPU core voltage. Some increased voltage to the RAM.

It's so damned frustrating to not know exactly what it was. But it really did seem like there was something to the power angle, especially since this seems to happen when the system idles and not when it's active. I can't believe the Ryzen APU itself was broken, but perhaps gradual degradation or change in the chip after running with power for two weeks led to an instability and required an adjustment to fix it.

I tried many of the suggestions online and as of right now, I've not had any crashes after turning off these three power-related settings in the BIOS:

  • Disabled the RAM "Power Down Mode"
  • Disabled "Precision Boost Overdrive"
  • Disabled something else I'll have to look up.

All of the victims of this had different solutions, but shared a common suffering. It was painstakingly time consuming to endure so many reboots, changes of settings, futile research, and absolute exasperation. At least a few just returned the CPU and tried a different one. I'm on the cusp of going to Intel if I can't get this working. 

Back to Basics

I repeated a step with the 5600G system that worked with the 1700X system. I did a Windows reset/reinstall. And it worked. The system was stable again. And has been since. I'm really happy with the system.

Looking back, the problem started after I'd updated a bunch of AMD drivers for video and CPU. What this tells me is that one of the driver updates caused the issue. So this time around, I only installed driver updates as Windows recommended them, rather than preemptively going to the AMD site and downloading the latest. It's disappointing that I can't trust the latest drivers, but clearly there's something off in either the driver or the site recommendations on which drivers to use because the system is rock solid now. There's something in the video or CPU drivers that must be yielding perhaps a memory conflict. 

There was a time when I always liked to go get the latest drivers. Now, for this PC at least, I'm adopting the old adage: "If it isn't broke, don't fix it." 

No comments: