Monday, May 29, 2023

The Mysterious Ryzen Reboot Bug

There's an old saying about boat fanatics. "The happiest days of a boat owner's life are the day he gets a boat and the day he sells the boat."

It's funny because it's true. A boat gives you access to happy (we hope) times on the water where you can swim, fish, sunbathe, and race. But the asset is often expensive to purchase and like most things, can come with a high cost of ongoing ownership. 

You can't say the same thing about a PC. For PC owners, it should be, "The happiest days for a PC owner are the day he gets a new PC and the day he gets an even newer PC." That's because a PC isn't a boat...it's a commodity and you expect it to just work and only get better as you climb up the tree of PC evolution. 

The Problem

Up to about a year ago I'd run on a Dell 5675 Ryzen 7 1700X PC that was mostly great. For the first three years it was solid and a good performer. It could run games fine and kicked Windows in the ass. Then, in late 2020 when I added a second monitor for work-from-home (WFH) the system would start to randomly crash after a short period of idling. It didn't yield an error, it didn't point to a specific app, it just would reboot. I could not figure out what the source was. Because it started happening after I went to dual monitors, I thought it was a video driver issue. I had the latest drivers for the GPU (an nVidia 1060) but the problem persisted.

I was not Alone

Online, I found that others had the issue too. There were a myriad of ideas, but no one could really pin it down. Some found it was related to power management, some theorized a Windows issue, some felt it was due to a faulty CPU, others said it was a faulty motherboard, some said it was the BIOS. 

And sometimes the testimonials contradicted each other. One person would say it was only on one particular chip, but several people with different Ryzens were seeing the issue. Some said it was due to a bad motherboard maker but the issue was happening across different motherboard brands. 

The Long Road of Diagnostics (and Trial and Error)

In my personal experience, it might have been some weird combination of all those things. Updating drivers did not solve the issue. 

Reinstalling the Operating System

I got a temporary reprieve by refreshing my Windows 10 install. It seemed to settle the system down and for a month or two, it ran great again. The problem returned and I knew that going through the hassle of refreshing the OS install would only bring temporary relief. I knew there had to be an explanation but I could not figure it out.

Checking the RAM

I then looked at the memory chip. It didn't appear to be the issue. It was a single 8GB DIMM chip. I bought another 8GB DIMM to swap it out with. No change, the reboots would still happen with either chip. The one good thing about getting the extra 8GB stick was now I had effectively upgraded the PC RAM to 16GB. 

The CPU?

I wasn't doing any fancy overclocking and the PC had been running fine for years, so I didn't think it was the CPU. But I made sure to update the drivers and the BIOS and verify both it and the RAM chips were seated firmly. Nothing changed.

The GPU

Because the issue seemed to start when I went to dual monitors and was running Citrix to run a remote desktop from work, I wondered if it could have been the video card. But again I'd made sure the drivers were updated and I wasn't doing any overclocking. I tried changing the video cables. I tried going back to a single monitor by turning off and disconnecting one of the monitors. No luck.

Time to Make a Change

Through most of 2021 and the first half of 2022 I limped along with the issue. Sometimes it would be kind to me and not trigger for two weeks at a time. Sometimes I'd do the Windows refresh trick and buy a couple months. The issue was like the taxman, it would always come back for me. 

Here's a quick bit of info about my personal PC history. Since about 2008 or so, I've only upgraded my primary desktop PC about every five years. You'd think that's not often enough given that I play games too. PCs are good enough now that I don't feel the need to change every two years like I did in the 80's and 90's. 

As we rolled past the midpoint of 2022 I figured one way to solve the reboot issue was to replace the PC. It was about time anyway as the Dell was getting to be five years old. So I began researching what I was going to do. 

I've tended to alternate between building my own PC and buying a prebuilt. I enjoy building PCs but sometimes a mass PC producer can hit a pretty good price point and save you the time of having to order parts and put a PC together. I compared some builds at HP and Dell against a parts list I built at PCPartPicker.com. Because GPUs were ridiculously expensive at the time, I decided to go with the AMD Ryzen 7 5700G APU. I'd had great luck back in 2012 when I built a PC using the AMD A10 chip. It was a great chip; no, I couldn't run games at the highest settings, but it was more than good enough to run everything I needed, was great at handling Windows, and it would save me the cost, space, and power of having to use a discrete GPU.

As the time came to buy the components, the price of the 5700G spiked. So I looked at the Ryzen 5 version of that chip, the 5600G. It was nearly $100 cheaper and although it had two fewer CPU cores, its performance compared favorably to my existing Ryzen 7 1700X. I pulled the trigger on all the parts and looked forward to an infusion of upgrades coming in the form of a more modern motherboard, CPU, NVMe SSD, and a healthy 32GB of RAM. 

It Lives!

I put the 5600G system together and it was great. Performance was good, the motherboard came with a BIOS version already capable of handling the CPU on the first boot, and the RAM and NVMe drive are fast and I wasn't getting any random reboots. 

It DIES!

What the fuck. After two blissful weeks the new system started to spontaneously reboot as the old system had. I was so frustrated. Again I started the scouring of the web for testimonials from others with the same issue. And there were definitely others that had the same issue. But as before, everyone had a different solution.

Several found out it was a power issue and that the Ryzens have trouble with power fluctuations as the PC tries to conserve or feed more juice as the PC's needs change. Some fixed it by switching the Windows power profile from balanced to high power. Others went into the BIOS and turned off overclocking or slightly increased CPU core voltage. Some increased voltage to the RAM.

It's so damned frustrating to not know exactly what it was. But it really did seem like there was something to the power angle, especially since this seems to happen when the system idles and not when it's active. I can't believe the Ryzen APU itself was broken, but perhaps gradual degradation or change in the chip after running with power for two weeks led to an instability and required an adjustment to fix it.

I tried many of the suggestions online and as of right now, I've not had any crashes after turning off these three power-related settings in the BIOS:

  • Disabled the RAM "Power Down Mode"
  • Disabled "Precision Boost Overdrive"
  • Disabled something else I'll have to look up.

All of the victims of this had different solutions, but shared a common suffering. It was painstakingly time consuming to endure so many reboots, changes of settings, futile research, and absolute exasperation. At least a few just returned the CPU and tried a different one. I'm on the cusp of going to Intel if I can't get this working. 

Back to Basics

I repeated a step with the 5600G system that worked with the 1700X system. I did a Windows reset/reinstall. And it worked. The system was stable again. And has been since. I'm really happy with the system.

Looking back, the problem started after I'd updated a bunch of AMD drivers for video and CPU. What this tells me is that one of the driver updates caused the issue. So this time around, I only installed driver updates as Windows recommended them, rather than preemptively going to the AMD site and downloading the latest. It's disappointing that I can't trust the latest drivers, but clearly there's something off in either the driver or the site recommendations on which drivers to use because the system is rock solid now. There's something in the video or CPU drivers that must be yielding perhaps a memory conflict. 

There was a time when I always liked to go get the latest drivers. Now, for this PC at least, I'm adopting the old adage: "If it isn't broke, don't fix it." 

Annual Memorial Day Post: 2023

Here we go. I'm going to potentially doom myself and write about an emotionally charged topic. Perhaps this post will be censored or taken down or I'll be burned alive in social media court, but remember it's just my opinion.

Talking About White Privilege

When I first heard the phrase "white privilege" I didn't like the concept and I didn't believe in it. I have always tended to agree with the argument that the answer to racism is not more racism, and outward finger pointing is just about the worst way to solve a problem. The finger should start by pointing inward, and honestly making sure you're not the one full of shit. 

I didn't like the sense that bandying around the catchphrase "white privilege" seemed to be accusing others of racism, when they may not have been racist. The false accusations and declarations of guilt without a trial so common in media today are signs of our civilization's regression to something barbaric. The smart are losing ground to the stupid and it's depressing. 

But I took a harder look at the concept of "white privilege" and although the extremists at either ends of the spectrum might disagree with my reasoning, I've come to an acceptance of it by recognizing it as an extension of a concept I learned about in management studies. 

The Halo Effect

In behavioral science, the halo effect represents the tendency for certain physical traits to impart the benefit of the doubt to the possessor of the trait. An example is height. Height is often perceived as a leadership indicator, even if the tall person is a complete idiot and couldn't lead himself/herself down a one-way street. With time and experience, the truth of a person's capabilities becomes known, but fresh off the starting line, people with halo effect traits have an advantage.

The most common example I remembered from management classes where I read about the halo effect is beauty. Attractive people tended to get the benefit of the doubt on first impressions as trustworthy or competent. In a group of people all applying for the same position, the most attractive person had the edge in being given the opportunity. There's some biological explanation here too, I'm sure, as most would like to be around people considered attractive. 

And yes, lighter skin is something that could be considered a positive trait. We'd like to think that isn't the case. We'd like to think that everyone is judged on their character and competence and not by how tall they are or the color of their hair or skin. But biology tells us that isn't the case. Even the most aware, intelligent, and disciplined of us can't change the fact that we're human and can succumb to stereotypes and preconceived notions when developing a first impression. 

Demographics or Bias?

I have met people of all colors and can't draw any definitive association about racism to color. Yes, I've met white people that were rude and made racist comments, and I've met non-white people that have done it too. Conversely, I've met wonderful people of all colors. 

As I said before, this is why I didn't like the idea of racial privilege. But that was smart me trying to reach for the ideal. I tried to reason that a white person has an advantage in an environment where whites are the majority. It made sense to me that people fear the unfamiliar and are more comfortable with the familiar. In a predominantly white population, it wasn't intentionally hostile discrimination but simply familiarity that would favor people of similar traits. 

Then I heard that lighter skin is a halo effect trait and that even in populations that are non-white, the lighter skinned generally benefitted from it. There are a number of studies on this and I won't get into it here but this told me I was incorrect in assuming it was merely a matter of demographics (although I'm not ready to fully discount demographics as at least a factor in the equation). The halo effect, or reverse halo effect, could be active in any environment and like height or beauty, skin tone could have an effect, perhaps as a part of the perception of beauty.

Another superficiality that imparts the halo effect is the shape of eyes. I bring this into the discussion because in a totally different culture, you can see how a trait can have a similar effect. The Japanese graphic novels (manga) and animated cartoons (anime) are huge components of their culture's media consumption. In many of these stories, villains often have narrow or slanted eyes while the heroes often have circular eyes (and also often, lighter skin). 

The Treadmill

So, ultimately, I've sort of come around to the idea of white privilege as a part of the halo effect. This does not mean that the people engaging in awarding or benefitting from the halo effect are racists. You can make the argument that it is a form of discrimination and that it has negative effects, especially when exercised by gatekeepers of opportunity, and that's an acceptable argument. We cannot however, assume that all people engage it in or that they are hostile. I'm living proof that you can be non-white and make a sustainable living by being employable and offering some value to an employer. 

What finally convinced me to accept the concept of white privilege? Corny though it might sound, it was a video game analogy. In many games you may control a character that earns experience points and then gains levels, which then allow you to improve your character's abilities. This system of advancement is often referred to as a "level treadmill" as you keep grinding on the treadmill to earn new levels. 

I came to think of it this way: On the level treadmill of life, the easiest difficulty setting is White Male. I'm ok with this analogy because it made sense to me. It's not saying the person on the treadmill is evil or racist which is good because they probably aren't. It's simply acknowledging the halo effect. 

How do we solve this problem? I think it isn't solvable. It's like an incurable but manageable disease. Human beings are imperfect, that's part of what makes us human. It is through education and communication that we improve and learn and develop the wisdom to work against our base instincts, but you can never remove the base instincts. And it is very natural for humans to be wary when faced with something different or unfamiliar. But we can manage it better by installing better leaders and better teachers. That's probably one of the best ways to attack the issue. Unfortunately, we have problems in politics and business and especially law where questionable people achieve leadership positions. And we have problems in education where administrators are doing their best to make the teaching profession unattractive...but those are discussions for different blog posts. 

I'd Rather be Here than Not

I need to tie this post into Memorial Day, so here it is. You may not like the level treadmill, which can also symbolize the Sisyphean life of working for retirement, but I'm darn glad to have it compared to some of the alternatives (like slavery). And we have it because someone died to keep us from some of those worse alternatives. Gratitude as always for our veterans.