A year or two back, I had a problem with a computer that I use as a file & print server. It had 6 SATA ports on the motherboard, which I used as follows:
0) boot drive (SSD)
1) BluRay writer
2 – 5) RAID 6 array for data and backups
(for those not familiar with the term, RAID 6 is a method of grouping drives so that they work together as a single large drive, with built in redundancy so that any 2 drives can fail and the data is still intact).
For some reason port 4 & 5 stopped detecting the disk drives attached to them. I solved this by purchasing a PCIe SATA card with 2 internal ports and connected the drives to it instead. This worked until a month ago. After upgrading the operating system (from Debian/Jessie (8) to Debian/Stretch (9)) the add-in card stopped being able to detect the 2 drives attached to it.
After much fiddling about, I discovered that there was nothing apparently wrong with the card or the drives. For some reason, only the first 4 ports (0- 3) could now recognize the hard disk drives, while the last 2 (4 & 5) and the add-in card could recognize the boot drive and the BluRay writer but not a hard drive. Switching the cables around and removing the add-in SATA card and I was back in business.
This lasted for almost 2 weeks. Then the computer began emitting intermittent (but frequent) beeps. It also stopped recognizing two of the drives. I eventually traced the beeping to a flaky power supply (replacing it with a new one cured the beeping) but I had to reinstall the add-in card to get all 4 hard drives recognized.
More bizarrely, I couldn’t boot into the latest version of the Linux kernel. I had to boot into the older (Debian/Jessie) kernel, which fortunately was still an option. While not ideal, I was willing to live with it. My server was still doing what it was supposed to do.
That didn’t last long. About a week ago, you guessed it, the two drives attached to the add-in card stopped being recognized.
At this point I concluded that maybe it was time to replace the motherboard. I found a cheap board that could to the job for about $100 (including processor) although I’d have to buy new RAM (the old system used DDR2 while the new board used DDR3) and a 4-port SATA card. Still it seemed like the cheapest solution so I placed the order on canadacomputers.com and asked for the free delivery to my closest store.
This was just before the long weekend. On Tuesday I got a call saying that the board I’d ordered was actually now out of stock.
Time to go for my first instinct – replace the server’s CPU, motherboard and memory with the one in my workstation. It’s been very reliable but is showing its age as a desktop system. However it’s half the age of the server components and there’s the AMD Ryzen processors now on the market. I normally buy AMD because I don’t want Intel to have a real monopoly on the desktop market.
As an aside, it’s interesting to note that with the AM4 socket, AMD has actually reunited its two socket types. Previously they’d used the AM sockets for processors without an onchip graphics processor while they’d used the FM sockets for processors with onchip graphics. Newer A-series and Athlon processors now use the same AM4 socket as the Ryzen processors.
This leads to an interesting confusion. Socket AM4 motherboards all have back-panel video ports but you won’t get a video signal from them unless you are using a processor with an onchip graphics processor. I haven’t read much discussion of this anywhere and it’s barely mentioned in the advertising.
Wednesday I bought a new CPU, motherboard and memory (since the Ryzen processors require DDR4) for my workstation. Returning home, I removed the motherboard from my server and put it aside. Then I transferred the motherboard from my workstation to the server, removing the video card though and replacing it with an more mundane one that uses passive cooling since the server doesn’t run a GUI.
I reconnected everything and started it up. This being Linux, everything started up perfectly – almost. One of the disk drives wasn’t being seen and the network wasn’t starting. This latter point is not good for a server!
Since I didn’t need the network until I got my workstation running, I first added the working disk drive back into the RAID 6 array so that there would still be some redundancy in case a drive failed. While the drive was resynchronizing with array, I started rebuilding the workstation.
I normally put the motherboard in the case before adding the CPU because it makes it easier to access all the screw locations. After getting everything connected, I went to install the CPU.
All the CPU coolers I’ve dealt with recently use a clip to hold them to the motherboard/CPU. The Ryzen’s cooler uses spring-loaded screws instead. This meant removing the motherboard’s clips so that the cooler could screw directly into the plate under the motherboard. However this plate isn’t attached to the board – it relies on the screws to hold it in place. When I removed the clips, the plate promptly fell off.
Not a big deal – I just had to remove the back of the case and hold the plate in place with one hand while trying to screw the CPU cooler in place with the other. And yes, that is as awkward as it sounds.
Finally, I inserted the memory, reconnected the peripherals, and powered it up.
Nothing. The lights came on, the fans started spinning but no sound, no video, nothing.
Back to basics. I removed everything except the video card and memory. No disk drives connected, no front USB or audio connections. Just the motherboard, memory and speaker. Still nothing. Next I removed the memory, since this should be a sure way to get the motherboard to start beeping. Another failure.
Next I shut down my server and yanked the video card from it to install in my workstation instead of the heavy-duty card. I still got nothing from the video.
After replacing the card back into the server and starting it up, I removed the motherboard from the workstation and connected it to a different power supply. I still couldn’t get the system to give a beep.
Admitting defeat, I returned to Canada Computers to give their tech support a try. I left everything with them and returned home to work on debugging the server’s missing drive issue.
By swapping cables around, I eventually found that the drive that wasn’t being recognized was apparently dead. I could attach it to the power and data cables from one of the working drives but it still wasn’t being seen.
Just then Canada Computers called. They couldn’t get the motherboard to beep either. Returning, I got a refund on the CPU and motherboard but decided to keep the memory. I then bought another CPU (same model) and a different, more expensive, motherboard, along with a replacement drive for my server.
I replaced the dead drive with the new one and restarted the server. The drive was being recognized correctly so I added it back into the RAID 6 array so I’d have the full 2-drive redundancy again. While it was rebuilding, I started rebuilding my workstation.
Learning from my mistake, this time I installed the CPU before putting the motherboard in the case. Also, I went with a minimal install, just connecting the power and switches before trying to power up the system.
Still no beeps. Perhaps the speaker was faulty? I put the video card back in and connected it to my monitor and tried again. Still nothing. I again removed the memory, since that should surely trigger some angry beeping. Utter silence.
At this point I’m wondering “WTF am I doing wrong?” One bad board could be a manufacturing defect, but two in a row is almost unheard of.
Back to Canada Computers again. When their tech guy came back, he rechecked everything then suggested perhaps it’s the memory. Apparently the Rizen processors are somewhat finicky. Moreover, unlike most processors, they don’t seem to trigger “no memory” POST warning beeps.
He grabbed a box of DDR4 RAM from another manufacturer and an open-box video card, plugged things in and powered it up. Success! Despite the new memory having identical specs to the old, the new memory worked while the original memory didn’t.
I returned home and commenced rebuilding the workstation. First I redid the minimal install, with just enough connected to get the system to boot. It worked, so I shut things down and reconnected everything. When I started it up, everything worked flawlessly – almost.
Remember, my server wasn’t starting the network properly so I didn’t have my network shares or printers available. Time to get that fixed.
I had vague recollections of network problems with the motherboard when I first installed it. I couldn’t recall how I fixed it however, so I powered up my laptop and looked for solutions online.
The onboard NIC is a Realtek R8168. This is actually a family of devices that aren’t always supported on Linux out of the box. Realtek make their own Linux driver so I went to their web site, downloaded it onto a USB stick and installed it on the server.
When this didn’t work, I decided to do what I should have done in the first place – check out what the exact failure is.
The problem seems to be that with SystemD replacing the venerable SysVinit, network interface names have changed. The old eth0, eth1, etc. only stick around if they were used before the upgrade. Since the network interface is on the replacement motherboard, it was given a new name. The old eth0 interface was tied to the old motherboard’s NIC so it is now obsolete. I had a udev persistent rule to name that old NIC eth0, so I removed it.
The new network interface is named enp5s0. However simply creating an /etc/network/interfaces file asking for enp5s0 to get its address via dhcp didn’t work. This wasn’t a good solution anyway, but I thought it would be worth a shot just to see if the network would actually work.
Since the Realtek r8168 driver didn’t seem to be working, I tried replacing it with Debian’s r8168-dkms driver. That didn’t work either. However I knew that the NIC was working when it was part of my workstation, so perhaps I had the right driver originally after all.
I removed the r8168-dkms driver and tried the r8169. This seemed to work, but I was getting an error message about firmware. Ignoring this for the moment, I returned to the /etc/network/interfaces file.
Because I run virtual machines on the server, I require network bridging. The actual NIC (enp5s0) is tied to a bridge service (br0) so I had to replace eth0 in the interfaces file with enp5s0 in both iface line and also in the bridge definition.
Another problem was, because of my earlier fiddling with the drivers, the r8169 driver was no longer loading automatically. Adding r8169 to the /etc/modules file fixed this. I also found a blacklist file set up by the r8168-dkms driver installation that I removed.
Now there was just the pesky firmware message to deal with. While it doesn’t seem to be necessary, installing the firmware-realtek package did remove the message since the package contains the correct firmware for the variant on my board.
I now have a speedy workstation and what seems to be a reliable server. Keeping my fingers crossed that no new problems will crop up!