PSOD on a non existing piece of hardware
Last weeks at a customer side we had some weird problems with random PSOD’s on HP BL460c Gen 8 servers.
The customer is migrating from ESXi 5.5 to ESXi 6.0 and after a test period of a few months we started to upgrade all hosts using VMWare Update Manager. After a few days some hosts suddenly started to give PSOD’s as shown below.
All PSOD’s happened on HP BL460c Gen 8 blades equipped with QLogic and Emulex adapters. The other thing in common is that all PSOD’s were caused by Linux RedHat servers with Oracle installed on them.
If we look at the PSOD and distillate the cause we see mlx4_core that is required by mlx4_en. These are Mellanox driver. The fun part is that we don’t have any mellanox hardware installed in the blades. Simular post can be found Here.
After consulting both HP and VMWare, the conclusion was to uninstall the VIBs instead of upgrading them. The drivers should have been removed during the upgrade tot ESXi 6 but remained at the host.
After removing the VIBS and rebooting the hosts there were no more PSOD’s.
ESX-cli commands:
esxcli software vib remove -n net-mlx4-en
esxcli software vib remove -n net-mlx4-core
esxcli software vib remove -n net-mst
The only remaining question for me is why only Linux machines with Oracle can stress the host and cause a PSOD.
Comments
We have seen this issue With Our setup With Upgrade from 5.5 to 6.0u2 HP custom iso. We only have PSOD on Our Linux VM With SAP Hana With more than 512GB memory, never With less than 512GB. How large is Your Oracle Linux? We are having HP BL460Gen9 1TB blades.
We have much smaller vm’s. Our hosts are Gen8 with 256GB memory