Last weeks at a customer side we had some weird problems with random PSOD’s on HP BL460c Gen 8 servers.
The customer is migrating from ESXi 5.5 to ESXi 6.0 and after a test period of a few months we started to upgrade all hosts using VMWare Update Manager. After a few days some hosts suddenly started to give PSOD’s as shown below.
All PSOD’s happened on HP BL460c Gen 8 blades equipped with QLogic and Emulex adapters. The other thing in common is that all PSOD’s were caused by Linux RedHat servers with Oracle installed on them.
If we look at the PSOD and distillate the cause we see mlx4_core that is required by mlx4_en. These are Mellanox driver. The fun part is that we don’t have any mellanox hardware installed in the blades. Simular post can be found Here.
After consulting both HP and VMWare, the conclusion was to uninstall the VIBs instead of upgrading them. The drivers should have been removed during the upgrade tot ESXi 6 but remained at the host.
After removing the VIBS and rebooting the hosts there were no more PSOD’s.
esxcli software vib remove -n net-mlx4-en
esxcli software vib remove -n net-mlx4-core
esxcli software vib remove -n net-mst
The only remaining question for me is why only Linux machines with Oracle can stress the host and cause a PSOD.