PSOD on a non existing piece of hardware

 16 February 2017         2 Comments

Last weeks at a customer side we had some weird problems with random PSOD’s on HP BL460c Gen 8 servers.

The customer is migrating from ESXi 5.5 to ESXi 6.0 and after a test period of a few months we started to upgrade all hosts using VMWare Update Manager. After a few days some hosts suddenly started to give PSOD’s as shown below.

All PSOD’s happened on HP BL460c Gen 8 blades equipped with QLogic and Emulex adapters. The other thing in common is that all PSOD’s were caused by Linux RedHat servers with Oracle installed on them.

If we look at the PSOD and distillate the cause we see mlx4_core that is required by mlx4_en. These are Mellanox driver. The fun part is that we don’t have any mellanox hardware installed in the blades. Simular post can be found Here.

After consulting both HP and VMWare, the conclusion was to uninstall the VIBs instead of upgrading them. The drivers should have been removed during the upgrade tot ESXi 6 but remained at the host.

After removing the VIBS and rebooting the hosts there were no more PSOD’s.

ESX-cli commands:

esxcli software vib remove -n net-mlx4-en
esxcli software vib remove -n net-mlx4-core
esxcli software vib remove -n net-mst

The only remaining question for me is why only Linux machines with Oracle can stress the host and cause a PSOD.

 

 

 

Comments

  1. We have seen this issue With Our setup With Upgrade from 5.5 to 6.0u2 HP custom iso. We only have PSOD on Our Linux VM With SAP Hana With more than 512GB memory, never With less than 512GB. How large is Your Oracle Linux? We are having HP BL460Gen9 1TB blades.

Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.