sysadmin-tips-and-tricks: ESXi - NC522SFP issue (QLogic chipset)

*****************************************************************************
2nd January 2012 - Happy New Year!!! and since the card swaps we've had no more issues!!!

*****************************************************************************

We have been having an issue with our Hosts for the last few months (since our Server Refresh mid-2011 which included a change from ESX 3.5 to ESXi 4.1 USB installation).

The issue was the Hosts becoming non-responsive but you could ping (which mean't HA didn't kick in) but the Guests stayed where they were but also unresponsive - this occured on a 3 seperate Hosts so it couldn't be put down to a single Host problem.

Calls were raised with VMware - the problem being we couldn't give them too many logs.
The first occasion the logs were over written and we weren't too sure that due to the networking problem (not responding) that the logs were being sent out (ESXi logs configured to store on the SAN). Running "vm-support" from the Host didn't give them any other data to work with.
The second occasion we managed to get some logs and a "vm-support" roll-up to VMware - they came back with not a lot apart from our firmware was out on the NC522SFP - the last firmware patching we did was approx. 3 months prior.
On both occassions the calls were closed unresolved as they couldn't help.

Contacted our Account Manager and had a conference call with them and a SE - I then had feedback from the SE who had looked into the calls and had a chat with an Escalation Engineer. The findings were that there is a known issue (Customer Advisory http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02964542) with the NCC522SFP firmware 4.0.556 or earlier - symptoms listed were identicle to ours - firmware at higher levels are not affected by the problems listed.

Unfortunately our firmware level is 4.0.579 - as per the information from the Engineer on the second VMware call and checking myself using the article http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027206.

The VMware site also lists this in their knowledge base (Essentially it isn't a VMware issue, is is a HP one) http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2012455

Action taken on all Hosts

1. Updated the Servers firmware using the HP Firmware Update DVD 10.0
2. Updated Hosts with the lastest ESXi patches (3 outstanding - which weren't NIC driver related)
3. Updated the NC522SFP driver to the latest we could find as listed in the HP Advisory https://my.vmware.com/web/vmware/details/dt_esxi40_qlogic_nx_nic_40602/ZHcqYnRwZXBiZCpwcA

Updating drivers using the vMA

1. Mount the ISO in the CD drive of the vMA (I had to shut ours down and add a CD drive first).
2. In the vMA mount the driver CD:
    sudo mount /dev/cdrom /mnt
3. Navigate to /mnt/offline-bundle/ and locate the .zip file.
4. Run the vihostupdate command to install drivers using the offline bundle:
    vihostupdate --server <FQDN Hostname> --install --bundle xxxxx.zip.

5. Reboot the ESXi Host
6. Check if the patch has been installed:
    vihostupdate --server <FQDN Hostname> --query
7. Check the driver version on the ESXi Host (not via the vMA):
    esxcfg-nics -l
    ethtool -i <vmnicNUMBER>

Further Action

We are now monitoring the situation and have escalated with out HP Account Manager as we were at a supported firmware level prior to any actions being taken

Update 23rd May 2012

Hosts are still having an issue. Over the weekend we had 2 Hosts evacuate Guests, we've had two further blips but the Guests weren't evacuated.

Further research has digged up some interesting Blogs - one of which lists sysmptoms we've been having and makes for an interesting read, especially the Comments!
http://wahlnetwork.com/2011/08/16/identifying-and-resolving-netxen-nx_nic-qlogic-nic-failures/

Update 1st June 2012

HP made available another Firmware DVD (SPP2012020.2012_0302.51.iso) which I've downloaded and ran it against on Host - this upgraded the Firmware from 4.0.579 to 4.0.585.

To update remaining Hosts and monitor.

Update 8th June 2012

We had another major fault on the 4th June (Host held onto the Guests again as the box was pingable - resolved by physical press of the power button, Guests then HA'd).
Response from HP was to apply the lastest Firmware released on the 2nd June (Suprising as we had only just applied a later Firmware). Once the Firmware was downloaded and applied to one server it turned out to be same version as before 4.0.585.
Seriously lost confidence in the cards now and informed HP of this - they are willing to replace the cards IF their Hardware team in the States find evidence in the logs (which they wont as the logs seem to kindly reset during a reboot). It would appear we are now stuck between a rock and a hard place.
Now investigating buying new Intel cards which other people seem to be having no issues with - Also going to speak to VMWare to get the HP cards off the HCL as they clearly aren't compatible!!!

Update 12th June 2012

The 4.0.585 appears to have stopped the "flapping" issue, as it is being called. This is where you see an alarm pop up in the VI Client titled "network uplink redundancy lost". In the messages log you'll see "watchdog timeout occured for uplink vmnicX" then the NIC will pop back up online seconds later.
I don't recall seeing these entries until we applied 4.0.579.
Our main issue is the total catatonic state that occurs when some sort of heavy loading on the NIC occurs.
VMware are looking into the NIC issue now as well - HP are looking into logs gathered over the weekend.

Update 21st June 2012

We had failures on Monday and Tuesday
- Monday was a clean failure all Guests migrated and when the Host was checked we had a nice PSOD the second one in a month (this turned out to be the System ROM version - even though we were on a safe version according to the Knowledge Base article).
- Tuesday was due to the Host running 4.0.579, which we were unable to work on due to the fact it isn't in a Cluster and required agreed downtime.
- At present no other failures on 4.0.585.
- (Friday 15th June)VMware are reviewing there HCL process to see why if this sort of thing can be picked up earlier. They are also engaging with HP to see if they can help resolve the issue.

Update 27th June 2012

Further failure on a Host which hadn't had the Firmware 4.0.585 update applied earlier due to Business downtime limitations.
- Host was patched 21st June and failed on the evening of the 22nd June. HP call updated.
- Investigation into replacement cards - Intel X520-DA2 - brought up interesting reading (someone who had issues with the NC522SFP replaced them all with the Intel card and had issues) with regards NIC port count. It now implicitly states that you can only have 4 x 10Gbe with 4.1 and that is it, no other ports. When we purchased the servers this wasn't the case. However, VMware log investigations hasn't highlighted any issues with port count (we have 4 x 10Gbe, 4 x 1Gbe - with two of these not used, but can't be disabled as they are cards).
- One option to mitigate NIC port count issue is to upgrade to vSphere 5 (Allow for 8 x 10Gbe ports or 6 x 10Gbe and 4 x 1Gbe).
- 25th June replacement cards arrived for one Host which had died two days after latest Firmware had been updated.
- 26th June HP called to check if we had received cards and to inform us that the recent Host failure met the criteria for card replacement and that they were on their way.

Update 5th July 2012

No issues this week.
- 29th June started a heap capture script kindly supplied by VMware and scheduled a VM Support capture for 1940hrs as our last failure was at 1943hrs the previous Friday.
- 2nd July no failures over the weekend. The heap capture sent to VMware for investigation and the findings showed that the memory limits weren't being hit.
- To run the same heap capture and VM support cron job again this coming weekend (7th/8th July)

Update 9th July 2012

The Host that is awaiting replacement cards failed again over the weekend (07:40hrs Saturday morning). Heap capture was wiped out due to data being stored in /tmp (did try setting it to run from the iSCSI scratch area but the script wouldn't run. Updated VMware and HP calls.

Contents of the Heap script (courtesy of Emiliano at VMware).

#!/bin/sh
LOGFILE=/vmfs/volumes/4e0d93ce-0a1f5ca4-7dd8-002655e35ddc/<host directory>/log2
PIDFILE=/vmfs/volumes/4e0d93ce-0a1f5ca4-7dd8-002655e35ddc/<host directory>/heapmon.pid
echo $$ > $PIDFILE
while true
do
   date >> "${LOGFILE}"
   vsish -e ls /system/heaps/| grep -i netpkt| while read heap
   do
     echo $heap
     vsish -e cat /system/heaps/${heap}stats
   done >> "${LOGFILE}"
sleep 5m
done

To make the script executable:
chmod a+x /<script location>/<script name>.sh

To start the script:
nohup /<script location>/<script name>.sh &
(The script will now run even if the console is shutdown)

To kill the script:
cat /<script location>/heapmon.pid
Output = heapmon PID e.g. 12345
kill 12345
Script is now finished and you can view the output file (log2)

Update 12th July 2012

- Replacement NICs arrived (11th July) to be installed on the 19th July (agreed downtime)
- To run the Heap script again over the weekend (13th -15th)
- VMSUPPORT bundle (from Friday 6th July) analysed by VMware on the 9th July and no issues found.

Update 20th July 2012

- Replacement NICs installed in Host.
- Firmware updated on cards.
- Heap script started and left running

Update 27th July 2012

- Call closed with HP (no errors appearing on the remaining Hosts - stable??)
- Heap script still running...
No problems encountered this week. If I get anymore problems I can open another call and link it with the old.

Update 31st July 2012

- We had a Host failure today but we can't find anything that may be NIC related. Call opened with VMware.

Update 7th August 2012

- Heap script still running.
- Host failure from the 31st July entry, nothing found in the logs. Call closed with VMware.

Update 10th August 2012

- We had a PSOD on 7th August on another Host. Feedback from VMware is that it is a known issue with the NC522SFP cards and we should update the driver {sigh} to 4.0.618.

Update 14th August 2012

- Heap script stopped

Update 29th August 2012

- We had a couple of failures early last week which involved a PSOD and a catatonic Host. Calls raised and information also passed onto the escalation Engineer from our original call.
- HP Account manager informed.
- All Hosts drivers updated to 4.0.618

Update 30th August 2012

- If the Host becomes unresponsive see if you can do the following to generate something useful for problem investigation:
1. Make a iLO connection (or equivalent) get a console session up and running and press ALT-F12 and the screen should change to show the tail of vmkernel log (take a screenshot). if this fails move to step 2.
2. Generate a PSOD via the iLO by generating a NMI to the system (again take a screenshot).

Update 31st August 2012

- New call raised with HP (4641842283) and requested link to the closed call (4640586880) and that our HP Account manager chases the call.

Update 6th September 2012

- The last two outages were due to not having the latest firmware (P65 SystemROM) and latest driver (4.0.618) being applied. The outages were NOT due to additional problems caused by the cards.

Update 19th September 2012

- No further issues and calls closed with VMware and HP.

Update 10th October 2012

- Over a month since the last issue and all seems okay!

Update 29th October 2012

Failure on one of the Hosts where it "lost" the default gateway - Hmmmm, sounds a bit familiar......
SystemROM P65 is 04/20/2012
Driver version is 4.0.618
Firmware version is 4.0.588

Logging a call with VMware and HP......

Update 2nd November 2012

After sending logs and pictures of the cards they are replacing the NICs and will be arriving today! For some reason they didn't have to been sent out from the States this time....

I've also been asked to take photos of the NICs in the remaining Hosts to send to HP. They will then decide whether to change them or not depending on what they find (I'm assuming this all to do with the Revision levels Erich mentioned in July).

Update 13th November 2012

After uploading all the photos of the cards we have I'm now getting inundated with replacement cards for all our remaining hosts regardless if they had a known history of problems.

I've asked to know why they are all getting swapped out as I seem to be replacing the same revision numbers like-for-like....

Update 19th November 2012

All cards are now replaced:

Old revision numbers were 1395A2220303 A01 to A05

Replacement revision number 1395A2220304 A01

13 comments:

Unknown27 June 2012 at 18:10
you and I should talk. We have been going through the same thing with HP/VMware since April. It is no better on esxi5. HP just sent us new hardware that is "guaranteed to fix the problem". I asked for supporting documentation that this HW rev is going to fix it. DENIED! Let me know if you are interested in comparing notes.
Packer5 July 2012 at 13:51
Interesting - thanks Erich. We haven't had an issue this week so fingers crossed we are past it all!
Dan Pride18 July 2012 at 17:05
This is the exact same issue we have seen with our NC522s, since esxi 4.x and into 5. Vmware support will only provide driver updates and refused to acknowledge if any other customers had issues with this card. We are prepared to go to CDW, our HP reseller, to discuss replacement options.
Randall16 August 2012 at 17:18
Hi all,

I was curious if you all have gotten your issues resolved and if so what steps were taken? I am having random hosts drop connections with these cards and am on the 5.0.619 nx_nic driver and am at 4.0.588 on the firmware. Any info would be greatly appreciated!

Thanks,
Randy
Unknown21 August 2012 at 21:04
Randall, HP confirmed a hardware rev issue on the 375, 522 cards. They will replace them. I have not had any issues since replacing the cards (SPI board and add-in cards)
Anonymous15 December 2012 at 08:37
What is your conclusion of these series of outage. I think this will help others learn lessons from you.
Jaffa12 February 2014 at 12:59
Two of my cards have fallen over. Nothing related to VMware, but I did notice something interesting - they died after an update appeared and was applied through Windows Update. Funny that one died out of 3 that were the same age, and then another one we got off eBay a year later fell over with the same update.

Wednesday, 16 May 2012

ESXi - NC522SFP issue (QLogic chipset)