Thursday, March 13, 2008

MSA1500cs periodically grinding to a halt with VMWare ESX 3.5

Last night for the third time since going live with VMWare in Sept. 2007 we had an issue where the MSA1500cs SAN that acts as the datastore for our 3 host ESX cluster ground to a halt. As the MSA1500 has no ability to log what it is doing this is a tricky one to figure out.

The first time it happened, I though it was because I had set the multipathing on the ESX hosts to fixed. Apparently this is a not the suggested configuration for the MSA1500 as even though it has the v7.00 Active/Active firmware on it, ESX doesn't see it as a true Active/Active setup. So I changed the path setup to Most Recently Used (MRU) and left it to run.

The second time it happened was shortly after upgrading to v3.5 which caused High Availability (HA) on one of the hosts to go nuts and fill up the log partition.

Last night it happened again but nothing had been changed on the setup for weeks. Basically, the VMs grind to a halt and then become unavailable and then some or all of the LUNs presented to the hosts disappear. The only way I've found to fix it is to shut the ESX hosts down (which takes about 30 minutes each as it is trying to gracefully flush the cache to the LUNs but they aren't there) and then power down the MSA1500 controller shelf and bring everything backup again in the correct order.

I opened a ticket with HP about it today but they couldn't see anything from the CLI show tech_info dump I sent them (which I guess they wouldn't as the thing had been power cycled!) and suggested that it could be the drivers or firmware in the BL480c servers that we are using or a firmware upgrade may be needed in the Brocade fiber channel switches.

As I had the time last night, I also applied the last batch of v3.5 patches.

So far it has happened after about 50-60 days of uptime so I think for now we will be proactive and do a manual shut down of everything on a weekend every 6 weeks or so.

6 comments:

Anonymous said...

I have this same problem with ESX and multiple MSA1500cs. I have a thread open on VMware's community forums. If you want to discuss this, please email me at emiller at genesishosting.com.

Eric

Anonymous said...

im having the same probs, have you found any info out that might help?
cheers

Unknown said...

Hi Anonymous,

Check out the thread on VMware's site here:

http://communities.vmware.com/message/935901;jsessionid=BFB0024C390D76FAE4C3978196B5B43F

Hope this helps!

Eric
Genesis Hosting Solutions, LLC
http://www.genesishosting.com/

Anonymous said...

Same Problem on our site we do have 4 esx servers running 3.5 and once in a while (about every 30 to40 days) we do get excessive SCSI reservations and our ESX server are unable to access storage on the MSA we did open a case at HP: reaction we don't support esx 3.5 on a MSA 1500 !?!?. And VMware: modify /etc/modules.conf (remove usb entries) and uninstall HP Insight agents. Advise of Hp: Downgrade tpo 3.0.2 and Firmware acording to support Matrix we are running a FW update now. Only solutions is bring down every ESX server when it happens

Anonymous said...

Hi, I now have a forum dedicated to this issue at http://www.msa1500cs.com/. No 100% resolution, but feedback on this forum would be "greatly" appreciated so we can gather more information about the problem. What would also help is if you could turn on debugging mode and gather the debug output from the serial port when the MSA1500cs locks up. Thanks! Eric

Anonymous said...

Also, Kris has a new blog entry about the issue too: http://drowninginnumbers.blogspot.com/2008/06/possible-fix-found-msa1500sc-halting.html