Sunday, August 10, 2014

Recovering vCenter Server after the major Outage


I have discussed this issue so many time with my colleagues and friends who experienced the situation where the whole Datacenter had an Outage due to X reason and then depending on the products running in the environment, it is getting difficult to restore everything back to normal.

The products which can add complexity in the restoration process can be Nexus 1000v, SRM, VMware VDS. Now let me emphasize here that there is nothing wrong with these products but to recover vCenter Server which is a Critical/Key component in the whole infrastructure.

I am going to explain few scenarios on how to recover vCenter Server and get everything up and running which may take time from minutes to few hours and sometimes a day or two (depends on the size and inventory of  the Datacenter).

1) If running VMware VDS, Nexus 1000v VDS

First of all one needs to find out where the vCenter Server was running lastly.

Now lets assume you are running fewer number of ESXi hosts in the cluster/Datacenter (lets say 1-20) then if you have the location documented if the vCenter was tied up on certain hosts only using the DRS rules then you can connect directly (assuming network connection is working or else we need to get that working first) to that particular ESXi host (using putty or DCUI and later using vSphere client) and find out if the vCenter Virtual Machine is still registered on that host or not. (If running WebClient then I will be covering that in an upcoming blog post).

vmware-cmd -l

The above command will tell us which VMs are still registered on the host and you can power on the vCenter Server through command line or from vSphere Client. Once logged in you can open the console of the vCenter and login using the local administrator account. Now here there is another assumption going here that if the vCenter Database is residing on another VM or a physical machine than there should be network connectivity available between the DB machine and vCenter Server.

Now lets take a situation where the management Network of ESXi host is configured on VDS (VMware) then if needed you need to restore the Standard Switch on the ESXi host so that atleast you can connect to the Management interface. Then you can try accessing vCenter Server from the console right on the vCenter server VM using localhost option and you can see all the inventory within the VC virtual machine. Once you power on all the ESXi hosts you should be able to see all the VMs and other inventory items.

If the VMs are showing inaccessible or orphaned then you can simply unregister the VMs and register them again. (This needs the information documented somewhere about the names of the virtual machines, network they are using / connected to, Datastores they are using or configured on).

If running Nexus 1000v then make sure both VSMs are powered on and can reach to vCenter Server and ESXi host/s. If needed you can register the VSMs on the same host as vCenter Server to make things easy. You can put them back on to the host where they belong as per the DRS rules (if any specified for Nexus
VSMs) set for them once everything is in working order.

2) Where you have large number of Hosts in the cluster (more than 20-50) so atleast you need the name of the Datastore where the vCenter Server files are residing. (Assuming no Storage vMotion occurred before the outage happened) so it will be a good practice to designate a specific datastore for vCenter Server virtual machine files so that it will be easy to just connect to that datastore using ssh from one of the ESXi host which is using as a shared datastore.

You can setup ssh to the host or use local ESXi shell option from DCUI and browse to the datastore and register the vCenter Server on that ESXi host.

vmware-cmd -s register .vmx


vim-cmd solo/reistervm /vmfs/volumes/datastore_name/VM_directory/VM_name.vmx

Once you see the vCenter entry in the inventory then just power it on. Make sure the vNIC is connected. Now if the VM was part of VDS then it must be connected to one of the dvPortGroup on the VDS which is not accessible as vCenter Service is not available yet. So you need to create a Standard vSwitch on the ESXi host where you registered the vCenter Server VM. Provide at least one uplink which can carry the same VALN traffic (if there was VLAN configured for the vCenter network) so get the connectivity. Now again you need to see if you have a spare NIC which you can use on VSS and if not (assuming all the NICs are used by VDS) then you just need to use one on VSS which was assigned to VDS before the outage.

3) Now assume you don't know the Datastore / ESXi host name where the vCenter Server was residing and running lastly and you have more than 50/100 hosts in the cluster. 

Here's comes the real part of this post so be patient and read on.

The question is how to find the vCenter Server virtual machine directory.

There are few methods you can go with

a) Run the PowerCLI script across all the ESXi hosts which definitely is a time consuming task as you need to connect to each host individually and run the command.

I can update the post here if someone comes with a PowerCLI one liner so please leave a comment.

b) If you are running SQL Database for vCenter Server database then you need to find out that VM first.
Login to the SQL VM (with local admin or using Domain Admin account (if DC/AD machine is available and accessible). Then login to the SQL Database using the Administrator Account. Run the following query against the VC Database.

First query only returns a host ID

select HOST_ID from VPX_VM where DNS_NAME like '%vcenter%'

You need to use the valued derived in the 2nd query. You just need to replace "vcenter' in the above query to the actual name of your vCenter Server VM name.

select * from VPX_ENTITY where ID='x'

The above will give you the result with the ESXi host where it was lastly registered and running.

Hopefully you can use the same query on Oracle Database as well but not sure so if someone is Oracle Expert then please leave a comment and I will modify the post with actual Query for Oracle database.

The above methods are having certain assumptions such as connectivity, login information to vCenter Server, SQL Server, ESXi host etc. etc. which are needed in the whole restore process.

4) Now the last situation where you dont know the name of the vCenter Server/ESXi Host/Datastore name then you just cross the fingers, pray to God and start digging for the VM on each and every direction possible and make a resolution first that you will DOCUMENT everything about your virtual Inventory going forward. Not joking here as seen instances like this too.

Let me know if you feel to add/update the existing information and I will be happy to do it. Just need your comment through any available medium.

Please share and care !!

Thanks for your time.

No comments:

Post a Comment