I have a customer that is using Windows Network Load Balancing for a fault tolerance web service across two web servers. They started running W2003 x86 a few years ago and recently decided that they should “upgrade” to W2008 R2 to take advantage of some of the better web hosting features. Each server has 2 NIC’s. The first is the normal one we use to log into the servers and manage them. The second is used purely for the NLB clustered web traffic.
This meant a rebuild of the servers. For some architectural reasons, it was also decided to build a new NLB cluster. We would do this one web server at a time.
We rebuilt the first server. I brought up a new NLB cluster, with just itself as the only member for the moment. We would add the second server when it was rebuilt. To bring it into production we would:
- Change the production IP address on the old NLB cluster to a temporary one.
- Change the temporary IP address on the new NLB cluster to the production on.
Then we could rebuild the second web server and away we go!
Muggins here drew the short straw and I was awake at 06:00 this morning to VPN in, do some prep work and switch the IP addresses to bring the new server into production. I did that and tested. The websites would not respond. I had no idea what was up. Network monitor showed external traffic coming in on TCP 80 and reaching the server. I could even see my IP address coming in.
I checked the website bindings which were set to the default of *, that is all assigned IP addresses on the server. I verified with IPCONFIG that the production IP was live. I could ping it from other machines and see the traffic in Network Monitor. I decided I would configure the site in IIS7.5 to just use the NLB cluster IP address. That’s where issue #1 arose. I could not select that IP address. After a quick google I learned that W2008 R2 IIS7.5 cannot pick detect the NLB cluster IP address and load it into the drop down list box. I had to type it in.
It should be OK now? I tested. And no joy. At this point I had to roll back the changes. The site had been offline for too long.
A few hours later I had the time to start investigating some more. I used another public IP address with a NAT rule to another internal IP address that I could use on the new NLB cluster. That would leave the production, old NLB, websites up and running and unaffected by my tests.
I still couldn’t access the site. I tested the sites from another server in the same VLAN. I could access the sites from there. Strange! This means that I either had a firewall or a routing issue. It couldn’t be a firewall issue. The same NAT rule was being used on the new server. I was simply moving the IP address and we don’t do anything crazy with MAC addresses. It couldn’t be an ARP cache issue because I could see web traffic actually reaching the server in Network Monitor 3.3.
I scratched my head. I could route out from the server. I could surf the web and traceroute out. Both the server’s management IP and NLB IP are in the same VLAN. The server management IP had the correct default gateway. The TCP configuration was identical to the W2003 R2 configuration.
What if … now I was reaching … what if NLB doesn’t route correctly? What if the NLB NIC’s IP configuration doesn’t pick up the default gateway set up on the management NIC’s IP configuration. If it was a normal NIC it probably would. I set up the default gateway on the NLB NIC. It was identical to the server management NIC configuration. I got the warning about multiple default gateways on a computer and clicked OK.
Now I tested web site access from an external IP and it worked perfectly. My conclusion? You have to configure the default gateway on an NLB NIC if using Network Load Balancing on Windows Server 2008 R2. Otherwise it will not route correctly to other networks; it should pick up the default gateway from the management NIC but it does not.