The settings for the WebObjects adaptor have this very temptingly named distribution method called “load balancing”.
Don’t use it. Use round-robin, its simple, and it works. If you have multiple webservers, be sure and tweak the round-robin order that you distribute requests to or use random.
Why: The LoadBalancing method leads to a cascade failure because what really happens is that the adaptor forwards ALL incoming requests to a single instance, bringing that instance down. That’s because the load balancing isn’t live, its the load as of the last successful request.
In Detail: Lets say you have 3 instances, 1, 2, 3. To start, let’s say requests have been coming in slowly enough. So the instances will show up in the load balancing with various loads:
Instance 1 load: 3.4
Instance 2 load: 2.3
Instance 3 load: 4.5
Now lets say your CEO goes on TV, and so 50 people fire up their browsers and jump to your site in the same moment. Where are those 50 new sessions going to go? All of them will go to instance 2, because it has the lowest load.
Now sure, instance 2 now has a higher load, but that’s the real-time load. The load number that WebObjects uses for the load balancing is based on the value returned from the last completed request. So until the first of those 50 requests come back from instance 2, the adaptor will continue to forward requests to instance 2.
Even then, it won’t help you, because then it might look like this:
Instance 1 load: 3.4
Instance 2 load: 52.3
Instance 3 load: 4.5
So the next 50 requests will go to instance 1. The same thing will happen:
Instance 1 load: 53.4
Instance 2 load: 52.3
Instance 3 load: 4.5
Now they will all go to instance 3. The same thing happens:
Instance 1 load: 53.4
Instance 2 load: 52.3
Instance 3 load: 54.5
Ok, so now the surge passes, and instance 2 now has the lowest load average again, and completes some successful requests so that the adaptor pulls is lower load average:
Instance 1 load: 53.4
Instance 2 load: 2.3
Instance 3 load: 54.5
It will now be almost impossible for instance 1 and 3 to ever be used again. You’ve gone from having 3 instances to effectively having 1.
That’s with 3 instances. If your site is as busy as ours so you need something like 45 instances to deal with load, the problem is even worse because what happens is that having all those requests go to a single instance basically destroys that instance. So then the adaptor sends the next set of requests to the next instance in the list, destroys that, etc.
A cascade failure, which brings down the site.
Round-robin on the other hand is brutally simple, and always works if you have one webserver, because each instance gets allocated new sessions evenly. If you have multiple webservers, its a bit more tricky, because their roundrobin won’t be in sync. So instead of new sessions going to instance 1, 2, 3, 4, with two webservers it goes 1,1, 2, 2, 3, 3. You can fix this by either having one webserver allocate in reverse order 45,44,43 by editing the XML config file or by just having both use random.