Yesterday, I was asked to comment on the recent MN DNR website outage by a Fox News reporter. I’ve had some time to digest and learn more details about this recent incident. I learned that the DNR website was designed and managed by a 3rd-party SaaS-provider. The provider has built a campground reservation system tailored to meet the needs of government organizations. Outsourcing this capability to a provider makes an awful lot of sense to me. I can understand why the MN DNR went this route.
I was intrigued by some of the incident details published in a St. Paul Pioneer Press article yesterday. I know I run the risk of sounding like an armchair quarterback in offering my thoughts on this matter. I’m sure that the service provider and the MN DNR team did everything they could to minimize service issues. But this scenario highlights some of the challenges faced by service providers, and these technical challenges fall squarely into my domain of expertise.
The Pioneer Press article notes that the campground reservation system was designed to handle 4,500 daily visits. What does that really mean in terms of capacity planning? Let’s say that all of those visits would occur during a 12-hour daylight window. A Pareto distribution would suggest that 80% of those visitors would access the website during 20% of the time window. That means 3,600 visitors would access the site over 8,640 seconds — or roughly one visitor every 2.4 seconds.
Is this enough demand to take down a website? It depends on what those visitors are doing. Browsing a website doesn’t create much of a performance hit on website infrastructure. But more intensive database operations can generate significant website infrastructure loads. The combination of the two can be deadly.
According to the Pioneer Press, the service provider reduced the size of high-resolution images on the MN DNR site to conserve bandwidth and computing resources. That’s one way to tackle the problem. Some companies will utilize image caching systems to speed up the delivery of images and reduce the computing load on critical web application servers. However, this doesn’t solve bandwidth constraints caused by demand for high-resolution images.
I think this is a perfect use case for the cloud, and Content Delivery Networks (CDN’s) in general. A CDN allows an organization to offload the storage and distribution of static objects like high-resolution images. The CDN would deliver the image much more efficiently than any organization could deliver the image themselves. The computing resources and, more importantly, the Internet bandwidth are outsourced to a service provider which is highly optimized for this type of service. Connecting your website to a CDN is a smart move these days if you plan to launch a website, and expect to have a dynamic range of demand.
And speaking of dynamic demand, it’s now possible to leverage computing infrastructure to elastically support customer demand. The MN DNR’s provider had to add more server capacity to support the website. Provisioning new servers can oftentimes take days or even weeks depending on the data center and application requirements.
Imagine building a website which is able to sense greater demand, and automatically provision more computing and storage resources to meet that demand. This kind of technology is available today in enterprise cloud platforms. The benefit is that an organization could spin up a significant amount of computing resources for a short period of time, to handle something like a surge of camping reservations. Then, once the website demand slows down, the organization can simply return the computing resources to the cloud provider. The key is that the website has to be designed to take advantage of dynamic cloud computing resources.
I strongly believe in creating meaningful Service Level Agreements (SLA’s) with service providers. It helps providers understand the level of commitment they are making and it ensures accountability. At VISI, we use SLA’s as a way to benchmark our performance and prove our value to our customers. And if we fail, we share the pain with our customers. That’s the way it should be.