Thursday, October 6, 2011
With a mind on the cloud, eyes on virtualization and feet planted firmly on the (datacenter) ground...
Hindsight is great! Too bad it comes from mistakes.
Then along came the Personal Computer. At first it didn't seem like a ground-shaker, it was perceived more like an expensive toy. However it got consistently better at doing, well, everything. Then came the need for PC users to be able to work together in groups, then the groups of users had to work together, enter the Internet, and well, here we are.
We are essentially stuck with the server-client model since the early nineties. Since then there has been a number of technological trends like thin clients, blade servers, virtualization, and now 'the Cloud'. And every time there's a new trend, the market goes through the same predictable loop; First there's the talk. Expos, conferences, blogs, newsletters and the such. "Technology X is the word." "It's the new style." "It's the way of the future." Then along come the sales people with scare-mongering and blitzkrieg tactics; "Why are you still stuck with the old model?" "Do you realize that your competition is ahead of you?" "Research shows that 99.99% of companies that moved to the Technology X model are already seeing a 400% reduction in costs and gazillion% increase in productivity." Vendors release optimized hardware, evangelists preach the gospel of change and everyone is waiting for the dawn of a new era...
...that simply never comes. It's the same story over and over again. As IT gets ever more complex, new technologies provide solutions in one context and cause problems in another. Let's take Virtualization for example. It's a fantastic bit of technology that will play an ever-growing part of our lives for years to come. It allows businesses to save on hardware and electricity, consolidate their infrastructure and better provision for the future. But think for a minute it's the answer to all your problems and you're in for a world of pain.
A great example of this is an old client of mine; they had 30+ mission-critical servers. Then along came a major hardware vendor that put the Virtualization spell on the management. The servers were aging they said. Think of the electricity bill. Think of all the hardware support contracts. Think of the management overhead. The competition has already virtualized their entire infrastructure. What are you waiting for? And they went for it! The vendor consolidated all 30+ servers into a virtualization cluster with 2 physical hosts and a storage solution. The new equipment was an absolute beast performance and storage-wise with plenty of both to spare. Everyone was happy, plenty of smiles and pats in the back to go around.
Then performance began degrading. At first it was unnoticeable, but it progressively got worse and in 9 months' time since its introduction it all went pear-shaped. Performance was so slow the whole system became unworkable. The in-house IT department could not figure out why this was happening. When I first got involved, I was mortified! Storage performance indicators were off the chart, everything was on the red zone. The whole thing seemed to be set up horribly wrong. I fought the urge to ask the age-old IT Pro question "Who set this up?". At a second glance it became clear to me that the problem wasn't the initial design and set-up. It was the lack of design, provisioning and the absence of a distinct road-map on how that initially virtualized infrastructure would be allowed to grow inside the available hardware. The chaotic way that data grew was what caused the storage subsystem to choke to death.
Now I'm at a crossroads. I can either start the tech-talk and risk losing my non-technical audience or skip the tech-talk and get raised eyebrows from the IT Professionals reading this. I'll opt for the latter and promise to come back at a future blog post and analyse this case study in detail.
The bottom line is that data tends to grow, and it frequently grows exponentially. Now, the most prevalent argument in favour of Virtualization is that when you buy a new physical server you usually buy more than you need: more processing power, more memory, more disk space. The fact is that you also buy relative peace of mind; the knowledge that your data and the processing power it requires can expand unimpeded, at least up to a certain point. Then again, another pro-Virtualization argument is that you can add processing power, memory and storage at will. But where will these resources come from? How do you manage growth in a virtual machine environment? How do you plan for the future? You can easily come to a point where provisioning generously for a Virtualization environment costs as much as having a non-virtualized environment.
The truth is that properly maintaining and provisioning a Virtualized environment not only has a larger administrative overhead, it also requires more knowledgeable staff. Something the case study client mentioned above hadn't planned for. In such a radical infrastructure change, the management needs to factor in the essential training and the costs it entails. And then there's the issue of availability; If any one of the aforementioned 30+ non-Virtualized mission critical servers goes down, it's bad enough. What if, in the Virtualized environment, the entire storage unit gives up the spirit? Granted it's unlikely, but still a possibility. If and when it happens, it's an absolute disaster. So now you need a second storage system to mirror the first. Let's face it, the math sometimes just doesn't add up.
There's a simple reality behind all this; The larger the infrastructure, the more cost-effective Virtualization is. Could it be then, that Virtualization is not THE way forward for everyone? Could it be that there's just no ONE technological trend to fast forward us to the future?
I know a lot of Cloud advocates that would strongly disagree. The Cloud offers true 'resource on demand'. It charges only for what you use, when you use it. It allows for rapid expansion of any infrastructure, followed with rapid downsizing without losing investment in equipment. It's always available and has a reduced management overhead. It offers state-of-the-art security fail safes. It's a businessman's dream which is destined to render IT Professionals obsolete. Or is it?
What is this 'Cloud'? Well, it's nothing more than a fully virtualized infrastructure with mechanisms that automatically allocate resources on demand or according to predefined rules. Nothing more and nothing less. So, in theory, the Cloud encompasses all advantages from Virtualization with none of the disadvantages... but in reality, although it pieces together many aspects of the IT jigsaw puzzle, it raises new issues.
Where is the Cloud? Nobody really knows where and there are numerous Cloud providers out there. And anyway, location is supposedly irrelevant since a Cloud infrastructure is, or rather should be, everywhere. What I can tell you is where it's not: inside your company/organization. Do you care? Should you? Well, Cloud advocates say no. It's a seamless solution, where a Cloud provider takes all necessary steps to provide you with ample resources, absolute availability, total control, lightning-fast speeds, world-wide distribution, etc, etc. So what could go wrong? What issues could possibly stem from a move to the Cloud?
The Cloud is nothing more than an elaborate sum of components and sub-systems, commonly found in any IT infrastructure. The cascading levels of redundancy, intricate net of failover mechanisms and wide physical location distribution, are what make any Cloud what it is. Cut enough strings from the net of redundancy and resources and the whole thing comes crumbling down. It's happening all over the place, just a superficial Google search produces tons of evidence: the AWS (Amazon Web Services), Google and the Google App Engine, Sony PlayStation Network, the list goes on and on. Security-wise, there's the issue of attack surface area. If you run an average-sized business with an average IT budget and an average datacenter, you can put enough security measures in place to deter potential attackers, simply because the amount of money and effort that needs to be put in place to hack your infrastructure is far greater than what this data is worth. Now, in the case of the Cloud, you've got thousands or hundreds of thousands of such bundles of data in virtually one place. Hacking the Cloud pays, hence the numerous occurrences of such incidents.
Location is also a major factor. In many countries a 100 Mbps pipeline to the Internet costs more than to equip an maintain a datacenter for years. What's the point of having your local data on the Internet if you cannot access it at LAN speeds? What if that pipeline goes down? Now you need another one for failover. The productivity and cost-effectiveness equation is found wanting when it comes to moving all data to the cloud. Businesses tend to need some data close-by for a variety of reasons; speed of access, security policies/protocols, sheer size, cost of Internet access and, yes, psychological reasons. I know many a CEO and CFO old-timers that would rather die than have their ERP data at any place they cannot see and touch. Call it a gross exaggeration, call it ignorant, call it backward, call it what you want and it still makes no difference. Sometimes the benefits of caution outweigh those of cost-effectiveness in the long run.
Another client of mine had the misfortune of hitting both snags in one go. They opted for a total move to the cloud after some coaxing by a major Cloud provider in the South-East Europe. They just went for it, all-out. I didn't have a say in it since my support contract fees were, among others, part of the overall cost saving the move would bring. The move went great, everything ran smoothly, plenty of hand-shakes, smiles and pats in the back to go around. However they soon found out that they miscalculated their overall peak data usage, especially regarding file sharing. The amount of bandwidth required in mid-day far exceeded that of their Internet pipeline. So some services had to move back to their datacenter which, by the way, had just been refurbished into a meeting room. But anyway, the proper balance was struck, the management was happy, the users were as happy as users can be after a major infrastructure change, all seemed well.
Then, on a sunny summer morning (June 2011) it all went pear-shaped. Right now, I am actually looking at the email the Cloud provider's Operations Director sent to all customers. The subject reads: "Update for the Cloud infrastructure failure event dated [...]". Due to ethical and possibly legal implications I will not quote the body of the email. He proceeds to describe the reasons of this failure, which were attributed to the inability of the redundancy mechanisms to maintain data integrity and physical server connectivity. This caused performance issues and occasional service loss. Finally, he attributes the failure to the hardware vendor and promises swift resolution. To my knowledge this issue has still not been fully resolved, and it's October the 6th.
Having worked extensively with storage systems myself, I cannot help but sympathize. Such failures are more common than one might think and every precaution should be duly taken to plan for every contingency. But customers don't care who's to blame. The damage is done and no SLAs, no amount of compensation can make up for it. Extensive infrastructure downtime can hurt a business in so many ways, it's a whole new blog post to even scratch the surface of the matter. The moral of the story is that the severity of any infrastructure failure rises proportionally with the level of consolidation that has been achieved. And as the Cloud is the epitome of consolidation, total reliance on it brings great risk. And don't get me wrong, I am not saying that there's a high chance of failure. Quite the contrary, the chances of any major public Cloud infrastructure going down are low. But it does happen and if you're totally reliant on it, it's out of your hands and herein lies the risk. Now just pause and think. What are the chances of a major datacenter disaster? How often do we see one? You must admit, it's pretty rare as well...
There are various degrees of commitment to any type of infrastructure change. Some businesses start slow and some go all-out. Learn a lesson from the ones that got burnt, and take it slow. There's a distinct pattern emerging here. In the end, too much consolidation costs far more than no consolidation at all.
Trends and technologies might come and go, but my verdict remains decidedly fixed in its fluidity.
There is no single answer, nor will there ever be one, to our IT woes. It's a delicate balance that must be continuously struck to achieve the optimal mixture of available technologies, which changes dynamically depending on business demand. Given the current worldwide market conditions, long-term plans are measured in months or weeks, not years.
It's time for IT Departments and decision-makers to forget about technological trends and fight the urge to overhaul production environments. We cannot afford to generalize, every business and every individual business need must have its own tailor-made solution. Individual adaptations of new technologies should be tested in labs and pilot environments until they mature. It's not backward thinking, it's plain common sense.
In one sentence:
Set your mind to the future, your eyes to the present and plant your feet firmly on the past. It's a winning recipe.
the IT Guy.
- Andreas Panagopoulos
Powered by Blogger.
- Office 365 with Exchange on-premises Hybrid Deployment and Migration - A COMPLETE HOW-TO GUIDE
- Workaround for Citrix EdgeSight Process Usage report showing average times instead of time sums and a quick insight into using MS SQL Report Builder with queries to generate custom SQL reports
- Annoying "You cannot access VMM management server scvmm.domain.local" 1604 error: "Contact the Virtual Machine Manager administrator to verify that your account is a member of a valid user role and then try the operation again."
- Using TMG 2010 with Hyper-V to support multiple Virtual DMZ hosts
- SCVMM 2012 locale issues and Error 24374 when adding new Library Share with default resources
- "The trust relationship between this computer and the primary domain failed" messing about with SCVMM 2012 RC
- How to use msg.exe to send popup messages to multiple PCs / Workstations
- Musings of an IT guy
- With a mind on the cloud, eyes on virtualization and feet planted firmly on the (datacenter) ground...
- I'm an IT Manager... nobody loves me!