Managing Petabyte Scale Workloads In The Cloud

Thursday, February 2, 2017

The cloud tsunami is here. There is a massive move to the cloud with new applications being birthed in the cloud. The amount of data and applications in the cloud has risen exponentially. For most customers, it is not a matter of if, but when they move to the cloud.

We all know about the cloud promise. The cloud has promised us a safe, secure, agile and cost-effective computing environment. The question is ‘Are we there yet? Is the promise here?’ This question is even more pertinent when we talk about petabyte-scale workloads.

From a cost perspective, public clouds have done a phenomenal job helping customers move from CAPEX to OPEX model. New evolutions, such as pay-as-you-go and micro compute—like Lambda & Azure Functions—have helped dramatically reduce compute costs. You can now pay for a milli-second of compute and scale as you grow.

From the point of agility, it’s a mixed bag. The agility of the cloud from being able to provide burst computing, is unmatched. You can fire up a 1000 virtual machines in the cloud to run analytics and then shut it down—this is something you just can’t imagine being able to do on-prem. However, there are many locks – vendor locks, architecture locks and cloud locks—which prevent you from being entirely agile. Some new technologies, like micro-compute, not only have steep learning curves, but also require you to completely rewrite the application. The cloud is agile, as long as you stay within your framework.

Let’s take the example of a Fortune 500 organization that has eight petabytes of data. The task of ingesting and making the data online in the cloud is a mammoth task in and of itself. An average virtual machine on Azure can mount about 40 TB of data. Yes, there are bigger machines which cost a lot more. This would require you to fire up 250 VMs just to make the data online before you even do something meaningful. And before you even make it available, it’s your task to manually slice and dice this data. It’s a Home Depot-style, DIY effort. Take this a step further, you now want to ingest this data to cloud services like Hadoop or Azure Media Services or many of the PaaS components available on Azure or AWS. This further requires you to manually manage more infrastructure in the cloud.

No doubt, the wise men (CIOs) with large amounts of data are still on the fence till a better framework comes by to help ingest, manage and ignite petabyte-scale dark data.

-Vijay Ramaswamy