A few years ago, the poster child for the public cloud generally, and Amazon’s public cloud, in particular, was Netflix. And within Netflix, the person who had the biggest public visibility, at least when it came to cloud infrastructure, was Adrian Cockcroft. Cockcroft was the best product evangelist that AWS had, and (back then, at least) they didn’t even pay him. Netflix was the quintessential cloud case study, and AWS benefitted hugely from the company’s, and Cockcroft’s, patronage.

Fast forward to today and Cockcroft has actually landed at AWS itself where, at last, he is actually paid for doing all that awesome evangelism work. What hasn’t changed, however, is the lessons that Cockcroft and his team brought to the world in terms of engineering for the cloud.

One of the key cultural and system innovations that Netflix introduced was chaos engineering. Essentially what chaos engineering does is take the perspective that failure will always occur in systems and that those systems should be engineered to proactively identify and fix unknown faults. Essentially chaos engineering is every kids’ dream job and every operations engineers’ worst nightmare. It involved simulating the most extreme, unique or volatile operating conditions imaginable, fixing whatever broke in that process and doing it all again.

Chaos engineering is, as the name suggests, about introducing muscle memory into the organization for almost constant unexpected eventualities. Netflix open sourced a number of different chaos tools, but there was always the opportunity for some commercial applications.

This is where Gremlin comes in. Gremlin is an early-stage company that is building chaos engineering into consumable service offerings. Its founder, Kolton Andrus, was formerly at both Amazon and Netflix (recognize a pattern here) where he was partly responsible for injecting these chaos principles into software teams. Gremlin aims to make it easier (and therefore cheaper and quicker) to build chaos engineering into the operating principles of an organization.

The value of chaos engineering is pretty obvious, but for those who don’t yet believe, some data points from gremlin’s research:

  • 98% of organizations say a single hour of downtime costs over $100,000
  • 81% of respondents indicated that 60 minutes of downtime costs their business over $300,000
  • 33% of those enterprises reported that one hour of downtime costs their firms $1-5 million

OK, OK, before you lambast me, I’d be the first to admit that those are totally weird statistics with little verifiable basis, but let’s just agree that downtime costs… something (where something = a big, albeit unquantifiable, amount)

Anyway, the hard nut for Gremlin to crack is how to build a following, how to be seen as “thought leaders” (and, yes, that is a gross term) and, ultimately, how to generate revenue for the business. One strategy they’re embarking on is that of community building. Today they’re launching the unsurprisingly named Gremlin Community which is a place to help organizations and practitioners find new ways of dealing with the ever-increasing complexity that they fact. Gremlin’s hypothesis is that there’s not a huge amount of content out there specifically about this emergent area, and they want to be the party that fills that void.

Gremlin Community is, according to the company: “a central space for engineers to come together to build resilient systems.” To this end, the community includes tutorials, a Slack channel, a list of talks and space for arranging meetups – all the stuff a good community needs.

MyPOV

Chaos engineering is relatively new and, as such, any information that can be disseminated out to the people who want to leverage it is a good thing. That said, there is always an interesting tension when a commercial organization, especially one building on top of generally available tools, tried to “own” that community.

This isn’t a novel tension for Gremlin in particular, and they seem to be designing the community in the right way. Fundamentally, the world needs more information about this new approach, regardless of the commercial success (or otherwise) of Gremlin the business, this community is a useful thing.

Ben Kepes

Ben Kepes is a technology evangelist, an investor, a commentator and a business adviser. Ben covers the convergence of technology, mobile, ubiquity and agility, all enabled by the Cloud. His areas of interest extend to enterprise software, software integration, financial/accounting software, platforms and infrastructure as well as articulating technology simply for everyday users.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.