When everything blows up

comments 2
Growth
Oh dear… (Source: ROMEO nuclear test,  Atomic Bomb Test Photos)

It’s August. You’re having a decent week. Looking over the top of your monitor, you can see that it’s pretty quiet in the office. Many of the engineers are on their summer vacation since their kids are off school, and they’ve picked a perfect moment to do so: the afternoon sky is a deep azure blue and is barely tainted by a single wisp of cirrus cloud.

Still, you don’t mind being in the office while your colleagues are away. You don’t mind at all: you get a rare fortnight where you have very little interruption. You’re finding your flow state, you’re getting lots of programming done, and the icing on the proverbial cake is that a warm convection of summer breeze is passing by your neck. Bliss.

. . .

You see a notification going off on Slack. You switch windows.

It’s the production environment. Uh oh. You minimize your IDE and switch to the Web browser, heading over to the monitoring dashboard. There’s one red dot, which you click. It’s the black box check for the API, which just timed out. That’s weird. You open up a new tab and go to the login page for your application, but it doesn’t load. 

Oh dear.

You switch back to the monitoring dashboard. Ten red dots. Three black box checks have failed. The homepage won’t respond. Two of the databases are reported as offline. The Elasticsearch cluster has also become unhealthy with half of your index shards unassigned.

Oh dear, oh dear. Cold sweats.

You grab your colleagues that are sitting near you. You begin diving into logs together, trying to work out what’s going on. One of the customer support agents comes jogging over. The thudding of their footsteps increases in volume as they get near you. 

“What’s going on? Tickets are flooding in.”

“We don’t know yet, we need to investigate,” you reply, with one eye on the logs that you’re scrolling through.

“OK, well what do I say?”

“We’re looking into it.”

“When’s it going to be fixed?”

“No idea. We’ve got to work out what’s going on first.”

The phone rings. You pick up. It’s your sales director. She’s not happy.

“In the middle of a pitch here. What’s going on? Shaun has gone back to slides, but if this lasts much longer, we’re going to look really dumb out here.”

“Not sure what’s going on yet, we need to investigate,” you reply, although what you’re really concentrating on is the cacophony of errors appearing in application logs.

“Let me know, be quick. Call me back.”

And the phone call ends with a click.

Back in the logs, you’re trying to see exactly what caused that initial exception to occur. Then you hear the deep thuds again on the office carpet. It gets louder. It’s the CEO, looking really quite cross.

“Did you know the app is down?”

. . .

It isn’t really, little coffee dude.

Chaos is normal

If you are working at a SaaS company then I will make you a bet: something is going to go horribly wrong in production this year. I don’t mean an issue with the UI, or a bug in a feature, I’m talking about something truly nasty like a networking problem at your data center, or a database having corrupt data written to it that then replicates to the failover and backups, or a DDoS attack, you name it: one of many potentially woeful stress-inducing panics.

Yes, all of those have happened to us, and not only once. Many times.

It happens. You can’t differentiate yourself from other companies by never experiencing outages. However, you can put in place some process that maintains some sense of order when everything is on fire and ensures that those that need to fix the issue are able to do so, and those that need to be informed, are.

Outages and unreliable service in SaaS will cost you a lot of money. Not only will repeated incidents cost you money via the salaries of the engineers that are spending their time fixing them rather than improving the product, a slow or unreliable website will cause your customers to lose their patience, open another tab and click on your competitor. When was the last time you waited for a site to load for more than 10 seconds? Your business was lost in these moments.

Defining those moments of panic

To the disorganized or inexperienced, these chaotic moments of panic are truly awful. You don’t know what to do first, you’re overwhelmed, and your unease transmits to those around you, causing a swell, then a breaking wave, of paranoia. However, these moments can feel saner with a simple process wrapped around them to guide you through the mess. 

But first, some definitions. The excellent Art of Scalability defines incidents and problems based on definitions from ITIL:

  • An incident is any event that reduces the quality of service. This includes downtime, slowness to the user, or an event that gives incorrect responses or data to the user.
  • A problem is the unknown cause of one or more incidents, often identified as the result of multiple similar incidents.

Incidents and problems are a massive pain. They are unexpected, unwanted, destroy the productivity of the teams that are working on them, and negatively effect morale. However, they’re always going to happen. You need to be prepared.

Your weapons against these situations are two-fold:

  1. You need to accept that incidents are going to happen, and have an incident management process that allows you and your staff the space and coordination to restore the required level of service whilst communicating effectively with the rest of the business.
  2. You need to do your best to track and log incidents so that they don’t turn into problems, or in the case that they do, those problems are kept as small as possible.

Let’s begin by looking at incident management.

Incident management

When things blow up, there’s three hats that need to be worn by the staff that are working on it:

  1. The manager of the situation, who is able to make decisions, delegate actions to staff, and to be the point of contact.
  2. The communicator, who is responsible for broadcasting to the business what is going on at regular intervals.
  3. The technical expert, who is identifying and fixing the issue.

In organizations that are not well-versed in incident management, these three hats often get worn by the same person, making them ineffective at all three strands of work. What works best in practice is to ensure that these hats are worn by at least two people, with the third role of technical expert always being worn uniquely by those that are doing that role. Let those that are working on fixing the problem do nothing but fix it calmly.

The role of the incident manager should be an experienced member of staff who has the authority to – and is comfortable doing so – making difficult calls, such as a decision whether you should keep the service running at reduced speed for 6 hours, or alternatively go offline to a holding page for 30 minutes to perform critical maintenance and resume at full speed afterwards.

The communicator hat can also be worn by the incident manager if they are content and competent at doing so. The communicator should broadcast through the expected channels to the rest of the business what the latest is with the incident, what is being done about it, and when the next update or expected fix is going to be. 

Identify people in your organization who are able to fill these roles. Ideally, for each of the roles there should exist multiple staff that can wear each of the hats. Then, you can define a rota to ensure that people take turns, and so that your incident management process isn’t affected too much if numerous people are on vacation.

A playbook for incidents

Define a playbook to follow when an incident occurs. A simple one could look like the following:

  1. The notification that an incident is taking place, and the assignment of staff into the three roles.
  2. A decision on the means of communication between those working on the incident. Typically we use Slack for this, but any way is fine as long as it works for you.
  3. Communication to the rest of the business that an incident is occurring: what it is, what’s being done about it, and what the regularity of the updates are going to be. Typically we give updates every thirty minutes via Slack. Additionally, you should update your customer-facing status page if required, and send out notifications to customers if it is deemed necessary.
  4. Regular internal communication documenting what is being done to recover from the incident. This includes any major decisions that have been made. This can be used later to review the incident and learn from it.
  5. The continuation of steps 3-4 until the incident is fixed. When it is fixed, the business should be notified and the work that resolved the issue should be documented.
  6. Scheduling a 5 Whys postmortem to get to the root of incident and decide on actions to prevent it in the future.

Following a playbook outline such as this ensures that the business is kept up to date while those that are working on fixing it are able to do so uninterrupted. 

After an incident

As specified in point 6 above, once an incident is over and the service has been restored, it is useful to run a 5 Whys session. These are well-documented elsewhere on the Web, so I won’t go into detail about how to run one. However, it is important to note that you do not want incidents to turn into problems that go unfixed long term. 

Your 5 Whys session should hopefully point to a piece of your infrastructure that needs improving, or a part of your application that needs to scale better. 

In order to nip the issue in the bud, you should identify the work that should be done to prevent it from happening again. Create tickets, assign actions and owners. The follow up from an incident is more important than the incident being fixed itself. Ensure that the issues identified are prevented from happening again in the future by improving monitoring, scalability, reliability, or whatever it takes. 

Sometimes, depending on the scale of parts of your architecture, it may be worthwhile spending money to get an external expert in to guide your path towards scalability. We have done so at various points for our main data stores. Although expensive on paper in the short-term, the reduction in incidents, problems and increase in availability of our application has kept our customers loyal.

In summary

Don’t just treat incidents as annoying things getting in the way of doing your real work. Take them seriously and do the work to make them happen less in the future; you don’t want them to turn into problems that drive your customers away.

Algorithms to make you more effective

comments 2
Growth
Shakey the Robot.
Shakey the Robot. (Source: Computer History Museum)

Your focus and how to protect it

Earlier on in my career, I was under the impression that success was strongly tied to saying yes.

That quick favor? No problem. That interesting idea that someone just mentioned to me in the kitchen? I should probably prototype that. Jumping on that call to a client? Why not. 

Always saying yes is how I thought I could be most helpful and how I could open myself up to the most opportunity. I mean, there’s even a very funny book about it.

Well, unfortunately saying yes all of the time, even with good intention and kindness, is a path towards being extremely nice but ultimately ineffective. 
Being effective, on the other hand, involves two strands of management of the self:

  1. Being able to organize my time and my mind so that I have the best chance of being as productive as possible.
  2. Saying yes to the most impactful pieces of work and politely refusing those that are not.

We’ll get on to exploring the types of things that you should be spending your time on shortly, but first, let’s zoom in a little more into how you spend your time.

Taking inspiration from algorithms

You, just like everyone else at your company, and like everyone else in your industry, partners and competitors alike, have exactly the same amount of time in the day. What really matters is how best you use it. People who understand how they best function and subsequently arrange their day around their productivity traits can be dramatically more productive than those that do not. 

To get some inspiration, let’s look at computers: specifically CPUs.

Context switching

Your operating system is doing a great number of different things at once. 
All of the applications that you are running execute inside many running processes. Now, just out of curiosity, what’s my laptop doing right now? As I open up Activity Monitor on my MacBook Pro as I write this sentence, this is what it looks like:

Activity Monitor.
Processes running on my laptop at the time of writing.

As you can see from the bottom of the window, there are hundreds of different processes and thousands of active threads. However, my laptop only has 8 CPU cores on which these processes can be executed. All of these processes are able to operate on a handful of cores by a neat trick: context switching. 

In order to give us – the slow humans – the illusion that the computer is doing many things in parallel, all of these tasks rapidly switch between one another, executing the next bit of work, then switching to the next process, executing for a bit, then switching, and so on. This switching happens extremely quickly, often at the rate of hundreds or thousands of times per second. To us mere mammals, everything appears to be happening at once.

However, context switching is expensive

Instead of being able to execute instructions continuously in one process, all of this multitasking requires administration: stopping a process involves saving state and then loading in new state for the new process. The CPU isn’t doing anything useful while this context switching is happening. The less context switching that occurs, the more instructions that are executed on the CPU cores. 

Do you see where this is going?

The first step of protecting your focus is to realize that context switching frequently between your own tasks is involves expending effort on administration but not impactful output. The longer that you can spend working on one task continuously, the more effective that you will be in aggregate. 

Aside from being interrupted by someone, you can manage your own environment to ensure that you don’t context switch excessively:

  • Close all other windows and tabs while you’re working on something. Resist the temptation to peek at your notifications, or just disable them.
  • Block out periods of deep work in your calendar where you declare yourself uninterruptible. There is the concept of offline hours which we’ve tried at various times in the office.
  • Drive your focus away from reactive messaging. Batch process emails, DMs and chats at specified times of the day. Again, you’re just like a computer: batch processing is often more efficient than serial execution.
  • Don’t start on a new task until you’ve finished the one that you’re doing. Although having multiple tasks on the go gives the illusion of productivity through busywork, just remember that CPU loading and saving state again, and again, and again. Inefficient, inefficient, inefficient.

Some context switching isn’t bad. Often a manager’s job relies on context switching between many different issues. But limiting it increases throughput of individual tasks.

So we’ve looked at CPUs to direct out thinking about how to focus better on tasks you’re working on. But can computing teach us anything about saying no to work that isn’t impactful?

I think that it can.

Pruning search trees

Search is a classic computer science problem. I’m not talking about Internet search engines here, though – I’m thinking about pathfinding. Given two places on a map, how do you decide the best route from A to B?

Let’s get our imaginations working.

Pretend that you are in London, standing at Trafalgar Square. You need to get to Regent’s Park. 

You have absolutely no idea how to get there and you have nothing at your disposal to help you out: no map, no people to ask, and no phone. The only way that you can probe your way to Regent’s Park is to effectively guess by walking in random directions for an unbounded amount of time, and it doesn’t take a stretch of the imagination to predict that you’re going to be quite lost quite quickly. 

A map of London.
Where do I go? (Source: Google Maps)

That’s not going to work.

Now imagine this time that you are standing at Trafalgar Square looking at a map. This time you can see the destination on the map, and you put your finger on it. That’s your first heuristic: the straight line measurement between your current location and the destination. 

But which way should you go? There are sixty thousand roads within the six square miles of central London, so plotting out the shortest route ahead of time is a massive, complex search space. Iterating through all of the possibilities will leave you standing here for weeks. 

Instead, you decide to use your heuristic: you walk down the first road that’s roughly in the straight-line direction of your destination, walk to the next junction, and then look at the best direction to go based on your new orientation towards your destination. You repeat this process, and eventually you get there.

Neat! This is something that we’ve been making computers do for over 60 years. 

The A* search algorithm works in a similar way. It is a best-first search that intends to optimize the route taken from point A to point B to be the one with the lowest cost calculated via a heuristic, preventing the need to exhaustively explore each potential path ahead of time. 

In the London route finding scenario above, the smallest cost is the least amount of walking required between the starting point and the destination. Typically the A* search algorithm performs this “walk” over a graph data structure.

A weighted graph.
A weighed graph. (Source: wikimedia.org)

Routes between places (A-E in the diagram) are represented as weighted edges on a graph, where the weights (the numbers in the diagram) represent the distance between those places. At each step of the algorithm, like in the scenario walking through London above, the algorithm expands the next possible steps and applies a heuristic – typically adding together the weights – to pick the one that costs the least. 

Repeated application of this heuristic ensures a speedy arrival at the destination.

An animation of A* search.
A* search in action. Note how the path is found without needing to explore the whole search space. (Source: imgur.com)

But how does this algorithm apply to the way that you manage your focus? I believe there are two themes which are related:

  • You can define a heuristic to prove that what you are working on is the most impactful task at any given time.
  • Then you can apply your heuristic to guide your choice of work, dramatically pruning your mental search space by focussing on the most important thing. You can say no to everything else with good reason.

Defining your own heuristic

I can’t predict what the most important thing that you should be working on right now. What is it? 

At a very high level, as a manager, I typically follow the formula that Andy Grove stated in High Output Management:

A manager’s output = the output of their organization + the output of the neighboring organizations under their influence. 

This formula allows me to prioritize the numerous things I could be working on each day, especially on days where I have free time and the luxury to choose activities. Instead of getting overwhelmed, I can prune my own search space accordingly by making sure that what I am doing is making an impact to the largest possible amount of people.

For example, if you have an important product launch coming up, then de-risking that launch as soon as possible may be your primary heuristic. If it’s growing your organization after receiving funding, then it’s that. If you’re an individual contributor working on the architecture of your application, then your heuristic could be continually improving the speed to serve data, or designing a plan to scale that architecture over the coming years. 

Optimize towards taking the shortest path to achieving that goal at all times. Yes, you can be A* search!

Shakey the Robot.
Shakey the Robot: invented by the researchers that also invented A* search. Yes, this is you now. (Source: Wikipedia).

The best part about defining your own heuristic for choosing the work that you should be doing is that you have a bulletproof reason for how you are prioritizing your time. 

Refusing that meeting where your attendance is not completely necessary, or opting out of other periphery work is no longer a matter of letting anyone down personally: your reasons are justifiable because you are laser focussed on a goal that will be maximally impactful for the company.

So what can you do?

Think about how to create the best conditions for you to work. Prevent those context switches as much as you can. Then be smart about what you are working on throughout the week: what is your heuristic that guides you towards your goal? Prune everything else away.