It’s the day before launch.
The engineering team look frantic. There are empty takeaway coffee cups across their desks, in the bin, and on the floor. Kelly is slouched over her keyboard looking at her monitor through her fingers. It’s the loading prompt, and it’s still loading, even after three minutes.
“This just can’t be possible,” she remarks. “How come we’ve never had any issues with the loading speed until now?”
A Slack message from QA.
Ash: Why is it throwing an Internal Server Error every time you change the date range to last month?
Ash: Try it out, see if it happens for you too.
Kelly: ARGH, let me have a look…
Bringing up the Chrome developer console, she feels the subtle change in air movement as someone approaches her.
“I’m really sorry to interrupt you, K.”
It’s Evan, the infrastructure engineer. Kelly tries not to get frustrated, and scoots back on her chair.
“The database just isn’t coping with the amount of requests from your service. I think it might need reworking. When is this meant to ship?”
Kelly feels her cheeks flush red. “Tomorrow.”
Another Slack message pops up. It’s Marketing.
Jordan: I’m just about to send out the countdown video on Twitter. App doesn’t seem to load at the moment. What’s going on?
Jordan: Are you there?
Kelly feels the world spin, and would rather be anywhere else right now.
The most wonderful time of the year
Software launches are one of the most anxiety-inducing things about being a professional developer. No other work event, apart from giving a talk to a room full of people, feels as full of terror, mishap, last minute stress and adrenaline as the day that the new application or feature gets switched on to a fanfare.
Building software is hard enough in the first place. Building it to a deadline is even harder. Building it to a precise deadline where all of the company – and soon your whole customer base – is looking at you, is terrifying.
Nothing ever goes entirely right in software. The bits of the project that you thought were going to be difficult turn out to be straightforward, and the bit that was going to be simple takes four times as long because of some obscure networking issue.
To top it all off, the kraken-like mega problem that nobody could have predicted beforehand eats all of your contingency time, and here you are again, once again – and you will forever be here, no matter the project – fixing bugs and performance issues right before the deadline, overtired, over caffeinated, and overstressed.
The big bang theory
Big bang launches are a very bad thing. We, as an industry, and as professionals working in software, need to do our best to persuade those that we work with that big bang never works.
What do I mean by big bang launches? I’m talking about launches where the application or feature:
- Is shipped to production only as the marketing launch goes out.
- Is enabled for all users at once.
- Hasn’t been profiled against real load in production.
The attraction of the big bang launch to the outside observer is clear: it is the ultimate demonstration of the whole company being in tight formation. There was nothing, and now there is something big. Magic.
The engineering, the marketing, the salespeople, everything – all of the planets align at just the right time – and the curtain opens in front of the ballet to rapturous applause. What a display of coordination and synergy!
But, this just-in-time delivery never goes right. Ever.
What kind of strategy can we adopt for making it look like we’re performing miracles but we’re actually being measured and safe instead?
How can we allow all of the space that we need to roll things out in production, test them and tweak them, whilst still factoring in contingency in such a way that allows for flexibility when things inevitably go wrong? How can we do this with our existing and prospective users being none the wiser?
Engineering strategies for a smooth launch
The best engineering strategy for shipping a big splashy feature for the first time is to make sure that when everyone starts to use it in production, you already know everything about how it runs in production. Broadly speaking, this involves:
- Planning for usage that is many times beyond what your current system can handle.
- Making use of feature flags in order to constantly ship the feature into production where you can see how it behaves.
- Doing extensive load testing early enough to be sure of your architectural decisions.
- Taking advantage of beta programs with trusted customers to get real, non-internal feedback on the production code as early as possible.
- Using shadow loading to see the real production footprint, without those users knowing they’re using it.
Let’s visit each of these in turn.
Planning for usage
When planning the approach for the new feature, you should be already thinking of future load, rather than current load. Take the number of users and then stick a couple of zeros on the end and think about whether it will still perform.
If it doesn’t, is it possible to scale it horizontally? How would that work? More applications, more servers? How will requests be routed? Round robin, or sharded by client or user? Is it possible to get away with estimates of data rather than needing to count and aggregate everything? Will values be computed in batch, real time or pre-computed?
Get creative. There may be a really neat solution.
Asking these questions with a diagram on a piece of paper or on a whiteboard is easier, cheaper and less stressful than doing it a week before launch. I once read that most of the effort in system design should be on picking through the edge cases and the contingency plan, rather than trying to make the core design perfect.
Make sure that you have a clear route for future scale, otherwise you’re going to be replacing the wheels on a moving car, rather than on a stationary chassis.
Extensive use of feature flags can save all manner of headaches. Continually releasing code behind a flag means that large features don’t end up in branches of the codebase that remain unmerged for long periods of time, needing painful rebasing before they go into master. Instead, shipping code behind a flag means you can merge small increments of functionality as you go, without the user ever knowing.
Feature flags, especially those that are highly customizable such as the ones provided by Launch Darkly, mean that you can test functionality with a percentage of your customers, or you can enable features to internal staff who will give you valuable feedback without the feature needing to be polished. You can also use flags to coordinate beta programs. When it comes to shipping time, you can roll out new functionality to customers in incremental cohorts to measure the impact.
Feature flags are great.
Prototype architectures can be tested by generating simulated load to prove that what you’re building is going to take the strain of real users. I’m used to backend tools such as Gatling, that let you simulate a large number of users and usage patterns hitting your services, and also easily collect the data from your tests to analyze the results.
What’s your 99th percentile case versus your 50th percentile? Is it acceptable? What will your biggest customer experience versus your mid-tier one?
As well as generating simulated load, it’s always valuable to get real users involved ahead of time. Identify users of your application that would be happy to provide feedback in return for using new – but potentially buggy – software ahead of everyone else.
With the help of feature flags you can give them the unpolished functionality ahead of time, monitor how the system performs under real usage patterns, and also speak to them for qualitative feedback.
Implementing their suggested improvements makes the final product better for everyone.
Before doing a general rollout of your feature, you can route all traffic to it behind the scenes, but not let the user know that it is happening. This is often called shadow loading.
For example, if your new feature is going to be shown on the top of your homepage, why not have all users unknowingly call that new endpoint on page load, with the results being discarded?
This way you can measure what the load on the system is going to be like during normal conditions. You can feel assured that the functionality is ready for showtime.
Think about ways in which you can ensure that your new application or feature has been planned for scale and has been subject to production load a long time before it gets shipped to real users. Use feature toggles, load testing, beta programs and shadow loading to ensure that launch day is one where you can celebrate success, rather than tend to fires.
By following some or all of the techniques above, you can ensure that on the day that you flick the big switch on for all of your users, the system has already been doing all of the work, predictably, for some time. You can go out for a celebratory lunch with the team without a feeling of paranoia that everything is about to blow up catastrophically.