Engineering at scale is a people problem

comment 1
Growth

Building people and technology

When your company is at an early stage, hard graft, rather than complexity, is the focus. Communication doesn’t require careful thought or organization because you’re either sitting next to each other or there are few enough of you to make decisions over video calls, emails or IMs. If there’s friction or disagreement amongst key people, then often the fight to keep afloat takes precedence anyway, dampening the urge to keep negotiating. Instead, you just do.

For the lucky few, dreams of success at those early stages can become a reality. That reality can be one playing out in fast-forward. Your company could attract funding and grow aggressively, expanding into many geographical locations and in turn create multiple products that each generate their own revenue streams. That’s all brilliant. But success and scale also cause drag: communication gets harder, collaboration becomes trickier and decisions require much more lobbying and consensus. Trying to continually adapt how the department works as it grows and matures is a continual iterative process that you’ll get wrong more than you get right.

From a technical viewpoint, as a product and company grows, so does the complexity of adding new features. In the early days where only a handful of developers were required to build the whole thing, decisions and progress were fast. But now building innovative new features often requires a multi-team effort: provisioning of new hardware or infrastructure, plenty of data science to prove concepts, new APIs and services built with future reuse in mind, and sweeping changes to the UI.

It becomes harder for each team to cleanly and completely own an area of the application. Dependencies begin to occur between different areas of the department and communication becomes more complex. You could argue that the hardest part of having a successful technology company is solving the technical challenges, but when it comes to doing this at scale it’s more often than not a people problem. How can everyone remain productive as the department steadily grows? Are there set processes and techniques that we can use to make this all predictable and frictionless?

Process isn’t always the answer

Being engineers at heart, we often try to solve all of our problems in the same way that we try to solve the problems that we have with technology. Surely, if we can streamline algorithms and build processes we should be able to apply the same logic to problems in the domain of communication and productivity, right?

Well, sometimes.

If we’re not careful then we can end up layering more and more complexity on top of what are essentially people problems and then inadvertently create even more people problems as a result. It’s a vicious circle.

It is sometimes necessary to take difficulties in the department and assess them as people problems rather than process problems to see if there are far simpler ways that they can be fixed. Let’s have a look at an example that we worked through in many iterations at Brandwatch. We still don’t have a right answer, but I feel that it highlights how we incorrectly identified a people problem as something that could be solved with even more process when the solution was in fact far simpler.

Pull requests

For many years I’ve been involved in reviewing code. This example covers how we worked through some issues with scaling backend code review as the department grew.

If you’re not familiar with them, pull requests are a method for submitting modifications to a codebase via Github so that others can review them before they are merged in. We’ve been using them for many years, and they’re great. However, when we first adopted them it took a number of adaptations to the way we used them before frustrations went away.

Let’s start with some history.

A very long time ago, in a pre-Github era, there weren’t many developers in the company. Typically people would ask for a quick in-person code review before pushing their work into the testing environment. Although an imperfect process, it encouraged people to seek review themselves before their code went any further. However, as we grew, and as the pull request functionality in Github became a lot better, we migrated to that system because it made it much easier to organize detailed reviews, especially when the number of developers in disparate geographic locations began to increase.

However, the pull request process highlighted a multitude of ambiguities in how we got things done:

  • How many people were required to review a given change before it could be merged in?
  • How senior did the reviewer need to be so that the review could be trusted?
  • Who was allowed to click the “merge” button once the reviews were complete?
  • How were conflicts resolved if the reviewers didn’t agree or if the submitter disagreed with the reviewers?

Mergers

In order to answer these questions, we added another layer of process: a group of senior and trusted developers called mergers, who had the authority to decide the fate of a pull request and ultimately were the ones that were allowed to click the “merge” button. Our new mergers enjoyed the initial responsibility and spent much more time carefully reviewing code. Two reviews were required: one from the merger and another from any other developer.

However, with time this additional layer of process created more people problems:

  • Our mergers became a single point of failure and if any of them got busy or sick then the throughput of teams suffered.
  • This also increased the amount of pressure that our most senior developers had on their job: they were now slowing up the rest of the department if they were busy with their own work!
  • Those that were not mergers felt like they had diminished responsibility. Not only did this make them feel like they were not trusted enough, it also dramatically reduced the amount of time that they spent reviewing code because they felt that a merger would need to do it anyway, so why bother?

SLAs

Of course, to solve this problem, more process was added! We implemented an SLA for pull request review which mergers needed to stick to.

This made things worse:

  • More pressure was being put on the mergers to review quicker.
  • The SLA was being observed regardless of the size, complexity or impact of the pull request. Some bug fixes take 30 seconds to review, and some architectural pieces can take multiple days.
  • Our delivery managers would chase pull requests that weren’t meeting SLAs, causing even more stress for our mergers. They started to really hate the responsibility.

Easing the grip

After a while, it was clear that our issues with code review were people problems. I reached a point where I felt I was spending too much time in the week addressing problems that the code review system had caused: stalemates due to disagreement, pressure from the business about things being “too slow”, angry staff who felt powerless to get their changes through, endless cyclic arguments from code style to personality clashes.

It was time for change. Fundamentally, regardless of process, we wanted to aim for the following:

  1. Everyone should feel empowered to make changes to the codebase.
  2. It should be in the best interest of the submitter to get their changes reviewed and merged. They have ultimate responsibility for the progress of their review.
  3. We trust that each engineer would submit their best work for review and would not want to do anything destructive or malicious.
  4. We trust everyone to give suitable constructive criticism, no matter their seniority, and to do their best to prevent bugs or odd patterns making their way into the code.
  5. Regardless of what happens, it’s just code after all. If something causes a bug when it gets merged into the codebase, we can just fix it or back it out. No big deal.
  6. Two positive reviews and it merges. No specific mergers required. The submitter should feel empowered to nominate reviewers if there are specialist concerns.

We’ve operated with this model since. It still has its flaws, but for the size we are now it makes do, and requires very little policing or intervention. It allows most review to happen within teams, too.

Redefining what we wanted to fix as a people problem showed that none of these ambitions required monitored implementations of particular processes. Instead, we just needed to encourage people to feel empowered to get on with their jobs to the best of their ability. And best of all, it had entirely positive effects. More people became empowered to get involved in reviewing because they could affect the outcome. Teams felt like they could take more control of their destiny by reviewing the code of their teammates rather than relying on nominated engineers in other teams and locations. And, when very complex or controversial pull requests were opened, people actively sought reviews from the right people themselves because it was in their best interest to do so.

In summary

Before you add another process or piece of technology to solve an issue, stop and think: is this a people problem?

Although it can be tempting to use one’s engineering brain to solve issues as if they were an algorithm, often that isn’t what needs fixing. In our example, an increase in trust and autonomy of individuals made people work more efficiently, rather than trying to layer on more process. We are just humans after all.

1 Comment

  1. Pingback: Engineering Management – Reflections of a Hopeful Cynic

Leave a Reply

Your email address will not be published. Required fields are marked *