The Engineering Manager

Previously I wrote about the inflection point that a particular part of your architecture will reach before you need to roll your own specialized piece of infrastructure. The summary of that article was that you probably won’t end up doing that unless you reach product market fit and then have a real success that drives scale.

What this means for most of us is that we’ll be reusing existing software for the bulk of our application infrastructure. This is absolutely fine. There is a wealth of fantastic open source projects out there. Observed from my origins as a backend engineer, I know that there’s pretty much always going to be a great storage system that does what I want it to do.

At Brandwatch we run large Apache Solr and HBase clusters in production, and we are extremely grateful for the open source community that has enabled us to use them to support our business.

However, as good as these projects are, occasionally there are bugs. Sometimes there are big bugs. If they occur, not only are all of your customers dealing with unexpected downtime, but you might have absolutely no idea what the problem is, or how to fix it.

Let’s explore some advantages of building open source projects that comprise your core infrastructure from source. But first, let’s consider our mindset when looking at the downloads page of a popular open source database.

Which should I download?

Earlier on in my career, when navigating to a project’s website in order to download a database to use, I would often see two options and have the following thoughts:

Downloading the binary: This option is for people who want to use the database; developers like me who just want to run it and store some data.
Downloading the source: This option is for people that want to have a dig around and see how it works, or for those that want to contribute to the project themselves.

For a long while, this is what we used to do in production as well: download the binary and then deploy and run it. If we needed to upgrade it, we’d download the latest binary and replace the older version. But, with time, we realized the decision to download the source was more nuanced.

This change came with scale.

Dependence

As time passed, and as the company grew and our data storage needs did also, we began to elevate the demands we placed on our storage technologies beyond the levels that you can easily find help and documentation for. We’d see errors or odd behavior that Google couldn’t help with, and those same issues also perplexed contributors on the project mailing lists.

This is where we started to worry.

At large data volumes you begin to discover that not all new features added to open source projects have been thoroughly tested at the same scale you are running at – and who would expect them to be? After all, this is free software that you just so happen to be running a successful SaaS business with. We should all be grateful that we get such a head start.

But, regardless of that head start, you begin to get locked in. Any dramatic increase in scale results in an increase in your dependence on these systems: your decision to use it becomes more complex to revert when it is supporting your customers around the clock.

Keeping up to date

Due to your increased dependence on a given system, you’ll want to more closely track the progress and roadmap of the project so that you can continue to upgrade and reap the benefits of new features and optimizations. You’ll also want to make sure that you swiftly apply new security patches.

However, even the simple act of upgrading to the latest version opens the door to more risk on your production environment:

Breaking changes. Some open source projects change extremely quickly and can move through non-backwards compatible version with regularity. If these are properly communicated, you’ll need to do the work to support them. This is especially hard with storage systems. And, even worse, sometimes there are breaking changes that aren’t so well communicated, and you only notice when your own application breaks!
Weird regressions. These are typically a pain to track down. You may notice that your application is getting slower or buggier with time, only for the problem to ultimately lie within code that you haven’t written yourself. The code of open source systems isn’t necessarily the first place you’d look, either. Surely someone else has thoroughly tested it, right?
Unknown maturity of a subset of features. You may eagerly upgrade to the latest version of a database because new functionality has been released that you are eager to use, only for you to find out that it hasn’t been tested extensively at scale. We struggled with the maturity of Solr being able to store indexes in HDFS to the point that we gave up and returned to local storage on SSDs. This took many weeks of testing to prove otherwise.

If you heavily depend on a technology in production to the point that it is business critical, there is a case for building projects from source and running your own builds in production, in order to be in control of your own destiny.

But why?

Reasons to build from source

Over the last few years, we have been migrating one of our larger data stores into Solr. We store hundreds of terabytes of data that users interact with via searches and facets (a word used to mean aggregations in Solr terminology). Data is continually ingested, updated and deleted: it is mutable.

While we were scaling our deployment, we hit numerous pain points that we had to overcome. In fact, some of these issues were deal breakers for that technology being able to work. We were able overcome these hurdles much more easily by building our Solr deployment from source.

Familiarization with the build process

To begin with, building a project from source expands your knowledge of how it is put together. Large projects don’t always have trivial build systems and they can take some time for you to get your head around. You’ll learn about the dependencies and required configuration. You’ll learn about what gets packaged up and bundled ready for running on your production servers, and what really happens when the system is started and stopped.

This process might help you learn techniques that you can use in your own projects going forward, even if that is how not to do things! Also, if you happen to need to debug or patch the system yourself in the future, you will save a lot of panic down the road if you have already experienced how the build process works.

You can also decide how you roll out updates of the system when they are released upstream. When you create a new build, how long should you allow it to run on your testing or staging environment before allowing it to go live? How can you maximize your chances of experiencing any unexpected behavior or broken functionality before it hits production? Should you write your own integration tests to prove that the new build still works with your own application in the way you expect?

Speed of patches to production

If you’re depending on a technology in production at scale, it’s likely you’ll experience a few bugs. Building from source allows you to isolate them and fix them fast. Rather than an upstream bug causing panic while you wait for the project to push out an official patch, if your system is broken then you can patch the source yourself, build it, and release it locally for use immediately whilst waiting for the rollout of official patches upstream. If you fix it first, then you’ve just contributed to the project.

The same is true for quickly applying the patches of others. While waiting for the maintainers to approve and release a submitted patch that affects your system, you can apply that patch to your own version and see if it solves your issue. Later, when everything is fixed officially upstream, you can build the latest upstream version and get back on track.

Not only does building from source allow you this flexibility in patching, but since it has already forced you to go through the process of building and releasing yourself, the stressful times of production system bugs should be isolated to just diagnosing and fixing bugs, rather than additionally needing to work out how to build the project and release it.

Debugging, configuring and testing

When investigating issues that you may suspect have come from an open source project, building from source allows you greater flexibility in debugging and testing.

When finding issues we suspected to originate from Solr, running our own build from source made it easier to attach the debugger. Solr is written in Java, and possessing the exact source code that produced the JAR file allowed us to remotely use the Java debugger via our IDEs to test our assumptions about where performance issues were occurring. This further improved our understanding of the system and allowed us to submit a number of patches upstream that improved performance specifically for our use case.

We also noticed that a number of open source storage projects have hard-coded – but useful – variables that are not configurable via the command line. Building from source allows you to make them configurable. Notable examples for us include the Solr recovery thread pool size, of which we now use different settings in production than the default.

So what should you build from source?

The pragmatic reader will have realized. clearly, that you don’t want to build all of your dependencies from source. As mentioned previously, at Brandwatch we build two storage technologies from source: Solr and HBase. Our decision to do this was guided by the following principles:

They are our primary storage systems. Our core competencies are in data enrichment, storage and analysis. These systems enable a lot of that functionality.
We run them at a large scale compared to what is documented. We’re nowhere near the size of deployments at Facebook or Apple, but we do run at a scale where documented evidence of this scale is scarce. We absolutely need to know how these technologies work.
They are written in technologies where we have in-house expertise. Both Solr and HBase are written in Java, and most of our backend developers use Java as their primary language. We have lots of people who can understand the code and make changes to it if needed.
Bugs and breaking changes affect our bottom line. If there is a big issue in either Solr or HBase then our customers are going to be affected. We need to be able to take control of the situation if this was ever to occur. We’ve noticed that HBase doesn’t change much, but Solr moves at a fast pace and the risk of regressions is high.

Being able to contribute upstream is also fantastic for the motivation and engagement of your engineers: what better than to be paid to contribute to the greater good? It’s nice to give back to the community, and it helps attract new engineers who want to work on these technologies at scale.

In summary

Building your mission critical systems from source allows you to be in control of your own destiny. We’ve been able to fix issues and improve the speed of Solr to make it even more of a great fit for our use case. Notably we recently improved the speed of updates to existing documents and also added the ability to facet on functions.

Goodbye, cruel world. Photo by MARVIN TOLENTINO on Unsplash.

My laptop is going out of the window

That email.

It’s been watching you all day. Lurking.

You’ve skirted around it, and you’ve turned your attention to other things: to Slack conversations, to pull requests, and even to writing that API documentation that’s been dangling at the bottom of your to-do list for weeks.

But it’s still there.

And now you have nothing else to distract you.

You open it. A wall of text appears on your screen. It’s even got proper formatting and numbered lists. What on Earth is this gigantic essay all about?

You feel your energy escape from your soul via your eyes, sucked into the event horizon of wordy dialogue that must have taken the author hours to write.

You sigh and begin reading through it.

Time passes.

By the time you get to the end, you’ve lost your train of thought and have forgotten the points that you wanted to write in your reply. You scroll right up to the beginning of the email and start reading it again.

Like before, you reach the bottom of the text, but – surprise, surprise – it’s taken so long to read that you’ve forgotten what you wanted to say.

You sigh again, but for longer this time.

Frustrated, you try a different technique.

You hit the “Reply” button so that you can write your response in sequence as you read through it. The in-line reply window opens, and now you can’t fit your response and the original email on the screen at the same time: no matter which way you resize it, all of the text and boxes dance around like a marionette.

Your brow furrows and you scratch the back of your head.

Feeling inspired, you open a new window, side by side with the original window, so you can fit the email on your screen and concurrently compose your reply at the same time.

So far so good.

You slowly read through the wall of text, making bullet point notes on it as you go. There are about six main talking points that you’ve extracted, and you expend some effort in polishing them up to form a solid narrative. You read, then re-read what you’ve written.

It looks reasonable. You sound intelligent. That makes a change.

You move your mouse cursor towards the “Send” button. As you click, you notice a message popping up at the bottom of the screen.

2 new message(s) in thread. Click here to show.

Ah, damn it.

You click to show the messages. Two more walls of text. You read them both.

The first reply has said everything you’d already written. Why did you bother? At least your opinion has been validated.

The second reply is quite confused. You don’t think they’ve read the original email properly, and they’ve tailed off into the realms of the bizarre, almost as if they’ve translated the text into Spanish, then into French, then into Klingon, then back into English.

What are they on about?

A blip sound.

1 new message(s) in thread. Click here to show.

You click.

It’s the original author’s out of office email.

A blip sound.

1 new message(s) in thread. Click here to show.

Oh, wonderful: your colleague who was very slow to write their initial reply has just chimed in on some points in the first email, taking things in a completely disparate direction.

You begin scanning the content to understand why they didn’t quite get the point.

1 new message(s) in thread. Click here to show.

Click.

Out of office again. Sigh.

You rest your head in the palm of your hand.

This conversation makes no sense any more, and you’ve been sitting here reading, writing and peeling apart these replies for 20 minutes.

You consider what your employer’s insurance is like on their equipment, and whether it covers your laptop “accidentally” launching itself out of the window in a bid for freedom.

Smash.

There’s a time and place for email

Let me begin by stating that email is brilliant. I love email.

It’s archival, the threading system works well, and since Ray Tomlinson sent the first ever message in 1971 (to himself, as a test – allegedly it may have read “QWERTYUIOP”), I would argue that email has done a fantastic job of bringing the world even closer together. The asynchronicity bridges timezones. Whole businesses are run via email communication.

Yet, there are times that email isn’t as good as other forms of communication. But let’s stay positive and look at what it’s good at first:

Archival notices. Since email hangs around forever, and since that it is easily searchable, email is perfect for making timestamped announcements that everyone will see and refer back to.
Newsletters. In my experience, whatever may feel like oversharing rarely is received that way. Sending out regular newsletters to your team, department or company is an excellent use of email to inform and increase your visibility.
Conversations with a narrow focus. Emails that cover one concise topic can allow people globally to contribute, assuming that the purpose of the message is to solicit opinion.
Follow ups to ratify decisions. After having a meeting or a decision point, a follow up email is a perfect way of putting in writing what has just happened so everyone is aligned.

So that’s the good stuff. But what’s email bad for?

Conversations with many active authors. The little story above is obviously an exaggeration, but “hot” email threads with lots of active participants begin to feel like a series of sliding doors. Everything gets confusing, communication is poor, effort is wasted, and nobody gets anything done. Consider a flurry of email thread activity as a signal to jump in a Slack channel, or do a video call, or start a shared document.
Anything requiring a quick response. Email isn’t like a DM, and comes with no guarantee of timeliness of response. People have very different approaches to their email. Some batch, some practice inbox zero, some simply get so much that they forget to reply. If you want a quick response, send a DM or make a call.
Topics that have many layers of context. Extremely complex subjects with many sub-contexts become extremely difficult to reply to. The email format doesn’t support levels of nesting without being fairly creative. Maybe another medium is better. If you feel like you need a deep breath after reading something complex, suggest another medium to discuss it further.

If you’re going to write an email for the reasons above, don’t. Please save others the pain!

Help your reader out

As well as using email for the right purpose, there are some ways that you can be courteous of the fact that when your recipients open your message, they are giving up their time to you.

For non-trivial email content, you can specify the actions that you want the readers to take, even if those actions are just to read it and do nothing else. The recipient, upon digesting the first couple of lines, can rest assured that even if a big block of text is coming up, they need only understand it, and not compose a short essay in response. They’ll thank you for it.

Additionally, for longer emails you can provide a short summary at the top to ease the reader in gently, or let them make the decision as to whether they want to read the whole thing or not. You may even find that as you write the summary, you can delete large parts of the proceeding text as they’re not needed after all.

And, if after all, it isn’t something that is best suited to being communicated via email, you could maybe try some alternatives:

Having a discussion in a private meeting room
Walking over to someone’s desk to ask them a question
Creating a Slack channel or group DM
Writing your thoughts in a shared document and soliciting comments
Going for a walk around the block to chat about it
Chatting about it over lunch
Just getting on with something using your best judgement

There’s plenty of alternatives that you could be doing rather than firing up Gmail.

So, in summary

Be a good email citizen. Otherwise this laptop gets it.

The Engineering Manager

— Empowering ourselves to empower others.

The case for building from source

Which should I download?

Dependence

Keeping up to date

Reasons to build from source

Familiarization with the build process

Speed of patches to production

Debugging, configuring and testing

So what should you build from source?

In summary

That massive email

My laptop is going out of the window

There’s a time and place for email

Help your reader out

So, in summary