Previously I wrote about the inflection point that a particular part of your architecture will reach before you need to roll your own specialized piece of infrastructure. The summary of that article was that you probably won’t end up doing that unless you reach product market fit and then have a real success that drives scale.
What this means for most of us is that we’ll be reusing existing software for the bulk of our application infrastructure. This is absolutely fine. There is a wealth of fantastic open source projects out there. Observed from my origins as a backend engineer, I know that there’s pretty much always going to be a great storage system that does what I want it to do.
However, as good as these projects are, occasionally there are bugs. Sometimes there are big bugs. If they occur, not only are all of your customers dealing with unexpected downtime, but you might have absolutely no idea what the problem is, or how to fix it.
Let’s explore some advantages of building open source projects that comprise your core infrastructure from source. But first, let’s consider our mindset when looking at the downloads page of a popular open source database.
Which should I download?
Earlier on in my career, when navigating to a project’s website in order to download a database to use, I would often see two options and have the following thoughts:
- Downloading the binary: This option is for people who want to use the database; developers like me who just want to run it and store some data.
- Downloading the source: This option is for people that want to have a dig around and see how it works, or for those that want to contribute to the project themselves.
For a long while, this is what we used to do in production as well: download the binary and then deploy and run it. If we needed to upgrade it, we’d download the latest binary and replace the older version. But, with time, we realized the decision to download the source was more nuanced.
This change came with scale.
As time passed, and as the company grew and our data storage needs did also, we began to elevate the demands we placed on our storage technologies beyond the levels that you can easily find help and documentation for. We’d see errors or odd behavior that Google couldn’t help with, and those same issues also perplexed contributors on the project mailing lists.
This is where we started to worry.
At large data volumes you begin to discover that not all new features added to open source projects have been thoroughly tested at the same scale you are running at – and who would expect them to be? After all, this is free software that you just so happen to be running a successful SaaS business with. We should all be grateful that we get such a head start.
But, regardless of that head start, you begin to get locked in. Any dramatic increase in scale results in an increase in your dependence on these systems: your decision to use it becomes more complex to revert when it is supporting your customers around the clock.
Keeping up to date
Due to your increased dependence on a given system, you’ll want to more closely track the progress and roadmap of the project so that you can continue to upgrade and reap the benefits of new features and optimizations. You’ll also want to make sure that you swiftly apply new security patches.
However, even the simple act of upgrading to the latest version opens the door to more risk on your production environment:
- Breaking changes. Some open source projects change extremely quickly and can move through non-backwards compatible version with regularity. If these are properly communicated, you’ll need to do the work to support them. This is especially hard with storage systems. And, even worse, sometimes there are breaking changes that aren’t so well communicated, and you only notice when your own application breaks!
- Weird regressions. These are typically a pain to track down. You may notice that your application is getting slower or buggier with time, only for the problem to ultimately lie within code that you haven’t written yourself. The code of open source systems isn’t necessarily the first place you’d look, either. Surely someone else has thoroughly tested it, right?
- Unknown maturity of a subset of features. You may eagerly upgrade to the latest version of a database because new functionality has been released that you are eager to use, only for you to find out that it hasn’t been tested extensively at scale. We struggled with the maturity of Solr being able to store indexes in HDFS to the point that we gave up and returned to local storage on SSDs. This took many weeks of testing to prove otherwise.
If you heavily depend on a technology in production to the point that it is business critical, there is a case for building projects from source and running your own builds in production, in order to be in control of your own destiny.
Reasons to build from source
Over the last few years, we have been migrating one of our larger data stores into Solr. We store hundreds of terabytes of data that users interact with via searches and facets (a word used to mean aggregations in Solr terminology). Data is continually ingested, updated and deleted: it is mutable.
While we were scaling our deployment, we hit numerous pain points that we had to overcome. In fact, some of these issues were deal breakers for that technology being able to work. We were able overcome these hurdles much more easily by building our Solr deployment from source.
Familiarization with the build process
To begin with, building a project from source expands your knowledge of how it is put together. Large projects don’t always have trivial build systems and they can take some time for you to get your head around. You’ll learn about the dependencies and required configuration. You’ll learn about what gets packaged up and bundled ready for running on your production servers, and what really happens when the system is started and stopped.
This process might help you learn techniques that you can use in your own projects going forward, even if that is how not to do things! Also, if you happen to need to debug or patch the system yourself in the future, you will save a lot of panic down the road if you have already experienced how the build process works.
You can also decide how you roll out updates of the system when they are released upstream. When you create a new build, how long should you allow it to run on your testing or staging environment before allowing it to go live? How can you maximize your chances of experiencing any unexpected behavior or broken functionality before it hits production? Should you write your own integration tests to prove that the new build still works with your own application in the way you expect?
Speed of patches to production
If you’re depending on a technology in production at scale, it’s likely you’ll experience a few bugs. Building from source allows you to isolate them and fix them fast. Rather than an upstream bug causing panic while you wait for the project to push out an official patch, if your system is broken then you can patch the source yourself, build it, and release it locally for use immediately whilst waiting for the rollout of official patches upstream. If you fix it first, then you’ve just contributed to the project.
The same is true for quickly applying the patches of others. While waiting for the maintainers to approve and release a submitted patch that affects your system, you can apply that patch to your own version and see if it solves your issue. Later, when everything is fixed officially upstream, you can build the latest upstream version and get back on track.
Not only does building from source allow you this flexibility in patching, but since it has already forced you to go through the process of building and releasing yourself, the stressful times of production system bugs should be isolated to just diagnosing and fixing bugs, rather than additionally needing to work out how to build the project and release it.
Debugging, configuring and testing
When investigating issues that you may suspect have come from an open source project, building from source allows you greater flexibility in debugging and testing.
When finding issues we suspected to originate from Solr, running our own build from source made it easier to attach the debugger. Solr is written in Java, and possessing the exact source code that produced the JAR file allowed us to remotely use the Java debugger via our IDEs to test our assumptions about where performance issues were occurring. This further improved our understanding of the system and allowed us to submit a number of patches upstream that improved performance specifically for our use case.
We also noticed that a number of open source storage projects have hard-coded – but useful – variables that are not configurable via the command line. Building from source allows you to make them configurable. Notable examples for us include the Solr recovery thread pool size, of which we now use different settings in production than the default.
So what should you build from source?
The pragmatic reader will have realized. clearly, that you don’t want to build all of your dependencies from source. As mentioned previously, at Brandwatch we build two storage technologies from source: Solr and HBase. Our decision to do this was guided by the following principles:
- They are our primary storage systems. Our core competencies are in data enrichment, storage and analysis. These systems enable a lot of that functionality.
- We run them at a large scale compared to what is documented. We’re nowhere near the size of deployments at Facebook or Apple, but we do run at a scale where documented evidence of this scale is scarce. We absolutely need to know how these technologies work.
- They are written in technologies where we have in-house expertise. Both Solr and HBase are written in Java, and most of our backend developers use Java as their primary language. We have lots of people who can understand the code and make changes to it if needed.
- Bugs and breaking changes affect our bottom line. If there is a big issue in either Solr or HBase then our customers are going to be affected. We need to be able to take control of the situation if this was ever to occur. We’ve noticed that HBase doesn’t change much, but Solr moves at a fast pace and the risk of regressions is high.
Being able to contribute upstream is also fantastic for the motivation and engagement of your engineers: what better than to be paid to contribute to the greater good? It’s nice to give back to the community, and it helps attract new engineers who want to work on these technologies at scale.
Building your mission critical systems from source allows you to be in control of your own destiny. We’ve been able to fix issues and improve the speed of Solr to make it even more of a great fit for our use case. Notably we recently improved the speed of updates to existing documents and also added the ability to facet on functions.