If you’re a CEO or other business type, when you think “scaling” you think “I’m excited to increase our userbase and revenue by orders of magnitude!”
If you’re a backend engineer, when you think “scaling” you think “oh shit. What’s going to happen to my platform when it grows by orders of magnitude?”
As a CTO, scaling is the number 1 thing that keeps me up at night. When it was announced that my previous company would be featured on Shark Tank, everyone celebrated. I panicked. Millions of new users visiting our platform that had already exhibited signs of weakness? All visiting same time? We had a company-defining opportunity and we only had one shot to get it right. There would be no second chances.
The main reason scaling is so stressful is that it is not linear: your system could be functioning perfectly one minute, then a small increase in usage tips it over the edge and you have an outage. The outage can be very difficult to overcome while the usage remains high, and telling the CEO “we need to decrease usage” probably won’t go over very well. Further, scaling vulnerabilities are much more difficult to QA than most bugs. You either need a complete replica of your production environment, which is expensive, or you need to conduct your performance testing in production, which can negatively impact real users. Testing on a scaled-down version of your production environment (like your staging environment) is next to useless.
A central problem at startups is that decisions that are made very early on – when you’re just in prototype phase and perhaps are working with low cost outsourced developers – can have a tremendous impact on scaling down the line, and those early decisions can be very difficult to reverse. If you choose the wrong database or hosting platform, it can be man-years of effort to move to a different one down the line.
There is a common misperception that on-demand hosting platforms like AWS have solved scaling so that engineers can just “throw hardware at the problem” (eg. buy twice as much on-demand hardware when usage doubles) so they don’t have to worry about scaling anymore. While AWS certainly makes scaling easier, it’s only part of the solution. More full-service platforms like Heroku, Firebase, and Lambda take even more of the work and stress out of scaling, but even those are not complete solutions. They are also more expensive at scale.
Your company’s database(s) are likely what your engineers think about most when it comes to scaling. Some databases are inherently more scalable than others, and it is a huge amount of work to switch to another one. In general, SQL databases like MySQL, PostgresSQL, and Oracle, are more likely to have scalability limitations than NoSQL databases like MongoDB and Cassandra (NoSQL basically just means any database that is not SQL). But SQL databases are very powerful and had been the state of the art for decades, so there is a good chance your company is using one. Much like the full-service hosting platforms, there are fully managed database providers like Firebase, mLab, and Amazon RDS, but again they are not complete solutions to all of your team’s scaling concerns.
Scaling-related outages tend to occur at the worst possible time: after a big launch, in the middle of a big marketing push, or when you’ve finally hit that inflection point in your hockey-stick growth curve. And by the time you’ve hit it, it’s too late to do anything about it. Your engineers will work day and night to resolve it, but often there is no quick fix. If there is one, it may be in the form of purchasing unnecessarily expensive hardware as a workaround for an unscalable software architecture.
When your lead backend engineer says that they need to reserve some time to plan for growth, listen to them. It may delay that feature that you really want to get done ASAP, but it’s better than having a business-killing outage when that feature begins to take off.