NoSQL is a popular buzzword these days. Let’s demystify it and discuss why your engineer might be talking about using a NoSQL database.
WTF is SQL?
As you may have guessed, the inspiration for the grammatically questionable name “NoSQL” is SQL. SQL was the state of the art class of databases from the 1980’s until NoSQL recently challenged its dominance. SQL, which stands for Structured Query Language and may be pronounced either as “sequel” or by spelling out the letters S-Q-L, is the foundation for most of the popular databases that everyone was using during that time: PostgreSQL, MySQL, Oracle, SQLite, MS SQL, and many others. SQL databases are also referred to as relational databases.
There are many reasons SQL became popular. First, it provides a powerful query language that allows programmers as well as business analysts to answer very complicated questions. Second, that language was standardized, so that once you learned the language for one, it was be easy to learn the others. Third, because they have been under development for decades – an eternity in the tech world – many of the SQL databases on the market are very mature and powerful. Databases are complex, and newcomers have a high bar to clear in order to be competitive.
SQL sounds great! So why is NoSQL a thing?
SQL is indeed very powerful, and it is perfect for many applications. But it has several characteristics that make it less well suited for many use cases.
1. It has scaling limitations
The most important reason that people started looking beyond SQL is that part of what makes SQL so powerful also makes it hard to scale. This is a complicated subject (see my post Why do engineers worry about scaling?), but there are some fundamental reasons that SQL databases have trouble achieving massive scale.
Don’t get me wrong, I’ve seen Oracle and Postgres databases handle thousands of transactions per second without a hitch. That’s a lot, but it’s small potatoes for the Facebooks and Googles of the world. If your startup achieves the coveted hockey stick growth you’re probably hoping for, you might run into SQL’s scaling limitations earlier than you might think. And switching from SQL to NoSQL (or between any databases, for that matter) is very hard, likely taking many man-months or even man-years to migrate. And you’ll have to deal with it right when your company is taking off and there are a million other things you want your engineers to be working on.
A close cousin of scalability is an important concept called availability. In an overly-condensed nutshell, availability refers to the ability of a system to keep functioning even if some parts fail. All computers, even powerful supercomputers, fail sometimes; that’s why you occasionally need to reboot your laptop. If a system doesn’t have good availability, then a restart of one computer means your website is down until the computer is back online. That’s not acceptable for many business-critical applications. To achieve excellent availability, databases need redundancy, such that if one computer fails, another easily and automatically picks up the slack without missing a beat. Because of some of the powerful features inherent in SQL, it is a challenge to achieve very high availability with SQL databases
2. Data representation
Before putting data into a SQL database, your engineer will define a schema, which is a list of attributes (or columns) of a specific type of data. For example, you might have an Employee table which contains columns representing the employee’s name, phone number, salary, and email address. You might have another table in the same database which stores projects, with a project name, a deadline, and a status indicator. Representing the relationship between tables (eg. the team lead of Project ABC is Employee 123, and Employee 345 is one of several team members of Project XYZ) is an important part of these databases – it’s where the “relational” part comes from.
One of the side effects of this approach is that all data must have the same attributes. Sure, you can leave some attributes blank (we call these “null”), but if you can’t have extra data associated with a subset of users. For example, perhaps you have some hourly employees whose data includes their work schedule, but salaried employees employees lack these attributes, and perhaps have other attributes specific to them. There are several approaches to dealing with this common problem – including adding columns for all permutations of the fields that any Employee might have – but none are particularly natural.
Finally, tables with a lists of attributes are not the most friendly format for code that needs to interact with the data. Many codebases include an extensive mapping layer (sometimes referred to as object-relational mapping, or ORM) that converts the required SQL representation of data into a format more convenient for the rest of the codebase.
3. The schema is rigid.
Data changes a lot in the lifetime of an application – tables get added, columns are added, formats of the data change, etc. Some of these operations are difficult for SQL databases, possibly even causing an outage while the change is made.
Further, schema changes complicate deployments. If an engineer writes code that includes a new column, the new column has to be added before the code is deployed. If a column is removed, it has to happen in the other order. Every backend engineer has a horror story about a mistake caused by this sequence of events, and it adds stress to the already stressful procedure of deploying new code to production.
SQL sounds terrible! Tell me about NoSQL
Gladly. NoSQL is simply any database that does not use SQL. There are various different solutions that really have nothing to do with each other, but since SQL used to be so dominant, people grouped these oddballs together under one umbrella. The goal of any NoSQL database is to overcome one or more of the challenges listed above, since, aside from those, SQL is pretty great. But for many applications, one or more of those challenges is a deal breaker.
The first thing that people using a NoSQL database will notice is the data representation (#2 above). Since NoSQL is such a wide umbrella, the data representation varies widely. I’ll explore four main types of data representations below.
This is the simplest: there is no structure at all. The code will generate a key and some data to associate with it. Using the Employee example from above, the application code might generate a key called “employee-123-phonenumber” and set it to the value “(987) 123-4567” The application code can then retrieve, update, or delete this data at any time by providing the key “employee-123-phonenumber” to the database. Some key/value store databases encourage bundling more information in each key, such as a key of “employee-123” and a value of “phone: (987) 123-4567, salary: 100,000, email: email@example.com” (probably not this format, but you get the idea), rather than multiple keys for each bit of information.
Examples of key/value store databases include Dynamo and Redis.
A document store is arguably a special case of key/value store databases, but it is an important one. The difference is that the value is structured, meaning that you can run queries inside the data value. In the “employee-123” example above, the application has to fetch that whole data blob for that employee even if it only needs the phone number. In a document store, the database understands what the data value represents, so you can retrieve just what you need – important if you are storing lots and lots of data in each key. Further, you can query based on the value data. For example, you can say “give me all Empoyees with an area code starting with 987.” You simply can’t do that in traditional key/value stores.
The most popular document store is MongoDB.
This is the most similar to SQL. You set up tables and columns just like SQL. However, the query language is very different. An important difference is that there is no concept of a “join,” which is the main reason SQL is difficult to scale. While the other NoSQL options I’ve listed arguably attempt to tackle all three of the SQL challenges, column stores really only tackle the first one (scalability). The data representation is similar to SQL, and, if anything, it is even more rigid than SQL.
The most popular column store is Cassandra.
Quite different from SQL and all the other NoSQL databases, graph databases structure the data as graphs. Facebook friend data is the classic example, and the needs of many companies to import portions of the Facebook graph helped graph databases come to prominence. Graph databases consist of nodes and edges, both of which may contain important information. For example, two nodes may represent people, and all of their profile details (email address, photo, favorite color, etc.) are stored with that node. An edge between them could represent that the two are friends, and could store data like the duration of their friendship.
Popular graph databases include Neo4j and OrientDB.
You now understand the four most common categories of NoSQL data formats. Let’s talk a little bit more about WTF NoSQL database creators were thinking when they started taking on well-entrenched SQL.
In the olden days (the 90’s), engineers improved database performance (and system performance in general) by upgrading to more powerful hardware. The computing industry was focused on building more and more powerful machines, and critical databases ran on the latest expensive supercomputers.
There are several downsides to this approach:
- It’s expensive. The price of powerful supercomputers rises exponentially.
- You are limited by the latest hardware. So you’re a bazillion dollar company and you can afford the latest shiny supercomputer. But what if that’s not enough? There’s nothing you can do besides wait for the next hardware breakthrough.
- If one machine fails, your entire system is down. As discussed when introducing the concept of availability above, hardware fails all the time, even expensive supercomputers. When that happens, your system is down. This includes upgrades: while you switch your database over to the latest supercomputer, your system is down.
- You have to over-provision. Because you can’t upgrade to the latest hardware every day, you need to buy a computer that is as powerful as what you need in the future, not what you need today. If you’re a fast-growing company, you may be paying for more than you need right now in anticipation of what you think you need in the future.
Google popularized the notion that many less-powerful computers working together is better than one very powerful computer. Rather than have one supercomputer the size of Texas powering Google’s search engine, they started using much cheaper hardware – but a LOT of it. They designed a new database from scratch that works well under this setup. One requirement of this kind of setup is that any piece of data lives on multiple computers, so that if one (or two, or three…) of the computers fail, you never lose the data. This is a distributed database.
For example, your distributed database may have 100 computers (referred to as nodes), and the key “employee-123” may be saved on three of them. If one of those three nodes goes down, you still have two copies. Some databases will detect that one of the three is down and automatically copy the data from one of the remaining two to another node to ensure that the system quickly gets back to its target of three copies.
SQL was not designed to be distributed, mainly because joins would be unacceptably slow if the data didn’t all reside on the same machine. Many (but not all) NoSQL databases were designed with distribution as one of the most important considerations, and it is one of the top reasons to use a NoSQL database.
A goal of distributed databases is that they scale linearly. While you can’t get around paying more as you store more data, it’s reasonable to expect to pay about twice as much for twice as much data. A linearly scalable database does this: if you have 100 nodes, adding 100 identical nodes will give you about twice the data and query performance. Compare that to using a single supercomputer, where upgrading to the latest supercomputer could give you a 20% performance increase for ten times the cost.
A note about consistency
Like the word “availability,” you probably thought you knew what the word “consistency” meant, but it has a very specific technical definition that is important to engineers. In short, when working with a database, you would generally expect that when you insert data and read it back, the retrieved data is always the same as what you put it in. That means the data is consistent. Sounds logical, right?
While nobody would argue that consistency is a bad thing, some would argue that it’s not as important as it sounds. Consistency is an absolute bedrock of SQL databases, as well as some NoSQL databases; developers can 100% depend on this behavior when working with consistent databases. But as database creators started working on NoSQL databases, they used the opportunity to question long-standing assumptions like this one. And some concluded that consistency comes with a high cost that is not always justified. If database creators are not required to adhere to consistency guarantees, it frees them up to make the database much better in other respects, namely availability and scalability.
Many NoSQL databases adhere to “eventual consistency,” which means “if you put data into the database, then at some point later reading that data back will retrieve the same data you put in, but we can’t guarantee that will be the case right away.” In practice, this usually means that the vast majority of the time it is consistent, but occasionally it is not. That “occasionally” is important, as any code that interacts with such data has to handle this edge case. It’s a mindset change for engineers who have long been accustomed to consistent databases.
Okay, I get the difference between SQL and NoSQL! Which should I use?
Even within these two categories there are a lot of choices, and it’s an extremely important decision that is hard to change later. So which should you use? It’s all about tradeoffs – no database is perfect for every situation. Generally, these are the most important things to consider:
- Scalability. Most popular databases can handle very large amounts of data and support thousands or millions of users of your product. But SQL databases will hit a wall at some point down the road without an easy path forward. If you’re planning to be the next Facebook, it will save your future engineering team a lot of pain to use a more scalable database. Even if they end up migrating to another database for reasons you can’t anticipate, starting off with an unscalable database forces their hand and could rush an important future decision.
- Availability & Distributability. As discussed above, availability is distinct from (and sometimes in conflict with) scalability. If a 5 second outage would really hurt your business, then availability is important.
- Maturity. It is important that your database works. It needs to be stable, fast, and bug-free. You don’t want to risk losing or corrupting your data; even if you schedule regular backups, restoring backups can take hours or days, which could kill your business. One of the biggest mistakes I made at my last company was choosing a database that was only a few years old and therefore hadn’t been through the wringer. We spent a lot of time fixing problems with the database, often requiring us to contact the database creators, when we really just wanted to spend our time building our product.
- Speed. Pretty much any database measures time in milliseconds. That may sound fast, but the difference between 1ms and 10ms could matter to a lot of applications. Some databases are inherently faster than others, and some optimize the speed of certain operations (like reading existing data) at the expense of other operations (like writing new data). Which is more important depends on your application.
- Consistency. Consistency is great, but might not be worth the cost. Engineers who haven’t worked with eventually consistent databases might be scared of them and over-emphasize the importance of consistency, but such databases are slowly gaining acceptance as people realize what they are giving up by insisting on consistency.
- Data representation. I put this last because in my opinion it is overrated. People often look for a database whose data format “feels like” the data they’ll be working with, such as a graph database for social networking applications. In reality, pretty much any database could be made to work with pretty much any application, although some would be clunky integrations that perform poorly. The main reason I think it’s overrated is that engineers have been taught for decades that “relational is right” which biases experienced engineers toward SQL. (Sorry. I made it almost to the end without expressing my own opinion.)
All too often, startups who outsource their initial development let the dev shop choose the database (often MySQL), and their future full-time engineering team needs to spend months or years moving to another one. If you are in this situation, I strongly advise you to seek the advice of a technical advisor instead of relying on a dev shop to make the decision, as they will want to use whichever database their engineers are most comfortable using, which might not be right for your business.