By: Andrew Carlisle

Recently, my team was tasked with making a switch from a combined MySQL and Cassandra infrastructure to one in which all of this data is stored entirely on a PostgreSQL server. This change was partially due to an increased drive to provide necessary and crucial flexibility to our customers, in tandem with the fact that Cassandra was simply not necessary for this particular application, even with the high quantities of data we were receiving. On its face, the mere need for such a change almost looks backwards given how much movement within the tech industry has been made away from SQL databases and towards NoSQL databases. But, in fact, NoSQL — or even hybrid systems — are not always best.

In certain applications, one might find that performance gains, hoped to be reaped from NoSQL’s optimizations, may not translate perfectly to production without some forethought. I would personally argue that SQL databases often are preferable (over something like Cassandra) in non-trivial applications, most of all when JOIN operations are required.. Generally speaking, NoSQL databases — certainly Cassandra, among others — do not support JOIN. I will add to this that the vast majority of ORMs (for those who may not be familiar with the term, these are effectively systems of abstracting database relations into typically “object-oriented” style objects within one’s backend code) are built around SQL. Thus, the flexibility and readability that is afforded by these ORMs — at least when operating a database of non-trivial objects —can be a lifesaver for development time, database management, integrity, and readability. Indeed, I would even argue that, for most web applications, it often outweighs the sometimes marginal or even relatively negligible performance increases that a NoSQL database may provide (of course, this is completely dependent on the nature and scale of the data, but that is perhaps a topic for another time).

However, none of this matters if the engineer is not paying close attention to their cloud infrastructure and the way that they are actually using their queries in production. In evaluating one engineer’s project, I found they were doing all of their insertion operations individually rather than attempting to batch or bulk insert them (when this was well within the scope of this particular application). It appeared they had been developing with a local setup and then deploying their project to the cloud where their database was running on a separate machine from their server. The end result in this case was rather comical, as once insertions were batched, even in Postgres, they were orders of magnitude faster than the piecemeal NoSQL insertions. They had not considered the simple fact of latency.

How did this original engineer miss this? I do not know, as this particular piece of software was inherited with little background knowledge. But, given that they were testing locally, I can assume that they elected for individual insertions. Making queries in this way can sometimes be less tricky than bulk insertions (which often have all sorts of constraints around them, and require a bit more forethought, especially when it comes to Cassandra). We found the performance was beyond satisfactory. What they did not consider, however, is that the latency between the backend server and a Cassandra (or SQL) server hosted in any sort of distributed system (ie. production). This meant that it didn’t really matter how fast these queries were; the latency between the backend and the database was so much greater than the query runtime, that, in fact, it really didn’t even remotely matter which database was used. So it followed that the real-world performance was actually significantly improved by simply batching insertions in Postgres (though of course, batching is supported in Cassandra — but the change was necessary nonetheless).

In any case, the moral of the story here, in my opinion, is that understanding your own cloud infrastructure is crucial to writing actual performant programs in the real world. As well as the fact that, just because one database may be purported to perform better than another given certain circumstances, without a solid understanding of the environment in which this application is going to be deployed in, one cannot hope to see any appreciable performance gain.

Most Recent News & Articles

100th Episode Of The Dan Smolen Podcast

Prateek Joshi, Founder and CEO of Plutoshift, discusses how A.I. makes the world a better place on the 100th episode of The Dan Smolen Podcast. The Dan Smolen Podcast...
Data quality dimensions for machine learning

8 Dimensions of Data Quality

By Prateek Joshi Large companies have enormous physical infrastructure. This infrastructure is well instrumented and data is collected continuously. The Plutoshift platform uses this data to help them monitor...

Machine Learning In 20 Words Or Less

By John Sizemore, Vice President – Data Intelligence & Analytics I’m often told that Machine Learning sounds complicated – but it doesn’t have to be. If I was asked to...

Executive guide to assessing your data readiness in 5 steps

By Prateek Joshi Within large companies, data is stored across many systems. If we specifically look at companies with large operations infrastructure, there are many different types of data...