Andrew Carlisle
Manager, Backend Engineer

Recently, my team was tasked with making a switch from a combined MySQL and Cassandra infrastructure to one in which all of this data is stored entirely on a PostgreSQL server. This change was partially due to an increased drive to provide necessary and crucial flexibility to our customers, in tandem with the fact that Cassandra was simply not necessary for this particular application, even with the high quantities of data we were receiving. On its face, the mere need for such a change almost looks backwards given how much movement within the tech industry has been made away from SQL databases and towards NoSQL databases. But, in fact, NoSQL — or even hybrid systems — are not always best.

Performance Gain Considerations

In certain applications, one might find that performance gains, hoped to be reaped from NoSQL’s optimizations, may not translate perfectly to production without some forethought. I would personally argue that SQL databases often are preferable (over something like Cassandra) in non-trivial applications, most of all when JOIN operations are required.. Generally speaking, NoSQL databases — certainly Cassandra, among others — do not support JOIN. I will add to this that the vast majority of ORMs (for those who may not be familiar with the term, these are effectively systems of abstracting database relations into typically “object-oriented” style objects within one’s backend code) are built around SQL. Thus, the flexibility and readability that is afforded by these ORMs — at least when operating a database of non-trivial objects —can be a lifesaver for development time, database management, integrity, and readability. Indeed, I would even argue that, for most web applications, it often outweighs the sometimes marginal or even relatively negligible performance increases that a NoSQL database may provide (of course, this is completely dependent on the nature and scale of the data, but that is perhaps a topic for another time).

Cloud Infrastructure

However, none of this matters if the engineer is not paying close attention to their cloud infrastructure and the way that they are actually using their queries in production. In evaluating one engineer’s project, I found they were doing all of their insertion operations individually rather than attempting to batch or bulk insert them (when this was well within the scope of this particular application). It appeared they had been developing with a local setup and then deploying their project to the cloud where their database was running on a separate machine from their server. The end result in this case was rather comical, as once insertions were batched, even in Postgres, they were orders of magnitude faster than the piecemeal NoSQL insertions. They had not considered the simple fact of latency.

How did this original engineer miss this? I do not know, as this particular piece of software was inherited with little background knowledge. But, given that they were testing locally, I can assume that they elected for individual insertions. Making queries in this way can sometimes be less tricky than bulk insertions (which often have all sorts of constraints around them, and require a bit more forethought, especially when it comes to Cassandra). We found the performance was beyond satisfactory. What they did not consider, however, is that the latency between the backend server and a Cassandra (or SQL) server hosted in any sort of distributed system (ie. production). This meant that it didn’t really matter how fast these queries were; the latency between the backend and the database was so much greater than the query runtime, that, in fact, it really didn’t even remotely matter which database was used. So it followed that the real-world performance was actually significantly improved by simply batching insertions in Postgres (though of course, batching is supported in Cassandra — but the change was necessary nonetheless).

The Moral of the Story

In any case, the moral of the story here, in my opinion, is that understanding your own cloud infrastructure is crucial to writing actual performant programs in the real world. As well as the fact that, just because one database may be purported to perform better than another given certain circumstances, without a solid understanding of the environment in which this application is going to be deployed in, one cannot hope to see any appreciable performance gain.

Most Recent News & Articles

Plutoshift logo with Accelerator 100 logo

Plutoshift Selected to Join Exclusive Sustainability Innovation Program

Plutoshift was born out of the desire to unlock the potential of AI to address energy and water challenges. And since our founding in 2017, sustainability and resource conservation...
Towards3Z Podcast

Towards 3Z Podcast: Zero Emissions, Zero Downtime, Zero Waste and Digital Transformation

I was honored to join Albert Vazquez-Agusti on Towards 3Z’s first podcast to talk about zero emissions, zero downtime and zero waste in a world where industrial transformation and...
Machine learning

6 key ingredients of successful Machine Learning deployments

Machine Learning (ML) is a vehicle to achieve Artificial Intelligence (AI). ML provides a framework to create intelligent systems that can process new data and produce useful output that...

The Water Values Podcast: Digital Transformation with Prateek Joshi

CEO Prateek Joshi talks about digital transformation in the water sector. Prateek hits on a number of important and practical points in a wide-ranging discussion on data, AI, and...

100th Episode Of The Dan Smolen Podcast

Prateek Joshi, Founder and CEO of Plutoshift, discusses how A.I. makes the world a better place on the 100th episode of The Dan Smolen Podcast. The Dan Smolen Podcast...
Data Security image

8 Security Standards For Safeguarding Customer Data

As a company, Plutoshift has many responsibilities towards our customers, teammates, vendors, and the environment. We manage critical data across many facets of our business. Being accountable for data...