By Prateek Joshi
Large companies have enormous physical infrastructure. This infrastructure is well instrumented and data is collected continuously. The Plutoshift platform uses this data to help them monitor their physical infrastructure. When we look at the data flowing in, we need to standardize and centralize the data in an automated way. One of the first steps in monitoring physical infrastructure is to check data quality. How do we do that? What framework should we use to validate data quality?
The topic of data quality is vast. There are many ways to check and validate data quality.
To automate the work of monitoring physical infrastructure, we employ a variety of machine learning tools. You need to automate the work of looking for anomalous performance metrics and surface them. If you look at deploying machine learning in these situations, it is basically a data infrastructure problem. If you have good data infrastructure, your machine learning tools will do a good job. Needless to say, your machine learning tools will look bad if the data infrastructure is not robust.
In our experience, there are 8 criteria we can use to ensure data quality in the world of physical infrastructure:
There shouldn't be any contradictions within the data. If you do a sweep of the entire data store, the observations should be consistent with each other. For example, let's say that there's a sensor monitoring the temperature of a system. The dataset shouldn't contain the same timestamp with two different temperature values.
The data should accurately reflect the reality. You should be able to trust your instrumentation. In general, this is a consideration for the data collection systems. For example, let's say that you're looking at the data store for flow rates within a pipe. The data should accurately reflect the reality of what's actually happening in the pipe. The machine learning model will assume that the data is true to make a prediction. If the data itself is inaccurate, the machine learning model can't do much.
The data should be relevant to the use case. You need data that enables you to achieve a specific goal. For example, let's say we're looking at the energy consumption problem. If you want to reduce energy consumption, you need to have data on the levers that are responsible for driving energy consumption. Machine learning can't do much with high-quality data if it's not relevant.
We should be able to trace the changes made to the data. You can make sure that nothing gets overwritten permanently. By understanding the changes made to the data over time, you can detect useful patterns. For example, let's say that you're looking at a response tracker filled with user-inputted values. The ability to trace the changes made to the data gives us the ability to look at the evolution of the dataset.
It means that all elements of the data should be in our database. Fragmented data is one of the most issues of subpar performance. In order to drive a use case, you need all elements of the data. Data completeness allows machine learning models to perform better. For example, let's say you are looking at monitoring membranes within the physical infrastructure at a beverage company. The aim is to predict cleaning dates and there are 5 key factors that affect the cleaning dates. If the dataset only has 3 of those, then the machine learning model can't achieve the level of desired accuracy.
We should get data with minimal latency. Data tells us something about the real world. The sooner we know it, the sooner we can dissect it to take action. If something is happening in the real world, the data collection system should be able to get that data into the hands of the end-user with minimal latency. For example, let's say we're look at pump monitoring. In case of emergency, the aim is to take action within the hour to minimize damage. If the data collection system sends you the data with a gap of 3 hours, then you'll be too late.
The data should have a fixed structure and format. Data format plays an important role in building scalable products. For software to work at a large scale, the data needs to be in an agreed-upon shape. This allows machine learning systems to work at scale, which is really powerful given the amount of data they can handle. For example, let's say we're looking at monitoring cooling systems across 400 sites. A machine learning model is effective if the data from all those sites is in a standardized format. If all 400 sites have different data formats, then you'll have to build separate workflows for each. The ability to scale reduces.
Data shouldn't be duplicated. This one seems obvious, but data duplication is a very real issue that we face. In a given database, the data shouldn't be duplicated. There's no reason to store it more than once. It occupies more space and doesn't serve any purpose. For example, let's say we're looking at pressure values within a steam system. For a given timestamp and location, we only need the value to occur once. If the values are duplicated, we need to deduplicate it before processing further.