Store-Optimizing Data Collection

How can you advance data science and machine learning with more effective data collection and storage?

Many power companies are deploying new technology and sensors to capture entirely new sets of data that can help improve operational efficiency, safety, and reliability. The real payoff is the ability to combine this new information with existing data found throughout your company – to discover new insights and anticipate future conditions. To ensure you can use this new data for advanced analytics such as predictive or prescriptive modeling, you may need to revisit your data storage strategy.

Objective

In this brief, we introduce you to ways to think about data collection to ensure you capture and store relevant data that can be used more effectively for advanced analytics.

8 Questions to help you Optimize Data Collection

1 What questions am I trying to answer in my business?
2 What types of data do I need (and what fields) for the analysis? Why?
3 What should my data look like to enable analysis? (e.g., data structure, time horizon)
4 Do I have enough data, or should I combine data from multiple sources?
5 Do I have duplicates?
6 What data is missing?
7 What do I do with null values? (discard, impute, etc.)

Review Data Collection Early & Often

As you work with new data sources from sensors, AMI, and smart systems it is important to check how data is being captured and stored early in the collection process. It helps you determine if you need to make adjustments to what data is being collected (types of data and various fields). You’ll want to identify early on if you need more or different information, and if you’re capturing the right data to support testing your hypotheses with analytics.

As an example, consider the preliminary data structure below, extracted from monitoring equipment. The Data IDs represent sensors at different locations throughout a plant, each of which typically collects different data. However, here we see two Data IDs capturing parameter A data, and two with parameter H data. This is unexpected. So you might question, is there more than one measurement location for each parameter? Or, is there an error in data collection and storage processes that records two readings?

New Uses for your Existing Data

To develop analyses using machine learning, you will likely be using data from operational systems that were originally collected for very different purposes. Your storage capacity needs may increase sharply to retain granular data that your systems currently either compress or discard. Not only will you save more data, but you will want to replicate data from operational systems so that it can be used in a separate analytics sandbox environment.

Mitigate Data Storage Roadblocks

Quality:

Evaluate early on the quality of information being collected by new technology, such as sensors or AMI, to help identify and fix issues.

Volume:

Measure performance and reliability related to storing time series data.

Variety:

Plan to store a wide variety of data types and formats including pdf files, video, audio, images, and social media posts.

Metadata:

Determine how missing information in metadata (descriptive information about datasets) will be captured as data is collected and moved to storage.

Conclusion

Start planning your data storage strategy based on what data structure you need to test your data science hypotheses. Your storage strategy needs to enable the creation of datasets to be used for advanced analytics.