Blueprint for Data Science Success

What key enablers does your company need to use data smartly?

To encourage and accelerate data science capabilities, companies need an integrated system and supporting processes that support this goal. It starts with creating a framework for building and evolving your data science, machine learning and AI initiatives. To help you get started, EPRI designed a System Blueprint for Data Science (system architecture diagram). The Blueprint defines the essential elements and illustrates requirements of an integrated system. You can use it as a template, and modify and enhance elements to meet your specific needs and reflect your company’s practices and corporate policies.

This Blueprint is also a valuable tool to help stakeholders and decision-makers understand, validate, and communicate data science needs that enable predictive analytics and machine learning. For companies with more mature capabilities, the diagram can help them identify potential gaps in existing processes, procedures, technologies, or skills.

Here’s a snapshot of the Blueprint, which identifies the main components involved in the stages of the Data Science Lifecycle — Acquire, Store, Cleanse, Visualize, and Analyze.

image/svg+xml System Blueprint to Support Data Science, Machine Learning, and AI

Getting the Most Out of the Blueprint

Your objective is to design an approach that will help your company speed the adoption of data science best practices, processes, tools, and technology. To guide your planning, this Blueprint helps you understand the stages of the Data Science Lifecycle. The diagram shows the elements you’ll need to optimize that will lead to capturing more actionable insights from data and analytics.

1

Data Sources

Not only do you want to access all the data your teams may have already individually collected, but new data sources are becoming available daily from public sources such as government agencies, device manufacturers, and internet users through crowd sourcing. This Blueprint lists some examples of internal and external data sources, but the list is much longer and constantly changing.

2

Delivery & Extraction

This component involves tools to extract and consolidate data from primary databases in bulk or batch. The tools offer an efficient and systematic way to pull in volumes of data. Typically, the data travels to a staging environment for virus and malware screening, before moving into storage.

3

Data Cleansing

Cleansing is a critical first step before conducting any data mining or advanced analytics with datasets. Some of the activities include anonymizing data (e.g., removing confidential, identifiable customer information), normalizing data into the same unit of measure or same time of day, removing duplicates, and understanding the magnitude and significance of missing values.

4

Virtual Data Lake

This Blueprint assumes all data owners and managers will store their data “in place” with no change to its current location. The virtual data lake indexes all datasets, making them searchable and available for use by others within the company. Permission for use is determined and granted by the data owner. Some organizations may opt to develop data lakes, data warehouses, data marts, or other architectures. Those strategies are compatible with the overall approach outlined in the Blueprint.

5

Data Science Tools

The tools include open source products like Python, R, as well as proprietary platforms that are available from a number of vendors. These tools help data scientists discover predictive information that will help them create analytic models and successfully get to insights.

6

Compute Layer & Virtual Machines

The compute layer refers to the data processing power required to churn through volumes of data for visualization and advanced analytics. Processor-intensive work no longer requires physical machines or super computers. Today, companies can scale up with virtual machines (often cloud-based) to meet their changing needs for processing power.

7

Data Science Workspace

This is a virtual sandbox for creating data visualizations and developing analytic models. For people working in data science, visualize and analyze are the most rewarding stages of the lifecycle because they lead to new insights.

8

Metadata & Data Management

Metadata refers to capturing descriptive information about datasets such as data source, data owner, and timeframe that data was collected. Metadata is critical to enable data sharing as it provides the information to create an index to make datasets searchable.

Data management is the organization of datasets and administration of permissions to review, edit, and use data in the virtual data lake.

9

Security & Governance

Processes and rules for governance are needed to screen, evaluate, and index datasets before they are stored in the virtual data lake. Governance helps to ensure data lake contents remain relevant and useful. Security protocols are needed to design the user permissions for who can read, edit, and use the data.