The data lake concept by itself is not new, nevertheless, some organizations struggle to understand the concept as many of them might still be caught up in the traditional paradigm of Enterprise Data Warehouses. James Dixon, founder and CTO of Pentaho was the first who used the expression “Data Lake” and explained it as follows:
„If you think of a DataMart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.“James Dixon, founder and CTO of Pentaho
Our MDP with MindSphere will enable the realization of data-driven use cases of all kind, will connect and index multiple data sources via a single interface and act also as a single source of truth for the manufacturing data.
After analyzing all known use cases and interviewing different stakeholders almost 600 requirements were identified and clustered in about 40 function clusters to define the overall data lake architecture. The outcome of the architectural concept was then benchmarked by an external consulting company against known and readily available data lake concepts.
We received the confirmation that the attempted MDP architecture is future-oriented and superior to that what is available on the market.
The MDP provides technology, enablement, data-as-a-service and finally even analytics-as-a-service to the factories and our customers:
- It supports all kind of data and retains them in its raw form
- Adapts to change easily (no hardware dependence) as requirements evolve
- Developers and users can access to all available data
- App developers and data scientist can leverage existing data to create business value
- Enablement of translation of business requirements to data queries and offering of final data sets to internal and external app developers and business users (data-as-a-service)
- Enablement to offer deep insights and data correlations beside of raw data to internal and external app developers and business users (analytics-as-a-service)
Product owners and developers are supported by a “MDP-organization” for connectivity and data sharing. The focus is on data analysis and not on the technical connection.
On the bottom of figure 2, we have the data sources, which can be structured and unstructured and fed into a raw data store, which means that no transformation of the data will be done (data source connectivity).
Data ingestion, streaming, as well as data referencing is possible to maintain the single source of truth concept and avoid data duplication.
Since it is in the cloud, it is a persistent storage that can store data at scale and distribute the processing to the different storage areas batch (cold), online (warm) and streaming (hot). The batch processing engine processes the data to consumable data that can be used for reporting, e.g. for dashboarding of KPIs. In addition, there is the real-time processing engine for streaming data and processes.
The data persistence is very important in the context of traceability and warranty or analyzing long term effects. Imagine you have as smart algorithm defined who makes important decisions, you might want to observe any data drifts to adjust your algorithm simultaneously.
The semantic layer is related to data concerning the meaning and not the structure. It links and publishes structured data in a way that it can be easily consumed and combined with other data. The more use cases we implement and the more insights we get the more linked data we create, and form what is commonly called a knowledge graph. The knowledge graph is an advanced way to map all knowledge on topics and to fill in the gaps of how this data is related. This will create totally new insights while different use cases might get connected via the knowledge graph. The difference to classical data warehouse concepts is, that we are seeking for answers instead of looking for possible ones. We want the facts wherever these are coming from. The data can represent concepts, objects, things, log files and whatever else you have in mind.
A very famous representative of this concept is the knowledge graph of Google. Data lakes without a semantic layer will simply turn into data swamps because over time it becomes extremely difficult to analyze all the data we have and might generate. The semantic layer with its knowledge graphs will also allow us to create structures for the relationships in the graphs.
We will be able to tell a graph that the solder joint of a specific shape might relate to insufficient pressure of the squeegee during stencil printing and that this is related to specific stencils. At the same time, the solder material is specified, and a graph related to the data of the material specification and so on. This might lead to insights regarding solder material and stencil specifications.
Additionally, to the functionalities mentioned above, the MindSphere platform in combination with our MDP will offer analytical sandboxes, which are exploratory areas for data scientists where they can develop and test new hypothesis, combine and explore data, create prototypes and simply run experiments to get more insights and finally generate use or even business cases.
Above the MDP we have the possibility to allocate the different internal, external and MindSphere partner applications to make the data accessible to the diverse users with their distinct interests and rights.
Read my next blog Edge Computing and Mindsphere – A True Dreamteam to find out more about the Industrial Edge layer we are integrating into our state-of-the-art manufacturing data architecture, and how it complements Mindsphere.