Dispositive data processing
Data lake as the basis for Industry 4.0
For a long time, the data warehouse was considered the central source for all data analyses. In the course of increasing digitalization, however, the data lake has overtaken the classic data warehouse. Especially in Industry 4.0, many use cases are no longer conceivable without it. What do companies need to consider when implementing the technology?
The right architecture for planning data processing had been clearly defined since the 1990s. A data warehouse collects the relevant data from the various operational source systems in a hub-and-spoke approach. The data is then harmonized, integrated and persisted in a multi-layered data integration and data refinement process. In this way, a single point of truth can be created from the data: a universally valid, correct database that can be relied upon. The user can access this treasure trove of data via reporting and analysis tools.
An essential characteristic of the data warehouse is to provide a uniform view of the company data - in a strict and predefined data model that is optimized for the evaluation of the data. Past-optimized analyses of key figures along consolidated evaluation structures can thus be optimally implemented. However, the high standards of correctness and degree of harmonization usually also mean that it takes quite a long time for data from a new data source to be correctly integrated - because a great deal of design and coordination work is required in advance.
New data sources require new solutions
This problem has become particularly apparent since the emergence of new data sources such as social media or IoT data. This data is often available in semi-structured or unstructured form, but still needs to be integrated into the data format. With the increasing relevance of these data sources, the idea of the data lake was born. The data lake can make all source data - internal and external, structured and polystructured - available as raw data, even in its unprocessed form, in order to have it available as quickly as possible.
While the focus of the data warehouse is clearly on the past-oriented analysis of key figures along consolidated evaluation structures, the data lake offers greater agility and flexibility. It can quickly integrate diverse data sources and large volumes of data and process them into data streams. This enables complex analyses - even those that are usually not even defined at the time of data storage.
Looking at these different objectives and characteristics of data lakes and data warehouses, it becomes clear that a data lake does not replace a data warehouse, but complements it. Both architectural concepts have their relevance and serve different use cases.
The data lake enables the optimization of products in the industry
In industry, two specialist requirements are driving the use of data lakes in particular. The optimization of production and the offer of better or new products, sometimes even completely new business models. The basic use cases here are the "digital twin", i.e. the digital image of the company's own or produced machines and the connection of these to the data lake with almost real-time data up-to-dateness.
While the data lakes of the first generation were technically very complex and the connection was challenging in terms of the required timeliness, the barriers to the use of data lakes have fallen today. Due to the change in the market situation of commercial distribution providers and the general strategy of increased cloud usage, this is shifting in the case of second-generation data lakes: the complexity of managing the basic platform is massively simplified when using native cloud services or dedicated managed Hadoop environments. Today, this enables the use of data lakes for almost any size of company.
The right strategy for companies
If a company wants to use a data lake, a number of considerations need to be made in advance. To this end, it is advisable to clearly identify and prioritize the use cases as part of a roadmap. The components that are to be used initially must then be selected. A continuous search and evaluation of alternatives from commercial, open source and cloud services options makes it possible to create optimum added value for the company.
In addition to the functional requirements, other points must also be taken into account in industrial use. These include, in particular, the protection of trade secrets from competitors and legal aspects. Machine manufacturers are also faced with the challenge of accessing the data of their own machines in the customer context, as machines from different manufacturers are often used in combination and customers in turn do not disclose all data to protect their company.
When setting up a data lake initiative, certain key conditions also emerge in practice as the basis for successful implementation. These are similar to those for implementing a central data warehouse: a strong management decision to set up and use a central platform initiative and the resulting close cooperation between business and production IT, and possibly also product development, which has often not been practiced to date, are fundamental.
In addition, the operation of a data lake should be set up flexibly and holistically. A DevOps team that continuously develops the platform and keeps it stable in operation has proven to be best practice.
In summary, it can be said that data warehouses and data lakes fulfill different requirements. In principle, a data lake platform is required for every Industry 4.0 initiative. The technological entry barrier for data lakes has fallen, but still requires sound planning of the architecture. The basis should be a roadmap for use cases. In order to maximize value creation in the long term, the necessary organizational requirements for the successful use of a data lake platform must be created alongside the technology.
Dr. Carsten Dittmar and Peter Schulz, Alexander Thamm GmbH










