The term Data Warehouse was first coined in the 1970s. In essence, a data warehouse is a database management system (DBMS) that houses all of the enterprise’s data. The data warehouse serves as a single source of truth for all business related queries.
Why a Data Warehouse
In a previous post, we learnt about the need for a reporting database to accomodate the different types of data access patterns.
By creating a separate DBMS with its own schema, we can avoid undue stress on the operational system, all the while making the data more accessible to business analysts. This is all well and good in theory, however, in practice, things can get messy.
Suppose that you recently joined an e-commerce company as a data scientist, and were tasked with creating a recommendations system. You attempt to famlizarize yourself with the data that can be used to train your model, however, you notice that there are multiple tables for the customer entity. One table was created by the legal department in order to detect seller fraud, another table was created by the marketing deparment in order to perform customer segmentation, etc… Given that every team extracted the data from the operational system at different points in time (i.e. nightly, weekly), data in one table could be missing from another thus changing the values of aggregations. Add to this the possibility of intermediate tables, and what could be interpreted as the same features have entirely different values.
As you can image, the preceding scenario didn’t limit itself to the customer dimension. Across the organization, it became increasingly difficult to find which version of the data was the right one to use among the many available databases. When end users failed to find the right version of the data, it resulted in poor business decisions. Over time, people came to understand that a different architecture was needed. This is where the data warehouse comes in.
The data warehouse solves the problem of data integrity.
The data warehouse does this by storing and managing the entire enterprise’s data in one central location. Going back to our example, instead of having multiple customer tables each created by a different deparment, we’d store a single customer dimension within the data warehouse.
The Furture of Data Warehouses
As time went on, several issues with the data warehouse architecture began to manifest themselves.
For instance, data warehouses were unsuitable for unstructured data because, in practice, a data warehouse is a relational DBMS. As a consequence, the end users could not make use of the latest advances in machine learning techniques specific to textual data, images, audio, etc…
Although we’ve made a lot of headway in terms of distributed systems since data warehouses were first introduced, traditional DBMS still have a difficult time with the sheer velocity and volumes of the data generated by today’s organizations.
Another major problem with the data warehouse architecture is that it’s expensive. Like we mentioned previously, data warehouses tended to consists of DBMS (i.e. Oracle). The licensing model for databases typically charges customers for the amount of data stored at rest. This number tends to be several times more expensive then cloud storage.
It’s for these reasons that organizations began to adopt a data lake as the landing area for all of their data. Upon arriving in the data lake, the structured data would make it’s way to the data warehouse via ETL pipelines.
When every team is allowed to create and manage their own reporting database, we run into data consistency issues. Someone looking to leverage some of what other teams have developed runs into trouble determining which of the many models is the most accurate portrayal of reality. The data warehouse solves this by providing a one stop shop (i.e. data marts in the form of star schemas) for all of the enterprise’s data. However, when used in this fashion data warehouses are notoriously expensive, run into problems of scalability, and fail to satisfy the data scientists’ needs for accessing unstructured data.