Data Vault modelling is used to build data warehouses while addressing the drawbacks of 3NF (Bill Inmon), and dimensional (Ralph Kimball) modelling. Data Vault, originally conceived by Daniel Linsteadt, was released in 2000. The data vault has three types of entities: hubs, links, and satellites.
A Hub represents a core business entity within a company. This can be things like a customer, product, or a store. Hubs don’t contain any details about the entity. They only contain the defined business key and a few mandated Data Vault fields. These include a hash based unique identifier (computed using the business key and load date), the timestamp when the record was created and the source system the record came from.
Direct hub to hub relationships are not allowed. In other words, a hub should never contain foreign keys. We connect hubs using links as we’ll see later.
A Link defines the relationship between business keys from two or more Hubs. The relationships are always many-to-many. Just like a Hub, a Link structure contains no contextual information about the entities. A link contains the hashes for the two linked Hubs and the same Data Vault fields we saw previously.
If relationships aren’t currently known or data sources aren’t yet accessible, this is ok because Links are easily created when they are needed. Adding a new Link in no way impacts existing Hubs or Satellites.
A Satellite houses all the contextual details regarding an entity. You could think of a satellite as a dimension in a star schema. When there is a change in the data, a new row must be inserted with the changed data (always type 2 slowly changing dimension). These records are differentiated from one another by utilizing the hash key and one of the Data Vault mandated fields (i.e. load_date).
Another common consideration when creating Satellites is data classification. Satellites enable data to be split apart based on classification. For example, suppose we had a Customer Hub table and two satellite tables. One satellite would contain data from a retailer’s Salesforce instance, while the other contains data from the retailer’s webstore.
Instead of reengineering the model and wasting valuable cycles determining the impact of those changes, data from a new source can be ingested into a completely new Satellite table. This speed also enables data engineers to iterate rapidly with business users on the creation of new information marts.
In 2013, the Data Vault 2.0 specification was released and with it hashing. Hashing provides many advantages over using standard composite or surrogate keys and data comparisons:
- Query Performance: Fewer comparisons to make when joining tables together. In other words, comparing a single column with a hash value is faster than comparing a primary key composed of multiple columns.
- Load Performance: Tables can be loaded in parallel because ingestion pipelines don’t need to wait for other surrogate keys to be created in the database. Every pipeline can compute all the needed keys.
- Deterministic: Meaning that the key can be computed from the data. There are no lookups necessary. This is advantageous because any system that has the same data can compute the same key.
Enterprise Data Warehouse
Similar to other data warehouse architectures, the bronze layer contains a table for every raw data source (no transformations).
The main Data Vault model (Hubs, Links and Satellites) lives in the silver layer.
The Business Vault is an optional tier in the Data Vault where the business can define common business entities, calculations, and logic. These things shouldn’t be implemented in every information mart differently, it should be implemented once in the Business Vault and used multiple times in the Information Marts. This could include things like type casting and removing whitespaces.
Data is stored in a form that renders it easy to query.
Thanks to the Hub, Link and Satellite model, Data Vault allows you to easily make changes (e.g. add data source) without re-engineering everything. Every row in a data vault must be accompanied by record source and load date attributes, enabling an auditor to trace values back to the source.