In my last post, I covered the subject of what BI is and where is headed, I thought on this one I could dig a bit deeper by talking about the specific’s terminologies and tools used for implementation.
One of the biggest pain points in BI is to choose the technology you are going to use. As each one will require big decisions on how the solution will be implemented. Technology is becoming faster and smarter, as a result we are getting more alternatives that will be better suited for each individual problem.
Today, we are going to cover what is a Database, Data warehouse, Data Lake and Data Vault. Understanding the advantages and differences of each one of them, we can plan for future scenarios to pick the right tool to use.
DATABASE
It is an organised collection of related data usually controlled by a Database Management System (DBMS). Data is typically modelled in rows and columns in a series of tables that have relationships between them, for efficient processing and retrieval. The data will require cleansing and transformation to be stored in a database (database normalisation).
Use case: When to use a database?
Databases are more efficient for transactional operations where the focus is on the current state of the business. Good for reporting but ideal for recording data. The ideal use cases are:
- Sales businesses: Use for storing customer, product and sales details.
- Manufacturing: Used for the data management of the supply chain and for tracking production of items, inventories status.
- Universities: To store student information, course registrations, subjects per degree, and results.
DATA WAREHOUSE
It is a centralised repository of data from one or more sources. This approach will also require data cleansing and transformation (star schema/normalization -3NF), for this reason, a data warehouse will normally be implemented using the extract, transform, load (ETL) process. This process will have 3 stages:
- Staging layer: data extraction (E from ETL process) from source systems.
- Integration Layer: Integrates and transforms (T from ETL process) the data from multiple sources to be arranged into dimensions (attributes) and facts (measurements and foreign keys)
- Loading stage: Getting the data into the data warehouse in a star schema ready for reporting (L from ETL process)
Use case: When to use a data warehouse?
Data warehouses are designed for analysing data, contains current and historical data. Separates analytics processing from transactional databases, improving the performance of both systems. The ideal use cases are:
- Insurance Companies: Used to analyse patterns, trends, and to track market fluctuations quickly.
- Healthcare sector: Used to strategize and predict outcomes, create patients reports, scheduling and results. Allow machine learning, big data can predict and help prepare for health care crisis and illnesses.
DATA LAKE
It is a centralised repository that allows us to store structured and unstructured data. The unstructured data can be loaded as-is, including emails, PDFs, binary data, without any cleaning or transformation. Data can be also loaded from structure source such as databases or data warehouses and anything in between, like semi-structured data, such as CVS files, XML, JSON.
The propose of this approach is to make available all data from different applications and sources to the experts to enable them to perform analysis and provide insights without carrying out a careful data design.
Use case: when to use a data lake?
A Data Lake allows to run different types of analytics, from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions, using all types of data. The ideal use cases are:
- Companies with teams of data experts: The data lake is more focused to be leveraged by a data expert that can upload data more easily and have the flexibility to run different types of analytics to discover insights.
DATA VAULT
It is a method and architecture for storing and delivering data. Data Vault data is generally RAW data sets, this means it requires loading data exactly as it is in the source system (no edits, no changes, no application of business rules to ensure is 100% auditable). Then, the data modelling will be done in the Data Vault Method which needs to break the data into a small number of standard components: Hubs (entities of interest to the business), Links (connect Hubs and may record a transaction or relationships) and Satellites (connect to hubs or links, contain their data in ‘Point in Time’).
The data vault is designed to run agile data warehouse projects, and the data warehouse is constructed by using the Data Vault standard components mentioned earlier.
The oversimplification of what a Data Vault is: a dynamic and customisable solution that allows access to all data, current and historical.
Use case: When to use a data vault?
A Data Vault will be more useful when the business requirements transform with time, the data vault will maintain the historical data for reference and easily implement the new standard or definitions of data that you required. The ideal use case is:
- A financial company: As they need to audit their data constantly but also need to add new data points making changes to the model and infrastructure. Using a data vault, the company will be able to checkpoint the time the change was made to the security model and infrastructure. This means the business team continues receiving the full view of historical and current information regarding the audit trail.
The final comparison of these technologies:
| Database | Data Warehouse | Data Lake | Data Vault | |
| Purpose | Record data | Analyse Data | Analyse Data by experts | Analytics and Data Science |
| Data structure | Structured | Structured | Structure, semi structure and unstructured | Unstructured load and then modelled to be structure in the Data Vault method |
| ETL | Yes | Yes | No | No |
| Security | High | High | Low | Medium |
| Flexibility of data | Medium | Low | High | High |
| Data Quality involved | Yes | Yes | No | No |
| Testing principle | Up to date data | Data correctness | Data volume and variety of sources | Data availability |
| Users | Data admin, developers | Business Professionals | Data Scientist | Data Scientist |
| Accessibility | Real time | Slow to make changes, needs data refresh to have up to date values | Highly accessible and quick to update | Not built for user accessibility. |
| Price/Performance | Fast query results using high cost storage | Fast query results using high cost storage | Query results getting faster using low-cost storage | Slow query results on high cost storage |
| Advantages | Secure Removes redundancy Real time data | Reliable Secure Good performance | Vast support High availability Huge volume | Flexible Scalable Accountable |