A data lakehosue is the new buzz word in the world of data. It is a relatively new concept that is getting more and more attention these days. In this short blog post I will talk about what a dalta lakehouse is and the top three advantages it offers.
First things first, what is a data lakehouse? Here is a definition from the DataBricks glossary:
“A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data" https://databricks.com/glossary/data-lakehouse
Modern day data architecture
A couple of months ago I had a discussion with a client about data warehouse (DWH) built on top of relational database management systems (RDBMS). Going forward the client was looking for a more flexible solution allowing to analyse data in their data lake. Their main motive was that most of their data was now in flat files and they liked the flexibility the flat files give them. They could just load the files to their data lake and come back to it later if they needed the data. They were reluctant to add the data to their existing RDBMS DWH because of the complexity and the lack of resources.
Coming from a SQL background, my initial approach was always to first get the data out of the flat files and load it into a database table before it can be analysed. This now seems like an unnecessary step.
The modern day data architecture looks like this:
Data comes in different formats and at different velocities. Never did we generate so much data so fast as we do today. Moreover, every day we are generating more data than the day before.
Data can come in batches or it can arrive real time in streams - it can be structured, semi structured or not structured at all. It can come from a myriad of Internet of Things (IoT) devices generating millions of messages very second. The volume of the data as well as the different types of data we are dealing with creates new challenges.
Firstly, we need to be able to store the data somewhere. Secondly, once the data is stored it needs to be processed and analysed.
The new architecture makes use of data lakes for data storage and then DataBricks for data processing. The output is stored in data lakes in the format of delta tables which can then be consumed by data visualisation tools such as Power BI.
Advantages of a data lakehouse
Delta lakehouses allow us to do data warehousing using data lakes as the main storage solution. Here are top three advantages of this approach:
1. SCALABILITY and PERFORMANCE (Separating storage from processing)
Data storage and data processing are separated. Your data lives in a data lake such as Azure Data Lake Storage Gen 2 while the data processing happens in DataBricks. Data lakes have no storage limits. We can load as much data as we want. At the same time if we need more processing power to analyse this data we can simply spin up a more powerful DataBricks cluster that will be able to handle it. Separation of storage from compute gives us scalability and better performance.
2. Support for different data types
Csv, Json, Excel, Avro, Parquet, Excel and other files are all supported by DataBricks, either by default or with additional libraries. Data can come in batches or it can be streamed too. DataBricks can be easily mounted to your data lake allowing it to access all the files just like if it were part of the internal DataBricks file system.
3. Data CONSISTENCY Delta tables offer data consistency and schema enforcement by supporting ACID transactions: (atomicity, consistency, isolation, durability) Delta tables allow you to merge, delete and update your data. Time travel or other words being able to see previous versions of data is possible as well. CDC can be enabled too.
Delta tables are, in a way, very similar to regular RDBMS tables with one big difference - they are not database tables. They are a bunch of parquet files stored in your data lake.
One other thing to mention is the Medallion data model. It is a three stage architecture consisting of the bronze; silver and gold layers. It is a data quality framework applied to your data lakehouse.
The raw data is ingested in the bronze layer. It is a copy of your source data in its original format. The bronze layer can be seen as your staging layer. It is a place to land your data as it arrives. Then, the data is cleaned, transformed and validated in the silver layer. This is also the layer where data is joined with other objects. This layer can also be seen as your Operational Data Store (ODS).
Finally, the data is further processed, aggregated in prepared to be used by end users in the gold layer.
The Medallion architecture gives you structure and helps to govern the data. It is very similar to layers and stages in your usual ETL process in your RDBMS data warehouses.
Data lakes and data warehouses are two different things. One is a synonym for data consistency while the other is used for data storage. Alone, they do well in one aspect but they lack in the other. When the two come together we can start taking advantage of the best of the two worlds and this is exactly what a delta lakehouse is.
Delta tables mimic regular data base tables. One could argue that we are recreating the wheel. We already have data warehousing - why step away from it if it works fine?
We used RDBMS for data analytics because most of our data came from OLTP systems such as SQL Server or Oracle. The new age of data is fundamentally different from what we had in the past. The mere size of data we have to deal and the versatility of the data we have creates new challenges.
The delta lakehouses address these two challenges by allowing us to store the data in a limitless storage solution while at the same time allowing us to enforce data consistency with ACID transactions. In my option; these are the two most important advantages of delta lakehouses.
Delta lakehouses require us to think different about the data. We need new skill sets such as Python or Scala and we need more data engineers! As of now it seems that the demand is bigger than the supply.
In my humble opinion, the data lakehouse will take over and replace data warehouse long term. Whether this will happen or not we are still to see :)