Demystifying Databricks Delta Lake (Format)
Nowadays, organizations have the ability to collect more data than ever before. That data comes in a variety of formats such as video, audio, text and potentially contain valuable insights for businesses. Therefore, most organizations realize the value of data as a strategic asset for various business-related initiatives, such as growing revenues, leveraging a product or service and improving customer experience. However, accessing and managing data has become increasingly complex.
In order to access and manage the amount of data at a low cost, data lakes have become a solution. But traditional data lakes have challenges. As data lakes accumulate data in different formats, maintaining reliable data is challenging and can lead to inaccurate results. In addition, with few auditing and governance features, data lakes are very hard to properly secure and govern.
By adding a transactional storage layer on top of a data lake, the lakehouse can be an answer to these challenges. To build a successful lakehouse, organizations have turned to Delta Lake, which is an open format data management and governance layer that combines the benefits of both data lakes and data warehouses. Delta lake increases productivity by optimizing for speed at scale with performance features like advanced indexing and schema enforcement. In addition, a delta lake operates with flexibility in an open-source environment stored in Apache Parquet format. This reduces risk by quickly and accurately updating data in the data lake for compliance and maintain better governance through audit logging.
When talking to customers, some of the most common challenges we hear about ETL pipelines are reliability, performance, and flexibility. In recent projects, we observe an increase demand for Databricks. Due to the high interest in Databricks know how coming from our customers, we want to take a closer look into the Databricks architecture and will examine different areas of the Databricks Delta Lake.
In the context of this blog post, we will focus and introduce the Databricks Delta format and its components, because this is a fundamental part of the Databricks Delta Lake and essential to understand the key concepts behind Delta Lake. We will briefly compare the Apache Parquet format with the Delta format and illustrate the key features. Then, we will depict the transaction log which is one of the key elements to understand Databricks Delta Lake under the hood. Finally, we will briefly highlight the key features of Delta and underline these features with examples.
From Parquet to Delta
What is Parquet?
Apache Parquet is an open source, column-oriented data file format which is designed for efficient data storage and retrieval. It features efficient data compression and encoding schemes with enhanced performance to handle complex data.
Column based storage is designed to bring efficiency compared to row-based files like CSV. An example is illustrated in Figure 01. We can see that in the csv file, we have lots of data which are mixed up. When we change the structure into a parquet file, we are doing a very efficient compression, because we are not storing the data row by row anymore.
Rather we are storing the data as columns and each column as a separate block of data. Instead of storing the first column “a, a, b..” as a lot of strings repeatedly, we are storing the string and indicate how many times the string is repeated. This is called RLE - Run length encoding. Thus, the more repetition we have in the data, the better Parquet compresses. Dictionary encoding also helps to have a high data compression. As a result, aggregation queries are less time-consuming compared to row-oriented databases.
Characteristics of Parquet:
- Open source file format
- Column-based format
- Language agnostic
- Highly efficient compression
- Structure is included in metadata
- Good for storing big data of any kind
- Saves on cloud storage space
- Increased data throughput and performance
However, there are a few problems inherent in Parquet. If we are repeatedly inserting data into a table this results in lots of small Parquet files. This is because a Parquet file is immutable and cannot be modified anymore after it is written. So, every time we have a transaction, we will generate a new file. Another point to consider is that every Parquet file consists of metadata, that means that if we are reading data, we need to scan a lot of metadata. Especially when doing big update queries where we need to check schemas, it affects the data processing and slows down the performance.
Another point is operational complexity. If we want to do transactions in our data table, e.g., “Delete from table where...”, it is not possible in Parquet directly, so operational SQL does not work. To execute an operational update, we first need to read the Parquet file into a dataframe, then filter the dataframe and write out a new Parquet file and probably replace the existing file. Thus, we need to do several different steps to execute an operation for Parquet.
In terms of auditing and recovery, we need to take care if we are doing a delete operation. For example, if we forget to put a where filter in our SQL statement after reading in our data into a dataframe, we are going to delete our Parquet files. There are no backups and rollbacks.
The problems of Parquet:
- Parquet files are immutable
- Every time we have a transaction, we have a new file (If data changes a lot, we have many files)
- Small files have a heavy performance impact
- Every parquet file has metadata inside
- Operational complexity
- No backups and no rollbacks
- Difficult auditing and recovery mechanisms
- No Indexes (Full index scan every time)
What is Delta?
Delta is a new file format which is based on Parquet files and which is:
- Fast (for specific jobs)
Here are possible definitions of Delta:
“Databricks Delta is a unified data management system that brings reliability and performance (10-100 times faster than Spark on Parquet) to cloud data lakes”
“Delta Lake is an open format storage layer that delivers reliability, security and performance on your data lake — for both streaming and batch operations.”
Coming back to our previous example, where we wanted to delete data from a table. First, we need to load our data within the Sales folder into a dataframe, then we filter the data in-memory and replace it with a new Parquet file which no longer has “Country” Germany anymore. Thus, the only way to delete is to replace the existing files with a new file containing the non-deleted data.
In contrast, when we first create our table, we still have created two Parquet files. But additionally, a folder called “_delta_log” is created alongside which stores json files in it. These files are our transaction logs. Meaning that every added Parquet file is tracked in our json file with some statistics. So, if we want to delete data from a table, still a new Parquet file is generated, that has not changed. But with Delta, the old Parquet file is not deleted. In the transaction log we can see that it is marked as “removed”. So, we delete files only logically. Every time a transaction is changed it is reflected in the transaction log. In the json file you will find a lot of information describing each committed transaction.
Consequently, when we query our “Sales” folder we would see all our Parquet files inside. However, at runtime when we query the Delta table using the transaction log, it decides which of all different files are valid. In our example, with Delta, we will ignore the files “PART-001.parquet” and “PART-002.parquet” and we will just read the latest file “PART-003.parquet”. Thus, we can look inside the transaction log and track which file were added, removed and every time anyone making changes of a table, this will be reflected in the transaction log. Whether it is a change of the table structure or adding new columns to the table. So, because of storing that information in the transaction log, we now have a history of the table and we have backups and can rollback.
“SELECT * FROM SALES TIMESTAMP AS OF "2018-10-18"
Hence, under the hood, Delta Lake stores the data in Apache Parquet format but adds features on top of it. Every transaction (commit) is recorded in the delta log as a json file. In the following the key features are listed.
Delta key features
A Delta Lake guarantees that each read, write, or modification of a table has the following properties: Atomicity, Consistency, Isolation and Durability. The ACID transactions ensure the highest possible data reliability and integrity. They ensure that your data never falls into an inconsistent state because of an operation that only partially completes. (Hence, the serializable isolation levels ensure that readers never see inconsistent data.)
Streaming and batch unification: In Delta Lake, a delta table is a batch table as well as a streaming source and sink. This allows a unified view of both paradigm of stream processing and batch processing.
Schema enforcement: In Delta Lake, schema enforcement or schema validation is a feature that ensures data quality by rejecting writes to a table that do not match the table’s schema.
Schema evolution: The schema evolution implemented in Databricks Delta Lake allows users to easily change a table’s current schema to accommodate data that is changing over time. It is used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.
Data versioning and time travel: When writing out data into a Delta table, every operation is automatically versioned into the transaction log. Hence, we can access any version of the Delta table, this allows us to travel back to a specific version of the current Delta table.
Managing Delta Tables: In order to improve query speed and keep the delta lake clean and efficient, there are several optimization processes that Databricks supports.
The Optimize command compacts small files into larger, better compressed files. This is treated like all other updates; files are NOT deleted. See Figure 08.
In this example, we haven’t ordered our small Parquet files, hence data skipping only hits occasional lucky success.
So, if we are using the optimize command, we can additionally use ZOrdering.
Z-Ordering: “Sort the data on specific columns before writing to files, to optimize data skipping”
In our example we compress our four different files into two larger files and order our data column “name” alphabetically. As a result, we can skip the whole second file while executing our query.
Another possibility to optimize query speed in Delta Lake is “Auto Optimize” which is an optional feature that automatically compact small files during writes to a Delta table. It consists of two complementary features:
- Optimised Writes: Databricks dynamically optimizes partition sizes based on the actual data and attempts to write out files of at least 128 MB. This is an approximate size and can vary depending on dataset characteristics
- Auto Compaction: Run a lightweight optimize job after the write has finished, looking for further file compaction opportunities, again with 128 MB
As a summary, we have seen an introduction to the Databricks Delta Lake format and its key components. We have seen that Delta Lake stores the data in Apache Parquet format under the hood but adds a transactional layer on top. We have learned that a Delta Lake guarantees that each read, write or modification of a tables is ACID conform. Furthermore, every operation is automatically versioned into the transaction log, so that we have backups and can rollback. We have seen that Databricks Delta Lake offers several optimization mechanisms to improve query speed. We now should have a good overview and a better understanding of how Delta Lakes functionality helps organizations overcome issues with their data lakes.
The Big Book of Data Engineering (Databricks Ebook)
Building the Data Lakehouse (Bill Inmon, Mary Levins, Ranjeet Srivastava)