Medallion Architecture: What, Why and How (2024)

Jun Shan

The introduction of Medallion Architecture in enterprise data management, pioneered by Databricks and adopted by Microsoft in their Fabric platform, marks a significant shift. This architecture aims to address the complexities of organizing and processing data within the realms of data lakes and lakehouses. The blog provides a comprehensive overview, starting with a review of data lake and ELT concepts. It then delves into the challenges of organizing data in a data lake, highlighting the necessity of a structured approach. Medallion Architecture, with its Bronze, Silver, and Gold layers, offers a systematic framework for data organization, transformation, and consumption. It emphasizes incremental enhancement, flexibility, and governance, paving the way for advanced analytics and machine learning endeavors.

In the world of enterprise data management, medallion architecture has gained significant attention in recent years. First invented by Databricks, it was adopted by Microsoft as the de facto building guideline for its central data storage, One Lake, when it released its new data platform Microsoft Fabric in Nov 2023. As Microsoft places Fabric at the center of its data offerings, medallion architecture is becoming an important part of their data framework, and will receive more spotlight in coming years.

So what is medallion architecture, what challenge can it resolve, and how should it be implemented? This blog tries to provide answers to these questions with the underlying principles that shaped this architecture, as well as some general guidelines of implementation.

If you have heard about medallion architecture, you likely have also heard, or even are very familiar with, the terms such as data warehouse, data lake, and data lakehouse. Medallion architecture is used in the context of these terms. The next few paragraphs will provide a brief review of what these concepts are about, with a focus on providing a common ground to better understand medallion architecture.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Unlike traditional data storage solutions, data lakes are designed to store data in its raw format (the same format when the data comes from its source), whether it’s structured data like relational databases, semi-structured data like JSON or XML files, or unstructured data like text documents, images, videos, and log files. This means you can store vast amounts of data from various sources without having to first structure it.

One of the key advantages of a data lake is its flexibility. Users can store data without worrying about its structure or format, making it suitable for storing diverse types of data for various purposes such as analytics, machine learning, and data exploration. Additionally, data lakes often support a variety of processing frameworks and tools for querying, analyzing, and extracting insights from the stored data.

Closely related to data lake is the concept of data lakehouse. Data in data lakes generally needs to be cleansed and conformed before they can be presented to the business user. Data lakes, as they aim at storing data in raw format, generally lack the consistency and integrity needed for business reporting purposes. On the other hand, data warehouses can provide this functionality, but require much more effort and cost to build. When you combine data lake and data warehouse together, building a consolidated data view by storing raw data in data lake and cleansed data in data warehouse, you can have both the scalability and flexibility provided by data lake, and analytical capability and integrity provided by data warehouse. Such a solution is called a data lakehouse.

Other terms that are closely related to data lake and data warehouse include ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). The three characters in these two terms (ETL or ELT) have the same meanings, but the orders of operations are different. In ETL, data is transformed into a predefined format, then gets loaded into target storage, while in ELT, data is loaded into the destination, and gets transformed only when requested. In ETL’s case, you will need to know what the target state looks like before you can design and implement the transformations, while in ELT, you can load the data regardless of its format, and transformation happens when requested. Thus, the ELT process is often preferred in current data engineering architectures, especially with the rise of data lake systems, which store large volumes of data in raw formats.

However, the introduction of data lakes and ELT processes also introduced a problem: How should you organize the data in data lakes, and design ELT processes accordingly? Traditional data warehouse usually consists of the following three layers:

1. Staging Area: This is an intermediate storage area where data is temporarily held during the ETL process. Staging areas allow for data validation, transformation, and error handling before it is loaded into the data warehouse.

2. Data Warehouse: The central repository for storing structured and organized data. It typically consists of a relational database optimized for analytical queries and reporting. Data in data warehouses is organized by business subject areas.

Sometimes, there is an additional layer before data warehouses, where you have all the data organized by sources, likely the transaction systems that generate the data. This layer is usually used for storage of data that must be retained for a short time. It can also be used for operational reporting, and is often called Operational Data Store (ODS).

3. Data Mart: A subset of the data warehouse that is focused on a specific business function or department. Data marts are often designed for easier access and analysis by specific user groups. The data marts are usually in star schema.

Medallion Architecture: What, Why and How (2)

It is easy to envision that the data mart layer will only contain well formatted, analytics oriented data. Meanwhile, the data warehouse layer will contain all cleansed data, organized by business subject areas, and ODS will be organized by source systems. Data mart, data warehouse, and ODS are all long term storage layers. Data in these layers, once loaded, will be stored for a long time, usually several years. Before data is loaded into these layers, there is the staging area which contains temporary data that is generated and utilized by the ETL process, and will be truncated as ETL processes finish their execution. In this traditional data warehouse ecosystem, the target state of each layer is well understood and well defined. Data modelers have full confidence in drawing the lines between different layers and defining the processing needs for ETL processes.

Now, when you move into a data lake world, things become tricky. By definition, data lakes will store all raw data in its raw format. That is simple enough. But eventually, you want to use these data for analytical purposes, and you usually can not directly perform analysis based on raw data. You will need to cleanse them, transform them, combine them, and perform necessary calculations and aggregations. If you stick to the original idea of ELT, these will be performed whenever you run your reports, on top of the raw data. But clearly, data transformation takes time and resources, sometimes a lot of them. It is natural that people want to save some intermediate transformation results in a form that is reusable, either as a starting point for further calculation or as a direct cache for reporting. But then as all these transformations happen in the data lake, their results being saved as temporary files, where are the boundaries of processing?

You can clearly see that, between the starting raw data and the target data that is readily available for reporting and analysis, there are many different steps and operations (especially join and aggregation) you can perform. If you try to perform all the possible transformations and save the result for each step, in most real world scenarios, there will be too many intermediate files that are only used once or twice. Storing and organizing them will be a huge pain. Most of the data, half-processed or over-processed, will not fit into your immediate needs and will be obsolete very soon. Even worse, they still exist in your system and occupy the name space. The end result is a huge data swamp of rotten data that nobody can use.

In order to resolve this issue, you need a framework to define the proper state of data, especially intermediate data. You need to define what kinds of data can be retained in the data lake, and how to organize them. This can be addressed by the Medallion Architecture.

A medallion architecture serves as a data design blueprint tailored for organizing data within a lake house environment. Its primary aim is to enhance the structure and quality of data gradually as it traverses through successive layers of the architecture, progressing from Bronze to Silver to Gold layers.

Bronze layer

The Bronze layer serves as the initial landing ground for all data originating from external source systems. Datasets within this layer mirror the structures of the source system tables in their original state, supplemented by extra metadata columns such as load date/time and process ID. The primary emphasis here is on Change Data Capture, enabling historical archiving of the source data, maintaining data lineage, facilitating audit trails, and allowing for reprocessing if necessary without requiring a fresh read from the source system.

Silver layer

The next layer of the lakehouse is the Silver layer. Within this layer, data from the Bronze layer undergoes a series of operations to a “just-enough” state (which will be discussed in detail later). This prepares the data in the Silver layer to offer an encompassing “enterprise view” comprising essential business entities, concepts, and transactions.

Gold layer

The last layer of the lakehouse is the Gold layer. Data within the Gold layer is typically structured into subject area specific databases, primed for consumption. This layer is dedicated to reporting and employs denormalized, read-optimized data models with minimal joins. It serves as the ultimate stage for applying data transformations and quality rules. Commonly, you will observe the integration of Kimball-style star schema based data marts within the Gold Layer of the lakehouse.

Medallion Architecture: What, Why and How (3)

Now that you know what a medallion architecture consists of, let’s see how it can help in data lake organization. There is not much debate on the bronze layer and the gold layer. The former is just the raw data plus metadata. The latter is the equivalent of the data mart layer in the traditional data warehouse ecosystem, which contains the consolidated and aggregated data ready for reporting usage. It is the silver layer that warrants some deeper dive.

By definition, the silver layer contains transformed data. It prepares data from diverse sources into a unified view, and gets it ready for transforming further into the Gold Layer. The data engineering paradigm associated with the silver layer is usually ELT, which typically applies only minimal or “just-enough” transformations during the loading process of the Silver layer. These “just-enough” transformations typically include the following tasks:

- Data cleansing: Also known as data cleaning or data scrubbing, data cleansing is the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset to improve its quality and reliability for analysis and decision-making. It is a crucial step in the data preparation process and is typically performed before data analysis or integration. Some common tasks involved are: removing duplicates, correcting typos, standardizing data formats (especially date and address), handling missing values, etc.

- Data verification: Data verification is the process of ensuring that data is accurate, consistent, and reliable through various validation techniques. It usually includes validating data based on known quality control measures, confirming data meeting corporate data governance policies, resolving Inconsistencies by cross-referencing different data sources or applying business rules and logics, standardizing and normalizing data, and handling outliers.

- Data conforming: Data conforming refers to the process of ensuring that data adheres to specific standards, formats, or requirements. It involves transforming and standardizing data to make it consistent and compatible with a particular data model, schema, or system. Data conforming is essential for integrating data from diverse sources, ensuring interoperability, and facilitating data analysis, reporting, and exchange.

- Data matching for integration purpose: To provide a unified and consistent view of an enterprise, data from different sources must eventually be consolidated and integrated. In order to make data consolidable, you must bring together data from disparate sources, identify and determine the relationships between records across different datasets or sources, and transform them into consistent and coherent formats. The latter is usually achieved by generating a universal primary key for different datasets of the same data entity.

After these tasks, the data in the silver layer is meaningful and consolidable, ready to be served to the gold layer. The next question then: should you perform more transformations in the silver layer? As mentioned previously, between this cleansed state and the gold layer, there are many other transformations that can happen, typical of which are column calculations, join/merge, and aggregations. This is a highly debatable question and a lot of times the answer depends more on personal preferences, not science. The general rule of thumb is that, without clear indication that a transformation is needed in multiple places and implementing it can save a significant amount of time and resources, don’t add extra work. There are always many transformations that you can perform based on project-specific needs and business rules. But most of these should be reserved for data loading tasks from the silver layer to the gold layer.

The silver layer is often being compared to the ODS and DWH layers in the traditional data warehouse architecture. It is worth noting that these three serve very similar purposes, which is largely collecting and preserving the entirety of enterprise data. Thus they do share some common characteristics. However, there are distinctive differences among them since they are designed following different methodologies. A data warehouse is usually organized by subject areas, which implies a defined relationship among tables and hierarchies. In a silver layer, data should remain in the original data schema, only with the addition of minimum transformation required. An ODS is usually a collection of data from different transaction systems, with some ETL effort to provide integration to the data, such as a common key. This bears some resemblance to the silver layer. However, as discussed above, the silver layer should only have just-enough transformations, where in most ODS, there are a certain degree of further calculations, joins, and aggregation to satisfy the needs of operational reporting.

Medallion Architecture is a data design pattern for better data organization. It is not a data model. It is neither dimensional, nor data vault, nor relational. Instead, a system built under the medallion architecture can hold files/tables designed on any of these modeling principles. Similarly, medallion architecture can fit into many data management ecosystems, such as data warehouses, data lakes, and/or lake houses. As discussed at the very beginning of this blog, data lakes consist of data files of raw format and are cheaper, more flexible, and more scalable, and data marts usually adopt Kimball style star schema and can provide better performance and better integrity for analytical purposes. As such, the bronze layer, which is the raw data plus metadata, usually resides in data lakes. The gold layer, which serves the reporting and analytical needs, usually resides in a dimensional data warehouse, but it can also be implemented in a data lake. The Silver Layer can be extremely flexible from a data modeling standpoint. It can adopt data models akin to 3rd-Normal Form structures, data vault models, or even no-SQL schema. You can even have all these in the same silver layer. This wide variety of data models means that the silver layer is usually built on a data lake, which provides enough flexibility for data models. So most likely, you will see a medallion architecture in a lake house environment, using a data lake for the bronze and silver layers, and a data warehouse for the gold layer.

It is even possible that the medallion architecture is being built on top of multiple data vendors. For example, you can have the bronze layer in three different clouds, holding data from different apps running on each cloud, build a silver layer in a specific cloud whose data transformation tools meet your needs best, and setup multiple gold layer datasets in different clouds for different reporting purposes, thus forming a multi-cloud data platform. All these setups can be fit into the medallion architecture perfectly.

Medallion Architecture: What, Why and How (4)

Overall, Medallion Architecture provides you with a well-governed holistic view of enterprise data by organizing data into different layers (Bronze, Silver, Gold), which meets the needs of various stakeholders, from raw data ingestion to consumption-ready formats for analytics and reporting. It provides you the flexibility required by the ever-changing nature of business, and the opportunity to improve your data ecosystem incrementally and progressively. Most importantly, it lays a solid foundation for data governance and compliance. By organizing data according to the maturity, and setting boundaries of processing needed in each layer, it is easy to manage metadata, track lineage, audit DataOps, and ensure data integrity and regulatory compliance in a medallion architecture based data lakehouse. With better organization comes better utilization and management. The medallion architecture can surely help you in reducing cost, simplify management, and improve efficiency.

As a cloud solution architect, I am actively implementing medallion architectures for my clients right now. There are lots of opportunities and challenges out there, especially because companies are moving into clouds, and moving into multi-clouds. The flexibility of medallion architecture makes it ideal for these situations. If you would like to discuss medallion architecture, data lake house, or anything cloud related and/or data related, please feel free to leave your thoughts in the comment area. Looking forward to your valuable feedback!

Medallion Architecture: What, Why and How (2024)

Bronze layer

Silver layer

Gold layer

References