Issue #22: What makes the Medallion Architecture Different? (2024)

This week has been a great week for high quality articles on data, I had to stop looking for more articles earlier than normal as I have a guide to finish. This week we have:

  • A short rant about the Medallion Architecture

  • Is Kimball Still Relevant?

  • Hello, World of CDC!

  • Grai: Open Source Data Lineage

  • You Can’t Master Data in a Database

  • Data Parallel, Task Parallel, and Agent Actor Architectures

  • Standardized Data Product Metadata Examples Based on Real-World Published Data Products

What makes the Medallion Architecture Different?

I’ve seen a few comments on social media about Medallion Architecture, saying it looks no different from the classic three-tier structure of Raw, Conformed and Enriched found in batch Data Warehouses and it’s just meaningless buzzwords created by Databricks to get more sweet Venture Capitalist money.

Could Databricks come up with less vague layer names than Gold, Silver and Bronze? Maybe, I suspect they wanted a set of names that doesn’t tie themselves to one modelling style, to show how flexible Medallion Architecture is (and sell more Databricks).

But it doesn’t change the fact Medallion Architecture does differ from other architectures.

What makes the architecture different is that Databricks supports both batch and streaming using the same technology across all three layers, whereas classic Batch, Lambda and Kappa Architectures have separate batch and real-time processing technologies.

This arguably makes it better than all of the above as it’s flexible in supporting streaming and batch while having lower maintenance than Lambda and Kappa as you only need one data processing product, not two (though you’d still keep the streaming and batch pipelines separate).

And just so it doesn’t look like I’m just shilling Databricks, this architecture can also be applied in Apache Flink and maybe Snowflake?, if Lakehouses are not your thing.

Issue #22: What makes the Medallion Architecture Different? (2)

Though I will say if you’re doing no streaming processing, then yeah, it’s just classic batch processing.

Is Kimball Still Relevant?

Joe Reis, co-author of the excellent Fundamentals of Data Engineering book, woke up in a fiery mood on Friday:

Here’s the deal. If you’re aware of the various data modeling approaches and can pick the right approach for your particular situation, terrific. You’re a competent and thoughtful professional. To completely ignore data modeling is professionally negligent, and I’ll argue you’re unfit for your job. We can do better as an industry. Don’t burn down data modeling just yet…

I’m tempted to frame the above paragraph. The rest of the post is just as good.

I will add this though: while the world of Data Engineering (DE) may feel a bit lukewarm on Kimball models as there are some arguments it doesn't scale as well as Data Vault, Activity Schema or One Big Table, I feel Kimball is more in use than any other point in time due to being the default way to model data in Self-Service Business Intelligence applications (BI): Power BI and Tableau.

And BI is 10 times bigger in usage than DE, I say that as a DE myself.

Though if you hate the idea of Kimball models in your BI apps, I’d check out Narrator, which uses Activity Schema.

I think an argument can be made that we're living in an era where it is common in a large organisation to use multiple types of data models, whereas 15 to 30 years ago you could only use Kimball and you'd be called crazy to question it (though I could be wrong, I was still in school then!).

Hello, World of CDC!

I’ve covered Change Data Capture (CDC) in previous issues, but this three part series (so far) by Ryan Blue, former Senior Engineer at Netflix and now CEO of Tabular, goes arguably into more depth about implementing CDC, what issues you might run into and how to solve them.

Grai: Open Source Data Lineage

Grai is a new start-up offering Open Source Data Lineage with a cloud option. It also has features to show the downstream impact of failing Data Quality tests.

You Can’t Master Data in a Database

This is a great article on something that I’ve been thinking about for awhile: Master Data Management (MDM) / Customer 360 / Single View of the Customer should be done as close to the operational data processing as possible rather than implemented post import of data into an Analytical Storage.

You want to master data at the source, or as close as possible to the source so data duplicates have less impact than if data is exported to the analytical plane to be mastered. Steve Jones of Capgemini lists the above and many other reasons why MDM is the solution to a business operations problem and not an analytical data problem.

Though, I will argue that it can be hard to get this view across in a large organisation, so MDM ends up closer the analytical data because because that is where the most pain is felt of having no mastered data.

This article is also great and on a very similar theme, talking about putting MDM in the close to operational data in a Data Mesh context.

Data Parallel, Task Parallel, and Agent Actor Architectures

If you’re a big nerd like me and want to know how distributed data processing solutions like Spark, Flink and Ray work under the hood, this is the perfect article for you.

Zander Matheson, Konrad Sienkowski and Oli Makhasoeva of the Streaming Processing product, Bytewax, go through three common types of distributed compute, each of their pros & cons and what their use cases are.

Standardized Data Product Metadata Examples Based on Real-World Published Data Products

While there is a lot of talk about how Data Products in Data Mesh should have consistent metadata model across the organisation, we haven’t seen many examples shared in public, likely because organisations that adopted Data Products don’t want to share their meta-model for fear it would give away company secrets or increase security risks.

To help organisations figure out what metamodel their data products should contain, Jarkko Moilanen of API Economy Hacklab, has co-authored an open source specification with a few examples.

While the specification looks like a great starting point, I will say I don’t think this is the final say on the matter, as I would like more detail on the Data Quality section, including what tests are run.

Sponsored by The Oakland Group, a full service data consultancy. Download our guide or contact us if you want to find out more about how we build Data Platforms.

Thanks for reading The Data Platform Journal! Subscribe for free to receive new posts and support my work.

Issue #22: What makes the Medallion Architecture Different? (2024)

FAQs

What is the main difference between the silver layer and gold layer in medallion architecture? ›

The Silver layer, which is generally stored at full-fidelity (i.e. the granularity of the data is the same in the Silver layer as it is in the Bronze layer) provides the foundation for the Gold layer, where we may have many projections of the Silver data in different forms and at different granularities, depending on ...

What is the concept of medallion architecture? ›

A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).

What are medallions in architecture? ›

A medallion is a round or oval ornament that frames a sculptural or pictorial decoration in any context, but typically a façade, an interior, a monument, or a piece of furniture or equipment.

What is the medallion architecture in snowflake? ›

The Medallion architecture consists of three main layers: Bronze, Silver, and Gold. It aims to incrementally and progressively improve the structure and quality of data as it flows through each layer of the architecture.

Why is medallion architecture important? ›

The Medallion Architecture sets itself apart by managing permissions for specific tables and schemas within your data lakehouse. By segregating raw, validated, and enriched data across distinct areas of your database, you can allocate and restrict certain rights for particular user groups within an organization.

What are the disadvantages of medallion architecture? ›

Medallion Architecture Disadvantages

Medallion architecture provides a framework for data cleaning, not data architecture. Uses large amounts of storage: though, as many have proclaimed, "storage is cheap," a medallion architecture effectively triples the amount of storage used in a data lake.

What is medallion architecture and how can you use it in fabric? ›

In Microsoft Fabric, the Medallion Architecture Layers is a design pattern employed to logically organize data in a lakehouse. The architecture comprises three distinct layers (Bronze, Silver and Gold), each indicating the quality of data stored in the lakehouse, with higher levels representing higher quality.

What is the medallion pattern? ›

« Previous (Madras Design) | Next (Millefleurs) » A circular part of the design in the shape of a disk, oval, diamond, hexagon, or other rotational figure, typically with a mirror symmetry, often used in the center as a focal point of an engineered design, or as an organic part of the motif.

What are medallions made of? ›

A medallion is a portable, flat, rounded disk. Usually made of metal, medallions typically feature designs on both sides. Often, they are worn in a similar way to necklaces, sometimes attached to ribbons. The value of a medallion is significant, as they are often given as sporting or military awards.

What are the use cases for medallion architecture? ›

Use Cases for Medallion Architecture

It allows for the raw storage capabilities of a lake while providing the structured, query-optimized environment of a warehouse. Real-time Data Processing: For businesses that require real-time analytics, the silver layer can be equipped with streaming data processing capabilities.

What is the function of medallion? ›

The Medallion signature guarantee validates the identity and signature of a party transferring securities or investments. Banks, credit unions, and other financial institutions must belong to one of three select Medallion signature guarantee programs to provide this service.

What do the medallions do? ›

Once a player obtains a Medallion, they will recover Shield over time unconditionally. The amount of recovery received increases with the number of Medallions in a player's possession. Additionally, a large yellow circle will appear around the player on the minimap, roughly outlining their location for opponents.

What is unique about Snowflake architecture? ›

Snowflake is a true self-managed service, meaning: There is no hardware (virtual or physical) to select, install, configure, or manage. There is virtually no software to install, configure, or manage. Ongoing maintenance, management, upgrades, and tuning are handled by Snowflake.

What is the right term for Snowflake architecture? ›

Snowflake's Architecture

Snowflake is built on a patented, multi-cluster, shared data architecture. It was created for the cloud to revolutionize data warehousing, data lakes, data analytics, and a host of other use cases.

What is the role of the Snowflake architect? ›

About the Role

The Snowflake Architect will mainly be responsible to develop, optimize and oversee company's logical, conceptual, and physical data model and provide recommendations.

What is the silver layer of the medallion? ›

The silver layer of the medallion architecture is where fabric data engineers and users processes and refines their data which include performing operations such as appending, merging data and applying data validation rules like removing nulls and deduplication (removing redundant data).

What are the layers of Delta Lake? ›

The Delta Lake architecture is made up of three main components: the Delta table, the Delta log, and the storage layer. Let's explore each component in more detail. At the core of Delta Lake is the Delta table, which is a transactional table that is optimized for large-scale data processing.

What is the difference between landing zone and bronze layer? ›

The Bronze layer is where all the data from external source systems is landed. You can also refer to this layer or zone as landing zone. The table structures in this layer correspond to the source “as-is,” along with any additional metadata columns that capture the load date/time, process ID, etc.

References

Top Articles
Latest Posts
Article information

Author: Mr. See Jast

Last Updated:

Views: 6179

Rating: 4.4 / 5 (75 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Mr. See Jast

Birthday: 1999-07-30

Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

Phone: +5023589614038

Job: Chief Executive

Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.