In October I stood in the vendor hall at dbt’s Coalesce in San Diego and found myself surrounded by booth after booth of data catalog companies, each one promising to ‘bring visibility’ to my data stack and ‘unlock the value’ of my data warehouse.

Despite my tone, I’m not fundamentally opposed to data catalogs or ignorant of the value of surfacing these kinds of insights. But admittedly I’ve always found them boring, the simple mention of the topic makes my eyes gloss over like someone’s talking about cars, taxes, or crypto. (No shade if you like those things - they just aren’t my bag.)

So when my longtime colleague Zac Ruiz mentioned he was starting a metadata company of his own, I thought it was time I confronted my own biases toward metadata and see what was so compelling about this topic that it filled an entire San Diego conference hall.

Big thanks of course to Zac Ruiz for taking the time to talk to a metadata-skeptic like me!

The basics

What do we mean by metadata?

“Metadata is data about the data. There’s infinite data, so there’s infinite metadata.” - Zac Ruiz

Metadata in our context refers to data about our data. This could be things like:

the number of rows in a table
when the table was last updated
the column definitions of a table
the lineage of that table
which dashboards or reports use that table
who has access to that data
how many null values are in a column

The go-to metadata story

The story I hear a lot when I hear about metadata goes something like this:

A new analyst in their first few weeks on their job is asked to pull some data on payments by the marketing team. They look at the database and see 3 tables with the word “payments” in it: payments_main, payments_fac, payments_marketing. So the analyst goes to the data catalog and looks up some metadata about each table, noting things like definitions, when they were last updated, the number of rows, and how many downstream entities are attached to that table. Based on that information, they figure out which table they ought to use, and pull the data easily.

Sound too good to be true?

I’ve lived that scenario and didn’t find the metadata at all helpful in choosing which of those tables to use. I still had to go to another person on the data team to tell me what each of them really meant in the context of what I was being asked to do. In short, the metadata let me down.

Thankfully, Zac had a different take on metadata that I could get behind.

A more compelling case for metadata

Let’s start with some philosophy

To explain the importance of metadata, Zac began with a notion of hyperobjects: objects that are so massively distributed in time and space as to transcend spatiotemporal specificity, such as global warming, styrofoam, and radioactive plutonium [1].

Zac explained it to me with a more accessible metaphor: Pizza Hut. Since there are Pizza Huts all around the world, no one person can experience Pizza Hut the same way. Each of us has only experienced a piece of the whole (pie) and thus lacks a holistic understanding of the entire entity of Pizza Hut.

Data, Zac argues, is like Pizza Hut.

We all come to data with different understandings, and expectations. Analysts may know a table’s columns by heart, know that sometimes people enter ‘us’ and sometimes ‘US’ or ‘USA’. But a business user may know it as something else entirely.

It’s this nature of data as hyperobjects that makes data so intrinsically difficult to talk about - a single column of data can mean something different to everyone.

“I'm scared of data because data is representing something that is so complex that typically people can't come to the same understanding of what it is. And that’s dangerous.” - Zac

The need for a common language

All of us in data can agree the most challenging moments we have are when trying to talk to someone else about data, especially someone outside the data team. These conversations are full of loaded words and phrases, what ‘profit’ means to one person is different to another, rendering them frustrating and often futile.

“The more people you add, the more complexity you add. And so the more time you spend talking about data in circles as you're talking about things that are too complex and abstract.” - Zac

So, we need common ground. Things that everyone can understand and interpret in the same, predictable way. This is the role metadata can play, Zac argues:

“Metadata is data about the data. So it's how many rows, how many columns, where did this file come from? We can experienced that. We can see it. And so when we talk about metadata, we're talking about something we can experience, something we can know about. If we talk about data, we're talking about [hyperobjects] something we do not know about.” - Zac

The idea that metadata could help us have better conversations is explained by the concept of relevance sorting:

“To have a conversation you and I have to agree on some facts about reality (what is relevant) and agree to leave out far more (what is not relevant). When we are talking about something we can experience (see, touch) we humans are relevance sorting wizards but when we are talking about hyperobjects we aren’t as good at relevence sorting - the conversations end up all over the place based on our different experiences at the edges of a hyperobject. We all talk about what we “think we know” and thus are all just making stuff up (including me).” - Zac

The idea of a common language in data is a compelling one. But there was something else I couldn’t shake…

Replace conversations or enhance them?

It is difficult to separate the theory from the application. While Zac’s idealized vision of metadata is compelling, I couldn't ignore the question of whether these tools were meant to enhance and improve our discussions with people about data or replace them entirely.

The proposed reality I hear often from data catalog companies is that with their tools instead of you telling people which dashboard to use, you can send them to this searchable interface and they can work it out for themselves. Instead of asking a coworker which table to use, you can ask this metadata-fed AI chatbot.

I am uncomfortable with a future in which we insulate ourselves even more from the organizations we support. I’m skeptical we’ll be better off without these discussions with stakeholders that are so frustrating and confusing. There is real value there, and technology should help us improve those conversations, not replace them.

“The lack of real conversation and interaction happening right now is one of the biggest problems in data. In the world really. I think the reason I’m so passionate about this area is that I believe it can help that. I believe it can help people have more conversations, and make those conversations better than they are today.” - Zac

With this explanation, I began to rethink the payments table scenario I mentioned at the start. Maybe the metadata hadn’t failed me. Maybe metadata had done exactly what it was meant to do - helped me have a productive conversation with the right person.

Where to begin?

If, like me, you can buy into the need for a common language to facilitate better discussions you might be ready to start seeing if metadata can help with that mission.

The following are Zac’s suggestions for where to start with metadata:

1. First, prove to yourself metadata can shift conversations

Before investing in significant process or tool changes, do a small test to see how metadata can shift the everyday discussions you’re having about data. And ask others to do the same:

“I would challenge you to listen intently and notice when a conversation with your stakeholders or even within your team shifts from talking about data to talking about metadata. Listen for how that conversation changes. Do you see more agreement? More head nods, more shared understanding? That is evidence that it can function in this way.” - Zac

2. Use the tools you have before you even look at the options out there

It’s easy to get worked up into which metadata tool is the best, which one is being hyped up the most, and when you’re ready for that, maybe that is what you need to be paying attention to, but Zac argues it’s not where you need to start:

“You don’t need a special tool. Especially when you’re starting out. Start with data dictionaries in Notion or Google docs. Start discussing what each table and column mean, and if you disagree, then talk about it until you agree. Start small.” - Zac

3. Do an audit of what metadata you have

Are you making the valuable data lineage from dbt available? Do you have usage statistics on dashboards you don’t look at?

Doing an audit of your tools and the metadata available is a great way to identify new ways to start using the valuable info you already have.

4. Do an audit of who has access to the metadata available

What use is metadata if no one can use it?

For each piece of metadata, work out who has access to it, and how often it’s used. Maybe making this information more available and understood can go a long way in cultivating trust with the wider business.

5. For every three tables/views/dashboards you build, retire one

A lot of requests for new assets are out of a misunderstanding of what that asset actually is, or what it easily could be. By making it clear that creating new models, tables, or dashboards isn’t costless, Zac argues you’ll end up having more constructive metadata conversations.

Metadata in the long-term

If the above steps are successful in proving the value of metadata, you may wish to continue this journey. Zac offers a few glimpses of the longer-term metadata vision:

Tooling

As an industry, we can be very solution-orientated, meaning a lot of the discussions around metadata are focused on the tools that exist in that space. Helpfully, Zac has reminded us (or maybe just me) of the real problems we’re trying to solve and how metadata could play a role in fixing them.

But what is next for this class of tools so prevalent they can dominate dbt’s Coalesce sponsor list?

“We still have further to go. A lot of interfaces that are just search portals aren’t scratching the surface of the potential of metadata. That’s the future I’m excited for.” - Zac

According to Zac, look out for more AI, more metadata, new (and better) UIs, and solutions that go beyond smart searches for data.

Metadata-first data strategy

More important than a tool is strategy. Are you prioritizing metadata? Making it a part of discussions, the lexicon of the team, and the wider organization?

“In a metadata first data strategy, you're always thinking about what other metadata do I need to solve problems? And so if I'm doing a metadata first data strategy for a company that has 15 payments tables, we’ll ask how do we tie metrics about outcomes to their data sources.” - Zac

Final thoughts

I have to admit I was far less cynical on the topic of metadata after the discussion with Zac. I still have concerns and issues with the presentation of metadata (e.g. why is it in a separate tool from where work happens), but those concerns no longer have me writing off the entire field.

Zac’s reminder to use metadata as a means of connection rather than isolation is a message we all needed to hear.

Resources

[1] Timothy Morton (n.d.) https://en.wikipedia.org/wiki/Timothy_Morton

[2] Daggett, C. (2014) ‘Hyperobjects By Timothy Morton’, Society and Space, 5 September. Available at: https://www.societyandspace.org/articles/hyperobjects-by-timothy-morton (Accessed: 01 March 2024).

[3] Madden, James D. (2023) ‘Unidentified Flying Hyperobject: UFOs, Philosophy, and the End of the World’.

Maybe metadata isn’t as boring as it sounds