Data Engineer! Why should you care about Knowledge Graphs?
I have realised there is a need for a part 0 of From Data Engineering to Knowledge Engineering.
Forty-seven tables with foreign keys pointing in every direction. A dbt DAG that looks like a plate of spaghetti. Three source systems for “customer” and none of them agree on what a customer is.
You don’t have a tooling problem. You have a modelling problem. And the solution already exists, it’s just been marketed so badly that you’ve never seriously considered it!
In this article, I describe the why. Why should you, as a data engineer, care about knowledge graphs? Not the academic kind with ontology alignment workshops. The practical kind, that solves problems you’re fighting every single day.
The problems you’ve normalised
A day in life
A product manager asks: “Can you get me a list of all customers who have active subscriptions and have filed a support ticket in the last 30 days?” Simple question. You know exactly where the data lives! Customers in the CRM export. Subscriptions in the billing system. Support tickets in Zendesk.
Three sources and three different identifiers for “customer.” Three different schemas. You write a pipeline and do some LEFT JOIN gymnastics. You add a fuzzy match on email because one system uses customer_id and another uses user_id and the third uses email as the primary key. You add a COALESCE for the name field because sometimes it’s first_name + last_name, sometimes it’s full_name, and sometimes it’s display_name.
You ship it, and it works. It took you a day and a half for what should have been a 10-minute question. Next week, someone asks the same question but adds “...and also show their contract tier.” That’s a fourth system. Your pipeline breaks. You fix it. You add more joins.
This cycle repeats every week, ..every month, ..every quarter. The pipelines get longer, the joins get more fragile, and the “simple questions” keep taking longer to answer.
You’ve normalised the fact that answering simple business questions requires days of pipeline work. That adding a single new data source means rewriting joins, fixing breakage, and retesting downstream models. That “how do we connect these two datasets?” is a recurring project, not a solved problem.
Everyone has. It’s just how data engineering works, right? No. It isn’t.
When JOINs are asked to do too much
JOINs are not the problem tho. The problem is what we’re asking JOINs to do in modern data engineering.
What’s happening in that day-in-life scenario? You have relationships between things (customers, subscriptions, tickets, contracts), and you’re trying to express those relationships by flattening them into rows and columns and then reconstructing them with JOINs. Within a single well-modelled system, that works beautifully. But across three or five source systems with different schemas, different IDs, and different assumptions about what “customer” means? That’s where JOINs start to strain.
Think about what a JOIN actually does. It takes two flat tables that have destroyed the natural structure of the data, and tries to stitch that structure back together using a shared key. If the key doesn’t match (and it often doesn’t across systems) you’re left writing transformation logic to create the match.
Every JOIN is a confession that your data model threw away a relationship that you now need back. The data was connected in the real world, your schema just didn’t preserve that connection, and now you’re paying the cost to rebuild it every time you query.
Most data engineers today work primarily on analytical use cases: integrating data from multiple sources, building models that answer cross-domain questions, creating data products that combine operational and business data. These are integration problems, and integration across heterogeneous systems is where relational modelling was never designed to shine.
A well-designed schema answers the questions it was designed for. But when the business asks questions that cross boundaries you didn’t anticipate, a new source, a new relationship, a new way of looking at data, the response is another pipeline, another set of joins, another migration. Not because JOINs are poor, but because they’re being overextended into problems where a different data model would carry the load more naturally.
What if the relationships were first-class?
Imagine a different model. Instead of tables with foreign keys, you have things and relationships between things, stated explicitly.
customer:Alice has_subscription subscription:Pro
customer:Alice filed_ticket ticket:12345
ticket:12345 created_on 2026-01-15 These are called triples: subject, predicate, object. That’s it! Three columns! Always. ❤️
Schema definitions (what is customer, and what is ticket, and so on) is also available as triples. They are stored in the ontology, but that’s not the focus in this article. Here we focus on how you, the data engineer, can utilise the knowledge graph capabilities.
In this way of structuring data, adding a new source system with new columns doesn’t require redesigning your schema. You just state more facts:
customer:Alice has_contract contract:Enterprise
contract:Enterprise has_tier "Gold"
No migration, new foreign keys, or broken pipelines. The new facts simply exist alongside the old ones, connected through the same identifiers. Feels almost like magic!
That day-in-life question? It becomes trivial, not because the query language is magical, but because the data model itself preserves the relationships. You don't need to reconstruct them with JOINs, they were never flattened away in the first place.
“This sounds like a graph database”
Sort of, but there’s an important difference.
Graph databases like Neo4j give you nodes and edges with arbitrary properties. Which is fine. But they don’t give you a shared vocabulary for what those nodes and edges mean. Your customer graph and my customer graph have nothing in common. There’s no way to merge them, validate them against a shared definition, or reason about them.
Knowledge graphs, specifically the RDF standard, give you that shared vocabulary. Every thing has a globally unique identifier (an IRI). Every relationship has a globally unique identifier. When two systems both talk about http://schema.org/Customer, they mean the same concept. When two different datasets use the same IRI for Alice, they’re talking about the same Alice.
This means that…
…merging datasets is free. If two graphs use the same identifiers, loading them into the same model automatically connects them. No join keys, no fuzzy matching, no
COALESCE.…schema evolution doesn’t break anything. Adding new types of facts doesn't require altering existing ones. Your old queries still work. Your new queries can see the new data.
…validation is declarative. Instead of writing Python assertions scattered across your pipeline, you define shapes that describe what valid data looks like, and the system checks conformance. (If you know SQL constraints, you already understand the idea.)
…meaning is machine-readable. An ontology (fancy word, simple concept) defines what types of things exist, what relationships are valid between them, and what properties they can have. This is your data contract, but one that machines can actually enforce and reason about.
How this fits in your existing stack
A knowledge graph doesn’t replace your warehouse, lakehouse, dbt models, or your orchestration pipelines. It complements them. I am asking you to consider that some of the problems you're solving with ever-more-complex SQL could be solved more naturally with a graph model. Let me give you a few examples.
Master data management across sources
You have customer data in Salesforce, billing data in Stripe, and support data in Zendesk. Today you maintain a brittle set of dbt models that try to stitch these together using email matching and ID lookups. With a knowledge graph, you map each source into the same model using shared identifiers. The graph is your golden record, and when a fourth source appears next quarter, you add it without touching the existing mappings. Your warehouse stays as it is; the graph handles the integration layer on top.
Data catalogue and lineage
You already track metadata somewhere. Maybe in a wiki, a spreadsheet, or a tool like DataHub. A knowledge graph formalises this: every dataset, every column, every transformation becomes a node with typed relationships. "This dashboard metric is derived from this dbt model, which reads from these three source tables, which are owned by this team." That's a graph query, not a documentation exercise. And because it's machine-readable, you can automate impact analysis when a source schema changes. And you know what? There already exists standardised ontologies for data catalogues (DCAT) and data products (DPROD), ready for you to use!
Contextualising time series data
You have a time series database full of sensor readings, but the readings are just numbers without context. Which asset does this sensor belong to? What process does that asset support? What are the normal operating bounds? A knowledge graph provides that context layer. You keep the time series in your existing database (InfluxDB, TimescaleDB, a cloud data lake, etc.) and the graph provides the semantic bridge that lets you query across context and data in one go. "Give me all temperature readings from sensors on pumps in Hall A that exceeded their threshold last month."
Data contracts and validation
You're writing dbt tests and Great Expectations suites to validate data quality. That works for tabular checks. But a knowledge graph lets you express richer constraints: "every Pump must have at least one Sensor", "every MaintenanceEvent must reference a valid Asset", "no Asset can be both Active and Decommissioned". These are SHACL shapes declarative, portable, and reusable across your entire data estate. They complement your existing tests; they don't replace them.
In each of these cases, the knowledge graph isn't replacing something that works well. It's handling the integration, context, and meaning layer that your existing tools were never designed to provide. Your warehouse handles structured analytics. Your lakehouse handles scale. Your orchestrator handles scheduling. The knowledge graph handles what things are, how they relate, and what the rules are.
“Okay, but RDF is slow, verbose, and nobody uses it”
These are real objections that deserve real answers:
It’s slow
Historically, yes. Triple stores were not built for the kind of analytical workloads data engineers care about. But the landscape has changed. Modern tools can process knowledge graphs at DataFrame speeds, because they're built on the same foundations as the analytical engines you already use (Apache Arrow, Polars). You don't have to choose between semantic richness and performance.
It’s verbose
RDF in some of its raw serialisation formats (XML, N-Triples) is not pleasant to read. But you don't have to read it. Just like you don't read Parquet files. You work with data frames, and the serialisation is handled for you.
Nobody uses it
This one is just wrong, but I understand why it feels true. Google's Knowledge Graph is RDF. Wikidata is RDF. The entire European Union interoperability framework is built on RDF. The energy sector (CIM/CGMES), healthcare (SNOMED-CT, FHIR), and financial services (FIBO) all have major RDF-based standards. The problem isn't adoption, it's visibility.
The RDF world has been terrible at telling its own story to the people who would benefit most from hearing it.
The AI elephant in the room
There's a reason this matters right now, and it isn't just data integration.
We're in the middle of a rush to deploy AI agents, systems that don't just generate text, but make decisions, take actions, and operate autonomously across your business.
An LLM can generate a fluent answer to any question. But fluency is not accuracy. When your agent decides to shut down a pump, reorder inventory, or escalate a customer complaint, you need more than a confident-sounding response. You need the agent to know what a pump is, which system it belongs to, what safety rules apply, and what the downstream consequences are. That's a knowledge graph!
Knowledge graphs give AI agents something they desperately lack: structured understanding of your domain. Not a prompt stuffed with context, or a vector database full of document chunks. An actual, queryable, machine-readable model of what exists in your operation, how things relate, and what rules govern them.
Concretely, this means grounding, guardrails, explainability and composability.
The irony is thick: the semantic web community has been building exactly this infrastructure for two decades(!!), but failed to connect it to the people who need it most. Now that AI agents are forcing the question of "how do we give machines structured knowledge?", the answer has been sitting there all along.
Knowledge graphs need you
Here's something the knowledge graph world doesn't say often enough: knowledge graph projects fail without data engineers.
Historically, knowledge graphs have been treated as a concern for ontologists, researchers, and semantic web specialists. They design beautiful ontologies, publish them in a machine-readable format, and present them at academic conferences. And then nothing happens. The ontology sits in a repository, and the data stays in its silos, and business sees no value.
The reason is simple: ontologists model the world, but they don't move data. They don't build pipelines. They don't manage data quality, handle schema drift, or operationalise anything at scale. You do.
Every successful knowledge graph project I've seen has data engineers as a core part of the equation. They're the ones who connect the sources, map the data, maintain the quality, and keep the whole thing running in production. The ontologist defines what correct looks like; the data engineer makes it real.
This is why the knowledge graph community needs to stop talking exclusively to itself and start talking to you. Not because you need to become ontologists, but because without your skills, knowledge graphs remain an academic exercise. And with your skills, they become the most powerful data integration and AI-readiness technology available.
The tooling gap has been the main barrier. You shouldn't need to learn Java, RDF/XML serialisation, or the intricacies of OWL to build a knowledge graph. You should be able to do it with Python and DataFrames. That's what modern tooling as maplib is finally making possible.
Where to go from here
If any of the above resonated, the rest of this article series walks you through the practical how, step by step, using Python and data frames you already know:
Part 1: From Data Engineering to Knowledge Engineering in the blink of an eye Get a knowledge graph running from a CSV in four lines of Python.
Part 2: Data Engineering Ontologies
Connect your data to existing ontologies for instant semantic enrichment.Part 3: SPARQL for SQL Developers
A translation guide for everything you already know.Part 4: From SQL Constraints to SHACL Shapes
Declarative data validation, the knowledge graph way.
The tools are here and the standards are mature. The only thing that’s been missing is someone explaining why you should care in a language that makes sense to you.
That’s what we’re here to fix. ✨
A huge thanks to Joe Reis and Tyler Stewart for sanity checking and reviewing this article.




Excellent post and a topic i am currently writing on, gave you a follow :)
That’s a good one. Ontology is often the neglected child while engineering Data Pipelines.