From DataFrame to Knowledge Graph in 4 lines of Python
As seen at the Practical Data Community's Lunch & Learn session
A Monday in early June, I was invited by Joe Reis to throw a session for the Practical Data Community. I stitched together a demo on how to quickly get started with knowledge graphs and triples all over the place, using tooling as Polars and Python. I had such great fun giving this demo, and huge thanks to the PDC community for great questions along the way.
This post walks through the demo, that takes three sources (CSV, Parquet, CSV) and turns them into a knowledge graph you can query across, enrich from external APIs, and export back to DataFrame. All in memory with just Polars, Python and maplib.
Still not sure why you should care about knowledge graphs? Check out a previous post I wrote on the topic.
TL;DR
GitHub repo: https://github.com/veleda/pdc-demo
The demo
Before we get started, there are two things you need to know about knowledge graphs:
We talk in triples; subject predicate object
We use IRIs as global, unique identifiers for all things
The setup
We have data sources about music. They overlap, but they don’t agree on schema or format.
bands.csv: a music encyclopedia. Band name, genre, year formed, and country. Sixteen bands from Pink Floyd to Deafheaven.
vinyl_store.parquet: a record store inventory. Album title, artist, year, price, stock status.
festival_lineups.csv: fictional festival schedules.
Each source knows something about the same band, but they describe different aspects.
Templates
import polars as pl
from maplib import Model
m = Model()
m.map_default(pl.read_csv("data/bands_with_iri.csv"), "iri")I could have just done the four lines of Python and be done with it? That was what I promised. But eyh, there are so much more fun to do when you have your data as triples than just mapping the stuff. Let’s have a look at the demo.
m.add_template("""
@prefix mu:<http://example.org/music/>.
mu:Band [
ottr:IRI ?band_iri,
xsd:string ?name,
xsd:string ?genre,
xsd:long ?formed,
xsd:string ?country
] :: {
ottr:Triple(?band_iri, a, mu:Band),
ottr:Triple(?band_iri, mu:name, ?name),
ottr:Triple(?band_iri, mu:genre, ?genre),
ottr:Triple(?band_iri, mu:formed, ?formed),
ottr:Triple(?band_iri, mu:country, ?country)
} .
""")The template takes in variables [ … ], and the variable names has to correspond to the DataFrame column headers. Then we have the body { … } that contains instructions on how to serialise your variables into triples in the structure of subject predicate object.
We have one template per source, and then run m.map(), and the graph fills up! Suddenly everything is in the same graph, connected through shared IRIs.
m.map(ns + "Band", bands_df)
m.map(ns + "Album", store_df)
m.map(ns + "FestivalLineup", lineup_df) The magic is in the IRIs. ✨ Because we build band IRIs the same way in every source (http://example.org/music/band/Radiohead), the vinyl store’s albums automatically connect to the encyclopaedia’s genre and country data. No JOIN key, fuzzy matching or COALESCE. The identifiers are the integration.
One query, three sources
Now, for the real fun stuff! A single SPARQL query that crosses all three sources:
SELECT ?band_name ?album ?price
WHERE {
?b mu:playsAt <.../festival/Roadburn> ;
mu:name ?band_name .
?a mu:artist ?b ;
mu:title ?album ;
mu:price ?price ;
mu:inStock true .
}“Which albums are in stock for bands playing at Roadburn, and what do they cost?” The festival data tells us who plays at Roadburn, and the store data tells us their albums and prices. The query will walk the graph, no JOINs needed, because the relationships were never flattened away.
Every select query result comes back as a Polars DataFrame. You can filter it, plot it, write it to Parquet, feed it to your existing pipeline.
Cleaning up messy data with vocabularies
So, what happens if you have inconsistent data? Our sources have “Post-Rock” and “Post Rock”, “Trip Hop” and “Trip-Hop”, “Prog Rock” and “Progressive Rock”. Same concepts, but different things.
In SQL, one would perhaps write a mapping table or a case statement (or something?). In a knowledge graph, one can use a vocabulary for such cases. A controlled set of concepts with canonical labels and alternative spellings.
mu:genre_post_rock a skos:Concept ;
skos:prefLabel "Post-Rock" ;
skos:altLabel "Post Rock" .One SPARQL update query swaps all the raw strings for concept IRIs, matching on both preferred and alternative labels. After that, “Post Rock” and “Post-Rock” point to the same concept. Sigur Rós and Mogwai are in the same genre, as they should be.
The schema is data too!
This might be the most mind-bending thing about knowledge graphs for someone coming from SQL. In a relational database, the schema and the data live in separate worlds. In a knowledge graph, the ontology (your classes, properties, semantics) is just more triples. You query it with the same SPARQL you use for everything else.
SELECT ?class ?property ?range
WHERE {
?cls a owl:Class .
?prop rdfs:domain ?cls ;
rdfs:range ?rng .
}So, if the graph holds an unknown schema/ontology to you, you can find it all with a single SPARQL! domain says what’s expected in subject position of a triple where ?prop is the predicate, and range says what’s expected in object position of a triple where ?prop is the predicate.
Enriching from the outside world
The extended demo takes the saved graph and enriches it from two external sources (Wikidata and Discogs) to show two contrasting patterns.
Wikidata speaks RDF natively. We can send a SPARQL CONSTRUCT query to their endpoint and get Turtle back, actual RDF triples, not JSON we need to parse and reshape. We read those triples straight into our graph and link them to our bands with a single SPARQL INSERT. RDF in, RDF out. ❤️
# Ask Wikidata for triples
req = urllib.request.Request(
"https://query.wikidata.org/sparql?query=" + urllib.parse.quote(sparql),
headers={"Accept": "text/turtle"},
)
turtle = urllib.request.urlopen(req).read().decode()
# Read them straight into our graph
m.reads(turtle, format="turtle", transient=True)To be honest, there is a keyword in SPARQL that let you skip the urllib stuff; SERVICE. maplib hasn’t implemented this SPARQL functionality yet, but it’s a thing in the spec!
PREFIX mu: <http://example.org/music/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?name ?mbid
WHERE {
?b a mu:Band ; mu:name ?name .
# just run as other queries, no urllib needed
SERVICE <https://query.wikidata.org/sparql> {
?item rdfs:label ?name ;
wdt:P434 ?mbid .
FILTER(LANG(?name) = "en")
}
}
ORDER BY ?nameDiscogs is a traditional REST API with JSON responses, one request per album, rate-limited. We parse the JSON, build a DataFrame, and use map_triples() to add first pressing metadata and marketplace prices to the graph. It works, but it’s visibly more work than the Wikidata path.
The contrast is the point. When both sides speak RDF, integration is almost free. When one side doesn’t, you’re back to the familiar pipeline work. Knowledge graphs will not eliminate all integration efforts, but they definitely eliminate unnecessary effort.
DataFrames back out
After enrichment and transformations, we pull the data back out as a Polars DataFrame and write it to Parquet.
catalog = m.query("""
SELECT ?name ?genre ?country ?formed ?mbid
WHERE { ... }
""")
catalog.write_parquet("band_catalog_enriched.parquet")The graph does not have to replace your analytical stack. It’s the semantic layer for integration and enrichment though transformations, reasoning and validation.
Try it!
The demo is two Jupyter notebooks you can run locally:
demo.ipynb: the core loop with sources to graph and cross-source queries, SKOS genre normalisation and export.
demo-extended.ipynd: Wikidata and Discogs enrichment and price comparison
pip install maplib polars
git clone https://github.com/veleda/pdc-demo
cd pdc-demo
jupyter labThe whole thing runs in-memory, just pip install maplib and you’re ready to go.
If you’re coming from the SQL world and this is your first look at knowledge graphs, I hope this demo is readable even if you’ve never seen SPARQL before. Any feedback on how to make this more accessible is welcome! If you want a translation guide, I wrote a SPARQL for SQL Developers piece that maps every concept to something you already know.
The tooling gap that kept knowledge graph out of reach for data engineers is closing fast. This demo is one small proof of that.
About the author
Ontologist who code in C, Java and Python. Been a knowledge graph practitioner for over a decade, and had numerous talks on the topic around the globe. Knowledge Graph Specialist at Data Treehouse. Author of SHACL for the Practitioner and awarded amongst Norway’s Top 50 Women in Tech 2024.



this is great, Veronika!