RDFLib tripplestore CPU/RAM #1784
Replies: 3 comments
-
Your question is timely. @nicholascar has expressed a desire to have a comparative review of the various RDFLib persistence stores and I've been working on exactly that, using the SP²Bench SPARQL Performance Benchmark because it's got a handy data generator which I can use for generating graphs of arbitrary numbers of triples. I have created an Ansible-driven Vagrant solution so that I can keep it all separate from my usual laptop work configuration and when it's complete, I'll publish it. For testing, I've been using an updated version of the old t0 = time()
memgraph.parse(location=inputloc, format="n3")
t1 = time()
timings["parsing"] = f"{t1 - t0:.5f}"
graph = Graph(store, URIRef("http://rdflib.net"))
graph.open(path, create=True if store != "SPARQLUpdateStore" else False)
graph.remove((None, None, None))
skmemgraph = memgraph.skolemize()
t0 = time()
graph += skmemgraph
t1 = time()
timings["loading"] = f"{t1 - t0:.5f}"
graph.commit()
graph.close()
del graph
graph = Graph(store, URIRef("http://rdflib.net"))
# Open and read
t0 = time()
graph.open(path, create=False)
t1 = time()
assert len(graph) == ntriples, len(graph)
timings["opening"] = f"{t1 - t0:.5f}"
t0 = time()
assert len(list(graph.triples((None, None, None)))) == ntriples
t1 = time()
timings["length"] = f"{t1 - t0:.5f}"
t0 = time()
res = graph.subjects(predicate=RDF.type, object=FOAF.Person)
t1 = time()
timings["subjects"] = f"{t1 - t0:.5f}"
assert len(list(res)) == nfoaf Latest results are:
One reason for using a persistence-backed store is that the graph isn't read into memory, which eases things RAM-wise. However, as the results indicate, iterating over the graph is increasingly time-consuming as the size of the graph increases, albeit some persistences perform better than others. What isn't shown here (because I haven't yet gotten round to saving the results) is the disappointing results for SPARQL queries. The SP²Bench Benchmark suite is actually a test of SPARQL query benchmark performance and I have to say that several of those queries just don't return at all, even for graphs with only a few thousand triples.
You don't say what your resources actually are, so it's not possible to make a sensible comment on that. You should be aware that large graphs demand appropriate resources. I'm noodling on a project that maps an altcoin blockchain directly to RDF, it generates around 6m triples per year and there's 8 year's worth of blocks to represent. fwiw, I didn't even consider using RDFLib native stores for the persistence, they're designed for programming convenience, not performance - I use a Fuseki-backed SPARQLStore for my work and even then, SPARQL querying can be slooow - to the extent that I'm considering splitting it into year-per-graph and seeing if that makes it any more tractable. I recommend that you have a browse through the user reports on the W3C wiki LargeTripleStores page - there are some useful hints there, e.g.
Happy to engage in further discussion if you can share more details. |
Beta Was this translation helpful? Give feedback.
-
Okaay, I've updated the results to include the handful of SPARQL queries that a) return and b) don't take all week ... 1285 triples
10303 triples
100073 triples
250128 triples
Fuseki balks at accepting 250K triples over the wire, can't really fault it for that, given that the recommended way of loading graphs is to use the command-line utilities. I'll run a separate test of the queries with a hand-loaded graph (expecting to see the same advantage for q11). Um, thinking about it ... I'm not really checking the query results, merely that the query returns at all ... res = graph.query(query) I should record the number of answers and check against that. |
Beta Was this translation helpful? Give feedback.
-
+1 Although, I do actually use BerkeleyDB here and there as a persistent file DB back-end since it works well. But we - my company and I - tend to use RDFlib for pipelines and manipulations of RDF before major DB loading. All our large RDF is in triplestores, many of which we then access via RDFLib-backed APIs. Python's web frameworks are really great these days (FastAPI) and work well with RDFLib. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I want to host a big RDF-tripple store using the native python RDFLib graph database. I am worried about how this database can scale in regards to memory and cpu though. My database will consist of millions of tripplets. I am also going to use owlrl for DeductiveClosure and OWL_RL_Semantics.
If memory/cpu is a limitation, what is the best way to go about this? Should I segregate my triplestore across several stores and then use some kind of federation? Any answers are much appriciated :)!
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions