RDFLib tripplestore CPU/RAM #1784

PetterBMarkussen · 2022-03-31T06:27:11Z

PetterBMarkussen
Mar 31, 2022

Hi,

I want to host a big RDF-tripple store using the native python RDFLib graph database. I am worried about how this database can scale in regards to memory and cpu though. My database will consist of millions of tripplets. I am also going to use owlrl for DeductiveClosure and OWL_RL_Semantics.

If memory/cpu is a limitation, what is the best way to go about this? Should I segregate my triplestore across several stores and then use some kind of federation? Any answers are much appriciated :)!

Thanks!

ghost · 2022-03-31T13:09:35Z

ghost
Mar 31, 2022

I want to host a big RDF-tripple store using the native python RDFLib graph database. I am worried about how this database can scale in regards to memory and cpu though. My database will consist of millions of tripplets. I am also going to use owlrl for DeductiveClosure and OWL_RL_Semantics.

Your question is timely. @nicholascar has expressed a desire to have a comparative review of the various RDFLib persistence stores and I've been working on exactly that, using the SP²Bench SPARQL Performance Benchmark because it's got a handy data generator which I can use for generating graphs of arbitrary numbers of triples.

I have created an Ansible-driven Vagrant solution so that I can keep it all separate from my usual laptop work configuration and when it's complete, I'll publish it.

For testing, I've been using an updated version of the old store_performance_test and the actual testing I've reproduced here so you can see exactly how the program calls map to the results - first the graph is loaded into memory (just to isolate the read IO component) which I label "parsing", then the graph in memory is added to the persistence with += (i.e. addN) which I call "loading", the persistence is closed and then re-opened (labelled "opening") and, to give some idea of iteration speed, a couple of straightforward calls of triples() and subjects().

    t0 = time()
    memgraph.parse(location=inputloc, format="n3")
    t1 = time()
    timings["parsing"] = f"{t1 - t0:.5f}"

    graph = Graph(store, URIRef("http://rdflib.net"))

    graph.open(path, create=True if store != "SPARQLUpdateStore" else False)

    graph.remove((None, None, None))

    skmemgraph = memgraph.skolemize()

    t0 = time()
    graph += skmemgraph
    t1 = time()
    timings["loading"] = f"{t1 - t0:.5f}"

    graph.commit()
    graph.close()
    del graph

    graph = Graph(store, URIRef("http://rdflib.net"))

    # Open and read
    t0 = time()
    graph.open(path, create=False)
    t1 = time()
    assert len(graph) == ntriples, len(graph)
    timings["opening"] = f"{t1 - t0:.5f}"

    t0 = time()
    assert len(list(graph.triples((None, None, None)))) == ntriples
    t1 = time()
    timings["length"] = f"{t1 - t0:.5f}"

    t0 = time()
    res = graph.subjects(predicate=RDF.type, object=FOAF.Person)
    t1 = time()
    timings["subjects"] = f"{t1 - t0:.5f}"
    assert len(list(res)) == nfoaf

Latest results are:

store	ntriples	parsing	loading	opening	triples	subjects
OxSled	10303	0.55142	0.59123	0.02738	0.13190	0.00004
LevelDB	10303	0.53613	0.56517	0.03034	0.08808	0.00003
SQLiteLSM	10303	0.53320	0.61366	0.00073	0.08462	0.00003
SQLAlchemy:MYSQL	10303	0.53248	0.74031	0.01257	0.18647	0.00004
SQLAlchemy:PGSQL	10303	0.58982	3.27487	0.01319	0.13373	0.00004
FileStorageZODB	10303	0.59701	1.57494	0.00117	0.08535	0.00003
BerkeleyDB	10303	0.59518	0.59010	0.00204	0.20562	0.00003
SQLiteDBStore	10303	0.53976	0.58345	0.00134	0.19871	0.00003
OxSled	100073	5.66835	7.17580	0.26268	1.39678	0.00004
LevelDB	100073	5.48512	8.58369	0.01416	1.16127	0.00004
SQLiteLSM	100073	6.23733	8.25789	0.00088	1.03050	0.00004
SQLAlchemy:MYSQL	100073	6.13500	6.98558	0.00905	2.48878	0.00008
SQLAlchemy:PGSQL	100073	5.35850	34.14443	0.01081	1.51447	0.00005
FileStorageZODB	100073	5.52698	18.87009	0.00455	0.85145	0.00005
BerkeleyDB	100073	5.98160	6.19416	0.00073	2.66152	0.00004
SQLiteDBStore	100073	5.37460	8.46839	0.00131	3.47490	0.00005
OxSled	250128	13.72005	19.59429	0.68959	3.56458	0.00005
LevelDB	250128	13.77921	26.46234	0.04158	3.75827	0.00005
SQLiteLSM	250128	15.67440	21.70475	0.00090	2.75154	0.00005
SQLAlchemy:MYSQL	250128	15.31149	19.02615	0.01026	5.74341	0.00012
SQLAlchemy:PGSQL	250128	13.51078	88.47071	0.01024	3.50416	0.00005
FileStorageZODB	250128	13.75694	55.54840	0.01137	2.14349	0.00006
BerkeleyDB	250128	15.42421	16.09714	0.00091	5.96024	0.00006
SQLiteDBStore	250128	13.58323	26.13147	0.00139	10.28140	0.00005
OxSled	500043	30.16962	42.54195	3.31826	7.34893	0.00404
LevelDB	500043	29.43870	67.47827	0.07318	8.20990	0.00005
SQLiteLSM	500043	29.96050	49.36532	0.00092	6.36636	0.00005
SQLAlchemy:MYSQL	500043	32.26753	43.26743	0.00924	13.98382	0.00008
SQLAlchemy:PGSQL	500043	29.36799	198.86512	0.03575	8.68996	0.00005
FileStorageZODB	500043	31.01450	137.34303	0.01733	4.29886	0.00006
BerkeleyDB	500043	29.94924	51.66214	0.00121	12.30992	0.00005
SQLiteDBStore	500043	33.29222	61.36098	0.00155	21.91978	0.00005

One reason for using a persistence-backed store is that the graph isn't read into memory, which eases things RAM-wise. However, as the results indicate, iterating over the graph is increasingly time-consuming as the size of the graph increases, albeit some persistences perform better than others.

What isn't shown here (because I haven't yet gotten round to saving the results) is the disappointing results for SPARQL queries. The SP²Bench Benchmark suite is actually a test of SPARQL query benchmark performance and I have to say that several of those queries just don't return at all, even for graphs with only a few thousand triples.

If memory/cpu is a limitation, what is the best way to go about this? Should I segregate my triplestore across several stores and then use some kind of federation? Any answers are much appriciated :)!

You don't say what your resources actually are, so it's not possible to make a sensible comment on that. You should be aware that large graphs demand appropriate resources.

I'm noodling on a project that maps an altcoin blockchain directly to RDF, it generates around 6m triples per year and there's 8 year's worth of blocks to represent. fwiw, I didn't even consider using RDFLib native stores for the persistence, they're designed for programming convenience, not performance - I use a Fuseki-backed SPARQLStore for my work and even then, SPARQL querying can be slooow - to the extent that I'm considering splitting it into year-per-graph and seeing if that makes it any more tractable.

I recommend that you have a browse through the user reports on the W3C wiki LargeTripleStores page - there are some useful hints there, e.g.

"We recently have ran a few scalability tests on Sesame's Native Store (Sesame 2.0-alpha-3). Using the Lehigh University Benchmark we successfully added a LUBM-500 dataset (consisting of about 70 million RDF triples). The machine used was a 2.8GhZ P4 (32-bits) with 1GB RAM, running Suse Linux 10.0 (kernel 2.6), Sun J2SE 1.5.0_06. Upload took about 3 hours. Query performance on the LUBM test-queries was adequate to good: unoptimized, the worst query (Q2) took 1.3 hours to complete, but most queries completed within tens of milliseconds (Q4,5,6,7,8,10,12,13) or 1-5 minutes (Q1,3,9,11,14) - though some of these queries are just fast because they return no results (the native store does no RDFS/OWL inferencing). We have yet to explore larger datasets and performance using RDFS inferencing but it seems that 70M is not the ceiling and that Sesame can easily cope with even larger sets, especially when we use bigger hardware."

Happy to engage in further discussion if you can share more details.

0 replies

ghost · 2022-03-31T15:06:55Z

ghost
Mar 31, 2022

Okaay, I've updated the results to include the handful of SPARQL queries that a) return and b) don't take all week ...

1285 triples

store	parsing	loading	opening	triples	subjects	q01	q09	q10	q11
OxSled	0.07971	0.06114	0.00788	0.01789	0.00003	0.00033	0.00008	0.00004	0.00142
SQLiteLSM	0.07430	0.08239	0.00076	0.01311	0.00003	0.15380	0.00402	0.00161	0.01014
LevelDB	0.07616	0.07146	0.01590	0.01173	0.00003	0.00387	0.00364	0.00160	0.01052
SQLA:MYSQL	0.07625	0.11663	0.01053	0.03105	0.00003	0.11156	0.00459	0.00293	0.01686
SQLA:PGSQL	0.07499	0.49025	0.01276	0.02424	0.00003	0.11029	0.00495	0.00263	0.01636
ZODB	0.07600	0.15039	0.00078	0.01158	0.00003	0.00383	0.00366	0.00163	0.01136
BerkeleyDB	0.07528	0.07299	0.00131	0.03504	0.00007	0.00360	0.00428	0.00153	0.01402
SQLiteDB	0.07755	0.07711	0.00504	0.04126	0.00003	0.00420	0.00358	0.00170	0.01226
Fuseki	0.07949	0.07465	0.00000	0.04688	0.00005	0.00781	0.01573	0.00814	0.01561

10303 triples

store	parsing	loading	opening	triples	subjects	q01	q09	q10	q11
OxSled	0.61736	0.73211	0.03611	0.14664	0.00004	0.00028	0.00007	0.00004	0.00985
SQLiteLSM	0.61548	0.71143	0.00106	0.09369	0.00004	0.00406	0.00361	0.00153	0.08135
LevelDB	0.60367	0.64502	0.03871	0.10886	0.00004	0.00392	0.00360	0.00164	0.08177
SQLA:MYSQL	0.60475	0.90792	0.01009	0.20805	0.00005	0.12397	0.00502	0.00295	0.09688
SQLA:PGSQL	0.60455	3.98201	0.01376	0.15482	0.00005	0.11538	0.00450	0.00280	0.09397
ZODB	0.60948	1.57225	0.00134	0.08388	0.00004	0.00368	0.00342	0.00144	0.07715
BerkeleyDB	0.61412	0.61188	0.00195	0.22620	0.00004	0.00380	0.00345	0.00150	0.09199
SQLiteDB	0.60591	0.65108	0.00440	0.22187	0.00004	0.00438	0.00360	0.00159	0.10017
Fuseki	0.60244	0.48916	0.00000	0.29324	0.00005	0.01254	0.05915	0.00896	0.01687

100073 triples

store	parsing	loading	opening	triples	subjects	q01	q09	q10	q11
OxSled	6.10206	8.48802	0.32554	1.53078	0.00007	0.00037	0.00009	0.00004	0.15907
SQLiteLSM	6.07826	8.27877	0.00097	1.04495	0.00008	0.00412	0.00357	0.00151	1.13375
LevelDB	5.94364	10.10092	0.03368	1.14822	0.00005	0.00408	0.00376	0.00162	1.12129
SQLA:MYSQL	5.93685	10.89971	0.01069	2.43923	0.00005	0.15164	0.00507	0.00282	1.14694
SQLA:PGSQL	5.94581	40.75686	0.01320	1.48067	0.00005	0.10992	0.00441	0.00233	1.09844
ZODB	5.94668	19.84446	0.00571	0.83170	0.00005	0.00389	0.00371	0.00150	1.00313
BerkeleyDB	5.97497	6.39221	0.00079	2.35931	0.00005	0.00364	0.00350	0.00149	1.12673
SQLiteDB	5.91513	9.83228	0.00180	3.84583	0.00010	0.00433	0.00388	0.00180	1.48761
Fuseki	5.95150	5.87784	0.00000	2.80978	0.00005	0.01311	0.17068	0.00959	0.02625

250128 triples

store	parsing	loading	opening	triples	subjects	q01	q09	q10	q11
OxSled	15.00273	23.56751	1.47748	4.07623	0.00006	0.00111	0.00010	0.00009	0.54124
SQLiteLSM	14.99965	22.55897	0.00096	2.80879	0.00005	0.00421	0.00456	0.00175	3.04822
LevelDB	14.99358	28.48605	0.00845	3.82369	0.00005	0.00445	0.00406	0.00158	3.13923
SQLA:MYSQL	14.78922	21.35407	0.00923	6.49300	0.00006	0.14926	0.00524	0.00281	3.11308
SQLA:PGSQL	14.98683	105.97135	0.01152	3.89677	0.00006	0.12066	0.00465	0.00327	2.89808
ZODB	15.08463	58.98293	0.01149	2.09040	0.00005	0.00373	0.00344	0.00163	2.70995
BerkeleyDB	14.81322	18.59234	0.00104	5.93029	0.00005	0.00382	0.00378	0.00149	2.99560
SQLiteDB	14.71524	29.08853	0.00158	10.91001	0.00006	0.00448	0.00405	0.00166	3.94015

Fuseki balks at accepting 250K triples over the wire, can't really fault it for that, given that the recommended way of loading graphs is to use the command-line utilities. I'll run a separate test of the queries with a hand-loaded graph (expecting to see the same advantage for q11).

Um, thinking about it ... I'm not really checking the query results, merely that the query returns at all ...

res = graph.query(query)

I should record the number of answers and check against that.

0 replies

nicholascar · 2022-03-31T23:12:44Z

nicholascar
Mar 31, 2022
Maintainer

I didn't even consider using RDFLib native stores for the persistence, they're designed for programming convenience, not performance - I use a Fuseki-backed SPARQLStore for my work

+1

Although, I do actually use BerkeleyDB here and there as a persistent file DB back-end since it works well. But we - my company and I - tend to use RDFlib for pipelines and manipulations of RDF before major DB loading. All our large RDF is in triplestores, many of which we then access via RDFLib-backed APIs. Python's web frameworks are really great these days (FastAPI) and work well with RDFLib.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDFLib tripplestore CPU/RAM #1784

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RDFLib tripplestore CPU/RAM #1784

PetterBMarkussen Mar 31, 2022

Replies: 3 comments

ghost Mar 31, 2022

Latest results are:

ghost Mar 31, 2022

1285 triples

10303 triples

100073 triples

250128 triples

nicholascar Mar 31, 2022 Maintainer

PetterBMarkussen
Mar 31, 2022

ghost
Mar 31, 2022

ghost
Mar 31, 2022

nicholascar
Mar 31, 2022
Maintainer