Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak using Oxigraph as RDFlib backend #17

Open
daniel-dona opened this issue Oct 20, 2022 · 3 comments
Open

Possible memory leak using Oxigraph as RDFlib backend #17

daniel-dona opened this issue Oct 20, 2022 · 3 comments

Comments

@daniel-dona
Copy link

daniel-dona commented Oct 20, 2022

When RDFlib graph is used as a temporal storage it seams that the backend leaks memory, the problem doesn't apear with the default RDFlib backend.

Is there a proper way to cleanup all the memory used by the Graph?

Test code:

from SPARQLWrapper import SPARQLWrapper, RDFXML, JSON, POSTDIRECTLY

import rdflib

from memory_profiler import memory_usage, profile


ENDPOINT = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

sparql_config = {
	"endpoint": ENDPOINT
}


def query():

	wrapper = SPARQLWrapper(**sparql_config)
	wrapper.setReturnFormat(RDFXML)
	wrapper.setMethod("POST")
	wrapper.setTimeout(10)

	query = '''

		CONSTRUCT {
			?item wdt:P31 wd:Q5
		}
		

		WHERE {
			?item wdt:P31 wd:Q5 .
		}

		LIMIT 10000

		'''

	wrapper.setQuery(query)
	return wrapper.queryAndConvert()
	

g_ref = query()


def run():
	#g = rdflib.Graph()
	g = rdflib.Graph(store="Oxigraph")
	g += g_ref # Load data in the graph


for i in range(100):
	run()
	print("Memory: ", memory_usage(-1, interval=.1, timeout=.1)[0], "MiB")

Result:

Memory:  58.4375 MiB
Memory:  66.9765625 MiB
Memory:  69.91015625 MiB
Memory:  75.30859375 MiB
Memory:  77.984375 MiB
Memory:  81.8046875 MiB
Memory:  86.51953125 MiB
Memory:  90.5546875 MiB
Memory:  94.15625 MiB
Memory:  99.5390625 MiB
Memory:  102.65625 MiB
Memory:  106.3984375 MiB
Memory:  110.53125 MiB
Memory:  114.16796875 MiB
Memory:  119.33984375 MiB
Memory:  122.96484375 MiB
Memory:  126.83984375 MiB
Memory:  130.9765625 MiB
Memory:  135.19921875 MiB
Memory:  138.625 MiB
[...]
Memory:  413.24609375 MiB
Memory:  418.140625 MiB
Memory:  421.3046875 MiB
Memory:  425.296875 MiB
Memory:  429.1796875 MiB
Memory:  433.67578125 MiB
Memory:  436.72265625 MiB
Memory:  439.859375 MiB
Memory:  444.72265625 MiB
Memory:  449.33203125 MiB
Memory:  452.96875 MiB
Memory:  457.765625 MiB
Memory:  460.9296875 MiB

And with the default backend:

Memory:  50.94140625 MiB
Memory:  52.47265625 MiB
Memory:  60.80859375 MiB
Memory:  55.734375 MiB
Memory:  62.79296875 MiB
Memory:  70.7734375 MiB
Memory:  51.49609375 MiB
Memory:  60.75390625 MiB
Memory:  54.48046875 MiB
Memory:  62.03515625 MiB
Memory:  55.5703125 MiB
Memory:  63.09765625 MiB
[...]
Memory:  70.7890625 MiB
Memory:  51.35546875 MiB
Memory:  61.03515625 MiB
Memory:  51.76171875 MiB
Memory:  61.03515625 MiB
Memory:  55.32421875 MiB
Memory:  62.87890625 MiB
@Tpt
Copy link
Contributor

Tpt commented Oct 20, 2022

Thank you for trying Oxigraph!

This is strange. I tried to force Python GC to run and it did not help. It's definitely possible that the leak is in the Python <-> Rust bridge, in the Rust code or in the C++ code of the storage system (RocksDB).

I run LLVM memory sanitizer that integrates a leak detection mechanism as part of Oxigraph CI and it does not seems to raise any but I am not fully instrumenting the C++ codebase and I am not covering the Python <-> Rust bridge.

A deeper investigation would definitely be useful.

@daniel-dona
Copy link
Author

Thank you @Tpt for your time and the Oxigraph project!

If I can help in anyway to debug this I'm available to give a helping hand...

@Tpt
Copy link
Contributor

Tpt commented Oct 21, 2022

If I can help in anyway to debug this I'm available to give a helping hand...

Thank you! If you are familiar with chasing memory leaks in C++ and Rust, it would be much appreciated.

Even a minimal code like the following seems to leak a bit:

import gc
import pyoxigraph
from memory_profiler import memory_usage

for i in range(100):
    g = pyoxigraph.Store()
    for t in g:
        g.add(_to_ox(t))
    del g
    gc.collect()
    print("Memory: ", memory_usage(-1, interval=.1, timeout=.1)[0], "MiB")

Imho it's not a very high priority, the Oxigraph store is aimed at very large datasets so a bit of leak during its creation is probably not a huge issue in practice (but it's definitely worth to investigate and fix). Doing stuff like improving SPARQL evaluation is likely to help way more to keep memory usage in check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants