Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: context cache #70

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

jeswr
Copy link
Contributor

@jeswr jeswr commented Oct 27, 2023

Creates a context caching mechanism. The use case of this is parsing numerous json-ld objects which all have the same context object; where context parsing is often the main bottleneck; see performance results below:

Parse a context that has not been cached; and without caching in place x 108 ops/sec ±0.32% (86 runs sampled)
Parse a list of iri contexts that have been cached x 79,985 ops/sec ±0.44% (90 runs sampled)
Parse a context object that has not been cached x 1,950 ops/sec ±1.36% (90 runs sampled)
Parse a context object that has been cached x 7,637 ops/sec ±0.20% (91 runs sampled)

@coveralls
Copy link

coveralls commented Oct 27, 2023

Pull Request Test Coverage Report for Build 6666516387

  • 53 of 53 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.1%) to 99.904%

Totals Coverage Status
Change from base Build 6492290900: -0.1%
Covered Lines: 570
Relevant Lines: 570

💛 - Coveralls

@rubensworks
Copy link
Owner

This looks very neat, thanks @jeswr!

Before I do an in-depth review, I have some general questions.

  • What is the overall performance impact of this when caching is not very valuable? Might be useful to run the parser perf checks again with an without this cache. I suspect hashing to have some overhead.
  • Do you have insights on memory consumption? My fear is that for parsers that are reused, many (potentially large) contexts can be cached. I know there's the LRU cache, but I wonder if we should consider disabling the cache by default (not sure tbh).
  • Do all JSON-LD spec tests still pass with this change?

Pinging @sdevalk, as this will interest him regarding rubensworks/rdf-dereference.js#48

@jeswr
Copy link
Contributor Author

jeswr commented Oct 27, 2023

What is the overall performance impact of this when caching is not very valuable? Might be useful to run the parser perf checks again with an without this cache. I suspect hashing to have some overhead.

In the worst case it seems to be an extra 50% overhead. It can be worked around by trying to work by-reference rather than by hash most of the time (see https://github.com/inrupt/solid-client-vc-js/blob/0f8ce276b6ea8a977b9d2ea189bc92385ef44b48/src/parser/jsonld.ts#L70-L111) however; the danger here is that you accidentally consume a large amount of memory by using contexts as keys so I'm leaving this for a subsequent piece of work once better benchmarking is in place and we have a good way of pruning them quickly.

FYI the custom caching mechanism in https://github.com/inrupt/solid-client-vc-js/blob/0f8ce276b6ea8a977b9d2ea189bc92385ef44b48/src/parser/jsonld.ts reduced the time on e2e tests from 20min to 20seconds.

Do you have insights on memory consumption? My fear is that for parsers that are reused, many (potentially large) contexts can be cached. I know there's the LRU cache, but I wonder if we should consider disabling the cache by default (not sure tbh).

I've disabled caching by default in b48cc07. So we will need to make an update to jsonld-streaming-parser to allow custom contexts / caches to be passed in.

Do all JSON-LD spec tests still pass with this change?

I'm not seeing rdf-test-suite as a dev dependency of this library or any commands to run spec testing. Do you have a script somewhere to run spec tests on this lib?

@rubensworks
Copy link
Owner

danger here is that you accidentally consume a large amount of memory by using contexts as keys so I'm leaving this for a subsequent piece of work once better benchmarking is in place and we have a good way of pruning them quickly.

I suspect that caching by reference would consume less memory than hashing, as caching is only done on pointers towards shared memory.

reduced the time on e2e tests from 20min to 20seconds.

Ooh, nice!

I'm not seeing rdf-test-suite as a dev dependency of this library or any commands to run spec testing. Do you have a script somewhere to run spec tests on this lib?

This will have to be tested by manually plugging this into jsonld-streaming-parser.

@jeswr
Copy link
Contributor Author

jeswr commented Oct 27, 2023

as caching is only done on pointers towards shared memory.

My main concern here is that if we have a poor configuration of the lru-cache then we are preventing GC on context objects that have already been parsed and may no longer be relevant.

@rubensworks
Copy link
Owner

My main concern here is that if we have a poor configuration of the lru-cache then we are preventing GC on context objects that have already been parsed and may no longer be relevant.

I would suspect lru-cache to be very well battle-tested by now, but I don't know enough of its internals to make any hard claims :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants