Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the JSON from Qleverfile #117

Open
tpluscode opened this issue Jan 17, 2025 · 4 comments
Open

Remove the JSON from Qleverfile #117

tpluscode opened this issue Jan 17, 2025 · 4 comments

Comments

@tpluscode
Copy link

tpluscode commented Jan 17, 2025

I have been looking into Qleverfiles recently and I find the JSON keys a little awkward.

Take the [index] section from wikidata example

[index]
INPUT_FILES      = latest-all.ttl.bz2 latest-lexemes.ttl.bz2 dcatap.nt
MULTI_INPUT_JSON = [{ "cmd": "lbzcat -n 4 latest-all.ttl.bz2", "format": "ttl", "parallel": "true" },
                    { "cmd": "lbzcat -n 1 latest-lexemes.ttl.bz2", "format": "ttl", "parallel": "false" },
                    { "cmd": "cat dcatap.nt", "format": "nt", "parallel": "false" }]
SETTINGS_JSON    = { "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 5000000 }
STXXL_MEMORY     = 10G

I'd propose to allow "dynamic" sections to break up the complex objects from MULTI_INPUT_JSON and SETTINGS_JSON

[index]
INPUT_FILES  = latest-all.ttl.bz2 latest-lexemes.ttl.bz2 dcatap.nt
STXXL_MEMORY = 10G

[index.latest-all]
CMD      = lbzcat -n 4 latest-all.ttl.bz2
FORMAT   = ttl
PARALLEL = true"

[index.latest-lexemes]
CMD      = lbzcat -n 1 latest-lexemes.ttl.bz2
FORMAT   = ttl
PARALLEL = false

[index.latest-lexemes]
CMD      = cat dcatap.nt
FORMAT   = nt
PARALLEL = false

[index.settings]
LANGUAGES-INTERNAL    = []
PREFIXES-EXTERNAL     = [""]
ASCII-PREFIXES-ONLY   = true
NUM-TRIPLES-PER-BATCH = 5000000

[index.settings.locale]
LANGUAGE           = en
COUNTRY            = US
IGNORE-PUNCTUATION = true

Alternatively, you could support YAML instead. That would for example allow you to publish a JSON Schema for validation and editor suggestions

PS
The standard way to write locale in this case would be like en-US. Curious why it's broken up as two keys

@hannahbast
Copy link
Member

@tpluscode Thank you for the comment. Can you be more specific on why you find the JSON awkward? By the way, the [...] are not needed for the MULTI_INPUT_JSON, a JSONL is fine as well.

@tpluscode
Copy link
Author

First thing is readability. JSON also feels alien to the textual format used. Like... putting binary columns in a relational database.

And the current formatting requires to jump some hoops to stringify it just right. From JavaScript, this is not exactly JSON.stringify so some additional code is necessary to get the right output.

@hannahbast
Copy link
Member

@tpluscode Coming back to this after three stressful weeks. I am not a fan of the "dynamic" sections. But I agree that YAML would be a natural format.

When we started, we wanted something that is as simple as possible and no more complex than necessary. YAML is the way to go when you have multiline stuff, where indentation is important for readability. For most Qleverfiles, this is what we want now (either sequences of commands or SPARQL queries or both).

The switch will be a bit annoying because we will have to support the old format for a while. But that's life.

@ktk
Copy link

ktk commented Feb 13, 2025

@hannahbast let's talk about how we can do that together. We can contribute validation & things around it so it becomes easier to work with it. @tpluscode and @ludovicm67 have ideas/experience in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants