Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML syntax #119

Open
tpluscode opened this issue Jan 17, 2025 · 0 comments
Open

YAML syntax #119

tpluscode opened this issue Jan 17, 2025 · 0 comments

Comments

@tpluscode
Copy link

It looks like QLeverfile has evolved enough complexity to warrant using a more flexible language. A popular choice would be YAML which supports JSON Schema, arrays and deep structures

I only see a problem with the variable substitution which I don't think is supported OOTB but since it looks like scripting and not simple interpolation, I would think it is unrelated to how the file is formatted

Anyway, here's a possible rendition of wikidata example

# Qleverfile for Wikidata, use with the QLever CLI (`pip install qlever`)
#
# qlever get-data  # ~7 hours, ~110 GB (compressed), ~20 billion triples
# qlever index     # ~5 hours, ~20 GB RAM, ~500 GB index size on disk
# qlever start     # a few seconds, adjust MEMORY_FOR_QUERIES as needed
#
# Adding a text index takes an additional ~2 hours and ~50 GB of disk space
#
# Measured on an AMD Ryzen 9 5950X with 128 GB RAM, and NVMe SSD (18.10.2024)

name: &name wikidata

env:
  GET_DATA_URL: https://dumps.wikimedia.org/wikidatawiki/entities
  DATE_WIKIDATA: $$(date -r latest-all.ttl.bz2 +%d.%m.%Y || echo "NO_DATE")
  DATE_WIKIPEDIA: $$(date -r wikipedia-abstracts.nt +%d.%m.%Y || echo "NO_DATE")

data:
  get-data-cmd: >
    curl -LRC - -O ${GET_DATA_URL}/latest-all.ttl.bz2 -O ${GET_DATA_URL}/latest-lexemes.ttl.bz2 2>&1 
    | tee wikidata.download-log.txt && curl -sL ${GET_DATA_URL}/dcatap.rdf 
    | docker run -i --rm -v $$(pwd):/data stain/jena riot --syntax=RDF/XML --output=NT /dev/stdin 
    > dcatap.nt
  description: Full Wikidata dump from ${GET_DATA_URL} (latest-all.ttl.bz2 and latest-lexemes.ttl.bz2, version ${DATE_WIKIDATA})

index:
  input-files: [ latest-all.ttl.bz2, latest-lexemes.ttl.bz2, dcatap.nt ]
  input:
    - cmd: lbzcat -n 4 latest-all.ttl.bz2
      format: ttl
      parallel: true
    - cmd: lbzcat -n 1 latest-lexemes.ttl.bz2
      format: ttl
      parallel: false
    - cmd: cat dcatap.nt
      format: nt
  settings:
    languages-internal: []
    prefixes-external: [""]
    locale:
      language: en
      country: US
      ignore-punctuation: true
    ascii-prefixes-only: true
    num-triples-per-batch: 5000000
  stxxl-memory: 10G

server:
  port: 7001
  access-token: *name
  memory-for-queries: 20G
  cache-max-size: 15G
  cache-max-size-single-entry: 5G
  timeout: 600s

runtime:
  system: docker
  image: adfreiburg/qlever

ui:
  config: wikidata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant