YAML syntax #119

tpluscode · 2025-01-17T13:43:05Z

It looks like QLeverfile has evolved enough complexity to warrant using a more flexible language. A popular choice would be YAML which supports JSON Schema, arrays and deep structures

I only see a problem with the variable substitution which I don't think is supported OOTB but since it looks like scripting and not simple interpolation, I would think it is unrelated to how the file is formatted

Anyway, here's a possible rendition of wikidata example

# Qleverfile for Wikidata, use with the QLever CLI (`pip install qlever`)
#
# qlever get-data  # ~7 hours, ~110 GB (compressed), ~20 billion triples
# qlever index     # ~5 hours, ~20 GB RAM, ~500 GB index size on disk
# qlever start     # a few seconds, adjust MEMORY_FOR_QUERIES as needed
#
# Adding a text index takes an additional ~2 hours and ~50 GB of disk space
#
# Measured on an AMD Ryzen 9 5950X with 128 GB RAM, and NVMe SSD (18.10.2024)

name: &name wikidata

env:
  GET_DATA_URL: https://dumps.wikimedia.org/wikidatawiki/entities
  DATE_WIKIDATA: $$(date -r latest-all.ttl.bz2 +%d.%m.%Y || echo "NO_DATE")
  DATE_WIKIPEDIA: $$(date -r wikipedia-abstracts.nt +%d.%m.%Y || echo "NO_DATE")

data:
  get-data-cmd: >
    curl -LRC - -O ${GET_DATA_URL}/latest-all.ttl.bz2 -O ${GET_DATA_URL}/latest-lexemes.ttl.bz2 2>&1 
    | tee wikidata.download-log.txt && curl -sL ${GET_DATA_URL}/dcatap.rdf 
    | docker run -i --rm -v $$(pwd):/data stain/jena riot --syntax=RDF/XML --output=NT /dev/stdin 
    > dcatap.nt
  description: Full Wikidata dump from ${GET_DATA_URL} (latest-all.ttl.bz2 and latest-lexemes.ttl.bz2, version ${DATE_WIKIDATA})

index:
  input-files: [ latest-all.ttl.bz2, latest-lexemes.ttl.bz2, dcatap.nt ]
  input:
    - cmd: lbzcat -n 4 latest-all.ttl.bz2
      format: ttl
      parallel: true
    - cmd: lbzcat -n 1 latest-lexemes.ttl.bz2
      format: ttl
      parallel: false
    - cmd: cat dcatap.nt
      format: nt
  settings:
    languages-internal: []
    prefixes-external: [""]
    locale:
      language: en
      country: US
      ignore-punctuation: true
    ascii-prefixes-only: true
    num-triples-per-batch: 5000000
  stxxl-memory: 10G

server:
  port: 7001
  access-token: *name
  memory-for-queries: 20G
  cache-max-size: 15G
  cache-max-size-single-entry: 5G
  timeout: 600s

runtime:
  system: docker
  image: adfreiburg/qlever

ui:
  config: wikidata

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YAML syntax #119

YAML syntax #119

tpluscode commented Jan 17, 2025

YAML syntax #119

YAML syntax #119

Comments

tpluscode commented Jan 17, 2025