Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

language strings missing a language tag #37

Open
pchampin opened this issue Jul 28, 2023 · 8 comments
Open

language strings missing a language tag #37

pchampin opened this issue Jul 28, 2023 · 8 comments
Labels
test:needs tests Test suite related: missing test

Comments

@pchampin
Copy link
Contributor

It just occurred to me that the following Turtle should be rejected as invalid:

@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

[] rdfs:label "this is invalid RDF"^^rdf:langString.

Indeed, if accepted, it generates an invalid literal: a literal whose datatype is rdf:langString but which does not have a language tag. This contradicts the "if and only if" of RDF Concepts' definition.

Almost all implementations that I have tested accept it (notable exception: @gkellogg's Ruby implementation).

We probably need a negative test for this, and of course similar tests for other concrete syntaxes.

@afs
Copy link
Contributor

afs commented Jul 28, 2023

It can be considered an ill-formed literal according to the datatype, much like "tuesday"^^xsd:integer makes it ill-defined by the datatype because the lexical form is not covered by the datatype mapping.

[] datatype that determines how the lexical form maps to a literal value

The (extended) lexical to value mapping does not exist for the case "this is invalid RDF"^^rdf:langString.

The text below the defn: "A literal is a language-tagged string if the third element is present. " - says it isn't a language-tagged string i.e. it has no value (the "literal value" text).

Special case datatypes are problematic because they propagate: e.g. STRDT e.g. subclasses/derived types of that datatype.

(Also -- bullet two only refers to lexical form.)

@gkellogg
Copy link
Member

Almost all implementations that I have tested accept it (notable exception: @gkellogg's Ruby implementation).

My implementation does, in fact, parse it, even though it's invalid, as it parses "tuesday"^^xsd:integer. It's when serializing to N-Triples that an error is generated (presuming you're using my distiller). If you change the output_format to turtle, you should get the input back out. There's also a validate option that would cause the parse phase to fail with errors.

My interpretation is that it's legal Turtle, as @afs says, it just results in an invalid triple. That would be the case for language tags which are also not valid according to BCP47.

@pchampin
Copy link
Contributor Author

pchampin commented Aug 2, 2023

It can be considered an ill-formed literal according to the datatype, much like "tuesday"^^xsd:integer

This is very different, IMO. "tuesday"^^xsd:integer is "a semantic inconsistency but is not syntactically ill-formed" (quoting rdf-concepts). On the other hand, a literal with datatype rdf:langString and no (or an empty) language tag is syntactically wrong.

On a practical level, it means that many implementation will choke or behave strangely with it. As @gkellogg points out, his implementation will crash when trying to serialize it back to N-Triples. It will not crash when serializing it back to Turtle, but will produce invalid Turtle, namely: "this is invalid RDF"@ (empty language tag).

@afs
Copy link
Contributor

afs commented Aug 3, 2023

means that many implementation will choke or behave strangely with it.

Your testing showed this isn't the case. Nothing has changed from RDF 1.1.

should be rejected as invalid:

At one level, if it's wrong, then it's outside the spec and we don't define the behaviour. We might suggest a behaviour but that's not the same as requiring a behaviour.

There are good reasons for systems to choose to accept it - if steaming large data, one occurrence far down the input stream, throwing an error and aborting a large load after a few hours is extremely inconvenient.

@pchampin
Copy link
Contributor Author

pchampin commented Oct 20, 2023

means that many implementation will choke or behave strangely with it.

Your testing showed this isn't the case.

My testing shows that most implementation don't choke on it., that's right. I do expect that they behave strangely down the line, though. E.g., @gkellogg 's implementation accepts it during parsing, but then fails to serialize it back...

Nothing has changed from RDF 1.1.

Indeed. I'm not claiming that this issue is specific to RDF 1.2.

should be rejected as invalid:

At one level, if it's wrong, then it's outside the spec and we don't define the behaviour. We might suggest a behaviour but that's not the same as requiring a behaviour.

I agree that we do not define the behaviour of parsers when they encounter invalid input.
Let me rephrase my suggestion then: we should change the Turtle syntax to make my example above invalid Turtle. As a rule, I think we must not allow for something that is valid in a concrete syntax, but has no counterpart in the abstract syntax. This is what happens here.

An alternative would be to relax the constraints on rdf:langString in the abstract syntax, so that a literal of this datatype could exist (in the abstract syntax) without a language tag (although they would be considered ill-formed, and be therefore inconsistent in the semantics). This would make the example above syntactically (concrete and abstract) valid.

There are good reasons for systems to choose to accept it - if steaming large data, one occurrence far down the input stream, throwing an error and aborting a large load after a few hours is extremely inconvenient.

Agreed, but this is a very general consideration, not specific to this issue.

@afs
Copy link
Contributor

afs commented Oct 20, 2023

Of the two options, I prefer the option to relax the abstract syntax and have it be ill-formed. i.e. widen and not restrict existing usage.

JSON-LD already says "and, if the datatype is rdf:langString, an optional language tag" and systems do accept it.

I'd prefer leaving things as they are.

JSON-LD and RDF/XML don't rely on the document parsing for language tags.

(The title of the issue maybe should be "missing language string" for the long term record of the WG's work.)

@pchampin pchampin changed the title invalid language string language strings missing a language tag Oct 20, 2023
@afs
Copy link
Contributor

afs commented Oct 20, 2023

Here's another corner case for "missing language tag".

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

[] rdfs:label "abc"@--rtl .

Sort of "unknown language but known to be rtl" or "always rtl".

@pfps pfps added the test:needs tests Test suite related: missing test label Jan 25, 2024
@afs
Copy link
Contributor

afs commented Jul 1, 2024

Seen in the wild: apache/jena#2555

That's in a result set but the principle is the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test:needs tests Test suite related: missing test
Projects
None yet
Development

No branches or pull requests

4 participants