Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching a unicode character without codepoint #90

Open
amblafont opened this issue Feb 7, 2020 · 14 comments · May be fixed by #127
Open

Matching a unicode character without codepoint #90

amblafont opened this issue Feb 7, 2020 · 14 comments · May be fixed by #127

Comments

@amblafont
Copy link

Currently,
match%sedlex lexbuf with | "ρ" -> ..
does not match, although
match%sedlex lexbuf with | math -> if Sedlexing.Utf8.lexeme lexbuf = "ρ" then ..
does.

Is there any way of making the first variant work, without having to replace "ρ" with its underlying codepoint, so that I can use it as part of a more complex regexp?

@ssfrr
Copy link

ssfrr commented Jan 12, 2022

just ran into this also. I tried two different ways:

open Printf

let next_tok buf =
  let open Sedlexing.Utf8 in
  let fn = [%sedlex.regexp? Chars "-+×÷"] in
  match%sedlex buf with
  | Chars "+-×÷" -> sprintf "with Chars: %s" (lexeme buf)
  | "+"|"-"|"×"|"÷" -> sprintf "with Bars: %s" (lexeme buf)
  | _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))

let test_tok =
  Sedlexing.Utf8.from_string "+" |> next_tok |> print_string; print_newline ();
  Sedlexing.Utf8.from_string "÷" |> next_tok |> print_string; print_newline ();

This prints with Chars: + for the first line and then errors out with exception Failure("Unexpected character: ") for the second.

It's not clear to me whether it's a problem with the match cases or the character iteration (or both) It's not printing the unexpected character which seems to indicate that the lexeme being processed doesn't include the whole unicode codepoint.

Is there a better way to do this?

@ssfrr
Copy link

ssfrr commented Jan 12, 2022

looks like I can put the raw codepoints like

  | 0x00D7 | 0x00F7 -> sprintf "with Codepoints: %s" (lexeme buf)

This is a bit of a hassle to generate the codepoints for all the characters I need, but I think is a reasonable workaround.

It's also still puzzling to me why the "unexpected character" case isn't printing the correct character.
edit: it looks like lexeme buf is the empty string here, which makes sense for cases where nothing matched.

@pmetzger
Copy link
Member

OCaml does not currently provide a guarantee that Unicode can be embedded without trouble in source files. It would be nice if everyone agreed that source files were encoded as UTF-8, but that is not yet the case.

@ssfrr
Copy link

ssfrr commented Jan 12, 2022

ah, I see. Yes that's unfortunate.

Could sedlex just make that assumption and document that any strings that show up in the match%sedlex clause are assumed to be in UTF8? Alternatively could it use the system's locale to decide?

How does ocamlc interpret source files? If you have a file encoded in UTF16 would it work?

@pmetzger
Copy link
Member

Could sedlex just make that assumption

No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.

Alternatively could it use the system's locale to decide?

Not really.

How does ocamlc interpret source files?

It used to assume that things were in ISO Latin-1. Then that got very partially obsoleted but without any definitive move to setting an actual reasonable permanent standard. Most programming languages have now adopted UTF-8 as an encoding for source files, but at this point OCaml hasn't.

A lot of people will claim that if you just use UTF-8 in strings this will work most of the time and should be fine. In fact, it can break in subtle ways. It's important that OCaml adopt an actual policy on what the encoding is, but it hasn't.

For now, just use the codepoint for Sedlex and you'll be much happier.

@ssfrr
Copy link

ssfrr commented Jan 13, 2022

I see. Thanks for taking the time to explain. I'll use the codepoints.

@hhugo
Copy link
Contributor

hhugo commented Jan 13, 2022

No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.

I don't think this is correct. Javascript encodes its strings at runtime as utf-16, javascript source files are usually utf-8.
Js_of_ocaml treats OCaml strings as sequence of bytes and even assuming they are utf-8 encoded when converting them to javascript utf-16 ones.

@hhugo
Copy link
Contributor

hhugo commented Jan 13, 2022

What about providing a new constructor Utf8 and treat ocaml strings inside it as utf8 encoded

let next_tok buf =
  let open Sedlexing.Utf8 in
  let fn = [%sedlex.regexp? Chars "-+×÷"] in
  match%sedlex buf with
  | Utf8 (Chars "+-×÷") -> sprintf "with Chars: %s" (lexeme buf)
  | Utf8 ("+"|"-"|"×"|"÷") -> sprintf "with Bars: %s" (lexeme buf)
  | _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))

@hhugo
Copy link
Contributor

hhugo commented Jan 13, 2022

I've implemented a PoC in https://github.com/hhugo/sedlex/tree/utf8

@pmetzger
Copy link
Member

You cannot alter the OCaml lexer through the use of a constructor written at the level of the language. The fact that something "usually" works isn't a guarantee that it will work consistently.

@alainfrisch
Copy link
Collaborator

I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII, and any occurrence of those bytes in the string literal encoded in utf-8 really correspond to those ASCII characters.

@pmetzger
Copy link
Member

I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII,

The lexer itself doesn't need to be changed much to adopt a policy of utf-8 encoding throughout, but it does currently happily take Latin-1, including in identifiers and strings, and if you want to hand Unicode code points to code dealing with strings you want tools that can safely presume valid utf-8 is going to be presented to them, meaning one needs to validate that (for example) input strings are valid utf-8.

@pmetzger
Copy link
Member

(Note that I proposed patches on this a couple of years ago and they got a bunch of pushback. If there's a desire to do this, I'm happy to support it and to get my patches to apply to the current compiler.)

@hhugo hhugo linked a pull request Feb 25, 2023 that will close this issue
@hhugo
Copy link
Contributor

hhugo commented Oct 29, 2024

@pmetzger we can probably make some progress now that ocaml/ocaml#12664 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants