Matching a unicode character without codepoint #90

amblafont · 2020-02-07T07:41:48Z

Currently,
match%sedlex lexbuf with | "ρ" -> ..
does not match, although
match%sedlex lexbuf with | math -> if Sedlexing.Utf8.lexeme lexbuf = "ρ" then ..
does.

Is there any way of making the first variant work, without having to replace "ρ" with its underlying codepoint, so that I can use it as part of a more complex regexp?

The text was updated successfully, but these errors were encountered:

ssfrr · 2022-01-12T21:23:01Z

just ran into this also. I tried two different ways:

open Printf

let next_tok buf =
  let open Sedlexing.Utf8 in
  let fn = [%sedlex.regexp? Chars "-+×÷"] in
  match%sedlex buf with
  | Chars "+-×÷" -> sprintf "with Chars: %s" (lexeme buf)
  | "+"|"-"|"×"|"÷" -> sprintf "with Bars: %s" (lexeme buf)
  | _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))

let test_tok =
  Sedlexing.Utf8.from_string "+" |> next_tok |> print_string; print_newline ();
  Sedlexing.Utf8.from_string "÷" |> next_tok |> print_string; print_newline ();

This prints with Chars: + for the first line and then errors out with exception Failure("Unexpected character: ") for the second.

It's not clear to me whether it's a problem with the match cases or the character iteration (or both) It's not printing the unexpected character which seems to indicate that the lexeme being processed doesn't include the whole unicode codepoint.

Is there a better way to do this?

ssfrr · 2022-01-12T21:31:10Z

looks like I can put the raw codepoints like

  | 0x00D7 | 0x00F7 -> sprintf "with Codepoints: %s" (lexeme buf)

This is a bit of a hassle to generate the codepoints for all the characters I need, but I think is a reasonable workaround.

~~It's also still puzzling to me why the "unexpected character" case isn't printing the correct character.~~
edit: it looks like lexeme buf is the empty string here, which makes sense for cases where nothing matched.

pmetzger · 2022-01-12T21:52:24Z

OCaml does not currently provide a guarantee that Unicode can be embedded without trouble in source files. It would be nice if everyone agreed that source files were encoded as UTF-8, but that is not yet the case.

ssfrr · 2022-01-12T22:03:59Z

ah, I see. Yes that's unfortunate.

Could sedlex just make that assumption and document that any strings that show up in the match%sedlex clause are assumed to be in UTF8? Alternatively could it use the system's locale to decide?

How does ocamlc interpret source files? If you have a file encoded in UTF16 would it work?

pmetzger · 2022-01-13T01:15:09Z

Could sedlex just make that assumption

No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.

Alternatively could it use the system's locale to decide?

Not really.

How does ocamlc interpret source files?

It used to assume that things were in ISO Latin-1. Then that got very partially obsoleted but without any definitive move to setting an actual reasonable permanent standard. Most programming languages have now adopted UTF-8 as an encoding for source files, but at this point OCaml hasn't.

A lot of people will claim that if you just use UTF-8 in strings this will work most of the time and should be fine. In fact, it can break in subtle ways. It's important that OCaml adopt an actual policy on what the encoding is, but it hasn't.

For now, just use the codepoint for Sedlex and you'll be much happier.

ssfrr · 2022-01-13T03:43:28Z

I see. Thanks for taking the time to explain. I'll use the codepoints.

hhugo · 2022-01-13T07:15:31Z

No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.

I don't think this is correct. Javascript encodes its strings at runtime as utf-16, javascript source files are usually utf-8.
Js_of_ocaml treats OCaml strings as sequence of bytes and even assuming they are utf-8 encoded when converting them to javascript utf-16 ones.

hhugo · 2022-01-13T07:32:51Z

What about providing a new constructor Utf8 and treat ocaml strings inside it as utf8 encoded

let next_tok buf =
  let open Sedlexing.Utf8 in
  let fn = [%sedlex.regexp? Chars "-+×÷"] in
  match%sedlex buf with
  | Utf8 (Chars "+-×÷") -> sprintf "with Chars: %s" (lexeme buf)
  | Utf8 ("+"|"-"|"×"|"÷") -> sprintf "with Bars: %s" (lexeme buf)
  | _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))

hhugo · 2022-01-13T08:52:14Z

I've implemented a PoC in https://github.com/hhugo/sedlex/tree/utf8

pmetzger · 2022-01-13T13:11:09Z

You cannot alter the OCaml lexer through the use of a constructor written at the level of the language. The fact that something "usually" works isn't a guarantee that it will work consistently.

alainfrisch · 2022-01-13T13:17:09Z

I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII, and any occurrence of those bytes in the string literal encoded in utf-8 really correspond to those ASCII characters.

pmetzger · 2022-01-13T15:55:06Z

I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII,

The lexer itself doesn't need to be changed much to adopt a policy of utf-8 encoding throughout, but it does currently happily take Latin-1, including in identifiers and strings, and if you want to hand Unicode code points to code dealing with strings you want tools that can safely presume valid utf-8 is going to be presented to them, meaning one needs to validate that (for example) input strings are valid utf-8.

pmetzger · 2022-01-13T15:56:14Z

(Note that I proposed patches on this a couple of years ago and they got a bunch of pushback. If there's a desire to do this, I'm happy to support it and to get my patches to apply to the current compiler.)

hhugo · 2024-10-29T14:16:40Z

@pmetzger we can probably make some progress now that ocaml/ocaml#12664 is merged

hhugo linked a pull request Feb 25, 2023 that will close this issue

Add utf8 support for string literal #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching a unicode character without codepoint #90

Matching a unicode character without codepoint #90

amblafont commented Feb 7, 2020

ssfrr commented Jan 12, 2022

ssfrr commented Jan 12, 2022 •

edited

Loading

pmetzger commented Jan 12, 2022

ssfrr commented Jan 12, 2022

pmetzger commented Jan 13, 2022

ssfrr commented Jan 13, 2022

hhugo commented Jan 13, 2022

hhugo commented Jan 13, 2022

hhugo commented Jan 13, 2022

pmetzger commented Jan 13, 2022

alainfrisch commented Jan 13, 2022

pmetzger commented Jan 13, 2022

pmetzger commented Jan 13, 2022

hhugo commented Oct 29, 2024

Matching a unicode character without codepoint #90

Matching a unicode character without codepoint #90

Comments

amblafont commented Feb 7, 2020

ssfrr commented Jan 12, 2022

ssfrr commented Jan 12, 2022 • edited Loading

pmetzger commented Jan 12, 2022

ssfrr commented Jan 12, 2022

pmetzger commented Jan 13, 2022

ssfrr commented Jan 13, 2022

hhugo commented Jan 13, 2022

hhugo commented Jan 13, 2022

hhugo commented Jan 13, 2022

pmetzger commented Jan 13, 2022

alainfrisch commented Jan 13, 2022

pmetzger commented Jan 13, 2022

pmetzger commented Jan 13, 2022

hhugo commented Oct 29, 2024

ssfrr commented Jan 12, 2022 •

edited

Loading