-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching a unicode character without codepoint #90
Comments
just ran into this also. I tried two different ways: open Printf
let next_tok buf =
let open Sedlexing.Utf8 in
let fn = [%sedlex.regexp? Chars "-+×÷"] in
match%sedlex buf with
| Chars "+-×÷" -> sprintf "with Chars: %s" (lexeme buf)
| "+"|"-"|"×"|"÷" -> sprintf "with Bars: %s" (lexeme buf)
| _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))
let test_tok =
Sedlexing.Utf8.from_string "+" |> next_tok |> print_string; print_newline ();
Sedlexing.Utf8.from_string "÷" |> next_tok |> print_string; print_newline (); This prints It's not clear to me whether it's a problem with the match cases or the character iteration (or both) It's not printing the unexpected character which seems to indicate that the lexeme being processed doesn't include the whole unicode codepoint. Is there a better way to do this? |
looks like I can put the raw codepoints like
This is a bit of a hassle to generate the codepoints for all the characters I need, but I think is a reasonable workaround.
|
OCaml does not currently provide a guarantee that Unicode can be embedded without trouble in source files. It would be nice if everyone agreed that source files were encoded as UTF-8, but that is not yet the case. |
ah, I see. Yes that's unfortunate. Could sedlex just make that assumption and document that any strings that show up in the How does |
No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.
Not really.
It used to assume that things were in ISO Latin-1. Then that got very partially obsoleted but without any definitive move to setting an actual reasonable permanent standard. Most programming languages have now adopted UTF-8 as an encoding for source files, but at this point OCaml hasn't. A lot of people will claim that if you just use UTF-8 in strings this will work most of the time and should be fine. In fact, it can break in subtle ways. It's important that OCaml adopt an actual policy on what the encoding is, but it hasn't. For now, just use the codepoint for Sedlex and you'll be much happier. |
I see. Thanks for taking the time to explain. I'll use the codepoints. |
I don't think this is correct. Javascript encodes its strings at runtime as utf-16, javascript source files are usually utf-8. |
What about providing a new constructor Utf8 and treat ocaml strings inside it as utf8 encoded let next_tok buf =
let open Sedlexing.Utf8 in
let fn = [%sedlex.regexp? Chars "-+×÷"] in
match%sedlex buf with
| Utf8 (Chars "+-×÷") -> sprintf "with Chars: %s" (lexeme buf)
| Utf8 ("+"|"-"|"×"|"÷") -> sprintf "with Bars: %s" (lexeme buf)
| _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf)) |
I've implemented a PoC in https://github.com/hhugo/sedlex/tree/utf8 |
You cannot alter the OCaml lexer through the use of a constructor written at the level of the language. The fact that something "usually" works isn't a guarantee that it will work consistently. |
I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII, and any occurrence of those bytes in the string literal encoded in utf-8 really correspond to those ASCII characters. |
The lexer itself doesn't need to be changed much to adopt a policy of utf-8 encoding throughout, but it does currently happily take Latin-1, including in identifiers and strings, and if you want to hand Unicode code points to code dealing with strings you want tools that can safely presume valid utf-8 is going to be presented to them, meaning one needs to validate that (for example) input strings are valid utf-8. |
(Note that I proposed patches on this a couple of years ago and they got a bunch of pushback. If there's a desire to do this, I'm happy to support it and to get my patches to apply to the current compiler.) |
@pmetzger we can probably make some progress now that ocaml/ocaml#12664 is merged |
Currently,
match%sedlex lexbuf with | "ρ" -> ..
does not match, although
match%sedlex lexbuf with | math -> if Sedlexing.Utf8.lexeme lexbuf = "ρ" then ..
does.
Is there any way of making the first variant work, without having to replace "ρ" with its underlying codepoint, so that I can use it as part of a more complex regexp?
The text was updated successfully, but these errors were encountered: