Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for unicode 16.0.0. #157

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Add support for unicode 16.0.0. #157

wants to merge 9 commits into from

Conversation

toots
Copy link
Member

@toots toots commented Sep 16, 2024

This PR adds support for unicode 16.0.0

Notes:

  • New non-binary properties were added to DerivedCoreProperties.txt. Those are not currently supported by the library and should be skipped for now.
  • Regression tests report code point 0x1171e missing in mn. This could be intentional.

There is quite a bit of noise due to some required module renaming to make the new old unicode ml file compile in the regression tests.

Otherwise, this is a fairly straight forward update.

@pmetzger
Copy link
Member

I haven't looked at Unicode 16; what are the non-binary properties for?

@toots
Copy link
Member Author

toots commented Sep 16, 2024

I haven't looked at Unicode 16; what are the non-binary properties for?

I think that this documents it: https://www.gnu.org/software/libunistring/manual/html_node/Indic-conjunct-break.html

@@ -46,7 +46,7 @@ let print_elements ch hashtbl cats =
(fun (b, e) -> Printf.sprintf "0x%x, 0x%x" b e)
(Cset.union_list (Hashtbl.find_all hashtbl c) :> (int * int) list)
in
Printf.fprintf ch " let %s = Sedlex_cset.of_list\n [" c;
Printf.fprintf ch " let %s = Sedlex_utils.Cset.of_list\n [" c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this diff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unicode.ml needs to compilable from within src/syntax and as examples/unicode_old.ml.

Ideally, unicode.ml should be copiable to unicode_old.ml without any chances.

If you use Sedlex_ppx.Sedlex_cset is does not compile in src/syntax.

Happy to find a better solution but, also, I don't think that it matters.

@@ -32,6 +32,7 @@ let compare name (old_l : (int * int) list) (new_l : Sedlex_ppx.Sedlex_cset.t) =
code_points

let test new_l (name, old_l) =
let old_l = Sedlex_utils.Cset.to_list old_l in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to manipulate cset directly. See #159

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still fails with master code (should be renamed main too). This call is just a pass-through. Looks like defining the type as private requires it. Again, not sure that this really matters..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type t = private (int * int) list

let to_list l = l

@@ -1,6 +1,6 @@
(executables
(names tokenizer regressions complement subtraction repeat performance)
(libraries sedlex sedlex_ppx)
(libraries sedlex sedlex_ppx sedlex_utils)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sedlex_utils only contains the cset implementation that is already accessible using Sedlex_ppx.Sedlex_cset.

src/generator/gen_unicode.ml Outdated Show resolved Hide resolved
@@ -38,6 +38,7 @@ let compare name (old_ : CSet.t) (new_ : CSet.t) =
let test new_l (name, old_l) =
(* Cn is for unassigned code points, which are allowed to be
* used in future version. *)
let old_l = CSet.to_list old_l in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be dropped, but you should also drop the lines below

let old_l =
      List.fold_left
        (fun acc (a, b) -> CSet.union acc (CSet.interval a b))
        CSet.empty old_l
    in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then to_list is no longer needed probably

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants