Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is tools/ for? #140

Open
jbicha opened this issue Oct 25, 2017 · 19 comments
Open

What is tools/ for? #140

jbicha opened this issue Oct 25, 2017 · 19 comments

Comments

@jbicha
Copy link

jbicha commented Oct 25, 2017

What is https://github.com/typesupply/defcon/tree/master/tools ?

It doesn't appear to be used in the build at all.

Is it used to generate https://github.com/typesupply/defcon/blob/master/Lib/defcon/tools/unicodeTools.py ? If so, shouldn't that be part of the build instead of including generated files in source?

I'm asking because I'm working with @medicalwei on packaging this for Debian and the Unicode data is technically under a different license.

@medicalwei
Copy link
Contributor

The file in unicodeTools.py seems to be from here:
ftp://www.unicode.org/Public/9.0.0/ucd/Scripts.txt

So this should be partially attributed to Unicode Consortium. According to this file it seems to be DFSG free as well:
http://www.unicode.org/copyright.html#License

@adrientetar
Copy link
Contributor

Is used to regenerate unicodeTools.py when a new Unicode version comes out, does not need to be packaged.

@adrientetar
Copy link
Contributor

If so, shouldn't that be part of the build instead of including generated files in source?

I guess you could do that indeed.

@medicalwei
Copy link
Contributor

medicalwei commented Oct 26, 2017

I did a rewrite of the loading part of the file: unicodeTools.py
https://paste.debian.net/992778/

Please check if that works as intended.

Note that this is for using the files externally. Feel free if you want to backport it.

@moyogo
Copy link
Collaborator

moyogo commented Oct 26, 2017

/usr/share/unicode/Scripts.txt, etc. are not going to work for everyone.

@medicalwei
Copy link
Contributor

The problem is that, in Debian we need to strip the duplicated files and prefer ones provided in the repository. This does not need to be in the upstream (and that's why I didn't file a pull request.)

However, if it is possible, could you separate the embedded texts from Unicode into some text files? In this way we can replace the files and symlink them to be provided by another package.

@moyogo
Copy link
Collaborator

moyogo commented Oct 26, 2017

How do you guarantee that the file provided by another package is the expected version of Unicode?

@medicalwei
Copy link
Contributor

medicalwei commented Oct 26, 2017 via email

@anthrotype
Copy link
Member

anthrotype commented Oct 26, 2017

The "UnicodeData.txt" file in tools/ folder is used with the script tools/openClosedUniGenerator.py to generate not the whole unicodeTools.py module but only part of it, namely a multi-line string _openClosePairText. However, a comment also says that string has been "tweaked by hand to handle special exceptions".

This is the diff between the _openClosePairText as generated from the tools/openClosedUniGenerator.py script and the data in tools/UnicodeData.txt, and the text which is currently in the Lib/defcon/tools/unicodeTools.py:
https://gist.github.com/anthrotype/3413bb4d92b12494b68b8b14fdc6c531

I don't know why it had to be tweaked, maybe @typesupply knows.

Let me know if I'm understanding this issue correctly.

There's a UnicodeData.txt file in tools; is the problem the fact that the file is there unused, or is it that it doesn't come with an appropriate license file? What do you mean by "DFSG free"? I'm not familiar with these things so any help is welcome.

The unicodeTools.py module embeds the content of "ftp://www.unicode.org/Public/9.0.0/ucd/Scripts.txt" file from Unicode Consortium. You would prefer it to be as a separate data file, because there's already one in Debian repository as a separate package and prefer to avoid duplicating them, correct?

~

btw, this reminded me that there's a pending PR which updates it to Unicode 10 which I forgot to review
#124

@anthrotype
Copy link
Member

I did a rewrite of the loading part of the file: unicodeTools.py

@medicalwei maybe you could send a pull request?

@typesupply
Copy link
Member

I don't know why it had to be tweaked, maybe @typesupply knows.

Because some open characters have closed partner characters that aren't defined in UnicodeData.txt. For example, 201D;RIGHT DOUBLE QUOTATION MARK;Pf is the closed partner to:

  • 201C;LEFT DOUBLE QUOTATION MARK;Pi
  • 201F;DOUBLE HIGH-REVERSED-9 QUOTATION MARK;Pi
  • 201E;DOUBLE LOW-9 QUOTATION MARK;Ps

In UnicodeData.txt, 201D;RIGHT DOUBLE QUOTATION MARK;Pf only appears as a partner to 201C;LEFT DOUBLE QUOTATION MARK;Pi so I had to manually define the other relationships.

I'm open to moving the exceptions to the generator to make this more clear.

@jbicha
Copy link
Author

jbicha commented Oct 26, 2017

The original issue here is that the Unicode data has its own license which wasn't clearly marked here.

DFSG is the Debian Free Software Guidelines. @medicalwei 's comment was that code or content licensed with the Unicode license are acceptable for inclusion in Debian.

Debian has a policy that the same piece of code not be duplicated in Debian if possible. Now, I believe the Unicode data isn't "code" but I thought it was worth asking whether the duplication was necessary here.

Debian updated its version of unicode-data to 10.0.0 very quickly after it was released in June.

@anthrotype
Copy link
Member

The original issue here is that the Unicode data has its own license which wasn't clearly marked here.

would it be enough to include the text of http://www.unicode.org/copyright.html#License in a file called "LICENSE" next to the unicode data files?

whether the duplication was necessary

I don't know. That data file is only used once a year, and I wouldn't like to complicate the setup too much.

Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?

@typesupply
Copy link
Member

Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?

This is fine with me. As long as the module continues to work as is, I have no opinion on where the source data is located.

@medicalwei
Copy link
Contributor

medicalwei commented Oct 27, 2017

Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?

I think upstream can simply move them to a dedicated text files and packagers can replace the files with symbolic links. No need for a global variable. (@jbicha correct me if the policy doesn't allow this.)

But as you stated there are differences for the open-close data from the generated script. I propose doing this with a patch (diff -Naur) from the generated file. We can trigger the generation at build time, and apply the patch right after the generation.

medicalwei added a commit to medicalwei/defcon that referenced this issue Nov 6, 2017
This is to address part of the problems in robotools#140 that some open parentheses does not match the closed ones.
These would only leave special exceptions and ornate parentheses (which is because of reversed order in the data) needed to be appended.
@anthrotype
Copy link
Member

With the new fontTools.unicodedata module in fonttools 3.20.0, I think defcon should simply use that instead of doing its own parsing of UCD data files. Everything needed should be in there, except perhaps for those open/close exceptions Tal mentioned, which can be hard-coded somewhere in unicodeTools.

@typesupply
Copy link
Member

As long as we have backwards compatibility with the functions in defcon I'd be very happy to ditch the UCD data parsing.

@anthrotype
Copy link
Member

About the issue of the built-in unicodedata.category not being in sync with Unicode 10 noted by @andyclymer in #124, the right thing to do instead of parsing data files is to add unicodedata2 (https://github.com/mikekap/unicodedata2) as an install requirement to defcon.
There are pre-compiled binaries installable via pip for all python versions and platforms.
https://pypi.python.org/pypi/unicodedata2/10.0.0.post2

When unicodedata2 is importable, fontTools.unicodedata will use that for category and all the other public functions.

benkiel referenced this issue Nov 1, 2018
…data2 backport

so that defcon.unicodeTools are up-to-date and use Unicode Character Database 11.0
@benkiel
Copy link
Member

benkiel commented Nov 1, 2018

Based on the changes to where unicodedata is being pulled from, this needs to be looked at again to retain the exceptions, but perhaps remove the /tools completely? I'm not 100% sure what openClosedUniGenerator.py is used/needed by now. It seems we could hardcode the exceptions in unicodeTools.py, and remove the outdated unicodedata.txt and the generator. @anthrotype?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants