What is tools/ for? #140

jbicha · 2017-10-25T15:51:25Z

What is https://github.com/typesupply/defcon/tree/master/tools ?

It doesn't appear to be used in the build at all.

Is it used to generate https://github.com/typesupply/defcon/blob/master/Lib/defcon/tools/unicodeTools.py ? If so, shouldn't that be part of the build instead of including generated files in source?

I'm asking because I'm working with @medicalwei on packaging this for Debian and the Unicode data is technically under a different license.

medicalwei · 2017-10-25T16:05:02Z

The file in unicodeTools.py seems to be from here:
ftp://www.unicode.org/Public/9.0.0/ucd/Scripts.txt

So this should be partially attributed to Unicode Consortium. According to this file it seems to be DFSG free as well:
http://www.unicode.org/copyright.html#License

adrientetar · 2017-10-25T18:38:21Z

Is used to regenerate unicodeTools.py when a new Unicode version comes out, does not need to be packaged.

adrientetar · 2017-10-25T18:40:53Z

If so, shouldn't that be part of the build instead of including generated files in source?

I guess you could do that indeed.

medicalwei · 2017-10-26T02:32:30Z

I did a rewrite of the loading part of the file: unicodeTools.py
https://paste.debian.net/992778/

Please check if that works as intended.

Note that this is for using the files externally. Feel free if you want to backport it.

moyogo · 2017-10-26T08:55:18Z

/usr/share/unicode/Scripts.txt, etc. are not going to work for everyone.

medicalwei · 2017-10-26T09:00:41Z

The problem is that, in Debian we need to strip the duplicated files and prefer ones provided in the repository. This does not need to be in the upstream (and that's why I didn't file a pull request.)

However, if it is possible, could you separate the embedded texts from Unicode into some text files? In this way we can replace the files and symlink them to be provided by another package.

moyogo · 2017-10-26T09:18:05Z

How do you guarantee that the file provided by another package is the expected version of Unicode?

medicalwei · 2017-10-26T09:22:00Z

Typically we use package dependency to guarantee that. However if upstream code expects the specific Unicode version we have to do extra work to upload another version of unicode-data.

…

On Thu, 26 Oct 2017 at 17:18 Denis Moyogo Jacquerye < ***@***.***> wrote: How do you guarantee that the file provided by another package is the expected version of Unicode? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#140 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEi8Z3QEq0Sp74dsH6O76EYGkANKv3qks5swE5NgaJpZM4QGPFi> .

anthrotype · 2017-10-26T10:04:01Z

The "UnicodeData.txt" file in tools/ folder is used with the script tools/openClosedUniGenerator.py to generate not the whole unicodeTools.py module but only part of it, namely a multi-line string _openClosePairText. However, a comment also says that string has been "tweaked by hand to handle special exceptions".

This is the diff between the _openClosePairText as generated from the tools/openClosedUniGenerator.py script and the data in tools/UnicodeData.txt, and the text which is currently in the Lib/defcon/tools/unicodeTools.py:
https://gist.github.com/anthrotype/3413bb4d92b12494b68b8b14fdc6c531

I don't know why it had to be tweaked, maybe @typesupply knows.

Let me know if I'm understanding this issue correctly.

There's a UnicodeData.txt file in tools; is the problem the fact that the file is there unused, or is it that it doesn't come with an appropriate license file? What do you mean by "DFSG free"? I'm not familiar with these things so any help is welcome.

The unicodeTools.py module embeds the content of "ftp://www.unicode.org/Public/9.0.0/ucd/Scripts.txt" file from Unicode Consortium. You would prefer it to be as a separate data file, because there's already one in Debian repository as a separate package and prefer to avoid duplicating them, correct?

~

btw, this reminded me that there's a pending PR which updates it to Unicode 10 which I forgot to review
#124

anthrotype · 2017-10-26T10:06:51Z

I did a rewrite of the loading part of the file: unicodeTools.py

@medicalwei maybe you could send a pull request?

typesupply · 2017-10-26T11:30:02Z

I don't know why it had to be tweaked, maybe @typesupply knows.

Because some open characters have closed partner characters that aren't defined in UnicodeData.txt. For example, 201D;RIGHT DOUBLE QUOTATION MARK;Pf is the closed partner to:

201C;LEFT DOUBLE QUOTATION MARK;Pi
201F;DOUBLE HIGH-REVERSED-9 QUOTATION MARK;Pi
201E;DOUBLE LOW-9 QUOTATION MARK;Ps

In UnicodeData.txt, 201D;RIGHT DOUBLE QUOTATION MARK;Pf only appears as a partner to 201C;LEFT DOUBLE QUOTATION MARK;Pi so I had to manually define the other relationships.

I'm open to moving the exceptions to the generator to make this more clear.

jbicha · 2017-10-26T12:59:51Z

The original issue here is that the Unicode data has its own license which wasn't clearly marked here.

DFSG is the Debian Free Software Guidelines. @medicalwei 's comment was that code or content licensed with the Unicode license are acceptable for inclusion in Debian.

Debian has a policy that the same piece of code not be duplicated in Debian if possible. Now, I believe the Unicode data isn't "code" but I thought it was worth asking whether the duplication was necessary here.

Debian updated its version of unicode-data to 10.0.0 very quickly after it was released in June.

anthrotype · 2017-10-26T14:14:25Z

The original issue here is that the Unicode data has its own license which wasn't clearly marked here.

would it be enough to include the text of http://www.unicode.org/copyright.html#License in a file called "LICENSE" next to the unicode data files?

whether the duplication was necessary

I don't know. That data file is only used once a year, and I wouldn't like to complicate the setup too much.

Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?

typesupply · 2017-10-26T14:47:04Z

Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?

This is fine with me. As long as the module continues to work as is, I have no opinion on where the source data is located.

medicalwei · 2017-10-27T02:27:10Z

Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?

I think upstream can simply move them to a dedicated text files and packagers can replace the files with symbolic links. No need for a global variable. (@jbicha correct me if the policy doesn't allow this.)

But as you stated there are differences for the open-close data from the generated script. I propose doing this with a patch (diff -Naur) from the generated file. We can trigger the generation at build time, and apply the patch right after the generation.

This is to address part of the problems in robotools#140 that some open parentheses does not match the closed ones. These would only leave special exceptions and ornate parentheses (which is because of reversed order in the data) needed to be appended.

anthrotype · 2017-11-21T12:31:15Z

With the new fontTools.unicodedata module in fonttools 3.20.0, I think defcon should simply use that instead of doing its own parsing of UCD data files. Everything needed should be in there, except perhaps for those open/close exceptions Tal mentioned, which can be hard-coded somewhere in unicodeTools.

typesupply · 2017-11-21T12:33:24Z

As long as we have backwards compatibility with the functions in defcon I'd be very happy to ditch the UCD data parsing.

anthrotype · 2017-11-21T12:36:21Z

About the issue of the built-in unicodedata.category not being in sync with Unicode 10 noted by @andyclymer in #124, the right thing to do instead of parsing data files is to add unicodedata2 (https://github.com/mikekap/unicodedata2) as an install requirement to defcon.
There are pre-compiled binaries installable via pip for all python versions and platforms.
https://pypi.python.org/pypi/unicodedata2/10.0.0.post2

When unicodedata2 is importable, fontTools.unicodedata will use that for category and all the other public functions.

…data2 backport so that defcon.unicodeTools are up-to-date and use Unicode Character Database 11.0

benkiel · 2018-11-01T20:15:57Z

Based on the changes to where unicodedata is being pulled from, this needs to be looked at again to retain the exceptions, but perhaps remove the /tools completely? I'm not 100% sure what openClosedUniGenerator.py is used/needed by now. It seems we could hardcode the exceptions in unicodeTools.py, and remove the outdated unicodedata.txt and the generator. @anthrotype?

medicalwei mentioned this issue Nov 6, 2017

fix: use close neighbors in openClosedUniGenerator.py #145

Merged

anthrotype mentioned this issue Nov 21, 2017

use fontTools.unicodedata and require unicodedata2 #150

Closed

benkiel referenced this issue Nov 1, 2018

setup.py: install fonttools with '[unicode]' extra to install unicode…

8469ee3

…data2 backport so that defcon.unicodeTools are up-to-date and use Unicode Character Database 11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is tools/ for? #140

What is tools/ for? #140

jbicha commented Oct 25, 2017

medicalwei commented Oct 25, 2017

adrientetar commented Oct 25, 2017

adrientetar commented Oct 25, 2017

medicalwei commented Oct 26, 2017 •

edited

Loading

moyogo commented Oct 26, 2017 •

edited

Loading

medicalwei commented Oct 26, 2017

moyogo commented Oct 26, 2017

medicalwei commented Oct 26, 2017 via email

anthrotype commented Oct 26, 2017 •

edited

Loading

anthrotype commented Oct 26, 2017

typesupply commented Oct 26, 2017

jbicha commented Oct 26, 2017

anthrotype commented Oct 26, 2017

typesupply commented Oct 26, 2017

medicalwei commented Oct 27, 2017 •

edited

Loading

anthrotype commented Nov 21, 2017

typesupply commented Nov 21, 2017

anthrotype commented Nov 21, 2017

benkiel commented Nov 1, 2018

What is tools/ for? #140

What is tools/ for? #140

Comments

jbicha commented Oct 25, 2017

medicalwei commented Oct 25, 2017

adrientetar commented Oct 25, 2017

adrientetar commented Oct 25, 2017

medicalwei commented Oct 26, 2017 • edited Loading

moyogo commented Oct 26, 2017 • edited Loading

medicalwei commented Oct 26, 2017

moyogo commented Oct 26, 2017

medicalwei commented Oct 26, 2017 via email

anthrotype commented Oct 26, 2017 • edited Loading

anthrotype commented Oct 26, 2017

typesupply commented Oct 26, 2017

jbicha commented Oct 26, 2017

anthrotype commented Oct 26, 2017

typesupply commented Oct 26, 2017

medicalwei commented Oct 27, 2017 • edited Loading

anthrotype commented Nov 21, 2017

typesupply commented Nov 21, 2017

anthrotype commented Nov 21, 2017

benkiel commented Nov 1, 2018

medicalwei commented Oct 26, 2017 •

edited

Loading

moyogo commented Oct 26, 2017 •

edited

Loading

anthrotype commented Oct 26, 2017 •

edited

Loading

medicalwei commented Oct 27, 2017 •

edited

Loading