use a visually more appealing encoding #261

christian-monch · 2023-03-16T11:13:26Z

Fixes #232

This PR uses Unidecode to translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations.

If unidecode() returns an empty string, the name "__not_representable_<X>" is used, where <X> is the length of the original string.

This commit uses Unidecode to translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations. If unidecode returns an empty string, the name "__not_representable_<X>" is used, where "<X>" is the length of the original string.

christian-monch · 2023-03-16T11:22:06Z

It should be noted that the results of unidecode() might still contain a lot of characters that are not allowed in dataverse directory names or dataverse file names. Those are still encoded in the format -<HEXDIGIT><HEXDIGIT> in the result of mangle_path().

This commit ensures that mangle_path is tested with "printable" unicode characters, e.g. `ä`. that will be converted into ascii characters by `unidecode()`.

christian-monch · 2023-03-16T11:40:00Z

@mih : I can rebase this PR once PR #257 is merged

mih · 2023-03-16T12:52:05Z

Why is

If unidecode() returns an empty string, the name "_not_representable" is used

done, rather than only the fallback on the hexcodes?

I cannot see from the test diff alone how it would look. Need to handcraft a test dataset and try.

christian-monch · 2023-03-21T11:53:02Z

Why is

If unidecode() returns an empty string, the name "_not_representable" is used

done, rather than only the fallback on the hexcodes?

I cannot see from the test diff alone how it would look. Need to handcraft a test dataset and try.

Good question. The answer is that we aimed at a human readable representation, and using the hexcodes would probably be confusing. I think it is a good idea though. If we would do that, we have to decide whether we want to distinguish hex-code-file names that are generated because unidecode() did not return anything and hex-code-file names that just exist in the dataset. If we want to distinguish that reliably, we need an escape mechanism to signal whether a file name belongs to the first class or the latter. That would also influence "normal" file names that contain the escape character.

We could also leave to interpretation to the user, who might know, which file names are "genuine" dataset file names and which file names are just a hex-code representation of names that are mapped on empty-strings by unidecode().

All in all, the simplest approach might be to use hex-codes if the unidecode()-result is empty, but without any escaping mechanism. I think the collision probability is low, comparable to the collision probability of using unidecode().

I will change the code.

christian-monch added 2 commits March 16, 2023 12:09

add Unidecode to requirements

344c8a3

improve mangle_path tests

ad77bbb

This commit ensures that mangle_path is tested with "printable" unicode characters, e.g. `ä`. that will be converted into ascii characters by `unidecode()`.

christian-monch marked this pull request as draft March 17, 2023 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use a visually more appealing encoding #261

use a visually more appealing encoding #261

christian-monch commented Mar 16, 2023 •

edited

Loading

christian-monch commented Mar 16, 2023

christian-monch commented Mar 16, 2023

mih commented Mar 16, 2023

christian-monch commented Mar 21, 2023

use a visually more appealing encoding #261

Are you sure you want to change the base?

use a visually more appealing encoding #261

Conversation

christian-monch commented Mar 16, 2023 • edited Loading

christian-monch commented Mar 16, 2023

christian-monch commented Mar 16, 2023

mih commented Mar 16, 2023

christian-monch commented Mar 21, 2023

christian-monch commented Mar 16, 2023 •

edited

Loading