Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a visually more appealing encoding #261

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Mar 16, 2023

Fixes #232

This PR uses Unidecode to translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations.

If unidecode() returns an empty string, the name "__not_representable_<X>" is used, where <X> is the length of the original string.

This commit uses Unidecode to translate
unicode characters into the ASCII-range
before employing any dataverse-specific
character quotations.

If unidecode returns an empty string,
the name "__not_representable_<X>" is
used, where "<X>" is the length of the
original string.
@christian-monch
Copy link
Contributor Author

It should be noted that the results of unidecode() might still contain a lot of characters that are not allowed in dataverse directory names or dataverse file names. Those are still encoded in the format -<HEXDIGIT><HEXDIGIT> in the result of mangle_path().

This commit ensures that mangle_path
is tested with "printable" unicode
characters, e.g. `ä`. that will be
converted into ascii characters by
`unidecode()`.
@christian-monch
Copy link
Contributor Author

@mih : I can rebase this PR once PR #257 is merged

@mih
Copy link
Member

mih commented Mar 16, 2023

Why is

If unidecode() returns an empty string, the name "_not_representable" is used

done, rather than only the fallback on the hexcodes?

I cannot see from the test diff alone how it would look. Need to handcraft a test dataset and try.

@christian-monch christian-monch marked this pull request as draft March 17, 2023 10:45
@christian-monch
Copy link
Contributor Author

Why is

If unidecode() returns an empty string, the name "_not_representable" is used

done, rather than only the fallback on the hexcodes?

I cannot see from the test diff alone how it would look. Need to handcraft a test dataset and try.

Good question. The answer is that we aimed at a human readable representation, and using the hexcodes would probably be confusing. I think it is a good idea though. If we would do that, we have to decide whether we want to distinguish hex-code-file names that are generated because unidecode() did not return anything and hex-code-file names that just exist in the dataset. If we want to distinguish that reliably, we need an escape mechanism to signal whether a file name belongs to the first class or the latter. That would also influence "normal" file names that contain the escape character.

We could also leave to interpretation to the user, who might know, which file names are "genuine" dataset file names and which file names are just a hex-code representation of names that are mapped on empty-strings by unidecode().

All in all, the simplest approach might be to use hex-codes if the unidecode()-result is empty, but without any escaping mechanism. I think the collision probability is low, comparable to the collision probability of using unidecode().

I will change the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Employ Unidecode for path mangling
2 participants