Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding issues with articles from previous-gen site #118

Open
bjacobel opened this issue Mar 29, 2015 · 4 comments
Open

Character encoding issues with articles from previous-gen site #118

bjacobel opened this issue Mar 29, 2015 · 4 comments

Comments

@bjacobel
Copy link

Example here. Fairly common with pre-2010 articles.

This may be unfixable. Have to look and see if they're actually saved that way in the DB, or if this is just a presentation layer thing.

@bjacobel
Copy link
Author

May be related to #21

@tophtucker
Copy link

i think this is basically unfixable at this point. even by the time i first
saw the db, they were all straight-up ascii question marks. if the browser
were just rendering special chars as ?s we'd be in better shape but that's
not it. :-/

i flirted with whipping up a data cleaning mini-app - i think a simple
regex could spot many of the bad ?s and then a human could choose from
likely possible corrections - prob not worth it, idk
On Sun, Mar 29, 2015 at 19:16 Brian Jacobel [email protected]
wrote:

May be related to #21 #21


Reply to this email directly or view it on GitHub
#118 (comment).

@bjacobel
Copy link
Author

s/(\?)\b\w+\b/‘/g and s/\b\w+\b(\?)/’/g would fix a large number of the issues, probably.

@bjacobel
Copy link
Author

alternately: s/\?(\b\w+\b)\?/“$1”/g

I love regex golf 💃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants