Character encoding issues with articles from previous-gen site #118

bjacobel · 2015-03-29T23:16:21Z

Example here. Fairly common with pre-2010 articles.

This may be unfixable. Have to look and see if they're actually saved that way in the DB, or if this is just a presentation layer thing.

bjacobel · 2015-03-29T23:16:31Z

May be related to #21

tophtucker · 2015-03-30T05:48:42Z

i think this is basically unfixable at this point. even by the time i first
saw the db, they were all straight-up ascii question marks. if the browser
were just rendering special chars as ?s we'd be in better shape but that's
not it. :-/

i flirted with whipping up a data cleaning mini-app - i think a simple
regex could spot many of the bad ?s and then a human could choose from
likely possible corrections - prob not worth it, idk
On Sun, Mar 29, 2015 at 19:16 Brian Jacobel [email protected]
wrote:

May be related to #21 #21

—
Reply to this email directly or view it on GitHub
#118 (comment).

bjacobel · 2015-03-30T15:35:05Z

s/(\?)\b\w+\b/‘/g and s/\b\w+\b(\?)/’/g would fix a large number of the issues, probably.

bjacobel · 2015-03-30T15:39:45Z

alternately: s/\?(\b\w+\b)\?/“$1”/g

I love regex golf 💃

bjacobel mentioned this issue Apr 27, 2015

Fix character encoding issues BowdoinOrient/bongo#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding issues with articles from previous-gen site #118

Character encoding issues with articles from previous-gen site #118

bjacobel commented Mar 29, 2015

bjacobel commented Mar 29, 2015

tophtucker commented Mar 30, 2015

bjacobel commented Mar 30, 2015

bjacobel commented Mar 30, 2015

Character encoding issues with articles from previous-gen site #118

Character encoding issues with articles from previous-gen site #118

Comments

bjacobel commented Mar 29, 2015

bjacobel commented Mar 29, 2015

tophtucker commented Mar 30, 2015

bjacobel commented Mar 30, 2015

bjacobel commented Mar 30, 2015