Merge lines using bounding boxes #59

danvk · 2015-04-30T14:43:21Z

I'm currently doing this with the OCR'd text directly, mostly out of expedience. Lines with similar widths are joined.

But it would be better to do this with the bounding boxes from ocropus-gpageseg. For example, in 712393b, the first line of the paragraph is indented. The right edges of the lines in the paragraph are all close to one another, even though the first line has fewer characters.

Vertical gaps between lines could also be used as cues here.

While I'm at it, it would also be better to detect "NO REPRODUCTIONS"-style lines on a per-box basis, since these sometimes get merged with dates or attributions.

This would be done in extract_ocropy_text.py.

The text was updated successfully, but these errors were encountered:

danvk · 2015-04-30T14:48:43Z

722041f is an interesting case here. The small line (east side.) between paragraphs should be joined to the first.

danvk added the OCR label Apr 30, 2015

danvk mentioned this issue Apr 30, 2015

Merge consecutive lines in a paragraph #60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge lines using bounding boxes #59

Merge lines using bounding boxes #59

danvk commented Apr 30, 2015

danvk commented Apr 30, 2015

Merge lines using bounding boxes #59

Merge lines using bounding boxes #59

Comments

danvk commented Apr 30, 2015

danvk commented Apr 30, 2015