You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently doing this with the OCR'd text directly, mostly out of expedience. Lines with similar widths are joined.
But it would be better to do this with the bounding boxes from ocropus-gpageseg. For example, in 712393b, the first line of the paragraph is indented. The right edges of the lines in the paragraph are all close to one another, even though the first line has fewer characters.
Vertical gaps between lines could also be used as cues here.
While I'm at it, it would also be better to detect "NO REPRODUCTIONS"-style lines on a per-box basis, since these sometimes get merged with dates or attributions.
This would be done in extract_ocropy_text.py.
The text was updated successfully, but these errors were encountered:
I'm currently doing this with the OCR'd text directly, mostly out of expedience. Lines with similar widths are joined.
But it would be better to do this with the bounding boxes from
ocropus-gpageseg
. For example, in 712393b, the first line of the paragraph is indented. The right edges of the lines in the paragraph are all close to one another, even though the first line has fewer characters.Vertical gaps between lines could also be used as cues here.
While I'm at it, it would also be better to detect "NO REPRODUCTIONS"-style lines on a per-box basis, since these sometimes get merged with dates or attributions.
This would be done in
extract_ocropy_text.py
.The text was updated successfully, but these errors were encountered: