Code for extracting data from a large number of PDFs, particularly FCC Political Ad documents
Main requirements:
- ImageMagick
- GNU Parallel
- PostgreSQL
- Python
- Tesseract (3.03)
- Ghostscript
(Optional)
- Cuneiform
Python are dependencies in requirements.txt. pip install -r requirements.txt
Some python dependencies may be installed with Debian packages:
- python-lxml
- python-numpy
- python-scipy
- python-skimage
createdb fcc
psql -d fcc -f schema.sql
CONNECTION = "dbname='fcc' user='<user>' host='localhost' password='<pw>'"
python_bin = '/path/to/scripts/'
raw_data_dir = 'rawdata/'
invalid_data_dir = 'invalid/'
to_validate_dir = 'to_validate/'
# Directory where pdfs and html directories should be located.
basepath = '/path/for/files/'
cd fpa
nosetests
python station_dloader.py
python pq.py --help
python run_parallel.py --help
python run_extract.py --help
# (cropbox.py, merge_clusters.py, standardize.py, threshold.py)
python pq.py markCommonFromLocalText
python pq.py parseTextInvoices
python pq.py parseTextContracts
python pq.py parseTextOrders
./get_bboxes.sh
python run_parallel.py --help
python run_parallel.py -s=35 -r=420 --psm=3 --targetfield=doctype --outfile="doctype.tsv"
python run_parallel.py -s=50 -r=400 --targetfield=contract --crop
python run_extract.py --help
python run_extract.py --targetfield=contract --crop
This can be used to run extractions on other machines, or without python dependencies.
python run_parallel.py <parameters> --dryrun
python run_parallel.py <parameters> --printcmd
python run_parallel.py <parameters> --outputpng --outputcrop --outputchop
./get_pages.sh
CLASSES
Polfile
Process
markCommonFromLocalText
parseTextContracts
parseTextInvoices
parseTextOrders
printParallelParams
class Polfile
Simple representation of a political file with id, paths, and bounding box.
class Process
Base class for processes.
Implementations are ways of processing a row in the database. Also known as
a strategy.
class markCommonFromLocalText(Process)
Process to identify common document types and formats from text based PDFs.
class parseTextContracts(Process)
Process to extract data from a text based contract using Poppler.
class parseTextInvoices(Process)
Process to extract data from a text based invoice using Poppler.
class parseTextOrders(Process)
Process to extract data from a text based order using Poppler.
class printParallelParams(Process)
Process to print parameters needed to extract a PDF with GNU Parallel.
FUNCTIONS
bulkProcess(process, query, download=False, overwrite=True, abortOnError=False)
Run a process on a set of files in a query.
DESCRIPTION
Crop an image around a box in the image closest to the size
and width/height specified.
example:
Crop around the box that's 50% of the total size and
with width/height 1.25:
cat img.png | python cropbox.py .5 1.25 > cropped.png
DESCRIPTION
Finds equivalent strings with abbreviations. Will find the
matching string with the highest "score." Scores should be in
the second column of the input file.
example:
python merge_clusters.py clusters.tsv
Example input (tab delimited):
Democratic Governors Association 25
Democratic Gov Assoc 2
Comm for Employment Oppty OH 62
Cmte for Empl Oppty FL 6
Committee for Employment Opportunity 85
Dem Governors Assn 8
Output:
Democratic Gov Assoc Democratic Governors Association
Comm for Employment Oppty OH Committee for Employment Opportunity
Cmte for Empl Oppty FL Committee for Employment Opportunity
Dem Governors Assn Democratic Governors Association
FUNCTIONS
canBePacked(s, target)
canSkip(s, abbrs, skip_words)
getCommon(a, b)
hasCommon(a, b)
highestMatch(fixes, counts)
Operating system utilities. Running commands, printing to stderr etc.
FUNCTIONS
getExitStatusFromCmd(command, shell=False, stderr=False)
Execute command and capture exit status.
getStdoutFromCmd(command, shell=False, stderr=False)
Execute command and capture stderr or stdout.
print_stderr(*objs)
Print to standard error.
Command line application for running processes.
DESCRIPTION
Examples:
python pq.py parseTextContracts --limit=10
python pq.py printParallelParams --where="<doc id>" -r 398 -s 54 --targetfield="invoice"
# Parallel should be run with run_parallel.py
python pq.py markCommonFromLocalText
FUNCTIONS
markCommonFromLocalText_cmd(where=None, limit=None)
Command that runs process to identify common document types and formats from text based PDFs.
parseTextContracts_cmd(where=None, limit=None)
Command that runs process to extract data from a text based contract using Poppler.
parseTextInvoices_cmd(where=None, limit=None)
Command that runs process to extract data from a text based invoice using Poppler.
parseTextOrders_cmd(where=None, limit=None)
Command that runs process to extract data from a text based order using Poppler.
printParallelParams_cmd(resolution, targetfield, where=None, limit=None)
Command that runs process to print parameters needed to extract a PDF with GNU Parallel.
Runs files through a series of filters and other image processing applications. See settings.extract_params.
DESCRIPTION
Examples:
python run_extract.py --targetfield=invoice --where="KOCO-TV_14043097411984" --crop --page=2
python run_extract.py --targetfield=contract --limit=12 --crop
Run GNU Parallel on a set of files, converting, cropping, OCR etc.
DESCRIPTION
options:
--resolution, -r Image resolution for conversion of PDF to PNG. Minimum for good
results is 300, but higher values with reduction (see size) will
produce better results especially for large font size and poor
quality input.
--size, -s Percent to resize images.
--targetfield Document format to extract. These are in formatschemas.py.
--inputfile File to use instead of output from the database.
--outfile Output file for extracted text.
--crop Crop the images. Only works for fields that are enclosed in a box on the
page.
--where ID of record to extract.
--limit Number of documents to extract if there are more than this number in the
database.
--page Page to extract.
--psm Page segmentation method. For tesseract. 3 for text on multiple lines, 7 for
text on a single line.
--rotation Rotate the images before OCR. -90 = rotate counter-clockwise 90 degrees.
--language Language for tesseract. Default is English.
--outputpng Output PNG files without any filters to /tmp.
--outputcrop Output PNG files after cropping with cropbox.py to /tmp.
--outputchop Output PNG files after chopping the header and margin from sides and
bottom.
--maxprocs Maximum number of processors for parallel to use. Default is 8.
--printcmd Do not run the command. Print it to the console.
--limitIMthreads Limit number of threads ImageMagick can use.
--median Apply median filter with a certain radius. Values over 2 will likely reduce
image quality for small fonts.
--sharpen ImageMagick sharpen: 0x2, 0x3 ...
--textcleaner Parameter set for textcleaner.
0 = Trim, sharpen
1 = Trim, enhance normalize
2 = Trim, sharpen, enhance normalize
--nudge Move the area extracted from the PDF by a certain number of points
"(Left, Right, Down, Up)." --nudge=(10, 0, 10, 0) is nudge 10 points left and
10 down.
--bbox Get HOCR coordinates from tesseract output. Only works for tesseract config
file "hocr."
--targettext Text to search for (with --bbox).
--threshold Threshold the image before OCR. Uses gaussian threshold with 40 pixel
block size.
--deskew Correct small rotation of the text with ImageMagick. 40 is recommended.
--cuneiform Use cuneiform instead of tesseract for OCR.
--whiteThreshold Threshold at which to turn gray pixels white. 97 is a safe value for
scanned paper documents.
--noinfo Don't append info to output like id, parameters used etc.
--dryrun Print all parallel commands without running. Useful for debugging, or
running without python.
examples:
Run single invoice with fields on page 2, resolution 400 dpi, resize to 50%.
python run_parallel.py -s=50 -r=400 --targetfield=invoice --limit=1 --page=2 --crop
Run all contracts in the database.
python run_parallel.py -s=50 -r=400 --targetfield=contract --crop
Run invoice_addresses (addresses don't need to be cropped and have multiple lines).
python run_parallel.py -s=50 -r=400 --targetfield=contract
DESCRIPTION
Standardize data. Make addresses and titlecase consistent
for string clustering.
example:
python standardize.py inputfile.tsv validate_address -c 3
options:
-c <int>, --column=<int> Column in the file with the data to be standardized.
-m <str>, --method=<str> Which standardization to use: standardize_address,
validate_address (with geocode), titlecase.
--header Input file has a header line.
Threshold an image.
DESCRIPTION
Usage:
python threshold.py gaussian | median | mean | otsu | yen | iso [blocksize]
Example:
cat input.png | python threshold.py gaussian 40 > output.png
cat input.png | python threshold.py otsu > output.png
FUNCTIONS
run_cmd(method, block_size=40)
Utilities
FUNCTIONS
checkDates(dates)
Validate a list of dates.
checkInts(ints)
Validate a list of integers.
dataAtHocrBboxes(bboxes, htmlpath, returnFirstWord=False)
Get text from HOCR within a list of bounding boxes.
downloadBinary(url, filename)
Download a file.
fuzzySplit(s, d)
Split on a delimiter that may be in the wrong place.
hocrWordCoordsMultiple(words, hocr)
Get any of a list of words from HOCR with bounding box.
inside(bbox, word, page)
Check to see if a word is inside a bounding box (lxml style bounding boxes).
pdfToText(filename)
Convert a text-based PDF to HTML.