Skip to content

Commit 2f925cc

Browse files
committed
Update documentation and pdf2text example (Issue #95)
1 parent 89c2a75 commit 2f925cc

File tree

7 files changed

+726
-75
lines changed

7 files changed

+726
-75
lines changed

CHANGES.md

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ v1.5.0 - YYYY-MM-DD
1212
- Added support for using libpng to embed PNG images in PDF output (Issue #90)
1313
- Added support for writing the PCLm subset of PDF (Issue #99)
1414
- Now support opening damaged PDF files (Issue #45)
15+
- Updated documentation (Issue #95)
1516
- Updated the pdf2txt example to support font encodings.
1617
- Fixed a potential heap overflow in the TrueType font code.
1718

doc/pdfio.3

+219-20
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.TH pdfio 3 "pdf read/write library" "2025-02-20" "pdf read/write library"
1+
.TH pdfio 3 "pdf read/write library" "2025-03-06" "pdf read/write library"
22
.SH NAME
33
pdfio \- pdf read/write library
44
.SH Introduction
@@ -34,7 +34,7 @@ PDFio is
3434
.I not
3535
concerned with rendering or viewing a PDF file, although a PDF RIP or viewer could be written using it.
3636
.PP
37-
PDFio is Copyright \[co] 2021\-2024 by Michael R Sweet and is licensed under the Apache License Version 2.0 with an (optional) exception to allow linking against GPL2/LGPL2 software. See the files "LICENSE" and "NOTICE" for more information.
37+
PDFio is Copyright \[co] 2021\-2025 by Michael R Sweet and is licensed under the Apache License Version 2.0 with an (optional) exception to allow linking against GPL2/LGPL2 software. See the files "LICENSE" and "NOTICE" for more information.
3838
.SS Requirements
3939
.PP
4040
PDFio requires the following to build the software:
@@ -52,9 +52,11 @@ A POSIX\-compliant sh program
5252

5353
.IP \(bu 5
5454
.PP
55-
ZLIB (https://www.zlib.net) 1.0 or higher
55+
ZLIB (https://www.zlib.net/) 1.0 or higher
5656

5757

58+
.PP
59+
PDFio will also use libpng 1.6 or higher (https://www.libpng.org/) to provide enhanced PNG image support.
5860
.PP
5961
IDE files for Xcode (macOS/iOS) and Visual Studio (Windows) are also provided.
6062
.SS Installing PDFio
@@ -1097,28 +1099,83 @@ The pdfioinfo.c example program opens a PDF file and prints the title, author, c
10971099
.fi
10981100
.SS Extract Text from PDF File
10991101
.PP
1100-
The pdf2text.c example code extracts non\-Unicode text from a PDF file by scanning each page for strings and text drawing commands. Since it doesn't look at the font encoding or support Unicode text, it is really only useful to extract plain ASCII text from a PDF file. And since it writes text in the order it appears in the page stream, it may not come out in the same order as appears on the page.
1102+
The pdf2text.c example code extracts text from a PDF file and writes it to the standard output. Unlike some other PDF tools, it outputs the text in the order it is seen in each page stream so the output might appear "jumbled" if the PDF producer doesn't output text in reading order. The code is able to handle different font encodings and produces UTF\-8 output.
11011103
.PP
1102-
The pdfioStreamGetToken function is used to read individual tokens from the page streams. Tokens starting with the open parenthesis are text strings, while PDF operators are left as\-is. We use some simple logic to make sure that we include spaces between text strings and add newlines for the text operators that start a new line in a text block:
1104+
The pdfioStreamGetToken function is used to read individual tokens from the page streams:
11031105
.nf
11041106

11051107
pdfio_stream_t *st; // Page stream
1108+
char buffer[1024], // Token buffer
1109+
*bufptr, // Pointer into buffer
1110+
name[256]; // Current (font) name
11061111
bool first = true; // First string on line?
1107-
char buffer[1024]; // Token buffer
1112+
int encoding[256]; // Font encoding to Unicode
1113+
bool in_array = false; // Are we in an array?
11081114

11091115
// Read PDF tokens from the page stream...
11101116
while (pdfioStreamGetToken(st, buffer, sizeof(buffer)))
11111117
{
1112-
if (buffer[0] == '(')
1118+
.fi
1119+
.PP
1120+
Justified text can be found inside arrays ("[ ... ]"), so we look for the array delimiter tokens and any (spacing) numbers inside an array. Experimentation has shown that numbers greater than 100 can be treated as whitespace:
1121+
.nf
1122+
1123+
if (!strcmp(buffer, "["))
1124+
{
1125+
// Start of an array for justified text...
1126+
in_array = true;
1127+
}
1128+
else if (!strcmp(buffer, "]"))
1129+
{
1130+
// End of an array for justified text...
1131+
in_array = false;
1132+
}
1133+
else if (!first && in_array && (isdigit(buffer[0]) || buffer[0] == '\-') && fabs(atof(buffer)) > 100)
1134+
{
1135+
// Whitespace in a justified text block...
1136+
putchar(' ');
1137+
}
1138+
.fi
1139+
.PP
1140+
Tokens starting with \'(' or \'<' are text fragments. 8\-bit text starting with \'(' needs to be mapped to Unicode using the current font encoding while hex strings starting with \'<' are UTF\-16 (Unicode) that need to be converted to UTF\-8:
1141+
.nf
1142+
1143+
else if (buffer[0] == '(')
11131144
{
11141145
// Text string using an 8\-bit encoding
1115-
if (first)
1116-
first = false;
1117-
else if (buffer[1] != ' ')
1118-
putchar(' ');
1146+
first = false;
11191147

1120-
fputs(buffer + 1, stdout);
1148+
for (bufptr = buffer + 1; *bufptr; bufptr ++)
1149+
put_utf8(encoding[*bufptr & 255]);
1150+
}
1151+
else if (buffer[0] == '<')
1152+
{
1153+
// Unicode text string
1154+
first = false;
1155+
1156+
puts_utf16(buffer + 1);
1157+
}
1158+
.fi
1159+
.PP
1160+
Simple (8\-bit) fonts include an encoding table that maps the 8\-bit characters to one of 1051 Unicode glyph names. Since each font can use a different encoding, we look for font names starting with \'/' and the "Tf" (set text font) operator token and load that font's encoding using the load_encoding function:
1161+
.nf
1162+
1163+
else if (buffer[0] == '/')
1164+
{
1165+
// Save name...
1166+
strncpy(name, buffer + 1, sizeof(name) \- 1);
1167+
name[sizeof(name) \- 1] = '\\0';
1168+
}
1169+
else if (!strcmp(buffer, "Tf") && name[0])
1170+
{
1171+
// Set font...
1172+
load_encoding(obj, name, encoding);
11211173
}
1174+
.fi
1175+
.PP
1176+
Finally, some text operators start a new line in a text block, so when we see their tokens we output a newline:
1177+
.nf
1178+
11221179
else if (!strcmp(buffer, "Td") || !strcmp(buffer, "TD") || !strcmp(buffer, "T*") ||
11231180
!strcmp(buffer, "\\'") || !strcmp(buffer, "\\""))
11241181
{
@@ -1127,9 +1184,150 @@ The pdfioStreamGetToken function is used to read individual tokens from the page
11271184
first = true;
11281185
}
11291186
}
1187+
.fi
1188+
.PP
1189+
The load_encoding Function
1190+
.PP
1191+
The load_encoding function looks up the named font in the page's "Resources" dictionary. Every PDF simple font contains an "Encoding" dictionary with a base encoding ("WinANSI", "MacRoman", or "MacExpert") and a differences array that lists character indexes and glyph names for an 8\-bit font.
1192+
.PP
1193+
We start by initializing the encoding array to the default WinANSI encoding and looking up the font object for the named font:
1194+
.nf
1195+
1196+
static void
1197+
load_encoding(
1198+
pdfio_obj_t *page_obj, // I \- Page object
1199+
const char *name, // I \- Font name
1200+
int encoding[256]) // O \- Encoding table
1201+
{
1202+
size_t i, j; // Looping vars
1203+
pdfio_dict_t *page_dict, // Page dictionary
1204+
*resources_dict, // Resources dictionary
1205+
*font_dict; // Font dictionary
1206+
pdfio_obj_t *font_obj, // Font object
1207+
*encoding_obj; // Encoding object
1208+
static int win_ansi[32] = // WinANSI characters from 128 to 159
1209+
{
1210+
...
1211+
};
1212+
static int mac_roman[128] = // MacRoman characters from 128 to 255
1213+
{
1214+
...
1215+
};
1216+
1217+
1218+
// Initialize the encoding to be the "standard" WinAnsi...
1219+
for (i = 0; i < 128; i ++)
1220+
encoding[i] = i;
1221+
for (i = 160; i < 256; i ++)
1222+
encoding[i] = i;
1223+
memcpy(encoding + 128, win_ansi, sizeof(win_ansi));
1224+
1225+
// Find the named font...
1226+
if ((page_dict = pdfioObjGetDict(page_obj)) == NULL)
1227+
return;
1228+
1229+
if ((resources_dict = pdfioDictGetDict(page_dict, "Resources")) == NULL)
1230+
return;
1231+
1232+
if ((font_dict = pdfioDictGetDict(resources_dict, "Font")) == NULL)
1233+
{
1234+
// Font resources not a dictionary, see if it is an object...
1235+
if ((font_obj = pdfioDictGetObj(resources_dict, "Font")) != NULL)
1236+
font_dict = pdfioObjGetDict(font_obj);
1237+
1238+
if (!font_dict)
1239+
return;
1240+
}
1241+
1242+
if ((font_obj = pdfioDictGetObj(font_dict, name)) == NULL)
1243+
return;
1244+
.fi
1245+
.PP
1246+
Once we have found the font we see if it has an "Encoding" dictionary:
1247+
.nf
1248+
1249+
pdfio_dict_t *encoding_dict; // Encoding dictionary
1250+
1251+
if ((encoding_obj = pdfioDictGetObj(pdfioObjGetDict(font_obj), "Encoding")) == NULL)
1252+
return;
1253+
1254+
if ((encoding_dict = pdfioObjGetDict(encoding_obj)) == NULL)
1255+
return;
1256+
.fi
1257+
.PP
1258+
Once we have the encoding dictionary we can get the "BaseEncoding" and "Differences" values:
1259+
.nf
1260+
1261+
const char *base_encoding; // BaseEncoding name
1262+
pdfio_array_t *differences; // Differences array
11301263

1131-
if (!first)
1132-
putchar('\\n');
1264+
// OK, have the encoding object, build the encoding using it...
1265+
base_encoding = pdfioDictGetName(encoding_dict, "BaseEncoding");
1266+
differences = pdfioDictGetArray(encoding_dict, "Differences");
1267+
.fi
1268+
.PP
1269+
If the base encoding is "MacRomainEncoding", we need to reset the upper 128 characters in the encoding array match it:
1270+
.nf
1271+
1272+
if (base_encoding && !strcmp(base_encoding, "MacRomanEncoding"))
1273+
{
1274+
// Map upper 128
1275+
memcpy(encoding + 128, mac_roman, sizeof(mac_roman));
1276+
}
1277+
.fi
1278+
.PP
1279+
Then we loop through the differences array, keeping track of the current index within the encoding array. A number indicates a new index while a name is the Unicode glyph for the current index:
1280+
.nf
1281+
1282+
typedef struct name_map_s
1283+
{
1284+
const char *name; // Character name
1285+
int unicode; // Unicode value
1286+
} name_map_t;
1287+
1288+
static name_map_t unicode_map[1051]; // List of glyph names
1289+
1290+
if (differences)
1291+
{
1292+
// Apply differences
1293+
size_t count = pdfioArrayGetSize(differences);
1294+
// Number of differences
1295+
const char *name; // Character name
1296+
size_t idx = 0; // Index in encoding array
1297+
1298+
for (i = 0; i < count; i ++)
1299+
{
1300+
switch (pdfioArrayGetType(differences, i))
1301+
{
1302+
case PDFIO_VALTYPE_NUMBER :
1303+
// Get the index of the next character...
1304+
idx = (size_t)pdfioArrayGetNumber(differences, i);
1305+
break;
1306+
1307+
case PDFIO_VALTYPE_NAME :
1308+
// Lookup name and apply to encoding...
1309+
if (idx < 0 || idx > 255)
1310+
break;
1311+
1312+
name = pdfioArrayGetName(differences, i);
1313+
for (j = 0; j < (sizeof(unicode_map) / sizeof(unicode_map[0])); j ++)
1314+
{
1315+
if (!strcmp(name, unicode_map[j].name))
1316+
{
1317+
encoding[idx] = unicode_map[j].unicode;
1318+
break;
1319+
}
1320+
}
1321+
idx ++;
1322+
break;
1323+
1324+
default :
1325+
// Do nothing for other values
1326+
break;
1327+
}
1328+
}
1329+
}
1330+
}
11331331
.fi
11341332
.SS Create a PDF File With Text and an Image
11351333
.PP
@@ -4365,12 +4563,13 @@ bool pdfioStreamGetToken (
43654563
);
43664564
.fi
43674565
.PP
4368-
This function reads a single PDF token from a stream. Operator tokens,
4369-
boolean values, and numbers are returned as-is in the provided string buffer.
4370-
String values start with the opening parenthesis ('(') but have all escaping
4371-
resolved and the terminating parenthesis removed. Hexadecimal string values
4372-
start with the opening angle bracket ('<') and have all whitespace and the
4373-
terminating angle bracket removed.
4566+
This function reads a single PDF token from a stream, skipping all whitespace
4567+
and comments. Operator tokens, boolean values, and numbers are returned
4568+
as-is in the provided string buffer. String values start with the opening
4569+
parenthesis ('(') but have all escaping resolved and the terminating
4570+
parenthesis removed. Hexadecimal string values start with the opening angle
4571+
bracket ('<') and have all whitespace and the terminating angle bracket
4572+
removed.
43744573
.SS pdfioStreamPeek
43754574
Peek at data in a stream.
43764575
.PP

0 commit comments

Comments
 (0)