You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
concerned with rendering or viewing a PDF file, although a PDF RIP or viewer could be written using it.
36
36
.PP
37
-
PDFio is Copyright \[co] 2021\-2024 by Michael R Sweet and is licensed under the Apache License Version 2.0 with an (optional) exception to allow linking against GPL2/LGPL2 software. See the files "LICENSE" and "NOTICE" for more information.
37
+
PDFio is Copyright \[co] 2021\-2025 by Michael R Sweet and is licensed under the Apache License Version 2.0 with an (optional) exception to allow linking against GPL2/LGPL2 software. See the files "LICENSE" and "NOTICE" for more information.
38
38
.SS Requirements
39
39
.PP
40
40
PDFio requires the following to build the software:
@@ -52,9 +52,11 @@ A POSIX\-compliant sh program
52
52
53
53
.IP\(bu5
54
54
.PP
55
-
ZLIB (https://www.zlib.net) 1.0 or higher
55
+
ZLIB (https://www.zlib.net/) 1.0 or higher
56
56
57
57
58
+
.PP
59
+
PDFio will also use libpng 1.6 or higher (https://www.libpng.org/) to provide enhanced PNG image support.
58
60
.PP
59
61
IDE files for Xcode (macOS/iOS) and Visual Studio (Windows) are also provided.
60
62
.SS Installing PDFio
@@ -1097,28 +1099,83 @@ The pdfioinfo.c example program opens a PDF file and prints the title, author, c
1097
1099
.fi
1098
1100
.SS Extract Text from PDF File
1099
1101
.PP
1100
-
The pdf2text.c example code extracts non\-Unicode text from a PDF file by scanning each page for strings and text drawing commands. Since it doesn't look at the font encoding or support Unicode text, it is really only useful to extract plain ASCII text from a PDF file. And since it writes text in the order it appears in the page stream, it may not come out in the same order as appears on the page.
1102
+
The pdf2text.c example code extracts text from a PDF file and writes it to the standard output. Unlike some other PDF tools, it outputs the text in the order it is seen in each page stream so the output might appear "jumbled" if the PDF producer doesn't output text in reading order. The code is able to handle different font encodings and produces UTF\-8 output.
1101
1103
.PP
1102
-
The pdfioStreamGetToken function is used to read individual tokens from the page streams. Tokens starting with the open parenthesis are text strings, while PDF operators are left as\-is. We use some simple logic to make sure that we include spaces between text strings and add newlines for the text operators that start a new line in a text block:
1104
+
The pdfioStreamGetToken function is used to read individual tokens from the page streams:
1103
1105
.nf
1104
1106
1105
1107
pdfio_stream_t *st; // Page stream
1108
+
char buffer[1024], // Token buffer
1109
+
*bufptr, // Pointer into buffer
1110
+
name[256]; // Current (font) name
1106
1111
bool first = true; // First string on line?
1107
-
char buffer[1024]; // Token buffer
1112
+
int encoding[256]; // Font encoding to Unicode
1113
+
bool in_array = false; // Are we in an array?
1108
1114
1109
1115
// Read PDF tokens from the page stream...
1110
1116
while (pdfioStreamGetToken(st, buffer, sizeof(buffer)))
1111
1117
{
1112
-
if (buffer[0] == '(')
1118
+
.fi
1119
+
.PP
1120
+
Justified text can be found inside arrays ("[ ... ]"), so we look for the array delimiter tokens and any (spacing) numbers inside an array. Experimentation has shown that numbers greater than 100 can be treated as whitespace:
Tokens starting with \'(' or \'<' are text fragments. 8\-bit text starting with \'(' needs to be mapped to Unicode using the current font encoding while hex strings starting with \'<' are UTF\-16 (Unicode) that need to be converted to UTF\-8:
1141
+
.nf
1142
+
1143
+
else if (buffer[0] == '(')
1113
1144
{
1114
1145
// Text string using an 8\-bit encoding
1115
-
if (first)
1116
-
first = false;
1117
-
else if (buffer[1] != ' ')
1118
-
putchar(' ');
1146
+
first = false;
1119
1147
1120
-
fputs(buffer + 1, stdout);
1148
+
for (bufptr = buffer + 1; *bufptr; bufptr ++)
1149
+
put_utf8(encoding[*bufptr & 255]);
1150
+
}
1151
+
else if (buffer[0] == '<')
1152
+
{
1153
+
// Unicode text string
1154
+
first = false;
1155
+
1156
+
puts_utf16(buffer + 1);
1157
+
}
1158
+
.fi
1159
+
.PP
1160
+
Simple (8\-bit) fonts include an encoding table that maps the 8\-bit characters to one of 1051 Unicode glyph names. Since each font can use a different encoding, we look for font names starting with \'/' and the "Tf" (set text font) operator token and load that font's encoding using the load_encoding function:
1161
+
.nf
1162
+
1163
+
else if (buffer[0] == '/')
1164
+
{
1165
+
// Save name...
1166
+
strncpy(name, buffer + 1, sizeof(name) \- 1);
1167
+
name[sizeof(name) \- 1] = '\\0';
1168
+
}
1169
+
else if (!strcmp(buffer, "Tf") && name[0])
1170
+
{
1171
+
// Set font...
1172
+
load_encoding(obj, name, encoding);
1121
1173
}
1174
+
.fi
1175
+
.PP
1176
+
Finally, some text operators start a new line in a text block, so when we see their tokens we output a newline:
@@ -1127,9 +1184,150 @@ The pdfioStreamGetToken function is used to read individual tokens from the page
1127
1184
first = true;
1128
1185
}
1129
1186
}
1187
+
.fi
1188
+
.PP
1189
+
The load_encoding Function
1190
+
.PP
1191
+
The load_encoding function looks up the named font in the page's "Resources" dictionary. Every PDF simple font contains an "Encoding" dictionary with a base encoding ("WinANSI", "MacRoman", or "MacExpert") and a differences array that lists character indexes and glyph names for an 8\-bit font.
1192
+
.PP
1193
+
We start by initializing the encoding array to the default WinANSI encoding and looking up the font object for the named font:
1194
+
.nf
1195
+
1196
+
static void
1197
+
load_encoding(
1198
+
pdfio_obj_t *page_obj, // I \- Page object
1199
+
const char *name, // I \- Font name
1200
+
int encoding[256]) // O \- Encoding table
1201
+
{
1202
+
size_t i, j; // Looping vars
1203
+
pdfio_dict_t *page_dict, // Page dictionary
1204
+
*resources_dict, // Resources dictionary
1205
+
*font_dict; // Font dictionary
1206
+
pdfio_obj_t *font_obj, // Font object
1207
+
*encoding_obj; // Encoding object
1208
+
static int win_ansi[32] = // WinANSI characters from 128 to 159
1209
+
{
1210
+
...
1211
+
};
1212
+
static int mac_roman[128] = // MacRoman characters from 128 to 255
1213
+
{
1214
+
...
1215
+
};
1216
+
1217
+
1218
+
// Initialize the encoding to be the "standard" WinAnsi...
Then we loop through the differences array, keeping track of the current index within the encoding array. A number indicates a new index while a name is the Unicode glyph for the current index:
1280
+
.nf
1281
+
1282
+
typedef struct name_map_s
1283
+
{
1284
+
const char *name; // Character name
1285
+
int unicode; // Unicode value
1286
+
} name_map_t;
1287
+
1288
+
static name_map_t unicode_map[1051]; // List of glyph names
0 commit comments