Skip to content

Commit dcca5b2

Browse files
Lucas FaudmanLucas Faudman
Lucas Faudman
authored and
Lucas Faudman
committed
add setup.sh, update README
1 parent 3a4046f commit dcca5b2

File tree

2 files changed

+14
-6
lines changed

2 files changed

+14
-6
lines changed

README.md

+9-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
11
# sans-index-generator
2-
Generate Indexes from SANS PDFs
2+
**Generate Indexes from SANS PDFs**
33

44
> NOTE: May not work with all SANS PDFs due to different structures. Modify the `fix_text` and `extract_pdf_text` methods in `extractpdfs.py` to match the structure of the PDFs you are working with if errors occur.
55
6+
## Setup
7+
Run the following command to clone the repository and run the setup script.
8+
```bash
9+
git clone https://github.com/LucasFaudman/sans-index-generator && cd sans-index-generator && chmod +x setup.sh && ./setup.sh
10+
```
11+
12+
## Usage
613
```bash
714
usage: extractpdfs.py [-h] [-P PASSWORD] [-O OUT] [--maxwidth MAXWIDTH]
815
[--only-page-order] [--only-alpha]
@@ -43,7 +50,7 @@ optional arguments:
4350
Save index to file
4451
```
4552

46-
### Example Output
53+
## Example Output
4754
```
4855
560/SEC560-Book1.pdf:
4956

extractpdfs.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
import argparse
22
import re
33
import json
4+
from sys import stdout, stderr
45
from pathlib import Path
5-
from collections import defaultdict, OrderedDict
6+
from collections import defaultdict
67
from concurrent.futures import ProcessPoolExecutor
7-
from sys import stdout, stderr
8+
89
from textwrap import TextWrapper
910
from PyPDF2 import PdfReader
1011

@@ -83,7 +84,7 @@ def make_index(file_pages, keep_roadmap=False, keep_toc=False, keep_continuation
8384
index = defaultdict(dict)
8485
for filename, pages in file_pages.items():
8586
for page_num, (header, text, references) in pages.items():
86-
if not keep_roadmap and header in ["Course Roadmap", "Course Outline"]:
87+
if not keep_roadmap and header.startswith(("Course Roadmap", "Course Outline")):
8788
continue
8889
if not keep_toc and header == "TABLE OF CONTENTS":
8990
continue
@@ -122,7 +123,7 @@ def print_index_by_alpha_order(index, stream=None, maxwidth=80):
122123

123124
def sort_fn(x): return x[0].replace(
124125
'The ', '', 1).replace('A ', '', 1).lower()
125-
alpha_index = OrderedDict(sorted(alpha_index.items(), key=sort_fn))
126+
alpha_index = dict(sorted(alpha_index.items(), key=sort_fn))
126127
max_pagestr_len = max(len(": " + ','.join(page_nums))
127128
for page_nums in alpha_index.values())
128129

0 commit comments

Comments
 (0)