You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+35-2
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,23 @@
1
1
## text-extractor
2
-
Extracts text from PDFfiles using embeded python
2
+
Extracts text from PDF, PPTX files and Images (PNG, JPEG, ...) using embeded python
3
3
4
4
5
5
## Installation ZPM
6
6
7
+
1. text-extractor
7
8
```
8
9
USER>zpm "install text-extractor"
9
10
```
10
11
12
+
2. Images (optional)
13
+
This package uses tesseract-ocr to extract text from images. If you will be using to extract text from images, you will need to install tesseract-ocr additionally: apt-get install -yq tesseract-ocr:
14
+
`apt-get install tesseract-ocr`
15
+
16
+
If the text is in any of the languages other than English, you will need the appropriate packages, for example, tesseract-ocr-fra for French: `apt-get install tesseract-ocr-fra`
17
+
18
+
3. PDF to Image (optional)
19
+
This package supports several ways to work with PDF. One of them involves converting pdf to images first, and then using text extraction from images. If you will use this approach you need to install poppler-utils:
20
+
`apt-get install poppler-utils`
11
21
12
22
## How to work with it
13
23
@@ -32,6 +42,29 @@ USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
32
42
USER>set string = pdf.Extract(0)
33
43
```
34
44
45
+
The examples above ignore images that can be inside .pdf and also contain text data
46
+
47
+
To get text and add text from images to it - use:
48
+
```
49
+
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
50
+
USER>set string = pdf.ExtractWithImages(0,"eng")
51
+
```
52
+
53
+
Another option is to save each .pdf page as an image, and then extract the text from those images
54
+
```
55
+
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
From Interoperability you can use Business Operation `NSolov.TextExtract.BusinessOperation` with request `NSolov.TextExtract.PDFRequest` for pdf and `NSolov.TextExtract.PPTXRequest` for pptx.
91
+
From Interoperability you can use Business Operation `NSolov.TextExtract.BusinessOperation` with request `NSolov.TextExtract.PDFRequest` for pdf, `NSolov.TextExtract.PPTXRequest` for pptx and `NSolov.TextExtract.ImageRequest` for images.
0 commit comments