-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract figures from pdf #1119
extract figures from pdf #1119
Changes from all commits
a139ebb
09284b1
512529b
180fc3a
ea16226
2098828
9ef8099
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,19 +33,28 @@ | |
// or analyze page per page, filter pages | ||
pages.slice(0, 2).forEach((page, i) => { | ||
def(`PAGE_${i}`, page) | ||
}) | ||
``` | ||
|
||
## Rendering to images | ||
## Images and figures | ||
|
||
GenAIScript automatically extracts bitmap images from PDFs and stores them in the data array. You can use these images to generate prompts. The image are encoded as PNG and may be large. | ||
|
||
```js | ||
const { data } = await parsers.PDF(env.files[0]) | ||
``` | ||
|
||
## Rendering pages to images | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Code block is missing after the new section title.
|
||
|
||
Add the `renderAsImage` option to also reach each page to a PNG image (as a buffer). This buffer can be used with a vision model to perform | ||
an OCR operation. | ||
|
||
```js wrap | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code snippet is not properly formatted. Ensure consistent use of backticks or triple backticks for code blocks.
|
||
const { images } = await parsers.PDF(env.files[0], | ||
{ renderAsImage: true }) | ||
const { images } = await parsers.PDF(env.files[0], { renderAsImage: true }) | ||
``` | ||
|
||
You can control the quality of the rendered image using the `scale` parameter (default is 3). | ||
|
||
## PDFs are messy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code block for controlling image quality is missing. It should be included under the "Rendering pages to images" section.
|
||
|
||
The PDF format was never really meant to allow for clean text extraction. The `parsers.PDF` function uses the `pdf-parse` package to extract text from PDFs. This package is not perfect and may fail to extract text from some PDFs. If you have access to the original document, it is recommended to use a more text-friendly format such as markdown or plain text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Header changed from "Rendering to images" to "Images and figures". Consider keeping the original header for consistency.