-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract figures from pdf #1119
extract figures from pdf #1119
Conversation
LGTM 🚀 with a caveat. The analysis:
No functional safety/security issues or breaking changes that would require fixing in dependent codebase. One potential concern: If components expect Otherwise LGTM.
|
|
Enhanced PDF parsing with folder-based caching and image extraction.
Removed unused "salt" in hashing and replaced "-" with "_" in filenames
Added a note on using the scale parameter to improve image rendering.
@@ -36,14 +36,21 @@ pages.slice(0, 2).forEach((page, i) => { | |||
}) | |||
``` | |||
|
|||
## Rendering to images | |||
## Images and figures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Header changed from "Rendering to images" to "Images and figures". Consider keeping the original header for consistency.
AI-generated content by pr-docs-review-commit
header_change
may be incorrect
const { data } = await parsers.PDF(env.files[0]) | ||
``` | ||
|
||
## Rendering pages to images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code block is missing after the new section title.
AI-generated content by pr-docs-review-commit
missing_code_block
may be incorrect
const { data } = await parsers.PDF(env.files[0]) | ||
``` | ||
|
||
## Rendering pages to images | ||
|
||
Add the `renderAsImage` option to also reach each page to a PNG image (as a buffer). This buffer can be used with a vision model to perform | ||
an OCR operation. | ||
|
||
```js wrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code snippet is not properly formatted. Ensure consistent use of backticks or triple backticks for code blocks.
AI-generated content by pr-docs-review-commit
code_formatting
may be incorrect
``` | ||
|
||
You can control the quality of the rendered image using the `scale` parameter (default is 3). | ||
|
||
## PDFs are messy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code block for controlling image quality is missing. It should be included under the "Rendering pages to images" section.
AI-generated content by pr-docs-review-commit
missing_code_block
may be incorrect
PDF Parsing Enhancements
Utility Import Expansion
isUint8Array
andisUint8ClampedArray
to better handle numeric data in PDFs.Advanced Filtering Capabilities
Image Processing Support
parseFigure
andparsenotated-ocr
) for rendering PDF figures as images and extracting OCR annotations directly in prompt templates.