extract figures from pdf #1119

pelikhan · 2025-02-11T06:09:33Z

PDF Parsing Enhancements

Utility Import Expansion
- Added imports for isUint8Array and isUint8ClampedArray to better handle numeric data in PDFs.
Advanced Filtering Capabilities
- Enhanced PDFTryParse with explicit filtering options to retain or remove specific pages based on criteria, improving output control.
Image Processing Support
- Introduced new functions (parseFigure and parsenotated-ocr) for rendering PDF figures as images and extracting OCR annotations directly in prompt templates.

AI-generated content by pr-describe may be incorrect

github-actions · 2025-02-11T06:23:30Z

LGTM 🚀 with a caveat.

The analysis:

This appears to be adding documentation/comments inline in a .types file
It also adds supporting YAML metadata (which is common in TypeScript/ECMAScript 6 types)
No new functionality is being added to existing exports (renderAsImage remains)
The change introduces YAML comments before the exports, which may affect how TypeScript processes them but shouldn't break any existing type definitions.

No functional safety/security issues or breaking changes that would require fixing in dependent codebase.

One potential concern: If components expect renderAsImage to be always true and this export is now conditionally added/removed based on some flag, it could introduce unexpected behavior if not carefully managed elsewhere.

Otherwise LGTM.

AI-generated content by pr-review may be incorrect

pelikhan · 2025-02-11T14:59:18Z

add size, id of the image
introduce id of image in generate text

Enhanced PDF parsing with folder-based caching and image extraction.

Removed unused "salt" in hashing and replaced "-" with "_" in filenames

Added a note on using the scale parameter to improve image rendering.

github-actions · 2025-02-11T16:27:18Z

docs/src/content/docs/reference/scripts/pdf.md

@@ -36,14 +36,21 @@ pages.slice(0, 2).forEach((page, i) => {
 })
 ```

-## Rendering to images
+## Images and figures


Header changed from "Rendering to images" to "Images and figures". Consider keeping the original header for consistency.

AI-generated content by pr-docs-review-commit header_change may be incorrect

github-actions · 2025-02-11T16:27:20Z

docs/src/content/docs/reference/scripts/pdf.md

+const { data } = await parsers.PDF(env.files[0])
+```
+
+## Rendering pages to images


Code block is missing after the new section title.

AI-generated content by pr-docs-review-commit missing_code_block may be incorrect

github-actions · 2025-02-11T16:28:34Z

docs/src/content/docs/reference/scripts/pdf.md

+const { data } = await parsers.PDF(env.files[0])
+```
+
+## Rendering pages to images

 Add the `renderAsImage` option to also reach each page to a PNG image (as a buffer). This buffer can be used with a vision model to perform
 an OCR operation.

 ```js wrap


The code snippet is not properly formatted. Ensure consistent use of backticks or triple backticks for code blocks.

AI-generated content by pr-docs-review-commit code_formatting may be incorrect

github-actions · 2025-02-11T16:28:35Z

docs/src/content/docs/reference/scripts/pdf.md

 ```

+You can control the quality of the rendered image using the `scale` parameter (default is 3).
+
 ## PDFs are messy


The code block for controlling image quality is missing. It should be included under the "Rendering pages to images" section.

AI-generated content by pr-docs-review-commit missing_code_block may be incorrect

extract figures from pdf

a139ebb

pelikhan added 2 commits February 11, 2025 07:00

towards image extraction

09284b1

decode pdf figures

512529b

pelikhan added 4 commits February 11, 2025 16:15

✨ Add caching and improve PDF parsing functionality

180fc3a

Enhanced PDF parsing with folder-based caching and image extraction.

♻️ Simplify PDF hashing and filename formatting

ea16226

Removed unused "salt" in hashing and replaced "-" with "_" in filenames

docs

2098828

✨ feat: add scale parameter to control PDF image quality

9ef8099

Added a note on using the scale parameter to improve image rendering.

github-actions bot reviewed Feb 11, 2025

View reviewed changes

pelikhan merged commit 7a3a581 into main Feb 11, 2025
15 checks passed

pelikhan deleted the pdffigures branch February 11, 2025 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract figures from pdf #1119

extract figures from pdf #1119

pelikhan commented Feb 11, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 11, 2025

pelikhan commented Feb 11, 2025

github-actions bot Feb 11, 2025

github-actions bot Feb 11, 2025

github-actions bot Feb 11, 2025

github-actions bot Feb 11, 2025

extract figures from pdf #1119

extract figures from pdf #1119

Conversation

pelikhan commented Feb 11, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 11, 2025

pelikhan commented Feb 11, 2025

github-actions bot Feb 11, 2025

Choose a reason for hiding this comment

github-actions bot Feb 11, 2025

Choose a reason for hiding this comment

github-actions bot Feb 11, 2025

Choose a reason for hiding this comment

github-actions bot Feb 11, 2025

Choose a reason for hiding this comment

pelikhan commented Feb 11, 2025 •

edited by github-actions bot

Loading