Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract figures from pdf #1119

Merged
merged 7 commits into from
Feb 11, 2025
Merged

extract figures from pdf #1119

merged 7 commits into from
Feb 11, 2025

Conversation

pelikhan
Copy link
Member

@pelikhan pelikhan commented Feb 11, 2025


PDF Parsing Enhancements

  1. Utility Import Expansion

    • Added imports for isUint8Array and isUint8ClampedArray to better handle numeric data in PDFs.
  2. Advanced Filtering Capabilities

    • Enhanced PDFTryParse with explicit filtering options to retain or remove specific pages based on criteria, improving output control.
  3. Image Processing Support

    • Introduced new functions (parseFigure and parsenotated-ocr) for rendering PDF figures as images and extracting OCR annotations directly in prompt templates.

AI-generated content by pr-describe may be incorrect

Copy link

LGTM 🚀 with a caveat.

The analysis:

  1. This appears to be adding documentation/comments inline in a .types file
  2. It also adds supporting YAML metadata (which is common in TypeScript/ECMAScript 6 types)
  3. No new functionality is being added to existing exports (renderAsImage remains)
  4. The change introduces YAML comments before the exports, which may affect how TypeScript processes them but shouldn't break any existing type definitions.

No functional safety/security issues or breaking changes that would require fixing in dependent codebase.

One potential concern: If components expect renderAsImage to be always true and this export is now conditionally added/removed based on some flag, it could introduce unexpected behavior if not carefully managed elsewhere.

Otherwise LGTM.

AI-generated content by pr-review may be incorrect

@pelikhan
Copy link
Member Author

  • add size, id of the image
  • introduce id of image in generate text

Enhanced PDF parsing with folder-based caching and image extraction.
Removed unused "salt" in hashing and replaced "-" with "_" in filenames
Added a note on using the scale parameter to improve image rendering.
@@ -36,14 +36,21 @@ pages.slice(0, 2).forEach((page, i) => {
})
```

## Rendering to images
## Images and figures

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header changed from "Rendering to images" to "Images and figures". Consider keeping the original header for consistency.

AI-generated content by pr-docs-review-commit header_change may be incorrect

const { data } = await parsers.PDF(env.files[0])
```

## Rendering pages to images

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code block is missing after the new section title.

AI-generated content by pr-docs-review-commit missing_code_block may be incorrect

const { data } = await parsers.PDF(env.files[0])
```

## Rendering pages to images

Add the `renderAsImage` option to also reach each page to a PNG image (as a buffer). This buffer can be used with a vision model to perform
an OCR operation.

```js wrap

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code snippet is not properly formatted. Ensure consistent use of backticks or triple backticks for code blocks.

AI-generated content by pr-docs-review-commit code_formatting may be incorrect

```

You can control the quality of the rendered image using the `scale` parameter (default is 3).

## PDFs are messy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code block for controlling image quality is missing. It should be included under the "Rendering pages to images" section.

AI-generated content by pr-docs-review-commit missing_code_block may be incorrect

@pelikhan pelikhan merged commit 7a3a581 into main Feb 11, 2025
15 checks passed
@pelikhan pelikhan deleted the pdffigures branch February 11, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant