Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_project.py #158

Open
lyc728 opened this issue Oct 23, 2024 · 5 comments
Open

run_project.py #158

lyc728 opened this issue Oct 23, 2024 · 5 comments

Comments

@lyc728
Copy link

lyc728 commented Oct 23, 2024

你好 想问下有没有整体流程的测试,同时问下run_project.py这个脚本是干嘛用的

@wufan-tb
Copy link
Collaborator

这个是pdf 2markdown的脚本,是综合使用布局检测,公式检测,公式识别等任务,提取pdf并转换为markdown,具体可以参考教程文档https://pdf-extract-kit.readthedocs.io/zh-cn/latest/project/pdf_extract.html

@lyc728
Copy link
Author

lyc728 commented Oct 24, 2024

有没有不需要将这4个组装一起的脚本呢?暂时用不到公式检测和公式识别

@lyc728
Copy link
Author

lyc728 commented Oct 25, 2024

企业微信截图_17298224941980 生成的md会把文本拼接一起 没有段落了

@wufan-tb
Copy link
Collaborator

更准确的拼接可以参考MinerU,后处理的逻辑比Kit的要复杂些,效果也更好。

@lyc728
Copy link
Author

lyc728 commented Oct 29, 2024

现在百度新增了版面区域检测模型,这边有打算接入的可能吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants