Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ignore images when converting? #1110

Open
NoNarne opened this issue Mar 10, 2025 · 5 comments
Open

How to ignore images when converting? #1110

NoNarne opened this issue Mar 10, 2025 · 5 comments

Comments

@NoNarne
Copy link

NoNarne commented Mar 10, 2025

For images in the document, markitdown will convert them into the following form:
![title](url)
Is there any way to make markitdown ignore these images and not generate any content?

@PhiFever
Copy link

I use this simple tool to ignore images

class Html2MarkdownConverter:
    def __init__(self):
        self.converter = MarkItDown()

    def convert(self, html_str: str) -> str:
        return self.converter.convert_stream(
            io.BytesIO(html_str.encode("utf8")), file_extension=".html"
        ).text_content

    def convert_without_images(self, html_str: str) -> str:
        markdown_text = self.convert(html_str)
        # remove images (pattern: ![alt text](URL))
        no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)

        # remove link (pattern: [text](URL))
        no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)

        # remove continuous \n (>2)
        cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)

        return cleaned_text

@NoNarne
Copy link
Author

NoNarne commented Mar 11, 2025

I use this simple tool to ignore images

class Html2MarkdownConverter:
def init(self):
self.converter = MarkItDown()

def convert(self, html_str: str) -> str:
    return self.converter.convert_stream(
        io.BytesIO(html_str.encode("utf8")), file_extension=".html"
    ).text_content

def convert_without_images(self, html_str: str) -> str:
    markdown_text = self.convert(html_str)
    # remove images (pattern: ![alt text](URL))
    no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)

    # remove link (pattern: [text](URL))
    no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)

    # remove continuous \n (>2)
    cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)

    return cleaned_text

Thanks! The "remove link" is useful for me too.

So it seems that markitdown does not support ignoring images and links now? Hope it will be supported soon.

@afourney
Copy link
Member

Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.

I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.

In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?

@NoNarne
Copy link
Author

NoNarne commented Mar 11, 2025

Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed.

I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents.

In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think?

I suggest this to be optional, like:

markitdown path-to-file.pdf -o document.md --ignore-images
md = MarkItDown(ignore_images=True)
or
markitdown path-to-file.pdf -o document.md --save-images path_to_save
md = MarkItDown(save_images=path_to_save)

@afourney
Copy link
Member

Yes, for sure it will be optional/configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants