-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to ignore images when converting? #1110
Comments
I use this simple tool to ignore images class Html2MarkdownConverter:
def __init__(self):
self.converter = MarkItDown()
def convert(self, html_str: str) -> str:
return self.converter.convert_stream(
io.BytesIO(html_str.encode("utf8")), file_extension=".html"
).text_content
def convert_without_images(self, html_str: str) -> str:
markdown_text = self.convert(html_str)
# remove images (pattern: )
no_images = re.sub(r"!\[.*?]\(.*?\)", "", markdown_text)
# remove link (pattern: [text](URL))
no_links = re.sub(r"\[(.*?)]\(.*?\)", r"\1", no_images)
# remove continuous \n (>2)
cleaned_text = re.sub(r"\n{3,}", "\n\n", no_links)
return cleaned_text |
Thanks! The "remove link" is useful for me too. So it seems that markitdown does not support ignoring images and links now? Hope it will be supported soon. |
Thanks. Some folks are advocating for embedding images via data-uris. Others are advocating for removing images entirely. A third option is to consider them sub-documents, and convert them into text recursively (this would potentially give you image captions and metadata). It's clear that some additional control or configuration is needed. I want to think some about how to proceed here -- it will likely involve some decisions about how to deal with sub-documents. In the meantime, I can probably add some conversion options to the python interface to support this -- but it might end up deprecated once there's a better design around sub-documents. What do you think? |
I suggest this to be optional, like:
|
Yes, for sure it will be optional/configurable. |
For images in the document, markitdown will convert them into the following form:

Is there any way to make markitdown ignore these images and not generate any content?
The text was updated successfully, but these errors were encountered: