Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large inline images use a lot of memory #10075

Open
ebeigarts opened this issue Aug 9, 2024 · 8 comments
Open

Large inline images use a lot of memory #10075

ebeigarts opened this issue Aug 9, 2024 · 8 comments
Labels

Comments

@ebeigarts
Copy link

ebeigarts commented Aug 9, 2024

Here is an example HTML file (10Mb) with one embedded JPEG image (7.6Mb).

Memory usage:

  • html to md uses 2985M
  • md to docx uses 3435M
  • html to docx uses 4350M

Test examples:

pandoc --version
pandoc 3.3

pandoc +RTS -t -RTS -o test.md test.html
# <<ghc: 30556315048 bytes, 3666 GCs, 248986086/1293885312 avg/max bytes residency (13 samples), 2985M in use, 0.001 INIT (0.001 elapsed), 2.773 MUT (2.511 elapsed), 2.892 GC (3.461 elapsed) :ghc>>

pandoc +RTS -t -RTS -o test.docx test.md
# <<ghc: 105686485256 bytes, 12695 GCs, 434087025/1466902032 avg/max bytes residency (19 samples), 3435M in use, 0.002 INIT (0.002 elapsed), 10.349 MUT (10.101 elapsed), 7.732 GC (8.308 elapsed) :ghc>>

pandoc +RTS -t -RTS -o test.docx test.html
# <<ghc: 76105089872 bytes, 9099 GCs, 489199853/1886265928 avg/max bytes residency (20 samples), 4350M in use, 0.002 INIT (0.002 elapsed), 8.025 MUT (7.772 elapsed), 9.163 GC (10.023 elapsed) :ghc>>

OS: macOS 14.14.1, m3/arm

@ebeigarts ebeigarts added the bug label Aug 9, 2024
@jgm
Copy link
Owner

jgm commented Aug 9, 2024

This does seem to have to do with inline (base64-encoded) images specifically. I tried the same file but with a linked image, and it only used 30 MB.

@jgm
Copy link
Owner

jgm commented Aug 9, 2024

Odd that even html -> md takes a lot of memory, even though the URL should just be passed through unchanged.

@jgm
Copy link
Owner

jgm commented Aug 9, 2024

If we do html -> json and then json -> md, that takes 956M for the first step and 810M for the second. So, it's neither a reader nor a writer issue exclusively. json -> html takes 1768M. But json -> json is fast.

I'd need to do some profiling to track this down further.

@jgm
Copy link
Owner

jgm commented Aug 9, 2024

Profiling, first three entries for html -> md:

COST CENTRE             MODULE                           SRC                                                       %time %alloc

parseURI                Network.URI                      Network/URI.hs:301:1-26                                    17.4   21.5
escapeURI               Text.Pandoc.URI                  src/Text/Pandoc/URI.hs:(31,1)-(32,65)                      17.2   21.0
parseURIReference       Network.URI                      Network/URI.hs:308:1-44                                    16.1   21.5

@jgm
Copy link
Owner

jgm commented Aug 9, 2024

parseURI gets called in both the reader and the writer, it seems.
In the reader as part of canonicalizeUrl.
IN the writer as part of isURI (which is used to check whether an "auto-link" should be used).

It seems that parseURI may be slow and could perhaps be optimized (it's in Network.URI so not part of pandoc).

We could also think about whether we could avoid calling it.

@jgm
Copy link
Owner

jgm commented Aug 9, 2024

Yeah, here's the problem (network-uri):
Network/URI.hs

The URI parser parses multiple segments and concatenates them. (Each segment is basically a path component starting with /). But look at the segment parser:

segment :: URIParser String
segment =
    do  { ps <- many pchar
        ; return $ concat ps
        }

This parses many small strings, one for each pchar (usually just one character!) and then concatenates them. I think that allocating thousands or millions or small strings and concatenating them is causing the memory blowup.

This should be pretty easy to optimize. I'm surprised nobody has run into this before, as this is a widely used package!

For reference, here is pchar:

segmentNz :: URIParser String
segmentNz =
    do  { ps <- many1 pchar
        ; return $ concat ps
        }

segmentNzc :: URIParser String
segmentNzc =
    do  { ps <- many1 (uchar "@")
        ; return $ concat ps
        }

pchar :: URIParser String
pchar = uchar ":@"

-- helper function for pchar and friends
uchar :: String -> URIParser String
uchar extras =
        unreservedChar
    <|> escaped
    <|> subDelims
    <|> do { c <- oneOf extras ; return [c] }

@jgm
Copy link
Owner

jgm commented Sep 6, 2024

I've made the patch to parseURI, so you'll notice a new difference once a new version of network-uri is released; but it's not going to make a HUGE difference, because that function is still fairly inefficient.

We could think about trying a different URI library, maybe uri-bytestring.

@ebeigarts
Copy link
Author

Thanks @jgm, really nice explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants