You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Python, appending strings is not guaranteed to be constant-time,
since they are documented to be immutable. In some corner cases,
CPython is able to make these operations constant-time, but reaching
into ETree objects is not such a case.
This leads to parse times being quadratic in the size of the text in
the input in pathological cases where parsing outputs a large number
of adjacent text nodes which must be combined (e.g. HTML-escaped
values). Specifically, we expect doubling the size of the input to
result in approximately doubling the time to parse; instead, we
observe quadratic behavior:
```
In [1]: import html5lib
In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
```
Switch from appending to the internal `str`, to appending text to an
array of text chunks, as appends can be done in constant time. Using
`bytearray` is a similar solution, but benchmarks slightly worse
because the strings must be encoded before being appended.
This improves parsing of text documents noticeably:
```
In [1]: import html5lib
In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
```
0 commit comments