Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make parsing of text be non-quadratic. #579

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexmv
Copy link

@alexmv alexmv commented Feb 27, 2024

In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.

This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:

In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

Switch from appending to the internal str, to appending text to an array of text chunks, as appends can be done in constant time. Using bytearray is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.

This improves parsing of text documents noticeably:

In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Old flamegraph:

New flamegraph:

In Python, appending strings is not guaranteed to be constant-time,
since they are documented to be immutable.  In some corner cases,
CPython is able to make these operations constant-time, but reaching
into ETree objects is not such a case.

This leads to parse times being quadratic in the size of the text in
the input in pathological cases where parsing outputs a large number
of adjacent text nodes which must be combined (e.g. HTML-escaped
values).  Specifically, we expect doubling the size of the input to
result in approximately doubling the time to parse; instead, we
observe quadratic behavior:

```
In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
```

Switch from appending to the internal `str`, to appending text to an
array of text chunks, as appends can be done in constant time.  Using
`bytearray` is a similar solution, but benchmarks slightly worse
because the strings must be encoded before being appended.

This improves parsing of text documents noticeably:

```
In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
```
@andersk
Copy link

andersk commented Feb 28, 2024

This solution can’t work, as it’s a breaking change to the public API. Before:

>>> html5lib.parse("hello")[1].text
'hello'

After:

>>> html5lib.parse("hello")[1].text
<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0>

@lopuhin
Copy link

lopuhin commented Mar 10, 2025

From what I can see, there are also plenty of operations in the _tokenizer.py which assume that it's possible to append a character to a string in O(1), which is often the case in CPython, but not the case for other implementations, where having a pure-python parser can be especially valuable. E.g. here

self.currentToken["data"][-1][1] += output

@andersk
Copy link

andersk commented Mar 10, 2025

@lopuhin That line is slow even in CPython.

In CPython, appending a character is only O(1) if the string is a local variable inside a function with no other references. It is O(n) for an object property obj.prop or an array element arr[i] (even if the object or array itself is a local variable), or for a global or nonlocal variable—in all of those cases, the string has a refcount of at least 2, which prevents it from being safely mutated in place and forces it to be copied.

import timeit

def linear_local(n):
    s = ""
    for i in range(n):
        s += "a"  # fast

def quadratic_object(n):
    class C: pass
    c = C()
    c.s = ""
    for i in range(n):
        c.s += "a"  # slow

def quadratic_array(n):
    a = [""]
    for i in range(n):
        a[0] += "a"  # slow

def quadratic_global(n):
    global s
    s = ""
    for i in range(n):
        s += "a"  # slow

def quadratic_nonlocal(n):
    s = ""
    def inner():
        nonlocal s
        for i in range(n):
            s += "a"  # slow
    inner()

for f in [linear_local, quadratic_object, quadratic_array, quadratic_global, quadratic_nonlocal]:
    for n in [100000, 200000, 400000, 800000]:
        print(f.__name__, n, timeit.timeit(lambda: f(n), number=1))

Output with CPython 3.13.2:

linear_local 100000 0.006017955995048396
linear_local 200000 0.013165883996407501
linear_local 400000 0.027179232012713328
linear_local 800000 0.052238386997487396
quadratic_object 100000 0.11766406099195592
quadratic_object 200000 0.5580674420052674
quadratic_object 400000 2.6726826040103333
quadratic_object 800000 12.140160495007876
quadratic_array 100000 0.12400677500409074
quadratic_array 200000 0.5755963019910268
quadratic_array 400000 2.642135899004643
quadratic_array 800000 11.990410245998646
quadratic_global 100000 0.12772354800836183
quadratic_global 200000 0.5731496340013109
quadratic_global 400000 2.738810390001163
quadratic_global 800000 12.154955972000607
quadratic_nonlocal 100000 0.1292998229910154
quadratic_nonlocal 200000 0.5955325639952207
quadratic_nonlocal 400000 2.6306100980000338
quadratic_nonlocal 800000 11.95639012400352

@lopuhin
Copy link

lopuhin commented Mar 10, 2025

Good point, thank you! Indeed I can reproduce the slowness of a particular HTML under CPython as well, although the difference is less than under GraalPy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants