Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue: fromHtml is 10x Slower than hast-util-from-parse5 #6

Closed
4 tasks done
zolero opened this issue Aug 21, 2024 · 21 comments
Closed
4 tasks done
Labels
🤷 no/invalid This cannot be acted upon 👎 phase/no Post cannot or will not be acted on

Comments

@zolero
Copy link

zolero commented Aug 21, 2024

Initial checklist

Affected packages and versions

hast-util-from-html

Link to runnable example

No response

Steps to reproduce

Set up a benchmark using src/parser.bench.ts.

Load html test: https://gist.github.com/zolero/6ecc0d238595b37fabb42cc86588998b

Run the benchmarks comparing hast-util-from-html and hast-util-from-parse5.
Observe the performance difference.

import path from "node:path";

import { fromHtml } from "hast-util-from-html";
import { fromParse5 } from "hast-util-from-parse5";
import { parse } from "parse5";
import { bench, describe } from "vitest";

describe("performance testing", () => {
    const html = fs.readFileSync(path.join(process.cwd(), `assets/html/real-world/bol-com.html`)).toString();
    bench("performance of HTML to HAST using hast-util-from-html", () => {
        fromHtml(html);
    });
    bench("performance of HTML to HAST using hast-util-from-parse5 + sourceCodeLocationInfo", () => {
        fromParse5(parse(html, { sourceCodeLocationInfo: true }));
    });
    bench("performance of HTML to HAST using hast-util-from-parse5 + w/o sourceCodeLocationInfo", () => {
        fromParse5(parse(html, { sourceCodeLocationInfo: false }));
    });
});

Expected behavior

I expected fromHtml to have similar or better performance compared to hast-util-from-parse5, considering both are intended to perform similar tasks.

Actual behavior

fromHtml is significantly slower than hast-util-from-parse5, which raises concerns about its efficiency in performance-critical applications.

Benchmark Results
Here are the results of my performance tests:

 ✓ src/parser.bench.ts (3) 5453ms
   ✓ performance testing (3) 5451ms
     name                                                                                       hz      min      max     mean      p75      p99     p995     p999     rme  samples   
   · performance of HTML to HAST using hast-util-from-html                                  4.5929   208.73   235.69   217.73   220.69   235.69   235.69   235.69  ±2.64%       10   slowest
   · performance of HTML to HAST using hast-util-from-parse5 + sourceCodeLocationInfo      23.1773  38.8248  49.1888  43.1457  44.3062  49.1888  49.1888  49.1888  ±4.92%       12   
   · performance of HTML to HAST using hast-util-from-parse5 + w/o sourceCodeLocationInfo  32.4982  25.4096  36.3745  30.7710  31.6351  36.3745  36.3745  36.3745  ±4.80%       17   fastest

Additional benchmark data:

 ✓ src/parser.bench.ts (3) 5510ms
   ✓ performance testing (3) 5508ms
     name                                                                                       hz      min      max     mean      p75      p99     p995     p999     rme  samples   
   · performance of HTML to HAST using hast-util-from-html                                  4.4994   211.85   247.17   222.25   221.88   247.17   247.17   247.17  ±3.59%       10   slowest
   · performance of HTML to HAST using hast-util-from-parse5 + sourceCodeLocationInfo      22.4136  41.9222  48.7456  44.6158  45.3911  48.7456  48.7456  48.7456  ±3.13%       12   
   · performance of HTML to HAST using hast-util-from-parse5 + w/o sourceCodeLocationInfo  32.9171  26.1313  35.0066  30.3794  31.1243  35.0066  35.0066  35.0066  ±3.47%       17   fastest

BENCH Summary
hast-util-from-parse5 + w/o sourceCodeLocationInfo is 1.47x faster than hast-util-from-parse5 + sourceCodeLocationInfo.
hast-util-from-parse5 + w/o sourceCodeLocationInfo is 7.32x faster than hast-util-from-html.

Affected runtime and version

v22.6.0

Affected package manager and version

1.22.22

Affected OS and version

Win 11

Build and bundle tools

Rollup

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Aug 21, 2024
@wooorm
Copy link
Member

wooorm commented Aug 21, 2024

Did you see the source?

https://github.com/syntax-tree/hast-util-from-html/blob/main/lib/index.js

I'm not sure what you're thinking. Like, source's pretty clear. You're probably turning some options on or off and profiling the difference?

@zolero

This comment was marked as off-topic.

@zolero
Copy link
Author

zolero commented Aug 22, 2024

Did you see the source?

https://github.com/syntax-tree/hast-util-from-html/blob/main/lib/index.js

I'm not sure what you're thinking. Like, source's pretty clear. You're probably turning some options on or off and profiling the difference?

And I'm not sure what you're thinking since everything I did is described pretty clearly to me and easy to reproduce. It's actually 10 times slower, using parse5 ^7.1.2.

 ✓ src/ast/html/hast.bench.ts (3) 5274ms
   ✓ performance testing HTML to HAST (3) 5272ms
     name                                                                                       hz      min      max     mean      p75      p99     p995     p999     rme  samples
   · performance of HTML to HAST using hast-util-from-html                                  4.3773   215.06   276.67   228.45   236.65   276.67   276.67   276.67  ±5.84%       10   slowest
   · performance of HTML to HAST using hast-util-from-parse5 + sourceCodeLocationInfo      39.9635  20.7337  30.9236  25.0229  27.1483  30.9236  30.9236  30.9236  ±5.63%       21
   · performance of HTML to HAST using hast-util-from-parse5 + w/o sourceCodeLocationInfo  45.5799  19.1052  24.9510  21.9395  23.3539  24.9510  24.9510  24.9510  ±3.41%       23   fastest
 ✓ src/ast/html/utils/sanitize.bench.ts (1) 616ms
   ✓ performance testing HAST (1) 612ms
     name                                  hz     min     max    mean     p75     p99    p995    p999     rme  samples
   · performance of HAST sanitization  679.29  1.0909  7.5912  1.4721  1.4849  3.3327  3.5710  7.5912  ±3.54%      340


 BENCH  Summary

  performance of HTML to HAST using hast-util-from-parse5 + w/o sourceCodeLocationInfo - src/ast/html/hast.bench.ts > performance testing HTML to HAST
    1.14x faster than performance of HTML to HAST using hast-util-from-parse5 + sourceCodeLocationInfo
    10.41x faster than performance of HTML to HAST using hast-util-from-html

@zolero zolero changed the title Performance Issue: fromHtml is 7x Slower than hast-util-from-parse5 Performance Issue: fromHtml is 10x Slower than hast-util-from-parse5 Aug 22, 2024
@zolero
Copy link
Author

zolero commented Aug 22, 2024

Setting file to true, which results in type a error, but returns an equal hast tree after removing only the position property from root object.

import { fromHtml } from "hast-util-from-html";
import { fromParse5 } from "hast-util-from-parse5";
import { parseFragment } from "parse5";
import { describe, expect, it } from "vitest";

describe("hast parser hast-util-from-html vs hast-util-from-parse5", () => {
    // Deep equal fromHtml and fromParse
    it("should deep equal HAST correctly", async () => {
        const html = `<div class="container"><header><h1>Welcome to My Website</h1><nav><ul><li><a href="/home">Home</a></li><li><a href="/about">About Us</a></li><li><a href="/services">Services</a></li><li><a href="/contact">Contact</a></li></ul></nav></header><main><section class="intro"><h2>About Our Company</h2><p>We are a <strong>leading</strong> company in the industry, with over <em>20 years</em> of experience.</p><p>Our mission is to provide <a href="/services#quality">top-notch services</a> to our clients.</p></section><section class="features"><h2>Our Features</h2><ul><li>Feature 1: High quality</li><li>Feature 2: Reliability</li><li>Feature 3: Affordability</li></ul></section><section class="testimonials"><h2>What Our Clients Say</h2><blockquote><p>"This company exceeded our expectations!"</p><footer>- Jane Doe, CEO of ExampleCorp</footer></blockquote><blockquote><p>"Outstanding service and support!"</p><footer>- John Smith, Manager at TechSolutions</footer></blockquote></section></main><aside><h3>Latest News</h3><article><h4>New Product Launch</h4><p>We are excited to announce the launch of our new product, <strong>ProductX</strong>. <a href="/news/productx">Read more</a></p></article><article><h4>Upcoming Webinar</h4><p>Join us for an exclusive webinar on industry trends. <a href="/webinar">Register now</a></p></article></aside><footer><p>&copy; 2024 My Website. All rights reserved.</p><p><a href="/privacy-policy">Privacy Policy</a> | <a href="/terms-of-service">Terms of Service</a></p></footer></div>`;

        const hast = fromHtml(html, { fragment: true });
        const hast1 = fromParse5(parseFragment(html, { sourceCodeLocationInfo: true }), {
            file: true
        });

        // Remove last position property for deep equal check
        delete (hast as any).position;
        delete (hast1 as any).position;

        expect(hast).toEqual(hast1);
    });
});

@wooorm
Copy link
Member

wooorm commented Aug 22, 2024

maybe it's time to move to Typescript and start having some proper type completions.

Please learn more about TypeScript.
TypeScript is already used in this project.
Type completions are already there.

everything I did is described pretty clearly to me and easy to reproduce

Why stop there? If you are profiling things, you can find out what branches take time?
Hence me asking: look at the code. There are different sets of code. One does more than the other. The first is slower than the second. That makes sense?

@zolero
Copy link
Author

zolero commented Aug 22, 2024

I wish you good luck with the management of your projects @wooorm.

@wooorm
Copy link
Member

wooorm commented Aug 22, 2024

Am I to understand that you are not at all intersted in contributing to open source? You just want to say that there might be a problem, and then expect someone to do work for you?

@zolero
Copy link
Author

zolero commented Aug 22, 2024

@wooorm your attitude demotivates me significantly to even make any contribution to your repositories, it surely motivates me to fork and do a complete recreation of it... other then that, I am done here.

@wooorm
Copy link
Member

wooorm commented Aug 22, 2024

You don‘t understand what open source is. What you describe is exactly what I give you: https://github.com/syntax-tree/hast-util-from-html/blob/main/license. What you don’t get, is free customer service.

@wooorm wooorm closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024
@wooorm wooorm added the 🤷 no/invalid This cannot be acted upon label Aug 22, 2024

This comment has been minimized.

@github-actions github-actions bot added 👎 phase/no Post cannot or will not be acted on and removed 🤞 phase/open Post is being triaged manually labels Aug 22, 2024
@zolero

This comment was marked as abuse.

@zolero

This comment was marked as abuse.

@wooorm
Copy link
Member

wooorm commented Aug 22, 2024

The benchmark here was indeed benchmarking different things. The comparison would be:

import fs from 'node:fs/promises'
import {VFile} from 'vfile'
import {fromHtml} from 'hast-util-from-html'
import {fromParse5} from 'hast-util-from-parse5'
import {parse} from 'parse5'

const html = await fs.readFile('example.html', 'utf8')
const file = new VFile(html)

console.time('hast-util-from-html')
fromHtml(file)
console.timeEnd('hast-util-from-html')

console.time('parse + hast-util-from-parse5')
fromParse5(parse(html, {sourceCodeLocationInfo: true}), {file})
console.timeEnd('parse + hast-util-from-parse5')

(note the {file} being passed to fromParse5)

There’s still a tiny bit more work done in fromHtml, but this code gives similar results.

@zolero
Copy link
Author

zolero commented Aug 22, 2024

The benchmark here was indeed benchmarking different things. The comparison would be:

import fs from 'node:fs/promises'
import {VFile} from 'vfile'
import {fromHtml} from 'hast-util-from-html'
import {fromParse5} from 'hast-util-from-parse5'
import {parse} from 'parse5'

const html = await fs.readFile('example.html', 'utf8')
const file = new VFile(html)

console.time('hast-util-from-html')
fromHtml(file)
console.timeEnd('hast-util-from-html')

console.time('parse + hast-util-from-parse5')
fromParse5(parse(html, {sourceCodeLocationInfo: true}), {file})
console.timeEnd('parse + hast-util-from-parse5')

(note the {file} being passed to fromParse5)

There’s still a tiny bit more work done in fromHtml, but this code gives similar results.

VFile results in a significant performance reduction...

@zolero
Copy link
Author

zolero commented Aug 22, 2024

Setting file to true, which results in type a error, but returns an equal hast tree after removing only the position property from root object.

import { fromHtml } from "hast-util-from-html";
import { fromParse5 } from "hast-util-from-parse5";
import { parseFragment } from "parse5";
import { describe, expect, it } from "vitest";

describe("hast parser hast-util-from-html vs hast-util-from-parse5", () => {
    // Deep equal fromHtml and fromParse
    it("should deep equal HAST correctly", async () => {
        const html = `<div class="container"><header><h1>Welcome to My Website</h1><nav><ul><li><a href="/home">Home</a></li><li><a href="/about">About Us</a></li><li><a href="/services">Services</a></li><li><a href="/contact">Contact</a></li></ul></nav></header><main><section class="intro"><h2>About Our Company</h2><p>We are a <strong>leading</strong> company in the industry, with over <em>20 years</em> of experience.</p><p>Our mission is to provide <a href="/services#quality">top-notch services</a> to our clients.</p></section><section class="features"><h2>Our Features</h2><ul><li>Feature 1: High quality</li><li>Feature 2: Reliability</li><li>Feature 3: Affordability</li></ul></section><section class="testimonials"><h2>What Our Clients Say</h2><blockquote><p>"This company exceeded our expectations!"</p><footer>- Jane Doe, CEO of ExampleCorp</footer></blockquote><blockquote><p>"Outstanding service and support!"</p><footer>- John Smith, Manager at TechSolutions</footer></blockquote></section></main><aside><h3>Latest News</h3><article><h4>New Product Launch</h4><p>We are excited to announce the launch of our new product, <strong>ProductX</strong>. <a href="/news/productx">Read more</a></p></article><article><h4>Upcoming Webinar</h4><p>Join us for an exclusive webinar on industry trends. <a href="/webinar">Register now</a></p></article></aside><footer><p>&copy; 2024 My Website. All rights reserved.</p><p><a href="/privacy-policy">Privacy Policy</a> | <a href="/terms-of-service">Terms of Service</a></p></footer></div>`;

        const hast = fromHtml(html, { fragment: true });
        const hast1 = fromParse5(parseFragment(html, { sourceCodeLocationInfo: true }), {
            file: true
        });

        // Remove last position property for deep equal check
        delete (hast as any).position;
        delete (hast1 as any).position;

        expect(hast).toEqual(hast1);
    });
});

Also as I've stated here... slows it down by 10 times.

@zolero
Copy link
Author

zolero commented Aug 22, 2024

See the difference by yourself. Note: I've disabled the vFile. So I would have 10 times the performance it should have.

import fs from "node:fs/promises";

import { fromHtml } from "hast-util-from-html";
import { fromParse5 } from "hast-util-from-parse5";
import { parse } from "parse5";
import { VFile } from "vfile";

function calculateStatistics(times) {
    times.sort((a, b) => a - b);
    const sum = times.reduce((a, b) => a + b, 0);
    const mean = sum / times.length;

    const p75 = times[Math.floor(0.75 * times.length)];
    const p99 = times[Math.floor(0.99 * times.length)];
    const p995 = times[Math.floor(0.995 * times.length)];
    const p999 = times[Math.floor(0.999 * times.length)];
    const min = times[0];
    const max = times[times.length - 1];
    const rme = (Math.max(...times.map(t => Math.abs(t - mean))) / mean) * 100;

    return {
        hz: (1000 / mean).toFixed(2), // operations per second
        min: `${min.toFixed(2)}ms`,
        max: `${max.toFixed(2)}ms`,
        mean: `${mean.toFixed(2)}ms`,
        p75: `${p75.toFixed(2)}ms`,
        p99: `${p99.toFixed(2)}ms`,
        p995: `${p995.toFixed(2)}ms`,
        p999: `${p999.toFixed(2)}ms`,
        rme: `${rme.toFixed(2)}%`,
        samples: times.length
    };
}

const html = await fs.readFile("src/example.html", "utf8");
const file = new VFile(html);

let times = [];
console.time("hast-util-from-html");
for (let i = 0; i < 10; i++) {
    const start = performance.now();
    fromHtml(file);
    const end = performance.now();
    times.push(end - start);
}
console.timeEnd("hast-util-from-html");

let stats = calculateStatistics(times);
console.log("hast-util-from-html stats:", stats);

times = [];
console.time("parse + hast-util-from-parse5");
for (let i = 0; i < 10; i++) {
    const start = performance.now();
    fromParse5(parse(html, { sourceCodeLocationInfo: true }), { file: true });
    const end = performance.now();
    times.push(end - start);
}
console.timeEnd("parse + hast-util-from-parse5");

stats = calculateStatistics(times);
console.log("parse + hast-util-from-parse5 stats:", stats);

@ChristianMurphy
Copy link
Member

@zolero

The MIT license grants you the right to use the code freely. However, it does not give you the right to demand maintainer time or to engage in sarcastic or disrespectful behavior.

Performance is important, and there is a shared interest in maintaining it. I encourage you to adopt a more constructive tone in your feedback.

If you're concerned about performance issues, consider generating a flamegraph to identify what is causing the delays on your system: https://nodejs.org/en/learn/diagnostics/flame-graphs. The tool https://github.com/davidmarkclements/0x is particularly effective for this purpose.

@zolero
Copy link
Author

zolero commented Aug 22, 2024

@ChristianMurphy Hey for the record, I was trying to help to state that this vFile is causing significant performance reduction. I'm here being called someone who doesn't know typescript, my code is faulty and that I should read the code first. After that he's telling me I should read the MIT license and don't know about opensource. This @wooorm is being completely indecent with his comments and you expect me to keep my respect towards him?,

@ChristianMurphy
Copy link
Member

I understand that you're frustrated, and I appreciate your intent to help by pointing out the performance issue. However, let's address a few things clearly:

Someone who doesn't know TypeScript

You made a comment about this project not being in TypeScript, when in fact, it uses TypeScript through JSDoc syntax. The comment was dismissive and suggested a lack of familiarity with a feature of TypeScript. We all have gaps in our knowledge—what matters is how we address them. Being rude isn't necessary.

I should read the MIT license and don't know about open source

The point being made, which I support, is that while you're free to use, read, and modify the source code, that doesn't entitle you to free support or to treat others disrespectfully.

And you expect me to keep my respect towards him?

Yes, respect is non-negotiable. Open source relies on the generosity and collaboration of its community. The maintainers are dedicating their time to help, often without compensation. Keeping the discourse respectful is essential for a productive environment.

I understand your frustration, but directing that frustration at the people trying to assist you is counterproductive. Let's work together constructively.

@zolero
Copy link
Author

zolero commented Aug 22, 2024

Yes, respect is non-negotiable. Open source relies on the generosity and collaboration of its community. The maintainers are dedicating their time to help, often without compensation. Keeping the discourse respectful is essential for a productive environment.

I understand your frustration, but directing that frustration at the people trying to assist you is counterproductive. Let's work together constructively.

Can you tell this Titus as well... not the first time he's being disrespectful towards others. If he's not willing to be able receive critics he might be not suited for handling issues.

@ChristianMurphy
Copy link
Member

I hear your concerns, and I want to emphasize that maintaining a respectful tone is important for everyone involved, including Titus. It's clear that tensions are high, but we need to remember that open-source projects thrive on constructive criticism and collaboration.

Titus, as the project owner and primary contributor, has invested a significant amount of time and effort into this project. His directness can sometimes come across as harsh, but his intention is to maintain the quality and integrity of the project. That said, it's crucial for everyone—regardless of their role—to communicate respectfully.

I'll address your feedback with Titus, but I also encourage you to approach these discussions with the same respect you’re asking for. Let’s work on keeping the dialogue productive and focused on the issues at hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤷 no/invalid This cannot be acted upon 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

3 participants