Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem after converting a .har -> .warc and importing in webrecorder #1

Closed
wsdookadr opened this issue Sep 30, 2018 · 1 comment
Closed

Comments

@wsdookadr
Copy link

wsdookadr commented Sep 30, 2018

Hi,

Thank you for writing har2warc. I will describe below what I've tried and some minor differences in what I was expecting that I don't know how to explain.

I pulled a docker image of splash created by @scrapinghub like so and ran it:

docker pull scrapinghub/splash
docker run -it -p 8050:8050 --name render_html scrapinghub/splash

Then I rendered a page using splash and exported the resulting .har (as indicated in splash's docs):

curl 'http://localhost:8050/render.har?url=https://www.digitalocean.com/community/tutorials/how-to-secure-haproxy-with-let-s-encrypt-on-centos-7&timeout=10&wait=7&response_body=1' > 1.har

Then I've converted the resulting .har to .warc

har2warc 1.har 1.warc

And after this I've imported the 1.warc file into webrecoreder.
Then I viewed that file as it was stored in webrecorder and any styling seemed to be missing.

I understand and agree that this does not just involve har2warc, and the problem could originate in one of these: har2warc , splash , webrecorder . I'm not sure where to attribute this behaviour.

The general use-case would be automating a large archiving operation where the result would be a faithful reproduction of the original website, if such a website happens to contain a lot of javascript-rendered content, and nowadays that is the case with many websites.

I'd be interested in your thoughts.

Thanks,
Stefan

@wsdookadr
Copy link
Author

Me again, I was able to isolate the problem to splash, I made the following PR #821. Using that change, the pipeline splash -> har2warc -> webrecorder is now fully functional, all images and styling is showing up.
I'm going to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant