-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Added option to disable the webkit in-memory and network caches #821
Conversation
On cursory review the code changes are clean and their purpose is clear. Some minor formatting changes to comply with PEP8, and a typo in the documentation changes, but nothing that should delay a merge. I'll +1 it and a make a few inline suggestions; if another maintainer can review and confirm also I expect we can merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor changes recommended but should not delay a merge. I think this is a valuable feature for some use-cases and adds a level of control that might be useful generally for experimenting and testing.
It would be nice, if the new functionality could be tested, to include tests for it.
…e pep8-compatible
Codecov Report
@@ Coverage Diff @@
## master #821 +/- ##
=========================================
Coverage ? 87.46%
=========================================
Files ? 41
Lines ? 5670
Branches ? 781
=========================================
Hits ? 4959
Misses ? 533
Partials ? 178
|
Hey @wsdookadr! The PR looks good, but could you please add a test case? Check e.g. here how a custom startup option is tested. I guess you'd need to define some resource in splash/splash/tests/mockserver.py Line 975 in a3f7e75
|
@kmike I've pushed another commit with tests for this. However, it seems like the problem is not being captured by the test, in other words both default options and the new
|
splash/tests/mockserver.py
Outdated
@use_chunked_encoding | ||
def render_GET(self, request): | ||
return ("""<html><head> | ||
<link rel="stylesheet" href="subresources-with-caching/style.css" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem could be that related resources don't set any specific cache headers, so the browser doesn't cache them. Could you try setting Cache-Control header for these static resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike you are correct, I've pushed a new commit that sets that Cache-Control header and now it seems that at least for the image.gif resource, the entire scenario is reproducible.
Used the Cache-Control header to allow the browser to cache the image.gif that is being fetched. This now actually reproduces the behaviour of the browser which would discard the image.gif resource from the HAR if the page/in-memory cache was enabled.
splash/tests/test_request_filters.py
Outdated
# pytest -s --verbosity=6 ./splash/tests/test_request_filters.py -k test_disable_browser_caches | ||
def test_disable_browser_caches(self): | ||
# entries if run with caches enabled | ||
# in this case we should have [[content,content,content],[content,content,no content]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why isn't it working for css files. Maybe there are some other response headers needed to make it work, or maybe browser uses ETag or something like that, so there is no "full" response, but still a response. It doesn't look good that our code sets cache headers, but then checks that the resource is not cached; test will break if resource start to be cached.
I think there are two ways to fix that:
- Investigate what's going on and make sure css is cached as well;
- remove a check for css file, check only html and gif.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you on that and I went with the 2nd approach, I have removed the css checks. I've pushed a new commit for this.
For this pull request I'm ok with having it this way.
splash/tests/mockserver.py
Outdated
return self.Image() | ||
return self | ||
|
||
class StyleSheet(Resource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be removed now, as well as <link rel=stylesheet>
tag in html?
splash/tests/mockserver.py
Outdated
@@ -901,6 +901,39 @@ def render_GET(self, request): | |||
return base64.decodebytes(b'R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs=') | |||
|
|||
|
|||
class SubresourcesWithCaching(Resource): | |||
""" | |||
Embedded css and image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring is no longer correct, as it embeds only image
splash/tests/mockserver.py
Outdated
@use_chunked_encoding | ||
def render_GET(self, request): | ||
return ("""<html><head> | ||
</head> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove <head>
, to make an example a bit shorter.
Thanks @wsdookadr! 🎉 |
Hi team, |
Hi, I've been reading through the existing open issues.
This pull request is intended to address #203 , #519 , scrapy-plugins/scrapy-splash#168 , #817 , webrecorder/har2warc#1
Please review
Thanks,
Stefan
Note: To try this out, you can use this docker image:
And then load a web page and generate a .har file from it like so
If you run this multiple times, the amount of data fetched by curl should be about the same (4mb for that url). Before this pull-request, the first time, it would produce around 4mb, and the subsequent ones about ~200-300k (the .har files would be missing resources that are cached by webkit in its page cache, the first time the page was loaded).