Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "script" content seems to truncate characters after a limit #244

Open
oliv23 opened this issue Aug 6, 2019 · 2 comments
Open

Getting "script" content seems to truncate characters after a limit #244

oliv23 opened this issue Aug 6, 2019 · 2 comments

Comments

@oliv23
Copy link

oliv23 commented Aug 6, 2019

Hi,

First of all, thank you for Osmosis! I've been using it for a few years now (yikes) and have only run into a handful of edge-cases issues since.

I'm trying to scrape the content of a <script type="text/json"> tag, and it works to some degree but when I save the output to a file, it's clear that the contents are not full and were truncated at a certain point.

I've logged the length of the content in the console and apparently the portion of the script I'm able to retrieve is 45825 characters. Now the strange thing is, I just tried this again, and now the length is 45873.

I suspect there is something wrong with the parsing of big strings or a hard-coded limit somewhere? Do you have any idea what this could be?

Best,
Olivier

@rchipka
Copy link
Owner

rchipka commented Aug 7, 2019

@oliv23 glad to hear it's been working well for so long.

Could this be an issue with console.log() truncating formatted output or are you outputting the raw string into a file?

I think we already set XML_PARSE_HUGE, but we might need to also set XML_PARSE_BIG_LINES

@oliv23
Copy link
Author

oliv23 commented Aug 8, 2019

@rchipka Thanks for the quick reply! I thought that might be the case so I wrote the result to a file instead. Same result unfortunately, forgot to mention that.

Ok, there might be something there? For reference, the page I'm scraping is this:
https://www.modaoperandi.com/acler-fw19/dalisay-draped-midi-dress?color=white

If you inspect the script's content in the console, you will find it is complete, however when scraping it with osmosis, it comes back chopped. Here is the code I'm using for reproducibility:

osmosis
.get('https://www.modaoperandi.com/acler-fw19/dalisay-draped-midi-dress?color=white')
.then(function(context, data, next) {
	next(context, data);
})
.find('#wraps-body-content')
.set({
	'category_json': '[data-react-class=SiteComponent] + script[type=text/json]'
})
.data(function(product) {
	var path = 'category_json.json';
	fs.writeFileSync(path, JSON.stringify(product.category_json, null, '\t'), 'utf8');

	console.log(product.category_json.length);
})

I looked into XML_PARSE_HUGE, the number associated with it is 524288. Not sure if that's the character count of object size, but if it is the former, it seems to me like it should work: I manually copied the full content from the script and it amounts to 147617 characters in total.

Olivier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants