-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: oa_fetch()
multithreaded?
#129
Comments
Maybe things could be slightly faster but at the end of the day it is an API service so as you guessed, there's a hard limit to speed. From their website (emphasis mine):
|
OK. Makes sense. Thanks. |
Oh - how many API calls are needed for a |
As in, if the input to Actually one way to speed up the process is parallelizing the conversion from JSON to df. This step is actually slower than requesting, per paper. Maybe we'll revisit the code for this at some point |
OK. Thanks a lot. Amy improvement in speed would be great! |
Just some note to myself (since I've actually thought about it a bit too):
|
Before doing any optimization, I think we need to really pinpoint what the bottleneck is, probably with profvis. Currently, the output list is not that deep, and I think improving On the side of API calls, I have experienced great speed improvement with OpenAlex Premium. You may want to write to the OpenAlex team to see if you could obtain an API key for a trial period @rkrug. Still, I agree that the conversion to dataframes can be slow. @rkrug Could you share an example snippet of how you would do snowball for, say, 50 works? There may be a way to retain the output as lists until the very last step. This example would help us better diagnose where the slowness comes from. |
Nothing special I would say - calling So with the Premium, I would get faster API axes and more requests? Nice. I will look into this. |
@trangdata I'll move the performance stuff over to a new issue and do some more digging before I attempt anything, but just to comment one last thing re: Line 173 in 66f0743
I profiled it here - https://rpubs.com/yjunechoe/oa_snowball_profvis1. I'm not sure what about that implementation specifically is making it so slow for such a trivial task, but it takes up over 10% of total run time in my toy example ( Update: Just ran the example again with a modified |
Amazing! Thanks so much @yjunechoe. 🌻 Surprising indeed! And yes a new issue would be great! |
FYI @yjunechoe I'm revising the code and we may not need simple_rapply after all! |
Thinking through this a little more: I'm making some significant changes in #132. We'll have to add more tests to make sure that removing Regarding speed, I think we should keep in mind that a good amount of time is spent getting a response from the API (~ elapsed time - user time?). In #132, I also added an # myids is a character vector of work ids Rainer sent me
system.time({
ilk_snowball <- oa_snowball(
identifier = myids[1:20],
verbose = TRUE,
options = list(select = c("id", "display_name", "authorships", "referenced_works")),
mailto = "[email protected]"
)
})
# user system elapsed
# 2.795 0.043 5.157
system.time({
ilk_snowball <- oa_snowball(
identifier = myids[1:20],
verbose = TRUE,
mailto = "[email protected]"
)
})
# user system elapsed
# 6.110 0.161 10.113 Seeing this and the result from profvis, I'm not sure the json conversion is the bottleneck. I'm leaning toward NOT adding a new dependency to rcppsimdjson atm. Last point: Take caution with memory. It's easy to run out of memory with |
When
snowball()
on a large number of works can take quite some time/// Would it be possible (I don't know the limitations otf the OpenAlex API), to make this multithreaded? There could actually be more threads than cores used, as the limit is likely the bandwidth and response time of the API?The text was updated successfully, but these errors were encountered: