Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: oa_fetch() multithreaded? #129

Closed
rkrug opened this issue Jul 19, 2023 · 12 comments
Closed

Feature Request: oa_fetch() multithreaded? #129

rkrug opened this issue Jul 19, 2023 · 12 comments

Comments

@rkrug
Copy link

rkrug commented Jul 19, 2023

When snowball() on a large number of works can take quite some time/// Would it be possible (I don't know the limitations otf the OpenAlex API), to make this multithreaded? There could actually be more threads than cores used, as the limit is likely the bandwidth and response time of the API?

@yjunechoe
Copy link
Collaborator

Maybe things could be slightly faster but at the end of the day it is an API service so as you guessed, there's a hard limit to speed. From their website (emphasis mine):

The API is limited to 100,000 calls per day. If you need more, simply drop us a line at [email protected]. There is a burst rate limit of 10 requests per second. So calling multiple requests at the same time could lead to errors with code 429.

@rkrug
Copy link
Author

rkrug commented Jul 19, 2023

OK. Makes sense.

Thanks.

@rkrug rkrug closed this as completed Jul 19, 2023
@rkrug
Copy link
Author

rkrug commented Jul 19, 2023

Oh - how many API calls are needed for a snowball() of around 2000 works?

@yjunechoe
Copy link
Collaborator

As in, if the input to oa_snowball() is 2000 openalex IDs? To be honest I'm not sure. I think it varies widely depending on how well cited the papers in your set are.

Actually one way to speed up the process is parallelizing the conversion from JSON to df. This step is actually slower than requesting, per paper. Maybe we'll revisit the code for this at some point

@rkrug
Copy link
Author

rkrug commented Jul 19, 2023

OK. Thanks a lot. Amy improvement in speed would be great!

@yjunechoe
Copy link
Collaborator

Just some note to myself (since I've actually thought about it a bit too):

  • We could replace the internal simple_rapply() function with {rrapply}
  • Instead of {jsonlite} and fromJSON() we can use {rcppsimdjson} which has much faster parse speed

@trangdata
Copy link
Collaborator

Before doing any optimization, I think we need to really pinpoint what the bottleneck is, probably with profvis. Currently, the output list is not that deep, and I think improving simple_rapply would not yield much better speed.

On the side of API calls, I have experienced great speed improvement with OpenAlex Premium. You may want to write to the OpenAlex team to see if you could obtain an API key for a trial period @rkrug.

Still, I agree that the conversion to dataframes can be slow. @rkrug Could you share an example snippet of how you would do snowball for, say, 50 works? There may be a way to retain the output as lists until the very last step. This example would help us better diagnose where the slowness comes from.

@rkrug
Copy link
Author

rkrug commented Jul 19, 2023

Nothing special I would say - calling snowball() with around 2500 ids.

So with the Premium, I would get faster API axes and more requests? Nice. I will look into this.

@yjunechoe
Copy link
Collaborator

yjunechoe commented Jul 19, 2023

@trangdata I'll move the performance stuff over to a new issue and do some more digging before I attempt anything, but just to comment one last thing re: simple_rapply - it takes surprisingly long just in this line inside oa3df():

paper <- simple_rapply(paper, `%||%`, y = NA)

I profiled it here - https://rpubs.com/yjunechoe/oa_snowball_profvis1. I'm not sure what about that implementation specifically is making it so slow for such a trivial task, but it takes up over 10% of total run time in my toy example (oa_snowball("W2589424942"))

image

Update: Just ran the example again with a modified oa2df() where I keep the simple_rapply() version (and just let its output garbage collect) and also test rrapply() for a side-by-side comparison:

image

@trangdata
Copy link
Collaborator

Amazing! Thanks so much @yjunechoe. 🌻 Surprising indeed! And yes a new issue would be great!

@trangdata
Copy link
Collaborator

FYI @yjunechoe I'm revising the code and we may not need simple_rapply after all!

@trangdata
Copy link
Collaborator

Thinking through this a little more:

I'm making some significant changes in #132. We'll have to add more tests to make sure that removing simple_rapply didn't break anything.

Regarding speed, I think we should keep in mind that a good amount of time is spent getting a response from the API (~ elapsed time - user time?).

In #132, I also added an options argument to oa_snowball (similar to how you would use it in oa_fetch). This would speed up your dataframe conversion a little bit if you can ignore some columns you don't need for the plot:

# myids is a character vector of work ids Rainer sent me
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids[1:20],
    verbose = TRUE,
    options = list(select = c("id", "display_name", "authorships", "referenced_works")),
    mailto = "[email protected]"
  )
})
#  user  system elapsed 
# 2.795   0.043   5.157 

system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids[1:20],
    verbose = TRUE,
    mailto = "[email protected]"
  )
})
#  user  system elapsed 
# 6.110   0.161  10.113

Seeing this and the result from profvis, I'm not sure the json conversion is the bottleneck. I'm leaning toward NOT adding a new dependency to rcppsimdjson atm.

Last point: Take caution with memory. It's easy to run out of memory with oa_snowball. So, if you can chunk your work and maybe save the output of each step, then bring them all back together in a new session, I would try that. I think there is some caching going on behind the seen with httr::GET that we can't capture. Related: #95 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants