Feature Request: `oa_fetch()` multithreaded? #129

rkrug · 2023-07-19T11:35:01Z

When snowball() on a large number of works can take quite some time/// Would it be possible (I don't know the limitations otf the OpenAlex API), to make this multithreaded? There could actually be more threads than cores used, as the limit is likely the bandwidth and response time of the API?

The text was updated successfully, but these errors were encountered:

yjunechoe · 2023-07-19T11:41:54Z

Maybe things could be slightly faster but at the end of the day it is an API service so as you guessed, there's a hard limit to speed. From their website (emphasis mine):

The API is limited to 100,000 calls per day. If you need more, simply drop us a line at [email protected]. There is a burst rate limit of 10 requests per second. So calling multiple requests at the same time could lead to errors with code 429.

rkrug · 2023-07-19T11:42:44Z

OK. Makes sense.

Thanks.

rkrug · 2023-07-19T11:43:29Z

Oh - how many API calls are needed for a snowball() of around 2000 works?

yjunechoe · 2023-07-19T11:50:40Z

As in, if the input to oa_snowball() is 2000 openalex IDs? To be honest I'm not sure. I think it varies widely depending on how well cited the papers in your set are.

Actually one way to speed up the process is parallelizing the conversion from JSON to df. This step is actually slower than requesting, per paper. Maybe we'll revisit the code for this at some point

rkrug · 2023-07-19T11:51:49Z

OK. Thanks a lot. Amy improvement in speed would be great!

yjunechoe · 2023-07-19T11:55:43Z

Just some note to myself (since I've actually thought about it a bit too):

We could replace the internal simple_rapply() function with {rrapply}
Instead of {jsonlite} and fromJSON() we can use {rcppsimdjson} which has much faster parse speed

trangdata · 2023-07-19T13:18:36Z

Before doing any optimization, I think we need to really pinpoint what the bottleneck is, probably with profvis. Currently, the output list is not that deep, and I think improving simple_rapply would not yield much better speed.

On the side of API calls, I have experienced great speed improvement with OpenAlex Premium. You may want to write to the OpenAlex team to see if you could obtain an API key for a trial period @rkrug.

Still, I agree that the conversion to dataframes can be slow. @rkrug Could you share an example snippet of how you would do snowball for, say, 50 works? There may be a way to retain the output as lists until the very last step. This example would help us better diagnose where the slowness comes from.

rkrug · 2023-07-19T13:33:50Z

Nothing special I would say - calling snowball() with around 2500 ids.

So with the Premium, I would get faster API axes and more requests? Nice. I will look into this.

yjunechoe · 2023-07-19T13:45:11Z

@trangdata I'll move the performance stuff over to a new issue and do some more digging before I attempt anything, but just to comment one last thing re: simple_rapply - it takes surprisingly long just in this line inside oa3df():

openalexR/R/oa2df.R

Line 173 in 66f0743

paper <- simple_rapply(paper, `%||%`, y = NA)

I profiled it here - https://rpubs.com/yjunechoe/oa_snowball_profvis1. I'm not sure what about that implementation specifically is making it so slow for such a trivial task, but it takes up over 10% of total run time in my toy example (oa_snowball("W2589424942"))

Update: Just ran the example again with a modified oa2df() where I keep the simple_rapply() version (and just let its output garbage collect) and also test rrapply() for a side-by-side comparison:

trangdata · 2023-07-19T14:03:32Z

Amazing! Thanks so much @yjunechoe. 🌻 Surprising indeed! And yes a new issue would be great!

trangdata · 2023-07-19T14:39:35Z

FYI @yjunechoe I'm revising the code and we may not need simple_rapply after all!

trangdata · 2023-07-19T19:48:57Z

Thinking through this a little more:

I'm making some significant changes in #132. We'll have to add more tests to make sure that removing simple_rapply didn't break anything.

Regarding speed, I think we should keep in mind that a good amount of time is spent getting a response from the API (~ elapsed time - user time?).

In #132, I also added an options argument to oa_snowball (similar to how you would use it in oa_fetch). This would speed up your dataframe conversion a little bit if you can ignore some columns you don't need for the plot:

# myids is a character vector of work ids Rainer sent me
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids[1:20],
    verbose = TRUE,
    options = list(select = c("id", "display_name", "authorships", "referenced_works")),
    mailto = "[email protected]"
  )
})
#  user  system elapsed 
# 2.795   0.043   5.157 

system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids[1:20],
    verbose = TRUE,
    mailto = "[email protected]"
  )
})
#  user  system elapsed 
# 6.110   0.161  10.113

Seeing this and the result from profvis, I'm not sure the json conversion is the bottleneck. I'm leaning toward NOT adding a new dependency to rcppsimdjson atm.

Last point: Take caution with memory. It's easy to run out of memory with oa_snowball. So, if you can chunk your work and maybe save the output of each step, then bring them all back together in a new session, I would try that. I think there is some caching going on behind the seen with httr::GET that we can't capture. Related: #95 (comment)

rkrug closed this as completed Jul 19, 2023

trangdata mentioned this issue Jul 19, 2023

Refactor and optimization #132

Merged

yjunechoe mentioned this issue Sep 6, 2024

Update regex's for performance #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: `oa_fetch()` multithreaded? #129

Feature Request: `oa_fetch()` multithreaded? #129

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023

rkrug commented Jul 19, 2023

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023

trangdata commented Jul 19, 2023

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023 •

edited

Loading

trangdata commented Jul 19, 2023

trangdata commented Jul 19, 2023

trangdata commented Jul 19, 2023

Feature Request: oa_fetch() multithreaded? #129

Feature Request: oa_fetch() multithreaded? #129

Comments

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023

rkrug commented Jul 19, 2023

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023

trangdata commented Jul 19, 2023

rkrug commented Jul 19, 2023

yjunechoe commented Jul 19, 2023 • edited Loading

trangdata commented Jul 19, 2023

trangdata commented Jul 19, 2023

trangdata commented Jul 19, 2023

Feature Request: `oa_fetch()` multithreaded? #129

Feature Request: `oa_fetch()` multithreaded? #129

yjunechoe commented Jul 19, 2023 •

edited

Loading