Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oa_snowball returns Error in if (is.na(so_info)) NA else so_info[[1]] when snowballing large number of cites #95

Open
TimothyElder opened this issue Apr 14, 2023 · 5 comments

Comments

@TimothyElder
Copy link

TimothyElder commented Apr 14, 2023

When running oa_snowball on all the works that cite one highly cited article a large number of works are returned and the script takes a long time to run. After returning about 100,000 works, the script returns error:

Error in if (is.na(so_info)) NA else so_info[[1]] : 
  argument is of length zero
Calls: oa_snowball -> do.call -> <Anonymous> -> oa2df -> works2df

Looking at the source code I can't quite make sense why this error is returned. And I can't think of a way of more efficiently returning all the works. Here is how I do it now:

# Returns the citing and cited entities from a focal set of entities
snowball_docs <- oa_snowball(
  identifier = "W2147016542",
  verbose = TRUE,
  is_retracted = FALSE
)

edges <- as.data.frame(snowball_docs$edges)

nodes <- as.data.frame(snowball_docs$nodes)

# Works that Cite the focal study 
citing_works <- edges %>%
    filter(from != "W2147016542") %>%
    select(from) %>%
    as.vector()

# Node attributes for works that cite focal article
citing_works_df <- nodes %>%
    filter(id %in% citing_works$from)

# return all the articles that cite the citing works
second_docs <- oa_snowball(
  identifier = citing_works_df$id,
  verbose = TRUE,
  cited_by_filter = list(cited_by_count = c(">1000", "<30000"))
)
@trangdata
Copy link
Collaborator

Hi @TimothyElder thanks so much for reporting this. 🌱

It is expected that the script takes a while to run because oa_snowball retrieves all works that cite and are cited by the focal work. When your set of focal work is over 5000 works, this can take a very long time, especially if some of these focal works have a lot of citations.

In this particular case, I think you run out of memory in R. The result of the following query finding works that are cited by a subset of your focal works somehow result in over 7GiB of memory used in session. I'll keep investigating, but I suggest breaking your identifier = citing_works_df$id into small chunks, e.g., identifier = citing_works_df$id[1:10], writing out the results, then combining them all later. Let me know if that works.

library(openalexR)
ids <- c("W2119340816", "W4285719527", "W4211208840", "W4247785462", "W4211082352", "W2163351155", "W4210992155", "W2103903454", "W2549006299", "W2026141069", "W3126128017", "W2145354914", "W2086643853", "W2085458222", "W1988902102", "W2095880617", "W2139524347", "W2109565845", "W2112652525", "W2137200701", "W2144330816", "W2552595635", "W1996710573", "W2051676630", "W1875373156", "W2761242421", "W2134119471", "W2125665528", "W2111285159", "W2147485520", "W2121875608", "W2561425398", "W4238604577", "W2336794604", "W2106742300", "W4211174791", "W1958810146", "W2184779060", "W2169678441", "W1942996532", "W2165335733", "W2098206882", "W2073051214", "W2168197710", "W2017506719", "W2469676206", "W2094905849", "W2099192919", "W2124028388", "W4248178819")
oa_fetch(
  cited_by = ids,
  verbose = TRUE,
  cited_by_count = c(">1000", "<30000")
)

@TimothyElder
Copy link
Author

@trangdata Thanks!

I kept working on this and found a similar solution to the one that you outlined. Instead of breaking it up by feeding in chunks of the data, I used the citing_filter AND the cited_by_filter (previously I was just using the latter). Like this:

oa_snowball(
  identifier = ids,
  verbose = TRUE,
  citing_filter = list(cited_by_count = c(">500", "<30000")),
  cited_by_filter = list(cited_by_count = c(">500", "<30000")),
  is_retracted = FALSE
)

Then plan on doing a few more passes with oa_snowball, where I return different combinations of those filters to get the complete network of snowball docs. Not super efficient, but workable.

@trangdata
Copy link
Collaborator

@TimothyElder one thing I noticed just now: did you mean for the conditions to be AND instead of OR? i.e., you're looking for works that are cited by over 500 other articles but less than 30,000 articles? If so, I think the following is what you want (I know, the syntax is strange with the lists with same element names). We need to add more documentation regarding these operators. https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists

oa_snowball(
  identifier = ids,
  verbose = TRUE,
  citing_filter = list(cited_by_count = ">500", cited_by_count = "<30000"),
  cited_by_filter = list(cited_by_count = ">500", cited_by_count = "<30000"),
  is_retracted = FALSE
)

@TimothyElder
Copy link
Author

@trangdata Yes!! Very good catch. This was my way of chunking out the process, though now that I look at the code I wrote, i see that there are some mistakes. But, yes I meant for the snowball to return only articles that are cited by more than 500 other articles but less than 30,000 articles. I also added the citing_filter with the same parameters, but now realize that my use doesn't make any sense if I understand the filter correctly.

For my own clarification the cited_by_filter is used to control articles that are returned by the number of times that article is cited by other articles. The citing_filter, on the other hand, is used to control the number of articles the focal article cites (that is the length of its own bibliography). If that is the case then my original use of the citing_filter doesn't really make any sense since there is likely no articles that cite more than 500 other articles and less than 30000 other articles.

Sorry in advance if that is confusing, and the documentation even on OpenAlex is a little confusing about the logical expressions.

@trangdata
Copy link
Collaborator

trangdata commented Apr 20, 2023

For my own clarification the cited_by_filter is used to control articles that are returned by the number of times that article is cited by other articles. The citing_filter, on the other hand, is used to control the number of articles the focal article cites (that is the length of its own bibliography).

Yes, you're correct @TimothyElder. 💯 Also, we're open to new PRs if you would like to improve the documentation! 🙏🏽 🪴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants