Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of edges from oa_snowball doesn't match cited_by_count #178

Open
TimothyElder opened this issue Oct 16, 2023 · 2 comments
Open

Number of edges from oa_snowball doesn't match cited_by_count #178

TimothyElder opened this issue Oct 16, 2023 · 2 comments
Labels

Comments

@TimothyElder
Copy link

TimothyElder commented Oct 16, 2023

When returning all the works that are cited by and that cite a focal article the number of edges in returned edges data frame that go to the focal article should match the cited_by_count of the focal article, but it seems that they usually do not.

I am trying to figure out whether this is an artifact in the data or whether I have misunderstood precisely what oa_snowball returns.

Here is an example of where I think the edges should match but they don't:

library(openalexR)

focal_article <- oa_fetch(
  entity = "works",
  doi = c("10.1056/nejmoa1000678"),
  verbose = TRUE
)

snowball_docs <- oa_snowball(
  identifier = focal_article$id,
  verbose = TRUE
)

edges <- snowball_docs$edges

id <- stringr::str_replace(focal_article$id, "https://openalex.org/", "")

# drop all works the focal work cites
edges <- edges |>
  filter(to == id)

# Raise error if edges don't match focal_article citation count
tryCatch({
  if(nrow(edges) != focal_article$cited_by_count) {
      stop("Number of edges doesn't match cited by count of focal article!")
  }
}, error = function(e) {
  cat("An error occurred: ", e$message, "\n")
})
@yjunechoe
Copy link
Collaborator

yjunechoe commented Oct 16, 2023

Thanks for the report! Definitely not ideal, but it's likely the same situation also reported in #115

For what it's worth, in my experience with snowball searching it's pretty common to have mismatches between the cited-by number in a paper's records vs. its discoverable connections (even within the same database). You can just think of the number of articles returned by backward-searching in oa_snowball() as the absolute lower bound estimate of cited-by (which doesn't account for older papers, retracted papers, inaccessible papers, etc.).

@TimothyElder
Copy link
Author

The discrepancy doesn't seem to be very sizable fortunately, less than 10 or so per article in my estimation. Thanks!

@TimothyElder TimothyElder changed the title Number of edges from snowball_docs doesn't match cited_by_count Number of edges from oa_snowball doesn't match cited_by_count Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants