Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/11242 ExternalIdentifier fix #11246

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Feb 12, 2025

What this PR does / why we need it: This PR fixes a bug in the ExternalIdentifier class that caused ORCIDs in URL form to not be recognized, breaking part of the external vocab functionality w.r.t. ORCID.

Which issue(s) this PR closes:

Special notes for your reviewer: As the fix is generic, I reverted changes in the ROR recognition to avoid having two separate versions. Also added tests for formatting or ORCID and ROR.
Also - I noted that it looks like some tests in DatasetFieldValueValidatorTest duplicate those in ExternalIdentifier. If a reviewer confirms that, I'd be happy to remove them (and the isValidAuthorIdentifier() method that is only used in those tests).

Suggestions on how to test this: Verify that publishing a dataset with an ORCID in numeric form or URL form as an author ID, the ORCID appears in the DataCite XML. Verify that either form of ORCID/ROR show up as valid links in the page display (without external vocab scripts turned on).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included.

Additional documentation:

@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Feb 12, 2025
@@ -4,7 +4,7 @@
import java.util.regex.Pattern;

public enum ExternalIdentifier {
ORCID("ORCID", "https://orcid.org/%s", "^\\d{4}-\\d{4}-\\d{4}-(\\d{4}|\\d{3}X)$"),
ORCID("ORCID", "https://orcid.org/%s", "^(https:\\/\\/orcid.org\\/)?\\d{4}-\\d{4}-\\d{4}-(\\d{4}|\\d{3}X)$"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of b9d0146 we say the following:

When entering author identifiers, select the type from the dropdown (e.g. "ORCID") and under "Identifier" enter just the unique identifier (e.g. "0000-0002-1825-0097") rather than the full URL (e.g. "https://orcid.org/0000-0002-1825-0097").

Should this be re-written?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will support both forms, but if you aren't using the external vocab scripts, the short form may still be preferable? That's why I didn't change it, but happy to change it if we want to allow a choice for manual entry. (AFAIK, there is no validation for input so people can still enter whatever they want.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey all. @pdurbin pinged me about this.

The short form, or the non-URL form, is preferable for ORCIDs that users type in or paste in the Author Identifier field, because that's what makes the link clickable, right? So the same will hold true for RORs that users type in or paste in the Author Identifier field, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier fix changed ROR so the short form would be a link. This PR makes it so that either form of ORCID or ROR should result in a valid link.

Copy link
Contributor

@jggautier jggautier Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. Also, when you wrote "happy to change it if we want to allow a choice for manual entry", I thought we can already allow a choice for manual entry of identifiers. That's how it's set up on Demo Dataverse and planned for Harvard Dataverse.

Or did you mean when installations configure an external vocab script to allow for manual entry, that is that users can enter the identifier in the Identifier field that appears when they enter their own name instead of selecting one from the suggestions?

About the user guide text being added in b9d0146, it's written to apply to all identifiers, so I'm testing on Demo Dataverse to review how the other identifiers that users enter in the Author Identifier field are displayed, especially when the identifier has a URL form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created that GitHub issue in the ext-vocab repo. It's at gdcc/dataverse-external-vocab-support#33. There are some details I couldn't be sure of because:

  • I'm testing on Demo Dataverse and the change won't happen on Demo Dataverse's Author fields until v6.6 is released and applied to Demo Dataverse, like how RORs are displayed on the dataset page
  • the Author fields on Demo Dataverse are behaving in ways I didn't expect, like how the dataset page displays an ORCID logo even when the user enters the ORCID by using the Author Identifier Type and Author Identifier fields
  • I'm not sure about what changes were made so that the dataset page displays IDs as links when users enter IDs in Author fields where the cvoc script isn't applied. This is what I'm trying to clarify in my last comment.

I can help update that GitHub issue as I understand more.

Copy link
Member Author

@qqmyers qqmyers Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the identifier types have both a short and URI form (FWIW DAI was not handled before - I've added it). Entering either as the author identifier will cause it to display as a link in the metadata display (except DAI which is not a URL). The DataCite XML generator adds the identifier, in URI form, regardless of the identifier type, so yes they are all included regardless of the short or URI input form (including DAI).

Copy link
Contributor

@jggautier jggautier Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay thanks @qqmyers!

So @pdurbin, so should we rewrite what's written in the User Guides, at b9d0146, about how to enter identifiers in the Author Identifier field?

Or maybe that addition can be removed, since identifiers that Dataverse can display as link will be displayed as links when the cvoc script is used and when it's not used, and when users enter the short forms of the IDs and when they enter the URL forms. If all of the ways that we expect users to enter identifiers in that field will be supported - displayed as links and included in the expected metadata exports - then we don't need to guidance about how to enter identifiers in that field, right?

This PR is "Ready for review" but should it be moved back to "In progress" until we agree on what to do about that addition to the User Guides?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something should be written somewhere about this confusing situation.

What if you run Dataverse for several years with out the cvoc scripts? Your database has lots of "just identifiers" values in it?

Then, you turn on cvoc scripts and run for another couple years. Now your database has a mix of (older) "just identifiers" and newer "full URL" values?

Do I have this right? Is this what we want? Shouldn't the database be consistent?

Apologies if I'm misunderstanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For installation managers and curators who expect at least some users to have to enter or paste identifiers into the Author Identifier field, I think they would be happy to know about this change. I imagine that they'll no longer have to tell their users to enter the identifiers in a particular way, since there will be better support for both ways that we expect users to enter identifiers when they have to enter them.

So maybe a release note could help those installation managers and curators know this? If we agree, I'd be happy to write one or help write one.

What if you run Dataverse for several years with out the cvoc scripts? Your database has lots of "just identifiers" values in it?

Then, you turn on cvoc scripts and run for another couple years. Now your database has a mix of (older) "just identifiers" and newer "full URL" values?

Yeah, I agree. And this has been the situation with Dataverse installations for a while, even those whose managers haven't used the cvoc scripts. Their databases have combinations of "just identifiers" (or the "short forms"), "full URL" values, and the wrong Author Identifier Type chosen or no Author Identifier Type chosen.

Do I have this right? Is this what we want? Shouldn't the database be consistent?

From my understanding, the inconsistency in the database has complicated our ability to ensure that when most of the identifiers are displayed on the dataset page, the identifiers are links that users can click on to learn more about the person or organization. And its complicated our ability to include identifiers in metadata exports in order to improve discovery and re-use of those identifiers in Dataverse repositories and in other systems, like DataCite.

From @qqmyers's work in this PR, I've gotten the impression that this complication, having a mix of "just identifiers" and "full URL" values in the database, has been manageable, at least for the Author Identifier field, even if the inconsistency isn't ideal.

In other cases, like licenses and terms of use metadata, I think we've decided that inconsistencies in the database aren't manageable, so we've added field validation, "multiple license support", and included ways for repositories to make existing values in their databases more consistent, like the "Flyway scripts".

I think of field validation and ways to change database values in bulk as ways to address inconsistencies in other parts of the database.

The cvoc scripts, and the ability to associate ORCIDs with more user accounts and prefill Author Identifier fields with those ORCIDs, also address this by making it less likely that users will need to enter anything in the Author Identifier field, and giving installation managers more control over what gets in the database. Those things are new, and won't work all of the time, like in cases where users need and are allowed to use other types of identifiers.

Maybe before considering other ways of ensuring that the identifiers are displayed as links and help make deposits more findable within and outside of Dataverse repositories, it might be helpful to learn how well what's been done so far is working.

@qqmyers qqmyers added this to the 6.6 milestone Feb 12, 2025
@qqmyers qqmyers added the GDCC:ORCID Priority for the ORCID Global Participation Fund Grant label Feb 18, 2025
@coveralls
Copy link

Coverage Status

coverage: 22.745% (+0.004%) from 22.741%
when pulling 5062ece on QualitativeDataRepository:IQSS/11242-ExtVocabFix
into 4b3afe3 on IQSS:develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC:ORCID Priority for the ORCID Global Participation Fund Grant Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: Ready for Review ⏩
Development

Successfully merging this pull request may close these issues.

Dataverse External Author Vocabulary error when sending ORCID values to DataCite
4 participants