Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globus support: download optimizations #11125

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Dec 30, 2024

What this PR does / why we need it:

The PR improves the reliability of the Globus download framework, primarily by extending the new task-monitoring system first implemented for uploads in #10781. There are a few other improvements. For example, it fixes a somewhat exotic bug where Globus downloads weren't counted, but only for multi-file downloads and only when a guestbook popup was enabled.
The improvements are implemented in response to/based on the experience with the NESE Data Storage integration with Dataverse at IQSS. Most of these are already in prod. use there, deployed as an experimental beta build.

Which issue(s) this PR closes:

Special notes for your reviewer:

The single most important improvement is that, similarly to what was implemented in #10781 for uploads, the ongoing download tasks can now be monitored asynchronously, with the state saved in the database. This makes the management of the temporary access rules more robust, guaranteeing that Dataverse will register the completion of each task, even if there was a server restart in between.

Assorted other fixes and improvements were added. For example, it is now possible for a user to have simultaneous downloads runnig on files from the same dataset (when the task state is saved in the database, upon completion of a task it is easy to check if any other active tasks are using the same access rule and, if so, avoid deleting it). I fixed something misguided I did in #10781 when I first implemented saving the task state in the database: I misunderstood how the client tokens worked; and tried to save the token for each task, thinking that it needed to be reused throughout the life of the task. In reality, the same token can be used for multiple tasks on the same endpoint; also, for a long-running task that saved token has a good chance of expiring - and I didn't have a provision for that. Now the monitoring service simply caches the access tokens for each endpoint that it manages, and refreshes them as needed. General error handling, logging and state checking has been improved.

Suggestions on how to test this:

dataverse-internal has active Globus configuration tied to a fully-functioning remote storage volume at NESE, identical to our prod. volume there. There are existing datasets with multiple large-ish (100MB+) Globus-stored files (for example: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi:10.70122/FK2/QZQPQE); more data can be uploaded for testing as needed. QA would involve verifying that such files can be downloaded; with an emphasis on repeating downloads from the same dataset, and perhaps parallel downloads of different files from the same dataset at the same time. For the end user, everything should work as described in the Globus download instruction (written for the prod. users, but fully applies to the test configuration on internal).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@landreev landreev added this to the 6.6 milestone Feb 12, 2025
@landreev landreev self-assigned this Feb 12, 2025
@landreev landreev marked this pull request as ready for review February 25, 2025 16:55
@landreev
Copy link
Contributor Author

I un-drafter the PR; waiting for a Jenkins test, will put the PR on the board if it passes

This comment has been minimized.

@landreev landreev removed their assignment Feb 25, 2025
@stevenwinship stevenwinship self-assigned this Feb 25, 2025
@landreev landreev requested a review from qqmyers February 25, 2025 18:44
and fileMetadata.dataFile.released
and not fileMetadata.dataFile.restricted
and not dataFileServiceBean.isActivelyEmbargoed(fileMetadata)
and DatasetPage.isShowQueryButton(fileMetadata.dataFile.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isShowQueryButton handles all of these (and retention period) except for tabularData, which is at least partially handled by registering for specific mime types. (Is the concern that it shows when ingests fails? (Does the Ask the Data' tool fail if ingest fails or can it do something with just the file?).
Whatever the correct logic - perhaps any changes can just go in isShowQueryButton() ?

Copy link
Contributor Author

@landreev landreev Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thing you noticed this... I put together these logic mods in response to something you pointed out to me a few releases back - namely that the rendering logic for this button introduced an ungodly number of datafileservice.find() calls (it's a case of x7 jsf renderings); the whole thing was added to a release at the last minute, AND after I had gone to some trouble reducing the number of database lookups in the page. The above did not completely fix it, but at least reduced the lookups, by narrowing when the method is called to tabular, released, ... etc. files (the tool is only supposed to run on public tab. files, etc. - the logic is inside the method; unfortunately, it's only applied there after the db. lookups). So, I added these bandaid lines to the custom changes I deploy in prod. at iqss... and forgot about it/never made a real pr. These changes made it back into this branch kind of by accident; but when I noticed I figured this would be a good opportunity to check it in.

Ideally, that method should not be called on the db id of the file in the first place; the page has access to actual entities. The main reason I never bothered with a cleaner fix was that it felt wrong to invest any work into jsf; since it was allegedly going away any sec.

Was a longer explanation than the issue is worth, for sure...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if I have a sec, I'll go ahead and rewrite the DatasetPage.isShowQueryButton() method to be less of a hog, now that I had to remember the whole thing in vivid detail)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - is that because we use the file id and then look it up in that method? Could we make it use the Datafile as a param and avoid the lookups and keep it all in the java code?

// we will make an extra attempt to refresh the token and try again
// in the event of an exception indicating the token is stale

String globusClientToken = getClientTokenForStorageDriver(t.getDataset(), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but this section is common to up and download.

// would mean "DO SKIP writing the guestbookResponse", and "false" means
// "DO write ..."
if(isGlobusTransfer) {
// Note that *singe-file* Globus transfers are NOT handled here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: should be single-file

}
retries++;
try {
Thread.sleep(3000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should only sleep if the taskState == null and there are no more retries. This will delay the loop process an extra 3 seconds every time for no reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, will fix.

rulesCache.invalidate(ruleId);
}
} else {
// we'll proceed anyway, under the assumption that we will make
// another attempt to look it up later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about code blocks with no actual code but ok. Maybe add a logger.fine saying the rule was not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had some logging/debugging lines in that block during development. Will remove.

retries++;
try {
Thread.sleep(3000);
} catch (InterruptedException ie) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. This is an extra 3 second delay after the taskState is found.

// for some other reason):
String ruleId = getRuleId(endpoint, principal, "rw");
if (ruleId != null) {
requestPermStatus = 201;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this get logged? Something went wrong with a prior removal (which doesn't have to stop this request but shouldn't be ignored as it's a bug/problem if perms aren't going away)

// if other tasks are still using it). If that's the case, we'll
// confirm that the rule does exist and assume that it's ok to
// proceed with the download.
if (FeatureFlags.GLOBUS_USE_EXPERIMENTAL_ASYNC_FRAMEWORK.enabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unlike the upload case, this is still failing if Dataverse failed to delete it after the last download? It's only allowing multiple parallel downloads. (So no reason to log here as this is valid versus evidence of some problem).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change it is NOT failing if the rule is already there (whether because Dataverse failed to delete it after a past download; or because it was created for a download that's still running) and the new monitoring mode is in use.

It is these extra lines above that actually make it possible to have multiple downloads from the same dataset. Prior to this pr, Dataverse would try to request an access rule, get back a 409, and pass it to the app; which in turn would tell the user that "download could not be started". Now the code intercepts the 409, verifies that the "r" permission for this user/this folder does exist, then returns 201, and the app proceeds to initiate the download. On the completion of the task, the monitoring service code checks if there are still any other tasks in the database-stored queue using the same rule, and only deletes it if there are none.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So a) why the check for the feature flag for download and not upload? and b) can you distinguish between parallel downloads and a prior failure (would the ruleId be in the cache for multiple downloads but not for a failed delete?)? As with the upload case, I'm worried that problems that leave a permission in place will get ignored because they now don't stop a new request and there's no logging.

Copy link
Contributor Author

@landreev landreev Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) For downloads: if the old-style continuous task monitoring is used, the code is falling back to the old behavior; under the assumption that there may be another active task using the rule, and then it would be impossible to prevent the one that finishes first from deleting the rule and killing the other task(s).
I'm not doing that for uploads, under the assumption that it is impossible to start more than one - since the dataset gets locked. So, if we found an existing rule, it must be from a finished or failed task.
I agree that this needs to be logged; and I should probably add a warning to the guide somewhere to keep an eye for these warnings.
b) I haven't been able to come up with a way to distinguish between a stale vs. still in use rule, without the database record. I don't think we can rely on the cache for this. We do remove the rule from the cache when we start transfers; that can be modified somehow, but then we are back to relying on the server staying up (we now have real-life cases of downloads, and occasional uploads that take days); plus the issue of multiple servers at IQSS and possibly other places.

BTW, do you know/remember why the rules Dataverse creates have no expiration dates? - making them expire/disappear eventually on their own would help, I think.

c) I'm realizing we need a dedicated admin api call for clearing up stale access rules. I've been doing that outside of Dataverse, but it would be trivial to add. Again, I'll only be able to tell for sure if it's stale or not with the db tracking on. But then this is something I'll likely feel comfortable dropping the "experimental" label from and making it the default behavior in 6.7.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: expiration - I don't know why we don't use them - perhaps no one saw that option? Or it's hard to predict how long you need? Seems like some conservative passive expiration would make sense though.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I suggested a couple minor changes but I think the overall logic in the updates make sense.

@landreev
Copy link
Contributor Author

I just killed the last Jenkins build (no. 14), because it was triggered by a commit of a typo fix in a comment.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:11057-globus-downloads
ghcr.io/gdcc/configbaker:11057-globus-downloads

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review 🔎
Development

Successfully merging this pull request may close these issues.

Globus integration: improve handling of downloads
3 participants