Globus support: download optimizations #11125

landreev · 2024-12-30T02:57:25Z

What this PR does / why we need it:

The PR improves the reliability of the Globus download framework, primarily by extending the new task-monitoring system first implemented for uploads in #10781. There are a few other improvements. For example, it fixes a somewhat exotic bug where Globus downloads weren't counted, but only for multi-file downloads and only when a guestbook popup was enabled.
The improvements are implemented in response to/based on the experience with the NESE Data Storage integration with Dataverse at IQSS. Most of these are already in prod. use there, deployed as an experimental beta build.

Which issue(s) this PR closes:

Closes Globus integration: improve handling of downloads #11057

Special notes for your reviewer:

The single most important improvement is that, similarly to what was implemented in #10781 for uploads, the ongoing download tasks can now be monitored asynchronously, with the state saved in the database. This makes the management of the temporary access rules more robust, guaranteeing that Dataverse will register the completion of each task, even if there was a server restart in between.

Assorted other fixes and improvements were added. For example, it is now possible for a user to have simultaneous downloads runnig on files from the same dataset (when the task state is saved in the database, upon completion of a task it is easy to check if any other active tasks are using the same access rule and, if so, avoid deleting it). I fixed something misguided I did in #10781 when I first implemented saving the task state in the database: I misunderstood how the client tokens worked; and tried to save the token for each task, thinking that it needed to be reused throughout the life of the task. In reality, the same token can be used for multiple tasks on the same endpoint; also, for a long-running task that saved token has a good chance of expiring - and I didn't have a provision for that. Now the monitoring service simply caches the access tokens for each endpoint that it manages, and refreshes them as needed. General error handling, logging and state checking has been improved.

Suggestions on how to test this:

dataverse-internal has active Globus configuration tied to a fully-functioning remote storage volume at NESE, identical to our prod. volume there. There are existing datasets with multiple large-ish (100MB+) Globus-stored files (for example: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi:10.70122/FK2/QZQPQE); more data can be uploaded for testing as needed. QA would involve verifying that such files can be downloaded; with an emphasis on repeating downloads from the same dataset, and perhaps parallel downloads of different files from the same dataset at the same time. For the end user, everything should work as described in the Globus download instruction (written for the prod. users, but fully applies to the test configuration on internal).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

status is "ACTIVE", take it seriously/literally. #11057

configured value. #11057

… size limits. #11057

…d/counted with gestbook popups enabled. #11057

Plus implements proper checks before deleting access rules, ensuring that no other active tasks are still using them. #11057

landreev · 2025-02-25T17:00:22Z

I un-drafter the PR; waiting for a Jenkins test, will put the PR on the board if it passes

qqmyers · 2025-02-25T19:41:34Z

src/main/webapp/filesFragment.xhtml

+                                             and fileMetadata.dataFile.released
+                                             and not fileMetadata.dataFile.restricted
+                                             and not dataFileServiceBean.isActivelyEmbargoed(fileMetadata)
+                                             and DatasetPage.isShowQueryButton(fileMetadata.dataFile.id)


isShowQueryButton handles all of these (and retention period) except for tabularData, which is at least partially handled by registering for specific mime types. (Is the concern that it shows when ingests fails? (Does the Ask the Data' tool fail if ingest fails or can it do something with just the file?).
Whatever the correct logic - perhaps any changes can just go in isShowQueryButton() ?

Funny thing you noticed this... I put together these logic mods in response to something you pointed out to me a few releases back - namely that the rendering logic for this button introduced an ungodly number of datafileservice.find() calls (it's a case of x7 jsf renderings); the whole thing was added to a release at the last minute, AND after I had gone to some trouble reducing the number of database lookups in the page. The above did not completely fix it, but at least reduced the lookups, by narrowing when the method is called to tabular, released, ... etc. files (the tool is only supposed to run on public tab. files, etc. - the logic is inside the method; unfortunately, it's only applied there after the db. lookups). So, I added these bandaid lines to the custom changes I deploy in prod. at iqss... and forgot about it/never made a real pr. These changes made it back into this branch kind of by accident; but when I noticed I figured this would be a good opportunity to check it in.

Ideally, that method should not be called on the db id of the file in the first place; the page has access to actual entities. The main reason I never bothered with a cleaner fix was that it felt wrong to invest any work into jsf; since it was allegedly going away any sec.

Was a longer explanation than the issue is worth, for sure...

(if I have a sec, I'll go ahead and rewrite the DatasetPage.isShowQueryButton() method to be less of a hog, now that I had to remember the whole thing in vivid detail)

Hmm - is that because we use the file id and then look it up in that method? Could we make it use the Datafile as a param and avoid the lookups and keep it all in the java code?

qqmyers · 2025-02-25T19:42:20Z

src/main/java/edu/harvard/iq/dataverse/globus/TaskMonitoringServiceBean.java

+            // we will make an extra attempt to refresh the token and try again
+            // in the event of an exception indicating the token is stale 
+
+            String globusClientToken = getClientTokenForStorageDriver(t.getDataset(), false);


Not a big deal, but this section is common to up and download.

stevenwinship · 2025-02-25T18:42:54Z

src/main/java/edu/harvard/iq/dataverse/FileDownloadHelper.java

+         // would mean "DO SKIP writing the guestbookResponse", and "false" means 
+         // "DO write ..."
+         if(isGlobusTransfer) {
+            // Note that *singe-file* Globus transfers are NOT handled here. 


typo: should be single-file

stevenwinship · 2025-02-25T19:07:22Z

src/main/java/edu/harvard/iq/dataverse/globus/GlobusServiceBean.java

+            }
+            retries++;
+            try {
+                Thread.sleep(3000);


you should only sleep if the taskState == null and there are no more retries. This will delay the loop process an extra 3 seconds every time for no reason.

Correct, will fix.

stevenwinship · 2025-02-25T19:11:28Z

src/main/java/edu/harvard/iq/dataverse/globus/GlobusServiceBean.java

                rulesCache.invalidate(ruleId);
            }
+        } else {
+            // we'll proceed anyway, under the assumption that we will make 
+            // another attempt to look it up later


I'm not crazy about code blocks with no actual code but ok. Maybe add a logger.fine saying the rule was not found?

Yeah, I had some logging/debugging lines in that block during development. Will remove.

stevenwinship · 2025-02-25T19:18:20Z

src/main/java/edu/harvard/iq/dataverse/globus/GlobusServiceBean.java

+            retries++;
+            try {
+                Thread.sleep(3000);
+            } catch (InterruptedException ie) {


Same as above. This is an extra 3 second delay after the taskState is found.

qqmyers · 2025-02-25T19:51:59Z

src/main/java/edu/harvard/iq/dataverse/globus/GlobusServiceBean.java

+            // for some other reason):
+            String ruleId = getRuleId(endpoint, principal, "rw");
+            if (ruleId != null) {
+                requestPermStatus = 201;


Should this get logged? Something went wrong with a prior removal (which doesn't have to stop this request but shouldn't be ignored as it's a bug/problem if perms aren't going away)

qqmyers · 2025-02-25T19:55:28Z

src/main/java/edu/harvard/iq/dataverse/globus/GlobusServiceBean.java

+            // if other tasks are still using it). If that's the case, we'll 
+            // confirm that the rule does exist and assume that it's ok to 
+            // proceed with the download. 
+            if (FeatureFlags.GLOBUS_USE_EXPERIMENTAL_ASYNC_FRAMEWORK.enabled()) {


So unlike the upload case, this is still failing if Dataverse failed to delete it after the last download? It's only allowing multiple parallel downloads. (So no reason to log here as this is valid versus evidence of some problem).

With this change it is NOT failing if the rule is already there (whether because Dataverse failed to delete it after a past download; or because it was created for a download that's still running) and the new monitoring mode is in use.

It is these extra lines above that actually make it possible to have multiple downloads from the same dataset. Prior to this pr, Dataverse would try to request an access rule, get back a 409, and pass it to the app; which in turn would tell the user that "download could not be started". Now the code intercepts the 409, verifies that the "r" permission for this user/this folder does exist, then returns 201, and the app proceeds to initiate the download. On the completion of the task, the monitoring service code checks if there are still any other tasks in the database-stored queue using the same rule, and only deletes it if there are none.

Got it. So a) why the check for the feature flag for download and not upload? and b) can you distinguish between parallel downloads and a prior failure (would the ruleId be in the cache for multiple downloads but not for a failed delete?)? As with the upload case, I'm worried that problems that leave a permission in place will get ignored because they now don't stop a new request and there's no logging.

a) For downloads: if the old-style continuous task monitoring is used, the code is falling back to the old behavior; under the assumption that there may be another active task using the rule, and then it would be impossible to prevent the one that finishes first from deleting the rule and killing the other task(s).
I'm not doing that for uploads, under the assumption that it is impossible to start more than one - since the dataset gets locked. So, if we found an existing rule, it must be from a finished or failed task.
I agree that this needs to be logged; and I should probably add a warning to the guide somewhere to keep an eye for these warnings.
b) I haven't been able to come up with a way to distinguish between a stale vs. still in use rule, without the database record. I don't think we can rely on the cache for this. We do remove the rule from the cache when we start transfers; that can be modified somehow, but then we are back to relying on the server staying up (we now have real-life cases of downloads, and occasional uploads that take days); plus the issue of multiple servers at IQSS and possibly other places.

BTW, do you know/remember why the rules Dataverse creates have no expiration dates? - making them expire/disappear eventually on their own would help, I think.

c) I'm realizing we need a dedicated admin api call for clearing up stale access rules. I've been doing that outside of Dataverse, but it would be trivial to add. Again, I'll only be able to tell for sure if it's stale or not with the db tracking on. But then this is something I'll likely feel comfortable dropping the "experimental" label from and making it the default behavior in 6.7.

re: expiration - I don't know why we don't use them - perhaps no one saw that option? Or it's hard to predict how long you need? Seems like some conservative passive expiration would make sense though.

qqmyers

Looks good - I suggested a couple minor changes but I think the overall logic in the updates make sense.

landreev · 2025-02-25T20:29:36Z

I just killed the last Jenkins build (no. 14), because it was triggered by a commit of a typo fix in a comment.

github-actions · 2025-02-25T20:37:18Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:11057-globus-downloads

ghcr.io/gdcc/configbaker:11057-globus-downloads

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev added 6 commits December 13, 2024 16:26

initializing the branch #11057

dab3884

more improvements for async. handling of download transfers #11057

8bf4156

more work in progress #11057

f9e90e9

Another experimental improvement: when Globus API reports that the task

2933b77

status is "ACTIVE", take it seriously/literally. #11057

restored the polling interval for download tasks monitoring to the

5c3c554

configured value. #11057

more refinements to the logging framework and such #11057

3de704b

This comment has been minimized.

Sign in to view

Globus downloads should not be subject to the multi-file zip download…

71de542

… size limits. #11057

This comment has been minimized.

Sign in to view

incremental improvements #11057

94e7484

This comment has been minimized.

Sign in to view

incremental improvements #11057

5a920a4

This comment has been minimized.

Sign in to view

landreev added 3 commits January 15, 2025 14:22

More changes to the queue monitoring framework #11057

156b52c

This fixes the bug where multi-file Globus downloads were not recorde…

1dc47e6

…d/counted with gestbook popups enabled. #11057

extra logging #11057

1574742

landreev added this to the 6.6 milestone Feb 12, 2025

landreev self-assigned this Feb 12, 2025

landreev added 5 commits February 24, 2025 11:54

This resolves the issues with expired client tokens.

87412ce

Plus implements proper checks before deleting access rules, ensuring that no other active tasks are still using them. #11057

release note/doc #11057

d59e2ce

Merge branch 'develop' into 11057-globus-downloads

84e7175

release note, cosmetic #11057

b519f2e

code cleanup #11057

5fb9461

landreev marked this pull request as ready for review February 25, 2025 16:55

a new dedicated exception class #11057

5753f4c

This comment has been minimized.

Sign in to view

landreev removed their assignment Feb 25, 2025

stevenwinship self-assigned this Feb 25, 2025

landreev requested a review from qqmyers February 25, 2025 18:44

qqmyers reviewed Feb 25, 2025

View reviewed changes

stevenwinship requested changes Feb 25, 2025

View reviewed changes

qqmyers reviewed Feb 25, 2025

View reviewed changes

typo #11057

4a47830

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Globus support: download optimizations #11125

Globus support: download optimizations #11125

landreev commented Dec 30, 2024 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Feb 25, 2025

This comment has been minimized.

qqmyers Feb 25, 2025

landreev Feb 25, 2025 •

edited

Loading

landreev Feb 25, 2025

qqmyers Feb 25, 2025

qqmyers Feb 25, 2025

stevenwinship Feb 25, 2025

stevenwinship Feb 25, 2025

landreev Feb 25, 2025

stevenwinship Feb 25, 2025

landreev Feb 25, 2025

stevenwinship Feb 25, 2025

qqmyers Feb 25, 2025

qqmyers Feb 25, 2025

landreev Feb 25, 2025

qqmyers Feb 25, 2025

landreev Feb 25, 2025 •

edited

Loading

qqmyers Feb 25, 2025

qqmyers left a comment

landreev commented Feb 25, 2025

github-actions bot commented Feb 25, 2025

Globus support: download optimizations #11125

Are you sure you want to change the base?

Globus support: download optimizations #11125

Conversation

landreev commented Dec 30, 2024 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Feb 25, 2025

This comment has been minimized.

Choose a reason for hiding this comment

landreev Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qqmyers left a comment

Choose a reason for hiding this comment

landreev commented Feb 25, 2025

github-actions bot commented Feb 25, 2025

landreev commented Dec 30, 2024 •

edited

Loading

landreev Feb 25, 2025 •

edited

Loading

landreev Feb 25, 2025 •

edited

Loading