-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Globus support: download optimizations #11125
base: develop
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…d/counted with gestbook popups enabled. #11057
Plus implements proper checks before deleting access rules, ensuring that no other active tasks are still using them. #11057
I un-drafter the PR; waiting for a Jenkins test, will put the PR on the board if it passes |
This comment has been minimized.
This comment has been minimized.
and fileMetadata.dataFile.released | ||
and not fileMetadata.dataFile.restricted | ||
and not dataFileServiceBean.isActivelyEmbargoed(fileMetadata) | ||
and DatasetPage.isShowQueryButton(fileMetadata.dataFile.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isShowQueryButton handles all of these (and retention period) except for tabularData, which is at least partially handled by registering for specific mime types. (Is the concern that it shows when ingests fails? (Does the Ask the Data' tool fail if ingest fails or can it do something with just the file?).
Whatever the correct logic - perhaps any changes can just go in isShowQueryButton() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny thing you noticed this... I put together these logic mods in response to something you pointed out to me a few releases back - namely that the rendering logic for this button introduced an ungodly number of datafileservice.find() calls (it's a case of x7 jsf renderings); the whole thing was added to a release at the last minute, AND after I had gone to some trouble reducing the number of database lookups in the page. The above did not completely fix it, but at least reduced the lookups, by narrowing when the method is called to tabular, released, ... etc. files (the tool is only supposed to run on public tab. files, etc. - the logic is inside the method; unfortunately, it's only applied there after the db. lookups). So, I added these bandaid lines to the custom changes I deploy in prod. at iqss... and forgot about it/never made a real pr. These changes made it back into this branch kind of by accident; but when I noticed I figured this would be a good opportunity to check it in.
Ideally, that method should not be called on the db id of the file in the first place; the page has access to actual entities. The main reason I never bothered with a cleaner fix was that it felt wrong to invest any work into jsf; since it was allegedly going away any sec.
Was a longer explanation than the issue is worth, for sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if I have a sec, I'll go ahead and rewrite the DatasetPage.isShowQueryButton()
method to be less of a hog, now that I had to remember the whole thing in vivid detail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - is that because we use the file id and then look it up in that method? Could we make it use the Datafile as a param and avoid the lookups and keep it all in the java code?
// we will make an extra attempt to refresh the token and try again | ||
// in the event of an exception indicating the token is stale | ||
|
||
String globusClientToken = getClientTokenForStorageDriver(t.getDataset(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but this section is common to up and download.
// would mean "DO SKIP writing the guestbookResponse", and "false" means | ||
// "DO write ..." | ||
if(isGlobusTransfer) { | ||
// Note that *singe-file* Globus transfers are NOT handled here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: should be single-file
} | ||
retries++; | ||
try { | ||
Thread.sleep(3000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should only sleep if the taskState == null and there are no more retries. This will delay the loop process an extra 3 seconds every time for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, will fix.
rulesCache.invalidate(ruleId); | ||
} | ||
} else { | ||
// we'll proceed anyway, under the assumption that we will make | ||
// another attempt to look it up later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about code blocks with no actual code but ok. Maybe add a logger.fine saying the rule was not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had some logging/debugging lines in that block during development. Will remove.
retries++; | ||
try { | ||
Thread.sleep(3000); | ||
} catch (InterruptedException ie) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. This is an extra 3 second delay after the taskState is found.
// for some other reason): | ||
String ruleId = getRuleId(endpoint, principal, "rw"); | ||
if (ruleId != null) { | ||
requestPermStatus = 201; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this get logged? Something went wrong with a prior removal (which doesn't have to stop this request but shouldn't be ignored as it's a bug/problem if perms aren't going away)
// if other tasks are still using it). If that's the case, we'll | ||
// confirm that the rule does exist and assume that it's ok to | ||
// proceed with the download. | ||
if (FeatureFlags.GLOBUS_USE_EXPERIMENTAL_ASYNC_FRAMEWORK.enabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unlike the upload case, this is still failing if Dataverse failed to delete it after the last download? It's only allowing multiple parallel downloads. (So no reason to log here as this is valid versus evidence of some problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change it is NOT failing if the rule is already there (whether because Dataverse failed to delete it after a past download; or because it was created for a download that's still running) and the new monitoring mode is in use.
It is these extra lines above that actually make it possible to have multiple downloads from the same dataset. Prior to this pr, Dataverse would try to request an access rule, get back a 409, and pass it to the app; which in turn would tell the user that "download could not be started". Now the code intercepts the 409, verifies that the "r" permission for this user/this folder does exist, then returns 201, and the app proceeds to initiate the download. On the completion of the task, the monitoring service code checks if there are still any other tasks in the database-stored queue using the same rule, and only deletes it if there are none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So a) why the check for the feature flag for download and not upload? and b) can you distinguish between parallel downloads and a prior failure (would the ruleId be in the cache for multiple downloads but not for a failed delete?)? As with the upload case, I'm worried that problems that leave a permission in place will get ignored because they now don't stop a new request and there's no logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) For downloads: if the old-style continuous task monitoring is used, the code is falling back to the old behavior; under the assumption that there may be another active task using the rule, and then it would be impossible to prevent the one that finishes first from deleting the rule and killing the other task(s).
I'm not doing that for uploads, under the assumption that it is impossible to start more than one - since the dataset gets locked. So, if we found an existing rule, it must be from a finished or failed task.
I agree that this needs to be logged; and I should probably add a warning to the guide somewhere to keep an eye for these warnings.
b) I haven't been able to come up with a way to distinguish between a stale vs. still in use rule, without the database record. I don't think we can rely on the cache for this. We do remove the rule from the cache when we start transfers; that can be modified somehow, but then we are back to relying on the server staying up (we now have real-life cases of downloads, and occasional uploads that take days); plus the issue of multiple servers at IQSS and possibly other places.
BTW, do you know/remember why the rules Dataverse creates have no expiration dates? - making them expire/disappear eventually on their own would help, I think.
c) I'm realizing we need a dedicated admin api call for clearing up stale access rules. I've been doing that outside of Dataverse, but it would be trivial to add. Again, I'll only be able to tell for sure if it's stale or not with the db tracking on. But then this is something I'll likely feel comfortable dropping the "experimental" label from and making it the default behavior in 6.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: expiration - I don't know why we don't use them - perhaps no one saw that option? Or it's hard to predict how long you need? Seems like some conservative passive expiration would make sense though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I suggested a couple minor changes but I think the overall logic in the updates make sense.
I just killed the last Jenkins build (no. 14), because it was triggered by a commit of a typo fix in a comment. |
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
What this PR does / why we need it:
The PR improves the reliability of the Globus download framework, primarily by extending the new task-monitoring system first implemented for uploads in #10781. There are a few other improvements. For example, it fixes a somewhat exotic bug where Globus downloads weren't counted, but only for multi-file downloads and only when a guestbook popup was enabled.
The improvements are implemented in response to/based on the experience with the NESE Data Storage integration with Dataverse at IQSS. Most of these are already in prod. use there, deployed as an experimental beta build.
Which issue(s) this PR closes:
Special notes for your reviewer:
The single most important improvement is that, similarly to what was implemented in #10781 for uploads, the ongoing download tasks can now be monitored asynchronously, with the state saved in the database. This makes the management of the temporary access rules more robust, guaranteeing that Dataverse will register the completion of each task, even if there was a server restart in between.
Assorted other fixes and improvements were added. For example, it is now possible for a user to have simultaneous downloads runnig on files from the same dataset (when the task state is saved in the database, upon completion of a task it is easy to check if any other active tasks are using the same access rule and, if so, avoid deleting it). I fixed something misguided I did in #10781 when I first implemented saving the task state in the database: I misunderstood how the client tokens worked; and tried to save the token for each task, thinking that it needed to be reused throughout the life of the task. In reality, the same token can be used for multiple tasks on the same endpoint; also, for a long-running task that saved token has a good chance of expiring - and I didn't have a provision for that. Now the monitoring service simply caches the access tokens for each endpoint that it manages, and refreshes them as needed. General error handling, logging and state checking has been improved.
Suggestions on how to test this:
dataverse-internal has active Globus configuration tied to a fully-functioning remote storage volume at NESE, identical to our prod. volume there. There are existing datasets with multiple large-ish (100MB+) Globus-stored files (for example: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi:10.70122/FK2/QZQPQE); more data can be uploaded for testing as needed. QA would involve verifying that such files can be downloaded; with an emphasis on repeating downloads from the same dataset, and perhaps parallel downloads of different files from the same dataset at the same time. For the end user, everything should work as described in the Globus download instruction (written for the prod. users, but fully applies to the test configuration on internal).
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: