Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add stream_file_content parameter to upload methods #890

Merged
merged 4 commits into from
Apr 8, 2025

Conversation

lukaszsocha2
Copy link
Contributor

@lukaszsocha2 lukaszsocha2 commented Mar 28, 2025

requests library documentation states that:

In the event you are posting a very large file as a multipart/form-data request, you may want to stream the request. By default, requests does not support this, but there is a separate package which does - requests-toolbelt. (link):

when using Requests’ Multipart upload functionality all the data must be read into memory before being sent to the server. (link):

In the Box Python SDK, we use requests-toolbelt by default to avoid loading the entire file into memory before sending the request. This ensures better efficiency, especially for large files.

However, 307 Temporary Redirects introduce a challenge when using streamed file uploads. Unlike 302 Found, which can change a POST request into a GET, a 307 redirect requires that the method and request body remain unchanged. This becomes an issue when the request body is a file stream, as the stream may already be exhausted when the redirect occurs. Since requests automatically follows redirects, the request is resent, but the file content may no longer be available.

To address this issue we have introduced stream_file_content parameter to upload methods. This allows you to choose between streaming the file (for memory efficiency) or loading it into memory first (to handle redirects correctly).

Fixes: #887

@lukaszsocha2 lukaszsocha2 force-pushed the sdk-4567-fix-dropping-stream-on-307 branch from c1d065a to 920cb03 Compare March 28, 2025 14:07
@coveralls
Copy link

coveralls commented Mar 28, 2025

Pull Request Test Coverage Report for Build 14313605139

Details

  • 7 of 7 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.002%) to 93.609%

Totals Coverage Status
Change from base Build 13813583754: 0.002%
Covered Lines: 3530
Relevant Lines: 3771

💛 - Coveralls

@lukaszsocha2 lukaszsocha2 force-pushed the sdk-4567-fix-dropping-stream-on-307 branch from 8d29fdf to 9c5d09e Compare March 28, 2025 14:09
@lukaszsocha2 lukaszsocha2 force-pushed the sdk-4567-fix-dropping-stream-on-307 branch from 11578e9 to 157d3e9 Compare March 28, 2025 14:15
@metathinker
Copy link
Member

Hi! For my understanding, I have a couple conceptual questions about this particular change:

  1. In our application, the streams we pass to the upload() calls are actual open files, so random access is possible. Could we add an option to have requests rewind the stream and reread it from the start, so that the content can be read twice without having to preload it into memory? Or would that require code changes in requests itself?
  2. Is it possible to make the same change as this one to chunked upload sessions, so that chunk uploads can also follow 307 redirects successfully?

@lukaszsocha2
Copy link
Contributor Author

Hi,

  1. It would be possible, but we would have to disable auto redirects in requests library and handle all redirects manually. I'm not sure if we want to make such a big change in our SDK
  2. Chunks of a file are loaded into memory already in case retry of a chunk is necessary. Did you test it and it didn't work this way?

@metathinker
Copy link
Member

metathinker commented Mar 31, 2025

  1. In our application, the streams we pass to the upload() calls are actual open files, so random access is possible. Could we add an option to have requests rewind the stream and reread it from the start, so that the content can be read twice without having to preload it into memory? Or would that require code changes in requests itself?
  1. It would be possible, but we would have to disable auto redirects in requests library and handle all redirects manually. I'm not sure if we want to make such a big change in our SDK

That makes sense. Actually, after reconsidering, I'm not sure I agree here. requests already knows how to rewind the open file object by default when re-POST-ing after a redirect (psf/requests#3655), as @nicolasferreiralozano pointed out in the original issue #887. So I don't think it's necessary for box-python-sdk to handle redirects manually in order to do the rewind? Perhaps the use of requests-toolbelt to perform the streaming from the open file object makes it harder, though.

  1. Is it possible to make the same change as this one to chunked upload sessions, so that chunk uploads can also follow 307 redirects successfully?
  1. Chunks of a file are loaded into memory already in case retry of a chunk is necessary. Did you test it and it didn't work this way?

Sorry, my mistake - please feel free to ignore my point 2. Our application actually bypasses the UploadSession.upload_part_bytes() method, which behaves as you describe, and instead reaches into the BoxSession to call PUT at a lower level. After a closer reading, it appears to me that our app does this precisely to avoid having to pre-load the chunk into memory. We will have to undo that behavior ourselves.

@lukaszsocha2
Copy link
Contributor Author

lukaszsocha2 commented Apr 1, 2025

  1. When setting stream_file_content to False SDK rollbacks to the default requests library behaviour. I learnt from docs that it keeps body in the memory, but I haven't verified it in any way. So if you say that requests rewinds the stream by its own then I guess setting stream_file_content to False should be the solution for you. So is the change fine to be merged ?

@metathinker
Copy link
Member

metathinker commented Apr 3, 2025

  1. When setting stream_file_content to False SDK rollbacks to the default requests library behaviour. I learnt from docs that it keeps body in the memory, but I haven't verified it in any way. So if you say that requests rewinds the stream by its own then I guess setting stream_file_content to False should be the solution for you. So is the change fine to be merged ?

It's possible that setting stream_file_content=False will indeed rewind the stream even with the SDK's particular configurations, but I feel like in that case, calling the parameter stream_file_content is more confusing than helpful.

Unfortunately, I have not had time to try to plug in this change into our app to see if it works. Hence, I would not support or oppose merging this change as is, but I defer to your judgement.

Copy link
Member

@congminh1254 congminh1254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but do we need to add docs for this?

@lukaszsocha2 lukaszsocha2 merged commit 0e63c00 into main Apr 8, 2025
12 checks passed
@lukaszsocha2 lukaszsocha2 deleted the sdk-4567-fix-dropping-stream-on-307 branch April 8, 2025 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP message body lost when uploading a file and a redirection occurs
4 participants