Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary fix for Base.unsafe_wrap #583

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

matinraayai
Copy link
Contributor

This PR should fix Base.unsafe_wrap, which allocated new memory instead of wrapping over the passed pointer.

@matinraayai matinraayai marked this pull request as draft January 22, 2024 18:28
@matinraayai matinraayai marked this pull request as ready for review January 22, 2024 18:28
@matinraayai matinraayai marked this pull request as draft January 22, 2024 18:28
@matinraayai matinraayai marked this pull request as ready for review January 23, 2024 21:41
@matinraayai
Copy link
Contributor Author

@pxl-th can you please review?
Also are there plans to make a distinction between Host-accessible buffers and managed buffers in AMDGPU.jl?

@pxl-th
Copy link
Collaborator

pxl-th commented Jan 24, 2024

which allocated new memory instead of wrapping over the passed pointer

Can you show an example of when it allocated?
Both HostBuffer and HIPBuffer when passed a pointer and bytesize don't allocate.

Wrapping host array
julia> x = zeros(Float32, 4, 4);

julia> xd = unsafe_wrap(ROCArray, pointer(x), size(x));
:3:hip_context.cpp          :152 : 2775812923 us: [pid:13824 tid:0x7fa8c6f40d00]  hipCtxCreate ( 0x7ffdc05f8250, 0, 0 ) 
:3:hip_context.cpp          :164 : 2775812936 us: [pid:13824 tid:0x7fa8c6f40d00] hipCtxCreate: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :652 : 2775812941 us: [pid:13824 tid:0x7fa8c6f40d00]  hipSetDevice ( 0 ) 
:3:hip_device_runtime.cpp   :656 : 2775812945 us: [pid:13824 tid:0x7fa8c6f40d00] hipSetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :637 : 2775812955 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDeviceCount ( 0x7ffdc05f84e8 ) 
:3:hip_device_runtime.cpp   :639 : 2775812959 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDeviceCount: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775812963 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f83f8 ) 
:3:hip_device_runtime.cpp   :630 : 2775812966 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_stream.cpp           :375 : 2775812975 us: [pid:13824 tid:0x7fa8c6f40d00]  hipStreamCreateWithPriority ( 0x7fa8bc099620, 0, 0 ) 
:3:rocdevice.cpp            :2768: 2775812984 us: [pid:13824 tid:0x7fa8c6f40d00] number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2846: 2775825071 us: [pid:13824 tid:0x7fa8c6f40d00] created hardware queue 0x7fa89c00a000 with size 16384 with priority 1, cooperative: 0
:3:rocdevice.cpp            :2938: 2775825081 us: [pid:13824 tid:0x7fa8c6f40d00] acquireQueue refCount: 0x7fa89c00a000 (1)
:4:rocdevice.cpp            :2099: 2775825338 us: [pid:13824 tid:0x7fa8c6f40d00] Allocate hsa host memory 0x7fa6e0600000, size 0x100000
:3:devprogram.cpp           :2686: 2775957593 us: [pid:13824 tid:0x7fa8c6f40d00] Using Code Object V5.
:3:hip_stream.cpp           :390 : 2775958610 us: [pid:13824 tid:0x7fa8c6f40d00] hipStreamCreateWithPriority: Returned hipSuccess : stream:0xb7eb70
:3:hip_device_runtime.cpp   :622 : 2775958618 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8518 ) 
:3:hip_device_runtime.cpp   :630 : 2775958622 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775958632 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8538 ) 
:3:hip_device_runtime.cpp   :630 : 2775958635 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_memory.cpp           :1233: 2775958655 us: [pid:13824 tid:0x7fa8c6f40d00]  hipHostRegister ( 0x7fa8bcd1bf40, 64, 2 ) 
:4:rocmemory.cpp            :967 : 2775958684 us: [pid:13824 tid:0x7fa8c6f40d00] Locking to pool 0x7802e0, size 0x40, HostPtr = 0x7fa8bcd1bf40, DevPtr = 0x7fa8bcd1bf40
:3:hip_memory.cpp           :1235: 2775958688 us: [pid:13824 tid:0x7fa8c6f40d00] hipHostRegister: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775958692 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f84d8 ) 
:3:hip_device_runtime.cpp   :630 : 2775958694 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_memory.cpp           :3350: 2775958702 us: [pid:13824 tid:0x7fa8c6f40d00]  hipHostGetDevicePointer ( 0x7fa8bc099630, 0x7fa8bcd1bf40, 0 ) 
:3:hip_memory.cpp           :3364: 2775958706 us: [pid:13824 tid:0x7fa8c6f40d00] hipHostGetDevicePointer: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2775963126 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8e38 ) 
:3:hip_device_runtime.cpp   :630 : 2775963133 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :607 : 2775963139 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceSynchronize (  ) 
:3:hip_device_runtime.cpp   :610 : 2775963143 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceSynchronize: Returned hipSuccess : 
Wrapping device array
julia> x = ROCArray{Float32}(undef, 4, 4);
:3:hip_device_runtime.cpp   :622 : 2870790570 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8438 ) 
:3:hip_device_runtime.cpp   :630 : 2870790582 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_mempool.cpp          :59  : 2870790591 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceGetMemPool ( 0x7fa8be5ed090, 0 ) 
:3:hip_mempool.cpp          :64  : 2870790594 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceGetMemPool: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2870790686 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8058 ) 
:3:hip_device_runtime.cpp   :630 : 2870790690 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_mempool.cpp          :69  : 2870790699 us: [pid:13824 tid:0x7fa8c6f40d00]  hipMallocAsync ( 0x7fa8be5ed110, 64, stream:0xb7eb70 ) 
:4:rocdevice.cpp            :2227: 2870790782 us: [pid:13824 tid:0x7fa8c6f40d00] Allocate hsa device memory 0x7fa6e0200000, size 0x40
:3:rocdevice.cpp            :2266: 2870790786 us: [pid:13824 tid:0x7fa8c6f40d00] device=0xb19f90, freeMem_ = 0x5feffffc0
:3:hip_mempool_impl.cpp     :208 : 2870790791 us: [pid:13824 tid:0x7fa8c6f40d00] Pool AllocMem: 0x7fa6e0200000, 0x10b7250
:3:hip_mempool.cpp          :88  : 2870790795 us: [pid:13824 tid:0x7fa8c6f40d00] hipMallocAsync: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :622 : 2870794261 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8e38 ) 
:3:hip_device_runtime.cpp   :630 : 2870794267 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :607 : 2870794271 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceSynchronize (  ) 
:3:hip_device_runtime.cpp   :610 : 2870794275 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceSynchronize: Returned hipSuccess : 

julia> xd = unsafe_wrap(ROCArray, pointer(x), size(x); lock=false);
:3:hip_device_runtime.cpp   :622 : 2875471365 us: [pid:13824 tid:0x7fa8c6f40d00]  hipGetDevice ( 0x7ffdc05f8e38 ) 
:3:hip_device_runtime.cpp   :630 : 2875471383 us: [pid:13824 tid:0x7fa8c6f40d00] hipGetDevice: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :607 : 2875471392 us: [pid:13824 tid:0x7fa8c6f40d00]  hipDeviceSynchronize (  ) 
:3:hip_device_runtime.cpp   :610 : 2875471399 us: [pid:13824 tid:0x7fa8c6f40d00] hipDeviceSynchronize: Returned hipSuccess : 

You can see that in both cases logs do not display any allocation.

Also the tests needs to be updated, they are failing right now.

Also are there plans to make a distinction between Host-accessible buffers and managed buffers in AMDGPU.jl?

Ideally yes, but I don't have much time at the moment to work on that.

@matinraayai
Copy link
Contributor Author

matinraayai commented Jan 24, 2024

@pxl-th even if it doesn't allocate, it still doesn't control the ownership of the buffer. The Base definition has a keyword argument own that can prevent that (do correct me if I'm wrong).
The lock keyword is not necessary, since the HIP runtime should tell us the attributes of the pointer.
This brings AMDGPU.jl closer to what CUDA.jl (and Base) does for this function here:
https://github.com/JuliaGPU/CUDA.jl/blob/275c9bc8f04b7743c34703207f6f8619317d5633/src/array.jl#L234
I will update the tests once I get confirmation that this can be merged to AMDGPU.jl.

I might be able to contribute managed buffers to AMDGPU.jl if no one is doing it.

@pxl-th
Copy link
Collaborator

pxl-th commented Jan 24, 2024

I don't have any objections regarding these changes (as long as they work).
lock keyword is more of a remnant from the HSA era.

FYI, current implementation also does not take the ownership, when creating HostBuffer or HIPBuffer from pointer and its bytesize it marks it as non-owning (last argument), so it won't free the memory on itself.

@luraess this might be breaking for you.

@matinraayai
Copy link
Contributor Author

matinraayai commented Jan 24, 2024

@pxl-th you are correct, I edited my last comment to reflect this. At its current state however, unsafe_wrap function implemented by AMDGPU.jl does not confirm to the Base (it doesn't allow taking ownership), and does not determine the attributes of the pointer automatically (it should not determine the pointer type via the lock argument).

I will update the tests and let CI run this.

In terms of implementing the managed buffer, is there anything on the LLVM side I should fix (e.g. address space)? Also do you know what the exact difference between managed buffers and host-accessible buffers in HIP/HSA?

@luraess luraess changed the title Preliminary fix for Base.unsafe_warp Preliminary fix for Base.unsafe_wrap Jan 25, 2024
@luraess
Copy link
Collaborator

luraess commented Jan 25, 2024

@luraess this might be breaking for you.

The one place we are currently using unsafe_wrap is in https://github.com/eth-cscs/ImplicitGlobalGrid.jl/blob/08003fcf7f1f4d3db07ceff0cebc7aac9784659b/src/AMDGPUExt/shared.jl#L41 which is then further used in order to register mem in https://github.com/eth-cscs/ImplicitGlobalGrid.jl/blob/08003fcf7f1f4d3db07ceff0cebc7aac9784659b/src/AMDGPUExt/update_halo.jl#L86-L90.

Would be good to sync on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants