You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The RPC mechanism has unclear use of locks to protect it and shared_ptr's to hold its resources. The current code in v2.21.2 is not consistent enough for me to infer the intent. Therefore, I ask for clarity in intention by showing the code that concerns me.
I can definitely miss something in my code review. Please do point out any mistakes I made.
Setup
all compilers, os, platforms
depthai-core and several versions earlier too
Code review
pimpl->rpcStream
I see no code that uses pimpl->rpcStream. Why does it exist?
line 626 then assigns that shared_ptr to pimpl->rpcStream which no code will meaningfully use.
line 627 creates a new shared_ptr rpcStream and copy assigns pimpl->rpcStream to it. So now the XLinkStream has two (2) shared_ptrs pointing to it. Why do we need two?
line 629 lambda COPY captures rpcStream. So now there are three (3) shared_ptrs pointing to the XLinkStream.
auto rpcStream, the one outside the lambda, goes out of scope at the end of init2(). It had no purpose to exist.
pimpl->rpcStream is eventually set to nullptr in Device::close(). Nothing ever used it. It had no purpose to exist.
In addition to no purpose, having the extra 2 refs on the shared_ptr adds minor unneeded code and cpu use.
I suggest...
remove pimpl->rpcStream. I see no need for it.
merge line 626 and 627 to a single auto rpcStream = ...
change line 629 to be a move capture like rpcStream = std::move(rpcStream)
Could put the make_shared in the capture to eliminate the local if OK with a long/wrap capture. It will emplace construct rather than move.
pimpl->rpcMutex
pimpl->rpcMutex seems to be used to protect RPC activity. It is used only one place...
Yet, the resources on which the RPC mechanism depends, like rpcStream and rpcClient are not protected by this mutex. Of most concern is the code in Device::close()...
The intention to set both those to nullptr is unclear. What is it? Yes, something is related to above rpcStream discussion.
I'll explore rpcStream below. pimpl->rpcClient is tracked separately in issue #805
[this is what the code currently does] Release 1 refcount of shared ownership to the thing pimpl->rpcStream points.
Set the value of the thing pimpl->rpcStream points to nullptr or an empty XLinkStream...which doesn't exist as there is no default constructor for XLinkStream
I think the first (1) is useless. pimpl->rpcStream is used by no code. The pimpl->rpcClient lambda captured the XLinkStream itself. That lambda XLinkStream continues to live. pimpl->rpcStream = nullptr has no affect on its lifetime.
If we instead want to do the second (2)...which is impossible...then I would want to lock pimpl->rpcMutex and then set its value with *pimpl->rpcStream = XLinkStream(), and add code into the lambda to check+throw on invalid XLinkStream.
If we take a step-back and blur intention...there could also be a threading issue. Image two threads. And there is no predicting which thread is first, last, fastest, slowest, when each or both are paused in the middle, etc.
Thread 1
Calls a Device api which cascades to a call to the RPC lambda which contains the above mutex lock. The lambda then uses rpcStream to write and read data to/from XLink. Please note the lambda itself is contained within rpcClient.
Thread 1 needs a valid rpcStream value during the entire call of the lambda
Thread 2 sets rpcStream = nullptr with no coordination with the RPC lambda via rpcMutex. But what was the intention?
If the intention of Thread 2 was to somehow destruct the rpcStream that it used within the lambda, then this is dangerous withuot mutex coordination. The null/destruct could happen at the start of the lambda before the XLink write, or between the XLink write and XLink read.
The text was updated successfully, but these errors were encountered:
Temporarily includes minor fixes to DeviceBase to fix the crash-after-close regression and to move RPC resources into the RPC lambda itself
TEST_CASE("Multiple devices created and destroyed in parallel") will readily show USB devices failing to successfully startup with 2 or 3 OAKs attached.
I am trying to discern the XLinkDeviceState_t of the USB devices at the moment of failure. Perhaps there is a clear pattern. It is reproducible now so hopefully I can bring more information so Luxonis can help on the device/firmware side of the issue.
My guess is that no-delay shutdown/reboot/boot/startpipe is failing sometimes due to timeouts/watchdogs. However, I have not eliminated the possibility of a bug in XLink with multithreaded client/host activity.
diablodale
added a commit
to diablodale/depthai-core
that referenced
this issue
May 10, 2023
The RPC mechanism has unclear use of locks to protect it and shared_ptr's to hold its resources. The current code in v2.21.2 is not consistent enough for me to infer the intent. Therefore, I ask for clarity in intention by showing the code that concerns me.
I can definitely miss something in my code review. Please do point out any mistakes I made.
Setup
Code review
pimpl->rpcStream
I see no code that uses
pimpl->rpcStream
. Why does it exist?depthai-core/src/device/DeviceBase.cpp
Lines 625 to 629 in 125feb8
shared_ptr<XLinkStream>
pimpl->rpcStream
which no code will meaningfully use.rpcStream
and copy assignspimpl->rpcStream
to it. So now the XLinkStream has two (2) shared_ptrs pointing to it. Why do we need two?rpcStream
. So now there are three (3) shared_ptrs pointing to the XLinkStream.auto rpcStream
, the one outside the lambda, goes out of scope at the end ofinit2()
. It had no purpose to exist.pimpl->rpcStream
is eventually set tonullptr
inDevice::close()
. Nothing ever used it. It had no purpose to exist.In addition to no purpose, having the extra 2 refs on the shared_ptr adds minor unneeded code and cpu use.
I suggest...
pimpl->rpcStream
. I see no need for it.auto rpcStream = ...
rpcStream = std::move(rpcStream)
pimpl->rpcMutex
pimpl->rpcMutex
seems to be used to protect RPC activity. It is used only one place...depthai-core/src/device/DeviceBase.cpp
Lines 630 to 632 in 125feb8
Yet, the resources on which the RPC mechanism depends, like
rpcStream
andrpcClient
are not protected by this mutex. Of most concern is the code inDevice::close()
...depthai-core/src/device/DeviceBase.cpp
Lines 429 to 430 in 125feb8
The intention to set both those to
nullptr
is unclear. What is it? Yes, something is related to aboverpcStream
discussion.I'll explore
rpcStream
below.pimpl->rpcClient
is tracked separately in issue #805pimpl->rpcStream
points.pimpl->rpcStream
points tonullptr
or an empty XLinkStream...which doesn't exist as there is no default constructor for XLinkStreamI think the first (1) is useless.
pimpl->rpcStream
is used by no code. Thepimpl->rpcClient
lambda captured the XLinkStream itself. That lambda XLinkStream continues to live.pimpl->rpcStream = nullptr
has no affect on its lifetime.If we instead want to do the second (2)...which is impossible...then I would want to lock
pimpl->rpcMutex
and then set its value with*pimpl->rpcStream = XLinkStream()
, and add code into the lambda to check+throw on invalid XLinkStream.If we take a step-back and blur intention...there could also be a threading issue. Image two threads. And there is no predicting which thread is first, last, fastest, slowest, when each or both are paused in the middle, etc.
Thread 1
Calls a
Device
api which cascades to a call to the RPC lambda which contains the above mutex lock. The lambda then usesrpcStream
to write and read data to/from XLink. Please note the lambda itself is contained withinrpcClient
.Thread 2
Calls
Device::close()
. This function has code...depthai-core/src/device/DeviceBase.cpp
Lines 429 to 430 in 125feb8
What happens? 💣Nothing good. Maybe errors. Maybe a crash. Unpredictable. Why?....
Device::close()
#805 crash scenrios (two so far I've identified)rpcStream
value during the entirecall
of the lambdarpcStream = nullptr
with no coordination with the RPC lambda viarpcMutex
. But what was the intention?rpcStream
that it used within the lambda, then this is dangerous withuot mutex coordination. The null/destruct could happen at the start of the lambda before the XLink write, or between the XLink write and XLink read.The text was updated successfully, but these errors were encountered: