Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC serialization and its mutex locks and/or shared_ptr use are unclear and likely crash #807

Open
diablodale opened this issue Apr 12, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@diablodale
Copy link
Contributor

The RPC mechanism has unclear use of locks to protect it and shared_ptr's to hold its resources. The current code in v2.21.2 is not consistent enough for me to infer the intent. Therefore, I ask for clarity in intention by showing the code that concerns me.

I can definitely miss something in my code review. Please do point out any mistakes I made.

Setup

  • all compilers, os, platforms
  • depthai-core and several versions earlier too

Code review

pimpl->rpcStream

I see no code that uses pimpl->rpcStream. Why does it exist?

// prepare rpc for both attached and host controlled mode
pimpl->rpcStream = std::make_shared<XLinkStream>(connection, device::XLINK_CHANNEL_MAIN_RPC, device::XLINK_USB_BUFFER_MAX_SIZE);
auto rpcStream = pimpl->rpcStream;
pimpl->rpcClient = std::make_unique<nanorpc::core::client<nanorpc::packer::nlohmann_msgpack>>([this, rpcStream](nanorpc::core::type::buffer request) {

  • line 626 creates a shared_ptr<XLinkStream>
  • line 626 then assigns that shared_ptr to pimpl->rpcStream which no code will meaningfully use.
  • line 627 creates a new shared_ptr rpcStream and copy assigns pimpl->rpcStream to it. So now the XLinkStream has two (2) shared_ptrs pointing to it. Why do we need two?
  • line 629 lambda COPY captures rpcStream. So now there are three (3) shared_ptrs pointing to the XLinkStream.
  • auto rpcStream, the one outside the lambda, goes out of scope at the end of init2(). It had no purpose to exist.
  • pimpl->rpcStream is eventually set to nullptr in Device::close(). Nothing ever used it. It had no purpose to exist.

In addition to no purpose, having the extra 2 refs on the shared_ptr adds minor unneeded code and cpu use.
I suggest...

  • remove pimpl->rpcStream. I see no need for it.
  • merge line 626 and 627 to a single auto rpcStream = ...
  • change line 629 to be a move capture like rpcStream = std::move(rpcStream)
  • Could put the make_shared in the capture to eliminate the local if OK with a long/wrap capture. It will emplace construct rather than move.

pimpl->rpcMutex

pimpl->rpcMutex seems to be used to protect RPC activity. It is used only one place...

// Lock for time of the RPC call, to not mix the responses between calling threads.
// Note: might cause issues on Windows on incorrect shutdown. To be investigated
std::unique_lock<std::mutex> lock(pimpl->rpcMutex);

Yet, the resources on which the RPC mechanism depends, like rpcStream and rpcClient are not protected by this mutex. Of most concern is the code in Device::close()...

pimpl->rpcStream = nullptr;
pimpl->rpcClient = nullptr;

The intention to set both those to nullptr is unclear. What is it? Yes, something is related to above rpcStream discussion.
I'll explore rpcStream below. pimpl->rpcClient is tracked separately in issue #805

  1. [this is what the code currently does] Release 1 refcount of shared ownership to the thing pimpl->rpcStream points.
  2. Set the value of the thing pimpl->rpcStream points to nullptr or an empty XLinkStream...which doesn't exist as there is no default constructor for XLinkStream

I think the first (1) is useless. pimpl->rpcStream is used by no code. The pimpl->rpcClient lambda captured the XLinkStream itself. That lambda XLinkStream continues to live. pimpl->rpcStream = nullptr has no affect on its lifetime.

If we instead want to do the second (2)...which is impossible...then I would want to lock pimpl->rpcMutex and then set its value with *pimpl->rpcStream = XLinkStream(), and add code into the lambda to check+throw on invalid XLinkStream.

If we take a step-back and blur intention...there could also be a threading issue. Image two threads. And there is no predicting which thread is first, last, fastest, slowest, when each or both are paused in the middle, etc.

Thread 1

Calls a Device api which cascades to a call to the RPC lambda which contains the above mutex lock. The lambda then uses rpcStream to write and read data to/from XLink. Please note the lambda itself is contained within rpcClient.

Thread 2

Calls Device::close(). This function has code...

pimpl->rpcStream = nullptr;
pimpl->rpcClient = nullptr;

What happens? 💣Nothing good. Maybe errors. Maybe a crash. Unpredictable. Why?....

  • Issue [REGRESSION] Device APIs crash application when called after Device::close() #805 crash scenrios (two so far I've identified)
  • Thread 1 needs a valid rpcStream value during the entire call of the lambda
  • Thread 2 sets rpcStream = nullptr with no coordination with the RPC lambda via rpcMutex. But what was the intention?
  • If the intention of Thread 2 was to somehow destruct the rpcStream that it used within the lambda, then this is dangerous withuot mutex coordination. The null/destruct could happen at the start of the lambda before the XLink write, or between the XLink write and XLink read.
@diablodale diablodale added the bug Something isn't working label Apr 12, 2023
diablodale added a commit to diablodale/depthai-core that referenced this issue Apr 13, 2023
@diablodale
Copy link
Contributor Author

In PR #809

  • Temporarily includes minor fixes to DeviceBase to fix the crash-after-close regression and to move RPC resources into the RPC lambda itself
  • TEST_CASE("Multiple devices created and destroyed in parallel") will readily show USB devices failing to successfully startup with 2 or 3 OAKs attached.

I am trying to discern the XLinkDeviceState_t of the USB devices at the moment of failure. Perhaps there is a clear pattern. It is reproducible now so hopefully I can bring more information so Luxonis can help on the device/firmware side of the issue.

My guess is that no-delay shutdown/reboot/boot/startpipe is failing sometimes due to timeouts/watchdogs. However, I have not eliminated the possibility of a bug in XLink with multithreaded client/host activity.

diablodale added a commit to diablodale/depthai-core that referenced this issue May 10, 2023
diablodale added a commit to diablodale/depthai-core that referenced this issue May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant