Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Issue: crash when current working directory not on same filesystem as data directory #12944

Open
matthewdarwin opened this issue Feb 17, 2025 · 11 comments
Assignees
Labels
community Issues created by community investigation required Node Node team

Comments

@matthewdarwin
Copy link

Contact Details

No response

Node type

Non-Top 100 Validator

Which network are you running?

testnet

What happened?

near crashes when you start near client from a directory (eg /) that is not on the same physical disk as the data directory (eg /var/lib/near)

Version

2.5.0-rc.2

Problem did not exist in 2.4.0

Relevant log output

with `RUST_BACKTRACE=full`


thread '<unnamed>' panicked at /usr/local/cargo/git/checkouts/nearcore-5bf7818cf2261fd0/380afac/chain/chain/src/runtime/mod.rs:918:26:
StorageInconsistentState("cache write error: Invalid cross-device link (os error 18)")
stack backtrace:
   0:     0x561d082ad889 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::ha4a311b32f6b4ad8
   1:     0x561d06b2f243 - core::fmt::write::h1866771663f62b81
   2:     0x561d08278ee2 - std::io::Write::write_fmt::hb549e7444823135e
   3:     0x561d082aea03 - std::sys::backtrace::BacktraceLock::print::hddd3a9918ce29aa7
   4:     0x561d082aef2c - std::panicking::rust_panic_with_hook::he21644cc2707f2c4
   5:     0x561d082aead8 - std::panicking::begin_panic_handler::{{closure}}::h42f7c414fed3cad9
   6:     0x561d082aea39 - std::sys::backtrace::__rust_end_short_backtrace::ha26cf5766b4e8c65
   7:     0x561d082aea2c - rust_begin_unwind
   8:     0x561d0683b22f - core::panicking::panic_fmt::h74866b78e934b1c0
   9:     0x561d06eda0cd - <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk::hde0fc7d681dea438
  10:     0x561d06e00642 - near_chain::update_shard::apply_new_chunk::h8286590111afd797
  11:     0x561d06dffd88 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h52512da70d5712a6
  12:     0x561d06df682c - rayon::iter::plumbing::bridge_producer_consumer::helper::h51882d7238f92ca8
  13:     0x561d06df8870 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hbed77d61a7837b4d
  14:     0x561d068bb843 - rayon_core::registry::WorkerThread::wait_until_cold::h2d91ac0f35ae41d3
  15:     0x561d08092844 - std::sys::backtrace::__rust_begin_short_backtrace::h5e2862a4c0f395e4
  16:     0x561d08092527 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h1e253b5660f82b30
  17:     0x561d082b0b8b - std::sys::pal::unix::thread::Thread::new::thread_start::h14f1eb868ff90fc9
  18:     0x7fc778fc91c4 - <unknown>
  19:     0x7fc77904985c - <unknown>
  20:                0x0 - <unknown>
Rayon: detected unexpected panic; aborting

Node head info

Node upgrade history

today for the testnet hard fork

DB reset history

several months ago
@VanBarbascu
Copy link
Contributor

Hi @matthewdarwin,

We operate our nodes in a similar configuration:

  • The home directory (~) is located on the boot disk
  • The data directory (~/.near/data) is located on a separate disk
  • Neard binary is in the home directory
  • We execute the neard binary from a systemd service.

The alternative configuration of maintaining the ~/.near folder on a separate disk has been thoroughly tested and also functions correctly.

This error appears to originate from RocksDB. Does your data folder contain different mount points? For instance, is the state-snapshot directory located on a separate disk?

To debug this issue further, I will require additional information about your machine configuration and the method you use to start your node.

@walnut-the-cat
Copy link
Contributor

walnut-the-cat commented Mar 7, 2025

Another report from OKX:

Image

@strokovok
Copy link

Hi, we (at Aurora Labs) experience exactly the same problem after updating our mainnet nodes to 2.5.1
In our case nearcore (indexer to be exact) is packaged as docker image
Binary is located in container's root filesystem
Home directory is mounted
Data directory is inside home directory

@walnut-the-cat
Copy link
Contributor

For everyone who is suffering with the issue, has their been any changes to filesystem set up recently?

@nagisa
Copy link
Collaborator

nagisa commented Mar 7, 2025

We're still verifying this, but the issue looks most likely to be caused by the custom builds here not honoring the Cargo.lock (did you/we cargo update?) file that is checked into the repository. In particular the tempfile library is locked to 3.14 which shouldn't have this problem.

The affected code will of course be fixed, meanwhile please make sure to use the Cargo.lock as-is if you're building your own binaries.

EDIT: can confirm that updating tempfile to the most recent version breaks the node if the current working directory differs from the .near/data/contracts mountpoint. As an alternative to a re-build you can make sure that the working directory is within the same mountpoint. But I would strongly suggest making absolutely sure that Cargo.lock is honored – otherwise you're running code that in principle has not been verified to work in integration of all its components.

@strokovok
Copy link

Thank you @nagisa, excellent investigation
The circumstance is that we don't build nearcore directly, but use it as dependency (indexers).
So I'm not sure if there's any way to respect nearcore's Cargo.lock
Perhaps we need to lock tempfile to 3.14 for our build targets?

@strokovok
Copy link

strokovok commented Mar 7, 2025

if the current working directory differs from the .near/data/contracts mountpoint

Hm, interestingly, the mountpoint is the same in our case

@matthewdarwin
Copy link
Author

matthewdarwin commented Mar 7, 2025

We are also building our own custom indexer binary (I'm in operations team, not the development team). I don't know the details how it works, you can check here: https://github.com/streamingfast/near-firehose-indexer/

@maoueh
Copy link
Contributor

maoueh commented Mar 7, 2025

Same here, we are using the NEAR indexer framework hence why it went unnoticed.

@strokovok Thanks for the change, I'll adjust it.

@ama31337
Copy link

ama31337 commented Mar 9, 2025

I've similar issue recently while updating from 2.4.0 to 2.5.1
StorageInconsistentState("cache write error: Permission denied (os error 13)”)

My validator has always been running under the near user with full sudo privileges.
After the update, the validator ran for only a few minutes before crashing with the error mentioned above.
Restarting it and downloading a new snapshot didn’t help, it crashes immediately with the same error.

However, after changing the User in the service from near to root, while explicitly specifying the old working directory, the validator started successfully and is now running.

This behavior occurs on both nodes, the main and the backup, with identical configurations, but on different servers with different hardware specs and from different providers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Issues created by community investigation required Node Node team
Projects
None yet
Development

No branches or pull requests

7 participants