Node Issue: crash when current working directory not on same filesystem as data directory #12944

matthewdarwin · 2025-02-17T21:28:27Z

Contact Details

No response

Node type

Non-Top 100 Validator

Which network are you running?

testnet

What happened?

near crashes when you start near client from a directory (eg /) that is not on the same physical disk as the data directory (eg /var/lib/near)

Version

2.5.0-rc.2

Problem did not exist in 2.4.0

Relevant log output

with `RUST_BACKTRACE=full`


thread '<unnamed>' panicked at /usr/local/cargo/git/checkouts/nearcore-5bf7818cf2261fd0/380afac/chain/chain/src/runtime/mod.rs:918:26:
StorageInconsistentState("cache write error: Invalid cross-device link (os error 18)")
stack backtrace:
   0:     0x561d082ad889 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::ha4a311b32f6b4ad8
   1:     0x561d06b2f243 - core::fmt::write::h1866771663f62b81
   2:     0x561d08278ee2 - std::io::Write::write_fmt::hb549e7444823135e
   3:     0x561d082aea03 - std::sys::backtrace::BacktraceLock::print::hddd3a9918ce29aa7
   4:     0x561d082aef2c - std::panicking::rust_panic_with_hook::he21644cc2707f2c4
   5:     0x561d082aead8 - std::panicking::begin_panic_handler::{{closure}}::h42f7c414fed3cad9
   6:     0x561d082aea39 - std::sys::backtrace::__rust_end_short_backtrace::ha26cf5766b4e8c65
   7:     0x561d082aea2c - rust_begin_unwind
   8:     0x561d0683b22f - core::panicking::panic_fmt::h74866b78e934b1c0
   9:     0x561d06eda0cd - <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk::hde0fc7d681dea438
  10:     0x561d06e00642 - near_chain::update_shard::apply_new_chunk::h8286590111afd797
  11:     0x561d06dffd88 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h52512da70d5712a6
  12:     0x561d06df682c - rayon::iter::plumbing::bridge_producer_consumer::helper::h51882d7238f92ca8
  13:     0x561d06df8870 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hbed77d61a7837b4d
  14:     0x561d068bb843 - rayon_core::registry::WorkerThread::wait_until_cold::h2d91ac0f35ae41d3
  15:     0x561d08092844 - std::sys::backtrace::__rust_begin_short_backtrace::h5e2862a4c0f395e4
  16:     0x561d08092527 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h1e253b5660f82b30
  17:     0x561d082b0b8b - std::sys::pal::unix::thread::Thread::new::thread_start::h14f1eb868ff90fc9
  18:     0x7fc778fc91c4 - <unknown>
  19:     0x7fc77904985c - <unknown>
  20:                0x0 - <unknown>
Rayon: detected unexpected panic; aborting

Node head info

Node upgrade history

today for the testnet hard fork

DB reset history

several months ago

VanBarbascu · 2025-02-19T12:53:49Z

Hi @matthewdarwin,

We operate our nodes in a similar configuration:

The home directory (~) is located on the boot disk
The data directory (~/.near/data) is located on a separate disk
Neard binary is in the home directory
We execute the neard binary from a systemd service.

The alternative configuration of maintaining the ~/.near folder on a separate disk has been thoroughly tested and also functions correctly.

This error appears to originate from RocksDB. Does your data folder contain different mount points? For instance, is the state-snapshot directory located on a separate disk?

To debug this issue further, I will require additional information about your machine configuration and the method you use to start your node.

walnut-the-cat · 2025-03-07T14:02:27Z

Another report from OKX:

strokovok · 2025-03-07T16:27:24Z

Hi, we (at Aurora Labs) experience exactly the same problem after updating our mainnet nodes to 2.5.1
In our case nearcore (indexer to be exact) is packaged as docker image
Binary is located in container's root filesystem
Home directory is mounted
Data directory is inside home directory

walnut-the-cat · 2025-03-07T20:45:14Z

For everyone who is suffering with the issue, has their been any changes to filesystem set up recently?

nagisa · 2025-03-07T20:55:38Z

We're still verifying this, but the issue looks most likely to be caused by the custom builds here not honoring the Cargo.lock (did you/we cargo update?) file that is checked into the repository. In particular the tempfile library is locked to 3.14 which shouldn't have this problem.

The affected code will of course be fixed, meanwhile please make sure to use the Cargo.lock as-is if you're building your own binaries.

EDIT: can confirm that updating tempfile to the most recent version breaks the node if the current working directory differs from the .near/data/contracts mountpoint. As an alternative to a re-build you can make sure that the working directory is within the same mountpoint. But I would strongly suggest making absolutely sure that Cargo.lock is honored – otherwise you're running code that in principle has not been verified to work in integration of all its components.

strokovok · 2025-03-07T21:24:26Z

Thank you @nagisa, excellent investigation
The circumstance is that we don't build nearcore directly, but use it as dependency (indexers).
So I'm not sure if there's any way to respect nearcore's Cargo.lock
Perhaps we need to lock tempfile to 3.14 for our build targets?

strokovok · 2025-03-07T21:33:06Z

if the current working directory differs from the .near/data/contracts mountpoint

Hm, interestingly, the mountpoint is the same in our case

matthewdarwin · 2025-03-07T21:45:24Z

We are also building our own custom indexer binary (I'm in operations team, not the development team). I don't know the details how it works, you can check here: https://github.com/streamingfast/near-firehose-indexer/

strokovok · 2025-03-07T21:46:30Z

@matthewdarwin
https://github.com/streamingfast/near-firehose-indexer/blob/a91019f3dd10018c908ed1e846262c6fa2ccf566/Cargo.lock#L6046

maoueh · 2025-03-07T21:47:11Z

Same here, we are using the NEAR indexer framework hence why it went unnoticed.

@strokovok Thanks for the change, I'll adjust it.

ama31337 · 2025-03-09T03:58:06Z

I've similar issue recently while updating from 2.4.0 to 2.5.1
StorageInconsistentState("cache write error: Permission denied (os error 13)”)

My validator has always been running under the near user with full sudo privileges.
After the update, the validator ran for only a few minutes before crashing with the error mentioned above.
Restarting it and downloading a new snapshot didn’t help, it crashes immediately with the same error.

However, after changing the User in the service from near to root, while explicitly specifying the old working directory, the validator started successfully and is now running.

This behavior occurs on both nodes, the main and the backup, with identical configurations, but on different servers with different hardware specs and from different providers

matthewdarwin added community Issues created by community investigation required Node Node team labels Feb 17, 2025

matthewdarwin assigned VanBarbascu Feb 17, 2025

github-actions bot mentioned this issue Mar 1, 2025

Monthly issue metrics report #13024

Open

This was referenced Mar 8, 2025

fix: pin tempfile to 3.14, chore: nearcore 2.5.1 aurora-is-near/near-lake-indexer#5

Merged

fix: Temporary pin the tempfile crate to version 3.14 aurora-is-near/borealis-engine-lib#206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Issue: crash when current working directory not on same filesystem as data directory #12944

Node Issue: crash when current working directory not on same filesystem as data directory #12944

matthewdarwin commented Feb 17, 2025

VanBarbascu commented Feb 19, 2025

walnut-the-cat commented Mar 7, 2025 •

edited

Loading

strokovok commented Mar 7, 2025

walnut-the-cat commented Mar 7, 2025

nagisa commented Mar 7, 2025 •

edited

Loading

strokovok commented Mar 7, 2025

strokovok commented Mar 7, 2025 •

edited

Loading

matthewdarwin commented Mar 7, 2025 •

edited

Loading

strokovok commented Mar 7, 2025

maoueh commented Mar 7, 2025

ama31337 commented Mar 9, 2025 •

edited

Loading

Node Issue: crash when current working directory not on same filesystem as data directory #12944

Node Issue: crash when current working directory not on same filesystem as data directory #12944

Comments

matthewdarwin commented Feb 17, 2025

Contact Details

Node type

Which network are you running?

What happened?

Version

Relevant log output

Node head info

Node upgrade history

DB reset history

VanBarbascu commented Feb 19, 2025

walnut-the-cat commented Mar 7, 2025 • edited Loading

strokovok commented Mar 7, 2025

walnut-the-cat commented Mar 7, 2025

nagisa commented Mar 7, 2025 • edited Loading

strokovok commented Mar 7, 2025

strokovok commented Mar 7, 2025 • edited Loading

matthewdarwin commented Mar 7, 2025 • edited Loading

strokovok commented Mar 7, 2025

maoueh commented Mar 7, 2025

ama31337 commented Mar 9, 2025 • edited Loading

walnut-the-cat commented Mar 7, 2025 •

edited

Loading

nagisa commented Mar 7, 2025 •

edited

Loading

strokovok commented Mar 7, 2025 •

edited

Loading

matthewdarwin commented Mar 7, 2025 •

edited

Loading

ama31337 commented Mar 9, 2025 •

edited

Loading