Row dropped with pandas read_csv on linux #1120

K-Meech · 2025-02-20T16:29:14Z

Describe the bug
When using pyfakefs with pandas, sometimes a single row is dropped on write / read. This only occurs on linux systems (tested with ubuntu laptop), with no issue on Windows. Totally understand if this issue is out of scope - as there are known issues with pandas listed in the docs!

How To Reproduce
Run the following on a linux system via pytest:

import pandas as pd

def test_minimal_example(fs):

    fs.create_dir("/TEST")

    n_rows = 46
    df = pd.DataFrame({
     "abcdefghlmnopqrst": [1]*n_rows,
     "abcdef": [1]*n_rows,
     "abcdefghijklm": ['ABCD']*n_rows,
     "abcdefghijklmnopqrstuvw": [pd.Timestamp('2023-06-13 02:24:46.996459+0000', tz='UTC')]*n_rows,
     "abcdefghijklmnopqrstuv": [pd.Timestamp('2023-06-02 09:20:20+0000', tz='UTC')]*n_rows,
     "abcdefghijklmnopqr": [pd.NaT]*35 + [pd.Timestamp('2023-06-15 18:00:00+0000', tz='UTC')]*11,
     "abcdefghijklmn": ['ABCDEFGHIJ']*n_rows,
     "abcdefghlmnopqr": ['ABCDEFG']*n_rows,
     "abcdefghijklmnopqrstuvwxyz": ['ABCD']*n_rows,
     "abcdefghijklmnopqrstuvwxy": ['ABCDEFGHIJK', None]*(int(n_rows/2)),
     "abcdefghij": ['ABC']*n_rows,
     "abcdefghi": [pd.Timestamp('2017-01-22 08:01:44.253136+0000', tz='UTC')]*n_rows,
     "abcdefghijk": [pd.Timestamp('2018-10-11 12:03:31.663658+0000', tz='UTC')]*n_rows
    })
    df.to_csv("/TEST/test.csv", index=False)

    read_df = pd.read_csv("/TEST/test.csv")
    assert len(read_df) == len(df)

Once read, the dataframe will drop one row (from 46 to 45). Changing pretty much anything about this dataframe e.g. names of columns, number of rows etc... will lead to this test passing.

Your environment
I'm running on WSL, but a colleague had the same issue on their ubuntu system:

Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
pyfakefs 5.7.4
pytest 8.3.4

The text was updated successfully, but these errors were encountered:

mrbean-bremen · 2025-02-20T18:49:09Z

Thanks - that's weird... I can even reproduce it under Windows.

mrbean-bremen · 2025-02-21T08:45:29Z

Thanks for the report - this is certainly a bug in pyfakefs that has never been noticed before. It seems to be related to I/O buffering.

Just for reference: it looks like the last line is not written in the fake fs if it crosses the end of the buffer. The default buffer size is 8192, and the size of your data is a bit over that. There is a difference between Windows and Unix due to the different line endings, so the file size under Windows is larger by the number of line endings in the file, which may be the reason why sometimes the same data work under Windows, but not under Linux (and vice verse).

mrbean-bremen · 2025-02-23T08:20:24Z

@K-Meech - Should be fixed in main now, please check if it works for you.

K-Meech · 2025-02-24T09:25:57Z

Thanks @mrbean-bremen ! Just checked, and everything is working on main now 😄

mrbean-bremen · 2025-02-24T10:13:29Z

Ok - let me know if you need a patch release now, otherwise I will first check a couple of other issues.

mrbean-bremen · 2025-02-24T10:15:39Z

As an aside: the bug was a simple mistake that has been there for about 5 years now without being noticed - and it shows again how important it is to think about edge cases while writing tests...

K-Meech · 2025-02-24T10:31:33Z

No rush for a release from our side, so feel free to check other issues first. Thanks again for the quick fix!

mrbean-bremen added the bug label Feb 21, 2025

mrbean-bremen mentioned this issue Feb 22, 2025

Fix content update after flush #1123

Merged

5 tasks

mrbean-bremen closed this as completed in #1123 Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row dropped with pandas read_csv on linux #1120

Row dropped with pandas read_csv on linux #1120

K-Meech commented Feb 20, 2025

mrbean-bremen commented Feb 20, 2025

mrbean-bremen commented Feb 21, 2025

mrbean-bremen commented Feb 23, 2025

K-Meech commented Feb 24, 2025

mrbean-bremen commented Feb 24, 2025

mrbean-bremen commented Feb 24, 2025

K-Meech commented Feb 24, 2025

Row dropped with pandas read_csv on linux #1120

Row dropped with pandas read_csv on linux #1120

Comments

K-Meech commented Feb 20, 2025

mrbean-bremen commented Feb 20, 2025

mrbean-bremen commented Feb 21, 2025

mrbean-bremen commented Feb 23, 2025

K-Meech commented Feb 24, 2025

mrbean-bremen commented Feb 24, 2025

mrbean-bremen commented Feb 24, 2025

K-Meech commented Feb 24, 2025