Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row dropped with pandas read_csv on linux #1120

Closed
K-Meech opened this issue Feb 20, 2025 · 7 comments · Fixed by #1123
Closed

Row dropped with pandas read_csv on linux #1120

K-Meech opened this issue Feb 20, 2025 · 7 comments · Fixed by #1123
Labels

Comments

@K-Meech
Copy link

K-Meech commented Feb 20, 2025

Describe the bug
When using pyfakefs with pandas, sometimes a single row is dropped on write / read. This only occurs on linux systems (tested with ubuntu laptop), with no issue on Windows. Totally understand if this issue is out of scope - as there are known issues with pandas listed in the docs!

How To Reproduce
Run the following on a linux system via pytest:

import pandas as pd

def test_minimal_example(fs):

    fs.create_dir("/TEST")

    n_rows = 46
    df = pd.DataFrame({
     "abcdefghlmnopqrst": [1]*n_rows,
     "abcdef": [1]*n_rows,
     "abcdefghijklm": ['ABCD']*n_rows,
     "abcdefghijklmnopqrstuvw": [pd.Timestamp('2023-06-13 02:24:46.996459+0000', tz='UTC')]*n_rows,
     "abcdefghijklmnopqrstuv": [pd.Timestamp('2023-06-02 09:20:20+0000', tz='UTC')]*n_rows,
     "abcdefghijklmnopqr": [pd.NaT]*35 + [pd.Timestamp('2023-06-15 18:00:00+0000', tz='UTC')]*11,
     "abcdefghijklmn": ['ABCDEFGHIJ']*n_rows,
     "abcdefghlmnopqr": ['ABCDEFG']*n_rows,
     "abcdefghijklmnopqrstuvwxyz": ['ABCD']*n_rows,
     "abcdefghijklmnopqrstuvwxy": ['ABCDEFGHIJK', None]*(int(n_rows/2)),
     "abcdefghij": ['ABC']*n_rows,
     "abcdefghi": [pd.Timestamp('2017-01-22 08:01:44.253136+0000', tz='UTC')]*n_rows,
     "abcdefghijk": [pd.Timestamp('2018-10-11 12:03:31.663658+0000', tz='UTC')]*n_rows
    })
    df.to_csv("/TEST/test.csv", index=False)

    read_df = pd.read_csv("/TEST/test.csv")
    assert len(read_df) == len(df)

Once read, the dataframe will drop one row (from 46 to 45). Changing pretty much anything about this dataframe e.g. names of columns, number of rows etc... will lead to this test passing.

Your environment
I'm running on WSL, but a colleague had the same issue on their ubuntu system:

Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
pyfakefs 5.7.4
pytest 8.3.4
@mrbean-bremen
Copy link
Member

Thanks - that's weird... I can even reproduce it under Windows.

@mrbean-bremen
Copy link
Member

Thanks for the report - this is certainly a bug in pyfakefs that has never been noticed before. It seems to be related to I/O buffering.

Just for reference: it looks like the last line is not written in the fake fs if it crosses the end of the buffer. The default buffer size is 8192, and the size of your data is a bit over that. There is a difference between Windows and Unix due to the different line endings, so the file size under Windows is larger by the number of line endings in the file, which may be the reason why sometimes the same data work under Windows, but not under Linux (and vice verse).

@mrbean-bremen
Copy link
Member

@K-Meech - Should be fixed in main now, please check if it works for you.

@K-Meech
Copy link
Author

K-Meech commented Feb 24, 2025

Thanks @mrbean-bremen ! Just checked, and everything is working on main now 😄

@mrbean-bremen
Copy link
Member

Ok - let me know if you need a patch release now, otherwise I will first check a couple of other issues.

@mrbean-bremen
Copy link
Member

As an aside: the bug was a simple mistake that has been there for about 5 years now without being noticed - and it shows again how important it is to think about edge cases while writing tests...

@K-Meech
Copy link
Author

K-Meech commented Feb 24, 2025

No rush for a release from our side, so feel free to check other issues first. Thanks again for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants