Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

printf: accept non-UTF-8 input in FORMAT and #6812

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

andrewliebenow
Copy link
Contributor

ARGUMENT arguments

Other implementations of printf permit arbitrary data to be passed to printf. The only restriction is that a null byte terminates FORMAT and ARGUMENT argument strings (since they are C strings).

The current implementation only accepts FORMAT and ARGUMENT arguments that are valid UTF-8 (this is being enforced by clap).

This commit removes the UTF-8 validation by switching to OsStr and OsString.

This allows users to use printf to transmit or reformat null-safe but not UTF-8-safe data, such as text encoded in an 8-bit text encoding. See the non_utf_8_input test for an example (ISO-8859-1 text).

ARGUMENT arguments

Other implementations of `printf` permit arbitrary data to be passed
to `printf`. The only restriction is that a null byte terminates
FORMAT and ARGUMENT argument strings (since they are C strings).

The current implementation only accepts FORMAT and ARGUMENT
arguments that are valid UTF-8 (this is being enforced by clap).

This commit removes the UTF-8 validation by switching to OsStr
and OsString.

This allows users to use `printf` to transmit or reformat null-safe
but not UTF-8-safe data, such as text encoded in an 8-bit text
encoding. See the `non_utf_8_input` test for an example (ISO-8859-1
text).
@andrewliebenow
Copy link
Contributor Author

Probably needs some refinement. If merged, will resolve #6804 (the echo half of that issue is already resolved).

Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/timeout/timeout is no longer failing!

@@ -126,3 +141,50 @@ fn extract_value<T: Default>(p: Result<T, ParseError<'_, T>>, input: &str) -> T
}
}
}

pub fn bytes_from_os_str(input: &OsStr) -> UResult<&[u8]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a unit test for this function?
and get_str_or_exit_with_error if possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit I just pushed cleans up the error handling. I renamed this function to try_get_bytes_from_os_str. I'm not sure how to test this, because on most platforms this operation never fails. I did add a test for this much easier to trigger error:

0f3bc3f#diff-6ea56f426af08b614e48fe26ab7345fd0f2901e87b6b77d5ccf37e4eff6dd3d1

❯ ./target/release/printf '%d' "$(coreutils printf 'Swer an rehte g\xFCete')"
./target/release/printf: invalid (non-UTF-8) argument like 'Swer an rehte g�ete' encountered

Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/timeout/timeout is no longer failing!

Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/timeout/timeout is no longer failing!

@RenjiSann
Copy link
Contributor

I am working on the handling of non-UTF-8 inputs in cksum and I realize we might be duplicating code.
I had to use some conversion utils as well, which I did put in uucore/src/lib/lib.rs, as my usage is not for CLI arguments strictly speaking.

See the related commit

I think we should organize ourselves to not duplicate code.
Is it ok for you to put a common implementation in lib.rs, so it can be used by anything else in the future ?

@andrewliebenow
Copy link
Contributor Author

I am working on the handling of non-UTF-8 inputs in cksum and I realize we might be duplicating code. I had to use some conversion utils as well, which I did put in uucore/src/lib/lib.rs, as my usage is not for CLI arguments strictly speaking.

See the related commit

I think we should organize ourselves to not duplicate code. Is it ok for you to put a common implementation in lib.rs, so it can be used by anything else in the future ?

If you think your PR will be merged soonish, I will convert this PR to a draft, wait for your PR to be merged, and then update this branch to use those conversion functions.

@RenjiSann
Copy link
Contributor

If you think your PR will be merged soonish, I will convert this PR to a draft, wait for your PR to be merged, and then update this branch to use those conversion functions.

It was just merged 👍

@andrewliebenow andrewliebenow force-pushed the printf-allow-non-utf-8 branch 2 times, most recently from 6a041b9 to a65a474 Compare October 24, 2024 17:13
@andrewliebenow
Copy link
Contributor Author

@RenjiSann I've included some changes to the code you just added. Please let me know if you have any issues with the changes.

@RenjiSann
Copy link
Contributor

I'm good with all of it.

The precaution regarding the potentially overlapping cfgs is a bit overkill for this specific case (A and not(A)), but is a good practice overall 👍

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@sylvestre
Copy link
Contributor

A few lines are not covered by unit tests (low code coverage)
Could you please improve that? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants