Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-46236: Add missing PyUnicode_GetDefaultEncoding() doc #130335

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rruuaanng
Copy link
Contributor

@rruuaanng rruuaanng commented Feb 20, 2025

@bedevere-app bedevere-app bot mentioned this pull request Feb 20, 2025
33 tasks
@rruuaanng
Copy link
Contributor Author

cc @vstinner

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Feb 20, 2025

As usual, the documentation does not require a NEWS (I hope I remember correctly).

I'm aware that this function returns a utf-8. Perhaps describing the function in this way would make it more robust (at the very least, we won't need to update its documentation if we modify it in the future).

@encukou
Copy link
Member

encukou commented Feb 20, 2025

I'd link to sys.getdefaultencoding, copy the doc from there, and add a .. impl-detail:: that it's always utf-8.

Perhaps this should be soft-deprecated.

@rruuaanng
Copy link
Contributor Author

Alright, something interesting happened. sys.getdefaultencoding always returns utf-8, but it's description says:

Return the name of the current default string encoding used by the Unicode implementation.

Should I modify its description? e.g., Return a "utf-8" string constant.

copy the doc from there, and add a .. impl-detail:: that it's always utf-8.

In other words, should I add .. impl-detail:: tag to sys.getdefaultencoding to point it to the PyUnicode_GetDefaultEncoding function? But here's the interesting part, it uses a completely new implementation, like the one below:

static PyObject *
get_utf8_unicode(void)
{
    _Py_DECLARE_STR(utf_8, "utf-8");
    PyObject *ret = &_Py_STR(utf_8);
    return Py_NewRef(ret);
}

Even so, do I still need to add this tag?

@encukou
Copy link
Member

encukou commented Feb 20, 2025

Hm, looking further, looks like the “current default string encoding used by the Unicode implementation” is what str.encode() uses. In Python 2, it was ascii; in Python 3 it's utf-8.

(click for Python 2 behaviour)
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'á'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

>>> import ctypes
>>> ctypes.pythonapi.PyUnicodeUCS4_SetDefaultEncoding("utf-8")  # hack!
0

>>> sys.getdefaultencoding()
'utf-8'
>>> 'á'.encode()
'\xc3\xa1'

It might be good to clarify this in the sys.getdefaultencoding docs, e.g.:

Return "utf-8". This is the name of the default string encoding, used in methods like str.encode.

(So, UTF-8 is not a CPython implementation detail.)

@encukou
Copy link
Member

encukou commented Feb 20, 2025

it uses a completely new implementation

That's needed to get a Python object, as opposed to a char*

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Feb 20, 2025

Hmm, maybe I should open a separate PR to do it.

@encukou Please review! I hope it meets your expectations. #130353

@encukou
Copy link
Member

encukou commented Feb 20, 2025

I'd prefer including the sys.getdefaultencoding change here, so that the PyUnicode_GetDefaultEncoding doc can refer to it.

Comment on lines 617 to 618
Return a ``"utf-8"`` string constant, which corresponds to the
:func:`~sys.getdefaultencoding` function in Python.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this wording, plus making lifetime explicit since we're returning a buffer.
(Currently the string is static of course, but if it'd ever become dynamic again, it'd need to be borrowed from the interpreter.)

Suggested change
Return a ``"utf-8"`` string constant, which corresponds to the
:func:`~sys.getdefaultencoding` function in Python.
Return the name of the default string encoding, ``"utf-8"``.
See :func:`sys.getdefaultencoding`.
The returned string does not need to be freed, and is valid
until interpreter shutdown.

@rruuaanng rruuaanng requested a review from encukou February 22, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants