From mboxrd@z Thu Jan  1 00:00:00 1970
From: Danny Milosavljevic <dannym@scratchpost.org>
Subject: Re: [PATCH python-tests] gnu: python-2.7: Enable UCS-4 Unicode
	encoding.
Date: Tue, 24 Jan 2017 00:46:04 +0100
Message-ID: <20170124004604.4ba3ad2c@scratchpost.org>
References: <20170122233159.2622-1-dannym@scratchpost.org>
	<87sho968o5.fsf@kirby.i-did-not-set--mail-host-address--so-tickle-me>
	<87r33tl7qb.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Return-path: <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42705)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dannym@scratchpost.org>) id 1cVoJU-0008Ex-A2
	for guix-devel@gnu.org; Mon, 23 Jan 2017 18:46:17 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dannym@scratchpost.org>) id 1cVoJP-0005wa-Ez
	for guix-devel@gnu.org; Mon, 23 Jan 2017 18:46:16 -0500
In-Reply-To: <87r33tl7qb.fsf@gnu.org>
List-Id: "Development of GNU Guix and the GNU System distribution."
	<guix-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/guix-devel/>
List-Post: <mailto:guix-devel@gnu.org>
List-Help: <mailto:guix-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guix-devel>,
	<mailto:guix-devel-request@gnu.org?subject=subscribe>
Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org
Sender: "Guix-devel" <guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org>
To: Ludovic =?ISO-8859-1?Q?Court=E8s?= <ludo@gnu.org>
Cc: guix-devel@gnu.org

Hi Ludo,

> > Otherwise LGTM. I checked some other distros and they seem to have this
> > enabled. Thanks! =20
>=20
> That means that strings are internally UCS-4-encoded, right?  What=E2=80=
=99s the
> rationale, and what happens when this flag is omitted?

The CPython C interface changes depending on the flag and some Python exten=
sions don't work with the narrow UTF-16 Unicode - which is what it would us=
e if you don't specify.

The default, UTF-16, is basically just historical baggage from when Unicode=
 had fewer than 65536 codepoints in the standard.

The max codepoint used nowadays is 1114111.

UCS-4 encoding means that just one 32-bit word encodes one Unicode codepoin=
t (it's 1:1). It's the most straightforward encoding if you don't care abou=
t size wastage.=20

If you *do* care about size wastage, you use UTF-8.

Only if you are tied down by some kind of backward compatibility constraint=
s you use UTF-16 or UCS-2 (the latter doesn't even have some way to encode =
codepoints over 65535 AT ALL - but UTF-16 uses a variable-length encoding t=
o represent those).

Python Unicode string builds on Microsoft Windows and Mac OS X usually use =
UTF-16 while on GNU Linux distributions we usually use UCS-4.

Python 3 does the obvious thing and has only one string class and switches =
the internal string encoding depending on what codepoints are used. That wa=
y the user is none the wiser and it still saves space.

But Python 2.7 still has "strings" and "unicode strings" which are disjunct=
 with no such optimizations.

So this patch basically just makes sure that we do the same as other distri=
butions so that all the Python 2.7 extensions work.