From mboxrd@z Thu Jan 1 00:00:00 1970 From: Danny Milosavljevic Subject: Re: [PATCH python-tests] gnu: python-2.7: Enable UCS-4 Unicode encoding. Date: Tue, 24 Jan 2017 00:46:04 +0100 Message-ID: <20170124004604.4ba3ad2c@scratchpost.org> References: <20170122233159.2622-1-dannym@scratchpost.org> <87sho968o5.fsf@kirby.i-did-not-set--mail-host-address--so-tickle-me> <87r33tl7qb.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:42705) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cVoJU-0008Ex-A2 for guix-devel@gnu.org; Mon, 23 Jan 2017 18:46:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cVoJP-0005wa-Ez for guix-devel@gnu.org; Mon, 23 Jan 2017 18:46:16 -0500 In-Reply-To: <87r33tl7qb.fsf@gnu.org> List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+gcggd-guix-devel=m.gmane.org@gnu.org Sender: "Guix-devel" To: Ludovic =?ISO-8859-1?Q?Court=E8s?= Cc: guix-devel@gnu.org Hi Ludo, > > Otherwise LGTM. I checked some other distros and they seem to have this > > enabled. Thanks! =20 >=20 > That means that strings are internally UCS-4-encoded, right? What=E2=80= =99s the > rationale, and what happens when this flag is omitted? The CPython C interface changes depending on the flag and some Python exten= sions don't work with the narrow UTF-16 Unicode - which is what it would us= e if you don't specify. The default, UTF-16, is basically just historical baggage from when Unicode= had fewer than 65536 codepoints in the standard. The max codepoint used nowadays is 1114111. UCS-4 encoding means that just one 32-bit word encodes one Unicode codepoin= t (it's 1:1). It's the most straightforward encoding if you don't care abou= t size wastage.=20 If you *do* care about size wastage, you use UTF-8. Only if you are tied down by some kind of backward compatibility constraint= s you use UTF-16 or UCS-2 (the latter doesn't even have some way to encode = codepoints over 65535 AT ALL - but UTF-16 uses a variable-length encoding t= o represent those). Python Unicode string builds on Microsoft Windows and Mac OS X usually use = UTF-16 while on GNU Linux distributions we usually use UCS-4. Python 3 does the obvious thing and has only one string class and switches = the internal string encoding depending on what codepoints are used. That wa= y the user is none the wiser and it still saves space. But Python 2.7 still has "strings" and "unicode strings" which are disjunct= with no such optimizations. So this patch basically just makes sure that we do the same as other distri= butions so that all the Python 2.7 extensions work.