From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: =?utf-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= Newsgroups: gmane.emacs.devel Subject: Re: distinguishing multibyte/unibyte ASCII Date: Fri, 09 Sep 2016 22:17:58 +0200 Message-ID: <87fup87rpl.fsf@toke.dk> References: <20160907153014.15752-1-toke@toke.dk> <87inu7k5z4.fsf@toke.dk> <83bmzzaawr.fsf@gnu.org> <877fank1oc.fsf@toke.dk> <87inu6iim8.fsf@toke.dk> <2563921f-d20d-753b-09eb-c8671bc5b6d6@yandex.ru> <87a8fiidso.fsf@toke.dk> <86d1kdq7cs.fsf@realize.ch> <83bmzwaopr.fsf@gnu.org> <8660q4ria9.fsf@realize.ch> <8360q4amyx.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1473452350 16013 195.159.176.226 (9 Sep 2016 20:19:10 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Fri, 9 Sep 2016 20:19:10 +0000 (UTC) Cc: Eli Zaretskii , Alain Schneble , dgutov@yandex.ru, emacs-devel@gnu.org To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Sep 09 22:19:06 2016 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1biSGN-0003Pq-TP for ged-emacs-devel@m.gmane.org; Fri, 09 Sep 2016 22:19:04 +0200 Original-Received: from localhost ([::1]:60027 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1biSGL-0002au-U1 for ged-emacs-devel@m.gmane.org; Fri, 09 Sep 2016 16:19:01 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:53118) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1biSFW-0002Z0-DZ for emacs-devel@gnu.org; Fri, 09 Sep 2016 16:18:11 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1biSFU-0002WZ-Dx for emacs-devel@gnu.org; Fri, 09 Sep 2016 16:18:09 -0400 Original-Received: from mail2.tohojo.dk ([77.235.48.147]:49252) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1biSFO-0002Vd-Lm; Fri, 09 Sep 2016 16:18:02 -0400 X-Virus-Scanned: amavisd-new at mail2.tohojo.dk DKIM-Filter: OpenDKIM Filter v2.10.3 mail2.tohojo.dk AFECC40D5E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=toke.dk; s=201310; t=1473452279; bh=iSyeKeA4ekWlebulAQnyPuaQOj5l2Z1nT6R96vJJbts=; h=From:To:Cc:Subject:References:Date:In-Reply-To:From; b=mXSwxqCHhzugTmm97aux+TUm89w1t2cjq3AouqNbnO3lwzUku5WJJf0/1R3jBJx4E wt96Jn23dF5nev+Ez+TMCJh1wDCUAsEw23mdIp1wLvAylRYm80GJf3T++lfp14zq9u xuQpmGHJXwDnCwTsL5wMKUH6D/yQzVuc/6NSNQV4= Original-Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id BD95F8261; Fri, 9 Sep 2016 22:17:58 +0200 (CEST) In-Reply-To: (Stefan Monnier's message of "Fri, 09 Sep 2016 16:01:57 -0400") X-Clacks-Overhead: GNU Terry Pratchett X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 77.235.48.147 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.org gmane.emacs.devel:207338 Archived-At: Stefan Monnier writes: >> If you just generate an ASCII string from ASCII characters, it will >> usually be unibyte. If you take it as a substring from a multibyte >> buffer, it will usually be multibyte. > > And it's arguably a wart in Emacs's handling of chars-vs-bytes. > But it's kind of hard to fix now. > > At some point I tried to change this handling (not exactly fix it) by > treating multibyte ASCII strings specially (it's easy to recognize by > checking that the char length is equal to the byte length and both are > readily available in the "struct Lisp_String" object). Then when we > read an ASCII string, instead of making it unibyte, I'd keep it as > multibyte. And then change things like "concat" so that those "ASCII > multibyte" strings don't force the result to be multibyte. > > My local Emacs still runs with those changes, but in the end I don't > think the result is really better (or sufficiently better to justify > the subtle incompatibilities it introduces). > > [ Also, I wouldn't be surprised to hear that such a change causes real > problems with utf-7 or EBCDIC, or other systems where decoding/encoding > a string of bytes/chars all <127 is not a no-op. ] Isn't Unicode fun? :) -Toke