From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings Date: Thu, 15 Sep 2016 16:23:54 +0200 Organization: http://mina86.com/ Message-ID: References: <1473720367-2807-1-git-send-email-mina86@mina86.com> <83mvjb98f5.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1473951283 20648 195.159.176.226 (15 Sep 2016 14:54:43 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 15 Sep 2016 14:54:43 +0000 (UTC) User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.106 (x86_64-unknown-linux-gnu) Cc: 24425@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Sep 15 16:54:39 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkY3e-0003yV-CX for geb-bug-gnu-emacs@m.gmane.org; Thu, 15 Sep 2016 16:54:34 +0200 Original-Received: from localhost ([::1]:35277 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkY3X-00071A-0M for geb-bug-gnu-emacs@m.gmane.org; Thu, 15 Sep 2016 10:54:27 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:46020) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkXbD-0007s2-2N for bug-gnu-emacs@gnu.org; Thu, 15 Sep 2016 10:25:15 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bkXb7-00028J-14 for bug-gnu-emacs@gnu.org; Thu, 15 Sep 2016 10:25:09 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:34623) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkXb6-00027q-Sn for bug-gnu-emacs@gnu.org; Thu, 15 Sep 2016 10:25:04 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bkXb4-0004eF-DK for bug-gnu-emacs@gnu.org; Thu, 15 Sep 2016 10:25:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 15 Sep 2016 14:25:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24425 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch Original-Received: via spool by 24425-submit@debbugs.gnu.org id=B24425.147394945517807 (code B ref 24425); Thu, 15 Sep 2016 14:25:02 +0000 Original-Received: (at 24425) by debbugs.gnu.org; 15 Sep 2016 14:24:15 +0000 Original-Received: from localhost ([127.0.0.1]:60568 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkXaD-0004d3-6v for submit@debbugs.gnu.org; Thu, 15 Sep 2016 10:24:14 -0400 Original-Received: from mail-qt0-f180.google.com ([209.85.216.180]:33121) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkXa6-0004cU-7n for 24425@debbugs.gnu.org; Thu, 15 Sep 2016 10:24:08 -0400 Original-Received: by mail-qt0-f180.google.com with SMTP id 11so26013154qtc.0 for <24425@debbugs.gnu.org>; Thu, 15 Sep 2016 07:24:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=3iomiEMJcEIzJu2Zc8CzwowLb5xzCJx1jK3zOsV0aIM=; b=HAUP80U7TuLf3AJ+s7ZkCK8skwcA/BIOB41qhrTjzXIFr5w1OsGUjD+jwC1ens4njy a6nC1mhl9q8YyXt4FhNkLkR7y6YJYeXKx6Nu7xOGhXQH2ghlXgQGunrfzYGL2pGFG34o m4C+GZIZglk9HFecLAiu8AUF4GI8N+j8rCLA0/HAKnkiiAm9FwUC/Fm8C2PPRW2hojp+ ATGQefjSKlFYK25sAyJzCi1WIGviwlUnG42ZkUhyb9gWAkI9BHZOSrFBuwuRI23A1Vc/ PYoqA4r/+w8cUX6qJoAOHUXOMRsW8/aYrwy7VeZLN9l08TVyP6KpGK1QZSpzuW5JuXhA pJZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:user-agent:face:date:message-id :mime-version:content-transfer-encoding; bh=3iomiEMJcEIzJu2Zc8CzwowLb5xzCJx1jK3zOsV0aIM=; b=krXs7sJwgD8XTL5ukwZw5ZZSDtdEeg9GszL+W48oPdwX1lVOlkGzhiQ1PgfcBFaJTI uHGNl+TVFaF/A6DD/0cZrbwfU+rmm0Z0SUymgJqFFwXFw+2KU+pR3G473nJTtG1oEUZP hpf+l8g3McwrANF9Dj2LXAMWzbNVDmFY1Wny9R11E4NMqvr6lkPS/pXGALAjI6THpFY1 sQPIdrFYyCk1hN/fxgbZ1alVriNNiDVXwB49c/+qndbr/C3vNc8atDZQ5WPU6hSjqnYF rPIuldmp6jcBH/JVdPAValzMtatWhBBH37o+smP97KS9kj145BZe3BeVWPSHE091ro9b Jc/Q== X-Gm-Message-State: AE9vXwNfkt/8UxKJf3kY7AU3700HMqdbh4TCEyEu3wC0VICoLEV3dViLS4yBJJdknJfNbL59 X-Received: by 10.28.134.136 with SMTP id i130mr3178775wmd.76.1473949436137; Thu, 15 Sep 2016 07:23:56 -0700 (PDT) Original-Received: from mpn-glaptop ([2620:0:105f:301:894b:a703:c2ff:3827]) by smtp.gmail.com with ESMTPSA id y2sm3699527wji.42.2016.09.15.07.23.55 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 15 Sep 2016 07:23:55 -0700 (PDT) In-Reply-To: <83mvjb98f5.fsf@gnu.org> Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4 fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:160915:eliz@gnu.org::EbQD7ERt7MgC2PGJ:0000062bm X-Hashcash: 1:20:160915:24425@debbugs.gnu.org::p4W1RwUmAgTxmm1X:00000000000000000000000000000000000000006pie X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:123332 Archived-At: On Tue, Sep 13 2016, Eli Zaretskii wrote: > Currently, case changes in unibyte characters and strings are only > well defined for pure ASCII text; if the input or the result is not > pure ASCII, we produce "undefined behavior". Would the following (not tested) make sense then: diff --git a/src/casefiddle.c b/src/casefiddle.c index 2d32f49..4dc2357 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -89,23 +89,19 @@ casify_object (enum case_action flag, Lisp_Object obj) for (i =3D 0; i < size; i++) { c =3D SREF (obj, i); - MAKE_CHAR_MULTIBYTE (c); c1 =3D c; - if (inword && flag !=3D CASE_CAPITALIZE_UP) - c =3D downcase (c); - else if (!uppercasep (c) - && (!inword || flag !=3D CASE_CAPITALIZE_UP)) - c =3D upcase1 (c1); - if ((int) flag >=3D (int) CASE_CAPITALIZE) - inword =3D (SYNTAX (c) =3D=3D Sword); - if (c !=3D c1) + if (ASCII_CHAR_P (c)) { - MAKE_CHAR_UNIBYTE (c); - /* If the char can't be converted to a valid byte, just don't - change it. */ - if (c >=3D 0 && c < 256) - SSET (obj, i, c); + if (inword && flag !=3D CASE_CAPITALIZE_UP) + c =3D downcase (c); + else if (!uppercasep (c) + && (!inword || flag !=3D CASE_CAPITALIZE_UP)) + c =3D upcase1 (c1); } + if ((int) flag >=3D (int) CASE_CAPITALIZE) + inword =3D (SYNTAX (c) =3D=3D Sword); + if (c !=3D c1 && ASCII_CHAR_P (c)) + SSET (obj, i, c); } return obj; } @@ -230,8 +226,9 @@ casify_region (enum case_action flag, Lisp_Object b, Li= sp_Object e) else { c =3D FETCH_BYTE (start_byte); - MAKE_CHAR_MULTIBYTE (c); len =3D 1; + if (!ASCII_CHAR_P (c)) + goto done; } c2 =3D c; if (inword && flag !=3D CASE_CAPITALIZE_UP) @@ -239,9 +236,6 @@ casify_region (enum case_action flag, Lisp_Object b, Li= sp_Object e) else if (!uppercasep (c) && (!inword || flag !=3D CASE_CAPITALIZE_UP)) c =3D upcase1 (c); - if ((int) flag >=3D (int) CASE_CAPITALIZE) - inword =3D ((SYNTAX (c) =3D=3D Sword) - && (inword || !syntax_prefix_flag_p (c))); if (c !=3D c2) { last =3D start; @@ -250,8 +244,8 @@ casify_region (enum case_action flag, Lisp_Object b, Li= sp_Object e) =20 if (! multibyte) { - MAKE_CHAR_UNIBYTE (c); - FETCH_BYTE (start_byte) =3D c; + if (ASCII_CHAR_P (c)) + FETCH_BYTE (start_byte) =3D c; } else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c)) FETCH_BYTE (start_byte) =3D c; @@ -280,6 +274,10 @@ casify_region (enum case_action flag, Lisp_Object b, L= isp_Object e) } } } + done: + if ((int) flag >=3D (int) CASE_CAPITALIZE) + inword =3D ((SYNTAX (c) =3D=3D Sword) + && (inword || !syntax_prefix_flag_p (c))); start++; start_byte +=3D len; } If working on non-ASCII characters isn=E2=80=99t supported we might just as= well skip all the logic that handles non-ASCII unibyte characters. > Properly means that upcasing "istanbul" in the above example will > produce "=C4=B0STANBUL", not "iSTANBUL", and downcasing "IRMA" will produ= ce > "=C4=B1rma". I thought about that but then another corner case is "istanbul\xff" which is a unibyte string with 8-bit bytes. I have no strong feelings either way so I=E2=80=99m happy just leaving it a= s is as well. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB