From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Michal Nazarewicz Newsgroups: gmane.emacs.bugs Subject: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings Date: Tue, 13 Sep 2016 00:46:07 +0200 Message-ID: <1473720367-2807-1-git-send-email-mina86@mina86.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1473720506 5812 195.159.176.226 (12 Sep 2016 22:48:26 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 12 Sep 2016 22:48:26 +0000 (UTC) To: 24425@debbugs.gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Sep 13 00:48:23 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja1O-0000AN-SB for geb-bug-gnu-emacs@m.gmane.org; Tue, 13 Sep 2016 00:48:15 +0200 Original-Received: from localhost ([::1]:45526 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja1M-0007J7-TD for geb-bug-gnu-emacs@m.gmane.org; Mon, 12 Sep 2016 18:48:12 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44493) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja1F-0007Ir-S8 for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:48:07 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja1C-00006v-H7 for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:48:05 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:60302) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja1C-00006p-Cz for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:48:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bja1C-0001cT-71 for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:48:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 12 Sep 2016 22:48:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 24425 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.14737204496183 (code B ref -1); Mon, 12 Sep 2016 22:48:02 +0000 Original-Received: (at submit) by debbugs.gnu.org; 12 Sep 2016 22:47:29 +0000 Original-Received: from localhost ([127.0.0.1]:58014 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja0e-0001be-St for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:29 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:53453) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja0d-0001bT-Iy for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:27 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja0X-0008Pm-Aj for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:22 -0400 Original-Received: from lists.gnu.org ([2001:4830:134:3::11]:40286) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0X-0008PS-7i for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:21 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44419) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0U-0007GM-OH for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:19 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja0Q-0008OS-IC for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:17 -0400 Original-Received: from mail-wm0-f46.google.com ([74.125.82.46]:37688) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0Q-0008OM-8U for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:14 -0400 Original-Received: by mail-wm0-f46.google.com with SMTP id c131so82453676wmh.0 for ; Mon, 12 Sep 2016 15:47:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=1zhxQVmV81Y0CHINHSPKoCDPnbmWwiLDIxlYu45NrNc=; b=gK5OJjKCsBPBhuQdj2KxisacVV/FK0xLCsMLWMkm5nCgwvZzAhcCWSM0QUz5t9bcJd UvkKHlD8+KhYmdhsEJxGiZEq9sY/N4I6B8+kuRs3yPfisn43fZz8/pAN/tIE1qHN+a8Q xAHon/9JlruP3UeDj9tNq1HlXNXNnZyV0iZfnX0FqaNhGqbTuZ8MM7XrBqxjnBmlJJeb bnY3AQj2lPrZThynKcQ23YxGWUpZB2BbYpO+pwymTcg/oL9+BMkC1t5AYDZF9ZM7RhoN pYK4lItT14Q8gNPPyzSZMwfPPW+ikfRQVXTpLFP9jMgiex5eTITscrOdqk1T4GBjiJlz tDOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=1zhxQVmV81Y0CHINHSPKoCDPnbmWwiLDIxlYu45NrNc=; b=liw7lJTWuUPxZqXTsD7a/0AxFxLNNGR+lhSOOtGUOkAXhgg2imEEHsdt+cCIzVcOX7 6xl1rloWLOrm4wVsUWQMF3tKDwjFUcBahr0JEtr8o5B6/aSP5MaNK/K9uyDdM6qeFLw1 eHFSRPbx6hi0HTHJ52yIq/sOBzj1d1oaZp2a00v93ApPnmOAviPBSs8DHhojZDt2Iqr3 pT5aV6/Hzr8MXqs+wNsBRd84dS85COX354AB61kfwj5uCP8K+L6V1nQuD1kUWXyT7iHn v2oUks6dq0JxOsa2tZIz2ZrMPWQDbVAMOc9w94xXhfUQxXNg4WQIdDo49GRH4TbEKFfZ xMng== X-Gm-Message-State: AE9vXwMK//BkqT1mY6aC8d9Pr/fu/34Bvbtr7RSwF3D/sg+NoweIko5L+flu1tM49Cmah+Id X-Received: by 10.28.146.133 with SMTP id u127mr1881802wmd.21.1473720372962; Mon, 12 Sep 2016 15:46:12 -0700 (PDT) Original-Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id d62sm19988523wmd.7.2016.09.12.15.46.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Sep 2016 15:46:11 -0700 (PDT) Original-Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id B043F1E0208; Tue, 13 Sep 2016 00:46:10 +0200 (CEST) X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:123238 Archived-At: Currently, when operating on unibyte strings and buffers, if casing ASCII character results in a Unicode character the result is forcefully converted to 8-bit by masking all but the eight least significant bits. This has awkward results such as: (let ((table (make-char-table 'case-table))) (set-char-table-parent table (current-case-table)) (set-case-syntax-pair ?I ?ı table) (set-case-syntax-pair ?İ ?i table) (with-case-table table (concat (upcase "istanabul") " " (downcase "IRMA")))) => "0STANABUL 1rma" Change the code so that ASCII characters being cased to Unicode characters are left unchanged when operating on unibyte data. In other words, aforementioned example will produce: => "iSTANBUL "Irma" Arguably this isn’t correct either but it’s less wrong and ther’s not much we can do when the strings are unibyte. Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed case. * src/casefiddle.c (casify_object, casify_region): When dealing with unibyte data, don’t attempt to store Unicode characters in the result. --- src/casefiddle.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) Unless there are objections, I’ll commit it in a few days. diff --git a/src/casefiddle.c b/src/casefiddle.c index 2d32f49..247cc6f 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -71,8 +71,8 @@ casify_object (enum case_action flag, Lisp_Object obj) { if (! inword) c = upcase1 (c1); - if (! multibyte) - MAKE_CHAR_UNIBYTE (c); + if (! multibyte && CHAR_BYTE8_P (c)) + c = CHAR_TO_BYTE8 (c); XSETFASTINT (obj, c | flags); } return obj; @@ -93,18 +93,19 @@ casify_object (enum case_action flag, Lisp_Object obj) c1 = c; if (inword && flag != CASE_CAPITALIZE_UP) c = downcase (c); - else if (!uppercasep (c) - && (!inword || flag != CASE_CAPITALIZE_UP)) - c = upcase1 (c1); + else if (!inword || flag != CASE_CAPITALIZE_UP) + c = upcase (c1); if ((int) flag >= (int) CASE_CAPITALIZE) inword = (SYNTAX (c) == Sword); if (c != c1) { - MAKE_CHAR_UNIBYTE (c); - /* If the char can't be converted to a valid byte, just don't - change it. */ - if (c >= 0 && c < 256) - SSET (obj, i, c); + if (CHAR_BYTE8_P (c)) + c = CHAR_TO_BYTE8 (c); + else if (!ASCII_CHAR_P (c)) + /* If the char can't be converted to a valid byte, just don't + change it. */ + continue; + SSET (obj, i, c); } } return obj; @@ -250,8 +251,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e) if (! multibyte) { - MAKE_CHAR_UNIBYTE (c); - FETCH_BYTE (start_byte) = c; + /* If the char can't be converted to a valid byte, just don't + change it. */ + if (ASCII_CHAR_P (c) || + (CHAR_BYTE8_P (c) && ((c = CHAR_TO_BYTE8 (c)), true))) + FETCH_BYTE (start_byte) = c; } else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c)) FETCH_BYTE (start_byte) = c; -- 2.8.0.rc3.226.g39d4020