From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#24206: 25.1; Curly quotes generate invalid strings, leading to a segfault Date: Mon, 15 Aug 2016 19:09:40 +0300 Message-ID: <83popaf1yz.fsf@gnu.org> References: <8337m7h1dp.fsf@gnu.org> <83zioffew5.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1471277482 23479 195.159.176.226 (15 Aug 2016 16:11:22 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Mon, 15 Aug 2016 16:11:22 +0000 (UTC) Cc: p.stephani2@gmail.com, johnw@gnu.org, nicolas@petton.fr, 24206@debbugs.gnu.org To: Paul Eggert Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Mon Aug 15 18:11:17 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bZKTs-0005ne-Lu for geb-bug-gnu-emacs@m.gmane.org; Mon, 15 Aug 2016 18:11:16 +0200 Original-Received: from localhost ([::1]:37883 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bZKTp-0000yc-MJ for geb-bug-gnu-emacs@m.gmane.org; Mon, 15 Aug 2016 12:11:13 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58336) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bZKTj-0000yI-88 for bug-gnu-emacs@gnu.org; Mon, 15 Aug 2016 12:11:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bZKTd-00041y-Ry for bug-gnu-emacs@gnu.org; Mon, 15 Aug 2016 12:11:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:60770) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bZKTd-00041u-OB for bug-gnu-emacs@gnu.org; Mon, 15 Aug 2016 12:11:01 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bZKTd-00087A-KW for bug-gnu-emacs@gnu.org; Mon, 15 Aug 2016 12:11:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 15 Aug 2016 16:11:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24206 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24206-submit@debbugs.gnu.org id=B24206.147127740831130 (code B ref 24206); Mon, 15 Aug 2016 16:11:01 +0000 Original-Received: (at 24206) by debbugs.gnu.org; 15 Aug 2016 16:10:08 +0000 Original-Received: from localhost ([127.0.0.1]:58482 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bZKSm-000862-7c for submit@debbugs.gnu.org; Mon, 15 Aug 2016 12:10:08 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:38958) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bZKSk-00085b-DX for 24206@debbugs.gnu.org; Mon, 15 Aug 2016 12:10:06 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bZKSa-0003mQ-Sr for 24206@debbugs.gnu.org; Mon, 15 Aug 2016 12:10:01 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:37039) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bZKSO-0003jH-E6; Mon, 15 Aug 2016 12:09:44 -0400 Original-Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:1936 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1bZKSM-0004jC-FX; Mon, 15 Aug 2016 12:09:42 -0400 In-reply-to: (message from Paul Eggert on Sun, 14 Aug 2016 19:04:42 -0700) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:122244 Archived-At: > Cc: p.stephani2@gmail.com, 24206@debbugs.gnu.org, johnw@gnu.org, > nicolas@petton.fr > From: Paul Eggert > Date: Sun, 14 Aug 2016 19:04:42 -0700 > > Eli Zaretskii wrote: > > Its multibyteness is entirely in Emacs's imagination. > > Sure, but Emacs should not substitute "\342\200\230" for "`". The point of > text-quoting-style is to substitute quotes, not byte string encodings of quotes. I'm not sure. We never discussed what should Emacs do when substitute-command-keys is called on a unibyte non-ASCII string which requires quote substitution. Other substitutions, including those that produce ASCII quote characters, previously would leave the unibyte string unibyte. But with your changes, any substitution converts the string into multibyte: (multibyte-string-p (substitute-command-keys "\200\\[goto-char]")) => t I think this is might be a subtle regression, because some code might just find itself mixing multibyte and unibyte strings where previously there were only unibyte strings. > >> > More generally, Fsubstitute_command_keys is quite confused about unibyte > >> > versus multibyte issues. It merges together a number of strings, and > >> > assumes that they are all multibyte iff the original string is > >> > multibyte, which is obviously not true in general. > > Could you please point out the specific places where this is done? > > OK, here's a contrived example. Run this code in emacs-25: > > (progn > (setq km (make-keymap)) > (define-key km "≠" 'global-set-key) > (substitute-command-keys "\200\\\\[global-set-key]")) > > This should return a 2-character string equal to "\200≠". I'm not sure your expectations are correct: as the original string is unibyte, the output of "\200≠", which is multibyte, might not be what the users expect. They might expect "\200\342\211\240" instead. > But in Emacs 25 it dumps core, at least on my platform (Fedora 23 > x86-64). And in Emacs 24 on my platform it returns a malformed > string that prints as "\242\1340" but has length 2. I suppose we > could make Emacs 24 dump core too, though I haven't tried hard to do > that. The errors are easily fixed, though. Below I show 2 patches. The first one should go to master (after reverting yours), and IMO is also safe enough for emacs-25. But if it is deemed not safe enough for the release, the second patch is safer. The second patch doesn't produce "\200≠" in your test case, but neither did Emacs 24, so this is not a regression. Comments? Let's decide on what to do with emacs-25 first, since that blocks the release, and then discuss master if needed. Thanks. --- src/doc.c~0 2016-06-20 08:49:44.000000000 +0300 +++ src/doc.c 2016-08-15 11:24:07.894579900 +0300 @@ -738,8 +738,9 @@ Otherwise, return a new string. */) unsigned char const *start; ptrdiff_t length, length_byte; Lisp_Object name; - bool multibyte; + bool multibyte, pure_ascii; ptrdiff_t nchars; + Lisp_Object orig_string = Qnil; if (NILP (string)) return Qnil; @@ -752,6 +753,20 @@ Otherwise, return a new string. */) enum text_quoting_style quoting_style = text_quoting_style (); multibyte = STRING_MULTIBYTE (string); + /* Pure-ASCII unibyte input strings should produce unibyte strings + if substitution doesn't yield non-ASCII bytes, otherwise they + should produce multibyte strings. */ + pure_ascii = SBYTES (string) == count_size_as_multibyte (SDATA (string), + SCHARS (string)); + /* If the input string is unibyte and includes non-ASCII characters, + make a multibyte copy, so as to be able to return the original + unibyte string if no substitution eventually happens. */ + if (!multibyte && !pure_ascii) + { + orig_string = string; + string = Fstring_make_multibyte (Fcopy_sequence (string)); + multibyte = true; + } nchars = 0; /* KEYMAP is either nil (which means search all the active keymaps) @@ -933,8 +948,8 @@ Otherwise, return a new string. */) subst_string: start = SDATA (tem); - length = SCHARS (tem); length_byte = SBYTES (tem); + length = SCHARS (tem); subst: nonquotes_changed = true; subst_quote: @@ -956,8 +971,8 @@ Otherwise, return a new string. */) && quoting_style == CURVE_QUOTING_STYLE) { start = (unsigned char const *) (strp[0] == '`' ? uLSQM : uRSQM); - length = 1; length_byte = sizeof uLSQM - 1; + length = 1; idx = strp - SDATA (string) + 1; goto subst_quote; } @@ -995,6 +1010,8 @@ Otherwise, return a new string. */) } } } + else if (!NILP (orig_string)) + tem = orig_string; else tem = string; xfree (buf); --- src/doc.c~0 2016-06-20 08:49:44.000000000 +0300 +++ src/doc.c 2016-08-15 11:13:15.132137200 +0300 @@ -738,7 +738,7 @@ Otherwise, return a new string. */) unsigned char const *start; ptrdiff_t length, length_byte; Lisp_Object name; - bool multibyte; + bool multibyte, pure_ascii; ptrdiff_t nchars; if (NILP (string)) @@ -752,6 +752,11 @@ Otherwise, return a new string. */) enum text_quoting_style quoting_style = text_quoting_style (); multibyte = STRING_MULTIBYTE (string); + /* Pure-ASCII unibyte input strings should produce unibyte strings + if substitution doesn't yield non-ASCII bytes, otherwise they + should produce multibyte strings. */ + pure_ascii = SBYTES (string) == count_size_as_multibyte (SDATA (string), + SCHARS (string)); nchars = 0; /* KEYMAP is either nil (which means search all the active keymaps) @@ -933,8 +938,11 @@ Otherwise, return a new string. */) subst_string: start = SDATA (tem); - length = SCHARS (tem); length_byte = SBYTES (tem); + if (multibyte || pure_ascii) + length = SCHARS (tem); + else + length = length_byte; subst: nonquotes_changed = true; subst_quote: @@ -956,8 +964,11 @@ Otherwise, return a new string. */) && quoting_style == CURVE_QUOTING_STYLE) { start = (unsigned char const *) (strp[0] == '`' ? uLSQM : uRSQM); - length = 1; length_byte = sizeof uLSQM - 1; + if (multibyte || pure_ascii) + length = 1; + else + length = length_byte; idx = strp - SDATA (string) + 1; goto subst_quote; }