From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Paul Eggert Newsgroups: gmane.emacs.bugs Subject: bug#24206: 25.1; Curly quotes generate invalid strings, leading to a segfault Date: Thu, 18 Aug 2016 11:33:36 -0700 Organization: UCLA Computer Science Department Message-ID: <7296d150-48aa-0c39-9421-63fffa499d2e@cs.ucla.edu> References: <8337m7h1dp.fsf@gnu.org> <83zioffew5.fsf@gnu.org> <83popaf1yz.fsf@gnu.org> <87bn0u3rqc.fsf@linux-m68k.org> <83mvkdg91i.fsf@gnu.org> <8b78f23f-4a4f-e568-b760-3350ca7bb8d3@cs.ucla.edu> <83d1l8g3zs.fsf@gnu.org> <4822bfeb-c507-a9ff-93bc-1d27ba93b9d7@cs.ucla.edu> <8360qzfmz3.fsf@gnu.org> <11031c1e-c784-0ba2-4b6c-4fab0cb92354@cs.ucla.edu> <83tweje0ct.fsf@gnu.org> <83lgzudu97.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1471545323 11888 195.159.176.226 (18 Aug 2016 18:35:23 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 18 Aug 2016 18:35:23 +0000 (UTC) User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 Cc: johnw@gnu.org, 24206@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Aug 18 20:35:19 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1baS9u-0002pn-C6 for geb-bug-gnu-emacs@m.gmane.org; Thu, 18 Aug 2016 20:35:18 +0200 Original-Received: from localhost ([::1]:53923 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1baS9r-0002hJ-Fv for geb-bug-gnu-emacs@m.gmane.org; Thu, 18 Aug 2016 14:35:15 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:58323) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1baS8l-000250-Jw for bug-gnu-emacs@gnu.org; Thu, 18 Aug 2016 14:34:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1baS8g-0000FB-R5 for bug-gnu-emacs@gnu.org; Thu, 18 Aug 2016 14:34:06 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:35902) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1baS8g-0000F7-NV for bug-gnu-emacs@gnu.org; Thu, 18 Aug 2016 14:34:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1baS8g-0002xl-GZ for bug-gnu-emacs@gnu.org; Thu, 18 Aug 2016 14:34:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 18 Aug 2016 18:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24206 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 24206-submit@debbugs.gnu.org id=B24206.147154523111373 (code B ref 24206); Thu, 18 Aug 2016 18:34:02 +0000 Original-Received: (at 24206) by debbugs.gnu.org; 18 Aug 2016 18:33:51 +0000 Original-Received: from localhost ([127.0.0.1]:33614 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1baS8U-0002xM-VV for submit@debbugs.gnu.org; Thu, 18 Aug 2016 14:33:51 -0400 Original-Received: from zimbra.cs.ucla.edu ([131.179.128.68]:57967) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1baS8T-0002x9-0e for 24206@debbugs.gnu.org; Thu, 18 Aug 2016 14:33:49 -0400 Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 3249C160E3C; Thu, 18 Aug 2016 11:33:42 -0700 (PDT) Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 3ZQ6k1wtxh84; Thu, 18 Aug 2016 11:33:41 -0700 (PDT) Original-Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 31E6E160E3D; Thu, 18 Aug 2016 11:33:41 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Original-Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id jQkRogaiGbrw; Thu, 18 Aug 2016 11:33:41 -0700 (PDT) Original-Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 11CC0160E3C; Thu, 18 Aug 2016 11:33:41 -0700 (PDT) In-Reply-To: <83lgzudu97.fsf@gnu.org> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:122374 Archived-At: Eli Zaretskii wrote: > The code that got removed was the easy and intuitive part: it dealt > with processing single-byte strings one byte at a time. The > hard-to-read part of the code is still with us. We have less 'if' > conditionals, but that's hardly the main complication in the original > code. Sure, but removing unnecessary easy stuff lets the reader see the hard st= uff=20 more clearly. > You are missing my point: the code on master now processes a string, > that could be either unibyte or multibyte, using only multibyte > methods. With the flag in place, each kind of string would have used > the method that's natural with it. The way things are now, one has to > think hard about what the code does to convince oneself it's valid. The way things were before it was even harder, because one had to worry n= ot only=20 about processing Emacs-encoded text, one also had to worry about processi= ng=20 unibyte text containing non-ASCII bytes. The code is simpler now, because= it=20 needs only to process Emacs-encoded text. The old code might have flown despite its problems, if all the input data= were=20 consistent (i.e., either all unibyte, or all with Emacs-encoded text). Bu= t=20 inputs need not be consistent, so the old approach simply did not work. As I take it, your principal objection to the new code is not to its inte= rnals:=20 it's that substitute-command-keys can now return a multibyte string even = when=20 all the input data is unibyte. I don't think that's a big deal, but if th= is is=20 the primary reason for our lengthy conversation, I can move things forwar= d by=20 changing the code so that it instead returns a unibyte string when all th= e input=20 data are unibyte. Would that suffice? >> I don't see why it is tricky, we do that in Emacs in other places. >> >> Really? A call to STRING_CHAR_AND_LENGTH followed by a length test fol= lowed by a call to memcpy for length > 1 and a special case inline copy f= or length =3D=3D 1? When copying multibyte data? Where else does Emacs do= that? > > What exactly confuses you in that snippet? Nothing confuses me in that snippet. I know what the snippet does, now th= at I've=20 read and understood it. It is a longwinded and unnecessarily tricky way o= f doing=20 something simple. > The call to > STRING_CHAR_AND_LENGTH itself? we have that in umpteen other places. > The single-byte optimization of not calling memcpy? That's standard > practice in C. I'm not talking about each individual line of code in that snippet. I am = talking=20 about the entire construction. A reader must look at all 14 lines to dedu= ce what=20 it does, deduce that it's unnecessarily complicated, and deduce that the=20 unnecessary complication is not a sign that something unusual is going on= . No=20 place else in Emacs has this construction. > If you need an example for using > STRING_CHAR_AND_LENGTH while copying text, you can find it in > copy_text, for example. copy_text does something quite different. When copying multibyte text, it= does a=20 single memcpy for the entire string. copy_text does not call memcpy for e= ach=20 multibyte character, and nothing in copy_text is particularly close to th= e=20 snippet in question. > you replaced it with a > fall-through, which is harder to follow and easier to introduce subtle > bugs. True, but the fall-through is clearly marked in comments. The new way is = a bit=20 more efficent (smaller code space, more likely to fit in cache); the old = way=20 duplicated character-copying code, a practice that also introduces subtle= bugs=20 -- in addition to temporarily mystifying the reader because the old code=20 duplicates were not identical (itself a sign of cruft). With all that in = mind,=20 it was reasonable to switch from the old way to the new, despite the=20 disadvantage you mention. >> Alan wanted something that he could put into his .emacs that w= ould cause >> (message PERCENTLESS) to output the string PERCENTLESS as-is, = assuming >> PERCENTLESS lacks %. This was the point of his original bug re= port; his original >> example involved ` and ' but he wanted the same behavior for =E2= =80=98 and =E2=80=99, a point >> that became clear during the discussion of Bug#23425. >> >> Then why not for '..' as well? How is that different from =E2=80=98= ..=E2=80=99? >> >> It's not different. Alan wanted the same behavior for '..', and he got= that too. > > But the behavior is not the same: I was referring to Alan's desire to treat all quotes the same (i.e., to n= ot=20 substitute for any of them), which is now supported by setting=20 text-quoting-style to grave. > (let ((text-quoting-style 'curve)) > (substitute-command-keys "'foo'")) > =3D> =E2=80=99foo=E2=80=99 > > but > > (let ((text-quoting-style 'grave)) > (substitute-command-keys "=E2=80=98foo=E2=80=99")) > =3D> =E2=80=98foo=E2=80=99 > > I would have expected the first example to yield 'foo' No, substitute-command-keys works on each grave accent and apostrophe=20 separately, without looking at the others. As I recall it's worked that w= ay and=20 has been documented that way, in both master and emacs-25, ever since the= =20 feature was installed. One could posit a "smarter" form of substitution, = which=20 leaves 'foo' alone but which translates `foo'. Although we considered tha= t=20 possibility during design, we rejected it because it is more complicated = and has=20 more problems and quirks that are a pain to document and would surprise u= sers in=20 other ways.