From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: David Malcolm Newsgroups: gmane.emacs.bugs Subject: bug#25987: 25.2; support gcc fixit notes Date: Fri, 13 Nov 2020 11:47:18 -0500 Message-ID: <0b88a592c7611d740b9dfa4bd4d853d14264be8d.camel@redhat.com> References: <87lgsj1jle.fsf@tromey.com> <1521218887.2913.237.camel@redhat.com> <83muz7pyde.fsf@gnu.org> <83o8lf9p68.fsf@gnu.org> <26f277bb345f10efe6340ac4074960905064fc97.camel@redhat.com> <83362i2nul.fsf@gnu.org> <8666386379d22239075d9237f00f40469c5be454.camel@redhat.com> <837drkopuf.fsf@gnu.org> <83mtzmznmw.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="3688"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Evolution 3.36.5 (3.36.5-1.fc32) Cc: 25987@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Fri Nov 13 17:48:11 2020 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1kdcF9-0000ox-Gy for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 13 Nov 2020 17:48:11 +0100 Original-Received: from localhost ([::1]:60522 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kdcF8-0008By-K3 for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 13 Nov 2020 11:48:10 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:51122) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kdcF0-0008B9-Vw for bug-gnu-emacs@gnu.org; Fri, 13 Nov 2020 11:48:02 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:37697) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kdcF0-0001pg-4T for bug-gnu-emacs@gnu.org; Fri, 13 Nov 2020 11:48:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1kdcF0-0006Ff-1r for bug-gnu-emacs@gnu.org; Fri, 13 Nov 2020 11:48:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: David Malcolm Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Fri, 13 Nov 2020 16:48:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 25987 X-GNU-PR-Package: emacs Original-Received: via spool by 25987-submit@debbugs.gnu.org id=B25987.160528604723984 (code B ref 25987); Fri, 13 Nov 2020 16:48:02 +0000 Original-Received: (at 25987) by debbugs.gnu.org; 13 Nov 2020 16:47:27 +0000 Original-Received: from localhost ([127.0.0.1]:49243 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kdcER-0006Em-0V for submit@debbugs.gnu.org; Fri, 13 Nov 2020 11:47:27 -0500 Original-Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:45414) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kdcEO-0006Ed-QV for 25987@debbugs.gnu.org; Fri, 13 Nov 2020 11:47:26 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605286044; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+SmHzVzC11QZRbmvAPuAvqnE9klAX7vAuPSAoWbJMVo=; b=eT/TesnxenRc3TZEEJZyIiwEsww+NmiBH5ivdCfdHPJ1oLBysAudTyYMhBQPGUoCy6XbjR GwQrV1rEDcXp5eDGzAPup5jpx7JNzfzBZj6ikQPxSO7DsELcHM7h7+XkS+l+2hHyACSYLb WTmIC2PRoy1Afs+NozsQ+yeweQYoI8U= Original-Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-578-N02Ta3uuOW-wbhgLAlfM6A-1; Fri, 13 Nov 2020 11:47:20 -0500 X-MC-Unique: N02Ta3uuOW-wbhgLAlfM6A-1 Original-Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E4345185E486; Fri, 13 Nov 2020 16:47:19 +0000 (UTC) Original-Received: from ovpn-112-135.phx2.redhat.com (ovpn-112-135.phx2.redhat.com [10.3.112.135]) by smtp.corp.redhat.com (Postfix) with ESMTP id 72FDE46; Fri, 13 Nov 2020 16:47:19 +0000 (UTC) In-Reply-To: <83mtzmznmw.fsf@gnu.org> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=dmalcolm@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.io gmane.emacs.bugs:193240 Archived-At: On Thu, 2020-11-12 at 15:54 +0200, Eli Zaretskii wrote: > > From: David Malcolm > > Cc: 25987@debbugs.gnu.org > > Date: Wed, 11 Nov 2020 14:36:49 -0500 > > > > On Tue, 2020-10-20 at 18:54 +0300, Eli Zaretskii wrote: > > > > From: David Malcolm > > > > Cc: 25987@debbugs.gnu.org > > > > Date: Tue, 20 Oct 2020 10:52:05 -0400 > > > > > > > > One possible issue: in the final diagnostic, there's a fix-it > > > > hint > > > > with > > > > non-ASCII replacement text, replacing "two_pi" with "two_π" > > > > (where > > > > the > > > > final char in the latter is GREEK SMALL LETTER PI, U+03C0) > > > > > > > > This replacement currently expressed as encoded bytes i.e: > > > > > > > > fix-it:"demo.c":{51:10-51:16}:"two_\317\200" > > > > > > > > where \317\200 is the octal-escaped representation of the two > > > > bytes > > > > of > > > > the UTF-8 encoding of the character. > > > > > > > > Is this going to work for Emacs? > > > > > > You mean, GCC doesn't actually emit the UTF-8 encoding of π, it > > > emits > > > its ASCII-fied representation? We'd need to decode that, but is > > > that > > > really justified? Why not emit UTF-8? > > > > I have an implementation that simply emits UTF-8 in quotes, > > escaping > > backslash, tab, newline, and doublequotes as before. (we have to > > escape at least newline, given that fix-it hint replacement text > > can > > contain them, and we're using newline to terminate the parseable > > hint). > > Sorry, I've lost the context: where did those non-ASCII names come > from? are they names of variables in the user's program? The names are identifiers from the user's program (names of variables, types, macros, etc), where an error has been issued, typically due to a misspelling of an identifier. For example, somewhere there's a declaration of a constant named "two_π", and later the code erroneously references it as "two_pi"; we want to emit a diagnostic saying: did you mean "two_π"? and provide a machine-readable fix-it hint suggesting the replacement of the pertinent source range with "two_π". GCC converts the source code from any encoding specified by -finput- charset= to use UTF-8 internally... https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html > If so, in > what encoding does GCC quote portions of the source code in its > warning/error messages? > Does it use the exact byte stream it found in > the source, or does it perform any conversions of the encoding? ...however there's a bug in GCC in how we print the source code itself, where we blithely emit the undecoded bytes directly to stderr when quoting the lines of source. This GCC bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR other/93067). We ought to encode the source code into UTF-8 when printing it (which may be a no-op for the common case). The annotation lines we print under the source lines for fix-it hints and labels are already printed in UTF-8, however. That said, the above bug is orthogonal to the fix-it hint issue, which prints the names in a different way (using UTF-8 encoded strings in GCC's symbol table, rather than scraping them from the filesystem, which is how the buggy source-quoting routines work). > > However, the filename also needs to be escaped. Currently I'm > > applying > > the same escaping rules to both filename and replacement text. > > What is the encoding of the filename? What if the bytes in a > > filename > > aren't UTF-8 encoded? How does emacs handle this case? > > Emacs has a separate variable for the encoding of file names, which > gets set from the locale settings. But this is not necessarily > relevant to the issue at hand, because we are talking about > processing > output from a sub-process (GCC) which includes both file names and > other stuff, such as fragments of the source code. When Emacs > processes sub-process output, it generally assumes all of it is > encoded in the same encoding. So if, for example, you encode > non-ASCII variables in UTF-8 while the file names are emitted in some > other encoding (perhaps because the locale's codeset is not UTF-8), > then there will be complications: we will have to read the output > from > GCC in its raw form, and then decode "by hand" (in Lisp) each part of > it as appropriate (which means we will need to be able to identifye > each such part). > > So it's important to understand the situation and its limitations for > proposing the best solution. As far as I can tell GCC handles filenames as raw bytes, and doesn't make any attempt to decode them, and emits them as bytes again in diagnostic messages. > > I tried creating file with the name "byte 0xff" .txt, and with > > valid > > UTF-8 non- ascii names and emacs reported them as \377.txt and with > > the UTF-8 names respectively, so perhaps I should simply emit the > > bytes and pretend they are UTF-8? > > What do you mean by "pretend" in this context? By "pretend" I mean simply re-emitting the bytes of the filename to stderr and ignoring encoding issues in them, despite the fact that the rest of the stream is supposed to be UTF-8-encoded. Currently the parseable-fixits option uses IS_PRINT on each "char" (i.e. byte) so that any non-printable bytes get octal-escaped. Is that acceptable for filenames? The other approach, to "pretend they're UTF- 8", would mean to not escape such bytes, so that if they are UTF-8 they are faithfully re-emitted. I think I like the approach where the filename part of the fixit line is octal-escaped, and the replacement text is UTF-8, but I don't know what's going to be best for you. Hope the above clarifies things. Dave