From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: David Malcolm <dmalcolm@redhat.com>
Newsgroups: gmane.emacs.bugs
Subject: bug#25987: 25.2; support gcc fixit notes
Date: Fri, 13 Nov 2020 11:47:18 -0500
Message-ID: <0b88a592c7611d740b9dfa4bd4d853d14264be8d.camel@redhat.com>
References: <87lgsj1jle.fsf@tromey.com>
 <1521218887.2913.237.camel@redhat.com> <83muz7pyde.fsf@gnu.org>
 <f3dfa5f31852456d551ec5d330d53921f623265c.camel@redhat.com>
 <83o8lf9p68.fsf@gnu.org>
 <26f277bb345f10efe6340ac4074960905064fc97.camel@redhat.com>
 <83362i2nul.fsf@gnu.org>
 <8666386379d22239075d9237f00f40469c5be454.camel@redhat.com>
 <837drkopuf.fsf@gnu.org>
 <a5181d7c54cec863cc1c25d39154b5d1a2c15741.camel@redhat.com>
 <83mtzmznmw.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="3688"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Evolution 3.36.5 (3.36.5-1.fc32)
Cc: 25987@debbugs.gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Fri Nov 13 17:48:11 2020
Return-path: <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1kdcF9-0000ox-Gy
	for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 13 Nov 2020 17:48:11 +0100
Original-Received: from localhost ([::1]:60522 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1kdcF8-0008By-K3
	for geb-bug-gnu-emacs@m.gmane-mx.org; Fri, 13 Nov 2020 11:48:10 -0500
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:51122)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1kdcF0-0008B9-Vw
 for bug-gnu-emacs@gnu.org; Fri, 13 Nov 2020 11:48:02 -0500
Original-Received: from debbugs.gnu.org ([209.51.188.43]:37697)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <Debian-debbugs@debbugs.gnu.org>)
 id 1kdcF0-0001pg-4T
 for bug-gnu-emacs@gnu.org; Fri, 13 Nov 2020 11:48:02 -0500
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2)
 (envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1kdcF0-0006Ff-1r
 for bug-gnu-emacs@gnu.org; Fri, 13 Nov 2020 11:48:02 -0500
X-Loop: help-debbugs@gnu.org
Resent-From: David Malcolm <dmalcolm@redhat.com>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-gnu-emacs@gnu.org
Resent-Date: Fri, 13 Nov 2020 16:48:02 +0000
Resent-Message-ID: <handler.25987.B25987.160528604723984@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 25987
X-GNU-PR-Package: emacs
Original-Received: via spool by 25987-submit@debbugs.gnu.org id=B25987.160528604723984
 (code B ref 25987); Fri, 13 Nov 2020 16:48:02 +0000
Original-Received: (at 25987) by debbugs.gnu.org; 13 Nov 2020 16:47:27 +0000
Original-Received: from localhost ([127.0.0.1]:49243 helo=debbugs.gnu.org)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
 id 1kdcER-0006Em-0V
 for submit@debbugs.gnu.org; Fri, 13 Nov 2020 11:47:27 -0500
Original-Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:45414)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <dmalcolm@redhat.com>) id 1kdcEO-0006Ed-QV
 for 25987@debbugs.gnu.org; Fri, 13 Nov 2020 11:47:26 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
 s=mimecast20190719; t=1605286044;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
 content-transfer-encoding:content-transfer-encoding:
 in-reply-to:in-reply-to:references:references;
 bh=+SmHzVzC11QZRbmvAPuAvqnE9klAX7vAuPSAoWbJMVo=;
 b=eT/TesnxenRc3TZEEJZyIiwEsww+NmiBH5ivdCfdHPJ1oLBysAudTyYMhBQPGUoCy6XbjR
 GwQrV1rEDcXp5eDGzAPup5jpx7JNzfzBZj6ikQPxSO7DsELcHM7h7+XkS+l+2hHyACSYLb
 WTmIC2PRoy1Afs+NozsQ+yeweQYoI8U=
Original-Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-578-N02Ta3uuOW-wbhgLAlfM6A-1; Fri, 13 Nov 2020 11:47:20 -0500
X-MC-Unique: N02Ta3uuOW-wbhgLAlfM6A-1
Original-Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com
 [10.5.11.23])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E4345185E486;
 Fri, 13 Nov 2020 16:47:19 +0000 (UTC)
Original-Received: from ovpn-112-135.phx2.redhat.com (ovpn-112-135.phx2.redhat.com
 [10.3.112.135])
 by smtp.corp.redhat.com (Postfix) with ESMTP id 72FDE46;
 Fri, 13 Nov 2020 16:47:19 +0000 (UTC)
In-Reply-To: <83mtzmznmw.fsf@gnu.org>
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23
Authentication-Results: relay.mimecast.com;
 auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=dmalcolm@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
X-BeenThere: bug-gnu-emacs@gnu.org
List-Id: "Bug reports for GNU Emacs,
 the Swiss army knife of text editors" <bug-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/bug-gnu-emacs>
List-Post: <mailto:bug-gnu-emacs@gnu.org>
List-Help: <mailto:bug-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-gnu-emacs>,
 <mailto:bug-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: "bug-gnu-emacs"
 <bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.bugs:193240
Archived-At: <http://permalink.gmane.org/gmane.emacs.bugs/193240>

On Thu, 2020-11-12 at 15:54 +0200, Eli Zaretskii wrote:
> > From: David Malcolm <dmalcolm@redhat.com>
> > Cc: 25987@debbugs.gnu.org
> > Date: Wed, 11 Nov 2020 14:36:49 -0500
> > 
> > On Tue, 2020-10-20 at 18:54 +0300, Eli Zaretskii wrote:
> > > > From: David Malcolm <dmalcolm@redhat.com>
> > > > Cc: 25987@debbugs.gnu.org
> > > > Date: Tue, 20 Oct 2020 10:52:05 -0400
> > > > 
> > > > One possible issue: in the final diagnostic, there's a fix-it
> > > > hint
> > > > with
> > > > non-ASCII replacement text, replacing "two_pi" with "two_π"
> > > > (where
> > > > the
> > > > final char in the latter is GREEK SMALL LETTER PI, U+03C0)
> > > > 
> > > > This replacement currently expressed as encoded bytes i.e:
> > > > 
> > > > fix-it:"demo.c":{51:10-51:16}:"two_\317\200"
> > > > 
> > > > where \317\200 is the octal-escaped representation of the two
> > > > bytes
> > > > of
> > > > the UTF-8 encoding of the character.
> > > > 
> > > > Is this going to work for Emacs?
> > > 
> > > You mean, GCC doesn't actually emit the UTF-8 encoding of π, it
> > > emits
> > > its ASCII-fied representation?  We'd need to decode that, but is
> > > that
> > > really justified?  Why not emit UTF-8?
> > 
> > I have an implementation that simply emits UTF-8 in quotes,
> > escaping
> > backslash, tab, newline, and doublequotes as before.  (we have to
> > escape at least newline, given that fix-it hint replacement text
> > can
> > contain them, and we're using newline to terminate the parseable
> > hint).
> 
> Sorry, I've lost the context: where did those non-ASCII names come
> from? are they names of variables in the user's program?  

The names are identifiers from the user's program (names of variables,
types, macros, etc), where an error has been issued, typically due to a
misspelling of an identifier.  For example, somewhere there's a
declaration of a constant named "two_π", and later the code erroneously
references it as "two_pi"; we want to emit a diagnostic saying:
  did you mean "two_π"?
and provide a machine-readable fix-it hint suggesting the replacement
of the pertinent source range with "two_π".

GCC converts the source code from any encoding specified by -finput-
charset= to use UTF-8 internally...

https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html

> If so, in
> what encoding does GCC quote portions of the source code in its
> warning/error messages?
>   Does it use the exact byte stream it found in
> the source, or does it perform any conversions of the encoding?

...however there's a bug in GCC in how we print the source code itself,
where we blithely emit the undecoded bytes directly to stderr when
quoting the lines of source.  This GCC bug is 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
other/93067).  We ought to encode the source code into UTF-8 when
printing it (which may be a no-op for the common case).  The annotation
lines we print under the source lines for fix-it hints and labels are
already printed in UTF-8, however.

That said, the above bug is orthogonal to the fix-it hint issue, which
prints the names in a different way (using UTF-8 encoded strings in
GCC's symbol table, rather than scraping them from the filesystem,
which is how the buggy source-quoting routines work).

> > However, the filename also needs to be escaped.  Currently I'm
> > applying
> > the same escaping rules to both filename and replacement text.
> > What is the encoding of the filename?  What if the bytes in a
> > filename
> > aren't UTF-8 encoded?  How does emacs handle this case?
> 
> Emacs has a separate variable for the encoding of file names, which
> gets set from the locale settings.  But this is not necessarily
> relevant to the issue at hand, because we are talking about
> processing
> output from a sub-process (GCC) which includes both file names and
> other stuff, such as fragments of the source code.  When Emacs
> processes sub-process output, it generally assumes all of it is
> encoded in the same encoding.  So if, for example, you encode
> non-ASCII variables in UTF-8 while the file names are emitted in some
> other encoding (perhaps because the locale's codeset is not UTF-8),
> then there will be complications: we will have to read the output
> from
> GCC in its raw form, and then decode "by hand" (in Lisp) each part of
> it as appropriate (which means we will need to be able to identifye
> each such part).
> 
> So it's important to understand the situation and its limitations for
> proposing the best solution.

As far as I can tell GCC handles filenames as raw bytes, and doesn't
make any attempt to decode them, and emits them as bytes again in
diagnostic messages.

> > I tried creating file with the name "byte 0xff" .txt, and with
> > valid
> > UTF-8 non- ascii names and emacs reported them as \377.txt and with
> > the UTF-8 names respectively, so perhaps I should simply emit the
> > bytes and pretend they are UTF-8?
> 
> What do you mean by "pretend" in this context?

By "pretend" I mean simply re-emitting the bytes of the filename to
stderr and ignoring encoding issues in them, despite the fact that the
rest of the stream is supposed to be UTF-8-encoded.

Currently the parseable-fixits option uses IS_PRINT on each "char"
(i.e. byte) so that any non-printable bytes get octal-escaped.  Is that
acceptable for filenames?  The other approach, to "pretend they're UTF-
8", would mean to not escape such bytes, so that if they are UTF-8 they
are faithfully re-emitted.

I think I like the approach where the filename part of the fixit line
is octal-escaped, and the replacement text is UTF-8, but I don't know
what's going to be best for you.

Hope the above clarifies things.

Dave