unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: David Malcolm <dmalcolm@redhat.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 25987@debbugs.gnu.org
Subject: bug#25987: 25.2; support gcc fixit notes
Date: Sat, 14 Nov 2020 14:46:29 -0500	[thread overview]
Message-ID: <af94af035a322df0913e258b5845d7461969acb8.camel@redhat.com> (raw)
In-Reply-To: <83tutsuihm.fsf@gnu.org>

On Sat, 2020-11-14 at 16:21 +0200, Eli Zaretskii wrote:
> > From: David Malcolm <dmalcolm@redhat.com>
> > Cc: 25987@debbugs.gnu.org
> > Date: Fri, 13 Nov 2020 11:47:18 -0500
> > 
> > The names are identifiers from the user's program (names of
> > variables,
> > types, macros, etc), where an error has been issued, typically due
> > to a
> > misspelling of an identifier.  For example, somewhere there's a
> > declaration of a constant named "two_π", and later the code
> > erroneously
> > references it as "two_pi"; we want to emit a diagnostic saying:
> >   did you mean "two_π"?
> > and provide a machine-readable fix-it hint suggesting the
> > replacement
> > of the pertinent source range with "two_π".
> > 
> > GCC converts the source code from any encoding specified by
> > -finput-
> > charset= to use UTF-8 internally...
> > 
> > https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html
> 
> And then GCC outputs these identifiers in UTF-8?  Or does it convert
> back to the original input-charset?

It emits them as UTF-8 when emitting diagnostics.

> > ...however there's a bug in GCC in how we print the source code
> > itself,
> > where we blithely emit the undecoded bytes directly to stderr when
> > quoting the lines of source.  This GCC bug is 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
> > other/93067).  We ought to encode the source code into UTF-8 when
> > printing it (which may be a no-op for the common case).
> 
> I'm not sure you are right here: I think it is better for GCC to use
> the original bytestream, because the user's locale might not support
> UTF-8 well; it is better to show the source to the user in the
> encoding in which it was written.

This seems to me to lead to a bigger question: what should the encoding
of GCC's stderr be?  Right now I believe we emit a mix of UTF-8 and
other encodings, as noted in my earlier post.

> However, I'm not familiar with GCC internals, so it is not clear to
> me
> whether the bug report will indeed affect the way source fragments
> will be output: the bug report only talks about converting the input,
> and I don't know enough to understand how will that affect output.
> 
> > The annotation lines we print under the source lines for fix-it
> > hints and labels are already printed in UTF-8, however.
> 
> The annotations are in US English, though, right?  If not, when will
> they include non-ASCII characters?

Annotation lines can contain labels as of GCC 9, and these can contain
identifiers; for example in this C++ type mismatch error, where the
types of the pertinent expressions are labeled:
$ g++ t.cc
t.cc: In function 'int test(const shape&, const shape&)':
t.cc:15:4: error: no match for 'operator+' (operand types are
'boxed_value<double>' and 'boxed_value<double>')
   14 |   return (width(s1) * height(s1)
      |           ~~~~~~~~~~~~~~~~~~~~~~
      |                     |
      |                     boxed_value<[...]>
   15 |    + width(s2) * height(s2));
      |    ^ ~~~~~~~~~~~~~~~~~~~~~~
      |                |
      |                boxed_value<[...]>

where "boxed_value" is an identifier and in theory could have non-ASCII 
characters in it.

> > That said, the above bug is orthogonal to the fix-it hint issue,
> > which
> > prints the names in a different way (using UTF-8 encoded strings in
> > GCC's symbol table, rather than scraping them from the filesystem,
> > which is how the buggy source-quoting routines work).
> > [...]
> > As far as I can tell GCC handles filenames as raw bytes, and
> > doesn't
> > make any attempt to decode them, and emits them as bytes again in
> > diagnostic messages.
> 
> This is okay, but since the other parts are in UTF-8, this will
> complicate things, as I mentioned in my previous message.
> 
> > > > I tried creating file with the name "byte 0xff" .txt, and with
> > > > valid
> > > > UTF-8 non- ascii names and emacs reported them as \377.txt and
> > > > with
> > > > the UTF-8 names respectively, so perhaps I should simply emit
> > > > the
> > > > bytes and pretend they are UTF-8?
> > > 
> > > What do you mean by "pretend" in this context?
> > 
> > By "pretend" I mean simply re-emitting the bytes of the filename to
> > stderr and ignoring encoding issues in them, despite the fact that
> > the
> > rest of the stream is supposed to be UTF-8-encoded.
> 
> As explained, it will be easier for Emacs to process GCC output if
> its
> encoding is consistent.

Indeed.  I'll raise this issue on the GCC mailing list.

> > Currently the parseable-fixits option uses IS_PRINT on each "char"
> > (i.e. byte) so that any non-printable bytes get octal-escaped.  Is
> > that
> > acceptable for filenames?  The other approach, to "pretend they're
> > UTF-
> > 8", would mean to not escape such bytes, so that if they are UTF-8
> > they
> > are faithfully re-emitted.
> > 
> > I think I like the approach where the filename part of the fixit
> > line
> > is octal-escaped, and the replacement text is UTF-8, but I don't
> > know
> > what's going to be best for you.
> 
> Given your description, it sounds like it will not be simple whatever
> you do.
> 
> I guess we should first try getting the plain-ASCII case to work, as
> that is the most frequent use case anyway.

I added some test cases and posted the patch to the gcc-patches mailing
list here:
  "[PATCH/RFC] Add GCC_EXTRA_DIAGNOSTIC_OUTPUT environment variable for
fix-it hints"
  https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html

Thanks
Dave






      reply	other threads:[~2020-11-14 19:46 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-05 21:47 bug#25987: 25.2; support gcc fixit notes Tom Tromey
2017-03-06 18:35 ` Eli Zaretskii
2017-03-07 13:54   ` Tom Tromey
2017-03-07 15:55     ` Eli Zaretskii
2017-03-08 18:34       ` Tom Tromey
2017-03-08 19:22         ` Eli Zaretskii
2017-03-09  4:20           ` Richard Stallman
2017-03-09 15:36             ` Eli Zaretskii
2017-03-08 18:44     ` Tom Tromey
2017-03-08 19:28       ` Eli Zaretskii
2017-03-09 16:37         ` Dmitry Gutov
2017-03-09 16:56           ` Eli Zaretskii
2017-03-09 17:37             ` Dmitry Gutov
2017-03-09 18:32               ` Eli Zaretskii
2017-03-09 21:26                 ` Dmitry Gutov
2017-08-06  3:34           ` Tom Tromey
2017-03-09 16:18 ` Dmitry Gutov
2017-03-09 16:53   ` Eli Zaretskii
2017-03-09 17:49     ` Dmitry Gutov
2017-03-09 18:35       ` Eli Zaretskii
2017-08-06  3:31   ` Tom Tromey
2018-03-16 16:48 ` David Malcolm
2018-03-16 20:19   ` Eli Zaretskii
2020-10-06 18:17     ` David Malcolm
2020-10-06 18:37       ` Eli Zaretskii
2020-10-12 22:27         ` David Malcolm
2020-10-13  7:34           ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-13 14:37           ` Eli Zaretskii
2020-10-14 22:43             ` David Malcolm
2020-10-15  7:47               ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2021-01-14 21:37                 ` David Malcolm
2020-10-15 13:53               ` Eli Zaretskii
2020-10-15 14:23                 ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-15 14:29                   ` Eli Zaretskii
2020-10-15 14:44                     ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-20 14:52             ` David Malcolm
2020-10-20 15:54               ` Eli Zaretskii
2020-11-11 19:36                 ` David Malcolm
2020-11-12 13:54                   ` Eli Zaretskii
2020-11-13 16:47                     ` David Malcolm
2020-11-14 14:21                       ` Eli Zaretskii
2020-11-14 19:46                         ` David Malcolm [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=af94af035a322df0913e258b5845d7461969acb8.camel@redhat.com \
    --to=dmalcolm@redhat.com \
    --cc=25987@debbugs.gnu.org \
    --cc=eliz@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).