unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: David Malcolm <dmalcolm@redhat.com>
Cc: 25987@debbugs.gnu.org
Subject: bug#25987: 25.2; support gcc fixit notes
Date: Sat, 14 Nov 2020 16:21:25 +0200	[thread overview]
Message-ID: <83tutsuihm.fsf@gnu.org> (raw)
In-Reply-To: <0b88a592c7611d740b9dfa4bd4d853d14264be8d.camel@redhat.com> (message from David Malcolm on Fri, 13 Nov 2020 11:47:18 -0500)

> From: David Malcolm <dmalcolm@redhat.com>
> Cc: 25987@debbugs.gnu.org
> Date: Fri, 13 Nov 2020 11:47:18 -0500
> 
> The names are identifiers from the user's program (names of variables,
> types, macros, etc), where an error has been issued, typically due to a
> misspelling of an identifier.  For example, somewhere there's a
> declaration of a constant named "two_π", and later the code erroneously
> references it as "two_pi"; we want to emit a diagnostic saying:
>   did you mean "two_π"?
> and provide a machine-readable fix-it hint suggesting the replacement
> of the pertinent source range with "two_π".
> 
> GCC converts the source code from any encoding specified by -finput-
> charset= to use UTF-8 internally...
> 
> https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html

And then GCC outputs these identifiers in UTF-8?  Or does it convert
back to the original input-charset?

> ...however there's a bug in GCC in how we print the source code itself,
> where we blithely emit the undecoded bytes directly to stderr when
> quoting the lines of source.  This GCC bug is 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
> other/93067).  We ought to encode the source code into UTF-8 when
> printing it (which may be a no-op for the common case).

I'm not sure you are right here: I think it is better for GCC to use
the original bytestream, because the user's locale might not support
UTF-8 well; it is better to show the source to the user in the
encoding in which it was written.

However, I'm not familiar with GCC internals, so it is not clear to me
whether the bug report will indeed affect the way source fragments
will be output: the bug report only talks about converting the input,
and I don't know enough to understand how will that affect output.

> The annotation lines we print under the source lines for fix-it
> hints and labels are already printed in UTF-8, however.

The annotations are in US English, though, right?  If not, when will
they include non-ASCII characters?

> That said, the above bug is orthogonal to the fix-it hint issue, which
> prints the names in a different way (using UTF-8 encoded strings in
> GCC's symbol table, rather than scraping them from the filesystem,
> which is how the buggy source-quoting routines work).
> [...]
> As far as I can tell GCC handles filenames as raw bytes, and doesn't
> make any attempt to decode them, and emits them as bytes again in
> diagnostic messages.

This is okay, but since the other parts are in UTF-8, this will
complicate things, as I mentioned in my previous message.

> > > I tried creating file with the name "byte 0xff" .txt, and with
> > > valid
> > > UTF-8 non- ascii names and emacs reported them as \377.txt and with
> > > the UTF-8 names respectively, so perhaps I should simply emit the
> > > bytes and pretend they are UTF-8?
> > 
> > What do you mean by "pretend" in this context?
> 
> By "pretend" I mean simply re-emitting the bytes of the filename to
> stderr and ignoring encoding issues in them, despite the fact that the
> rest of the stream is supposed to be UTF-8-encoded.

As explained, it will be easier for Emacs to process GCC output if its
encoding is consistent.

> Currently the parseable-fixits option uses IS_PRINT on each "char"
> (i.e. byte) so that any non-printable bytes get octal-escaped.  Is that
> acceptable for filenames?  The other approach, to "pretend they're UTF-
> 8", would mean to not escape such bytes, so that if they are UTF-8 they
> are faithfully re-emitted.
> 
> I think I like the approach where the filename part of the fixit line
> is octal-escaped, and the replacement text is UTF-8, but I don't know
> what's going to be best for you.

Given your description, it sounds like it will not be simple whatever
you do.

I guess we should first try getting the plain-ASCII case to work, as
that is the most frequent use case anyway.





  reply	other threads:[~2020-11-14 14:21 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-05 21:47 bug#25987: 25.2; support gcc fixit notes Tom Tromey
2017-03-06 18:35 ` Eli Zaretskii
2017-03-07 13:54   ` Tom Tromey
2017-03-07 15:55     ` Eli Zaretskii
2017-03-08 18:34       ` Tom Tromey
2017-03-08 19:22         ` Eli Zaretskii
2017-03-09  4:20           ` Richard Stallman
2017-03-09 15:36             ` Eli Zaretskii
2017-03-08 18:44     ` Tom Tromey
2017-03-08 19:28       ` Eli Zaretskii
2017-03-09 16:37         ` Dmitry Gutov
2017-03-09 16:56           ` Eli Zaretskii
2017-03-09 17:37             ` Dmitry Gutov
2017-03-09 18:32               ` Eli Zaretskii
2017-03-09 21:26                 ` Dmitry Gutov
2017-08-06  3:34           ` Tom Tromey
2017-03-09 16:18 ` Dmitry Gutov
2017-03-09 16:53   ` Eli Zaretskii
2017-03-09 17:49     ` Dmitry Gutov
2017-03-09 18:35       ` Eli Zaretskii
2017-08-06  3:31   ` Tom Tromey
2018-03-16 16:48 ` David Malcolm
2018-03-16 20:19   ` Eli Zaretskii
2020-10-06 18:17     ` David Malcolm
2020-10-06 18:37       ` Eli Zaretskii
2020-10-12 22:27         ` David Malcolm
2020-10-13  7:34           ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-13 14:37           ` Eli Zaretskii
2020-10-14 22:43             ` David Malcolm
2020-10-15  7:47               ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2021-01-14 21:37                 ` David Malcolm
2020-10-15 13:53               ` Eli Zaretskii
2020-10-15 14:23                 ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-15 14:29                   ` Eli Zaretskii
2020-10-15 14:44                     ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-20 14:52             ` David Malcolm
2020-10-20 15:54               ` Eli Zaretskii
2020-11-11 19:36                 ` David Malcolm
2020-11-12 13:54                   ` Eli Zaretskii
2020-11-13 16:47                     ` David Malcolm
2020-11-14 14:21                       ` Eli Zaretskii [this message]
2020-11-14 19:46                         ` David Malcolm

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83tutsuihm.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=25987@debbugs.gnu.org \
    --cc=dmalcolm@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).