From: David Malcolm <dmalcolm@redhat.com>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 25987@debbugs.gnu.org
Subject: bug#25987: 25.2; support gcc fixit notes
Date: Sat, 14 Nov 2020 14:46:29 -0500 [thread overview]
Message-ID: <af94af035a322df0913e258b5845d7461969acb8.camel@redhat.com> (raw)
In-Reply-To: <83tutsuihm.fsf@gnu.org>
On Sat, 2020-11-14 at 16:21 +0200, Eli Zaretskii wrote:
> > From: David Malcolm <dmalcolm@redhat.com>
> > Cc: 25987@debbugs.gnu.org
> > Date: Fri, 13 Nov 2020 11:47:18 -0500
> >
> > The names are identifiers from the user's program (names of
> > variables,
> > types, macros, etc), where an error has been issued, typically due
> > to a
> > misspelling of an identifier. For example, somewhere there's a
> > declaration of a constant named "two_π", and later the code
> > erroneously
> > references it as "two_pi"; we want to emit a diagnostic saying:
> > did you mean "two_π"?
> > and provide a machine-readable fix-it hint suggesting the
> > replacement
> > of the pertinent source range with "two_π".
> >
> > GCC converts the source code from any encoding specified by
> > -finput-
> > charset= to use UTF-8 internally...
> >
> > https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html
>
> And then GCC outputs these identifiers in UTF-8? Or does it convert
> back to the original input-charset?
It emits them as UTF-8 when emitting diagnostics.
> > ...however there's a bug in GCC in how we print the source code
> > itself,
> > where we blithely emit the undecoded bytes directly to stderr when
> > quoting the lines of source. This GCC bug is
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067 (aka PR
> > other/93067). We ought to encode the source code into UTF-8 when
> > printing it (which may be a no-op for the common case).
>
> I'm not sure you are right here: I think it is better for GCC to use
> the original bytestream, because the user's locale might not support
> UTF-8 well; it is better to show the source to the user in the
> encoding in which it was written.
This seems to me to lead to a bigger question: what should the encoding
of GCC's stderr be? Right now I believe we emit a mix of UTF-8 and
other encodings, as noted in my earlier post.
> However, I'm not familiar with GCC internals, so it is not clear to
> me
> whether the bug report will indeed affect the way source fragments
> will be output: the bug report only talks about converting the input,
> and I don't know enough to understand how will that affect output.
>
> > The annotation lines we print under the source lines for fix-it
> > hints and labels are already printed in UTF-8, however.
>
> The annotations are in US English, though, right? If not, when will
> they include non-ASCII characters?
Annotation lines can contain labels as of GCC 9, and these can contain
identifiers; for example in this C++ type mismatch error, where the
types of the pertinent expressions are labeled:
$ g++ t.cc
t.cc: In function 'int test(const shape&, const shape&)':
t.cc:15:4: error: no match for 'operator+' (operand types are
'boxed_value<double>' and 'boxed_value<double>')
14 | return (width(s1) * height(s1)
| ~~~~~~~~~~~~~~~~~~~~~~
| |
| boxed_value<[...]>
15 | + width(s2) * height(s2));
| ^ ~~~~~~~~~~~~~~~~~~~~~~
| |
| boxed_value<[...]>
where "boxed_value" is an identifier and in theory could have non-ASCII
characters in it.
> > That said, the above bug is orthogonal to the fix-it hint issue,
> > which
> > prints the names in a different way (using UTF-8 encoded strings in
> > GCC's symbol table, rather than scraping them from the filesystem,
> > which is how the buggy source-quoting routines work).
> > [...]
> > As far as I can tell GCC handles filenames as raw bytes, and
> > doesn't
> > make any attempt to decode them, and emits them as bytes again in
> > diagnostic messages.
>
> This is okay, but since the other parts are in UTF-8, this will
> complicate things, as I mentioned in my previous message.
>
> > > > I tried creating file with the name "byte 0xff" .txt, and with
> > > > valid
> > > > UTF-8 non- ascii names and emacs reported them as \377.txt and
> > > > with
> > > > the UTF-8 names respectively, so perhaps I should simply emit
> > > > the
> > > > bytes and pretend they are UTF-8?
> > >
> > > What do you mean by "pretend" in this context?
> >
> > By "pretend" I mean simply re-emitting the bytes of the filename to
> > stderr and ignoring encoding issues in them, despite the fact that
> > the
> > rest of the stream is supposed to be UTF-8-encoded.
>
> As explained, it will be easier for Emacs to process GCC output if
> its
> encoding is consistent.
Indeed. I'll raise this issue on the GCC mailing list.
> > Currently the parseable-fixits option uses IS_PRINT on each "char"
> > (i.e. byte) so that any non-printable bytes get octal-escaped. Is
> > that
> > acceptable for filenames? The other approach, to "pretend they're
> > UTF-
> > 8", would mean to not escape such bytes, so that if they are UTF-8
> > they
> > are faithfully re-emitted.
> >
> > I think I like the approach where the filename part of the fixit
> > line
> > is octal-escaped, and the replacement text is UTF-8, but I don't
> > know
> > what's going to be best for you.
>
> Given your description, it sounds like it will not be simple whatever
> you do.
>
> I guess we should first try getting the plain-ASCII case to work, as
> that is the most frequent use case anyway.
I added some test cases and posted the patch to the gcc-patches mailing
list here:
"[PATCH/RFC] Add GCC_EXTRA_DIAGNOSTIC_OUTPUT environment variable for
fix-it hints"
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html
Thanks
Dave
prev parent reply other threads:[~2020-11-14 19:46 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-05 21:47 bug#25987: 25.2; support gcc fixit notes Tom Tromey
2017-03-06 18:35 ` Eli Zaretskii
2017-03-07 13:54 ` Tom Tromey
2017-03-07 15:55 ` Eli Zaretskii
2017-03-08 18:34 ` Tom Tromey
2017-03-08 19:22 ` Eli Zaretskii
2017-03-09 4:20 ` Richard Stallman
2017-03-09 15:36 ` Eli Zaretskii
2017-03-08 18:44 ` Tom Tromey
2017-03-08 19:28 ` Eli Zaretskii
2017-03-09 16:37 ` Dmitry Gutov
2017-03-09 16:56 ` Eli Zaretskii
2017-03-09 17:37 ` Dmitry Gutov
2017-03-09 18:32 ` Eli Zaretskii
2017-03-09 21:26 ` Dmitry Gutov
2017-08-06 3:34 ` Tom Tromey
2017-03-09 16:18 ` Dmitry Gutov
2017-03-09 16:53 ` Eli Zaretskii
2017-03-09 17:49 ` Dmitry Gutov
2017-03-09 18:35 ` Eli Zaretskii
2017-08-06 3:31 ` Tom Tromey
2018-03-16 16:48 ` David Malcolm
2018-03-16 20:19 ` Eli Zaretskii
2020-10-06 18:17 ` David Malcolm
2020-10-06 18:37 ` Eli Zaretskii
2020-10-12 22:27 ` David Malcolm
2020-10-13 7:34 ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-13 14:37 ` Eli Zaretskii
2020-10-14 22:43 ` David Malcolm
2020-10-15 7:47 ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2021-01-14 21:37 ` David Malcolm
2020-10-15 13:53 ` Eli Zaretskii
2020-10-15 14:23 ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-15 14:29 ` Eli Zaretskii
2020-10-15 14:44 ` Andrea Corallo via Bug reports for GNU Emacs, the Swiss army knife of text editors
2020-10-20 14:52 ` David Malcolm
2020-10-20 15:54 ` Eli Zaretskii
2020-11-11 19:36 ` David Malcolm
2020-11-12 13:54 ` Eli Zaretskii
2020-11-13 16:47 ` David Malcolm
2020-11-14 14:21 ` Eli Zaretskii
2020-11-14 19:46 ` David Malcolm [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=af94af035a322df0913e258b5845d7461969acb8.camel@redhat.com \
--to=dmalcolm@redhat.com \
--cc=25987@debbugs.gnu.org \
--cc=eliz@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).