bug#37633: Column part interpreted wrong in compilation mode

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#37633: Column part interpreted wrong in compilation mode
@ 2019-10-05 11:12 Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-05 16:08 ` Eli Zaretskii
  2022-04-23 13:36 ` Lars Ingebrigtsen
  0 siblings, 2 replies; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-05 11:12 UTC (permalink / raw)
  To: 37633; +Cc: anton

[-- Attachment #1: Type: text/plain, Size: 2760 bytes --]

Compilers like gcc and others (e.g. gforth) output file:line:column on each 
error or warning.  However, “column” here is really the byte offset into the 
line (starting at 1).

Problems arise when tabs and UTF-8 glyphs are involved, e.g. compile

---------------test.c---------------
void foo() {
	printf("test %i", b);
	printf("test你好 %i", c);
}
---------------gcc test.c---------------
-*- mode: compilation; default-directory: "~/tmp/" -*-
Compilation started at Sat Oct  5 12:13:23

gcc test.c
test.c: In function ‘foo’:
test.c:2:2: warning: implicit declaration of function ‘printf’ [-Wimplicit-
function-declaration]
    2 |  printf("test %i", b);
      |  ^~~~~~
test.c:2:2: warning: incompatible implicit declaration of built-in function 
‘printf’
test.c:1:1: note: include ‘<stdio.h>’ or provide a declaration of ‘printf’
  +++ |+#include <stdio.h>
    1 | void foo() {
test.c:2:20: error: ‘b’ undeclared (first use in this function)
    2 |  printf("test %i", b);
      |                    ^
test.c:2:20: note: each undeclared identifier is reported only once for each 
function it appears in
test.c:3:26: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test你好 %i", c);
      |                          ^

Compilation exited abnormally with code 1 at Sat Oct  5 12:13:23
---------------snip---------------

When you click on test.c:2:20, it gets you to the second t in 'test'; if you 
click on test.c:3:26, you end up on the '%'.  The expected result would be to 
have the cursor on 'b' and 'c'.

The problem has been discussed here two years ago:

https://www.reddit.com/r/emacs/comments/5m3i59/
ask_remacs_get_compile_mode_to_treat_column/

Suggested solution: Use byte-to-position to calculate the position in 
compilation-move-to-column.

Since debugging environments can also control Emacs e.g. through emacsclient 
+line:column file, I suggest adding a pattern that indicates that column here 
really means byte position, too, e.g. +line/byte or +line,byte or such. Or 
just interpret it as byte position, too.  gedit e.g. counts a tab as 1 if you 
open a file with +line:column options, but counts one UTF-8 glyph also as 1 
(which is not how compilers count).

Some programming languages convert unicode glyphs and other characters into 
internal character types (e.g. JavaScript), and then the gedit behavior or the 
behavior with compilation-error-screen-columns set to nil is probably ok.  
It's just that we need a byte mode here, too. True and false is not enough.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 11:12 bug#37633: Column part interpreted wrong in compilation mode Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-05 16:08 ` Eli Zaretskii
  2019-10-05 16:16   ` Eli Zaretskii
  2019-10-05 16:58   ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2022-04-23 13:36 ` Lars Ingebrigtsen
  1 sibling, 2 replies; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-05 16:08 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: 37633, anton

> Cc: anton@mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 13:12:34 +0200
> From: Bernd Paysan via "Bug reports for GNU Emacs,
>  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> 
> Suggested solution: Use byte-to-position to calculate the position in 
> compilation-move-to-column.

This only works in UTF-8 locales, and is not 100% even there, so it
isn't the right solution.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 16:08 ` Eli Zaretskii
@ 2019-10-05 16:16   ` Eli Zaretskii
  2019-10-05 17:05     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
                       ` (2 more replies)
  2019-10-05 16:58   ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  1 sibling, 3 replies; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-05 16:16 UTC (permalink / raw)
  To: bernd; +Cc: 37633, anton

> Date: Sat, 05 Oct 2019 19:08:21 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> 
> > Suggested solution: Use byte-to-position to calculate the position in 
> > compilation-move-to-column.
> 
> This only works in UTF-8 locales, and is not 100% even there, so it
> isn't the right solution.

In general, byte-to-position is meant to be used only for converting
between byte and character positions of text in Emacs buffers.

For byte offsets in external text we have bufferpos-to-filepos, but
that requires us to know the encoding of the external text.  We need
to find a reasonable way of getting that.  Suggestions and patches
welcome.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 16:08 ` Eli Zaretskii
  2019-10-05 16:16   ` Eli Zaretskii
@ 2019-10-05 16:58   ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  1 sibling, 0 replies; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-05 16:58 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

Am Samstag, 5. Oktober 2019, 18:08:21 CEST schrieb Eli Zaretskii:
> > Cc: anton@mips.complang.tuwien.ac.at
> > Date: Sat, 05 Oct 2019 13:12:34 +0200
> > From: Bernd Paysan via "Bug reports for GNU Emacs,
> > 
> >  the Swiss army knife of text editors" <bug-gnu-emacs@gnu.org>
> > 
> > Suggested solution: Use byte-to-position to calculate the position in
> > compilation-move-to-column.
> 
> This only works in UTF-8 locales, and is not 100% even there, so it
> isn't the right solution.

It's at least an improvement, though it's not perfect.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 16:16   ` Eli Zaretskii
@ 2019-10-05 17:05     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-05 18:53       ` Eli Zaretskii
  2019-10-05 17:34     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-06 12:31     ` Anton Ertl
  2 siblings, 1 reply; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-05 17:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 1532 bytes --]

Am Samstag, 5. Oktober 2019, 18:16:53 CEST schrieb Eli Zaretskii:
> > Date: Sat, 05 Oct 2019 19:08:21 +0300
> > From: Eli Zaretskii <eliz@gnu.org>
> > Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> > 
> > > Suggested solution: Use byte-to-position to calculate the position in
> > > compilation-move-to-column.
> > 
> > This only works in UTF-8 locales, and is not 100% even there, so it
> > isn't the right solution.
> 
> In general, byte-to-position is meant to be used only for converting
> between byte and character positions of text in Emacs buffers.
> 
> For byte offsets in external text we have bufferpos-to-filepos, but
> that requires us to know the encoding of the external text.  We need
> to find a reasonable way of getting that.  Suggestions and patches
> welcome.

We can likely assume that the auto-detected encoding is the correct one, i.e. 
buffer-file-coding-system can be used (the default for the optional encoding 
system parameter for bufferpos-to-filepos and filepos-to-bufferpos).

I.e. go to the line selected, do a bufferpos-to-filepos on that position, add 
the column-1 to that, and do a filepos-to-bufferpos.  Jump there.

Problem with precision: "exact" requires encoding the entire file, so it's 
slow for large files.  Particularly with automatically generated files, this 
is likely not acceptable, so "approximate" could be good enough.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 16:16   ` Eli Zaretskii
  2019-10-05 17:05     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-05 17:34     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-06 12:31     ` Anton Ertl
  2 siblings, 0 replies; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-05 17:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 2131 bytes --]

Am Samstag, 5. Oktober 2019, 18:16:53 CEST schrieb Eli Zaretskii:
> > Date: Sat, 05 Oct 2019 19:08:21 +0300
> > From: Eli Zaretskii <eliz@gnu.org>
> > Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> > 
> > > Suggested solution: Use byte-to-position to calculate the position in
> > > compilation-move-to-column.
> > 
> > This only works in UTF-8 locales, and is not 100% even there, so it
> > isn't the right solution.
> 
> In general, byte-to-position is meant to be used only for converting
> between byte and character positions of text in Emacs buffers.
> 
> For byte offsets in external text we have bufferpos-to-filepos, but
> that requires us to know the encoding of the external text.  We need
> to find a reasonable way of getting that.  Suggestions and patches
> welcome.

Ok, first I tried bufferpos-to-filepos.

(defun compilation-move-to-column (col screen)
  "Go to column COL on the current line.
If SCREEN is non-nil, columns are screen columns, otherwise, they are
just char-counts."
  (setq col (- col compilation-first-column))
  (let ((realpos (filepos-to-bufferpos (+ (bufferpos-to-filepos (line-
beginning-position) 'approximate) col) 'approximate)))
    (goto-char (min realpos (line-end-position)))))

I left out the (if ) with (screen), because I just wanted to test this case.  
For the examples I've used, it works with the 'approximate setting.

I leave out this screen part to the emacs maintainers, because you maybe want 
a three-case statement: nil for char-count, 't for screen columns, and 
'bytepos for byte-accurate position.  JavaScript (node) is ok with the char-
count mode.

Second test-case: iso8859-1 encoded file with

void foo() {
	printf("test %i", b);
	printf("testäöü %i", c);
}

...
test-iso.c:3:23: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test��� %i", c);
      |                       ^
...

works when you click there, too.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 17:05     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-05 18:53       ` Eli Zaretskii
  2019-10-05 18:54         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-05 18:53 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: 37633, anton

> From: Bernd Paysan <bernd@net2o.de>
> Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 19:05:26 +0200
> 
> We can likely assume that the auto-detected encoding is the correct one, i.e. 
> buffer-file-coding-system can be used (the default for the optional encoding 
> system parameter for bufferpos-to-filepos and filepos-to-bufferpos).

Encoding of subprocess output is generally not auto-detected, it uses
the defaults derived from the locale.  I don't recommend
auto-detecting, because that's quite fragile (and is not needed here
anyway, IMO).

> Problem with precision: "exact" requires encoding the entire file, so it's 
> slow for large files.  Particularly with automatically generated files, this 
> is likely not acceptable, so "approximate" could be good enough.

We cannot use 'exact' here because there's no file per se: we only
have the compiler output.  We must use 'approximate'.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 18:53       ` Eli Zaretskii
@ 2019-10-05 18:54         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-05 19:14           ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-05 18:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

Am Samstag, 5. Oktober 2019, 20:53:02 CEST schrieb Eli Zaretskii:
> > Problem with precision: "exact" requires encoding the entire file, so it's
> > slow for large files.  Particularly with automatically generated files,
> > this is likely not acceptable, so "approximate" could be good enough.
> 
> We cannot use 'exact' here because there's no file per se: we only
> have the compiler output.  We must use 'approximate'.

The buffer that matters is not the compiler output, it's the buffer of the 
source code.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 18:54         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-05 19:14           ` Eli Zaretskii
  2019-10-05 19:24             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-05 19:14 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: 37633, anton

> From: Bernd Paysan <bernd@net2o.de>
> Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 20:54:38 +0200
> 
> > We cannot use 'exact' here because there's no file per se: we only
> > have the compiler output.  We must use 'approximate'.
> 
> The buffer that matters is not the compiler output, it's the buffer of the 
> source code.

But the column numbers are counted in the compiler output, and no one
said that the compiler output must be encoded the same as the source
file.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 19:14           ` Eli Zaretskii
@ 2019-10-05 19:24             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-06 17:16               ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-05 19:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]

Am Samstag, 5. Oktober 2019, 21:14:38 CEST schrieb Eli Zaretskii:
> > From: Bernd Paysan <bernd@net2o.de>
> > Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> > Date: Sat, 05 Oct 2019 20:54:38 +0200
> > 
> > > We cannot use 'exact' here because there's no file per se: we only
> > > have the compiler output.  We must use 'approximate'.
> > 
> > The buffer that matters is not the compiler output, it's the buffer of the
> > source code.
> 
> But the column numbers are counted in the compiler output, and no one
> said that the compiler output must be encoded the same as the source
> file.

The column numbers are written as decimal digits in the compiler output.  They 
are not even calculated, they are just extracted.

Indeed, the compiler output can be in a different encoding, but it doesn't 
matter.  The navigation that needs to change is in the source code file.  This 
is compiler output from compiling an iso-latin encoded file, the compiler 
output itself is utf-8:

test-iso.c:3:23: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test��� %i", c);
      |                       ^

The 23(-1) are the numbers of bytes to get from the start of line to the 
missing variable 'c'.  The three � are there, because the compilation buffer 
contains invalid characters now.  They are iso-latin characters, invalid in 
utf-8.  But this is irrelevant.  All the compilation mode does is extract the 
test-iso.c (file name), 3 (line number) and 23 (byte index).  Navigation 
happens in test-iso.c, it's a file (the C compiler can't access emacs 
buffers), autodetection is pretty reliable.

There might be some corner cases, where the suggested solution is not perfect, 
but it's much better than what we have now.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 16:16   ` Eli Zaretskii
  2019-10-05 17:05     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-05 17:34     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-06 12:31     ` Anton Ertl
  2019-10-06 17:53       ` Eli Zaretskii
  2 siblings, 1 reply; 23+ messages in thread
From: Anton Ertl @ 2019-10-06 12:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, bernd, anton

On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote:
> For byte offsets in external text we have bufferpos-to-filepos, but
> that requires us to know the encoding of the external text.  We need
> to find a reasonable way of getting that.  Suggestions and patches
> welcome.

It's the encoding that you assumed for the text when you loaded the
file into the buffer.

The assumption may be wrong, which may cause problems elsewhere, but
should not cause problems for interpreting the byte position, because
the byte position does not depend on the encoding (unlike the
character position).

- anton





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 19:24             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-06 17:16               ` Eli Zaretskii
  2019-10-06 17:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-06 17:16 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: 37633, anton

> From: Bernd Paysan <bernd@net2o.de>
> Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> Date: Sat, 05 Oct 2019 21:24:17 +0200
> 
> > But the column numbers are counted in the compiler output, and no one
> > said that the compiler output must be encoded the same as the source
> > file.
> 
> The column numbers are written as decimal digits in the compiler output.  They 
> are not even calculated, they are just extracted.
> 
> Indeed, the compiler output can be in a different encoding, but it doesn't 
> matter.  The navigation that needs to change is in the source code file.  This 
> is compiler output from compiling an iso-latin encoded file, the compiler 
> output itself is utf-8:
> 
> test-iso.c:3:23: error: ‘c’ undeclared (first use in this function)
>     3 |  printf("test��� %i", c);
>       |                       ^
> 
> The 23(-1) are the numbers of bytes to get from the start of line to the 
> missing variable 'c'.  The three � are there, because the compilation buffer 
> contains invalid characters now.  They are iso-latin characters, invalid in 
> utf-8.  But this is irrelevant.  All the compilation mode does is extract the 
> test-iso.c (file name), 3 (line number) and 23 (byte index).  Navigation 
> happens in test-iso.c, it's a file (the C compiler can't access emacs 
> buffers), autodetection is pretty reliable.

Sorry, now I'm confused.  Does the compiler count bytes in its output
(where a Latin-1 line could be recoded in UTF-8, and thus have a
different number of bytes), or does it count bytes in the original
file (in this case encoded in Latin-1, i.e. 1 byte per character)?





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 17:16               ` Eli Zaretskii
@ 2019-10-06 17:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-06 18:54                   ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-06 17:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

Am Sonntag, 6. Oktober 2019, 19:16:43 CEST schrieb Eli Zaretskii:
> Sorry, now I'm confused.  Does the compiler count bytes in its output
> (where a Latin-1 line could be recoded in UTF-8, and thus have a
> different number of bytes), or does it count bytes in the original
> file (in this case encoded in Latin-1, i.e. 1 byte per character)?

It counts bytes in its input.  The output is just a copy of the input.  The 
compiler (GCC here) does not even care or know about what encoding the input 
actually is.  It's supposed to be ASCII compatible, the compiler does not try 
to be smart.  C symbols are supposed to be ASCII only, C strings are just byte 
arrays.  Don't try to overestimate the smartness here.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 12:31     ` Anton Ertl
@ 2019-10-06 17:53       ` Eli Zaretskii
  2019-10-06 19:02         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-07  7:09         ` Anton Ertl
  0 siblings, 2 replies; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-06 17:53 UTC (permalink / raw)
  To: anton; +Cc: 37633, bernd

> Date: Sun, 6 Oct 2019 14:31:12 +0200
> From: Anton Ertl <anton@mips.complang.tuwien.ac.at>
> Cc: bernd@net2o.de, 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> 
> On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote:
> > For byte offsets in external text we have bufferpos-to-filepos, but
> > that requires us to know the encoding of the external text.  We need
> > to find a reasonable way of getting that.  Suggestions and patches
> > welcome.
> 
> It's the encoding that you assumed for the text when you loaded the
> file into the buffer.

I'm not sure this is correct.  You are saying that the compiler counts
bytes in the original file, not in its output (which might be encoded
differently).  Do we have conclusive evidence that this is always
true?

> the byte position does not depend on the encoding (unlike the
> character position).

??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8
will yield a different number of bytes.  So I don't think I understand
how can you say the above.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 17:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-06 18:54                   ` Eli Zaretskii
  2019-10-06 19:16                     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-06 18:54 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: 37633, anton

> From: Bernd Paysan <bernd@net2o.de>
> Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> Date: Sun, 06 Oct 2019 19:35:33 +0200
> 
> It counts bytes in its input.

In that case, using the encoding with which we visited the source is
TRT.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 17:53       ` Eli Zaretskii
@ 2019-10-06 19:02         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-06 19:16           ` Eli Zaretskii
  2019-10-07  7:09         ` Anton Ertl
  1 sibling, 1 reply; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-06 19:02 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: anton, 37633

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

Am Sonntag, 6. Oktober 2019, 19:53:49 CEST schrieb Eli Zaretskii:
> > Date: Sun, 6 Oct 2019 14:31:12 +0200
> > From: Anton Ertl <anton@mips.complang.tuwien.ac.at>
> > Cc: bernd@net2o.de, 37633@debbugs.gnu.org,
> > anton@mips.complang.tuwien.ac.at
> > 
> > On Sat, Oct 05, 2019 at 07:16:53PM +0300, Eli Zaretskii wrote:
> > > For byte offsets in external text we have bufferpos-to-filepos, but
> > > that requires us to know the encoding of the external text.  We need
> > > to find a reasonable way of getting that.  Suggestions and patches
> > > welcome.
> > 
> > It's the encoding that you assumed for the text when you loaded the
> > file into the buffer.
> 
> I'm not sure this is correct.  You are saying that the compiler counts
> bytes in the original file, not in its output (which might be encoded
> differently).  Do we have conclusive evidence that this is always
> true?

Almost always.  gcc has a gazillion of options almost nobody uses.

E.g., you can use -finput-encoding=<endoding> to transcode input files on 
reading.  It's a not well tested option, as the output (still iso8859-1) 
shows:

% gcc -finput-charset=iso8859-1 test-iso.c
test-iso.c: In function ‘foo’:
test-iso.c:2:2: warning: implicit declaration of function ‘printf’ [-
Wimplicit-function-declaration]
    2 |  printf("test %i", b);
      |  ^~~~~~
test-iso.c:2:2: warning: incompatible implicit declaration of built-in 
function ‘printf’
test-iso.c:1:1: note: include ‘<stdio.h>’ or provide a declaration of ‘printf’
  +++ |+#include <stdio.h>
    1 | void foo() {
test-iso.c:2:20: error: ‘b’ undeclared (first use in this function)
    2 |  printf("test %i", b);
      |                    ^
test-iso.c:2:20: note: each undeclared identifier is reported only once for 
each function it appears in
test-iso.c:3:26: error: ‘c’ undeclared (first use in this function)
    3 |  printf("test��� %i", c);
      |                          ^

Here, due to the conversion on read in, the position reported is different (it 
was 3:23 before).

This transparent conversion on reading is used rarely.  Or rather: There is no 
search result in the entire github database.

> > the byte position does not depend on the encoding (unlike the
> > character position).
> 
> ??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8
> will yield a different number of bytes.  So I don't think I understand
> how can you say the above.

What I'm trying to tell: The compiler (unless instructed to convert the file 
on reading) reports the byte position it found in the file.  That's the same 
byte position the editor calculates for that file — and that is regardless of 
what the editor assumed as encoding.  I.e. if the editor mistook a UTF-8 file 
for an iso8859-1, it will see an UTF-8 string "äöü" (6 bytes UTF-8) as 
"Ã¤Ã¶Ã¼" (6 bytes iso8859-1).  But it's still 6 bytes.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 18:54                   ` Eli Zaretskii
@ 2019-10-06 19:16                     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 0 replies; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-06 19:16 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 37633, anton

[-- Attachment #1: Type: text/plain, Size: 481 bytes --]

Am Sonntag, 6. Oktober 2019, 20:54:28 CEST schrieb Eli Zaretskii:
> > From: Bernd Paysan <bernd@net2o.de>
> > Cc: 37633@debbugs.gnu.org, anton@mips.complang.tuwien.ac.at
> > Date: Sun, 06 Oct 2019 19:35:33 +0200
> > 
> > It counts bytes in its input.
> 
> In that case, using the encoding with which we visited the source is
> TRT.

Yes.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 19:02         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-06 19:16           ` Eli Zaretskii
  2019-10-06 19:22             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-06 19:16 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: anton, 37633

> From: Bernd Paysan <bernd@net2o.de>
> Cc: anton@mips.complang.tuwien.ac.at, 37633@debbugs.gnu.org
> Date: Sun, 06 Oct 2019 21:02:14 +0200
> 
> if the editor mistook a UTF-8 file for an iso8859-1, it will see an
> UTF-8 string "äöü" (6 bytes UTF-8) as "Ã¤Ã¶Ã¼" (6 bytes iso8859-1).
> But it's still 6 bytes.

Not inside the Emacs buffer, it isn't.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 19:16           ` Eli Zaretskii
@ 2019-10-06 19:22             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-06 19:34               ` Eli Zaretskii
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-06 19:22 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: anton, 37633

[-- Attachment #1: Type: text/plain, Size: 2672 bytes --]

Am Sonntag, 6. Oktober 2019, 21:16:47 CEST schrieb Eli Zaretskii:
> > From: Bernd Paysan <bernd@net2o.de>
> > Cc: anton@mips.complang.tuwien.ac.at, 37633@debbugs.gnu.org
> > Date: Sun, 06 Oct 2019 21:02:14 +0200
> > 
> > if the editor mistook a UTF-8 file for an iso8859-1, it will see an
> > UTF-8 string "äöü" (6 bytes UTF-8) as "Ã¤Ã¶Ã¼" (6 bytes iso8859-1).
> > But it's still 6 bytes.
> 
> Not inside the Emacs buffer, it isn't.

I created a unicode file:

void main() {
        char *b="ha", *c="ho";
        printf("test %i", b);
        printf("testäöü %i", c);
}

I loaded this into emacs, and reverted the buffer using iso8859-1 coding 
(simulating a wrongly detected encoding).

It then looks like this:

void main() {
	char *b="ha", *c="ho";
	printf("test %i", b);
	printf("testÃ¤Ã¶Ã¼ %i", c);
}

I compiled it with gcc -Wall test-utf8.c into a compile-mode buffer.

-*- mode: compilation; default-directory: "~/tmp/" -*-
Compilation started at Sun Oct  6 21:18:24

gcc -Wall test-utf.c 
test-utf.c:1:6: warning: return type of ‘main’ is not ‘int’ [-Wmain]
    1 | void main() {
      |      ^~~~
test-utf.c: In function ‘main’:
test-utf.c:3:2: warning: implicit declaration of function ‘printf’ [-
Wimplicit-function-declaration]
    3 |  printf("test %i", b);
      |  ^~~~~~
test-utf.c:3:2: warning: incompatible implicit declaration of built-in 
function ‘printf’
test-utf.c:1:1: note: include ‘<stdio.h>’ or provide a declaration of ‘printf’
  +++ |+#include <stdio.h>
    1 | void main() {
test-utf.c:3:16: warning: format ‘%i’ expects argument of type ‘int’, but 
argument 2 has type ‘char *’ [-Wformat=]
    3 |  printf("test %i", b);
      |               ~^   ~
      |                |   |
      |                int char *
      |               %s
test-utf.c:4:22: warning: format ‘%i’ expects argument of type ‘int’, but 
argument 2 has type ‘char *’ [-Wformat=]
    4 |  printf("testäöü %i", c);
      |                     ~^   ~
      |                      |   |
      |                      int char *
      |                     %s

Compilation finished at Sun Oct  6 21:18:24

If I click on the test-utf.c:4:22 label, I get exactly where I want to: On the 
i of %i.

If I revert this buffer with the correct encoding utf-8-unix, then it still 
navigates to the i of %i, so it's all agnostic to whether the encoding 
detected was correct or wrong.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 19:22             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-06 19:34               ` Eli Zaretskii
  2019-10-06 19:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 1 reply; 23+ messages in thread
From: Eli Zaretskii @ 2019-10-06 19:34 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: anton, 37633

> From: Bernd Paysan <bernd@net2o.de>
> Cc: anton@mips.complang.tuwien.ac.at, 37633@debbugs.gnu.org
> Date: Sun, 06 Oct 2019 21:22:20 +0200
> 
> > > if the editor mistook a UTF-8 file for an iso8859-1, it will see an
> > > UTF-8 string "äöü" (6 bytes UTF-8) as "Ã¤Ã¶Ã¼" (6 bytes iso8859-1).
> > > But it's still 6 bytes.
> > 
> > Not inside the Emacs buffer, it isn't.
> 
> I created a unicode file:
> [...]
> If I revert this buffer with the correct encoding utf-8-unix, then it still 
> navigates to the i of %i, so it's all agnostic to whether the encoding 
> detected was correct or wrong.

Not sure I understand: are you saying that your experiment proves that
my assertion about the number of bytes was incorrect?  Because it
doesn't.

And anyway, I see n o reason to argue about this side issue, since we
seem to be in agreement that using the file's encoding is TRT.





^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 19:34               ` Eli Zaretskii
@ 2019-10-06 19:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  0 siblings, 0 replies; 23+ messages in thread
From: Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors @ 2019-10-06 19:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: anton, 37633

[-- Attachment #1: Type: text/plain, Size: 585 bytes --]

Am Sonntag, 6. Oktober 2019, 21:34:15 CEST schrieb Eli Zaretskii:
> Not sure I understand: are you saying that your experiment proves that
> my assertion about the number of bytes was incorrect?  Because it
> doesn't.

No, the experiment supports your assertion.

> And anyway, I see n o reason to argue about this side issue, since we
> seem to be in agreement that using the file's encoding is TRT.

Indeed. Use the file's encoding is TRT.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o id: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
https://net2o.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-06 17:53       ` Eli Zaretskii
  2019-10-06 19:02         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
@ 2019-10-07  7:09         ` Anton Ertl
  1 sibling, 0 replies; 23+ messages in thread
From: Anton Ertl @ 2019-10-07  7:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: anton, bernd, 37633

On Sun, Oct 06, 2019 at 08:53:49PM +0300, Eli Zaretskii wrote:
> > the byte position does not depend on the encoding (unlike the
> > character position).
> 
> ??? The same Latin-1 characters encoded in ISO-8859-1 and in UTF-8
> will yield a different number of bytes.  So I don't think I understand
> how can you say the above.

The same bytes have the same number of bytes, whether you interpret
them as having one encoding or some other encoding.  How many
characters these bytes have depends on the encoding.

Of course, if you have transcoded the bytes into some other encoding,
you have to transcode them back for counting.  So for Emacs this means
converting back to the input encoding, and then counting (i.e., what
you describe as TRT (which I guess means The Right Thing)).

- anton

^ permalink raw reply	[flat|nested] 23+ messages in thread

* bug#37633: Column part interpreted wrong in compilation mode
  2019-10-05 11:12 bug#37633: Column part interpreted wrong in compilation mode Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
  2019-10-05 16:08 ` Eli Zaretskii
@ 2022-04-23 13:36 ` Lars Ingebrigtsen
  1 sibling, 0 replies; 23+ messages in thread
From: Lars Ingebrigtsen @ 2022-04-23 13:36 UTC (permalink / raw)
  To: Bernd Paysan; +Cc: 37633, anton

Bernd Paysan <bernd@net2o.de> writes:

> Problems arise when tabs and UTF-8 glyphs are involved, e.g. compile
>
> ---------------test.c---------------
> void foo() {
> 	printf("test %i", b);
> 	printf("test你好 %i", c);
> }
> ---------------gcc test.c---------------
> -*- mode: compilation; default-directory: "~/tmp/" -*-
> Compilation started at Sat Oct  5 12:13:23

[...]

> test.c:3:26: error: ‘c’ undeclared (first use in this function)
>     3 |  printf("test你好 %i", c);
>       |                          ^

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

Amusingly enough, gcc 11.2.0 said this to me

comp.c:4:31: error: 'c' undeclared (first use in this function)
    4 |         printf("test你好 %i", c);
      |                               ^

It's counting the leading TAB character as eight columns...  and then
counting the bytes of Chinese characters individually, ending up with a
column of 31.

So just using `filepos-to-bufferpos' wouldn't fix the current gcc.  We
could implement gcc's logic fully, but that's changing over time, and
other compilers surely have their own logic.  (I wouldn't be surprised
whether other compilers count characters instead of bytes in their
column outputs.)  And -finput-charset doesn't help with the column
calculation in gcc.

Since the issue is as messy as it is, I don't think there's anything
meaningful we can do here on the Emacs side, so I'm therefore closing
this bug report.  (If somebody has ideas that would work in general
here, please respond and we'll reopen.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-04-23 13:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-05 11:12 bug#37633: Column part interpreted wrong in compilation mode Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-05 16:08 ` Eli Zaretskii
2019-10-05 16:16   ` Eli Zaretskii
2019-10-05 17:05     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-05 18:53       ` Eli Zaretskii
2019-10-05 18:54         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-05 19:14           ` Eli Zaretskii
2019-10-05 19:24             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-06 17:16               ` Eli Zaretskii
2019-10-06 17:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-06 18:54                   ` Eli Zaretskii
2019-10-06 19:16                     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-05 17:34     ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-06 12:31     ` Anton Ertl
2019-10-06 17:53       ` Eli Zaretskii
2019-10-06 19:02         ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-06 19:16           ` Eli Zaretskii
2019-10-06 19:22             ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-06 19:34               ` Eli Zaretskii
2019-10-06 19:35                 ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2019-10-07  7:09         ` Anton Ertl
2019-10-05 16:58   ` Bernd Paysan via Bug reports for GNU Emacs, the Swiss army knife of text editors
2022-04-23 13:36 ` Lars Ingebrigtsen

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).