unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* Repeatable instance of bug#870
@ 2009-01-05  5:03 Juanma Barranquero
  2009-01-05  5:05 ` Juanma Barranquero
  2009-01-05 10:59 ` bug#870: " Jason Rumney
  0 siblings, 2 replies; 11+ messages in thread
From: Juanma Barranquero @ 2009-01-05  5:03 UTC (permalink / raw)
  To: Emacs Devel; +Cc: 870

[-- Attachment #1: Type: text/plain, Size: 1312 bytes --]

Today, I've been finally able to create a repeatable test case for
bug#870, "Missing ^J in ChangeLog".

The bug manifests itself as one or more ^J chars missing when reading
a text file. AFAIK, it has only happened with ChangeLogs, and just to
a few Windows users (not unexpectedly, as we typically handle much
more CRLF files than people on other systems).

On my setup, the bug can be repeated at will by doing:

   emacs -Q --eval "(desktop-save-mode 1)" ChangeLog.870
   C-x C-f
   y <RET>    ; to save the desktop when asked
   emacs -Q --eval "(desktop-read)"
   C-s C-q C-M

After that, the cursor will be over a ^M char, the remnant of a CRLF
pair whose ^J has disappeared.

If before restarting Emacs you edit .emacs.desktop and remove
"(buffer-file-coding-system . utf-8-dos)" from the ChangeLog.870
entry, the bug does not happen.

The missing ^J is exactly at position #x8000 of the ChangeLog.870
file. If you do remove a character from the file and repeat the test,
the problem does not happen at position #x8000, but another instance
of the same bug does happen at position #x38007. That seems to
indicate some kind of trouble with a 32 KiB buffer.

I'm attaching a bzipped copy of ChangeLog.870.

Any help in debugging this bug (or even a patch fixing it ;-) will be
much appreciated.

    Juanma

[-- Attachment #2: ChangeLog.870.bz2 --]
[-- Type: application/x-bzip2, Size: 123313 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable instance of bug#870
  2009-01-05  5:03 Repeatable instance of bug#870 Juanma Barranquero
@ 2009-01-05  5:05 ` Juanma Barranquero
  2009-01-05 10:59 ` bug#870: " Jason Rumney
  1 sibling, 0 replies; 11+ messages in thread
From: Juanma Barranquero @ 2009-01-05  5:05 UTC (permalink / raw)
  To: Emacs Devel

On Mon, Jan 5, 2009 at 06:03, Juanma Barranquero <lekktu@gmail.com> wrote:

>   emacs -Q --eval "(desktop-save-mode 1)" ChangeLog.870
>   C-x C-f

"C-x C-c", I mean.

>   y <RET>    ; to save the desktop when asked
>   emacs -Q --eval "(desktop-read)"
>   C-s C-q C-M

    Juanma




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05  5:03 Repeatable instance of bug#870 Juanma Barranquero
  2009-01-05  5:05 ` Juanma Barranquero
@ 2009-01-05 10:59 ` Jason Rumney
  2009-01-05 11:12   ` Juanma Barranquero
  2009-01-07  1:07   ` Kenichi Handa
  1 sibling, 2 replies; 11+ messages in thread
From: Jason Rumney @ 2009-01-05 10:59 UTC (permalink / raw)
  To: Juanma Barranquero, 870; +Cc: Emacs Devel

Juanma Barranquero wrote:
>    emacs -Q --eval "(desktop-save-mode 1)" ChangeLog.870
>   

I can also reproduce the bug with C-x RET r utf-8-dos after visiting the 
file normally.

It appears that there is a bug in all the decode_coding_* functions when 
a CR lies on a CHARBUF_SIZE (0x4000) boundary with a matching LF on the 
other side of the boundary.

They all do something like:

      if (eol_crlf && c1 == '\r')
        ONE_MORE_BYTE (byte_after_cr);

but ONE_MORE_BYTE will abort the decode if it reaches the end of the 
buffer, leaving the CR in limbo between having been read and being added 
to the buffer. Then on decoding the subsequent block, the initial LF 
does not trip the normal CRLF decoding, so it is put into the buffer.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05 10:59 ` bug#870: " Jason Rumney
@ 2009-01-05 11:12   ` Juanma Barranquero
  2009-01-05 11:22     ` Jason Rumney
  2009-01-07  1:07   ` Kenichi Handa
  1 sibling, 1 reply; 11+ messages in thread
From: Juanma Barranquero @ 2009-01-05 11:12 UTC (permalink / raw)
  To: Jason Rumney; +Cc: 870, Emacs Devel

On Mon, Jan 5, 2009 at 11:59, Jason Rumney <jasonr@gnu.org> wrote:

> It appears that there is a bug in all the decode_coding_* functions when a
> CR lies on a CHARBUF_SIZE (0x4000) boundary with a matching LF on the other
> side of the boundary.
>
> They all do something like:
>
>     if (eol_crlf && c1 == '\r')
>       ONE_MORE_BYTE (byte_after_cr);
>
> but ONE_MORE_BYTE will abort the decode if it reaches the end of the buffer,
> leaving the CR in limbo between having been read and being added to the
> buffer. Then on decoding the subsequent block, the initial LF does not trip
> the normal CRLF decoding, so it is put into the buffer.

Wouldn't that mean that, on writing the buffer, the file would end
with extra CRs, instead of missing LFs?

    Juanma




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05 11:12   ` Juanma Barranquero
@ 2009-01-05 11:22     ` Jason Rumney
  2009-01-05 11:31       ` Juanma Barranquero
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Rumney @ 2009-01-05 11:22 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 870, Emacs Devel

Juanma Barranquero wrote:
> On Mon, Jan 5, 2009 at 11:59, Jason Rumney <jasonr@gnu.org> wrote:
>
>   
>> It appears that there is a bug in all the decode_coding_* functions when a
>> CR lies on a CHARBUF_SIZE (0x4000) boundary with a matching LF on the other
>> side of the boundary.
>>
>> They all do something like:
>>
>>     if (eol_crlf && c1 == '\r')
>>       ONE_MORE_BYTE (byte_after_cr);
>>
>> but ONE_MORE_BYTE will abort the decode if it reaches the end of the buffer,
>> leaving the CR in limbo between having been read and being added to the
>> buffer. Then on decoding the subsequent block, the initial LF does not trip
>> the normal CRLF decoding, so it is put into the buffer.
>>     
>
> Wouldn't that mean that, on writing the buffer, the file would end
> with extra CRs, instead of missing LFs?
>   
The CRs are effectively stripped on reading, since they end up in limbo 
between being read and being added to the decoding buffer. I haven't 
tried writing the file, but I think (from memory and from the way the 
code looks to me) the problem is a missing CR, not a missing LF.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05 11:22     ` Jason Rumney
@ 2009-01-05 11:31       ` Juanma Barranquero
  2009-01-05 13:50         ` Jason Rumney
  0 siblings, 1 reply; 11+ messages in thread
From: Juanma Barranquero @ 2009-01-05 11:31 UTC (permalink / raw)
  To: Jason Rumney; +Cc: 870, Emacs Devel

On Mon, Jan 5, 2009 at 12:22, Jason Rumney <jasonr@gnu.org> wrote:

> The CRs are effectively stripped on reading, since they end up in limbo
> between being read and being added to the decoding buffer. I haven't tried
> writing the file, but I think (from memory and from the way the code looks
> to me) the problem is a missing CR, not a missing LF.

That's not what I see.

ChangeLog.870 initially contains:

0000 7ff0 20 74 69 6d 65 2d 73 74  61 6d 70 2e 65 6c 3a 0d   time-stamp.el:.
0000 8000 0a 09 2a 20 74 69 6d 65  2e 65 6c 3a 0d 0a 09 2a  ..* time.el:...*

After rereading the file, in Emacs it shows as:

	* time-stamp.el:^M	* time.el:

which I interpret as if, while reading, the ^M was read without ^L and
so taken literally, while the ^L was missing.

Then, if I write it back, the file on disk contains

0000 7ff0 20 74 69 6d 65 2d 73 74  61 6d 70 2e 65 6c 3a 0d   time-stamp.el:.
0000 8000 09 2a 20 74 69 6d 65 2e  65 6c 3a 0d 0a 09 2a 20  .* time.el:...*

so a LF has gone missing.

    Juanma




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05 11:31       ` Juanma Barranquero
@ 2009-01-05 13:50         ` Jason Rumney
  2009-01-05 14:28           ` Juanma Barranquero
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Rumney @ 2009-01-05 13:50 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 870, Emacs Devel

Juanma Barranquero wrote:
> After rereading the file, in Emacs it shows as:
>
> 	* time-stamp.el:^M	* time.el:
>
> which I interpret as if, while reading, the ^M was read without ^L and
> so taken literally, while the ^L was missing.
>
> Then, if I write it back, the file on disk contains
>
> 0000 7ff0 20 74 69 6d 65 2d 73 74  61 6d 70 2e 65 6c 3a 0d   time-stamp.el:.
> 0000 8000 09 2a 20 74 69 6d 65 2e  65 6c 3a 0d 0a 09 2a 20  .* time.el:...*
>
> so a LF has gone missing.
>   

Yes, you're right it is a LF (^J) that has gone missing - I was 
confused. So maybe I am wrong about exactly what happens in that part of 
the decode functions - maybe the CR does get written to the buffer, but 
the following LF is somehow swallowed.





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05 13:50         ` Jason Rumney
@ 2009-01-05 14:28           ` Juanma Barranquero
  0 siblings, 0 replies; 11+ messages in thread
From: Juanma Barranquero @ 2009-01-05 14:28 UTC (permalink / raw)
  To: Jason Rumney; +Cc: 870, Emacs Devel

On Mon, Jan 5, 2009 at 14:50, Jason Rumney <jasonr@gnu.org> wrote:

> So
> maybe I am wrong about exactly what happens in that part of the decode
> functions - maybe the CR does get written to the buffer, but the following
> LF is somehow swallowed.

The bug does not happen on encoding (for writing), because it is
already visible after re-decoding (I mean, after desktop.el applies
buffer-file-coding-system, or after the
revert-buffer-with-coding-system call in your example). Once the
buffer has the lone ^M, it's no wonder it ends up in the file after
writing.

I think you're right that the problem is related to decoding a CRLF
when the pair crosses a buffer boundary.

    Juanma




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-05 10:59 ` bug#870: " Jason Rumney
  2009-01-05 11:12   ` Juanma Barranquero
@ 2009-01-07  1:07   ` Kenichi Handa
  2009-01-07  6:53     ` Kenichi Handa
  1 sibling, 1 reply; 11+ messages in thread
From: Kenichi Handa @ 2009-01-07  1:07 UTC (permalink / raw)
  To: Jason Rumney; +Cc: lekktu, 870, emacs-devel

In article <4961E7F7.2000509@gnu.org>, Jason Rumney <jasonr@gnu.org> writes:

> Juanma Barranquero wrote:
> >    emacs -Q --eval "(desktop-save-mode 1)" ChangeLog.870
> >   

> I can also reproduce the bug with C-x RET r utf-8-dos after visiting the 
> file normally.

I can reproduce it by that recipe.

> It appears that there is a bug in all the decode_coding_* functions when 
> a CR lies on a CHARBUF_SIZE (0x4000) boundary with a matching LF on the 
> other side of the boundary.

> They all do something like:

>       if (eol_crlf && c1 == '\r')
>         ONE_MORE_BYTE (byte_after_cr);

> but ONE_MORE_BYTE will abort the decode if it reaches the end of the 
> buffer, leaving the CR in limbo between having been read and being added 
> to the buffer. Then on decoding the subsequent block, the initial LF 
> does not trip the normal CRLF decoding, so it is put into the buffer.

??? decode_coding_* gets bytes from coding->source and
produces characters in CHARBUF.  So, I think the above
analysis is not correct.

As normal visiting of ChangeLog.870 doesn't have the problem
but revisiting it causes the problem, I think the bug is in
Finsert_file_contents; perhaps in the handling of REPLACE.
I'll have a look at it.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-07  1:07   ` Kenichi Handa
@ 2009-01-07  6:53     ` Kenichi Handa
  2009-01-07  9:43       ` Juanma Barranquero
  0 siblings, 1 reply; 11+ messages in thread
From: Kenichi Handa @ 2009-01-07  6:53 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: lekktu, emacs-devel, 870, jasonr

In article <E1LKMsw-0005wG-G6@etlken.m17n.org>, Kenichi Handa <handa@m17n.org> writes:

> > It appears that there is a bug in all the decode_coding_* functions when 
> > a CR lies on a CHARBUF_SIZE (0x4000) boundary with a matching LF on the 
> > other side of the boundary.

> > They all do something like:

> >       if (eol_crlf && c1 == '\r')
> >         ONE_MORE_BYTE (byte_after_cr);

> > but ONE_MORE_BYTE will abort the decode if it reaches the end of the 
> > buffer, leaving the CR in limbo between having been read and being added 
> > to the buffer. Then on decoding the subsequent block, the initial LF 
> > does not trip the normal CRLF decoding, so it is put into the buffer.

> ??? decode_coding_* gets bytes from coding->source and
> produces characters in CHARBUF.  So, I think the above
> analysis is not correct.

> As normal visiting of ChangeLog.870 doesn't have the problem
> but revisiting it causes the problem, I think the bug is in
> Finsert_file_contents; perhaps in the handling of REPLACE.
> I'll have a look at it.

I fixed the bug.  Actually what wrong was decode_coding_*
but in the different place as above.

---
Kenichi Handa
handa@m17n.org




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug#870: Repeatable instance of bug#870
  2009-01-07  6:53     ` Kenichi Handa
@ 2009-01-07  9:43       ` Juanma Barranquero
  0 siblings, 0 replies; 11+ messages in thread
From: Juanma Barranquero @ 2009-01-07  9:43 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel, 870, jasonr

On Wed, Jan 7, 2009 at 07:53, Kenichi Handa <handa@m17n.org> wrote:

> I fixed the bug.

Thanks! (I've been suffering this #$@!&* for the past eight months or so.)

I've added the "(Bug#870)" ref to your ChangeLog entry.

    Juanma




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-01-07  9:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-05  5:03 Repeatable instance of bug#870 Juanma Barranquero
2009-01-05  5:05 ` Juanma Barranquero
2009-01-05 10:59 ` bug#870: " Jason Rumney
2009-01-05 11:12   ` Juanma Barranquero
2009-01-05 11:22     ` Jason Rumney
2009-01-05 11:31       ` Juanma Barranquero
2009-01-05 13:50         ` Jason Rumney
2009-01-05 14:28           ` Juanma Barranquero
2009-01-07  1:07   ` Kenichi Handa
2009-01-07  6:53     ` Kenichi Handa
2009-01-07  9:43       ` Juanma Barranquero

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).