bug#16286: 24.3.50; insert-file-contents may bring invisible garbage

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

From: Eli Zaretskii <eliz@gnu.org>
To: Andrey Kotlarski <m00naticus@gmail.com>, Kenichi Handa <handa@gnu.org>
Cc: 16286@debbugs.gnu.org
Subject: bug#16286: 24.3.50; insert-file-contents may bring invisible garbage
Date: Thu, 02 Jan 2014 18:30:30 +0200	[thread overview]
Message-ID: <83y52yxs61.fsf@gnu.org> (raw)
In-Reply-To: <87sitb4usd.fsf@gmail.com>

> From: Andrey Kotlarski <m00naticus@gmail.com>
> Date: Sun, 29 Dec 2013 16:05:22 +0200
> 
> In trunk inserting few bytes from file may sometimes result in nothing
> visible in the buffer while invisible artifacts are present and may
> affect subsequent operations.  Moreover, there doesn't seem to be way to
> recover from this.  Here's example session with emacs -Q:
> 
> (let ((file "test.txt"))
>   (unless (file-exists-p file)
>     (find-file file)
>     (insert "абв")                      ;Cyrillic letters
>     (save-buffer)
>     (kill-buffer))
> 
>   (let ((buf (generate-new-buffer "test")))
>     (switch-to-buffer buf)
>     (insert-file-contents file nil 0 2) ;inserts а
>     (goto-char (point-max))
>     (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing visible in the buffer
>                                         ;but actually there is
>     (erase-buffer)                      ;and still is
>     (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong type argument: inserted-chars, 1
>     (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty
>     ))
> 
> Trying to insert multibyte characters now brings content length issues,
> garbage inserted and at some point Emacs crashes.

Your Emacs is built without --enable-checking; if that configure-time
switch is used, Emacs hits an assertion violation as soon as this sexp
is evaluated:

     (insert-file-contents file nil 2 3)

Also, you are wrong about there being some invisible stuff in the
buffer.  The problem is elsewhere: Emacs gets confused about the
number of characters and the number of bytes in the buffer.  These two
counts should be in sync at all times; once they become
unsynchronized, Emacs will generally crash very soon.

I'm CC'ing Handa-san in the hope that he will be able to suggest a
solution.

The problem happens in decode_coding_gap (called from
insert-file-contents), in this code fragment (note the call to
detect_coding):

  if (CODING_REQUIRE_DETECTION (coding))
    detect_coding (coding);
  attrs = CODING_ID_ATTRS (coding->id);
  if (! disable_ascii_optimization
      && ! coding->src_multibyte
      && ! NILP (CODING_ATTR_ASCII_COMPAT (attrs))
      && NILP (CODING_ATTR_POST_READ (attrs))
      && NILP (get_translation_table (attrs, 0, NULL)))
    {
      chars = coding->head_ascii;
      if (chars < 0)
	chars = check_ascii (coding);
      if (chars != bytes)
	{
	  /* There exists a non-ASCII byte.  */
	  if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8))
	    {
	      if (coding->detected_utf8_chars >= 0)
		chars = coding->detected_utf8_chars;  <<<<<<<<<<<<<<
	      else
		chars = check_utf_8 (coding);

This reuses the number of characters that are valid UTF-8 sequences in
the byte stream to be decoded, stored in coding->detected_utf8_chars,
which were found by detect_coding_utf_8, which was called by
detect_coding.  In the case in point, detect_coding_utf_8 finds zero
valid UTF-8 sequences, and so 'chars' becomes zero.  But the number of
decoded bytes is not adjusted to fit that, so it stays at its original
value of 1.  Then, decode_coding_gap does this:

	  coding->produced = bytes;
	  coding->produced_char = chars;
	  insert_from_gap (chars, bytes, 1);

Since 'chars' is zero, but 'bytes' is 1, this causes a mismatch
between buffer's Z and Z_BYTE values, and from there it's a slippery
slope all the way to an assertion violation during redisplay.

Similar problems happen when insert-file-contents is called to read
some number of bytes that doesn't end at a UTF-8 sequence boundary.

I think I see a potential reason for this in detect_coding_utf_8, near
its end:

      if (nchars < src_end - coding->source)
	/* The found characters are less than source bytes, which
	   means that we found a valid non-ASCII characters.  */
	detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | CATEGORY_MASK_UTF_8_NOSIG;

This misses the use case such as this one, where the detection loop
consumed one byte, found it not to be the head byte of a UTF-8
sequence, and then hit the end of the source bytes.  It looks like the
function incorrectly returns a success indication in this case, which
might be part of the problem.

> In release 24.3 and earlier insert-file-contents seems to always insert
> something, be it wrongly decoded or raw eight-bit characters.  But it is
> visible and easy to deal with.  The above example works fine there.
> This is useful for the vlf package (https://github.com/m00natic/vlfi) as
> a way to detect insufficient amount of bytes requested and allows
> further adjustment.

What vlf does is strange and IMO not the best possible solution to
this issue:

        (cond ((vlf-partial-decode-shown-p) ;remove raw bytes from end
               (goto-char (point-max))
               (while (eq (char-charset (preceding-char)) 'eight-bit)
                 (setq shift-end (1- shift-end))
                 (delete-char -1)))
              ((< end vlf-file-size) ;add bytes until new character is displayed
               (let ((position (or position (point-min)))
                     (expected-size (buffer-size)))
                 (while (and (progn
                               (setq shift-end (1+ shift-end)
                                     end (1+ end))
                               (delete-region position (point-max))
                               (goto-char position)
                               (insert-file-contents buffer-file-name
                                                     nil start end)
                               (< end vlf-file-size))
                             (= expected-size (buffer-size))))))))

This seems to have a subtle misfeature of not supporting files with
inconsistent encoding, or files with binary data, because there _all_
characters will belong to the eight-bit charset.  Also, I don't
understand why the removal of raw bytes is conditioned on Emacs
version: why not just remove them unconditionally: if there are none,
nothing will be removed.

More to the point, I'm not sure whether inserting raw bytes in
insert-file-contents when a portion of a multibyte sequence was read
(i.e. go back to what Emacs 24.3 did) will be good for vlf.  It sounds
to me much better if Emacs would only return complete characters read
from the file, so that applications will not need to remove those
stray bytes.

Finally, it would seem a better design for vlf to always read a few
more bytes than was requested into some scratch buffer, and then
decode them manually to determine just how many to copy to the main
buffer.

next prev parent reply	other threads:[~2014-01-02 16:30 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-29 14:05 bug#16286: 24.3.50; insert-file-contents may bring invisible garbage Andrey Kotlarski
2014-01-02 16:30 ` Eli Zaretskii [this message]
2014-01-04 22:42   ` Andrey Kotlarski
2014-01-26  0:36 ` Paul Eggert
2014-01-27 15:01   ` K. Handa
2014-01-27 17:01     ` Paul Eggert
2014-01-29 13:40       ` K. Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83y52yxs61.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=16286@debbugs.gnu.org \
    --cc=handa@gnu.org \
    --cc=m00naticus@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).