* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage @ 2013-12-29 14:05 Andrey Kotlarski 2014-01-02 16:30 ` Eli Zaretskii 2014-01-26 0:36 ` Paul Eggert 0 siblings, 2 replies; 7+ messages in thread From: Andrey Kotlarski @ 2013-12-29 14:05 UTC (permalink / raw) To: 16286 In trunk inserting few bytes from file may sometimes result in nothing visible in the buffer while invisible artifacts are present and may affect subsequent operations. Moreover, there doesn't seem to be way to recover from this. Here's example session with emacs -Q: (let ((file "test.txt")) (unless (file-exists-p file) (find-file file) (insert "абв") ;Cyrillic letters (save-buffer) (kill-buffer)) (let ((buf (generate-new-buffer "test"))) (switch-to-buffer buf) (insert-file-contents file nil 0 2) ;inserts а (goto-char (point-max)) (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing visible in the buffer ;but actually there is (erase-buffer) ;and still is (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong type argument: inserted-chars, 1 (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty )) Trying to insert multibyte characters now brings content length issues, garbage inserted and at some point Emacs crashes. In release 24.3 and earlier insert-file-contents seems to always insert something, be it wrongly decoded or raw eight-bit characters. But it is visible and easy to deal with. The above example works fine there. This is useful for the vlf package (https://github.com/m00natic/vlfi) as a way to detect insufficient amount of bytes requested and allows further adjustment. In GNU Emacs 24.3.50.1 (x86_64-pc-linux-gnu) of 2013-12-29 on andrexhe Bzr revision: 115803 eggert@cs.ucla.edu-20131229075253-hmeofd1oihd5n3rk Windowing system distributor `The X.Org Foundation', version 11.0.11405000 Configured using: `configure --build=x86_64-pc-linux-gnu --enable-link-time-optimization --with-x-toolkit=no --with-wide-int --without-toolkit-scroll-bars --without-xaw3d --without-gpm --without-gconf --without-gsettings build_alias=x86_64-pc-linux-gnu 'CFLAGS=-march=native -mtune=native -O2 -pipe'' Important settings: value of $LC_COLLATE: C value of $LANG: bg_BG.UTF-8 locale-coding-system: utf-8-unix Major mode: Fundamental Minor modes in effect: tooltip-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Recent input: M-x r e p o - e m - b u <tab> <return> Recent messages: For information about GNU Emacs and the GNU system, type C-h C-a. let: Wrong type argument: inserted-chars, 1 Load-path shadows: None found. Features: (shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml easymenu mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util help-fns mail-prsvr mail-utils time-date tooltip electric uniquify ediff-hook vc-hooks lisp-float-type mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list newcomment lisp-mode prog-mode register page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote make-network-process dbusbind gfilenotify dynamic-setting font-render-setting x multi-tty emacs) ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage 2013-12-29 14:05 bug#16286: 24.3.50; insert-file-contents may bring invisible garbage Andrey Kotlarski @ 2014-01-02 16:30 ` Eli Zaretskii 2014-01-04 22:42 ` Andrey Kotlarski 2014-01-26 0:36 ` Paul Eggert 1 sibling, 1 reply; 7+ messages in thread From: Eli Zaretskii @ 2014-01-02 16:30 UTC (permalink / raw) To: Andrey Kotlarski, Kenichi Handa; +Cc: 16286 > From: Andrey Kotlarski <m00naticus@gmail.com> > Date: Sun, 29 Dec 2013 16:05:22 +0200 > > In trunk inserting few bytes from file may sometimes result in nothing > visible in the buffer while invisible artifacts are present and may > affect subsequent operations. Moreover, there doesn't seem to be way to > recover from this. Here's example session with emacs -Q: > > (let ((file "test.txt")) > (unless (file-exists-p file) > (find-file file) > (insert "абв") ;Cyrillic letters > (save-buffer) > (kill-buffer)) > > (let ((buf (generate-new-buffer "test"))) > (switch-to-buffer buf) > (insert-file-contents file nil 0 2) ;inserts а > (goto-char (point-max)) > (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing visible in the buffer > ;but actually there is > (erase-buffer) ;and still is > (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong type argument: inserted-chars, 1 > (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty > )) > > Trying to insert multibyte characters now brings content length issues, > garbage inserted and at some point Emacs crashes. Your Emacs is built without --enable-checking; if that configure-time switch is used, Emacs hits an assertion violation as soon as this sexp is evaluated: (insert-file-contents file nil 2 3) Also, you are wrong about there being some invisible stuff in the buffer. The problem is elsewhere: Emacs gets confused about the number of characters and the number of bytes in the buffer. These two counts should be in sync at all times; once they become unsynchronized, Emacs will generally crash very soon. I'm CC'ing Handa-san in the hope that he will be able to suggest a solution. The problem happens in decode_coding_gap (called from insert-file-contents), in this code fragment (note the call to detect_coding): if (CODING_REQUIRE_DETECTION (coding)) detect_coding (coding); attrs = CODING_ID_ATTRS (coding->id); if (! disable_ascii_optimization && ! coding->src_multibyte && ! NILP (CODING_ATTR_ASCII_COMPAT (attrs)) && NILP (CODING_ATTR_POST_READ (attrs)) && NILP (get_translation_table (attrs, 0, NULL))) { chars = coding->head_ascii; if (chars < 0) chars = check_ascii (coding); if (chars != bytes) { /* There exists a non-ASCII byte. */ if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8)) { if (coding->detected_utf8_chars >= 0) chars = coding->detected_utf8_chars; <<<<<<<<<<<<<< else chars = check_utf_8 (coding); This reuses the number of characters that are valid UTF-8 sequences in the byte stream to be decoded, stored in coding->detected_utf8_chars, which were found by detect_coding_utf_8, which was called by detect_coding. In the case in point, detect_coding_utf_8 finds zero valid UTF-8 sequences, and so 'chars' becomes zero. But the number of decoded bytes is not adjusted to fit that, so it stays at its original value of 1. Then, decode_coding_gap does this: coding->produced = bytes; coding->produced_char = chars; insert_from_gap (chars, bytes, 1); Since 'chars' is zero, but 'bytes' is 1, this causes a mismatch between buffer's Z and Z_BYTE values, and from there it's a slippery slope all the way to an assertion violation during redisplay. Similar problems happen when insert-file-contents is called to read some number of bytes that doesn't end at a UTF-8 sequence boundary. I think I see a potential reason for this in detect_coding_utf_8, near its end: if (nchars < src_end - coding->source) /* The found characters are less than source bytes, which means that we found a valid non-ASCII characters. */ detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | CATEGORY_MASK_UTF_8_NOSIG; This misses the use case such as this one, where the detection loop consumed one byte, found it not to be the head byte of a UTF-8 sequence, and then hit the end of the source bytes. It looks like the function incorrectly returns a success indication in this case, which might be part of the problem. > In release 24.3 and earlier insert-file-contents seems to always insert > something, be it wrongly decoded or raw eight-bit characters. But it is > visible and easy to deal with. The above example works fine there. > This is useful for the vlf package (https://github.com/m00natic/vlfi) as > a way to detect insufficient amount of bytes requested and allows > further adjustment. What vlf does is strange and IMO not the best possible solution to this issue: (cond ((vlf-partial-decode-shown-p) ;remove raw bytes from end (goto-char (point-max)) (while (eq (char-charset (preceding-char)) 'eight-bit) (setq shift-end (1- shift-end)) (delete-char -1))) ((< end vlf-file-size) ;add bytes until new character is displayed (let ((position (or position (point-min))) (expected-size (buffer-size))) (while (and (progn (setq shift-end (1+ shift-end) end (1+ end)) (delete-region position (point-max)) (goto-char position) (insert-file-contents buffer-file-name nil start end) (< end vlf-file-size)) (= expected-size (buffer-size)))))))) This seems to have a subtle misfeature of not supporting files with inconsistent encoding, or files with binary data, because there _all_ characters will belong to the eight-bit charset. Also, I don't understand why the removal of raw bytes is conditioned on Emacs version: why not just remove them unconditionally: if there are none, nothing will be removed. More to the point, I'm not sure whether inserting raw bytes in insert-file-contents when a portion of a multibyte sequence was read (i.e. go back to what Emacs 24.3 did) will be good for vlf. It sounds to me much better if Emacs would only return complete characters read from the file, so that applications will not need to remove those stray bytes. Finally, it would seem a better design for vlf to always read a few more bytes than was requested into some scratch buffer, and then decode them manually to determine just how many to copy to the main buffer. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage 2014-01-02 16:30 ` Eli Zaretskii @ 2014-01-04 22:42 ` Andrey Kotlarski 0 siblings, 0 replies; 7+ messages in thread From: Andrey Kotlarski @ 2014-01-04 22:42 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 16286 Thanks a lot for the hints and pointers. [ 2 януари 2014, 18:30 +0200, четвъртък ] Eli Zaretskii: > What vlf does is strange and IMO not the best possible solution to > this issue: > ... > This seems to have a subtle misfeature of not supporting files with > inconsistent encoding, or files with binary data, because there _all_ > characters will belong to the eight-bit charset. There had been changes meanwhile which hopefully address these (no special treatment of eight-bit) along other issues (vlf-base.el). > More to the point, I'm not sure whether inserting raw bytes in > insert-file-contents when a portion of a multibyte sequence was read > (i.e. go back to what Emacs 24.3 did) will be good for vlf. It sounds > to me much better if Emacs would only return complete characters read > from the file, so that applications will not need to remove those > stray bytes. I agree. It would be ideal for vlf if insert-file-contents would also report the number of stray bytes at the end that haven't been utilized. > Finally, it would seem a better design for vlf to always read a few > more bytes than was requested into some scratch buffer, and then > decode them manually to determine just how many to copy to the main > buffer. I see that vlf somehow works only by some accident with current trunk (and --enable-checking disabled), so I'm on it. My initial attempt at naively combining insert-file-contents-literally with decode-coding-inserted-region though often produces wrong decoding where insert-file-contents would be good. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage 2013-12-29 14:05 bug#16286: 24.3.50; insert-file-contents may bring invisible garbage Andrey Kotlarski 2014-01-02 16:30 ` Eli Zaretskii @ 2014-01-26 0:36 ` Paul Eggert 2014-01-27 15:01 ` K. Handa 1 sibling, 1 reply; 7+ messages in thread From: Paul Eggert @ 2014-01-26 0:36 UTC (permalink / raw) To: 16286-done I installed a patch as trunk bzr 116158, which (at least for me) fixes the reported bug, and am taking the liberty of marking this as done. There may well be a better fix, but at least Emacs shouldn't crash or report nonsense now. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage 2014-01-26 0:36 ` Paul Eggert @ 2014-01-27 15:01 ` K. Handa 2014-01-27 17:01 ` Paul Eggert 0 siblings, 1 reply; 7+ messages in thread From: K. Handa @ 2014-01-27 15:01 UTC (permalink / raw) To: Paul Eggert; +Cc: 16286-done In article <52E4588D.70004@cs.ucla.edu>, Paul Eggert <eggert@cs.ucla.edu> writes: > I installed a patch as trunk bzr 116158, which (at least for me) fixes > the reported bug, and am taking the liberty of marking this as done. > There may well be a better fix, but at least Emacs shouldn't crash or > report nonsense now. Thank you for working on this bug which I introduced when I made decode_coding_gap optimized for ASCII and UTF-8 only files. Your change is to set CODING_MODE_LAST_BLOCK in coding->mode before calling decode_coding_gap so that detect_coding doesn't detect a file as utf-8 if it has incomplete utf-8 sequence at the tail (as the reported testcase). But, I think it is better that detect_coding detects such a file as utf-8 and treats the trailing garbage as raw bytes. 24.3 does it, and that is why decode_coding_gap sets CODING_MODE_LAST_BLOCK after calling detect_coding. So, I suggest the attached fix instead of yours. What do you think? --- Kenichi Handa handa@gnu.org === modified file 'src/ChangeLog' --- src/ChangeLog 2014-01-26 12:17:55 +0000 +++ src/ChangeLog 2014-01-27 14:53:58 +0000 @@ -1,3 +1,16 @@ +2014-01-27 K. Handa <handa@gnu.org> + + These change are to fix bug#16286 in the different way than what + done by revno:116158. + + * coding.h (struct coding_system): New member detected_utf8_bytes. + + * coding.c (detect_coding_utf_8): Set coding->detected_utf8_bytes. + (decode_coding_gap): Use short cut for UTF-8 file reading only + when coding->detected_utf8_bytes equals to coding->src_bytes. + + * fileio.c (Finsert_file_contents): Cancel the previous change. + 2014-01-26 Jan Djärv <jan.h.d@swipnet.se> * xterm.c (x_focus_changed): Check for non-X terminal-frame (Bug#16540) === modified file 'src/coding.c' --- src/coding.c 2014-01-26 01:20:24 +0000 +++ src/coding.c 2014-01-27 14:47:43 +0000 @@ -1300,6 +1300,7 @@ means that we found a valid non-ASCII characters. */ detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | CATEGORY_MASK_UTF_8_NOSIG; } + coding->detected_utf8_bytes = src_base - coding->source; coding->detected_utf8_chars = nchars; return 1; } @@ -7890,7 +7891,7 @@ coding->dst_multibyte = ! NILP (BVAR (current_buffer, enable_multibyte_characters)); coding->head_ascii = -1; - coding->detected_utf8_chars = -1; + coding->detected_utf8_bytes = coding->detected_utf8_chars = -1; coding->eol_seen = EOL_SEEN_NONE; if (CODING_REQUIRE_DETECTION (coding)) detect_coding (coding); @@ -7907,7 +7908,8 @@ if (chars != bytes) { /* There exists a non-ASCII byte. */ - if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8)) + if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8) + && coding->detected_utf8_bytes == coding->src_bytes) { if (coding->detected_utf8_chars >= 0) chars = coding->detected_utf8_chars; === modified file 'src/coding.h' --- src/coding.h 2014-01-26 01:20:24 +0000 +++ src/coding.h 2014-01-27 14:47:43 +0000 @@ -468,7 +468,9 @@ the eol format. */ ptrdiff_t head_ascii; - ptrdiff_t detected_utf8_chars; + /* How many bytes/chars at the source are detected as valid utf-8 + sequence. Set by detect_coding_utf_8. */ + ptrdiff_t detected_utf8_bytes, detected_utf8_chars; /* Used internally in coding.c. See the comment of detect_ascii. */ int eol_seen; === modified file 'src/fileio.c' --- src/fileio.c 2014-01-26 00:32:30 +0000 +++ src/fileio.c 2014-01-27 14:47:59 +0000 @@ -4298,7 +4298,6 @@ Z_BYTE -= inserted; ZV -= inserted; Z -= inserted; - coding.mode |= CODING_MODE_LAST_BLOCK; decode_coding_gap (&coding, inserted, inserted); inserted = coding.produced_char; coding_system = CODING_ID_NAME (coding.id); ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage 2014-01-27 15:01 ` K. Handa @ 2014-01-27 17:01 ` Paul Eggert 2014-01-29 13:40 ` K. Handa 0 siblings, 1 reply; 7+ messages in thread From: Paul Eggert @ 2014-01-27 17:01 UTC (permalink / raw) To: K. Handa; +Cc: 16286-done Yes, thanks, that looks like a better patch. ^ permalink raw reply [flat|nested] 7+ messages in thread
* bug#16286: 24.3.50; insert-file-contents may bring invisible garbage 2014-01-27 17:01 ` Paul Eggert @ 2014-01-29 13:40 ` K. Handa 0 siblings, 0 replies; 7+ messages in thread From: K. Handa @ 2014-01-29 13:40 UTC (permalink / raw) To: Paul Eggert; +Cc: 16286-done In article <52E690DE.6070101@cs.ucla.edu>, Paul Eggert <eggert@cs.ucla.edu> writes: > Yes, thanks, that looks like a better patch. Ok, I've just committed it. --- Kenichi Handa handa@gnu.org ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-01-29 13:40 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-29 14:05 bug#16286: 24.3.50; insert-file-contents may bring invisible garbage Andrey Kotlarski 2014-01-02 16:30 ` Eli Zaretskii 2014-01-04 22:42 ` Andrey Kotlarski 2014-01-26 0:36 ` Paul Eggert 2014-01-27 15:01 ` K. Handa 2014-01-27 17:01 ` Paul Eggert 2014-01-29 13:40 ` K. Handa
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).