From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#16286: 24.3.50; insert-file-contents may bring invisible garbage Date: Thu, 02 Jan 2014 18:30:30 +0200 Message-ID: <83y52yxs61.fsf@gnu.org> References: <87sitb4usd.fsf@gmail.com> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Trace: ger.gmane.org 1388680273 22597 80.91.229.3 (2 Jan 2014 16:31:13 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 2 Jan 2014 16:31:13 +0000 (UTC) Cc: 16286@debbugs.gnu.org To: Andrey Kotlarski , Kenichi Handa Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Jan 02 17:31:18 2014 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1VylAz-00089f-Ld for geb-bug-gnu-emacs@m.gmane.org; Thu, 02 Jan 2014 17:31:17 +0100 Original-Received: from localhost ([::1]:46021 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VylAz-0000r8-89 for geb-bug-gnu-emacs@m.gmane.org; Thu, 02 Jan 2014 11:31:17 -0500 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:60585) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VylAq-0000r1-HD for bug-gnu-emacs@gnu.org; Thu, 02 Jan 2014 11:31:14 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VylAk-00078C-L4 for bug-gnu-emacs@gnu.org; Thu, 02 Jan 2014 11:31:08 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:44148) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VylAk-000787-IA for bug-gnu-emacs@gnu.org; Thu, 02 Jan 2014 11:31:02 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80) (envelope-from ) id 1VylAj-0008Br-Ne for bug-gnu-emacs@gnu.org; Thu, 02 Jan 2014 11:31:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 02 Jan 2014 16:31:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 16286 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 16286-submit@debbugs.gnu.org id=B16286.138868024631458 (code B ref 16286); Thu, 02 Jan 2014 16:31:01 +0000 Original-Received: (at 16286) by debbugs.gnu.org; 2 Jan 2014 16:30:46 +0000 Original-Received: from localhost ([127.0.0.1]:58167 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VylAS-0008BG-9s for submit@debbugs.gnu.org; Thu, 02 Jan 2014 11:30:45 -0500 Original-Received: from mtaout20.012.net.il ([80.179.55.166]:61145) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1VylAL-0008Ay-OI for 16286@debbugs.gnu.org; Thu, 02 Jan 2014 11:30:40 -0500 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0MYS00I008E7L400@a-mtaout20.012.net.il> for 16286@debbugs.gnu.org; Thu, 02 Jan 2014 18:30:32 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MYS00ICX8IVKY10@a-mtaout20.012.net.il>; Thu, 02 Jan 2014 18:30:31 +0200 (IST) In-reply-to: <87sitb4usd.fsf@gmail.com> X-012-Sender: halo1@inter.net.il X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:82833 Archived-At: > From: Andrey Kotlarski > Date: Sun, 29 Dec 2013 16:05:22 +0200 > > In trunk inserting few bytes from file may sometimes result in nothing > visible in the buffer while invisible artifacts are present and may > affect subsequent operations. Moreover, there doesn't seem to be way to > recover from this. Here's example session with emacs -Q: > > (let ((file "test.txt")) > (unless (file-exists-p file) > (find-file file) > (insert "абв") ;Cyrillic letters > (save-buffer) > (kill-buffer)) > > (let ((buf (generate-new-buffer "test"))) > (switch-to-buffer buf) > (insert-file-contents file nil 0 2) ;inserts а > (goto-char (point-max)) > (insert-file-contents file nil 2 3) ;returns 0 bytes inserted, nothing visible in the buffer > ;but actually there is > (erase-buffer) ;and still is > (insert-file-contents file nil 2 4) ;should insert б, instead let: Wrong type argument: inserted-chars, 1 > (message "%S" (buffer-string)) ;"бЀ" while buffer is visibly empty > )) > > Trying to insert multibyte characters now brings content length issues, > garbage inserted and at some point Emacs crashes. Your Emacs is built without --enable-checking; if that configure-time switch is used, Emacs hits an assertion violation as soon as this sexp is evaluated: (insert-file-contents file nil 2 3) Also, you are wrong about there being some invisible stuff in the buffer. The problem is elsewhere: Emacs gets confused about the number of characters and the number of bytes in the buffer. These two counts should be in sync at all times; once they become unsynchronized, Emacs will generally crash very soon. I'm CC'ing Handa-san in the hope that he will be able to suggest a solution. The problem happens in decode_coding_gap (called from insert-file-contents), in this code fragment (note the call to detect_coding): if (CODING_REQUIRE_DETECTION (coding)) detect_coding (coding); attrs = CODING_ID_ATTRS (coding->id); if (! disable_ascii_optimization && ! coding->src_multibyte && ! NILP (CODING_ATTR_ASCII_COMPAT (attrs)) && NILP (CODING_ATTR_POST_READ (attrs)) && NILP (get_translation_table (attrs, 0, NULL))) { chars = coding->head_ascii; if (chars < 0) chars = check_ascii (coding); if (chars != bytes) { /* There exists a non-ASCII byte. */ if (EQ (CODING_ATTR_TYPE (attrs), Qutf_8)) { if (coding->detected_utf8_chars >= 0) chars = coding->detected_utf8_chars; <<<<<<<<<<<<<< else chars = check_utf_8 (coding); This reuses the number of characters that are valid UTF-8 sequences in the byte stream to be decoded, stored in coding->detected_utf8_chars, which were found by detect_coding_utf_8, which was called by detect_coding. In the case in point, detect_coding_utf_8 finds zero valid UTF-8 sequences, and so 'chars' becomes zero. But the number of decoded bytes is not adjusted to fit that, so it stays at its original value of 1. Then, decode_coding_gap does this: coding->produced = bytes; coding->produced_char = chars; insert_from_gap (chars, bytes, 1); Since 'chars' is zero, but 'bytes' is 1, this causes a mismatch between buffer's Z and Z_BYTE values, and from there it's a slippery slope all the way to an assertion violation during redisplay. Similar problems happen when insert-file-contents is called to read some number of bytes that doesn't end at a UTF-8 sequence boundary. I think I see a potential reason for this in detect_coding_utf_8, near its end: if (nchars < src_end - coding->source) /* The found characters are less than source bytes, which means that we found a valid non-ASCII characters. */ detect_info->found |= CATEGORY_MASK_UTF_8_AUTO | CATEGORY_MASK_UTF_8_NOSIG; This misses the use case such as this one, where the detection loop consumed one byte, found it not to be the head byte of a UTF-8 sequence, and then hit the end of the source bytes. It looks like the function incorrectly returns a success indication in this case, which might be part of the problem. > In release 24.3 and earlier insert-file-contents seems to always insert > something, be it wrongly decoded or raw eight-bit characters. But it is > visible and easy to deal with. The above example works fine there. > This is useful for the vlf package (https://github.com/m00natic/vlfi) as > a way to detect insufficient amount of bytes requested and allows > further adjustment. What vlf does is strange and IMO not the best possible solution to this issue: (cond ((vlf-partial-decode-shown-p) ;remove raw bytes from end (goto-char (point-max)) (while (eq (char-charset (preceding-char)) 'eight-bit) (setq shift-end (1- shift-end)) (delete-char -1))) ((< end vlf-file-size) ;add bytes until new character is displayed (let ((position (or position (point-min))) (expected-size (buffer-size))) (while (and (progn (setq shift-end (1+ shift-end) end (1+ end)) (delete-region position (point-max)) (goto-char position) (insert-file-contents buffer-file-name nil start end) (< end vlf-file-size)) (= expected-size (buffer-size)))))))) This seems to have a subtle misfeature of not supporting files with inconsistent encoding, or files with binary data, because there _all_ characters will belong to the eight-bit charset. Also, I don't understand why the removal of raw bytes is conditioned on Emacs version: why not just remove them unconditionally: if there are none, nothing will be removed. More to the point, I'm not sure whether inserting raw bytes in insert-file-contents when a portion of a multibyte sequence was read (i.e. go back to what Emacs 24.3 did) will be good for vlf. It sounds to me much better if Emacs would only return complete characters read from the file, so that applications will not need to remove those stray bytes. Finally, it would seem a better design for vlf to always read a few more bytes than was requested into some scratch buffer, and then decode them manually to determine just how many to copy to the main buffer.