bug#25288: 25.1; term, ansi-term, broken output of utf8 text

unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed

* bug#25288: 25.1; term, ansi-term, broken output of utf8 text
@ 2016-12-28 10:41 Vjacheslav
  2016-12-28 19:10 ` npostavs
  0 siblings, 1 reply; 6+ messages in thread
From: Vjacheslav @ 2016-12-28 10:41 UTC (permalink / raw)
  To: 25288


Trying to use this command from terminal running bash:

[fva@localhost ~]$ python -c 'print "ш"*5000'

produces garbage (шшш\321\210шшш) in output. Terminal needs reset. Possibly this 
is a bug which seen in very old linux, (breaks multibyte characters on buffer 
borders).

default-process-coding-system is OK:

default-process-coding-system is a variable defined in ‘C source code’.
Its value is (utf-8-unix . utf-8-unix)




In GNU Emacs 25.1.1 (x86_64-redhat-linux-gnu, GTK+ Version 3.22.4)
  of 2016-12-15 built on buildvm-30.phx2.fedoraproject.org
Windowing system distributor 'Fedora Project', version 11.0.11900000
Configured using:
  'configure --build=x86_64-redhat-linux-gnu
  --host=x86_64-redhat-linux-gnu --program-prefix=
  --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr
  --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
  --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64
  --libexecdir=/usr/libexec --localstatedir=/var
  --sharedstatedir=/var/lib --mandir=/usr/share/man
  --infodir=/usr/share/info --with-dbus --with-gif --with-jpeg --with-png
  --with-rsvg --with-tiff --with-xft --with-xpm --with-x-toolkit=gtk3
  --with-gpm=no --with-xwidgets build_alias=x86_64-redhat-linux-gnu
  host_alias=x86_64-redhat-linux-gnu 'CFLAGS=-DMAIL_USE_LOCKF -O2 -g
  -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
  -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
  -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1
  -m64 -mtune=generic' LDFLAGS=-Wl,-z,relro
  PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'

Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND DBUS GCONF GSETTINGS NOTIFY
ACL LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 XWIDGETS

Important settings:
   value of $LANG: ru_RU.UTF-8
   value of $XMODIFIERS: @im=ibus
   locale-coding-system: utf-8-unix

Major mode: Term

Minor modes in effect:
   show-paren-mode: t
   recentf-mode: t
   delete-selection-mode: t
   global-auto-complete-mode: t
   tooltip-mode: t
   global-eldoc-mode: t
   electric-indent-mode: t
   mouse-wheel-mode: t
   menu-bar-mode: t
   file-name-shadow-mode: t
   global-font-lock-mode: t
   font-lock-mode: t
   blink-cursor-mode: t
   auto-composition-mode: t
   auto-encryption-mode: t
   auto-compression-mode: t
   line-number-mode: t
   transient-mark-mode: t

Recent messages:
Checking 120 files in /usr/share/emacs/25.1/lisp/obsolete...
Checking for load-path shadows...done
Auto-saving...
next-line: End of buffer [2 times]
previous-line: Beginning of buffer [7 times]
Quit
funcall-interactively: End of buffer [4 times]
previous-line: Beginning of buffer [2 times]
mwheel-scroll: Beginning of buffer [2 times]
Making completion list... [2 times]

Load-path shadows:
None found.

Features:
(pp shadow sort mail-extr emacsbug message idna dired format-spec rfc822
mml mml-sec password-cache epg epg-config gnus-util mm-decode mm-bodies
mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail
rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils thingatpt
help-fns help-mode term disp-table ehelp easy-mmode ropemacs ring pymacs
advice paren recentf tree-widget wid-edit easymenu delsel cus-start
cus-load erlang-start auto-complete-config auto-complete edmacro kmacro
cl-loaddefs pcase cl-lib popup time-date mule-util cyril-util tooltip
eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel x-win
term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment elisp-mode lisp-mode prog-mode register page
menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock font-lock
syntax facemenu font-core frame cl-generic cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
cp51932 hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese charscript case-table epa-hook jka-cmpr-hook help
simple abbrev minibuffer cl-preloaded nadvice loaddefs button faces
cus-face macroexp files text-properties overlay sha1 md5 base64 format
env code-pages mule custom widget hashtable-print-readable backquote
dbusbind inotify dynamic-setting system-font-setting font-render-setting
xwidget-internal move-toolbar gtk x-toolkit x multi-tty
make-network-process emacs)

Memory information:
((conses 16 118333 17341)
  (symbols 48 23114 0)
  (miscs 40 145 285)
  (strings 32 22117 5473)
  (string-bytes 1 586321)
  (vectors 16 15669)
  (vector-slots 8 490744 11337)
  (floats 8 203 310)
  (intervals 56 965 1)
  (buffers 976 25))





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25288: 25.1; term, ansi-term, broken output of utf8 text
  2016-12-28 10:41 bug#25288: 25.1; term, ansi-term, broken output of utf8 text Vjacheslav
@ 2016-12-28 19:10 ` npostavs
  2016-12-28 19:31   ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: npostavs @ 2016-12-28 19:10 UTC (permalink / raw)
  To: Vjacheslav; +Cc: 25288

found 25288 24.5
tags 25288 confirmed
quit

Vjacheslav <fvamail@gmail.com> writes:

> Trying to use this command from terminal running bash:
>
> [fva@localhost ~]$ python -c 'print "ш"*5000'
>
> produces garbage (шшш\321\210шшш) in output. Terminal needs
> reset. Possibly this is a bug which seen in very old linux, (breaks
> multibyte characters on buffer borders).
>
> default-process-coding-system is OK:
>
> default-process-coding-system is a variable defined in ‘C source code’.
> Its value is (utf-8-unix . utf-8-unix)

It looks like the problem is that the process filter function,
term-emulate-terminal, receives the output in chunks of 4096 bytes[1].  The
ш character is encoded in 2 bytes, which means it can be split across
chunks.

Is there a way to recognize incomplete decoding from lisp?  I can't see
any.


[1]: It's getting bytes rather than characters because in term-exec-1 we
have:

	;; The process's output contains not just chars but also binary
	;; escape codes, so we need to see the raw output.  We will have to
	;; do the decoding by hand on the parts that are made of chars.
	(coding-system-for-read 'binary))






^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25288: 25.1; term, ansi-term, broken output of utf8 text
  2016-12-28 19:10 ` npostavs
@ 2016-12-28 19:31   ` Eli Zaretskii
  2016-12-29  2:37     ` npostavs
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2016-12-28 19:31 UTC (permalink / raw)
  To: npostavs; +Cc: 25288, fvamail

> From: npostavs@users.sourceforge.net
> Date: Wed, 28 Dec 2016 14:10:30 -0500
> Cc: 25288@debbugs.gnu.org
> 
> Is there a way to recognize incomplete decoding from lisp?  I can't see
> any.

If you know the encoding of the byte stream (and term.el must, since
it evidently decodes it later on), then you could probably use
char-charset, after decoding: if you get 'eight-bit, then you've got
incomplete byte sequence.  But I didn't try that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25288: 25.1; term, ansi-term, broken output of utf8 text
  2016-12-28 19:31   ` Eli Zaretskii
@ 2016-12-29  2:37     ` npostavs
  2016-12-29 16:06       ` Eli Zaretskii
  0 siblings, 1 reply; 6+ messages in thread
From: npostavs @ 2016-12-29  2:37 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 25288, fvamail

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]

tags 25288 patch
quit

Eli Zaretskii <eliz@gnu.org> writes:

>> From: npostavs@users.sourceforge.net
>> Date: Wed, 28 Dec 2016 14:10:30 -0500
>> Cc: 25288@debbugs.gnu.org
>> 
>> Is there a way to recognize incomplete decoding from lisp?  I can't see
>> any.
>
> If you know the encoding of the byte stream (and term.el must, since
> it evidently decodes it later on), then you could probably use
> char-charset, after decoding: if you get 'eight-bit, then you've got
> incomplete byte sequence.  But I didn't try that.

That should work at least for encodings like utf-8 for which undecoded
bytes are not ascii.  I guess parsing of escape codes would only work on
such encodings anyway, so it should be fine.  Patch attached.


[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 4666 bytes --]

From 6b052065c60406df5b4cd54f698f78594a010922 Mon Sep 17 00:00:00 2001
From: Noam Postavsky <npostavs@gmail.com>
Date: Wed, 28 Dec 2016 20:13:20 -0500
Subject: [PATCH v1] Handle multibyte chars spanning chunks in term.el

* lisp/term.el (term-terminal-undecoded-bytes): New variable.
(term-mode): Make it buffer local.  Don't make `term-terminal-parameter'
buffer-local twice.
(term-emulate-terminal): Check for bytes of incompletely decoded
characters, and save them until the next call when they can be fully
decoded (Bug#25288).
---
 lisp/term.el | 39 +++++++++++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/lisp/term.el b/lisp/term.el
index d3d6390..696e39f 100644
--- a/lisp/term.el
+++ b/lisp/term.el
@@ -341,6 +341,7 @@
 (defconst term-protocol-version "0.96")
 
 (eval-when-compile (require 'ange-ftp))
+(eval-when-compile (require 'cl-lib))
 (require 'ring)
 (require 'ehelp)
 
@@ -404,6 +405,7 @@ term-terminal-state
 (defvar term-kill-echo-list nil
   "A queue of strings whose echo we want suppressed.")
 (defvar term-terminal-parameter)
+(defvar term-terminal-undecoded-bytes nil)
 (defvar term-terminal-previous-parameter)
 (defvar term-current-face 'term)
 (defvar term-scroll-start 0 "Top-most line (inclusive) of scrolling region.")
@@ -1015,7 +1017,6 @@ term-mode
 
   ;; These local variables are set to their local values:
   (make-local-variable 'term-saved-home-marker)
-  (make-local-variable 'term-terminal-parameter)
   (make-local-variable 'term-saved-cursor)
   (make-local-variable 'term-prompt-regexp)
   (make-local-variable 'term-input-ring-size)
@@ -1052,6 +1053,7 @@ term-mode
   (make-local-variable 'term-ansi-current-invisible)
 
   (make-local-variable 'term-terminal-parameter)
+  (make-local-variable 'term-terminal-undecoded-bytes)
   (make-local-variable 'term-terminal-previous-parameter)
   (make-local-variable 'term-terminal-previous-parameter-2)
   (make-local-variable 'term-terminal-previous-parameter-3)
@@ -2748,6 +2750,10 @@ term-emulate-terminal
 
 	  (when term-log-buffer
 	    (princ str term-log-buffer))
+          (when term-terminal-undecoded-bytes
+            (setq str (concat term-terminal-undecoded-bytes str))
+            (setq str-length (length str))
+            (setq term-terminal-undecoded-bytes nil))
 	  (cond ((eq term-terminal-state 4) ;; Have saved pending output.
 		 (setq str (concat term-terminal-parameter str))
 		 (setq term-terminal-parameter nil)
@@ -2763,13 +2769,6 @@ term-emulate-terminal
 				       str i))
 		   (when (not funny) (setq funny str-length))
 		   (cond ((> funny i)
-			  ;; Decode the string before counting
-			  ;; characters, to avoid garbling of certain
-			  ;; multibyte characters (bug#1006).
-			  (setq decoded-substring
-				(decode-coding-string
-				 (substring str i funny)
-				 locale-coding-system))
 			  (cond ((eq term-terminal-state 1)
 				 ;; We are in state 1, we need to wrap
 				 ;; around.  Go to the beginning of
@@ -2778,7 +2777,31 @@ term-emulate-terminal
 				 (term-down 1 t)
 				 (term-move-columns (- (term-current-column)))
 				 (setq term-terminal-state 0)))
+			  ;; Decode the string before counting
+			  ;; characters, to avoid garbling of certain
+			  ;; multibyte characters (bug#1006).
+			  (setq decoded-substring
+				(decode-coding-string
+				 (substring str i funny)
+				 locale-coding-system))
 			  (setq count (length decoded-substring))
+                          ;; Check for multibyte characters that ends
+                          ;; before end of string, and save it for
+                          ;; next time.
+                          (when (= funny str-length)
+                            (let ((partial 0))
+                              (while (eq (char-charset (aref decoded-substring
+                                                             (- count 1 partial)))
+                                         'eight-bit)
+                                (cl-incf partial))
+                              (when (> partial 0)
+                                (setq term-terminal-undecoded-bytes
+                                      (substring decoded-substring (- partial)))
+                                (setq decoded-substring
+                                      (substring decoded-substring 0 (- partial)))
+                                (cl-decf str-length partial)
+                                (cl-decf count partial)
+                                (cl-decf funny partial))))
 			  (setq temp (- (+ (term-horizontal-column) count)
 					term-width))
 			  (cond ((or term-suppress-hard-newline (<= temp 0)))
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* bug#25288: 25.1; term, ansi-term, broken output of utf8 text
  2016-12-29  2:37     ` npostavs
@ 2016-12-29 16:06       ` Eli Zaretskii
  2017-01-03 14:05         ` npostavs
  0 siblings, 1 reply; 6+ messages in thread
From: Eli Zaretskii @ 2016-12-29 16:06 UTC (permalink / raw)
  To: npostavs; +Cc: 25288, fvamail

> From: npostavs@users.sourceforge.net
> Cc: 25288@debbugs.gnu.org,  fvamail@gmail.com
> Date: Wed, 28 Dec 2016 21:37:19 -0500
> 
> > If you know the encoding of the byte stream (and term.el must, since
> > it evidently decodes it later on), then you could probably use
> > char-charset, after decoding: if you get 'eight-bit, then you've got
> > incomplete byte sequence.  But I didn't try that.
> 
> That should work at least for encodings like utf-8 for which undecoded
> bytes are not ascii.  I guess parsing of escape codes would only work on
> such encodings anyway, so it should be fine.  Patch attached.

LGTM, thanks.





^ permalink raw reply	[flat|nested] 6+ messages in thread

* bug#25288: 25.1; term, ansi-term, broken output of utf8 text
  2016-12-29 16:06       ` Eli Zaretskii
@ 2017-01-03 14:05         ` npostavs
  0 siblings, 0 replies; 6+ messages in thread
From: npostavs @ 2017-01-03 14:05 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 25288, fvamail

tags 25288 fixed
close 25288 26.1
quit

Eli Zaretskii <eliz@gnu.org> writes:

>> From: npostavs@users.sourceforge.net
>> Cc: 25288@debbugs.gnu.org,  fvamail@gmail.com
>> Date: Wed, 28 Dec 2016 21:37:19 -0500
>> 
>> > If you know the encoding of the byte stream (and term.el must, since
>> > it evidently decodes it later on), then you could probably use
>> > char-charset, after decoding: if you get 'eight-bit, then you've got
>> > incomplete byte sequence.  But I didn't try that.
>> 
>> That should work at least for encodings like utf-8 for which undecoded
>> bytes are not ascii.  I guess parsing of escape codes would only work on
>> such encodings anyway, so it should be fine.  Patch attached.
>
> LGTM, thanks.

Pushed as 134e86b360ca.





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-03 14:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-28 10:41 bug#25288: 25.1; term, ansi-term, broken output of utf8 text Vjacheslav
2016-12-28 19:10 ` npostavs
2016-12-28 19:31   ` Eli Zaretskii
2016-12-29  2:37     ` npostavs
2016-12-29 16:06       ` Eli Zaretskii
2017-01-03 14:05         ` npostavs

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).