* bug#25288: 25.1; term, ansi-term, broken output of utf8 text @ 2016-12-28 10:41 Vjacheslav 2016-12-28 19:10 ` npostavs 0 siblings, 1 reply; 6+ messages in thread From: Vjacheslav @ 2016-12-28 10:41 UTC (permalink / raw) To: 25288 Trying to use this command from terminal running bash: [fva@localhost ~]$ python -c 'print "ш"*5000' produces garbage (шшш\321\210шшш) in output. Terminal needs reset. Possibly this is a bug which seen in very old linux, (breaks multibyte characters on buffer borders). default-process-coding-system is OK: default-process-coding-system is a variable defined in ‘C source code’. Its value is (utf-8-unix . utf-8-unix) In GNU Emacs 25.1.1 (x86_64-redhat-linux-gnu, GTK+ Version 3.22.4) of 2016-12-15 built on buildvm-30.phx2.fedoraproject.org Windowing system distributor 'Fedora Project', version 11.0.11900000 Configured using: 'configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-dbus --with-gif --with-jpeg --with-png --with-rsvg --with-tiff --with-xft --with-xpm --with-x-toolkit=gtk3 --with-gpm=no --with-xwidgets build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu 'CFLAGS=-DMAIL_USE_LOCKF -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic' LDFLAGS=-Wl,-z,relro PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig' Configured features: XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND DBUS GCONF GSETTINGS NOTIFY ACL LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB TOOLKIT_SCROLL_BARS GTK3 X11 XWIDGETS Important settings: value of $LANG: ru_RU.UTF-8 value of $XMODIFIERS: @im=ibus locale-coding-system: utf-8-unix Major mode: Term Minor modes in effect: show-paren-mode: t recentf-mode: t delete-selection-mode: t global-auto-complete-mode: t tooltip-mode: t global-eldoc-mode: t electric-indent-mode: t mouse-wheel-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Recent messages: Checking 120 files in /usr/share/emacs/25.1/lisp/obsolete... Checking for load-path shadows...done Auto-saving... next-line: End of buffer [2 times] previous-line: Beginning of buffer [7 times] Quit funcall-interactively: End of buffer [4 times] previous-line: Beginning of buffer [2 times] mwheel-scroll: Beginning of buffer [2 times] Making completion list... [2 times] Load-path shadows: None found. Features: (pp shadow sort mail-extr emacsbug message idna dired format-spec rfc822 mml mml-sec password-cache epg epg-config gnus-util mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils thingatpt help-fns help-mode term disp-table ehelp easy-mmode ropemacs ring pymacs advice paren recentf tree-widget wid-edit easymenu delsel cus-start cus-load erlang-start auto-complete-config auto-complete edmacro kmacro cl-loaddefs pcase cl-lib popup time-date mule-util cyril-util tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel x-win term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe tabulated-list newcomment elisp-mode lisp-mode prog-mode register page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core frame cl-generic cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese charscript case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer cl-preloaded nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote dbusbind inotify dynamic-setting system-font-setting font-render-setting xwidget-internal move-toolbar gtk x-toolkit x multi-tty make-network-process emacs) Memory information: ((conses 16 118333 17341) (symbols 48 23114 0) (miscs 40 145 285) (strings 32 22117 5473) (string-bytes 1 586321) (vectors 16 15669) (vector-slots 8 490744 11337) (floats 8 203 310) (intervals 56 965 1) (buffers 976 25)) ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25288: 25.1; term, ansi-term, broken output of utf8 text 2016-12-28 10:41 bug#25288: 25.1; term, ansi-term, broken output of utf8 text Vjacheslav @ 2016-12-28 19:10 ` npostavs 2016-12-28 19:31 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: npostavs @ 2016-12-28 19:10 UTC (permalink / raw) To: Vjacheslav; +Cc: 25288 found 25288 24.5 tags 25288 confirmed quit Vjacheslav <fvamail@gmail.com> writes: > Trying to use this command from terminal running bash: > > [fva@localhost ~]$ python -c 'print "ш"*5000' > > produces garbage (шшш\321\210шшш) in output. Terminal needs > reset. Possibly this is a bug which seen in very old linux, (breaks > multibyte characters on buffer borders). > > default-process-coding-system is OK: > > default-process-coding-system is a variable defined in ‘C source code’. > Its value is (utf-8-unix . utf-8-unix) It looks like the problem is that the process filter function, term-emulate-terminal, receives the output in chunks of 4096 bytes[1]. The ш character is encoded in 2 bytes, which means it can be split across chunks. Is there a way to recognize incomplete decoding from lisp? I can't see any. [1]: It's getting bytes rather than characters because in term-exec-1 we have: ;; The process's output contains not just chars but also binary ;; escape codes, so we need to see the raw output. We will have to ;; do the decoding by hand on the parts that are made of chars. (coding-system-for-read 'binary)) ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25288: 25.1; term, ansi-term, broken output of utf8 text 2016-12-28 19:10 ` npostavs @ 2016-12-28 19:31 ` Eli Zaretskii 2016-12-29 2:37 ` npostavs 0 siblings, 1 reply; 6+ messages in thread From: Eli Zaretskii @ 2016-12-28 19:31 UTC (permalink / raw) To: npostavs; +Cc: 25288, fvamail > From: npostavs@users.sourceforge.net > Date: Wed, 28 Dec 2016 14:10:30 -0500 > Cc: 25288@debbugs.gnu.org > > Is there a way to recognize incomplete decoding from lisp? I can't see > any. If you know the encoding of the byte stream (and term.el must, since it evidently decodes it later on), then you could probably use char-charset, after decoding: if you get 'eight-bit, then you've got incomplete byte sequence. But I didn't try that. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25288: 25.1; term, ansi-term, broken output of utf8 text 2016-12-28 19:31 ` Eli Zaretskii @ 2016-12-29 2:37 ` npostavs 2016-12-29 16:06 ` Eli Zaretskii 0 siblings, 1 reply; 6+ messages in thread From: npostavs @ 2016-12-29 2:37 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 25288, fvamail [-- Attachment #1: Type: text/plain, Size: 727 bytes --] tags 25288 patch quit Eli Zaretskii <eliz@gnu.org> writes: >> From: npostavs@users.sourceforge.net >> Date: Wed, 28 Dec 2016 14:10:30 -0500 >> Cc: 25288@debbugs.gnu.org >> >> Is there a way to recognize incomplete decoding from lisp? I can't see >> any. > > If you know the encoding of the byte stream (and term.el must, since > it evidently decodes it later on), then you could probably use > char-charset, after decoding: if you get 'eight-bit, then you've got > incomplete byte sequence. But I didn't try that. That should work at least for encodings like utf-8 for which undecoded bytes are not ascii. I guess parsing of escape codes would only work on such encodings anyway, so it should be fine. Patch attached. [-- Attachment #2: patch --] [-- Type: text/plain, Size: 4666 bytes --] From 6b052065c60406df5b4cd54f698f78594a010922 Mon Sep 17 00:00:00 2001 From: Noam Postavsky <npostavs@gmail.com> Date: Wed, 28 Dec 2016 20:13:20 -0500 Subject: [PATCH v1] Handle multibyte chars spanning chunks in term.el * lisp/term.el (term-terminal-undecoded-bytes): New variable. (term-mode): Make it buffer local. Don't make `term-terminal-parameter' buffer-local twice. (term-emulate-terminal): Check for bytes of incompletely decoded characters, and save them until the next call when they can be fully decoded (Bug#25288). --- lisp/term.el | 39 +++++++++++++++++++++++++++++++-------- 1 file changed, 31 insertions(+), 8 deletions(-) diff --git a/lisp/term.el b/lisp/term.el index d3d6390..696e39f 100644 --- a/lisp/term.el +++ b/lisp/term.el @@ -341,6 +341,7 @@ (defconst term-protocol-version "0.96") (eval-when-compile (require 'ange-ftp)) +(eval-when-compile (require 'cl-lib)) (require 'ring) (require 'ehelp) @@ -404,6 +405,7 @@ term-terminal-state (defvar term-kill-echo-list nil "A queue of strings whose echo we want suppressed.") (defvar term-terminal-parameter) +(defvar term-terminal-undecoded-bytes nil) (defvar term-terminal-previous-parameter) (defvar term-current-face 'term) (defvar term-scroll-start 0 "Top-most line (inclusive) of scrolling region.") @@ -1015,7 +1017,6 @@ term-mode ;; These local variables are set to their local values: (make-local-variable 'term-saved-home-marker) - (make-local-variable 'term-terminal-parameter) (make-local-variable 'term-saved-cursor) (make-local-variable 'term-prompt-regexp) (make-local-variable 'term-input-ring-size) @@ -1052,6 +1053,7 @@ term-mode (make-local-variable 'term-ansi-current-invisible) (make-local-variable 'term-terminal-parameter) + (make-local-variable 'term-terminal-undecoded-bytes) (make-local-variable 'term-terminal-previous-parameter) (make-local-variable 'term-terminal-previous-parameter-2) (make-local-variable 'term-terminal-previous-parameter-3) @@ -2748,6 +2750,10 @@ term-emulate-terminal (when term-log-buffer (princ str term-log-buffer)) + (when term-terminal-undecoded-bytes + (setq str (concat term-terminal-undecoded-bytes str)) + (setq str-length (length str)) + (setq term-terminal-undecoded-bytes nil)) (cond ((eq term-terminal-state 4) ;; Have saved pending output. (setq str (concat term-terminal-parameter str)) (setq term-terminal-parameter nil) @@ -2763,13 +2769,6 @@ term-emulate-terminal str i)) (when (not funny) (setq funny str-length)) (cond ((> funny i) - ;; Decode the string before counting - ;; characters, to avoid garbling of certain - ;; multibyte characters (bug#1006). - (setq decoded-substring - (decode-coding-string - (substring str i funny) - locale-coding-system)) (cond ((eq term-terminal-state 1) ;; We are in state 1, we need to wrap ;; around. Go to the beginning of @@ -2778,7 +2777,31 @@ term-emulate-terminal (term-down 1 t) (term-move-columns (- (term-current-column))) (setq term-terminal-state 0))) + ;; Decode the string before counting + ;; characters, to avoid garbling of certain + ;; multibyte characters (bug#1006). + (setq decoded-substring + (decode-coding-string + (substring str i funny) + locale-coding-system)) (setq count (length decoded-substring)) + ;; Check for multibyte characters that ends + ;; before end of string, and save it for + ;; next time. + (when (= funny str-length) + (let ((partial 0)) + (while (eq (char-charset (aref decoded-substring + (- count 1 partial))) + 'eight-bit) + (cl-incf partial)) + (when (> partial 0) + (setq term-terminal-undecoded-bytes + (substring decoded-substring (- partial))) + (setq decoded-substring + (substring decoded-substring 0 (- partial))) + (cl-decf str-length partial) + (cl-decf count partial) + (cl-decf funny partial)))) (setq temp (- (+ (term-horizontal-column) count) term-width)) (cond ((or term-suppress-hard-newline (<= temp 0))) -- 2.9.3 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* bug#25288: 25.1; term, ansi-term, broken output of utf8 text 2016-12-29 2:37 ` npostavs @ 2016-12-29 16:06 ` Eli Zaretskii 2017-01-03 14:05 ` npostavs 0 siblings, 1 reply; 6+ messages in thread From: Eli Zaretskii @ 2016-12-29 16:06 UTC (permalink / raw) To: npostavs; +Cc: 25288, fvamail > From: npostavs@users.sourceforge.net > Cc: 25288@debbugs.gnu.org, fvamail@gmail.com > Date: Wed, 28 Dec 2016 21:37:19 -0500 > > > If you know the encoding of the byte stream (and term.el must, since > > it evidently decodes it later on), then you could probably use > > char-charset, after decoding: if you get 'eight-bit, then you've got > > incomplete byte sequence. But I didn't try that. > > That should work at least for encodings like utf-8 for which undecoded > bytes are not ascii. I guess parsing of escape codes would only work on > such encodings anyway, so it should be fine. Patch attached. LGTM, thanks. ^ permalink raw reply [flat|nested] 6+ messages in thread
* bug#25288: 25.1; term, ansi-term, broken output of utf8 text 2016-12-29 16:06 ` Eli Zaretskii @ 2017-01-03 14:05 ` npostavs 0 siblings, 0 replies; 6+ messages in thread From: npostavs @ 2017-01-03 14:05 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 25288, fvamail tags 25288 fixed close 25288 26.1 quit Eli Zaretskii <eliz@gnu.org> writes: >> From: npostavs@users.sourceforge.net >> Cc: 25288@debbugs.gnu.org, fvamail@gmail.com >> Date: Wed, 28 Dec 2016 21:37:19 -0500 >> >> > If you know the encoding of the byte stream (and term.el must, since >> > it evidently decodes it later on), then you could probably use >> > char-charset, after decoding: if you get 'eight-bit, then you've got >> > incomplete byte sequence. But I didn't try that. >> >> That should work at least for encodings like utf-8 for which undecoded >> bytes are not ascii. I guess parsing of escape codes would only work on >> such encodings anyway, so it should be fine. Patch attached. > > LGTM, thanks. Pushed as 134e86b360ca. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-01-03 14:05 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-12-28 10:41 bug#25288: 25.1; term, ansi-term, broken output of utf8 text Vjacheslav 2016-12-28 19:10 ` npostavs 2016-12-28 19:31 ` Eli Zaretskii 2016-12-29 2:37 ` npostavs 2016-12-29 16:06 ` Eli Zaretskii 2017-01-03 14:05 ` npostavs
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).