* bug#23814: 24.5; bug of hz coding-system @ 2016-06-21 12:22 ynyaaa 2016-06-21 12:58 ` Eli Zaretskii ` (5 more replies) 0 siblings, 6 replies; 16+ messages in thread From: ynyaaa @ 2016-06-21 12:22 UTC (permalink / raw) To: 23814 hz coding-system should encode chinese-gb2312 characters, it may fail to encode text without charset property. current-language-environment =>"Japanese" ;; wrong (encode-coding-string "\x4E00" 'hz) =>"\e$B0l~}" ;; correct (encode-coding-string (propertize "\x4E00" 'charset 'chinese-gb2312) 'hz) =>"~{R;~}" When the second byte of chinese-gb2312 character equals to ?~, hz coding-system may faile to decode. (encode-coding-string (propertize "\x670D" 'charset 'chinese-gb2312) 'hz) =>"~{7~~}" ;; wrong (decode-coding-string "~{7~~}" 'hz) =>"\300\267" In GNU Emacs 24.5.1 (i686-pc-mingw32) of 2015-04-11 on LEG570 Windowing system distributor `Microsoft Corp.', version 6.0.6002 Configured using: `configure --prefix=/c/usr --host=i686-pc-mingw32' Important settings: value of $LANG: JPN locale-coding-system: cp932 Major mode: Lisp Interaction Minor modes in effect: tooltip-mode: t electric-indent-mode: t mouse-wheel-mode: t tool-bar-mode: t menu-bar-mode: t file-name-shadow-mode: t global-font-lock-mode: t font-lock-mode: t blink-cursor-mode: t auto-composition-mode: t auto-encryption-mode: t auto-compression-mode: t line-number-mode: t transient-mark-mode: t Recent messages: Load-path shadows: None found. Features: (network-stream starttls tls mailalias smtpmail auth-source eieio byte-opt bytecomp byte-compile cl-extra cl-loaddefs cl-lib cconv eieio-core password-cache rect warnings china-util misearch multi-isearch pp shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils help-mode easymenu advice help-fns time-date japan-util tooltip electric uniquify ediff-hook vc-hooks lisp-float-type mwheel dos-w32 ls-lisp w32-common-fns disp-table w32-win w32-vars tool-bar dnd fontset image regexp-opt fringe tabulated-list newcomment lisp-mode prog-mode register page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak czech european ethiopic indian cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer nadvice loaddefs button faces cus-face macroexp files text-properties overlay sha1 md5 base64 format env code-pages mule custom widget hashtable-print-readable backquote make-network-process w32notify w32 multi-tty emacs) Memory information: ((conses 8 94845 27098) (symbols 32 19573 0) (miscs 32 77 279) (strings 16 16482 13821) (string-bytes 1 462365) (vectors 8 12746) (vector-slots 4 519456 11240) (floats 8 62 556) (intervals 28 606 13) (buffers 508 18)) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa @ 2016-06-21 12:58 ` Eli Zaretskii 2016-06-22 13:47 ` ynyaaa ` (4 subsequent siblings) 5 siblings, 0 replies; 16+ messages in thread From: Eli Zaretskii @ 2016-06-21 12:58 UTC (permalink / raw) To: ynyaaa; +Cc: 23814 > From: ynyaaa@gmail.com > Date: Tue, 21 Jun 2016 21:22:32 +0900 > > hz coding-system should encode chinese-gb2312 characters, > it may fail to encode text without charset property. This is by design, and mentioned in the doc string of that coding-system. Since Emacs is Unicode based, the _only_ way of having "chinese-gb2312 characters" is by using that text property. IOW, I don't think this is a bug. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa 2016-06-21 12:58 ` Eli Zaretskii @ 2016-06-22 13:47 ` ynyaaa 2016-06-22 15:28 ` Eli Zaretskii 2016-06-22 17:04 ` ynyaaa ` (3 subsequent siblings) 5 siblings, 1 reply; 16+ messages in thread From: ynyaaa @ 2016-06-22 13:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 23814 Eli Zaretskii <eliz@gnu.org> writes: > This is by design, and mentioned in the doc string of that > coding-system. Since Emacs is Unicode based, the _only_ way of having > "chinese-gb2312 characters" is by using that text property. `encode-hz-region' uses `iso-2022-7bit' coding-system internally, replacing it with the coding-system below will work. (define-coding-system 'iso-2022-cn-gb "ISO 2022 based 7bit encoding only for Chinese GB2312." :coding-type 'iso-2022 :mnemonic ?C :charset-list '(ascii chinese-gb2312) :designation [(ascii chinese-gb2312) nil nil nil] :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) ) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-22 13:47 ` ynyaaa @ 2016-06-22 15:28 ` Eli Zaretskii 0 siblings, 0 replies; 16+ messages in thread From: Eli Zaretskii @ 2016-06-22 15:28 UTC (permalink / raw) To: ynyaaa; +Cc: 23814 > > From: ynyaaa@gmail.com > Cc: 23814@debbugs.gnu.org > Date: Wed, 22 Jun 2016 22:47:00 +0900 > > Eli Zaretskii <eliz@gnu.org> writes: > > > This is by design, and mentioned in the doc string of that > > coding-system. Since Emacs is Unicode based, the _only_ way of having > > "chinese-gb2312 characters" is by using that text property. > > `encode-hz-region' uses `iso-2022-7bit' coding-system internally, > replacing it with the coding-system below will work. > > (define-coding-system 'iso-2022-cn-gb > "ISO 2022 based 7bit encoding only for Chinese GB2312." > :coding-type 'iso-2022 > :mnemonic ?C > :charset-list '(ascii chinese-gb2312) > :designation [(ascii chinese-gb2312) nil nil nil] > :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) > ) What advantages does this change have? ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa 2016-06-21 12:58 ` Eli Zaretskii 2016-06-22 13:47 ` ynyaaa @ 2016-06-22 17:04 ` ynyaaa 2016-06-22 17:26 ` Eli Zaretskii 2016-07-24 8:21 ` ynyaaa ` (2 subsequent siblings) 5 siblings, 1 reply; 16+ messages in thread From: ynyaaa @ 2016-06-22 17:04 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 23814 Eli Zaretskii <eliz@gnu.org> writes: >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally, >> replacing it with the coding-system below will work. >> >> (define-coding-system 'iso-2022-cn-gb >> "ISO 2022 based 7bit encoding only for Chinese GB2312." >> :coding-type 'iso-2022 >> :mnemonic ?C >> :charset-list '(ascii chinese-gb2312) >> :designation [(ascii chinese-gb2312) nil nil nil] >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) >> ) > > What advantages does this change have? `iso-2022-7bit' may encode same character to various strings, while `iso-2022-cn-gb' encodes same charcter to same string. (mapcar (lambda (cs) (encode-coding-string (propertize "\x4e00" 'charset cs) 'iso-2022-7bit)) '(chinese-gb2312 japanese-jisx0208 korean-ksc5601 chinese-cns11643-1)) =>("\e$AR;\e(B" "\e$B0l\e(B" "\e$(Cli\e(B" "\e$(GD!\e(B") (mapcar (lambda (cs) (encode-coding-string (propertize "\x4e00" 'charset cs) 'iso-2022-cn-gb)) '(chinese-gb2312 japanese-jisx0208 korean-ksc5601 chinese-cns11643-1)) =>("\e$AR;\e(B" "\e$AR;\e(B" "\e$AR;\e(B" "\e$AR;\e(B") `encode-hz-region' expects `chinese-gb2312' characters are encoded with "\e$A" sequences, and replaces them to "~{". ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-22 17:04 ` ynyaaa @ 2016-06-22 17:26 ` Eli Zaretskii 2016-07-09 11:20 ` Eli Zaretskii 0 siblings, 1 reply; 16+ messages in thread From: Eli Zaretskii @ 2016-06-22 17:26 UTC (permalink / raw) To: ynyaaa, Kenichi Handa; +Cc: 23814 > From: ynyaaa@gmail.com > Cc: 23814@debbugs.gnu.org > Date: Thu, 23 Jun 2016 02:04:18 +0900 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally, > >> replacing it with the coding-system below will work. > >> > >> (define-coding-system 'iso-2022-cn-gb > >> "ISO 2022 based 7bit encoding only for Chinese GB2312." > >> :coding-type 'iso-2022 > >> :mnemonic ?C > >> :charset-list '(ascii chinese-gb2312) > >> :designation [(ascii chinese-gb2312) nil nil nil] > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) > >> ) > > > > What advantages does this change have? > > `iso-2022-7bit' may encode same character to various strings, > while `iso-2022-cn-gb' encodes same charcter to same string. > > (mapcar (lambda (cs) (encode-coding-string > (propertize "\x4e00" 'charset cs) > 'iso-2022-7bit)) > '(chinese-gb2312 japanese-jisx0208 korean-ksc5601 > chinese-cns11643-1)) > =>("\e$AR;\e(B" > "\e$B0l\e(B" > "\e$(Cli\e(B" > "\e$(GD!\e(B") > > (mapcar (lambda (cs) (encode-coding-string > (propertize "\x4e00" 'charset cs) > 'iso-2022-cn-gb)) > '(chinese-gb2312 japanese-jisx0208 korean-ksc5601 > chinese-cns11643-1)) > =>("\e$AR;\e(B" > "\e$AR;\e(B" > "\e$AR;\e(B" > "\e$AR;\e(B") > > `encode-hz-region' expects `chinese-gb2312' characters are encoded > with "\e$A" sequences, and replaces them to "~{". I understand, but as I said, I think this is by design, and should not be changed. However, maybe I'm missing something, so I'll CC Handa-san and ask him to comment on this proposal and the issue in general. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-22 17:26 ` Eli Zaretskii @ 2016-07-09 11:20 ` Eli Zaretskii 2016-07-13 14:12 ` handa 0 siblings, 1 reply; 16+ messages in thread From: Eli Zaretskii @ 2016-07-09 11:20 UTC (permalink / raw) To: handa; +Cc: ynyaaa, 23814 Ping! Could you please comment on this issue? > Date: Wed, 22 Jun 2016 20:26:53 +0300 > From: Eli Zaretskii <eliz@gnu.org> > Cc: 23814@debbugs.gnu.org > > > From: ynyaaa@gmail.com > > Cc: 23814@debbugs.gnu.org > > Date: Thu, 23 Jun 2016 02:04:18 +0900 > > > > Eli Zaretskii <eliz@gnu.org> writes: > > > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally, > > >> replacing it with the coding-system below will work. > > >> > > >> (define-coding-system 'iso-2022-cn-gb > > >> "ISO 2022 based 7bit encoding only for Chinese GB2312." > > >> :coding-type 'iso-2022 > > >> :mnemonic ?C > > >> :charset-list '(ascii chinese-gb2312) > > >> :designation [(ascii chinese-gb2312) nil nil nil] > > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) > > >> ) > > > > > > What advantages does this change have? > > > > `iso-2022-7bit' may encode same character to various strings, > > while `iso-2022-cn-gb' encodes same charcter to same string. > > > > (mapcar (lambda (cs) (encode-coding-string > > (propertize "\x4e00" 'charset cs) > > 'iso-2022-7bit)) > > '(chinese-gb2312 japanese-jisx0208 korean-ksc5601 > > chinese-cns11643-1)) > > =>("\e$AR;\e(B" > > "\e$B0l\e(B" > > "\e$(Cli\e(B" > > "\e$(GD!\e(B") > > > > (mapcar (lambda (cs) (encode-coding-string > > (propertize "\x4e00" 'charset cs) > > 'iso-2022-cn-gb)) > > '(chinese-gb2312 japanese-jisx0208 korean-ksc5601 > > chinese-cns11643-1)) > > =>("\e$AR;\e(B" > > "\e$AR;\e(B" > > "\e$AR;\e(B" > > "\e$AR;\e(B") > > > > `encode-hz-region' expects `chinese-gb2312' characters are encoded > > with "\e$A" sequences, and replaces them to "~{". > > I understand, but as I said, I think this is by design, and should not > be changed. However, maybe I'm missing something, so I'll CC > Handa-san and ask him to comment on this proposal and the issue in > general. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-07-09 11:20 ` Eli Zaretskii @ 2016-07-13 14:12 ` handa 2016-07-23 17:47 ` Eli Zaretskii 0 siblings, 1 reply; 16+ messages in thread From: handa @ 2016-07-13 14:12 UTC (permalink / raw) To: Eli Zaretskii; +Cc: ynyaaa, 23814 In article <83d1mngirw.fsf@gnu.org>, Eli Zaretskii <eliz@gnu.org> writes: > Ping! Could you please comment on this issue? Sorry, I've overlooked that mail. > > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally, > > > >> replacing it with the coding-system below will work. > > > >> > > > >> (define-coding-system 'iso-2022-cn-gb > > > >> "ISO 2022 based 7bit encoding only for Chinese GB2312." > > > >> :coding-type 'iso-2022 > > > >> :mnemonic ?C > > > >> :charset-list '(ascii chinese-gb2312) > > > >> :designation [(ascii chinese-gb2312) nil nil nil] > > > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) > > > >> ) Right. But, as there are already so many iso-2022 based coding systems, I'd like to avoid adding a new one just for encode-hz-region. I think the attached patch is sufficent. Could you please try it? It also fixes the problem of incorrect decoding of "~{7~~}". --- K. Handa handa@gnu.org diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el index e531640..9735bd6 100644 --- a/lisp/language/china-util.el +++ b/lisp/language/china-util.el @@ -95,7 +95,9 @@ decode-hz-region (goto-char (point-min)) (while (search-forward "~" nil t) (setq ch (following-char)) - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))) + (if (= ch ?{) + (search-forward "~}" nil 'move) + (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))) ;; "^zW...\n" -> Chinese GB2312 ;; "~{...~}" -> Chinese GB2312 @@ -141,7 +143,7 @@ encode-hz-region (save-excursion (save-restriction (narrow-to-region beg end) - + (put-text-property beg end 'charset 'chinese-gb2312) ;; "~" -> "~~" (goto-char (point-min)) (while (search-forward "~" nil t) (insert ?~)) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-07-13 14:12 ` handa @ 2016-07-23 17:47 ` Eli Zaretskii 0 siblings, 0 replies; 16+ messages in thread From: Eli Zaretskii @ 2016-07-23 17:47 UTC (permalink / raw) To: ynyaaa; +Cc: 23814 Ping! Could you please try this patch and see if it solves the problem? > From: handa <handa@gnu.org> > Cc: ynyaaa@gmail.com, 23814@debbugs.gnu.org > Date: Wed, 13 Jul 2016 23:12:47 +0900 > > > > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally, > > > > >> replacing it with the coding-system below will work. > > > > >> > > > > >> (define-coding-system 'iso-2022-cn-gb > > > > >> "ISO 2022 based 7bit encoding only for Chinese GB2312." > > > > >> :coding-type 'iso-2022 > > > > >> :mnemonic ?C > > > > >> :charset-list '(ascii chinese-gb2312) > > > > >> :designation [(ascii chinese-gb2312) nil nil nil] > > > > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe) > > > > >> ) > > Right. But, as there are already so many iso-2022 based coding systems, > I'd like to avoid adding a new one just for encode-hz-region. I think > the attached patch is sufficent. Could you please try it? It also > fixes the problem of incorrect decoding of "~{7~~}". > > --- > K. Handa > handa@gnu.org > > diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el > index e531640..9735bd6 100644 > --- a/lisp/language/china-util.el > +++ b/lisp/language/china-util.el > @@ -95,7 +95,9 @@ decode-hz-region > (goto-char (point-min)) > (while (search-forward "~" nil t) > (setq ch (following-char)) > - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))) > + (if (= ch ?{) > + (search-forward "~}" nil 'move) > + (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))) > > ;; "^zW...\n" -> Chinese GB2312 > ;; "~{...~}" -> Chinese GB2312 > @@ -141,7 +143,7 @@ encode-hz-region > (save-excursion > (save-restriction > (narrow-to-region beg end) > - > + (put-text-property beg end 'charset 'chinese-gb2312) > ;; "~" -> "~~" > (goto-char (point-min)) > (while (search-forward "~" nil t) (insert ?~)) > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa ` (2 preceding siblings ...) 2016-06-22 17:04 ` ynyaaa @ 2016-07-24 8:21 ` ynyaaa 2016-07-26 15:09 ` handa 2016-07-29 1:05 ` ynyaaa 2016-08-17 6:33 ` ynyaaa 5 siblings, 1 reply; 16+ messages in thread From: ynyaaa @ 2016-07-24 8:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 23814 Eli Zaretskii <eliz@gnu.org> writes: > Ping! Could you please try this patch and see if it solves the > problem? The patch seems to make better results. But I found other bugs about decodings of "~" escape. "~~" and "~{!!~}" should be encoded and decoded as below. "~~" -> "~~~~" -> "~~" "~{!!~}" -> "~~{!!~~}" -> "~{!!~}" In really they are encoded properly, but decoded in wrong way. (decode-coding-string (encode-coding-string "~~" 'hz) 'hz) => "~" (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz) => #("\x3000" 0 1 (charset chinese-gb2312)) These behaviors are not affected by the patch. >> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el >> index e531640..9735bd6 100644 >> --- a/lisp/language/china-util.el >> +++ b/lisp/language/china-util.el >> @@ -95,7 +95,9 @@ decode-hz-region >> (goto-char (point-min)) >> (while (search-forward "~" nil t) >> (setq ch (following-char)) >> - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))) >> + (if (= ch ?{) >> + (search-forward "~}" nil 'move) >> + (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))) >> >> ;; "^zW...\n" -> Chinese GB2312 >> ;; "~{...~}" -> Chinese GB2312 >> @@ -141,7 +143,7 @@ encode-hz-region >> (save-excursion >> (save-restriction >> (narrow-to-region beg end) >> - >> + (put-text-property beg end 'charset 'chinese-gb2312) >> ;; "~" -> "~~" >> (goto-char (point-min)) >> (while (search-forward "~" nil t) (insert ?~)) >> >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-07-24 8:21 ` ynyaaa @ 2016-07-26 15:09 ` handa 0 siblings, 0 replies; 16+ messages in thread From: handa @ 2016-07-26 15:09 UTC (permalink / raw) To: ynyaaa; +Cc: 23814 In article <87twffigzv.fsf@gmail.com>, ynyaaa@gmail.com writes: > But I found other bugs about decodings of "~" escape. > "~~" and "~{!!~}" should be encoded and decoded as below. > "~~" -> "~~~~" -> "~~" > "~{!!~}" -> "~~{!!~~}" -> "~{!!~}" > In really they are encoded properly, but decoded in wrong way. > (decode-coding-string (encode-coding-string "~~" 'hz) 'hz) >>> "~" > (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz) >>> #("\x3000" 0 1 (charset chinese-gb2312)) Thank you for finding those bugs. Could you please try the attached patch instead? --- K. Handa handa@gnu.org diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el index e531640..9abdae1 100644 --- a/lisp/language/china-util.el +++ b/lisp/language/china-util.el @@ -95,7 +95,12 @@ decode-hz-region (goto-char (point-min)) (while (search-forward "~" nil t) (setq ch (following-char)) - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))) + (if (= ch ?{) + (search-forward "~}" nil 'move) + (when (or (= ch ?\n) (= ch ?~)) + (delete-char -1) + (put-text-property (point) (1+ (point)) 'hz-decoded t) + (forward-char 1)))) ;; "^zW...\n" -> Chinese GB2312 ;; "~{...~}" -> Chinese GB2312 @@ -104,6 +109,8 @@ decode-hz-region (while (re-search-forward hz/zw-start-gb nil t) (setq pos (match-beginning 0) ch (char-after pos)) + (if (and (= ch ?~) (get-text-property pos 'hz-decoded)) + (forward-char 1) ;; Record the first position to start conversion. (or beg (setq beg pos)) (end-of-line) @@ -122,9 +129,10 @@ decode-hz-region t) (delete-char -2)) (setq end (point)) - (translate-region pos (point) hz-set-msb-table)))) + (translate-region pos (point) hz-set-msb-table))))) (if beg (decode-coding-region beg end 'euc-china))) + (remove-text-properties (point-min) (point-max) '(hz-decoded nil)) (- (point-max) (point-min))))) ;;;###autoload @@ -142,6 +150,7 @@ encode-hz-region (save-restriction (narrow-to-region beg end) + (put-text-property beg end 'charset 'chinese-gb2312) ;; "~" -> "~~" (goto-char (point-min)) (while (search-forward "~" nil t) (insert ?~)) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa ` (3 preceding siblings ...) 2016-07-24 8:21 ` ynyaaa @ 2016-07-29 1:05 ` ynyaaa 2016-08-14 11:22 ` handa 2016-08-17 6:33 ` ynyaaa 5 siblings, 1 reply; 16+ messages in thread From: ynyaaa @ 2016-07-29 1:05 UTC (permalink / raw) To: handa; +Cc: 23814 handa <handa@gnu.org> writes: > In article <87twffigzv.fsf@gmail.com>, ynyaaa@gmail.com writes: > >> But I found other bugs about decodings of "~" escape. >> "~~" and "~{!!~}" should be encoded and decoded as below. >> "~~" -> "~~~~" -> "~~" >> "~{!!~}" -> "~~{!!~~}" -> "~{!!~}" > >> In really they are encoded properly, but decoded in wrong way. >> (decode-coding-string (encode-coding-string "~~" 'hz) 'hz) >>>> "~" >> (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz) >>>> #("\x3000" 0 1 (charset chinese-gb2312)) > > Thank you for finding those bugs. Could you please try the attached > patch instead? > > --- > K. Handa > handa@gnu.org If there are unencodable characters, encodable characters may be broken. In this example, the second ?\x4E00 character disappears. (set-language-environment 'Chinese-GB) (decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz) => "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273" To avoid this behavior, there are some solutions. (a) While decoding, replace "~{...~}" with "\e$A...\e(B" and decode with iso-2022-7bit. (b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding and insert "\e$)A" at the beginning of the temp buffer and decode with iso-2022-8bit-ss2. (8bit data are decoded as euc-cn.) (c) While encoding, use euc-cn instead of iso-2022-7bit and translate each consecutive 8bit data to 7bit data prefixed by "~{" and postfixed by "~}". By the way, RFC1843 describes: The escape sequence '~\n' is a line-continuation marker to be consumed with no output produced. This form shoud return "AB". (decode-coding-string "A~\nB" 'hz) => "A\nB" > diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el > index e531640..9abdae1 100644 > --- a/lisp/language/china-util.el > +++ b/lisp/language/china-util.el > @@ -95,7 +95,12 @@ decode-hz-region > (goto-char (point-min)) > (while (search-forward "~" nil t) > (setq ch (following-char)) > - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))) > + (if (= ch ?{) > + (search-forward "~}" nil 'move) > + (when (or (= ch ?\n) (= ch ?~)) > + (delete-char -1) > + (put-text-property (point) (1+ (point)) 'hz-decoded t) > + (forward-char 1)))) > > ;; "^zW...\n" -> Chinese GB2312 > ;; "~{...~}" -> Chinese GB2312 > @@ -104,6 +109,8 @@ decode-hz-region > (while (re-search-forward hz/zw-start-gb nil t) > (setq pos (match-beginning 0) > ch (char-after pos)) > + (if (and (= ch ?~) (get-text-property pos 'hz-decoded)) > + (forward-char 1) > ;; Record the first position to start conversion. > (or beg (setq beg pos)) > (end-of-line) > @@ -122,9 +129,10 @@ decode-hz-region > t) > (delete-char -2)) > (setq end (point)) > - (translate-region pos (point) hz-set-msb-table)))) > + (translate-region pos (point) hz-set-msb-table))))) > (if beg > (decode-coding-region beg end 'euc-china))) > + (remove-text-properties (point-min) (point-max) '(hz-decoded nil)) > (- (point-max) (point-min))))) > > ;;;###autoload > @@ -142,6 +150,7 @@ encode-hz-region > (save-restriction > (narrow-to-region beg end) > > + (put-text-property beg end 'charset 'chinese-gb2312) > ;; "~" -> "~~" > (goto-char (point-min)) > (while (search-forward "~" nil t) (insert ?~)) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-07-29 1:05 ` ynyaaa @ 2016-08-14 11:22 ` handa 0 siblings, 0 replies; 16+ messages in thread From: handa @ 2016-08-14 11:22 UTC (permalink / raw) To: ynyaaa; +Cc: 23814 [-- Attachment #1: Type: text/plain, Size: 1951 bytes --] Hi, sorry for the late response. I've just noticed that my reply mail didn't go out successfully. I'm trying to re-send it. I wrote: > In article <871t2dz22d.fsf@gmail.com>, ynyaaa@gmail.com writes: > > If there are unencodable characters, encodable characters may be broken. > > In this example, the second ?\x4E00 character disappears. > > (set-language-environment 'Chinese-GB) > > (decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz) > >>> "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273" > > How to treat unencodable characters on encoding is a difficult problem. > As HZ is designed for 7-bit environment, I think it's important to keep > 7-bit on encoding. So, the new code uses \uXXXX for those characters. > Another way is to use UTF-8 sequence for them, then we can decode it > back. Which, do yo think, is better? > > > To avoid this behavior, there are some solutions. > > (a) While decoding, replace "~{...~}" with "\e$A...\e(B" > > and decode with iso-2022-7bit. > > (b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding > > and insert "\e$)A" at the beginning of the temp buffer > > and decode with iso-2022-8bit-ss2. > > (8bit data are decoded as euc-cn.) > > (c) While encoding, use euc-cn instead of iso-2022-7bit > > and translate each consecutive 8bit data to 7bit data > > prefixed by "~{" and postfixed by "~}". > > I adopted the (a) method for decoding, and fix bugs encoding code. > > > By the way, RFC1843 describes: > > The escape sequence '~\n' is a line-continuation marker to be > > consumed with no output produced. > > The variable decode-hz-line-continuation controls this feature. I don't > remember why the default is nil (i.e. do not decode ~\n), perhaps some > Chinese people I was discussing with on implementing HZ support > suggested that. > > Attched is the full china-util.el (not a diff). > > --- > K. Handa > handa@gnu.org [-- Attachment #2: china-util.el --] [-- Type: application/emacs-lisp, Size: 6915 bytes --] ;;; china-util.el --- utilities for Chinese -*- coding: utf-8 -*- ;; Copyright (C) 1995, 2001-2016 Free Software Foundation, Inc. ;; Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, ;; 2005, 2006, 2007, 2008, 2009, 2010, 2011 ;; National Institute of Advanced Industrial Science and Technology (AIST) ;; Registration Number H14PRO021 ;; Copyright (C) 2003 ;; National Institute of Advanced Industrial Science and Technology (AIST) ;; Registration Number H13PRO009 ;; Keywords: mule, multilingual, Chinese ;; This file is part of GNU Emacs. ;; GNU Emacs is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; GNU Emacs is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. ;;; Commentary: ;;; Code: ;; Hz/ZW/EUC-TW encoding stuff ;; HZ is an encoding method for Chinese character set GB2312 used ;; widely in Internet. It is very similar to 7-bit environment of ;; ISO-2022. The difference is that HZ uses the sequence "~{" and ;; "~}" for designating GB2312 and ASCII respectively, hence, it ;; doesn't uses ESC (0x1B) code. ;; ZW is another encoding method for Chinese character set GB2312. It ;; encodes Chinese characters line by line by starting each line with ;; the sequence "zW". It also uses only 7-bit as HZ. ;; EUC-TW is similar to EUC-KS or EUC-JP. Its main character set is ;; plane 1 of CNS 11643; characters of planes 2 to 7 are accessed with ;; a single shift escape followed by three bytes: the first gives the ;; plane, the second and third the character code. Note that characters ;; of plane 1 are (redundantly) accessible with a single shift escape ;; also. ;; ISO-2022 escape sequence to designate GB2312. (defvar iso2022-gb-designation "\e$A") ;; HZ escape sequence to designate GB2312. (defvar hz-gb-designation "~{") ;; ISO-2022 escape sequence to designate ASCII. (defvar iso2022-ascii-designation "\e(B") ;; HZ escape sequence to designate ASCII. (defvar hz-ascii-designation "~}") ;; Regexp of ZW sequence to start GB2312. (defvar zw-start-gb "^zW") ;; Regexp for start of GB2312 in an encoding mixture of HZ and ZW. (defvar hz/zw-start-gb (concat hz-gb-designation "\\|" zw-start-gb "\\|[^\0-\177]")) (defvar decode-hz-line-continuation nil "Flag to tell if we should care line continuation convention of Hz.") (defconst hz-set-msb-table (eval-when-compile (let ((chars nil) (i 0)) (while (< i 33) (push i chars) (setq i (1+ i))) (while (< i 127) (push (decode-char 'eight-bit (+ i 128)) chars) (setq i (1+ i))) (apply 'string (nreverse chars))))) ;;;###autoload (defun decode-hz-region (beg end) "Decode HZ/ZW encoded text in the current region. Return the length of resulting text." (interactive "r") (save-excursion (save-restriction (let (pos ch) (narrow-to-region beg end) ;; We, at first, convert HZ/ZW to `iso-2022-7bit', ;; then decode it. ;; "~\n" -> "", "~~" -> "~" (goto-char (point-min)) (while (search-forward "~" nil t) (setq ch (following-char)) (cond ((= ch ?{) (delete-region (1- (point)) (1+ (point))) (setq pos (point)) (insert iso2022-gb-designation) (if (looking-at "\\([!-}][!-~]\\)*") (goto-char (match-end 0))) (if (looking-at hz-ascii-designation) (delete-region (match-beginning 0) (match-end 0))) (insert iso2022-ascii-designation) (decode-coding-region pos (point) 'iso-2022-7bit)) ((= ch ?~) (delete-char 1)) ((and (= ch ?\n) decode-hz-line-continuation) (delete-region (1- (point)) (1+ (point)))) (t (forward-char 1))))) (- (point-max) (point-min))))) ;;;###autoload (defun decode-hz-buffer () "Decode HZ/ZW encoded text in the current buffer." (interactive) (decode-hz-region (point-min) (point-max))) (defvar hz-category-table nil) ;;;###autoload (defun encode-hz-region (beg end) "Encode the text in the current region to HZ. Return the length of resulting text." (interactive "r") (unless hz-category-table (setq hz-category-table (make-category-table)) (with-category-table hz-category-table (define-category ?c "hz encodable") (map-charset-chars #'modify-category-entry 'ascii ?c) (map-charset-chars #'modify-category-entry 'chinese-gb2312 ?c))) (save-excursion (save-restriction (narrow-to-region beg end) (with-category-table hz-category-table ;; ~ -> ~~ (goto-char (point-min)) (while (search-forward "~" nil t) (insert ?~)) ;; ESC -> ESC ESC (goto-char (point-min)) (while (search-forward "\e" nil t) (insert ?\e)) ;; Non-ASCII-GB2312 -> \uXXXX (goto-char (point-min)) (while (re-search-forward "\\Cc" nil t) (let ((ch (preceding-char))) (delete-char -1) (insert (format "\\u%04X" ch)))) ;; Prefer chinese-gb2312 for Chinese characters. (put-text-property (point-min) (point-max) 'charset 'chinese-gb2312) (encode-coding-region (point-min) (point-max) 'iso-2022-7bit) ;; ESC $ B ... ESC ( B -> ~{ ... ~} ;; ESC ESC -> ESC (goto-char (point-min)) (while (search-forward "\e" nil t) (if (= (following-char) ?\e) ;; ESC ESC -> ESC (delete-char 1) (forward-char -1) (if (looking-at iso2022-gb-designation) (progn (delete-region (match-beginning 0) (match-end 0)) (insert hz-gb-designation) (search-forward iso2022-ascii-designation nil 'move) (delete-region (match-beginning 0) (match-end 0)) (insert hz-ascii-designation)))))) (- (point-max) (point-min))))) ;;;###autoload (defun encode-hz-buffer () "Encode the text in the current buffer to HZ." (interactive) (encode-hz-region (point-min) (point-max))) ;;;###autoload (defun post-read-decode-hz (len) (let ((pos (point)) (buffer-modified-p (buffer-modified-p)) last-coding-system-used) (prog1 (decode-hz-region pos (+ pos len)) (set-buffer-modified-p buffer-modified-p)))) ;;;###autoload (defun pre-write-encode-hz (from to) (let ((buf (current-buffer))) (set-buffer (generate-new-buffer " *temp*")) (if (stringp from) (insert from) (insert-buffer-substring buf from to)) (let (last-coding-system-used) (encode-hz-region 1 (point-max))) nil)) ;; (provide 'china-util) ;;; china-util.el ends here ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa ` (4 preceding siblings ...) 2016-07-29 1:05 ` ynyaaa @ 2016-08-17 6:33 ` ynyaaa 2016-08-17 14:43 ` handa 5 siblings, 1 reply; 16+ messages in thread From: ynyaaa @ 2016-08-17 6:33 UTC (permalink / raw) To: handa; +Cc: 23814 Hi, I tried new china-util.el. It works very well. handa <handa@gnu.org> writes: > Hi, sorry for the late response. I've just noticed that my reply mail > didn't go out successfully. I'm trying to re-send it. >> How to treat unencodable characters on encoding is a difficult problem. >> As HZ is designed for 7-bit environment, I think it's important to keep >> 7-bit on encoding. So, the new code uses \uXXXX for those characters. >> Another way is to use UTF-8 sequence for them, then we can decode it >> back. Which, do yo think, is better? I prefer 7bit encoding to use only 7bit data, too. As for elisp, "\u12345" is treated as "\u1234\ 5". ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-08-17 6:33 ` ynyaaa @ 2016-08-17 14:43 ` handa 2016-08-17 15:28 ` Eli Zaretskii 0 siblings, 1 reply; 16+ messages in thread From: handa @ 2016-08-17 14:43 UTC (permalink / raw) To: ynyaaa; +Cc: 23814 In article <87oa4rdhvq.fsf@gmail.com>, ynyaaa@gmail.com writes: > Hi, I tried new china-util.el. It works very well. Thank you for testing it. > I prefer 7bit encoding to use only 7bit data, too. > As for elisp, "\u12345" is treated as "\u1234\ 5". Ah, ok, I changed to encode characters not in BMP to \UXXXXXXXX. I've just committed the attached change. --- K. Handa handa@gnu.org 2016-08-17 handa <handa@gnu.org> * lisp/language/china-util.el (decode-hz-region): Pay attention to "~~}" sequence at the end of Chinese character range. (hz-category-table): New variable. (encode-hz-region): Convert non-encodable characters to \u... and \U... Preserve ESC on ecoding. Put `chinese-gb2312' `charset' text property in advance to force iso-2022-encoding to select chinese-gb2312 designation. diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el index e531640..6505fb8 100644 --- a/lisp/language/china-util.el +++ b/lisp/language/china-util.el @@ -88,43 +88,34 @@ decode-hz-region (let (pos ch) (narrow-to-region beg end) - ;; We, at first, convert HZ/ZW to `euc-china', + ;; We, at first, convert HZ/ZW to `iso-2022-7bit', ;; then decode it. - ;; "~\n" -> "\n", "~~" -> "~" + ;; "~\n" -> "", "~~" -> "~" (goto-char (point-min)) (while (search-forward "~" nil t) (setq ch (following-char)) - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))) + (cond ((= ch ?{) + (delete-region (1- (point)) (1+ (point))) + (setq pos (point)) + (insert iso2022-gb-designation) + (if (looking-at "\\([!-}][!-~]\\)*") + (goto-char (match-end 0))) + (if (looking-at hz-ascii-designation) + (delete-region (match-beginning 0) (match-end 0))) + (insert iso2022-ascii-designation) + (decode-coding-region pos (point) 'iso-2022-7bit)) + + ((= ch ?~) + (delete-char 1)) + + ((and (= ch ?\n) + decode-hz-line-continuation) + (delete-region (1- (point)) (1+ (point)))) + + (t + (forward-char 1))))) - ;; "^zW...\n" -> Chinese GB2312 - ;; "~{...~}" -> Chinese GB2312 - (goto-char (point-min)) - (setq beg nil) - (while (re-search-forward hz/zw-start-gb nil t) - (setq pos (match-beginning 0) - ch (char-after pos)) - ;; Record the first position to start conversion. - (or beg (setq beg pos)) - (end-of-line) - (setq end (point)) - (if (>= ch 128) ; 8bit GB2312 - nil - (goto-char pos) - (delete-char 2) - (setq end (- end 2)) - (if (= ch ?z) ; ZW -> euc-china - (progn - (translate-region (point) end hz-set-msb-table) - (goto-char end)) - (if (search-forward hz-ascii-designation - (if decode-hz-line-continuation nil end) - t) - (delete-char -2)) - (setq end (point)) - (translate-region pos (point) hz-set-msb-table)))) - (if beg - (decode-coding-region beg end 'euc-china))) (- (point-max) (point-min))))) ;;;###autoload @@ -133,33 +124,57 @@ decode-hz-buffer (interactive) (decode-hz-region (point-min) (point-max))) +(defvar hz-category-table nil) + ;;;###autoload (defun encode-hz-region (beg end) "Encode the text in the current region to HZ. Return the length of resulting text." (interactive "r") + (unless hz-category-table + (setq hz-category-table (make-category-table)) + (with-category-table hz-category-table + (define-category ?c "hz encodable") + (map-charset-chars #'modify-category-entry 'ascii ?c) + (map-charset-chars #'modify-category-entry 'chinese-gb2312 ?c))) (save-excursion (save-restriction (narrow-to-region beg end) + (with-category-table hz-category-table + ;; ~ -> ~~ + (goto-char (point-min)) + (while (search-forward "~" nil t) (insert ?~)) + + ;; ESC -> ESC ESC + (goto-char (point-min)) + (while (search-forward "\e" nil t) (insert ?\e)) - ;; "~" -> "~~" - (goto-char (point-min)) - (while (search-forward "~" nil t) (insert ?~)) - - ;; Chinese GB2312 -> "~{...~}" - (goto-char (point-min)) - (if (re-search-forward "\\cc" nil t) - (let (pos) - (goto-char (setq pos (match-beginning 0))) - (encode-coding-region pos (point-max) 'iso-2022-7bit) - (goto-char pos) - (while (search-forward iso2022-gb-designation nil t) - (delete-char -3) - (insert hz-gb-designation)) - (goto-char pos) - (while (search-forward iso2022-ascii-designation nil t) - (delete-char -3) - (insert hz-ascii-designation)))) + ;; Non-ASCII-GB2312 -> \uXXXX + (goto-char (point-min)) + (while (re-search-forward "\\Cc" nil t) + (let ((ch (preceding-char))) + (delete-char -1) + (insert (format (if (< ch #x10000) "\\u%04X" "\\U%08X") ch)))) + + ;; Prefer chinese-gb2312 for Chinese characters. + (put-text-property (point-min) (point-max) 'charset 'chinese-gb2312) + (encode-coding-region (point-min) (point-max) 'iso-2022-7bit) + + ;; ESC $ B ... ESC ( B -> ~{ ... ~} + ;; ESC ESC -> ESC + (goto-char (point-min)) + (while (search-forward "\e" nil t) + (if (= (following-char) ?\e) + ;; ESC ESC -> ESC + (delete-char 1) + (forward-char -1) + (if (looking-at iso2022-gb-designation) + (progn + (delete-region (match-beginning 0) (match-end 0)) + (insert hz-gb-designation) + (search-forward iso2022-ascii-designation nil 'move) + (delete-region (match-beginning 0) (match-end 0)) + (insert hz-ascii-designation)))))) (- (point-max) (point-min))))) ;;;###autoload ^ permalink raw reply related [flat|nested] 16+ messages in thread
* bug#23814: 24.5; bug of hz coding-system 2016-08-17 14:43 ` handa @ 2016-08-17 15:28 ` Eli Zaretskii 0 siblings, 0 replies; 16+ messages in thread From: Eli Zaretskii @ 2016-08-17 15:28 UTC (permalink / raw) To: handa; +Cc: ynyaaa, 23814 > From: handa <handa@gnu.org> > Cc: eliz@gnu.org, 23814@debbugs.gnu.org > Date: Wed, 17 Aug 2016 23:43:13 +0900 > > In article <87oa4rdhvq.fsf@gmail.com>, ynyaaa@gmail.com writes: > > > Hi, I tried new china-util.el. It works very well. > > Thank you for testing it. > > > I prefer 7bit encoding to use only 7bit data, too. > > As for elisp, "\u12345" is treated as "\u1234\ 5". > > Ah, ok, I changed to encode characters not in BMP to \UXXXXXXXX. > > I've just committed the attached change. Thanks. Please close the bug if satisfied with the solution. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-08-17 15:28 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-06-21 12:22 bug#23814: 24.5; bug of hz coding-system ynyaaa 2016-06-21 12:58 ` Eli Zaretskii 2016-06-22 13:47 ` ynyaaa 2016-06-22 15:28 ` Eli Zaretskii 2016-06-22 17:04 ` ynyaaa 2016-06-22 17:26 ` Eli Zaretskii 2016-07-09 11:20 ` Eli Zaretskii 2016-07-13 14:12 ` handa 2016-07-23 17:47 ` Eli Zaretskii 2016-07-24 8:21 ` ynyaaa 2016-07-26 15:09 ` handa 2016-07-29 1:05 ` ynyaaa 2016-08-14 11:22 ` handa 2016-08-17 6:33 ` ynyaaa 2016-08-17 14:43 ` handa 2016-08-17 15:28 ` Eli Zaretskii
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).