From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: handa Newsgroups: gmane.emacs.bugs Subject: bug#23814: 24.5; bug of hz coding-system Date: Sun, 14 Aug 2016 20:22:25 +0900 Message-ID: <87bn0vzjbi.fsf@gnu.org> References: <877fdiu3xz.fsf@gmail.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: blaine.gmane.org 1471173800 31335 195.159.176.226 (14 Aug 2016 11:23:20 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 14 Aug 2016 11:23:20 +0000 (UTC) Cc: 23814@debbugs.gnu.org To: ynyaaa@gmail.com Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sun Aug 14 13:23:16 2016 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bYtVc-0007xy-9J for geb-bug-gnu-emacs@m.gmane.org; Sun, 14 Aug 2016 13:23:16 +0200 Original-Received: from localhost ([::1]:60329 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bYtVZ-0000eF-DV for geb-bug-gnu-emacs@m.gmane.org; Sun, 14 Aug 2016 07:23:13 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:43724) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bYtVS-0000dt-SK for bug-gnu-emacs@gnu.org; Sun, 14 Aug 2016 07:23:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bYtVO-0007Hl-Ky for bug-gnu-emacs@gnu.org; Sun, 14 Aug 2016 07:23:05 -0400 Original-Received: from debbugs.gnu.org ([208.118.235.43]:59095) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bYtVO-0007Hh-HW for bug-gnu-emacs@gnu.org; Sun, 14 Aug 2016 07:23:02 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1bYtVO-0002ag-DW for bug-gnu-emacs@gnu.org; Sun, 14 Aug 2016 07:23:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: handa Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 14 Aug 2016 11:23:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 23814 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 23814-submit@debbugs.gnu.org id=B23814.14711737649921 (code B ref 23814); Sun, 14 Aug 2016 11:23:02 +0000 Original-Received: (at 23814) by debbugs.gnu.org; 14 Aug 2016 11:22:44 +0000 Original-Received: from localhost ([127.0.0.1]:56807 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bYtV6-0002Zx-2Q for submit@debbugs.gnu.org; Sun, 14 Aug 2016 07:22:44 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:52700) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bYtV4-0002Zi-T9 for 23814@debbugs.gnu.org; Sun, 14 Aug 2016 07:22:43 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bYtUy-0007GL-Q9 for 23814@debbugs.gnu.org; Sun, 14 Aug 2016 07:22:37 -0400 Original-Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44307) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bYtUu-0007G0-W0; Sun, 14 Aug 2016 07:22:33 -0400 Original-Received: from fl1-122-134-89-8.iba.mesh.ad.jp ([122.134.89.8]:45590 helo=shatin) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1bYtUs-00038G-Pu; Sun, 14 Aug 2016 07:22:31 -0400 Original-Received: from handa by shatin with local (Exim 4.86_2) (envelope-from ) id 1bYtUn-0002CH-RZ; Sun, 14 Aug 2016 20:22:25 +0900 In-Reply-To: <871t2dz22d.fsf@gmail.com> (ynyaaa@gmail.com) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 208.118.235.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "bug-gnu-emacs" Xref: news.gmane.org gmane.emacs.bugs:122197 Archived-At: --=-=-= Content-Type: text/plain Hi, sorry for the late response. I've just noticed that my reply mail didn't go out successfully. I'm trying to re-send it. I wrote: > In article <871t2dz22d.fsf@gmail.com>, ynyaaa@gmail.com writes: > > If there are unencodable characters, encodable characters may be broken. > > In this example, the second ?\x4E00 character disappears. > > (set-language-environment 'Chinese-GB) > > (decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz) > >>> "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273" > > How to treat unencodable characters on encoding is a difficult problem. > As HZ is designed for 7-bit environment, I think it's important to keep > 7-bit on encoding. So, the new code uses \uXXXX for those characters. > Another way is to use UTF-8 sequence for them, then we can decode it > back. Which, do yo think, is better? > > > To avoid this behavior, there are some solutions. > > (a) While decoding, replace "~{...~}" with "\e$A...\e(B" > > and decode with iso-2022-7bit. > > (b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding > > and insert "\e$)A" at the beginning of the temp buffer > > and decode with iso-2022-8bit-ss2. > > (8bit data are decoded as euc-cn.) > > (c) While encoding, use euc-cn instead of iso-2022-7bit > > and translate each consecutive 8bit data to 7bit data > > prefixed by "~{" and postfixed by "~}". > > I adopted the (a) method for decoding, and fix bugs encoding code. > > > By the way, RFC1843 describes: > > The escape sequence '~\n' is a line-continuation marker to be > > consumed with no output produced. > > The variable decode-hz-line-continuation controls this feature. I don't > remember why the default is nil (i.e. do not decode ~\n), perhaps some > Chinese people I was discussing with on implementing HZ support > suggested that. > > Attched is the full china-util.el (not a diff). > > --- > K. Handa > handa@gnu.org --=-=-= Content-Type: application/emacs-lisp; charset=utf-8 Content-Disposition: attachment; filename=china-util.el Content-Transfer-Encoding: quoted-printable ;;; china-util.el --- utilities for Chinese -*- coding: utf-8 -*- ;; Copyright (C) 1995, 2001-2016 Free Software Foundation, Inc. ;; Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, ;; 2005, 2006, 2007, 2008, 2009, 2010, 2011 ;; National Institute of Advanced Industrial Science and Technology (AIST) ;; Registration Number H14PRO021 ;; Copyright (C) 2003 ;; National Institute of Advanced Industrial Science and Technology (AIST) ;; Registration Number H13PRO009 ;; Keywords: mule, multilingual, Chinese ;; This file is part of GNU Emacs. ;; GNU Emacs is free software: you can redistribute it and/or modify ;; it under the terms of the GNU General Public License as published by ;; the Free Software Foundation, either version 3 of the License, or ;; (at your option) any later version. ;; GNU Emacs is distributed in the hope that it will be useful, ;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;; GNU General Public License for more details. ;; You should have received a copy of the GNU General Public License ;; along with GNU Emacs. If not, see . ;;; Commentary: ;;; Code: ;; Hz/ZW/EUC-TW encoding stuff ;; HZ is an encoding method for Chinese character set GB2312 used ;; widely in Internet. It is very similar to 7-bit environment of ;; ISO-2022. The difference is that HZ uses the sequence "~{" and ;; "~}" for designating GB2312 and ASCII respectively, hence, it ;; doesn't uses ESC (0x1B) code. ;; ZW is another encoding method for Chinese character set GB2312. It ;; encodes Chinese characters line by line by starting each line with ;; the sequence "zW". It also uses only 7-bit as HZ. ;; EUC-TW is similar to EUC-KS or EUC-JP. Its main character set is ;; plane 1 of CNS 11643; characters of planes 2 to 7 are accessed with ;; a single shift escape followed by three bytes: the first gives the ;; plane, the second and third the character code. Note that characters ;; of plane 1 are (redundantly) accessible with a single shift escape ;; also. ;; ISO-2022 escape sequence to designate GB2312. (defvar iso2022-gb-designation "\e$A") ;; HZ escape sequence to designate GB2312. (defvar hz-gb-designation "~{") ;; ISO-2022 escape sequence to designate ASCII. (defvar iso2022-ascii-designation "\e(B") ;; HZ escape sequence to designate ASCII. (defvar hz-ascii-designation "~}") ;; Regexp of ZW sequence to start GB2312. (defvar zw-start-gb "^zW") ;; Regexp for start of GB2312 in an encoding mixture of HZ and ZW. (defvar hz/zw-start-gb (concat hz-gb-designation "\\|" zw-start-gb "\\|[^\0-\177]")) (defvar decode-hz-line-continuation nil "Flag to tell if we should care line continuation convention of Hz.") (defconst hz-set-msb-table (eval-when-compile (let ((chars nil) (i 0)) (while (< i 33) (push i chars) (setq i (1+ i))) (while (< i 127) (push (decode-char 'eight-bit (+ i 128)) chars) (setq i (1+ i))) (apply 'string (nreverse chars))))) ;;;###autoload (defun decode-hz-region (beg end) "Decode HZ/ZW encoded text in the current region. Return the length of resulting text." (interactive "r") (save-excursion (save-restriction (let (pos ch) (narrow-to-region beg end) ;; We, at first, convert HZ/ZW to `iso-2022-7bit', ;; then decode it. ;; "~\n" -> "", "~~" -> "~" (goto-char (point-min)) (while (search-forward "~" nil t) (setq ch (following-char)) (cond ((=3D ch ?{) (delete-region (1- (point)) (1+ (point))) (setq pos (point)) (insert iso2022-gb-designation) (if (looking-at "\\([!-}][!-~]\\)*") (goto-char (match-end 0))) (if (looking-at hz-ascii-designation) (delete-region (match-beginning 0) (match-end 0))) (insert iso2022-ascii-designation) (decode-coding-region pos (point) 'iso-2022-7bit)) ((=3D ch ?~) (delete-char 1)) ((and (=3D ch ?\n) decode-hz-line-continuation) (delete-region (1- (point)) (1+ (point)))) (t (forward-char 1))))) (- (point-max) (point-min))))) ;;;###autoload (defun decode-hz-buffer () "Decode HZ/ZW encoded text in the current buffer." (interactive) (decode-hz-region (point-min) (point-max))) (defvar hz-category-table nil) ;;;###autoload (defun encode-hz-region (beg end) "Encode the text in the current region to HZ. Return the length of resulting text." (interactive "r") (unless hz-category-table (setq hz-category-table (make-category-table)) (with-category-table hz-category-table (define-category ?c "hz encodable") (map-charset-chars #'modify-category-entry 'ascii ?c) (map-charset-chars #'modify-category-entry 'chinese-gb2312 ?c))) (save-excursion (save-restriction (narrow-to-region beg end) (with-category-table hz-category-table ;; ~ -> ~~ (goto-char (point-min)) (while (search-forward "~" nil t) (insert ?~)) ;; ESC -> ESC ESC (goto-char (point-min)) (while (search-forward "\e" nil t) (insert ?\e)) ;; Non-ASCII-GB2312 -> \uXXXX (goto-char (point-min)) (while (re-search-forward "\\Cc" nil t) (let ((ch (preceding-char))) (delete-char -1) (insert (format "\\u%04X" ch)))) ;; Prefer chinese-gb2312 for Chinese characters. (put-text-property (point-min) (point-max) 'charset 'chinese-gb2312) (encode-coding-region (point-min) (point-max) 'iso-2022-7bit) ;; ESC $ B ... ESC ( B -> ~{ ... ~} ;; ESC ESC -> ESC (goto-char (point-min)) (while (search-forward "\e" nil t) (if (=3D (following-char) ?\e) ;; ESC ESC -> ESC (delete-char 1) (forward-char -1) (if (looking-at iso2022-gb-designation) (progn (delete-region (match-beginning 0) (match-end 0)) (insert hz-gb-designation) (search-forward iso2022-ascii-designation nil 'move) (delete-region (match-beginning 0) (match-end 0)) (insert hz-ascii-designation)))))) (- (point-max) (point-min))))) ;;;###autoload (defun encode-hz-buffer () "Encode the text in the current buffer to HZ." (interactive) (encode-hz-region (point-min) (point-max))) ;;;###autoload (defun post-read-decode-hz (len) (let ((pos (point)) (buffer-modified-p (buffer-modified-p)) last-coding-system-used) (prog1 (decode-hz-region pos (+ pos len)) (set-buffer-modified-p buffer-modified-p)))) ;;;###autoload (defun pre-write-encode-hz (from to) (let ((buf (current-buffer))) (set-buffer (generate-new-buffer " *temp*")) (if (stringp from) (insert from) (insert-buffer-substring buf from to)) (let (last-coding-system-used) (encode-hz-region 1 (point-max))) nil)) ;; (provide 'china-util) ;;; china-util.el ends here --=-=-=--