From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: Kenichi Handa <handa@m17n.org>
Newsgroups: gmane.emacs.devel
Subject: Re: TUTORIAL.bg and windows-1251
Date: Tue, 25 Nov 2003 08:55:52 +0900 (JST)
Sender: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Message-ID: <200311242355.IAA24563@etlken.m17n.org>
References: <3FB52552.6090302@fmi.uni-sofia.bg>
	<200311170721.QAA11735@etlken.m17n.org>
	<3FBA3F81.4010602@fmi.uni-sofia.bg>
NNTP-Posting-Host: deer.gmane.org
X-Trace: sea.gmane.org 1069718911 28824 80.91.224.253 (25 Nov 2003 00:08:31 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Tue, 25 Nov 2003 00:08:31 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org Tue Nov 25 01:08:26 2003
Return-path: <emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org>
Original-Received: from quimby.gnus.org ([80.91.224.244])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1AOQkk-0001fg-00
	for <emacs-devel@deer.gmane.org>; Tue, 25 Nov 2003 01:08:26 +0100
Original-Received: from monty-python.gnu.org ([199.232.76.173])
	by quimby.gnus.org with esmtp (Exim 3.35 #1 (Debian))
	id 1AOQkj-00018s-00
	for <emacs-devel@quimby.gnus.org>; Tue, 25 Nov 2003 01:08:26 +0100
Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.24)
	id 1AORg2-0003nT-9R
	for emacs-devel@quimby.gnus.org; Mon, 24 Nov 2003 20:07:38 -0500
Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.24)
	id 1AORe3-0003Mh-KG
	for emacs-devel@gnu.org; Mon, 24 Nov 2003 20:05:35 -0500
Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.24)
	id 1AORdW-0003Ep-MR
	for emacs-devel@gnu.org; Mon, 24 Nov 2003 20:05:33 -0500
Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org)
	by monty-python.gnu.org with esmtp (Exim 4.24) id 1AORdV-0003Dr-Cq
	for emacs-devel@gnu.org; Mon, 24 Nov 2003 20:05:01 -0500
Original-Received: from fs.m17n.org (fs.m17n.org [192.47.44.2])
	by tsukuba.m17n.org (8.11.6p2/3.7W-20010518204228) with ESMTP id
	hAONtrh18118; Tue, 25 Nov 2003 08:55:53 +0900 (JST)
	(envelope-from handa@m17n.org)
Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125])
	by fs.m17n.org (8.11.6/3.7W-20010823150639) with ESMTP id hAONtrs10699; 
	Tue, 25 Nov 2003 08:55:53 +0900 (JST)
Original-Received: (from handa@localhost)
	by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id IAA24563;
	Tue, 25 Nov 2003 08:55:52 +0900 (JST)
Original-To: ogi@fmi.uni-sofia.bg
In-reply-to: <3FBA3F81.4010602@fmi.uni-sofia.bg> (message from Ognyan Kulev on
	Tue, 18 Nov 2003 17:49:21 +0200)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Emacs development discussions.  <emacs-devel.gnu.org>
List-Unsubscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://mail.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://mail.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+emacs-devel=quimby.gnus.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:18094
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:18094

Sorry for the late responses on this thread.  I'm now
involved in threads more than what my capacity allows.

In article <3FBA3F81.4010602@fmi.uni-sofia.bg>, Ognyan Kulev <ogi@fmi.uni-sofia.bg> writes:

> Kenichi Handa wrote:
>>  I think the default handling of cyrillic characters must be
>>  most convenient for native users.  But, there are many
>>  languages that use cyrillic and their requests may conflict.
>>  So I think we must start from adjusting each language
>>  environment.  Once we found most language environments
>>  require the same setting, we can make it the default.

> Can X encoding be adjusted?  Isn't there only two choices for cyrillic: 
> iso10646-1 and iso8859-5?

It seems that bg_BG locale of glibc, gtk, or XFree86 (I
don't know which is responsible for) encodes cyrillic
characters using extended segment with charset name
"microsoft-cp1251" in selection.

Please try the attached file.  It overrides the ctext
encoder/decoder so that microsoft-cp1251 is used on decoding
in Bulgarian lang. env.

[...]
> The negative site of Debian packages is that each encoding of the four 
> above mentioned has its own package.  So people sometimes install only 
> microsoft-cp1251 and iso10646-1 fonts, without koi8-r and iso8859-5 ones.

> Another problem with cronyx-courier is that it doesn't work when it's 
> set in Default in Basic Faces customize group.  I've just posted 
> question to comp.emacs.

> What about the following: when mule-unicode-0100-24ff is used and the 
> used iso10646-1 font doesn't contain wanted character (e.g. cyrillic 
> one), then another font is searched that contains such character.  I 
> think this will often end up in cronyx-courier.  Is this hard to be 
> implemented?

I've implemented it in emacs-unicode verion.  But, that
change requires various infrastructure of emacs-unicode, so
it's very difficult to back port it in HEAD.

Anyway, the attached ctext.el also contains a short code to
enable Emacs to display characters in windows-1251 by
microsoft-cp1251 font.  Please try to call
(use-microsoft-cp1251-font).

---
Ken'ichi HANDA
handa@m17n.org

--- ctext.el ---
(defvar ctext-non-standard-encodings-database
  '(("big5-0" big5 2 (chinese-big5-1 chinese-big5-2)))
  "Alist of non-standard character set encodings for CTEXT's extended segments.
Each element has the form (ENCODING-NAME CODING-SYSTEM N-OCTET CHARSET)
and provides information about how to use \"extended segments\"
with the encoding name ENCODING-NAME.

CODING-SYSTEM is the coding-system to encode the characters into
an extended segment.

N-OCTET is the number of octets (bytes) that encodes a character
in the segment.  It can be 0 (meaning the number of octets per
character is variable), 1, 2, 3, or 4.

CHARSET is a charater set containing characters that are encoded
as ENCODING-NAME.  It may be a list of character sets.  It may
also be a char-table, in which case characters that have non-nil
value in the char-table are the target.

On decoding CTEXT, all encoding names listed here are recognized.

On encoding CTEXT, encoding names in the variable
`ctext-non-standard-encodings-list' and in
`ctext-non-standard-encodings' property of the current language
environment are used.")

(defun ctext-post-read-conversion (len)
  "Decode LEN characters encoded as Compound Text with Extended Segments."
  (save-match-data
    (save-restriction
      (let ((case-fold-search nil)
	    (in-workbuf (string= (buffer-name) " *code-converting-work*"))
	    last-coding-system-used
	    pos bytes)
	(or in-workbuf
	    (narrow-to-region (point) (+ (point) len)))
	(decode-coding-region (point-min) (point-max) 'ctext)
	(if in-workbuf
	    (set-buffer-multibyte t))
	(while (re-search-forward ctext-non-standard-encodings-regexp
				  nil 'move)
	  (setq pos (match-beginning 0))
	  (if (match-beginning 1)
	      ;; ESC % / [0-4] M L --ENCODING-NAME-- \002 --BYTES--
	      (let* ((M (char-after (+ pos 4)))
		     (L (char-after (+ pos 5)))
		     (encoding (match-string 2))
		     (encoding-info (assoc-ignore-case 
				     encoding
				     ctext-non-standard-encodings-database))
		     (coding (if encoding-info
				 (nth 1 encoding-info)
			       (setq encoding (intern (downcase encoding)))
			       (and (coding-system-p encoding)
				    encoding))))
		(setq bytes (- (+ (* (- M 128) 128) (- L 128))
			       (- (point) (+ pos 6))))
		(when coding
		  (delete-region pos (point))
		  (forward-char bytes)
		  (decode-coding-region (- (point) bytes) (point) coding)))
	    ;; ESC % G --UTF-8-BYTES-- ESC % @
	    (setq bytes (- (point) pos))
	    (decode-coding-region (- (point) bytes) (point) 'utf-8))))
      (goto-char (point-min))
      (- (point-max) (point)))))

(defvar ctext-non-standard-encodings-list
  '("big5-0")
  "List of non-standard character set encoding names used in CTEXT.")

(defun ctext-non-standard-encodings-table ()
  (let ((table (make-char-table 'translation-table)))
    (dolist (encoding (reverse
		       (append
			(get-language-info current-language-environment
					   'ctext-non-standard-encodings)
			ctext-non-standard-encodings-list)))
      (let* ((slot (assoc encoding ctext-non-standard-encodings-database))
	     (charset (nth 3 slot)))
	(if charset
	    (cond ((charsetp charset)
		   (aset table (make-char charset) slot))
		  ((listp charset)
		   (dolist (elt charset)
		     (aset table (make-char elt) slot)))
		  ((char-table-p charset)
		   (map-char-table #'(lambda (k v) 
				   (if (and v (> k 128)) (aset table k slot)))
				   charset))))))
    table))

(defun ctext-pre-write-conversion (from to)
  "Encode characters between FROM and TO as Compound Text w/Extended Segments.

If FROM is a string, or if the current buffer is not the one set up for us
by encode-coding-string, generate a new temp buffer, insert the
text, and convert it in the temporary buffer.  Otherwise, convert in-place."
  (save-match-data
    ;; Setup a working buffer if necessary.
    (cond ((stringp from)
	   (let ((buf (current-buffer)))
	     (set-buffer (generate-new-buffer " *temp"))
	     (set-buffer-multibyte (multibyte-string-p from))
	     (insert from)))
	  ((not (string= (buffer-name) " *code-converting-work*"))
	   (let ((buf (current-buffer))
		 (multibyte enable-multibyte-characters))
	     (set-buffer (generate-new-buffer " *temp"))
	     (set-buffer-multibyte multibyte)
	     (insert-buffer-substring buf from to))))

    ;; Now we can encode the whole buffer.
    (let ((encoding-table (ctext-non-standard-encodings-table))
	  last-coding-system-used
	  last-pos last-encoding-info
	  pos encoding-info end-pos)
      (goto-char (setq last-pos (point-min)))
      (setq end-pos (point-marker))
      (while (re-search-forward "[^\000-\177]+" nil t)
	(setq last-pos (match-beginning 0)
	      last-encoding-info (aref encoding-table (char-after last-pos)))
	(set-marker end-pos (match-end 0))
	(goto-char (1+ last-pos))
	(catch 'tag
	  (while t
	    (setq encoding-info
		  (if (< (point) end-pos)
		      (aref encoding-table (following-char))))
	    (unless (eq last-encoding-info encoding-info)
	      (if last-encoding-info
		  (let ((encoding-name (car last-encoding-info))
			(coding-system (nth 1 last-encoding-info))
			(noctets (nth 2 last-encoding-info))
			len)
		    (encode-coding-region last-pos (point) coding-system)
		    (setq len (+ (length encoding-name) 1
				 (- (point) last-pos)))
		    (save-excursion
		      (goto-char last-pos)
		      (insert (string-to-multibyte 
			       (format "\e%%/%d%c%c%s"
				       noctets
				       (+ (/ len 128) 128)
				       (+ (% len 128) 128)
				       encoding-name)))))
		(encode-coding-region last-pos (point) 'ctext-no-compositions))
	      (setq last-pos (point)
		    last-encoding-info encoding-info))
	    (if (< (point) end-pos)
		(forward-char 1)
	      (throw 'tag nil))))
	(if (< last-pos (point))
	    (encode-coding-region last-pos (point) 'ctext-no-compositions)))
      (set-marker end-pos nil)
      (goto-char (point-min))))
  ;; Must return nil, as build_annotations_2 expects that.
  nil)

;; The followings are to override the current settings.

(set-language-info "Bulgarian" 'ctext-non-standard-encodings
		   '("microsoft-cp1251"))

(let ((elt `("microsoft-cp1251" windows-1251 1
	     ,(get 'encode-windows-1251 'translation-table)))
      (slot (assoc "microsoft-cp1251" ctext-non-standard-encodings-database)))
  (if slot
      (setcdr slot (cdr elt))
    (push elt ctext-non-standard-encodings-database)))

(define-ccl-program ccl-encode-windows-1251-font
  '(0
    ((r1 <<= 7)
     (r1 += r2)
     (translate-character encode-windows-1251 r0 r1)
     )))

(let ((slot (assoc "microsoft-cp1251" font-ccl-encoder-alist)))
  (if slot
      (setcdr slot ccl-encode-windows-1251-font)
    (push '("microsoft-cp1251" . ccl-encode-windows-1251-font)
	  font-ccl-encoder-alist)))

(defun use-microsoft-cp1251-font ()
  (let ((fontspec '(nil . "microsoft-cp1251")))
    (map-char-table
     #'(lambda (k v) 
	 (if (and v (> k 128))
	     (set-fontset-font "fontset-default" k fontspec)))
     (get 'encode-windows-1251 'translation-table))))