From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: Kenichi Handa Newsgroups: gmane.emacs.devel Subject: Re: decode-coding-string gone awry? Date: Mon, 14 Feb 2005 10:50:25 +0900 (JST) Message-ID: <200502140150.KAA29610@etlken.m17n.org> References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya") Content-Type: text/plain; charset=US-ASCII X-Trace: sea.gmane.org 1108347091 19592 80.91.229.2 (14 Feb 2005 02:11:31 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Mon, 14 Feb 2005 02:11:31 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 14 03:11:31 2005 Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1D0Vht-00067C-G0 for ged-emacs-devel@m.gmane.org; Mon, 14 Feb 2005 03:11:25 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D0VxL-00045b-Aa for ged-emacs-devel@m.gmane.org; Sun, 13 Feb 2005 21:27:23 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D0VvZ-0003SY-SP for emacs-devel@gnu.org; Sun, 13 Feb 2005 21:25:34 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D0VvY-0003Rm-3f for emacs-devel@gnu.org; Sun, 13 Feb 2005 21:25:33 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D0VuM-0002rM-D0 for emacs-devel@gnu.org; Sun, 13 Feb 2005 21:24:18 -0500 Original-Received: from [192.47.44.130] (helo=tsukuba.m17n.org) by monty-python.gnu.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.34) id 1D0VNc-0007Ju-30; Sun, 13 Feb 2005 20:50:28 -0500 Original-Received: from nfs.m17n.org (nfs.m17n.org [192.47.44.7]) by tsukuba.m17n.org (8.12.3/8.12.3/Debian-7.1) with ESMTP id j1E1oPXD031064; Mon, 14 Feb 2005 10:50:26 +0900 Original-Received: from etlken.m17n.org (etlken.m17n.org [192.47.44.125]) by nfs.m17n.org (8.12.3/8.12.3/Debian-7.1) with ESMTP id j1E1oPPN001778; Mon, 14 Feb 2005 10:50:25 +0900 Original-Received: (from handa@localhost) by etlken.m17n.org (8.8.8+Sun/3.7W-2001040620) id KAA29610; Mon, 14 Feb 2005 10:50:25 +0900 (JST) Original-To: David Kastrup In-reply-to: (message from David Kastrup on Sun, 13 Feb 2005 04:50:49 +0100) User-Agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org X-MailScanner-To: ged-emacs-devel@m.gmane.org Xref: main.gmane.org gmane.emacs.devel:33358 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:33358 In article , David Kastrup writes: > I have the problem that within preview-latex there is a function that > assembles UTF-8 strings from single characters. This function, when > used manually, mostly works. It is called within a process sentinel > and fails rather consistently there with a current CVS Emacs. I > include the code here since I don't know what might be involved here: > regexp-quote, substring, char-to-string etc. The starting string is > taken from a buffer containing only ASCII (inserted by a process with > coding-system 'raw-text). It seems that you are caught in a trap of automatic unibyte->multibyte conversion. > (defun preview-error-quote (string) > "Turn STRING with potential ^^ sequences into a regexp. > To preserve sanity, additional ^ prefixes are matched literally, > so the character represented by ^^^ preceding extended characters > will not get matched, usually." > (let (output case-fold-search) > (while (string-match "\\^\\{2,\\}\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)" > string) > (setq output > (concat output > (regexp-quote (substring string > 0 > (- (match-beginning 1) 2))) If STRING is taken from a multibyte buffer, it is a multibyte string. Thus, the above substring also returns a multibyte string. > (if (match-beginning 2) > (concat > "\\(?:" (regexp-quote > (substring string > (- (match-beginning 1) 2) > (match-end 0))) > "\\|" > (char-to-string > (logxor (aref string (match-beginning 2)) 64)) > "\\)") > (char-to-string > (string-to-number (match-string 1 string) 16)))) But, this char-to-string produces a unibyte string. So, on concatinating them, this unibyte string is automatically converted to multibyte by string-make-multibyte function which usually produces a multibyte string containing latin-1 chars. > string (substring string (match-end 0)))) > (setq output (concat output (regexp-quote string))) > (if (featurep 'mule) > (prog2 > (message "%S %S " output buffer-file-coding-system) > (setq output (decode-coding-string output buffer-file-coding-system)) And this decode-coding-string treats the internal byte sequence of a multibyte string OUTPUT as utf-8, thus you get some garbage. > Unfortunately, when I call this stuff by hand instead from the > process-sentinel, it mostly works That is because the string you give to preview-error-quote is a unibyte string in that case. The Lisp reader generates a unibyte string when it sees ASCII-only string. Ex: (multibyte-string-p "abc") => nil This will also return incorrect string. (preview-error-quote (string-to-multibyte "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$")) So, the easiest fix will be to do: (setq string (string-as-unibyte string)) in the head of preview-error-quote. --- Ken'ichi HANDA handa@m17n.org