From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel Subject: Re: decode-coding-string gone awry? Date: Mon, 14 Feb 2005 19:41:19 +0100 Message-ID: References: <874qgf1dkv.fsf-monnier+emacs@gnu.org> NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1108406491 26464 80.91.229.2 (14 Feb 2005 18:41:31 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Mon, 14 Feb 2005 18:41:31 +0000 (UTC) Cc: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 14 19:41:30 2005 Original-Received: from lists.gnu.org ([199.232.76.165]) by ciao.gmane.org with esmtp (Exim 4.43) id 1D0l9s-0008BV-W7 for ged-emacs-devel@m.gmane.org; Mon, 14 Feb 2005 19:41:21 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D0lPT-0005q2-8o for ged-emacs-devel@m.gmane.org; Mon, 14 Feb 2005 13:57:27 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1D0lOw-0005oi-OL for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:56:55 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1D0lOu-0005np-L1 for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:56:53 -0500 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1D0lOu-0005mv-9K for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:56:52 -0500 Original-Received: from [199.232.76.164] (helo=fencepost.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.34) id 1D0l9q-0007o6-P9 for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:41:18 -0500 Original-Received: from localhost ([127.0.0.1] helo=lola.goethe.zz) by fencepost.gnu.org with esmtp (Exim 4.34) id 1D0l5p-0004xw-8u; Mon, 14 Feb 2005 13:37:09 -0500 Original-To: Stefan Monnier In-Reply-To: (Stefan Monnier's message of "Mon, 14 Feb 2005 13:12:03 -0500") User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org X-MailScanner-To: ged-emacs-devel@m.gmane.org Xref: main.gmane.org gmane.emacs.devel:33406 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:33406 Stefan Monnier writes: >>> instead of being processed directly from the process filter, then >>> you should also ensure that this buffer is unibyte. > >> Yuk. The problem is that this buffer is not only processed by >> preview-latex, but also by AUCTeX, and the versions that get combined >> may be different. AUCTeX uses the source code buffer's file encoding >> by default, which is fine for basically unibyte based coding systems. > > If you can't change this part, then your best bet might be to do something > like: > > (defun preview-error-quote (string) > "Turn STRING with potential ^^ sequences into a regexp. > To preserve sanity, additional ^ prefixes are matched literally, > so the character represented by ^^^ preceding extended characters > will not get matched, usually." > (let (output case-fold-search) > (while (string-match "\\^*\\(\\^\\^\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)\\)+" > string) > (setq output > (concat output > (regexp-quote (substring string 0 (match-beginning 1))) > (decode-coding-string > (preview-dequote-thingies (substring (match-beginning 1) > (match-end 0))) > buffer-file-coding-system)) > string (substring string (match-end 0)))) > (setq output (concat output (regexp-quote string))) > output))) > > BTW, you can use the 3rd arg to string-match to avoid consing strings for > `string'. > > This way you only apply decode-coding-string to the part of the > string which is still undecoded but not to the rest. No use. The gag precisely is that TeX may decide to split a _single_ Unicode character into some bytes that it will let go through unchanged, and some bytes that it will transcribe into ^^ba notation. If decode-coding-string is supposed to have a chance of reassembling this junk, it must only be run at the end of reconstructing the byte stream. Yes, this is completely insane. No, I can't avoid having to deal with it somehow. Give me a clue: what happens if a process inserts stuff with 'raw-text encoding into a multibyte buffer? 'raw-text is a reconstructible encoding, isn't it, so the stuff will get converted into some prefix byte indicating "isolated single-byte entity instead of utf-8 char" and the byte itself or something, right? And decode-encoding-string does not want to work on something like that? I have to admit to total cluelessness. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum