From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: David Kastrup <dak@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: decode-coding-string gone awry?
Date: Mon, 14 Feb 2005 19:41:19 +0100
Message-ID: <x53bvz3rxs.fsf@lola.goethe.zz>
References: <x5d5v52k4m.fsf@lola.goethe.zz>
	<874qgf1dkv.fsf-monnier+emacs@gnu.org> <x5hdkf5jzi.fsf@lola.goethe.zz>
	<jwvbranhykt.fsf-monnier+emacs@gnu.org>
	<x5fyzz3vh4.fsf@lola.goethe.zz>
	<jwvu0ofggsu.fsf-monnier+emacs@gnu.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1108406491 26464 80.91.229.2 (14 Feb 2005 18:41:31 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Mon, 14 Feb 2005 18:41:31 +0000 (UTC)
Cc: emacs-devel@gnu.org
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Mon Feb 14 19:41:30 2005
Original-Received: from lists.gnu.org ([199.232.76.165])
	by ciao.gmane.org with esmtp (Exim 4.43)
	id 1D0l9s-0008BV-W7
	for ged-emacs-devel@m.gmane.org; Mon, 14 Feb 2005 19:41:21 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1D0lPT-0005q2-8o
	for ged-emacs-devel@m.gmane.org; Mon, 14 Feb 2005 13:57:27 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1D0lOw-0005oi-OL
	for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:56:55 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1D0lOu-0005np-L1
	for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:56:53 -0500
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1D0lOu-0005mv-9K
	for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:56:52 -0500
Original-Received: from [199.232.76.164] (helo=fencepost.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.34) id 1D0l9q-0007o6-P9
	for emacs-devel@gnu.org; Mon, 14 Feb 2005 13:41:18 -0500
Original-Received: from localhost ([127.0.0.1] helo=lola.goethe.zz)
	by fencepost.gnu.org with esmtp (Exim 4.34)
	id 1D0l5p-0004xw-8u; Mon, 14 Feb 2005 13:37:09 -0500
Original-To: Stefan Monnier <monnier@iro.umontreal.ca>
In-Reply-To: <jwvu0ofggsu.fsf-monnier+emacs@gnu.org> (Stefan Monnier's
	message of "Mon, 14 Feb 2005 13:12:03 -0500")
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
X-MailScanner-To: ged-emacs-devel@m.gmane.org
Xref: main.gmane.org gmane.emacs.devel:33406
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:33406

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> instead of being processed directly from the process filter, then
>>> you should also ensure that this buffer is unibyte.
>
>> Yuk.  The problem is that this buffer is not only processed by
>> preview-latex, but also by AUCTeX, and the versions that get combined
>> may be different.  AUCTeX uses the source code buffer's file encoding
>> by default, which is fine for basically unibyte based coding systems.
>
> If you can't change this part, then your best bet might be to do something
> like:
>
> (defun preview-error-quote (string)
>   "Turn STRING with potential ^^ sequences into a regexp.
> To preserve sanity, additional ^ prefixes are matched literally,
> so the character represented by ^^^ preceding extended characters
> will not get matched, usually."
>   (let (output case-fold-search)
>     (while (string-match "\\^*\\(\\^\\^\\(\\([@-_?]\\)\\|[8-9a-f][0-9a-f]\\)\\)+"
>                          string)
>       (setq output
>             (concat output
>                     (regexp-quote (substring string 0 (match-beginning 1)))
>                     (decode-coding-string
>                      (preview-dequote-thingies (substring (match-beginning 1)
>                                                           (match-end 0)))
>                      buffer-file-coding-system))
>             string (substring string (match-end 0))))
>     (setq output (concat output (regexp-quote string)))
>     output)))
>
> BTW, you can use the 3rd arg to string-match to avoid consing strings for
> `string'.
>
> This way you only apply decode-coding-string to the part of the
> string which is still undecoded but not to the rest.

No use.  The gag precisely is that TeX may decide to split a _single_
Unicode character into some bytes that it will let go through
unchanged, and some bytes that it will transcribe into ^^ba notation.
If decode-coding-string is supposed to have a chance of reassembling
this junk, it must only be run at the end of reconstructing the byte
stream.  Yes, this is completely insane.  No, I can't avoid having to
deal with it somehow.

Give me a clue: what happens if a process inserts stuff with 'raw-text
encoding into a multibyte buffer?  'raw-text is a reconstructible
encoding, isn't it, so the stuff will get converted into some prefix
byte indicating "isolated single-byte entity instead of utf-8 char"
and the byte itself or something, right?  And decode-encoding-string
does not want to work on something like that?

I have to admit to total cluelessness.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum