From mboxrd@z Thu Jan  1 00:00:00 1970
Path: main.gmane.org!not-for-mail
From: David Kastrup <dak@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: How to create a derived encoding?
Date: Tue, 12 Oct 2004 02:10:00 +0200
Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Message-ID: <x5vfdgbxuv.fsf@lola.goethe.zz>
NNTP-Posting-Host: deer.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: sea.gmane.org 1097539823 16300 80.91.229.6 (12 Oct 2004 00:10:23 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Tue, 12 Oct 2004 00:10:23 +0000 (UTC)
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Oct 12 02:10:14 2004
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Original-Received: from lists.gnu.org ([199.232.76.165])
	by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian))
	id 1CHAF4-0007AC-00
	for <ged-emacs-devel@m.gmane.org>; Tue, 12 Oct 2004 02:10:14 +0200
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.33)
	id 1CHAM1-0000MW-Mz
	for ged-emacs-devel@m.gmane.org; Mon, 11 Oct 2004 20:17:25 -0400
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33)
	id 1CHALu-0000MA-QI
	for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:17:18 -0400
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33)
	id 1CHALt-0000Ll-IL
	for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:17:17 -0400
Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.33) id 1CHALt-0000Lb-G5
	for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:17:17 -0400
Original-Received: from [199.232.76.164] (helo=fencepost.gnu.org)
	by monty-python.gnu.org with esmtp (Exim 4.34) id 1CHAEt-0006bz-Pe
	for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:10:03 -0400
Original-Received: from localhost ([127.0.0.1] helo=lola.goethe.zz)
	by fencepost.gnu.org with esmtp (Exim 4.34) id 1CHAEq-0006dc-Sd
	for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:10:01 -0400
Original-To: emacs-devel@gnu.org
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: main.gmane.org gmane.emacs.devel:28266
X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:28266


After considerable thinking about the problem, I have arrived at the
conclusion that for efficiency's sake I'd like to have an encoding
like tex-utf-8 which is derived from the normal utf-8 except that
sequences like ^^8a and similar are converted into a corresponding
byte before combining Unicode characters.  It would be a bonus if such
sequences staid unchanged in case that this sort of composition does
not lead to a valid Unicode character, but that's just a bonus.

The problem is that TeX has no clue about _characters_, but works on
byte streams, and it has the habit of transliterating some byte codes
in the above manner.  Treating the output of TeX sensibly means
converting those transliteration back into bytes _before_ assembling
Unicode characters.

The same problem occurs with unibyte non-ASCII encodings by Latin-1.
I already have one (rather inefficient) hack to deal with that in
preview-latex, but it does not extend easily to multibyte.

So if there was a tolerably working way to derive a special encoding
(which will be used as a process output encoding) that reconverts
control sequences like the above before composing unicode characters
from the resulting utf-8 stream, this would appear to be by far the
fastest and convenient way to go about this problem.

Any hints how to derive a suitably augmented encoding from an existing
one?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum