From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: David Kastrup Newsgroups: gmane.emacs.devel Subject: How to create a derived encoding? Date: Tue, 12 Oct 2004 02:10:00 +0200 Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Message-ID: NNTP-Posting-Host: deer.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1097539823 16300 80.91.229.6 (12 Oct 2004 00:10:23 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 12 Oct 2004 00:10:23 +0000 (UTC) Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Tue Oct 12 02:10:14 2004 Return-path: Original-Received: from lists.gnu.org ([199.232.76.165]) by deer.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 1CHAF4-0007AC-00 for ; Tue, 12 Oct 2004 02:10:14 +0200 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1CHAM1-0000MW-Mz for ged-emacs-devel@m.gmane.org; Mon, 11 Oct 2004 20:17:25 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.33) id 1CHALu-0000MA-QI for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:17:18 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.33) id 1CHALt-0000Ll-IL for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:17:17 -0400 Original-Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.33) id 1CHALt-0000Lb-G5 for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:17:17 -0400 Original-Received: from [199.232.76.164] (helo=fencepost.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.34) id 1CHAEt-0006bz-Pe for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:10:03 -0400 Original-Received: from localhost ([127.0.0.1] helo=lola.goethe.zz) by fencepost.gnu.org with esmtp (Exim 4.34) id 1CHAEq-0006dc-Sd for emacs-devel@gnu.org; Mon, 11 Oct 2004 20:10:01 -0400 Original-To: emacs-devel@gnu.org User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: main.gmane.org gmane.emacs.devel:28266 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:28266 After considerable thinking about the problem, I have arrived at the conclusion that for efficiency's sake I'd like to have an encoding like tex-utf-8 which is derived from the normal utf-8 except that sequences like ^^8a and similar are converted into a corresponding byte before combining Unicode characters. It would be a bonus if such sequences staid unchanged in case that this sort of composition does not lead to a valid Unicode character, but that's just a bonus. The problem is that TeX has no clue about _characters_, but works on byte streams, and it has the habit of transliterating some byte codes in the above manner. Treating the output of TeX sensibly means converting those transliteration back into bytes _before_ assembling Unicode characters. The same problem occurs with unibyte non-ASCII encodings by Latin-1. I already have one (rather inefficient) hack to deal with that in preview-latex, but it does not extend easily to multibyte. So if there was a tolerably working way to derive a special encoding (which will be used as a process output encoding) that reconverts control sequences like the above before composing unicode characters from the resulting utf-8 stream, this would appear to be by far the fastest and convenient way to go about this problem. Any hints how to derive a suitably augmented encoding from an existing one? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum