From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: =?utf-8?Q?H=C3=A9ctor?= Lahoz Newsgroups: gmane.emacs.help Subject: Re: How to translate LaTeX into UTF-8 in Elisp? Date: Tue, 4 Jul 2017 12:23:48 +0200 Message-ID: <20170704102348.GA3579@workstation> References: <87shpyfj2q.fsf@mbork.pl> <87bmp2rud7.fsf@jane> <87y3s6vzwg.fsf@debian.uxu> <87van9dgo3.fsf@jane> <861spxx1h4.fsf@zoho.com> <87tw2t1kyf.fsf@jane> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1499163876 5340 195.159.176.226 (4 Jul 2017 10:24:36 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Tue, 4 Jul 2017 10:24:36 +0000 (UTC) User-Agent: Mutt/1.5.20 (2009-06-14) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Jul 04 12:24:32 2017 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dSL0L-0000mX-RM for geh-help-gnu-emacs@m.gmane.org; Tue, 04 Jul 2017 12:24:26 +0200 Original-Received: from localhost ([::1]:40229 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dSL0P-0003zV-Gn for geh-help-gnu-emacs@m.gmane.org; Tue, 04 Jul 2017 06:24:29 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:51254) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dSKzx-0003zN-6q for help-gnu-emacs@gnu.org; Tue, 04 Jul 2017 06:24:02 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dSKzs-00065k-A4 for help-gnu-emacs@gnu.org; Tue, 04 Jul 2017 06:24:01 -0400 Original-Received: from mail-wm0-x235.google.com ([2a00:1450:400c:c09::235]:35522) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1dSKzs-000647-2e for help-gnu-emacs@gnu.org; Tue, 04 Jul 2017 06:23:56 -0400 Original-Received: by mail-wm0-x235.google.com with SMTP id w126so191410208wme.0 for ; Tue, 04 Jul 2017 03:23:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to:user-agent; bh=UOBd/mwJ5UD3nM49Fpi0ieS9FIfBI1BVQ4RDe0je3bs=; b=a7f1pAUex7Fh6az7F7Q2j3h0LtILfvezo3J+fisfeenRoDtxCH+AKJiPiifnCg8g+N PkWnVPfYxOUH3DQH5Z4wxVlTON6v3/RZw9abUw2Bk0RdrSyRxXLzitEfSOpfqTOgTqlY vTUEacOnddFCJTVz2OejftNfGNLnP9zWeKcci8Kh+zbVxxmHkO9xjxWNa/ip8DMautNi UXnB9Rui8jt7ICA4wbp6Z0DWuvGjF25bbWUPdxEOOtWmv/00xXqaT6SJkQFDeaTfVgIF 9pwuRO1+iyMVupPpWnuAAbtemgOOuHux5o2qxyxcISJbhPb62eR19JJgqMM7a9FwciLO v1zA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:mail-followup-to :references:mime-version:content-disposition:in-reply-to:user-agent; bh=UOBd/mwJ5UD3nM49Fpi0ieS9FIfBI1BVQ4RDe0je3bs=; b=Y2Jm6G9y1MbKHu0vkG00Ex3OLQUNN+FNXMmYSwiCcetWGS2MMag0zs0BSh0dxk4P9T IxAtgEVbYBReVj8hBNmfpM9t7XmVhCFM6bu9c7LW27aKuJhBy7R7eTRLb7tnRGIT3ZFj QNz8JGsNRBvsfk70TOLBQfHBQ1I++K0RCAhbKO/H7rb4JUMHNStFDKGhChCEh6NAp/xj uB+c9GmWyuJpf6TTLtqNP20qw6zHso5kE96rjcFJDAh3L93TX3VXldyCQ5tU/k9N7Wuz Z/iYMLeiFs09Y+Cz6KrHv2iSZHYXIkhSp1OsW+VIXXynsN5SDij8G2Me1aeASvdTLxSn rTHg== X-Gm-Message-State: AIVw111BccYzxqQc/fAhlHowUKByGMJYPmE6KmZsUPDh7eR6ZTrNbNgb 2KbSzLnLK5TaDPi5 X-Received: by 10.28.58.147 with SMTP id h141mr19162376wma.112.1499163831934; Tue, 04 Jul 2017 03:23:51 -0700 (PDT) Original-Received: from workstation (static-16-216-87-188.ipcom.comunitel.net. [188.87.216.16]) by smtp.gmail.com with ESMTPSA id 35sm14619186wrp.63.2017.07.04.03.23.50 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 04 Jul 2017 03:23:51 -0700 (PDT) Mail-Followup-To: help-gnu-emacs@gnu.org Content-Disposition: inline In-Reply-To: <87tw2t1kyf.fsf@jane> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:400c:c09::235 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:113691 Archived-At: Marcin Borkowski wrote: > OK, so here is a proof of concept: > > --8<---------------cut here---------------start------------->8--- > (defvar TeX-to-Unicode-accents-alist > '((?` . "grave") > (?' . "acute") > (?^ . "circumflex") > (?\" . "diaeresis") > (?H . "double acute") > (?~ . "tilde") > (?c . "with cedilla") > (?k . "ogonek") > (?= . "macron") > (?. . "with dot above") > (?u . "with breve") > (?v . "with caron")) > "A mapping from TeX control characters to accent names used in > Unicode.") > > (defun combine-letter-diacritical-mark (letter mark) > "Return a Unicode string of LETTER combined with MARK. > MARK can be any character that can be used in TeX accenting > commands." > (let* ((letter (if (stringp letter) > (string-to-char letter) > letter)) > (uppercase (= letter > (upcase letter)))) > (cdr (assoc-string > (format "LATIN %s LETTER %c %s" > (if uppercase "CAPITAL" "SMALL") > letter > (cdr (assoc mark TeX-to-Unicode-accents-alist))) > ucs-names > t)))) > --8<---------------cut here---------------end--------------->8--- > Great. Perhaps you could consider translating to unicode combining characters. I think it is closer to the original TeX idea and could be cleaner: 0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;;;; 0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;;;; 0302;COMBINING CIRCUMFLEX ACCENT;Mn;230;NSM;;;;;N;NON-SPACING CIRCUMFLEX;;;; 0303;COMBINING TILDE;Mn;230;NSM;;;;;N;NON-SPACING TILDE;;;; 0304;COMBINING MACRON;Mn;230;NSM;;;;;N;NON-SPACING MACRON;;;; 0305;COMBINING OVERLINE;Mn;230;NSM;;;;;N;NON-SPACING OVERSCORE;;;; 0306;COMBINING BREVE;Mn;230;NSM;;;;;N;NON-SPACING BREVE;;;; 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;; 0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;;;; 0309;COMBINING HOOK ABOVE;Mn;230;NSM;;;;;N;NON-SPACING HOOK ABOVE;;;; 030A;COMBINING RING ABOVE;Mn;230;NSM;;;;;N;NON-SPACING RING ABOVE;;;; 030B;COMBINING DOUBLE ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING DOUBLE ACUTE;;;; 030C;COMBINING CARON;Mn;230;NSM;;;;;N;NON-SPACING HACEK;;;; 030D;COMBINING VERTICAL LINE ABOVE;Mn;230;NSM;;;;;N;NON-SPACING VERTICAL LINE ABOVE;;;; See the wikipedia article on unicode equivalence: https://en.wikipedia.org/wiki/Unicode_equivalence The difference is that unicode reverses the order. First you have the base character and then all combining characters. For example, \'a would be translated to either 00E1;LATIN SMALL LETTER A WITH ACUTE or 0061;LATIN SMALL LETTER A 0301;COMBINING ACUTE ACCENT I don't know the implications of using unicode combining characters. I guess the choice depends on the purpose of the output.