unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: "Héctor Lahoz" <hectorlahoz@gmail.com>
To: help-gnu-emacs@gnu.org
Subject: Re: How to translate LaTeX into UTF-8 in Elisp?
Date: Tue, 4 Jul 2017 12:23:48 +0200	[thread overview]
Message-ID: <20170704102348.GA3579@workstation> (raw)
In-Reply-To: <87tw2t1kyf.fsf@jane>

Marcin Borkowski wrote:
> OK, so here is a proof of concept:
> 
> --8<---------------cut here---------------start------------->8---
> (defvar TeX-to-Unicode-accents-alist
>   '((?` . "grave")
>     (?' . "acute")
>     (?^ . "circumflex")
>     (?\" . "diaeresis")
>     (?H . "double acute")
>     (?~ . "tilde")
>     (?c . "with cedilla")
>     (?k . "ogonek")
>     (?= . "macron")
>     (?. . "with dot above")
>     (?u . "with breve")
>     (?v . "with caron"))
>   "A mapping from TeX control characters to accent names used in
> Unicode.")
> 
> (defun combine-letter-diacritical-mark (letter mark)
>   "Return a Unicode string of LETTER combined with MARK.
> MARK can be any character that can be used in TeX accenting
> commands."
>   (let* ((letter (if (stringp letter)
>                      (string-to-char letter)
>                    letter))
>          (uppercase (= letter
>                        (upcase letter))))
>     (cdr (assoc-string
>           (format "LATIN %s LETTER %c %s"
>                   (if uppercase "CAPITAL" "SMALL")
>                   letter
>                   (cdr (assoc mark TeX-to-Unicode-accents-alist)))
>           ucs-names
>           t))))
> --8<---------------cut here---------------end--------------->8---
> 

Great.

Perhaps you could consider translating to unicode combining characters.
I think it is closer to the original TeX idea and could be cleaner:

0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;;;;
0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;;;;
0302;COMBINING CIRCUMFLEX ACCENT;Mn;230;NSM;;;;;N;NON-SPACING CIRCUMFLEX;;;;
0303;COMBINING TILDE;Mn;230;NSM;;;;;N;NON-SPACING TILDE;;;;
0304;COMBINING MACRON;Mn;230;NSM;;;;;N;NON-SPACING MACRON;;;;
0305;COMBINING OVERLINE;Mn;230;NSM;;;;;N;NON-SPACING OVERSCORE;;;;
0306;COMBINING BREVE;Mn;230;NSM;;;;;N;NON-SPACING BREVE;;;;
0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
0309;COMBINING HOOK ABOVE;Mn;230;NSM;;;;;N;NON-SPACING HOOK ABOVE;;;;
030A;COMBINING RING ABOVE;Mn;230;NSM;;;;;N;NON-SPACING RING ABOVE;;;;
030B;COMBINING DOUBLE ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING DOUBLE ACUTE;;;;
030C;COMBINING CARON;Mn;230;NSM;;;;;N;NON-SPACING HACEK;;;;
030D;COMBINING VERTICAL LINE ABOVE;Mn;230;NSM;;;;;N;NON-SPACING VERTICAL LINE ABOVE;;;;

See the wikipedia article on unicode equivalence:
https://en.wikipedia.org/wiki/Unicode_equivalence

The difference is that unicode reverses the order. First you have the
base character and then all combining characters. For example, \'a would
be translated to either

00E1;LATIN SMALL LETTER A WITH ACUTE

or

0061;LATIN SMALL LETTER A
0301;COMBINING ACUTE ACCENT

I don't know the implications of using unicode combining characters.
I guess the choice depends on the purpose of the output.



  parent reply	other threads:[~2017-07-04 10:23 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-08 17:04 How to translate LaTeX into UTF-8 in Elisp? Marcin Borkowski
2016-12-08 18:21 ` Carlos Konstanski
2016-12-08 19:13   ` Marcin Borkowski
2016-12-08 22:12 ` Stefan Monnier
2017-01-27 11:48   ` Marcin Borkowski
2017-01-28  8:15 ` Kendall Shaw
2017-07-03  4:56 ` Marcin Borkowski
2017-07-03  5:43   ` Emanuel Berg
2017-07-03  9:16     ` Marcin Borkowski
2017-07-03  9:31       ` tomas
2017-07-04  5:55         ` Marcin Borkowski
2017-07-03 10:24       ` Emanuel Berg
2017-07-03 17:36         ` Marcin Borkowski
2017-07-03 20:01           ` Emanuel Berg
2017-07-04 10:23           ` Héctor Lahoz [this message]
2017-07-03  8:37   ` Teemu Likonen
2017-07-04  5:57     ` Marcin Borkowski
2017-07-04  7:13       ` Udyant Wig
2017-07-04  9:27         ` Thien-Thi Nguyen
2017-07-04 20:37           ` Emanuel Berg
2017-07-05  7:05           ` Udyant Wig
2017-07-05 16:06             ` Emanuel Berg
2017-07-13 17:45             ` Thien-Thi Nguyen
2017-07-14  1:48               ` Udyant Wig
2017-07-04 11:18   ` Joost Kremers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170704102348.GA3579@workstation \
    --to=hectorlahoz@gmail.com \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).