* "Unidecode" functionality in Emacs
@ 2018-03-19 22:04 John Mastro
2018-03-20 4:59 ` Teemu Likonen
2018-03-20 6:20 ` Eli Zaretskii
0 siblings, 2 replies; 6+ messages in thread
From: John Mastro @ 2018-03-19 22:04 UTC (permalink / raw)
To: Help Gnu Emacs mailing list
There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
(derived from one another in that order). They each transliterate
Unicode text to ASCII, e.g.:
(unidecode "Déjà vu")
;=> "Deja vu"
(unidecode "北亰")
;=> "Bei Jing "
Does Emacs have equivalent functionality built-in?
[ The context for this is that I recently submitted a change to the
MELPA recipe, and Steve Purcell mentioned[4] that he would be
surprised if Emacs doesn't already have such functionality. ]
Thanks for any pointers
John
[1]: http://search.cpan.org/~sburke/Text-Unidecode-1.30/lib/Text/Unidecode.pm
[2]: https://pypi.python.org/pypi/Unidecode
[3]: https://github.com/sindikat/unidecode
[4]: https://github.com/melpa/melpa/pull/5351#issuecomment-373966218
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "Unidecode" functionality in Emacs
2018-03-19 22:04 "Unidecode" functionality in Emacs John Mastro
@ 2018-03-20 4:59 ` Teemu Likonen
2018-03-20 6:39 ` Eli Zaretskii
2018-03-20 6:20 ` Eli Zaretskii
1 sibling, 1 reply; 6+ messages in thread
From: Teemu Likonen @ 2018-03-20 4:59 UTC (permalink / raw)
To: John Mastro; +Cc: Help Gnu Emacs mailing list
[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]
John Mastro [2018-03-19 15:04:29-07] wrote:
> There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
> (derived from one another in that order). They each transliterate
> Unicode text to ASCII, e.g.:
>
> (unidecode "Déjà vu")
> ;=> "Deja vu"
> (unidecode "北亰")
> ;=> "Bei Jing "
>
> Does Emacs have equivalent functionality built-in?
I don't know of any built-in functions but external "iconv" tool can do
similar thing for Latin scripts. Here's an example Emacs Lisp function
wrapper for "iconv":
(defun tl-ascii-translit (string)
(with-temp-buffer
(insert string)
(call-process-region (point-min) (point-max)
"iconv" t t nil "-t" "ASCII//TRANSLIT")
(buffer-substring-no-properties (point-min) (point-max))))
Works for Latin scripts:
(tl-ascii-translit "Déjà vu") ;=> "Deja vu"
(tl-ascii-translit "北亰") ;=> "??"
--
/// Teemu Likonen - .-.. <https://keybase.io/tlikonen> //
// PGP: 4E10 55DC 84E9 DFF6 13D7 8557 719D 69D3 2453 9450 ///
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "Unidecode" functionality in Emacs
2018-03-20 4:59 ` Teemu Likonen
@ 2018-03-20 6:39 ` Eli Zaretskii
0 siblings, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2018-03-20 6:39 UTC (permalink / raw)
To: help-gnu-emacs
> From: Teemu Likonen <tlikonen@iki.fi>
> Date: Tue, 20 Mar 2018 06:59:34 +0200
> Cc: Help Gnu Emacs mailing list <help-gnu-emacs@gnu.org>
>
> I don't know of any built-in functions but external "iconv" tool can do
> similar thing for Latin scripts. Here's an example Emacs Lisp function
> wrapper for "iconv":
>
> (defun tl-ascii-translit (string)
> (with-temp-buffer
> (insert string)
> (call-process-region (point-min) (point-max)
> "iconv" t t nil "-t" "ASCII//TRANSLIT")
> (buffer-substring-no-properties (point-min) (point-max))))
>
> Works for Latin scripts:
>
> (tl-ascii-translit "Déjà vu") ;=> "Deja vu"
> (tl-ascii-translit "北亰") ;=> "??"
The iconv's "TRANSLIT" is not the transliteration that's sought here.
It's an attempt to present similarly-looking characters when the
original character is not in the target character set (ASCII in the
above snippet). So it's a small wonder this only works for European
scripts, because no ASCII character can ever "look like" characters in
other scripts.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "Unidecode" functionality in Emacs
2018-03-19 22:04 "Unidecode" functionality in Emacs John Mastro
2018-03-20 4:59 ` Teemu Likonen
@ 2018-03-20 6:20 ` Eli Zaretskii
2018-03-20 17:23 ` John Mastro
2018-03-20 20:25 ` Stefan Monnier
1 sibling, 2 replies; 6+ messages in thread
From: Eli Zaretskii @ 2018-03-20 6:20 UTC (permalink / raw)
To: help-gnu-emacs
> From: John Mastro <john.b.mastro@gmail.com>
> Date: Mon, 19 Mar 2018 15:04:29 -0700
>
> There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
> (derived from one another in that order). They each transliterate
> Unicode text to ASCII, e.g.:
>
> (unidecode "Déjà vu")
> ;=> "Deja vu"
> (unidecode "北亰")
> ;=> "Bei Jing "
>
> Does Emacs have equivalent functionality built-in?
It's possible to remove accents (the first example) using the
functionality in ucs-normalize.el. Some transliteration is possible
for scripts for which there exists a "transliteration" input method,
using the code by Michael Welsh Duggan posted here:
http://lists.gnu.org/archive/html/emacs-devel/2018-02/msg00387.html
For example, you can transliterate Cyrillic text using the
cyrillic-translit input method that comes with Emacs. But there are
no general-purpose transliteration capabilities in Emacs, AFAIK.
However, it looks like the Perl package is just a huge database of
precomputed transliterations, in which case doing the same in Emacs
Lisp should be almost trivial.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "Unidecode" functionality in Emacs
2018-03-20 6:20 ` Eli Zaretskii
@ 2018-03-20 17:23 ` John Mastro
2018-03-20 20:25 ` Stefan Monnier
1 sibling, 0 replies; 6+ messages in thread
From: John Mastro @ 2018-03-20 17:23 UTC (permalink / raw)
To: Help Gnu Emacs mailing list
Eli Zaretskii <eliz@gnu.org> wrote:
>> There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
>> (derived from one another in that order). They each transliterate
>> Unicode text to ASCII, e.g.:
>>
>> (unidecode "Déjà vu")
>> ;=> "Deja vu"
>> (unidecode "北亰")
>> ;=> "Bei Jing "
>>
>> Does Emacs have equivalent functionality built-in?
>
> It's possible to remove accents (the first example) using the
> functionality in ucs-normalize.el. Some transliteration is possible
> for scripts for which there exists a "transliteration" input method,
> using the code by Michael Welsh Duggan posted here:
>
> http://lists.gnu.org/archive/html/emacs-devel/2018-02/msg00387.html
>
> For example, you can transliterate Cyrillic text using the
> cyrillic-translit input method that comes with Emacs. But there are
> no general-purpose transliteration capabilities in Emacs, AFAIK.
Thanks, I'll take a look at those.
> However, it looks like the Perl package is just a huge database of
> precomputed transliterations, in which case doing the same in Emacs
> Lisp should be almost trivial.
Yep, that's how the Emacs package works too. It boils down to 25 lines
of Lisp[1] plus the database[2].
Thanks
John
[1]: https://github.com/sindikat/unidecode/blob/5502ada9287b4012eabb879f12f5b0a9df52c5b7/unidecode.el#L56-L82
[2]: https://github.com/sindikat/unidecode/tree/5502ada9287b4012eabb879f12f5b0a9df52c5b7/data
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "Unidecode" functionality in Emacs
2018-03-20 6:20 ` Eli Zaretskii
2018-03-20 17:23 ` John Mastro
@ 2018-03-20 20:25 ` Stefan Monnier
1 sibling, 0 replies; 6+ messages in thread
From: Stefan Monnier @ 2018-03-20 20:25 UTC (permalink / raw)
To: help-gnu-emacs
> However, it looks like the Perl package is just a huge database of
> precomputed transliterations, in which case doing the same in Emacs
> Lisp should be almost trivial.
This said, Emacs comes with several of those databases already in the
form of quail input methods and unicode char metadata, so it should be
possible to get something a bit less trivial which doesn't require any
extra database.
Stefan
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-03-20 20:25 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-19 22:04 "Unidecode" functionality in Emacs John Mastro
2018-03-20 4:59 ` Teemu Likonen
2018-03-20 6:39 ` Eli Zaretskii
2018-03-20 6:20 ` Eli Zaretskii
2018-03-20 17:23 ` John Mastro
2018-03-20 20:25 ` Stefan Monnier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).