all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* "Unidecode" functionality in Emacs
@ 2018-03-19 22:04 John Mastro
  2018-03-20  4:59 ` Teemu Likonen
  2018-03-20  6:20 ` Eli Zaretskii
  0 siblings, 2 replies; 6+ messages in thread
From: John Mastro @ 2018-03-19 22:04 UTC (permalink / raw)
  To: Help Gnu Emacs mailing list

There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
(derived from one another in that order). They each transliterate
Unicode text to ASCII, e.g.:

    (unidecode "Déjà vu")
    ;=> "Deja vu"
    (unidecode "北亰")
    ;=> "Bei Jing "

Does Emacs have equivalent functionality built-in?

[ The context for this is that I recently submitted a change to the
  MELPA recipe, and Steve Purcell mentioned[4] that he would be
  surprised if Emacs doesn't already have such functionality. ]

Thanks for any pointers

        John

[1]: http://search.cpan.org/~sburke/Text-Unidecode-1.30/lib/Text/Unidecode.pm
[2]: https://pypi.python.org/pypi/Unidecode
[3]: https://github.com/sindikat/unidecode
[4]: https://github.com/melpa/melpa/pull/5351#issuecomment-373966218



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "Unidecode" functionality in Emacs
  2018-03-19 22:04 "Unidecode" functionality in Emacs John Mastro
@ 2018-03-20  4:59 ` Teemu Likonen
  2018-03-20  6:39   ` Eli Zaretskii
  2018-03-20  6:20 ` Eli Zaretskii
  1 sibling, 1 reply; 6+ messages in thread
From: Teemu Likonen @ 2018-03-20  4:59 UTC (permalink / raw)
  To: John Mastro; +Cc: Help Gnu Emacs mailing list

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

John Mastro [2018-03-19 15:04:29-07] wrote:

> There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
> (derived from one another in that order). They each transliterate
> Unicode text to ASCII, e.g.:
>
>     (unidecode "Déjà vu")
>     ;=> "Deja vu"
>     (unidecode "北亰")
>     ;=> "Bei Jing "
>
> Does Emacs have equivalent functionality built-in?

I don't know of any built-in functions but external "iconv" tool can do
similar thing for Latin scripts. Here's an example Emacs Lisp function
wrapper for "iconv":

    (defun tl-ascii-translit (string)
      (with-temp-buffer
        (insert string)
        (call-process-region (point-min) (point-max)
                             "iconv" t t nil "-t" "ASCII//TRANSLIT")
        (buffer-substring-no-properties (point-min) (point-max))))

Works for Latin scripts:

    (tl-ascii-translit "Déjà vu") ;=> "Deja vu"
    (tl-ascii-translit "北亰") ;=> "??"

-- 
/// Teemu Likonen   - .-..   <https://keybase.io/tlikonen> //
// PGP: 4E10 55DC 84E9 DFF6 13D7 8557 719D 69D3 2453 9450 ///

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "Unidecode" functionality in Emacs
  2018-03-19 22:04 "Unidecode" functionality in Emacs John Mastro
  2018-03-20  4:59 ` Teemu Likonen
@ 2018-03-20  6:20 ` Eli Zaretskii
  2018-03-20 17:23   ` John Mastro
  2018-03-20 20:25   ` Stefan Monnier
  1 sibling, 2 replies; 6+ messages in thread
From: Eli Zaretskii @ 2018-03-20  6:20 UTC (permalink / raw)
  To: help-gnu-emacs

> From: John Mastro <john.b.mastro@gmail.com>
> Date: Mon, 19 Mar 2018 15:04:29 -0700
> 
> There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
> (derived from one another in that order). They each transliterate
> Unicode text to ASCII, e.g.:
> 
>     (unidecode "Déjà vu")
>     ;=> "Deja vu"
>     (unidecode "北亰")
>     ;=> "Bei Jing "
> 
> Does Emacs have equivalent functionality built-in?

It's possible to remove accents (the first example) using the
functionality in ucs-normalize.el.  Some transliteration is possible
for scripts for which there exists a "transliteration" input method,
using the code by Michael Welsh Duggan posted here:

  http://lists.gnu.org/archive/html/emacs-devel/2018-02/msg00387.html

For example, you can transliterate Cyrillic text using the
cyrillic-translit input method that comes with Emacs.  But there are
no general-purpose transliteration capabilities in Emacs, AFAIK.

However, it looks like the Perl package is just a huge database of
precomputed transliterations, in which case doing the same in Emacs
Lisp should be almost trivial.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "Unidecode" functionality in Emacs
  2018-03-20  4:59 ` Teemu Likonen
@ 2018-03-20  6:39   ` Eli Zaretskii
  0 siblings, 0 replies; 6+ messages in thread
From: Eli Zaretskii @ 2018-03-20  6:39 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Teemu Likonen <tlikonen@iki.fi>
> Date: Tue, 20 Mar 2018 06:59:34 +0200
> Cc: Help Gnu Emacs mailing list <help-gnu-emacs@gnu.org>
> 
> I don't know of any built-in functions but external "iconv" tool can do
> similar thing for Latin scripts. Here's an example Emacs Lisp function
> wrapper for "iconv":
> 
>     (defun tl-ascii-translit (string)
>       (with-temp-buffer
>         (insert string)
>         (call-process-region (point-min) (point-max)
>                              "iconv" t t nil "-t" "ASCII//TRANSLIT")
>         (buffer-substring-no-properties (point-min) (point-max))))
> 
> Works for Latin scripts:
> 
>     (tl-ascii-translit "Déjà vu") ;=> "Deja vu"
>     (tl-ascii-translit "北亰") ;=> "??"

The iconv's "TRANSLIT" is not the transliteration that's sought here.
It's an attempt to present similarly-looking characters when the
original character is not in the target character set (ASCII in the
above snippet).  So it's a small wonder this only works for European
scripts, because no ASCII character can ever "look like" characters in
other scripts.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "Unidecode" functionality in Emacs
  2018-03-20  6:20 ` Eli Zaretskii
@ 2018-03-20 17:23   ` John Mastro
  2018-03-20 20:25   ` Stefan Monnier
  1 sibling, 0 replies; 6+ messages in thread
From: John Mastro @ 2018-03-20 17:23 UTC (permalink / raw)
  To: Help Gnu Emacs mailing list

Eli Zaretskii <eliz@gnu.org> wrote:
>> There are "Unidecode" packages for Perl[1], Python[2], and Emacs[3]
>> (derived from one another in that order). They each transliterate
>> Unicode text to ASCII, e.g.:
>>
>>     (unidecode "Déjà vu")
>>     ;=> "Deja vu"
>>     (unidecode "北亰")
>>     ;=> "Bei Jing "
>>
>> Does Emacs have equivalent functionality built-in?
>
> It's possible to remove accents (the first example) using the
> functionality in ucs-normalize.el.  Some transliteration is possible
> for scripts for which there exists a "transliteration" input method,
> using the code by Michael Welsh Duggan posted here:
>
>   http://lists.gnu.org/archive/html/emacs-devel/2018-02/msg00387.html
>
> For example, you can transliterate Cyrillic text using the
> cyrillic-translit input method that comes with Emacs.  But there are
> no general-purpose transliteration capabilities in Emacs, AFAIK.

Thanks, I'll take a look at those.

> However, it looks like the Perl package is just a huge database of
> precomputed transliterations, in which case doing the same in Emacs
> Lisp should be almost trivial.

Yep, that's how the Emacs package works too. It boils down to 25 lines
of Lisp[1] plus the database[2].

Thanks

        John

[1]: https://github.com/sindikat/unidecode/blob/5502ada9287b4012eabb879f12f5b0a9df52c5b7/unidecode.el#L56-L82
[2]: https://github.com/sindikat/unidecode/tree/5502ada9287b4012eabb879f12f5b0a9df52c5b7/data



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "Unidecode" functionality in Emacs
  2018-03-20  6:20 ` Eli Zaretskii
  2018-03-20 17:23   ` John Mastro
@ 2018-03-20 20:25   ` Stefan Monnier
  1 sibling, 0 replies; 6+ messages in thread
From: Stefan Monnier @ 2018-03-20 20:25 UTC (permalink / raw)
  To: help-gnu-emacs

> However, it looks like the Perl package is just a huge database of
> precomputed transliterations, in which case doing the same in Emacs
> Lisp should be almost trivial.

This said, Emacs comes with several of those databases already in the
form of quail input methods and unicode char metadata, so it should be
possible to get something a bit less trivial which doesn't require any
extra database.


        Stefan




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-03-20 20:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-19 22:04 "Unidecode" functionality in Emacs John Mastro
2018-03-20  4:59 ` Teemu Likonen
2018-03-20  6:39   ` Eli Zaretskii
2018-03-20  6:20 ` Eli Zaretskii
2018-03-20 17:23   ` John Mastro
2018-03-20 20:25   ` Stefan Monnier

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.