all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Is there a way to "asciify" a string?
@ 2018-05-27  6:22 Marcin Borkowski
  2018-05-27  7:36 ` tomas
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-27  6:22 UTC (permalink / raw
  To: Help Gnu Emacs mailing list

Hi all,

I want to convert e.g. "żółć" to "zolc", or "Poincaré" to "Poincare"
etc.  IOW, I want to replace all these funny Unicode accented characters
with their ASCII equivalents.

Is there anything for that in Emacs?

TIA,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27  6:22 Is there a way to "asciify" a string? Marcin Borkowski
@ 2018-05-27  7:36 ` tomas
  2018-05-27 12:36   ` Marcin Borkowski
  2018-05-27 14:55 ` Eric Abrahamsen
  2018-05-27 16:00 ` Eli Zaretskii
  2 siblings, 1 reply; 43+ messages in thread
From: tomas @ 2018-05-27  7:36 UTC (permalink / raw
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, May 27, 2018 at 08:22:20AM +0200, Marcin Borkowski wrote:
> Hi all,
> 
> I want to convert e.g. "żółć" to "zolc", or "Poincaré" to "Poincare"
> etc.  IOW, I want to replace all these funny Unicode accented characters
> with their ASCII equivalents.
> 
> Is there anything for that in Emacs?

I haven't an answer to your direct question, just a warning: without a
language context, you can't do it "correctly". For one illustrative
example, in German "ü" -> "ue", but in Spanish "ü" -> "u" (those diaereses
do have different functions in those languages). Transliterating "ü" with
just "u" in German would be wrong (but the reader might make some sense
of it), transliterating "ü" with "ue" in Spanish would not only be wrong,
but would almost certainly throw off the reader's auto-correction feature
(unles (s)he knows German and can recall that association).

I'm sure there are tons of other examples like that.

Heck, even up- and downcasing is strictly language context dependent
(witness the Turkish dotless I).

Sigh :-)

Cheers
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsKYA0ACgkQBcgs9XrR2kaEowCfWl/QI+IFq4bt1J9uvTs9HGhW
FOcAn0LaWKiQ7xK7/85R+CaDr9hl0lfP
=eVFI
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27  7:36 ` tomas
@ 2018-05-27 12:36   ` Marcin Borkowski
  2018-05-27 12:52     ` Teemu Likonen
                       ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-27 12:36 UTC (permalink / raw
  To: tomas; +Cc: help-gnu-emacs


On 2018-05-27, at 09:36, tomas@tuxteam.de wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Sun, May 27, 2018 at 08:22:20AM +0200, Marcin Borkowski wrote:
>> Hi all,
>>
>> I want to convert e.g. "żółć" to "zolc", or "Poincaré" to "Poincare"
>> etc.  IOW, I want to replace all these funny Unicode accented characters
>> with their ASCII equivalents.
>>
>> Is there anything for that in Emacs?
>
> I haven't an answer to your direct question, just a warning: without a
> language context, you can't do it "correctly". For one illustrative
> example, in German "ü" -> "ue", but in Spanish "ü" -> "u" (those diaereses
> do have different functions in those languages). Transliterating "ü" with
> just "u" in German would be wrong (but the reader might make some sense
> of it), transliterating "ü" with "ue" in Spanish would not only be wrong,
> but would almost certainly throw off the reader's auto-correction feature
> (unles (s)he knows German and can recall that association).
>
> I'm sure there are tons of other examples like that.
>
> Heck, even up- and downcasing is strictly language context dependent
> (witness the Turkish dotless I).
>
> Sigh :-)

I understand that.

Still, I need something *simple*.  I have a person's name (possibly with
some national characters), and I want to derive a filename from it.  It
doesn't have to be correct in 100% cases.  It doesn't even have to be
unambiguous (there will be a number for that in the filename, too).

At worst, I might just reimplement `tr' in Emacs and use it to convert
Polish letters to their Latin equivalents (which will cover 99.9%
cases), but I thought that with the (in)famous char folding etc. Emacs
can handle this out of the box.

(BTW, if there is some command-line utility to do that, that's fine
too.)

Best,

--
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 12:36   ` Marcin Borkowski
@ 2018-05-27 12:52     ` Teemu Likonen
  2018-05-27 16:07       ` Eli Zaretskii
  2018-05-27 13:04     ` Yuri Khan
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: Teemu Likonen @ 2018-05-27 12:52 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 999 bytes --]

Marcin Borkowski [2018-05-27 14:36:27+02] wrote:

> Still, I need something *simple*. I have a person's name (possibly
> with some national characters), and I want to derive a filename from
> it. It doesn't have to be correct in 100% cases. It doesn't even have
> to be unambiguous (there will be a number for that in the filename,
> too).

> (BTW, if there is some command-line utility to do that, that's fine
> too.)

There is iconv:

    $ echo áàãä | iconv -t ASCII//TRANSLIT
    aaaa

So an Emacs Lisp wrapper function for iconv can be written like this:

    (defun my-iconv-ascii-translit (string)
      (with-temp-buffer
        (insert string)
        (call-process-region (point-min) (point-max)
                             "iconv" t t nil "-t" "ASCII//TRANSLIT")
        (buffer-substring-no-properties (point-min) (point-max))))

-- 
/// Teemu Likonen   - .-..   <https://keybase.io/tlikonen> //
// PGP: 4E10 55DC 84E9 DFF6 13D7 8557 719D 69D3 2453 9450 ///

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 12:36   ` Marcin Borkowski
  2018-05-27 12:52     ` Teemu Likonen
@ 2018-05-27 13:04     ` Yuri Khan
  2018-05-30 10:14       ` Marcin Borkowski
  2018-05-31  2:03       ` John Mastro
  2018-05-27 19:53     ` tomas
                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 43+ messages in thread
From: Yuri Khan @ 2018-05-27 13:04 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs

On Sun, May 27, 2018 at 7:38 PM Marcin Borkowski <mbork@mbork.pl> wrote:

> I have a person's name (possibly with
> some national characters), and I want to derive a filename from it.  It
> doesn't have to be correct in 100% cases.  It doesn't even have to be
> unambiguous (there will be a number for that in the filename, too).

Technically you could use the name of a person as is, as long as it is
representable in Unicode and contains neither the null character nor the
slash character. But I assume you want a filename that is portable between
file systems, or a filename that can be represented in an URI path segment
without %-encoding, or any combination of the above.

In that case, the Python unidecode library is probably the closest that you
can find. But make very sure that the people involved never see their own
name’s transliteration.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27  6:22 Is there a way to "asciify" a string? Marcin Borkowski
  2018-05-27  7:36 ` tomas
@ 2018-05-27 14:55 ` Eric Abrahamsen
  2018-05-27 16:00 ` Eli Zaretskii
  2 siblings, 0 replies; 43+ messages in thread
From: Eric Abrahamsen @ 2018-05-27 14:55 UTC (permalink / raw
  To: help-gnu-emacs

Marcin Borkowski <mbork@mbork.pl> writes:

> Hi all,
>
> I want to convert e.g. "żółć" to "zolc", or "Poincaré" to "Poincare"
> etc.  IOW, I want to replace all these funny Unicode accented characters
> with their ASCII equivalents.
>
> Is there anything for that in Emacs?

Check out char-fold.el. I think its main purpose is to go the other way:
to take an ascii string and return a regexp matching all it's likely
unicode lookalikes, but you might be able to use `char-fold-table' to go
the other way.

Eric




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27  6:22 Is there a way to "asciify" a string? Marcin Borkowski
  2018-05-27  7:36 ` tomas
  2018-05-27 14:55 ` Eric Abrahamsen
@ 2018-05-27 16:00 ` Eli Zaretskii
  2 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2018-05-27 16:00 UTC (permalink / raw
  To: help-gnu-emacs

> From: Marcin Borkowski <mbork@mbork.pl>
> Date: Sun, 27 May 2018 08:22:20 +0200
> 
> I want to convert e.g. "żółć" to "zolc", or "Poincaré" to "Poincare"
> etc.  IOW, I want to replace all these funny Unicode accented characters
> with their ASCII equivalents.
> 
> Is there anything for that in Emacs?

Yes, use ucs-normalize.el functionality to decompose accented
characters, and then remove all the characters that aren't ASCII.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 12:52     ` Teemu Likonen
@ 2018-05-27 16:07       ` Eli Zaretskii
  2018-05-27 16:59         ` Teemu Likonen
  2018-05-27 20:00         ` tomas
  0 siblings, 2 replies; 43+ messages in thread
From: Eli Zaretskii @ 2018-05-27 16:07 UTC (permalink / raw
  To: help-gnu-emacs

> From: Teemu Likonen <tlikonen@iki.fi>
> Date: Sun, 27 May 2018 15:52:58 +0300
> Cc: help-gnu-emacs <help-gnu-emacs@gnu.org>
> 
> > (BTW, if there is some command-line utility to do that, that's fine
> > too.)
> 
> There is iconv:
> 
>     $ echo áàãä | iconv -t ASCII//TRANSLIT
>     aaaa
> 
> So an Emacs Lisp wrapper function for iconv can be written like this:
> 
>     (defun my-iconv-ascii-translit (string)
>       (with-temp-buffer
>         (insert string)
>         (call-process-region (point-min) (point-max)
>                              "iconv" t t nil "-t" "ASCII//TRANSLIT")
>         (buffer-substring-no-properties (point-min) (point-max))))

Come on, crowd, you _must_ know that Emacs has all of the iconv's
functionality (and more) built-in, right?  All that encode-coding
stuff etc.?

Btw, ASCII//TRANSLIT doesn't necessarily guarantee it will succeed in
removing all the diacritics, AFAIK.  You need to completely decompose
the characters for that, and AFAIK iconv doesn't have such
capabilities.  (Emacs does.)



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 16:07       ` Eli Zaretskii
@ 2018-05-27 16:59         ` Teemu Likonen
  2018-05-28  5:24           ` Tak Kunihiro
  2018-05-30 10:12           ` Marcin Borkowski
  2018-05-27 20:00         ` tomas
  1 sibling, 2 replies; 43+ messages in thread
From: Teemu Likonen @ 2018-05-27 16:59 UTC (permalink / raw
  To: Eli Zaretskii; +Cc: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 785 bytes --]

Eli Zaretskii [2018-05-27 19:07:40+03] wrote:

> Come on, crowd, you _must_ know that Emacs has all of the iconv's
> functionality (and more) built-in, right? All that encode-coding stuff
> etc.?

Emacs has loads of stuff that I don't know or don't have time to study.
:-) But thanks for your ucs-normalize hint. An ASCII normalization and
filter could be something like this:


    (defun my-ascii-normalize-filter (string)
      (require 'cl-lib)
      (cl-remove-if (lambda (char)
                      (> char 127))
                    (ucs-normalize-NFKD-string string)))

Maybe one could want to filter out control chars too...

-- 
/// Teemu Likonen   - .-..   <https://keybase.io/tlikonen> //
// PGP: 4E10 55DC 84E9 DFF6 13D7 8557 719D 69D3 2453 9450 ///

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 12:36   ` Marcin Borkowski
  2018-05-27 12:52     ` Teemu Likonen
  2018-05-27 13:04     ` Yuri Khan
@ 2018-05-27 19:53     ` tomas
  2018-05-28  8:15     ` Philipp Stephani
  2018-05-31 14:23     ` Stefan Monnier
  4 siblings, 0 replies; 43+ messages in thread
From: tomas @ 2018-05-27 19:53 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, May 27, 2018 at 02:36:27PM +0200, Marcin Borkowski wrote:
> 
> On 2018-05-27, at 09:36, tomas@tuxteam.de wrote:

[...]

> > Sigh :-)
> 
> I understand that.

[...]
> (BTW, if there is some command-line utility to do that, that's fine
> too.)

I don't know (yet?) about Emacs, but iconv might do the trick:

  tomas@trotzki:~$ echo Überbäcker | iconv -f utf-8 -t ascii//TRANSLIT
  Uberbacker

Cheers
- -- t
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsLDNIACgkQBcgs9XrR2kbzEACfXlbbQCiriqW7ztASEShj5eOc
6woAn09HdxmHgxTU2IK1AlsFS+hCxyBX
=0Mrb
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 16:07       ` Eli Zaretskii
  2018-05-27 16:59         ` Teemu Likonen
@ 2018-05-27 20:00         ` tomas
  2018-05-28 18:27           ` Eli Zaretskii
  1 sibling, 1 reply; 43+ messages in thread
From: tomas @ 2018-05-27 20:00 UTC (permalink / raw
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, May 27, 2018 at 07:07:40PM +0300, Eli Zaretskii wrote:

[...]

> Come on, crowd, you _must_ know that Emacs has all of the iconv's
> functionality (and more) built-in, right?  All that encode-coding
> stuff etc.?

Heh. At least we have the hunch, yes. After all, Emacs has nearly
everything ;-)

I searched the manual (and apropos) for translit and turned up nearly
empty (OK, OK, apropos yields standard-display-cyrillic-translit, but
that doesn't look promising).

How to go about that search?

Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsLDmYACgkQBcgs9XrR2kaf5ACeK89wG6CkEPoLGVjntVn3na4t
Up8AnR0RLCpS4aZs1Q9pQzbZGvOc3E1/
=uqct
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 16:59         ` Teemu Likonen
@ 2018-05-28  5:24           ` Tak Kunihiro
  2018-05-30 10:12           ` Marcin Borkowski
  1 sibling, 0 replies; 43+ messages in thread
From: Tak Kunihiro @ 2018-05-28  5:24 UTC (permalink / raw
  To: Teemu Likonen; +Cc: tkk, help-gnu-emacs

Teemu Likonen <tlikonen@iki.fi> writes:

>     (defun my-ascii-normalize-filter (string)
>       (require 'cl-lib)
>       (cl-remove-if (lambda (char)
>                       (> char 127))
>                     (ucs-normalize-NFKD-string string)))

It is cool. I wanted to asciify the name too.  
I put the code into my M-l as shown below.

#+BEGIN_SRC emacs-lisp
(defun downcase-or-normalize-word (arg)
  "Convert to lower case from point to end of word, moving over.
With ARG, normalize a word on point, moving over."
  (interactive "p")
  (require 'cl-lib)
  (if (equal arg 4)
      (let ((point0 (point)) string)
        (forward-word 1)
        (setq string (buffer-substring point0 (point)))
        (delete-region point0 (point))
        (insert
         (cl-remove-if (lambda (char)
                         (> char 127))
                       (ucs-normalize-NFKD-string string))))
    (downcase-word arg)))

(global-set-key [remap downcase-word] 'downcase-or-normalize-word)
#+END_SRC



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 12:36   ` Marcin Borkowski
                       ` (2 preceding siblings ...)
  2018-05-27 19:53     ` tomas
@ 2018-05-28  8:15     ` Philipp Stephani
  2018-05-28 10:28       ` Marcin Borkowski
  2018-05-31 14:23     ` Stefan Monnier
  4 siblings, 1 reply; 43+ messages in thread
From: Philipp Stephani @ 2018-05-28  8:15 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs

Marcin Borkowski <mbork@mbork.pl> schrieb am So., 27. Mai 2018 um 14:38 Uhr:

>
> On 2018-05-27, at 09:36, tomas@tuxteam.de wrote:
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > On Sun, May 27, 2018 at 08:22:20AM +0200, Marcin Borkowski wrote:
> >> Hi all,
> >>
> >> I want to convert e.g. "żółć" to "zolc", or "Poincaré" to "Poincare"
> >> etc.  IOW, I want to replace all these funny Unicode accented characters
> >> with their ASCII equivalents.
> >>
> >> Is there anything for that in Emacs?
> >
> > I haven't an answer to your direct question, just a warning: without a
> > language context, you can't do it "correctly". For one illustrative
> > example, in German "ü" -> "ue", but in Spanish "ü" -> "u" (those
> diaereses
> > do have different functions in those languages). Transliterating "ü" with
> > just "u" in German would be wrong (but the reader might make some sense
> > of it), transliterating "ü" with "ue" in Spanish would not only be wrong,
> > but would almost certainly throw off the reader's auto-correction feature
> > (unles (s)he knows German and can recall that association).
> >
> > I'm sure there are tons of other examples like that.
> >
> > Heck, even up- and downcasing is strictly language context dependent
> > (witness the Turkish dotless I).
> >
> > Sigh :-)
>
> I understand that.
>
> Still, I need something *simple*.  I have a person's name (possibly with
> some national characters), and I want to derive a filename from it.  It
> doesn't have to be correct in 100% cases.  It doesn't even have to be
> unambiguous (there will be a number for that in the filename, too).


Then why not use only the number, if that's enough the make the filename
unique?
Either the filename is an internal implementation detail, then it doesn't
have to be human-readable.
Or you could ignore portability concerns and use the username as is.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-28  8:15     ` Philipp Stephani
@ 2018-05-28 10:28       ` Marcin Borkowski
  2018-05-28 10:39         ` tomas
  0 siblings, 1 reply; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-28 10:28 UTC (permalink / raw
  To: Philipp Stephani; +Cc: help-gnu-emacs


On 2018-05-28, at 10:15, Philipp Stephani <p.stephani2@gmail.com> wrote:

> Marcin Borkowski <mbork@mbork.pl> schrieb am So., 27. Mai 2018 um 14:38 Uhr:
>
>> I understand that.
>>
>> Still, I need something *simple*.  I have a person's name (possibly with
>> some national characters), and I want to derive a filename from it.  It
>> doesn't have to be correct in 100% cases.  It doesn't even have to be
>> unambiguous (there will be a number for that in the filename, too).
>
> Then why not use only the number, if that's enough the make the filename
> unique?
> Either the filename is an internal implementation detail, then it doesn't
> have to be human-readable.
> Or you could ignore portability concerns and use the username as is.

That's an interesting idea.  However, I disagree.  My idea is to have
a number (because that helps to keep some order, and there may be - and
sometimes are - more than one item relating to the same person), but
a name is _very_ helpful for menomic reasons.

Best,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-28 10:28       ` Marcin Borkowski
@ 2018-05-28 10:39         ` tomas
  2018-05-28 15:30           ` Yuri Khan
  2018-05-30 10:12           ` Marcin Borkowski
  0 siblings, 2 replies; 43+ messages in thread
From: tomas @ 2018-05-28 10:39 UTC (permalink / raw
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, May 28, 2018 at 12:28:05PM +0200, Marcin Borkowski wrote:
> 
> On 2018-05-28, at 10:15, Philipp Stephani <p.stephani2@gmail.com> wrote:
> 
> > Marcin Borkowski <mbork@mbork.pl> schrieb am So., 27. Mai 2018 um 14:38 Uhr:
> >
> >> I understand that.
> >>
> >> Still, I need something *simple*.  I have a person's name (possibly with
> >> some national characters), and I want to derive a filename from it [...]

> That's an interesting idea.  However, I disagree.  My idea is to have
> a number (because that helps to keep some order, and there may be - and
> sometimes are - more than one item relating to the same person), but
> a name is _very_ helpful for menomic reasons.

One often-used method is to just URL-encode [1] the thing. On the plus
side, it's always ASCII and you have a "lossless" mapping back and forth,
on the down side you lose lexicographic order and (some) readability
(although we trained rats have already learnt to cope with %3D and %2F)

Emacs supports that with the pair of functions `url-hexify-string' and
`url-unhex-string' (see, Eli? Sometimes even /me finds something .-)

Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsL3EkACgkQBcgs9XrR2kb2yACfRq604m3nZ+s795yc8amNLG7I
QsgAn16wDP1ywyjRoBsniO40qBftwxgv
=8nAO
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-28 10:39         ` tomas
@ 2018-05-28 15:30           ` Yuri Khan
  2018-05-28 16:02             ` tomas
  2018-05-30 10:12           ` Marcin Borkowski
  1 sibling, 1 reply; 43+ messages in thread
From: Yuri Khan @ 2018-05-28 15:30 UTC (permalink / raw
  To: tomas; +Cc: help-gnu-emacs

On Mon, May 28, 2018 at 5:39 PM <tomas@tuxteam.de> wrote:

> One often-used method is to just URL-encode [1] the thing. On the plus
> side, it's always ASCII and you have a "lossless" mapping back and forth,
> on the down side you lose lexicographic order and (some) readability
> (although we trained rats have already learnt to cope with %3D and %2F)

%3D ‘=’ and %2F ‘/’ are Trained Rat Level 1 material. Under your
suggestion, my name is %D0%AE%D1%80%D0%B8%D0%B9+%D0%A5%D0%B0%D0%BD and it
takes a certain effort to decode that in my head. I seriously doubt Tak
Kunihiro’s urlencoded name could be read at all without decoding.

(On the other hand, I regularly get offended at banks trying to
automatically transliterate my name when issuing a card for me.)



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-28 15:30           ` Yuri Khan
@ 2018-05-28 16:02             ` tomas
  0 siblings, 0 replies; 43+ messages in thread
From: tomas @ 2018-05-28 16:02 UTC (permalink / raw
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, May 28, 2018 at 10:30:38PM +0700, Yuri Khan wrote:
> On Mon, May 28, 2018 at 5:39 PM <tomas@tuxteam.de> wrote:
> 
> > One often-used method is to just URL-encode [1] the thing [...]

[...]

> %3D ‘=’ and %2F ‘/’ are Trained Rat Level 1 material. Under your
> suggestion, my name is %D0%AE%D1%80%D0%B8%D0%B9+%D0%A5%D0%B0%D0%BD and it
> takes a certain effort to decode that in my head. I seriously doubt Tak
> Kunihiro’s urlencoded name could be read at all without decoding.

This is all Enlightened Beaver level. I'm far below that...

> (On the other hand, I regularly get offended at banks trying to
> automatically transliterate my name when issuing a card for me.)

FWIW, I've got one accented letter, and used to give out my name
with that, to see what comes back from the round trip. Hilarity :-)

Things have become much more boring the last fifteen to twenty
years, though.

Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsMKCAACgkQBcgs9XrR2kY3IwCffQEZ5cn/K4rd8Vs9SigmROvx
r4kAnifbFp8deNRQ6p4nJgW4ZIxx3Zmy
=ntf1
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 20:00         ` tomas
@ 2018-05-28 18:27           ` Eli Zaretskii
  2018-05-29  6:37             ` tomas
  0 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2018-05-28 18:27 UTC (permalink / raw
  To: help-gnu-emacs

> Date: Sun, 27 May 2018 22:00:38 +0200
> From: <tomas@tuxteam.de>
> 
> I searched the manual (and apropos) for translit and turned up nearly
> empty (OK, OK, apropos yields standard-display-cyrillic-translit, but
> that doesn't look promising).
> 
> How to go about that search?

First, it is wrong to think about this as "transliteration".  The
correct term is "Unicode normalization".

But the real problem is that the facilities in ucs-normalize are not
documented in the ELisp manual.  Something to work on, I guess.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-28 18:27           ` Eli Zaretskii
@ 2018-05-29  6:37             ` tomas
  0 siblings, 0 replies; 43+ messages in thread
From: tomas @ 2018-05-29  6:37 UTC (permalink / raw
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, May 28, 2018 at 09:27:18PM +0300, Eli Zaretskii wrote:
> > Date: Sun, 27 May 2018 22:00:38 +0200
> > From: <tomas@tuxteam.de>
> > 
> > I searched the manual (and apropos) for translit and turned up nearly
> > empty (OK, OK, apropos yields standard-display-cyrillic-translit, but
> > that doesn't look promising).
> > 
> > How to go about that search?
> 
> First, it is wrong to think about this as "transliteration".  The
> correct term is "Unicode normalization".

Thanks for hinting in the right direction...

> But the real problem is that the facilities in ucs-normalize are not
> documented in the ELisp manual.  Something to work on, I guess.

Days so short :-)

Thanks, Eli
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsM9TMACgkQBcgs9XrR2kblrACfWsTEJHlqzmnRpUroYrQ7mb3R
AqUAnj+dmpHdH6/dQGfx71bMQRNfQBqp
=MNzL
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 16:59         ` Teemu Likonen
  2018-05-28  5:24           ` Tak Kunihiro
@ 2018-05-30 10:12           ` Marcin Borkowski
  2018-05-30 17:05             ` Eli Zaretskii
  1 sibling, 1 reply; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-30 10:12 UTC (permalink / raw
  To: Teemu Likonen; +Cc: help-gnu-emacs


On 2018-05-27, at 18:59, Teemu Likonen <tlikonen@iki.fi> wrote:

> Eli Zaretskii [2018-05-27 19:07:40+03] wrote:
>
>> Come on, crowd, you _must_ know that Emacs has all of the iconv's
>> functionality (and more) built-in, right? All that encode-coding stuff
>> etc.?
>
> Emacs has loads of stuff that I don't know or don't have time to study.
> :-) But thanks for your ucs-normalize hint. An ASCII normalization and
> filter could be something like this:
>
>
>     (defun my-ascii-normalize-filter (string)
>       (require 'cl-lib)
>       (cl-remove-if (lambda (char)
>                       (> char 127))
>                     (ucs-normalize-NFKD-string string)))
>
> Maybe one could want to filter out control chars too...

Thanks, that's a step in the right direction!

However, (my-ascii-normalize-filter "żółć") gived "zoc" and not
"zolc"...

Best,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-28 10:39         ` tomas
  2018-05-28 15:30           ` Yuri Khan
@ 2018-05-30 10:12           ` Marcin Borkowski
  1 sibling, 0 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-30 10:12 UTC (permalink / raw
  To: tomas; +Cc: help-gnu-emacs


On 2018-05-28, at 12:39, tomas@tuxteam.de wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, May 28, 2018 at 12:28:05PM +0200, Marcin Borkowski wrote:
>> 
>> On 2018-05-28, at 10:15, Philipp Stephani <p.stephani2@gmail.com> wrote:
>> 
>> > Marcin Borkowski <mbork@mbork.pl> schrieb am So., 27. Mai 2018 um 14:38 Uhr:
>> >
>> >> I understand that.
>> >>
>> >> Still, I need something *simple*.  I have a person's name (possibly with
>> >> some national characters), and I want to derive a filename from it [...]
>
>> That's an interesting idea.  However, I disagree.  My idea is to have
>> a number (because that helps to keep some order, and there may be - and
>> sometimes are - more than one item relating to the same person), but
>> a name is _very_ helpful for menomic reasons.
>
> One often-used method is to just URL-encode [1] the thing. On the plus
> side, it's always ASCII and you have a "lossless" mapping back and forth,
> on the down side you lose lexicographic order and (some) readability
> (although we trained rats have already learnt to cope with %3D and %2F)
>
> Emacs supports that with the pair of functions `url-hexify-string' and
> `url-unhex-string' (see, Eli? Sometimes even /me finds something .-)

Very interesting (though not acceptable in my use-case).  Thanks,
though!

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 13:04     ` Yuri Khan
@ 2018-05-30 10:14       ` Marcin Borkowski
  2018-05-30 11:51         ` Yuri Khan
  2018-05-31  2:03       ` John Mastro
  1 sibling, 1 reply; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-30 10:14 UTC (permalink / raw
  To: Yuri Khan; +Cc: help-gnu-emacs


On 2018-05-27, at 15:04, Yuri Khan <yurivkhan@gmail.com> wrote:

> On Sun, May 27, 2018 at 7:38 PM Marcin Borkowski <mbork@mbork.pl> wrote:
>
> [...] But make very sure that the people involved never see their own
> name’s transliteration.

Actually, they shouldn't be offended - a similar thing happens in emails
and such all the time.

Best,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-30 10:14       ` Marcin Borkowski
@ 2018-05-30 11:51         ` Yuri Khan
  2018-05-30 15:04           ` Marcin Borkowski
  0 siblings, 1 reply; 43+ messages in thread
From: Yuri Khan @ 2018-05-30 11:51 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs

On Wed, May 30, 2018 at 5:15 PM Marcin Borkowski <mbork@mbork.pl> wrote:

> Actually, they shouldn't be offended - a similar thing happens in emails
> and such all the time.

Not very often. Normally, one chooses their preferred transliteration and
puts that in their From: header, then replies go with that form in To:, and
that form is also saved in peoples’ address books so new messages also get
the preferred form. The only time I expect issues is when person A knows
the native form of person B’s name and attempts to transliterate it, and
guesses wrong.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-30 11:51         ` Yuri Khan
@ 2018-05-30 15:04           ` Marcin Borkowski
  0 siblings, 0 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-30 15:04 UTC (permalink / raw
  To: Yuri Khan; +Cc: help-gnu-emacs


On 2018-05-30, at 13:51, Yuri Khan <yurivkhan@gmail.com> wrote:

> On Wed, May 30, 2018 at 5:15 PM Marcin Borkowski <mbork@mbork.pl> wrote:
>
>> Actually, they shouldn't be offended - a similar thing happens in emails
>> and such all the time.
>
> Not very often. Normally, one chooses their preferred transliteration and
> puts that in their From: header, then replies go with that form in To:, and
> that form is also saved in peoples’ address books so new messages also get
> the preferred form. The only time I expect issues is when person A knows
> the native form of person B’s name and attempts to transliterate it, and
> guesses wrong.

In Polish (my use-case 99%+ of the time), there is no possibility of
confusion wrt transliteration: all non-Latin letters in our alphabet are
ąćęłńóśźż, and they neatly (although non-injectively) map to acelnoszz.

Best,

--
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-30 10:12           ` Marcin Borkowski
@ 2018-05-30 17:05             ` Eli Zaretskii
  2018-05-30 19:38               ` Marcin Borkowski
  0 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2018-05-30 17:05 UTC (permalink / raw
  To: help-gnu-emacs

> From: Marcin Borkowski <mbork@mbork.pl>
> Cc: Eli Zaretskii <eliz@gnu.org>, help-gnu-emacs <help-gnu-emacs@gnu.org>
> Date: Wed, 30 May 2018 12:12:18 +0200
> 
> >     (defun my-ascii-normalize-filter (string)
> >       (require 'cl-lib)
> >       (cl-remove-if (lambda (char)
> >                       (> char 127))
> >                     (ucs-normalize-NFKD-string string)))
> >
> > Maybe one could want to filter out control chars too...
> 
> Thanks, that's a step in the right direction!
> 
> However, (my-ascii-normalize-filter "żółć") gived "zoc" and not
> "zolc"...

That's because ł doesn't have any decompositions.  So it stays
unchanged and is removed because its codepoint is above 127.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-30 17:05             ` Eli Zaretskii
@ 2018-05-30 19:38               ` Marcin Borkowski
  0 siblings, 0 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-30 19:38 UTC (permalink / raw
  To: Eli Zaretskii; +Cc: help-gnu-emacs


On 2018-05-30, at 19:05, Eli Zaretskii <eliz@gnu.org> wrote:

>> From: Marcin Borkowski <mbork@mbork.pl>
>> Cc: Eli Zaretskii <eliz@gnu.org>, help-gnu-emacs <help-gnu-emacs@gnu.org>
>> Date: Wed, 30 May 2018 12:12:18 +0200
>> 
>> >     (defun my-ascii-normalize-filter (string)
>> >       (require 'cl-lib)
>> >       (cl-remove-if (lambda (char)
>> >                       (> char 127))
>> >                     (ucs-normalize-NFKD-string string)))
>> >
>> > Maybe one could want to filter out control chars too...
>> 
>> Thanks, that's a step in the right direction!
>> 
>> However, (my-ascii-normalize-filter "żółć") gived "zoc" and not
>> "zolc"...
>
> That's because ł doesn't have any decompositions.  So it stays
> unchanged and is removed because its codepoint is above 127.

I see.  This means that I'll have to take care of it on an earlier
stage.  Not very elegant - but what could I expect with Unicode? ;-)

Thanks,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 13:04     ` Yuri Khan
  2018-05-30 10:14       ` Marcin Borkowski
@ 2018-05-31  2:03       ` John Mastro
  2018-06-02 18:07         ` Marcin Borkowski
  2018-06-02 18:12         ` Marcin Borkowski
  1 sibling, 2 replies; 43+ messages in thread
From: John Mastro @ 2018-05-31  2:03 UTC (permalink / raw
  To: Help Gnu Emacs mailing list

Yuri Khan <yurivkhan@gmail.com> wrote:
> > I have a person's name (possibly with
> > some national characters), and I want to derive a filename from it.  It
> > doesn't have to be correct in 100% cases.  It doesn't even have to be
> > unambiguous (there will be a number for that in the filename, too).
>
> Technically you could use the name of a person as is, as long as it is
> representable in Unicode and contains neither the null character nor the
> slash character. But I assume you want a filename that is portable between
> file systems, or a filename that can be represented in an URI path segment
> without %-encoding, or any combination of the above.
>
> In that case, the Python unidecode library is probably the closest that you
> can find. But make very sure that the people involved never see their own
> name’s transliteration.

There's also an Emacs Lisp port of unidecode[1]

(unidecode "żółć")
;=> "zolc"

[1]: https://github.com/sindikat/unidecode



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-27 12:36   ` Marcin Borkowski
                       ` (3 preceding siblings ...)
  2018-05-28  8:15     ` Philipp Stephani
@ 2018-05-31 14:23     ` Stefan Monnier
  2018-05-31 15:08       ` S. Champailler
                         ` (2 more replies)
  4 siblings, 3 replies; 43+ messages in thread
From: Stefan Monnier @ 2018-05-31 14:23 UTC (permalink / raw
  To: help-gnu-emacs

> Still, I need something *simple*.  I have a person's name (possibly with
> some national characters), and I want to derive a filename from it.

I really strongly recommend you try to solve this problem by doing
nothing: keep the name in its full glory.  Nowadays users *should*
expect this to work.


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 14:23     ` Stefan Monnier
@ 2018-05-31 15:08       ` S. Champailler
  2018-05-31 22:52         ` Richard Wordingham
  2018-05-31 15:42       ` Marcin Borkowski
       [not found]       ` <mailman.871.1527781438.1292.help-gnu-emacs@gnu.org>
  2 siblings, 1 reply; 43+ messages in thread
From: S. Champailler @ 2018-05-31 15:08 UTC (permalink / raw
  To: help-gnu-emacs, Stefan Monnier

I second that, removing accents and other "nationalities" is much trickier than one might expect (you can look at Java example, the Java unicode support is quite complete), especially for lanugages far away from english such as russian. By "tricky" I mean there are *hundreds* of edge cases. Nevertheless, there are ways do sort of do what you want by playing with thigsn such as "non spacing combining characters", "normalized strings", etc. If you have the opportunity, just try to do it, the great lesson you'lll get of that is that human languages are super complexe (and thus super interesting).

Today, everyone should use Unicode, it's much simpler. Many file systems support unicode.

stF


> Le 31 mai 2018 à 16:23, Stefan Monnier <monnier@iro.umontreal.ca> a écrit :
> 
> 
> > Still, I need something *simple*.  I have a person's name (possibly with
> > some national characters), and I want to derive a filename from it.
> 
> I really strongly recommend you try to solve this problem by doing
> nothing: keep the name in its full glory.  Nowadays users *should*
> expect this to work.
> 
> 
>         Stefan
> 
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 14:23     ` Stefan Monnier
  2018-05-31 15:08       ` S. Champailler
@ 2018-05-31 15:42       ` Marcin Borkowski
  2018-05-31 15:53         ` Eli Zaretskii
       [not found]       ` <mailman.871.1527781438.1292.help-gnu-emacs@gnu.org>
  2 siblings, 1 reply; 43+ messages in thread
From: Marcin Borkowski @ 2018-05-31 15:42 UTC (permalink / raw
  To: Stefan Monnier; +Cc: help-gnu-emacs


On 2018-05-31, at 16:23, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

>> Still, I need something *simple*.  I have a person's name (possibly with
>> some national characters), and I want to derive a filename from it.
>
> I really strongly recommend you try to solve this problem by doing
> nothing: keep the name in its full glory.  Nowadays users *should*
> expect this to work.

It's tempting, but no: these files will eventually be sent to
e.g. people on Windows XP and the like.  I don't want to take risks of
unreadable filenames.

Thanks anyway,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 15:42       ` Marcin Borkowski
@ 2018-05-31 15:53         ` Eli Zaretskii
  2018-05-31 16:20           ` Yuri Khan
  2018-05-31 19:03           ` Stefan Monnier
  0 siblings, 2 replies; 43+ messages in thread
From: Eli Zaretskii @ 2018-05-31 15:53 UTC (permalink / raw
  To: Marcin Borkowski; +Cc: help-gnu-emacs, monnier

> From: Marcin Borkowski <mbork@mbork.pl>
> Date: Thu, 31 May 2018 17:42:33 +0200
> Cc: help-gnu-emacs@gnu.org
> 
> > I really strongly recommend you try to solve this problem by doing
> > nothing: keep the name in its full glory.  Nowadays users *should*
> > expect this to work.
> 
> It's tempting, but no: these files will eventually be sent to
> e.g. people on Windows XP and the like.  I don't want to take risks of
> unreadable filenames.

Windows nowadays supports the full Unicode range of characters in file
names, with a few exceptions that cannot happen in people's names
(like slash, null, ':', etc.).  Emacs on Windows even supports such
file names (since Emacs 24).



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 15:53         ` Eli Zaretskii
@ 2018-05-31 16:20           ` Yuri Khan
  2018-05-31 19:03           ` Stefan Monnier
  1 sibling, 0 replies; 43+ messages in thread
From: Yuri Khan @ 2018-05-31 16:20 UTC (permalink / raw
  To: Eli Zaretskii; +Cc: help-gnu-emacs, Stefan Monnier

On Thu, May 31, 2018 at 10:54 PM Eli Zaretskii <eliz@gnu.org> wrote:

> Windows nowadays supports the full Unicode range of characters in file
> names, with a few exceptions that cannot happen in people's names
> (like slash, null, ':', etc.).

Seeing the categorical “cannot”, I simply must post this link here:

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

And, of course, https://xkcd.com/327/ too.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 15:53         ` Eli Zaretskii
  2018-05-31 16:20           ` Yuri Khan
@ 2018-05-31 19:03           ` Stefan Monnier
  1 sibling, 0 replies; 43+ messages in thread
From: Stefan Monnier @ 2018-05-31 19:03 UTC (permalink / raw
  To: Eli Zaretskii; +Cc: help-gnu-emacs

>> > I really strongly recommend you try to solve this problem by doing
>> > nothing: keep the name in its full glory.  Nowadays users *should*
>> > expect this to work.
>> It's tempting, but no: these files will eventually be sent to
>> e.g. people on Windows XP and the like.  I don't want to take risks of
>> unreadable filenames.
>
> Windows nowadays supports the full Unicode range of characters in file
> names,

To clarify: the "nowadays" includes Windows XP.


        Stefan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 15:08       ` S. Champailler
@ 2018-05-31 22:52         ` Richard Wordingham
  0 siblings, 0 replies; 43+ messages in thread
From: Richard Wordingham @ 2018-05-31 22:52 UTC (permalink / raw
  To: help-gnu-emacs

On Thu, 31 May 2018 17:08:47 +0200 (CEST)
"S. Champailler" <schampaillerspam@skynet.be> wrote:

> I second that, removing accents and other "nationalities" is much
> trickier than one might expect (you can look at Java example, the
> Java unicode support is quite complete), especially for lanugages far
> away from english such as russian. By "tricky" I mean there are
> *hundreds* of edge cases. Nevertheless, there are ways do sort of do
> what you want by playing with thigsn such as "non spacing combining
> characters", "normalized strings", etc. If you have the opportunity,
> just try to do it, the great lesson you'lll get of that is that human
> languages are super complexe (and thus super interesting).

Make sure you transliterate the string first.  Remember that stripping
out Indic vowels (many of which are gc=Mn) is no more reasonable than
stripping out ASCII vowels.

> Today, everyone should use Unicode, it's much simpler. Many file
> systems support unicode.

But be warned that some very different strings may compare equal.  The
Unicode Collation algorithm is highly likely *not* to be the default.
Windows XP used to compare strings of Canadian Aboriginal Syllabics of
the same length as equal.  I remember using sort -u to remove duplicates
from a list of words on a Linux distribution, and finding that I only
had one left. I now play safe and do that sort of trick in the C locale.

Richard.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
       [not found]       ` <mailman.871.1527781438.1292.help-gnu-emacs@gnu.org>
@ 2018-05-31 23:23         ` James K. Lowden
  2018-06-01  2:04           ` Stefan Monnier
  2018-06-01  7:02           ` Eli Zaretskii
  0 siblings, 2 replies; 43+ messages in thread
From: James K. Lowden @ 2018-05-31 23:23 UTC (permalink / raw
  To: help-gnu-emacs

On Thu, 31 May 2018 17:42:33 +0200
Marcin Borkowski <mbork@mbork.pl> wrote:

> > I really strongly recommend you try to solve this problem by doing
> > nothing: keep the name in its full glory.  Nowadays users *should*
> > expect this to work.
> 
> It's tempting, but no: these files will eventually be sent to
> e.g. people on Windows XP and the like.  I don't want to take risks of
> unreadable filenames.

It's good advice, though treacherous.  If you use any encoding other
than ASCII, you'll need to indicate the encoding used, and put up with
recipients who don't know what "encoding" is, or can't re-encode the
names to their machine's preferred encoding.  

For instance, if you send UTF-8, you can expect befuddlement from
Windows users, whose system implicitly recognizes UTF-16LE.  

I can hardly blame you for not wanting to do that.  

If Windows's filename rules were the actual constraint, the allowed
characters in a Windows filename is well defined.  The
prohibited characters could be URL-encoded or similar.  That would
yield a recognizable, unique name, and the original could be recovered
by reversing the process.  

If I were solving your problem, I'd look for something similar to what
you describe, but wholly reversible.  I'd use ascii//TRANSLIT or similar
to get the "unaccented" version of the character, and insert a
URL-style escape after each one representing the original
Unicode character in hex.  So, 

	Jönköping

becomes

	Jo%F6nko%F6ping

If you escape literal percent signs, too, ("%" becomes "%%25") then
the reversal rule is simply "for every /%[:xdigit:]{2}/, replace the
previous character with the indicated codepoint".  

This approach preserves uniqueness in the filename, so you can dispense
with "uniquifying" it with a meaningless integer.  

--jkl


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 23:23         ` James K. Lowden
@ 2018-06-01  2:04           ` Stefan Monnier
  2018-06-01  7:02           ` Eli Zaretskii
  1 sibling, 0 replies; 43+ messages in thread
From: Stefan Monnier @ 2018-06-01  2:04 UTC (permalink / raw
  To: help-gnu-emacs

> It's good advice, though treacherous.  If you use any encoding other
> than ASCII, you'll need to indicate the encoding used, and put up with
> recipients who don't know what "encoding" is

Yes, and this has been known for enough years (and slowly fixed
everywhere) that nowadays users *should* (and from what I can tell,
usually do) expect this to be handled properly.

It may still fail, of course, but this shouldn't be much more common
than the occurrence of any other bug.


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31 23:23         ` James K. Lowden
  2018-06-01  2:04           ` Stefan Monnier
@ 2018-06-01  7:02           ` Eli Zaretskii
  1 sibling, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2018-06-01  7:02 UTC (permalink / raw
  To: help-gnu-emacs

> From: "James K. Lowden" <jklowden@speakeasy.net>
> Date: Thu, 31 May 2018 19:23:48 -0400
> 
> It's good advice, though treacherous.  If you use any encoding other
> than ASCII, you'll need to indicate the encoding used, and put up with
> recipients who don't know what "encoding" is, or can't re-encode the
> names to their machine's preferred encoding.  
> 
> For instance, if you send UTF-8, you can expect befuddlement from
> Windows users, whose system implicitly recognizes UTF-16LE.  

As Stefan points out, any reasonable application already knows how to
overcome this difficulty.  Emacs certainly does -- that's what the
various encodings it supports are about.  Using that, you can visit a
file names on a system where file names are UTF-8 encoded, then save
that file to a system whose file names are encoded in UTF-16LE.  All
that Emacs needs is for the user to tell it which encoding to use for
file names on what system.

> If Windows's filename rules were the actual constraint, the allowed
> characters in a Windows filename is well defined.

Yes, and the "prohibited" characters, while they are more numerous
than on Posix systems, are still very few (and are all below codepoint
127).  Any other non-ASCII character is allowed, be it inside the BMP
or outside it.

So it is quite possible nowadays to keep the original characters and
expect to be able to name files with them.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31  2:03       ` John Mastro
@ 2018-06-02 18:07         ` Marcin Borkowski
  2018-06-02 18:48           ` tomas
  2018-06-02 22:33           ` Drew Adams
  2018-06-02 18:12         ` Marcin Borkowski
  1 sibling, 2 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-06-02 18:07 UTC (permalink / raw
  To: John Mastro; +Cc: Help Gnu Emacs mailing list


On 2018-05-31, at 04:03, John Mastro <john.b.mastro@gmail.com> wrote:

> Yuri Khan <yurivkhan@gmail.com> wrote:
>> > I have a person's name (possibly with
>> > some national characters), and I want to derive a filename from it.  It
>> > doesn't have to be correct in 100% cases.  It doesn't even have to be
>> > unambiguous (there will be a number for that in the filename, too).
>>
>> Technically you could use the name of a person as is, as long as it is
>> representable in Unicode and contains neither the null character nor the
>> slash character. But I assume you want a filename that is portable between
>> file systems, or a filename that can be represented in an URI path segment
>> without %-encoding, or any combination of the above.
>>
>> In that case, the Python unidecode library is probably the closest that you
>> can find. But make very sure that the people involved never see their own
>> name’s transliteration.
>
> There's also an Emacs Lisp port of unidecode[1]
>
> (unidecode "żółć")
> ;=> "zolc"
>
> [1]: https://github.com/sindikat/unidecode

Thanks,

and thanks also to all the others for their input.

I didn't really intend to create such a storm.

My use case is much, much simpler than much of the stuff mentioned in
this thread.  99.5% (or more) of the cases are Polish names, where we
have only 9 "offending" letters, all easily asciified.  I thought there
is a simple, general solution (and I learned there isn't and probably
there can't be).

Hence, I'm going to stick with Eli's suggestion (and manual conversion
of "ł" into "l").  And in case I encounter a non-Polish name with some
letters outside the English alphabet (this may very rarely happen),
I can just manually override this simple solution.

IOW, KISS.

But thanks for the opportunity to learn a few things!

--
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-05-31  2:03       ` John Mastro
  2018-06-02 18:07         ` Marcin Borkowski
@ 2018-06-02 18:12         ` Marcin Borkowski
  1 sibling, 0 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-06-02 18:12 UTC (permalink / raw
  To: John Mastro; +Cc: Help Gnu Emacs mailing list


On 2018-05-31, at 04:03, John Mastro <john.b.mastro@gmail.com> wrote:

> Yuri Khan <yurivkhan@gmail.com> wrote:
>> > I have a person's name (possibly with
>> > some national characters), and I want to derive a filename from it.  It
>> > doesn't have to be correct in 100% cases.  It doesn't even have to be
>> > unambiguous (there will be a number for that in the filename, too).
>>
>> Technically you could use the name of a person as is, as long as it is
>> representable in Unicode and contains neither the null character nor the
>> slash character. But I assume you want a filename that is portable between
>> file systems, or a filename that can be represented in an URI path segment
>> without %-encoding, or any combination of the above.
>>
>> In that case, the Python unidecode library is probably the closest that you
>> can find. But make very sure that the people involved never see their own
>> name’s transliteration.
>
> There's also an Emacs Lisp port of unidecode[1]
>
> (unidecode "żółć")
> ;=> "zolc"
>
> [1]: https://github.com/sindikat/unidecode

Thanks!

I think I won't use it (I want as few dependencies as possible), but
I'll keep that in mind (and probably blog about it one day).

Best,

--
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-06-02 18:07         ` Marcin Borkowski
@ 2018-06-02 18:48           ` tomas
  2018-06-07 17:16             ` Marcin Borkowski
  2018-06-02 22:33           ` Drew Adams
  1 sibling, 1 reply; 43+ messages in thread
From: tomas @ 2018-06-02 18:48 UTC (permalink / raw
  To: help-gnu-emacs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, Jun 02, 2018 at 08:07:50PM +0200, Marcin Borkowski wrote:
[...]

> Thanks,
> 
> and thanks also to all the others for their input.
> 
> I didn't really intend to create such a storm.

Well, it is an interesting topic.

I, for one, learnt quite a couple of thing, too. So from me, also
thanks to all (including you, for bringing it up :)

Cheers
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlsS5mkACgkQBcgs9XrR2kafXgCeKNzXzanfBhIarRbalVZbhllS
Dv8An2Hsax1GrrA6MxzxOdQco6/q441S
=P5EV
-----END PGP SIGNATURE-----



^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: Is there a way to "asciify" a string?
  2018-06-02 18:07         ` Marcin Borkowski
  2018-06-02 18:48           ` tomas
@ 2018-06-02 22:33           ` Drew Adams
  2018-06-07 17:15             ` Marcin Borkowski
  1 sibling, 1 reply; 43+ messages in thread
From: Drew Adams @ 2018-06-02 22:33 UTC (permalink / raw
  To: Marcin Borkowski, John Mastro; +Cc: Help Gnu Emacs mailing list

> My use case is much, much simpler than much of the stuff mentioned in
> this thread.  99.5% (or more) of the cases are Polish names, where we
> have only 9 "offending" letters, all easily asciified.  I thought there
> is a simple, general solution (and I learned there isn't and probably
> there can't be).
...
> KISS

If you want a really rudimentary, KISS solution along those
lines, just apply a set of mappings of the chars you're
interested in.

I used that approach in the last millenium, in `unaccent.el'
(and I still use it occasionally).

I was interested only in translating letters with trema/umlaut,
circumflex, grave & acute accents, cedilla, tilde, S-zed,
guillemets, ae ligature, slashed O, angstrom, and upside-down
question mark & exclamation point - so my translation alist
contained only those mappings.  But you could using any
mapping, just by redefining variable `reverse-iso-chars-alist'
(not a good name, considering its generality, but that's what
I was using it for).

https://www.emacswiki.org/emacs/download/unaccent.el




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-06-02 22:33           ` Drew Adams
@ 2018-06-07 17:15             ` Marcin Borkowski
  0 siblings, 0 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-06-07 17:15 UTC (permalink / raw
  To: Drew Adams; +Cc: John Mastro, Help Gnu Emacs mailing list


On 2018-06-03, at 00:33, Drew Adams <drew.adams@oracle.com> wrote:

>> My use case is much, much simpler than much of the stuff mentioned in
>> this thread.  99.5% (or more) of the cases are Polish names, where we
>> have only 9 "offending" letters, all easily asciified.  I thought there
>> is a simple, general solution (and I learned there isn't and probably
>> there can't be).
> ...
>> KISS
>
> If you want a really rudimentary, KISS solution along those
> lines, just apply a set of mappings of the chars you're
> interested in.
>
> I used that approach in the last millenium, in `unaccent.el'
> (and I still use it occasionally).
>
> I was interested only in translating letters with trema/umlaut,
> circumflex, grave & acute accents, cedilla, tilde, S-zed,
> guillemets, ae ligature, slashed O, angstrom, and upside-down
> question mark & exclamation point - so my translation alist
> contained only those mappings.  But you could using any
> mapping, just by redefining variable `reverse-iso-chars-alist'
> (not a good name, considering its generality, but that's what
> I was using it for).
>
> https://www.emacswiki.org/emacs/download/unaccent.el

Thanks, that's a good idea, although I was too lazy to perform all these
mappings myself and used the function mentioned earlier in the thread to
shave off all the accents.

Best,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Is there a way to "asciify" a string?
  2018-06-02 18:48           ` tomas
@ 2018-06-07 17:16             ` Marcin Borkowski
  0 siblings, 0 replies; 43+ messages in thread
From: Marcin Borkowski @ 2018-06-07 17:16 UTC (permalink / raw
  To: tomas; +Cc: help-gnu-emacs


On 2018-06-02, at 20:48, tomas@tuxteam.de wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Sat, Jun 02, 2018 at 08:07:50PM +0200, Marcin Borkowski wrote:
> [...]
>
>> Thanks,
>> 
>> and thanks also to all the others for their input.
>> 
>> I didn't really intend to create such a storm.
>
> Well, it is an interesting topic.
>
> I, for one, learnt quite a couple of thing, too. So from me, also
> thanks to all (including you, for bringing it up :)

You're welcome!

(I _might_ blog about this stuff, but I'm afraid I won't find time for
that within the next few weeks...)

Best,

-- 
Marcin Borkowski
http://mbork.pl



^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2018-06-07 17:16 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-27  6:22 Is there a way to "asciify" a string? Marcin Borkowski
2018-05-27  7:36 ` tomas
2018-05-27 12:36   ` Marcin Borkowski
2018-05-27 12:52     ` Teemu Likonen
2018-05-27 16:07       ` Eli Zaretskii
2018-05-27 16:59         ` Teemu Likonen
2018-05-28  5:24           ` Tak Kunihiro
2018-05-30 10:12           ` Marcin Borkowski
2018-05-30 17:05             ` Eli Zaretskii
2018-05-30 19:38               ` Marcin Borkowski
2018-05-27 20:00         ` tomas
2018-05-28 18:27           ` Eli Zaretskii
2018-05-29  6:37             ` tomas
2018-05-27 13:04     ` Yuri Khan
2018-05-30 10:14       ` Marcin Borkowski
2018-05-30 11:51         ` Yuri Khan
2018-05-30 15:04           ` Marcin Borkowski
2018-05-31  2:03       ` John Mastro
2018-06-02 18:07         ` Marcin Borkowski
2018-06-02 18:48           ` tomas
2018-06-07 17:16             ` Marcin Borkowski
2018-06-02 22:33           ` Drew Adams
2018-06-07 17:15             ` Marcin Borkowski
2018-06-02 18:12         ` Marcin Borkowski
2018-05-27 19:53     ` tomas
2018-05-28  8:15     ` Philipp Stephani
2018-05-28 10:28       ` Marcin Borkowski
2018-05-28 10:39         ` tomas
2018-05-28 15:30           ` Yuri Khan
2018-05-28 16:02             ` tomas
2018-05-30 10:12           ` Marcin Borkowski
2018-05-31 14:23     ` Stefan Monnier
2018-05-31 15:08       ` S. Champailler
2018-05-31 22:52         ` Richard Wordingham
2018-05-31 15:42       ` Marcin Borkowski
2018-05-31 15:53         ` Eli Zaretskii
2018-05-31 16:20           ` Yuri Khan
2018-05-31 19:03           ` Stefan Monnier
     [not found]       ` <mailman.871.1527781438.1292.help-gnu-emacs@gnu.org>
2018-05-31 23:23         ` James K. Lowden
2018-06-01  2:04           ` Stefan Monnier
2018-06-01  7:02           ` Eli Zaretskii
2018-05-27 14:55 ` Eric Abrahamsen
2018-05-27 16:00 ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.