unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#51292: 27.2; Reversing strings with unicode combining characters
@ 2021-10-19 19:16 Howard Melman
  2021-10-19 19:26 ` Lars Ingebrigtsen
  0 siblings, 1 reply; 12+ messages in thread
From: Howard Melman @ 2021-10-19 19:16 UTC (permalink / raw)
  To: 51292

Reversing a string fails to account for unicode combining characters

    (reverse "nai\u0308ve")
    "ev̈ian"

Note the diaeresis is now on the v and not the i.  s-reverse gets it right:

    (s-reverse "nai\u0308ve")
    "evïan"

I tried on both:

GNU Emacs 27.2 (build 1, x86_64-apple-darwin18.7.0, Carbon Version 158 AppKit 1671.6) of 2021-03-27
GNU Emacs 27.1 (build 1, x86_64-apple-darwin18.7.0, NS appkit-1671.60 Version 10.14.6 (Build 18G95)) of 2020-08-12

-- 

Howard





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 19:16 bug#51292: 27.2; Reversing strings with unicode combining characters Howard Melman
@ 2021-10-19 19:26 ` Lars Ingebrigtsen
  2021-10-19 20:50   ` Lars Ingebrigtsen
  2021-10-20 11:45   ` Eli Zaretskii
  0 siblings, 2 replies; 12+ messages in thread
From: Lars Ingebrigtsen @ 2021-10-19 19:26 UTC (permalink / raw)
  To: Howard Melman; +Cc: 51292

Howard Melman <hmelman@gmail.com> writes:

> Reversing a string fails to account for unicode combining characters
>
>     (reverse "nai\u0308ve")
>     "ev̈ian"
>
> Note the diaeresis is now on the v and not the i.  s-reverse gets it right:
>
>     (s-reverse "nai\u0308ve")
>     "evïan"

So I wondered what s-reverse did, and indeed:

(defun s-reverse (s)
  "Return the reverse of S."
  (declare (pure t) (side-effect-free t))
  (save-match-data
    (if (multibyte-string-p s)
        (let ((input (string-to-list s))
              output)
          (require 'ucs-normalize)
          (while input
            ;; Handle entire grapheme cluster as a single unit
            (let ((grapheme (list (pop input))))
              (while (memql (car input) ucs-normalize-combining-chars)
                (push (pop input) grapheme))
              (setq output (nconc (nreverse grapheme) output))))
          (concat output))
      (concat (nreverse (string-to-list s))))))

Emacs has string-reverse, obsolete since 25.1.  Perhaps we should
reintroduce it and use the definition from s?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 19:26 ` Lars Ingebrigtsen
@ 2021-10-19 20:50   ` Lars Ingebrigtsen
  2021-10-19 21:21     ` Howard Melman
                       ` (2 more replies)
  2021-10-20 11:45   ` Eli Zaretskii
  1 sibling, 3 replies; 12+ messages in thread
From: Lars Ingebrigtsen @ 2021-10-19 20:50 UTC (permalink / raw)
  To: Howard Melman; +Cc: 51292

Lars Ingebrigtsen <larsi@gnus.org> writes:

> Emacs has string-reverse, obsolete since 25.1.  Perhaps we should
> reintroduce it and use the definition from s?

Or...  well, that might break some people's code, so let's not do that.

And I'm not quite sure that such a function really makes sense.  How
often do you reverse a string for display purposes, anyway?

But it might make sense to add function to tokenize a string into
grapheme clusters -- I can see that being useful.  Then the caller can
chop and reverse the list of clusters as they wish.

`string-tokenize-graphemes'?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 20:50   ` Lars Ingebrigtsen
@ 2021-10-19 21:21     ` Howard Melman
  2021-10-20  8:58       ` Lars Ingebrigtsen
  2021-10-19 23:13     ` Stefan Kangas
  2021-10-20 11:50     ` Eli Zaretskii
  2 siblings, 1 reply; 12+ messages in thread
From: Howard Melman @ 2021-10-19 21:21 UTC (permalink / raw)
  To: 51292

Lars Ingebrigtsen <larsi@gnus.org> writes:

> Lars Ingebrigtsen <larsi@gnus.org> writes:
>
>> Emacs has string-reverse, obsolete since 25.1.  Perhaps we should
>> reintroduce it and use the definition from s?
>
> Or...  well, that might break some people's code, so let's not do that.
>
> And I'm not quite sure that such a function really makes sense.  How
> often do you reverse a string for display purposes,
> anyway?

FWIW, I'm not invested in the outcome.  I haven't had a need
to do this but found the behavior curious.

> But it might make sense to add function to tokenize a string into
> grapheme clusters -- I can see that being useful.  Then the caller can
> chop and reverse the list of clusters as they wish.
>
> `string-tokenize-graphemes'?

I agree that seems potentially useful.

-- 

Howard






^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 20:50   ` Lars Ingebrigtsen
  2021-10-19 21:21     ` Howard Melman
@ 2021-10-19 23:13     ` Stefan Kangas
  2021-10-20  8:11       ` Lars Ingebrigtsen
  2021-10-20 11:50     ` Eli Zaretskii
  2 siblings, 1 reply; 12+ messages in thread
From: Stefan Kangas @ 2021-10-19 23:13 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Howard Melman, 51292

Lars Ingebrigtsen <larsi@gnus.org> writes:

> And I'm not quite sure that such a function really makes sense.  How
> often do you reverse a string for display purposes, anyway?

I guess not often, if I'm reading the results of this GitHub search
right:

https://github.com/search?l=Emacs+Lisp&o=desc&q=s-reverse+-filename%3Aexamples.el+-filename%3As-tests.el&s=&type=Code





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 23:13     ` Stefan Kangas
@ 2021-10-20  8:11       ` Lars Ingebrigtsen
  2021-10-20 13:02         ` Stefan Kangas
  0 siblings, 1 reply; 12+ messages in thread
From: Lars Ingebrigtsen @ 2021-10-20  8:11 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Howard Melman, 51292

Stefan Kangas <stefan@marxist.se> writes:

> I guess not often, if I'm reading the results of this GitHub search
> right:
>
> https://github.com/search?l=Emacs+Lisp&o=desc&q=s-reverse+-filename%3Aexamples.el+-filename%3As-tests.el&s=&type=Code

I'm not sure how to read it.  :-)  It says:

 2,671 code results 

and then the page is blank?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 21:21     ` Howard Melman
@ 2021-10-20  8:58       ` Lars Ingebrigtsen
  0 siblings, 0 replies; 12+ messages in thread
From: Lars Ingebrigtsen @ 2021-10-20  8:58 UTC (permalink / raw)
  To: Howard Melman; +Cc: 51292

Howard Melman <hmelman@gmail.com> writes:

>> But it might make sense to add function to tokenize a string into
>> grapheme clusters -- I can see that being useful.  Then the caller can
>> chop and reverse the list of clusters as they wish.
>>
>> `string-tokenize-graphemes'?
>
> I agree that seems potentially useful.

It's not that common to have un-normalised strings, though, and if you
normalise the string first you get

(reverse (ucs-normalize-NFC-string "nai\u0308ve"))
=> "evïan"

as expected.  So I think adding more utility functions here wouldn't be
productive (i.e., I don't think they would actually be useful for
people -- they'd just complicate things for users further).

So I think everything here is basically working as designed in Emacs,
and that the design is fine.  (And that the s-reverse isn't good.)  So
I'm closing this bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 19:26 ` Lars Ingebrigtsen
  2021-10-19 20:50   ` Lars Ingebrigtsen
@ 2021-10-20 11:45   ` Eli Zaretskii
  1 sibling, 0 replies; 12+ messages in thread
From: Eli Zaretskii @ 2021-10-20 11:45 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: hmelman, 51292

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Tue, 19 Oct 2021 21:26:31 +0200
> Cc: 51292@debbugs.gnu.org
> 
> Howard Melman <hmelman@gmail.com> writes:
> 
> > Reversing a string fails to account for unicode combining characters
> >
> >     (reverse "nai\u0308ve")
> >     "ev̈ian"
> >
> > Note the diaeresis is now on the v and not the i.  s-reverse gets it right:
> >
> >     (s-reverse "nai\u0308ve")
> >     "evïan"
> 
> So I wondered what s-reverse did, and indeed:
> 
> (defun s-reverse (s)
>   "Return the reverse of S."
>   (declare (pure t) (side-effect-free t))
>   (save-match-data
>     (if (multibyte-string-p s)
>         (let ((input (string-to-list s))
>               output)
>           (require 'ucs-normalize)
>           (while input
>             ;; Handle entire grapheme cluster as a single unit
>             (let ((grapheme (list (pop input))))
>               (while (memql (car input) ucs-normalize-combining-chars)
>                 (push (pop input) grapheme))
>               (setq output (nconc (nreverse grapheme) output))))
>           (concat output))
>       (concat (nreverse (string-to-list s))))))
> 
> Emacs has string-reverse, obsolete since 25.1.  Perhaps we should
> reintroduce it and use the definition from s?

I don't understand the use case(s) where this could be useful.  If
this is for display, then displaying text needs much more than just
combining accents with the base characters.  E.g., what if the accent
should not combine when the order is reversed, i.e. the composition
rules depend on the following characters as well?  And what if
character composition is not due to normalization rules.  Or what if
the text includes bidirectional scripts, whose reversal rules are
either very complex or simply undefined?

If this is not for display, then where is this useful and why?

If someone can describe real-life use cases, we could reason whether
doing something like that could be useful enough.  Without that, the
code in s-reverse seems like an incomplete semi-feature which supports
some limited use cases that someone needed in some specific situation,
not a useful general feature that handles the issue anywhere close to
completeness.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-19 20:50   ` Lars Ingebrigtsen
  2021-10-19 21:21     ` Howard Melman
  2021-10-19 23:13     ` Stefan Kangas
@ 2021-10-20 11:50     ` Eli Zaretskii
  2 siblings, 0 replies; 12+ messages in thread
From: Eli Zaretskii @ 2021-10-20 11:50 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: hmelman, 51292

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Date: Tue, 19 Oct 2021 22:50:15 +0200
> Cc: 51292@debbugs.gnu.org
> 
> But it might make sense to add function to tokenize a string into
> grapheme clusters -- I can see that being useful.  Then the caller can
> chop and reverse the list of clusters as they wish.

We have find-composition; isn't that sufficient?  Especially since we
don't really understand the use case?





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-20  8:11       ` Lars Ingebrigtsen
@ 2021-10-20 13:02         ` Stefan Kangas
  2021-10-21  2:50           ` Lars Ingebrigtsen
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Kangas @ 2021-10-20 13:02 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Howard Melman, 51292

Lars Ingebrigtsen <larsi@gnus.org> writes:

> Stefan Kangas <stefan@marxist.se> writes:
>
>> I guess not often, if I'm reading the results of this GitHub search
>> right:
>>
>> https://github.com/search?l=Emacs+Lisp&o=desc&q=s-reverse+-filename%3Aexamples.el+-filename%3As-tests.el&s=&type=Code
>
> I'm not sure how to read it.  :-)  It says:
>
>  2,671 code results
>
> and then the page is blank?

OK, so it's not just here that happens...

I believe this means that there is no match?  Try removing
e.g. "-filename:examples.el" and you should see matches again, but all
of them are for that file.  Add that restriction again, and you see no
matches.  I think.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-20 13:02         ` Stefan Kangas
@ 2021-10-21  2:50           ` Lars Ingebrigtsen
  2021-10-21  3:51             ` Stefan Kangas
  0 siblings, 1 reply; 12+ messages in thread
From: Lars Ingebrigtsen @ 2021-10-21  2:50 UTC (permalink / raw)
  To: Stefan Kangas; +Cc: Howard Melman, 51292

Stefan Kangas <stefan@marxist.se> writes:

> I believe this means that there is no match?  Try removing
> e.g. "-filename:examples.el" and you should see matches again, but all
> of them are for that file.  Add that restriction again, and you see no
> matches.  I think.

There's something very odd going on with that seach.  If I remove
"-filename:examples.el", then I get a bunch of matches from files like
src/gdi/gdiTools.f?  Very odd.

https://github.com/search?q=s-reverse+-filename%3As-tests.el&type=Code

But none of the matches on the first few pages refer to the s.el
s-reverse, so I think the conclusion is the same -- it's not a function
that's actually used for anything.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 12+ messages in thread

* bug#51292: 27.2; Reversing strings with unicode combining characters
  2021-10-21  2:50           ` Lars Ingebrigtsen
@ 2021-10-21  3:51             ` Stefan Kangas
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Kangas @ 2021-10-21  3:51 UTC (permalink / raw)
  To: Lars Ingebrigtsen; +Cc: Howard Melman, 51292

Lars Ingebrigtsen <larsi@gnus.org> writes:

> There's something very odd going on with that seach.  If I remove
> "-filename:examples.el", then I get a bunch of matches from files like
> src/gdi/gdiTools.f?  Very odd.

Indeed, the Github search is not working very well.

> https://github.com/search?q=s-reverse+-filename%3As-tests.el&type=Code
>
> But none of the matches on the first few pages refer to the s.el

With that link, I think you need to click the "Emacs Lisp" button to see
them?  I do that and see 2743 matches, but all of them copies of
examples.el from s.el itself.

Ah, on page 10 I see some matches in ensime-completion-util.el, but then
from page 18 or so I start seeing only matches in s.el itself again.  I
didn't go much further, but it doesn't seem to be very popular.  When I
sort by "Recently indexed", I see only matches in s.el again.

(The GitHub user interface is horrible, BTW.)

> s-reverse, so I think the conclusion is the same -- it's not a function
> that's actually used for anything.

I agree.  It was probably added because it is there in Clojure, where it
makes sense as they tend to favour immutable data.  Whereas we just do
`replace-match', etc.





^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-10-21  3:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 19:16 bug#51292: 27.2; Reversing strings with unicode combining characters Howard Melman
2021-10-19 19:26 ` Lars Ingebrigtsen
2021-10-19 20:50   ` Lars Ingebrigtsen
2021-10-19 21:21     ` Howard Melman
2021-10-20  8:58       ` Lars Ingebrigtsen
2021-10-19 23:13     ` Stefan Kangas
2021-10-20  8:11       ` Lars Ingebrigtsen
2021-10-20 13:02         ` Stefan Kangas
2021-10-21  2:50           ` Lars Ingebrigtsen
2021-10-21  3:51             ` Stefan Kangas
2021-10-20 11:50     ` Eli Zaretskii
2021-10-20 11:45   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).