bug#17130: 24.4.50; Deficient Unicode case folding

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#17130: 24.4.50; Deficient Unicode case folding
@ 2014-03-28 12:07 Nathan Trapuzzano
  2014-03-28 15:51 ` Eli Zaretskii
  2019-09-29 14:23 ` Lars Ingebrigtsen
  0 siblings, 2 replies; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-28 12:07 UTC (permalink / raw)
  To: 17130

M-: (compare-strings "σ" nil nil "ς" nil nil t)

==> -1  ;; should be t

Can someone that knows a thing about Unicode and emacs case tables speak
to whether the latter could suffice for implementing full Unicode case
folding?





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-28 12:07 bug#17130: 24.4.50; Deficient Unicode case folding Nathan Trapuzzano
@ 2014-03-28 15:51 ` Eli Zaretskii
  2014-03-28 19:31   ` nbtrap
  2019-09-29 14:23 ` Lars Ingebrigtsen
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-28 15:51 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Date: Fri, 28 Mar 2014 08:07:20 -0400
> 
> M-: (compare-strings "σ" nil nil "ς" nil nil t)
> 
> ==> -1  ;; should be t

No, because these characters are not a case pair.

> Can someone that knows a thing about Unicode and emacs case tables speak
> to whether the latter could suffice for implementing full Unicode case
> folding?

What is "full Unicode case folding"?





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-28 15:51 ` Eli Zaretskii
@ 2014-03-28 19:31   ` nbtrap
  2014-03-29  6:45     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: nbtrap @ 2014-03-28 19:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Eli Zaretskii <eliz@gnu.org> writes:

>> M-: (compare-strings "σ" nil nil "ς" nil nil t)
>> 
>> ==> -1  ;; should be t
>
> No, because these characters are not a case pair.

They're not a case pair in Emacs, but they should compare equally under
Unicode case folding.

>> Can someone that knows a thing about Unicode and emacs case tables speak
>> to whether the latter could suffice for implementing full Unicode case
>> folding?
>
> What is "full Unicode case folding"?

Somthing that implements this:
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

And perhaps more.  I don't know, but someone on this list probably does.

If you look about a third of the way down, there's a line saying that
U+03C2 (ς) should fold into U+03C3 (σ).





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-28 19:31   ` nbtrap
@ 2014-03-29  6:45     ` Eli Zaretskii
       [not found]       ` <87ob0pnptc.fsf@nbtrap.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-29  6:45 UTC (permalink / raw)
  To: nbtrap; +Cc: 17130

> From: nbtrap@nbtrap.com
> Cc: 17130@debbugs.gnu.org
> Date: Fri, 28 Mar 2014 15:31:09 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> M-: (compare-strings "σ" nil nil "ς" nil nil t)
> >> 
> >> ==> -1  ;; should be t
> >
> > No, because these characters are not a case pair.
> 
> They're not a case pair in Emacs, but they should compare equally under
> Unicode case folding.

Emacs doesn't currently support that.

> >> Can someone that knows a thing about Unicode and emacs case tables speak
> >> to whether the latter could suffice for implementing full Unicode case
> >> folding?
> >
> > What is "full Unicode case folding"?
> 
> Somthing that implements this:
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> 
> And perhaps more.  I don't know, but someone on this list probably does.
> 
> If you look about a third of the way down, there's a line saying that
> U+03C2 (ς) should fold into U+03C3 (σ).

Patches are welcome to import those tables into Emacs, and make case
folding support them.





^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <87ob0pnptc.fsf@nbtrap.com>]

* bug#17130: 24.4.50; Deficient Unicode case folding
       [not found]       ` <87ob0pnptc.fsf@nbtrap.com>
@ 2014-03-29 13:15         ` Eli Zaretskii
  2014-03-29 14:03           ` Nathan Trapuzzano
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-29 13:15 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Cc: 17130@debbugs.gnu.org
> Date: Sat, 29 Mar 2014 08:37:35 -0400
> 
> Reading through the manual section on case tables, it seems that this
> could be supported via the extra "canonicalize" slot:
> 
>     CANONICALIZE
>       The canonicalize table maps all of a set of case-related
>       characters into a particular member of that set.

Not efficiently, no.  E.g., how will you find ς from σ, using this
method?

Besides, don't we also need to know that ς can only be present at the
end of a word?

Or maybe I'm misunderstanding what you meant?

> If this isn't already used for Unicode case folding, what _is_ it used
> for?

It is used for case-insensitive regexp matching, see search.c.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 13:15         ` Eli Zaretskii
@ 2014-03-29 14:03           ` Nathan Trapuzzano
  2014-03-29 14:45             ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 14:03 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Eli Zaretskii <eliz@gnu.org> writes:

>> Reading through the manual section on case tables, it seems that this
>> could be supported via the extra "canonicalize" slot:
>> 
>>     CANONICALIZE
>>       The canonicalize table maps all of a set of case-related
>>       characters into a particular member of that set.
>
> Not efficiently, no.  E.g., how will you find ς from σ, using this
> method?

σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
fold to σ.  (By the way, ς should upcase to Σ--that much I know the case
tables can handle.)

> Besides, don't we also need to know that ς can only be present at the
> end of a word?

Don't think so.  AFAIK, Unicode says nothing about ordering except when
it comes to combining characters.  But even it did prescribe such a
rule, I don't think it would have anything to do with case folding.

>> If this isn't already used for Unicode case folding, what _is_ it used
>> for?
>
> It is used for case-insensitive regexp matching, see search.c.

Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
what is the purpose of the CANONICALIZE slot except as a kind of
placeholder that gets autofilled?  Are there other kinds of case
folding--other than traditional upper/lower and Unicode--that I'm not
aware of?  I understand that Emacs autofills the CANONICALIZE slot from
the other slots, but only when the CANONICALIZE slot is not already set
to non-nil.  What if the CANONICALIZE slot on ς were set to σ?  I think
that's all that would have to happen for the Unicode folding to work.
It seems the machinery is already in place.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 14:03           ` Nathan Trapuzzano
@ 2014-03-29 14:45             ` Eli Zaretskii
  2014-03-29 15:29               ` Nathan Trapuzzano
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-29 14:45 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Cc: 17130@debbugs.gnu.org
> Date: Sat, 29 Mar 2014 10:03:32 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> Reading through the manual section on case tables, it seems that this
> >> could be supported via the extra "canonicalize" slot:
> >> 
> >>     CANONICALIZE
> >>       The canonicalize table maps all of a set of case-related
> >>       characters into a particular member of that set.
> >
> > Not efficiently, no.  E.g., how will you find ς from σ, using this
> > method?
> 
> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
> fold to σ.

So you would need to search all characters to find those which have σ
in the CANONICALIZE slot -- not very efficient, to say the least.

IOW, what you suggest will provide a one-way mapping, whereas we need
a two-way mapping.

> > Besides, don't we also need to know that ς can only be present at the
> > end of a word?
> 
> Don't think so.  AFAIK, Unicode says nothing about ordering except when
> it comes to combining characters.  But even it did prescribe such a
> rule, I don't think it would have anything to do with case folding.

Who said this is only about case folding?  Emacs should use this data
for up-casing and down-casing as well, for example, so that M-l
downcases Σ to ς, not σ, when it is at the end of the word.  Wouldn't
users of Greek expect that?

> >> If this isn't already used for Unicode case folding, what _is_ it used
> >> for?
> >
> > It is used for case-insensitive regexp matching, see search.c.
> 
> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
> what is the purpose of the CANONICALIZE slot except as a kind of
> placeholder that gets autofilled?

Whenever you need the canonical equivalent of a character, such as in
case-insensitive search, you need that slot.

> Are there other kinds of case folding--other than traditional
> upper/lower and Unicode--that I'm not aware of?

There's "title case", of course.  There are also characters whose case
pair is not a single character, but several, like the upper-case
variant of ß in German.  Basically, any character not marked "C" in
the Unicode CaseFolding.txt is special in some way.

> I understand that Emacs autofills the CANONICALIZE slot from
> the other slots, but only when the CANONICALIZE slot is not already set
> to non-nil.  What if the CANONICALIZE slot on ς were set to σ?  I think
> that's all that would have to happen for the Unicode folding to work.
> It seems the machinery is already in place.

For this case, maybe (and even it doesn't handle Σ correctly, I think,
when downcased at the end of the word).  For other cases, not
necessarily.

Personally, I think we need an additional slot for what you want, and
code to use it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 14:45             ` Eli Zaretskii
@ 2014-03-29 15:29               ` Nathan Trapuzzano
  2014-03-29 17:37                 ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 15:29 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Eli Zaretskii <eliz@gnu.org> writes:

>> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
>> fold to σ.
>
> So you would need to search all characters to find those which have σ
> in the CANONICALIZE slot -- not very efficient, to say the least.

Doesn't this already happen?  If not, then what is the CANONICALIZE slot
doing that couldn't be done with the regular upcase/downcase slots by
themselves?

> IOW, what you suggest will provide a one-way mapping, whereas we need
> a two-way mapping.

Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
least in principle.

>> > Besides, don't we also need to know that ς can only be present at the
>> > end of a word?
>> 
>> Don't think so.  AFAIK, Unicode says nothing about ordering except when
>> it comes to combining characters.  But even it did prescribe such a
>> rule, I don't think it would have anything to do with case folding.
>
> Who said this is only about case folding?

I should have said just "case", not "case folding".

> Emacs should use this data for up-casing and down-casing as well, for
> example, so that M-l downcases Σ to ς, not σ, when it is at the end of
> the word.  Wouldn't users of Greek expect that?

Maybe.  I'm just saying that Unicode itself doesn't prescribe or even
recommend such behavior.  It defines case conversions independently of
ordering.

That said, making M-l downcase terminal Σ to ς would be a nice feature
that could be enabled, e.g., by enabling a minor mode or by modifying
some *-functions variable of functions that get called before the normal
behavior of M-l is applied, etc.  But it shouldn't have anything to do
with Unicode-compliant case-insensitive searching.

>> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
>> what is the purpose of the CANONICALIZE slot except as a kind of
>> placeholder that gets autofilled?
>
> Whenever you need the canonical equivalent of a character, such as in
> case-insensitive search, you need that slot.

But there's nothing about the slot that mandates that only _pairs_ can
be case-equivalent under case folding.  Indeed, the manual speaks of
"sets" of chracters that might be equivalent under case-folding, hence
my understanding that σ, ς, and Σ can all have σ in their CANONICALIZE
slot, and that's all it would take.

(Btw, I'm using "case-insensitive" to mean the same as "under
case-folding".)

>> Are there other kinds of case folding--other than traditional
>> upper/lower and Unicode--that I'm not aware of?
>
> There's "title case", of course.  

I think title case would require an extra slot in the case table.

> There are also characters whose case pair is not a single character,
> but several, like the upper-case variant of ß in German.

Good point.  "ß" should fold to "ss".  I guess for the CANONICALIZE slot
to suffice, it would have to map to a string, not a code point.

> Personally, I think we need an additional slot for what you want, and
> code to use it.

Given the point about ß, you're probably right.  Unless we can make
entries in the CANONICALIZE slot be strings rather than code points.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 15:29               ` Nathan Trapuzzano
@ 2014-03-29 17:37                 ` Eli Zaretskii
  2014-03-29 18:31                   ` Nathan Trapuzzano
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-29 17:37 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Cc: 17130@debbugs.gnu.org
> Date: Sat, 29 Mar 2014 11:29:43 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
> >> fold to σ.
> >
> > So you would need to search all characters to find those which have σ
> > in the CANONICALIZE slot -- not very efficient, to say the least.
> 
> Doesn't this already happen?

No, not when that slot is used for case-insensitive search.  You just
use it to get the canonical equivalent, i.e. use the one-way mapping
that it provides.

> If not, then what is the CANONICALIZE slot doing that couldn't be
> done with the regular upcase/downcase slots by themselves?

If that slot is "trivial", i.e. contains the lower-case variant of the
character, then indeed this slot doesn't add information, I think,
only utility.  But it doesn't have to contain the lower-case variant.

> > IOW, what you suggest will provide a one-way mapping, whereas we need
> > a two-way mapping.
> 
> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
> least in principle.

It is sufficient for mapping a character to its canonical equivalent,
but not finding the non-canonical variants of a canonical character.
IOW, it is not well suited to finding ς given just σ.

> > Emacs should use this data for up-casing and down-casing as well, for
> > example, so that M-l downcases Σ to ς, not σ, when it is at the end of
> > the word.  Wouldn't users of Greek expect that?
> 
> Maybe.  I'm just saying that Unicode itself doesn't prescribe or even
> recommend such behavior.  It defines case conversions independently of
> ordering.
> 
> That said, making M-l downcase terminal Σ to ς would be a nice feature
> that could be enabled, e.g., by enabling a minor mode or by modifying
> some *-functions variable of functions that get called before the normal
> behavior of M-l is applied, etc.  But it shouldn't have anything to do
> with Unicode-compliant case-insensitive searching.

For searching, you only need the CANONICALIZE slot.  But what about
replacing the search string while keeping the letter case in the
replacement?  For that, CANONICALIZE alone is not enough, you need the
reverse mapping.

> > Personally, I think we need an additional slot for what you want, and
> > code to use it.
> 
> Given the point about ß, you're probably right.  Unless we can make
> entries in the CANONICALIZE slot be strings rather than code points.

This is Lisp; a vector slot can contain any Lisp object.  But using
CANONICALIZE for what you want would be wrong, I think, because it
will screw up case-insensitive search, which expects to find there a
single character.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 17:37                 ` Eli Zaretskii
@ 2014-03-29 18:31                   ` Nathan Trapuzzano
  2014-03-29 18:36                     ` Nathan Trapuzzano
  2014-03-29 19:50                     ` Eli Zaretskii
  0 siblings, 2 replies; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 18:31 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Eli Zaretskii <eliz@gnu.org> writes:

>> > So you would need to search all characters to find those which have σ
>> > in the CANONICALIZE slot -- not very efficient, to say the least.
>> 
>> Doesn't this already happen?
>
> No, not when that slot is used for case-insensitive search.  You just
> use it to get the canonical equivalent, i.e. use the one-way mapping
> that it provides.

I still don't get it.  What I say below may explain why.

>> If not, then what is the CANONICALIZE slot doing that couldn't be
>> done with the regular upcase/downcase slots by themselves?
>
> If that slot is "trivial", i.e. contains the lower-case variant of the
> character, then indeed this slot doesn't add information, I think,
> only utility.  But it doesn't have to contain the lower-case variant.

I know.  But if Emacs doesn't do Unicode folding, what is there other
than lower/upper variants?

>> > IOW, what you suggest will provide a one-way mapping, whereas we need
>> > a two-way mapping.
>> 
>> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
>> least in principle.
>
> It is sufficient for mapping a character to its canonical equivalent,
> but not finding the non-canonical variants of a canonical character.
> IOW, it is not well suited to finding ς given just σ.

Finding the non-canonical variants is not something that happens (at
least in principle) during case-insensitive matching.  You convert both
the matching string and the string being matched into their canonical
equivalents and see if they match.  You never UNfold.  Case folding is
by definition a one-way operation.

>> That said, making M-l downcase terminal Σ to ς would be a nice feature
>> that could be enabled, e.g., by enabling a minor mode or by modifying
>> some *-functions variable of functions that get called before the normal
>> behavior of M-l is applied, etc.  But it shouldn't have anything to do
>> with Unicode-compliant case-insensitive searching.
>
> For searching, you only need the CANONICALIZE slot.  But what about
> replacing the search string while keeping the letter case in the
> replacement?  For that, CANONICALIZE alone is not enough, you need the
> reverse mapping.

There is no reverse mapping when it comes to folding.  There can't be,
since multiple characters can fold into the same character.

I don't fully understand what "case-replace" does (e.g. case being a
property of characters and not strings, what does it mean to "preserve
case" when replacing a string of length x with a string of length y
where x != y), but I don't think Unicode folding would complicate it.
There are three cases in Unicode: lower, upper, and title.  Upper and
title already overlap for the vast majority of codepoints, so there you
already have problems with a case-preserving replace.  That said "fold"
is not a case in Unicode; it's a one-way mapping of non-overlapping sets
of characters to a canonical equivalent, so it makes no sense to talk
about preserving case with respect to case folding.

Notandum: I was wrong about Unicode saying nothing about character
ordering for non-combining characters.  The "special casing" document
(ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt) contains
context- and language- dependent case rules for certain characters,
including final sigma.  Notably, the document says that Σ in terminal
position should (or "may"--I'm not really sure about how to interpret
the document) downcase to ς.  That said, the document has _nothing_ to
do with case _folding_, which is always context- and language-
independent.

Rightly interpreted, therefore, case _conversion_ (such as in
case-preserving replace) and case-insensitive _searching_ (i.e. case
folding), according to Unicode, are orthogonal.  We don't have to
address both at the same time.

>> Given the point about ß, you're probably right.  Unless we can make
>> entries in the CANONICALIZE slot be strings rather than code points.
>
> This is Lisp; a vector slot can contain any Lisp object.  But using
> CANONICALIZE for what you want would be wrong, I think, because it
> will screw up case-insensitive search, which expects to find there a
> single character.

Right, that's what I meant.  Putting strings there would break
something.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 18:31                   ` Nathan Trapuzzano
@ 2014-03-29 18:36                     ` Nathan Trapuzzano
  2014-03-29 19:51                       ` Eli Zaretskii
  2014-03-29 19:50                     ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 18:36 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Nathan Trapuzzano <nbtrap@nbtrap.com> writes:

> Rightly interpreted, therefore, case _conversion_ (such as in
> case-preserving replace) and case-insensitive _searching_ (i.e. case
> folding), according to Unicode, are orthogonal.  We don't have to
> address both at the same time.

Er, let me rephrase.  Case _conversion_ (such as in case-preserving
replace) and case _folding_ (such as ought be used in case-insensitive
searching) are orthogonal.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 18:36                     ` Nathan Trapuzzano
@ 2014-03-29 19:51                       ` Eli Zaretskii
  2014-03-29 20:15                         ` Nathan Trapuzzano
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-29 19:51 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Cc: 17130@debbugs.gnu.org
> Date: Sat, 29 Mar 2014 14:36:42 -0400
> 
> Er, let me rephrase.  Case _conversion_ (such as in case-preserving
> replace) and case _folding_ (such as ought be used in case-insensitive
> searching) are orthogonal.

But they can very well use the same database.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 19:51                       ` Eli Zaretskii
@ 2014-03-29 20:15                         ` Nathan Trapuzzano
  2014-03-30  2:45                           ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 20:15 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
>> Cc: 17130@debbugs.gnu.org
>> Date: Sat, 29 Mar 2014 14:36:42 -0400
>> 
>> Er, let me rephrase.  Case _conversion_ (such as in case-preserving
>> replace) and case _folding_ (such as ought be used in case-insensitive
>> searching) are orthogonal.
>
> But they can very well use the same database.

It's not clear what you mean.

We already have a place to store upper- and lower- case variants.  What
I'm proposing is to use the CANONICALIZE slot as a place to store the
case-folding mapping.  If this would mess up Emacs' case-preserving
replace, then I think that would just mean that case-preserving replace
is broken.  There is no such case as "canonicalize"--you can't say, "Oh,
this string is in the canonical case, so when I want to replace it with
this other string in canonical case".  A case-preserving replace should
only consult the upper- and lower-case slots (and perhaps the title-case
slot if it existed).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 20:15                         ` Nathan Trapuzzano
@ 2014-03-30  2:45                           ` Eli Zaretskii
  0 siblings, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-30  2:45 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Cc: 17130@debbugs.gnu.org
> Date: Sat, 29 Mar 2014 16:15:34 -0400
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> >> Cc: 17130@debbugs.gnu.org
> >> Date: Sat, 29 Mar 2014 14:36:42 -0400
> >> 
> >> Er, let me rephrase.  Case _conversion_ (such as in case-preserving
> >> replace) and case _folding_ (such as ought be used in case-insensitive
> >> searching) are orthogonal.
> >
> > But they can very well use the same database.
> 
> It's not clear what you mean.

You keep asking questions about the purpose of the CANONICALIZE slot,
and I keep trying to explain that purpose.

> We already have a place to store upper- and lower- case variants.  What
> I'm proposing is to use the CANONICALIZE slot as a place to store the
> case-folding mapping.  If this would mess up Emacs' case-preserving
> replace, then I think that would just mean that case-preserving replace
> is broken.  There is no such case as "canonicalize"--you can't say, "Oh,
> this string is in the canonical case, so when I want to replace it with
> this other string in canonical case".  A case-preserving replace should
> only consult the upper- and lower-case slots (and perhaps the title-case
> slot if it existed).

Perhaps you should tell what does tis mean in practice, from the POV
of populating the CANONICALIZE slot, and how that content would be
used under your proposal.  That should make the discussion more
useful, I hope.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 18:31                   ` Nathan Trapuzzano
  2014-03-29 18:36                     ` Nathan Trapuzzano
@ 2014-03-29 19:50                     ` Eli Zaretskii
  2014-03-29 20:01                       ` Nathan Trapuzzano
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2014-03-29 19:50 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

> From: Nathan Trapuzzano <nbtrap@nbtrap.com>
> Cc: 17130@debbugs.gnu.org
> Date: Sat, 29 Mar 2014 14:31:52 -0400
> 
> >> If not, then what is the CANONICALIZE slot doing that couldn't be
> >> done with the regular upcase/downcase slots by themselves?
> >
> > If that slot is "trivial", i.e. contains the lower-case variant of the
> > character, then indeed this slot doesn't add information, I think,
> > only utility.  But it doesn't have to contain the lower-case variant.
> 
> I know.  But if Emacs doesn't do Unicode folding, what is there other
> than lower/upper variants?

You can make it have whatever you like, because you can set up
buffer-specific tables.

> >> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
> >> least in principle.
> >
> > It is sufficient for mapping a character to its canonical equivalent,
> > but not finding the non-canonical variants of a canonical character.
> > IOW, it is not well suited to finding ς given just σ.
> 
> Finding the non-canonical variants is not something that happens (at
> least in principle) during case-insensitive matching.

The case database is not only for searching.

> > For searching, you only need the CANONICALIZE slot.  But what about
> > replacing the search string while keeping the letter case in the
> > replacement?  For that, CANONICALIZE alone is not enough, you need the
> > reverse mapping.
> 
> There is no reverse mapping when it comes to folding.  There can't be,
> since multiple characters can fold into the same character.

You can use the case of the string being replaced as guidelines.
E.g., if the replaced string was capitalized, you can capitalize the
replacement.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-29 19:50                     ` Eli Zaretskii
@ 2014-03-29 20:01                       ` Nathan Trapuzzano
  0 siblings, 0 replies; 17+ messages in thread
From: Nathan Trapuzzano @ 2014-03-29 20:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 17130

Eli Zaretskii <eliz@gnu.org> writes:

>> I know.  But if Emacs doesn't do Unicode folding, what is there other
>> than lower/upper variants?
>
> You can make it have whatever you like, because you can set up
> buffer-specific tables.

Makes me wonder if whoever implemented the CANONICALIZE slot had Unicode
folding in mind.

>> Finding the non-canonical variants is not something that happens (at
>> least in principle) during case-insensitive matching.
>
> The case database is not only for searching.
>
>> There is no reverse mapping when it comes to folding.  There can't be,
>> since multiple characters can fold into the same character.
>
> You can use the case of the string being replaced as guidelines.
> E.g., if the replaced string was capitalized, you can capitalize the
> replacement.

I think you're still conflating case conversion and case folding.  As I
said, there is no case called "fold".  There's just upper, lower, and
title.  And the fact that these three overlap is already a problem for
case-preserving replace.  I spent most of my last email trying to
explain this.





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#17130: 24.4.50; Deficient Unicode case folding
  2014-03-28 12:07 bug#17130: 24.4.50; Deficient Unicode case folding Nathan Trapuzzano
  2014-03-28 15:51 ` Eli Zaretskii
@ 2019-09-29 14:23 ` Lars Ingebrigtsen
  1 sibling, 0 replies; 17+ messages in thread
From: Lars Ingebrigtsen @ 2019-09-29 14:23 UTC (permalink / raw)
  To: Nathan Trapuzzano; +Cc: 17130

Nathan Trapuzzano <nbtrap@nbtrap.com> writes:

> M-: (compare-strings "σ" nil nil "ς" nil nil t)
>
> ==> -1  ;; should be t

(compare-strings "σ" nil nil "ς" nil nil t)
=> t

I'm unable to reproduce this in Emacs 27, so I'm going to go ahead and
guess that this has been fixed in the years since this bug was reported,
and I'm closing this bug report.  If this is still a problem, please
reopen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-09-29 14:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-28 12:07 bug#17130: 24.4.50; Deficient Unicode case folding Nathan Trapuzzano
2014-03-28 15:51 ` Eli Zaretskii
2014-03-28 19:31   ` nbtrap
2014-03-29  6:45     ` Eli Zaretskii
     [not found]       ` <87ob0pnptc.fsf@nbtrap.com>
2014-03-29 13:15         ` Eli Zaretskii
2014-03-29 14:03           ` Nathan Trapuzzano
2014-03-29 14:45             ` Eli Zaretskii
2014-03-29 15:29               ` Nathan Trapuzzano
2014-03-29 17:37                 ` Eli Zaretskii
2014-03-29 18:31                   ` Nathan Trapuzzano
2014-03-29 18:36                     ` Nathan Trapuzzano
2014-03-29 19:51                       ` Eli Zaretskii
2014-03-29 20:15                         ` Nathan Trapuzzano
2014-03-30  2:45                           ` Eli Zaretskii
2014-03-29 19:50                     ` Eli Zaretskii
2014-03-29 20:01                       ` Nathan Trapuzzano
2019-09-29 14:23 ` Lars Ingebrigtsen

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.