* bug#17130: 24.4.50; Deficient Unicode case folding @ 2014-03-28 12:07 Nathan Trapuzzano 2014-03-28 15:51 ` Eli Zaretskii 2019-09-29 14:23 ` Lars Ingebrigtsen 0 siblings, 2 replies; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-28 12:07 UTC (permalink / raw) To: 17130 M-: (compare-strings "σ" nil nil "ς" nil nil t) ==> -1 ;; should be t Can someone that knows a thing about Unicode and emacs case tables speak to whether the latter could suffice for implementing full Unicode case folding? ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-28 12:07 bug#17130: 24.4.50; Deficient Unicode case folding Nathan Trapuzzano @ 2014-03-28 15:51 ` Eli Zaretskii 2014-03-28 19:31 ` nbtrap 2019-09-29 14:23 ` Lars Ingebrigtsen 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-28 15:51 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Date: Fri, 28 Mar 2014 08:07:20 -0400 > > M-: (compare-strings "σ" nil nil "ς" nil nil t) > > ==> -1 ;; should be t No, because these characters are not a case pair. > Can someone that knows a thing about Unicode and emacs case tables speak > to whether the latter could suffice for implementing full Unicode case > folding? What is "full Unicode case folding"? ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-28 15:51 ` Eli Zaretskii @ 2014-03-28 19:31 ` nbtrap 2014-03-29 6:45 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: nbtrap @ 2014-03-28 19:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Eli Zaretskii <eliz@gnu.org> writes: >> M-: (compare-strings "σ" nil nil "ς" nil nil t) >> >> ==> -1 ;; should be t > > No, because these characters are not a case pair. They're not a case pair in Emacs, but they should compare equally under Unicode case folding. >> Can someone that knows a thing about Unicode and emacs case tables speak >> to whether the latter could suffice for implementing full Unicode case >> folding? > > What is "full Unicode case folding"? Somthing that implements this: http://www.unicode.org/Public/UNIDATA/CaseFolding.txt And perhaps more. I don't know, but someone on this list probably does. If you look about a third of the way down, there's a line saying that U+03C2 (ς) should fold into U+03C3 (σ). ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-28 19:31 ` nbtrap @ 2014-03-29 6:45 ` Eli Zaretskii [not found] ` <87ob0pnptc.fsf@nbtrap.com> 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-29 6:45 UTC (permalink / raw) To: nbtrap; +Cc: 17130 > From: nbtrap@nbtrap.com > Cc: 17130@debbugs.gnu.org > Date: Fri, 28 Mar 2014 15:31:09 -0400 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> M-: (compare-strings "σ" nil nil "ς" nil nil t) > >> > >> ==> -1 ;; should be t > > > > No, because these characters are not a case pair. > > They're not a case pair in Emacs, but they should compare equally under > Unicode case folding. Emacs doesn't currently support that. > >> Can someone that knows a thing about Unicode and emacs case tables speak > >> to whether the latter could suffice for implementing full Unicode case > >> folding? > > > > What is "full Unicode case folding"? > > Somthing that implements this: > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > > And perhaps more. I don't know, but someone on this list probably does. > > If you look about a third of the way down, there's a line saying that > U+03C2 (ς) should fold into U+03C3 (σ). Patches are welcome to import those tables into Emacs, and make case folding support them. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <87ob0pnptc.fsf@nbtrap.com>]
* bug#17130: 24.4.50; Deficient Unicode case folding [not found] ` <87ob0pnptc.fsf@nbtrap.com> @ 2014-03-29 13:15 ` Eli Zaretskii 2014-03-29 14:03 ` Nathan Trapuzzano 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-29 13:15 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Cc: 17130@debbugs.gnu.org > Date: Sat, 29 Mar 2014 08:37:35 -0400 > > Reading through the manual section on case tables, it seems that this > could be supported via the extra "canonicalize" slot: > > CANONICALIZE > The canonicalize table maps all of a set of case-related > characters into a particular member of that set. Not efficiently, no. E.g., how will you find ς from σ, using this method? Besides, don't we also need to know that ς can only be present at the end of a word? Or maybe I'm misunderstanding what you meant? > If this isn't already used for Unicode case folding, what _is_ it used > for? It is used for case-insensitive regexp matching, see search.c. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 13:15 ` Eli Zaretskii @ 2014-03-29 14:03 ` Nathan Trapuzzano 2014-03-29 14:45 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 14:03 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Eli Zaretskii <eliz@gnu.org> writes: >> Reading through the manual section on case tables, it seems that this >> could be supported via the extra "canonicalize" slot: >> >> CANONICALIZE >> The canonicalize table maps all of a set of case-related >> characters into a particular member of that set. > > Not efficiently, no. E.g., how will you find ς from σ, using this > method? σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all fold to σ. (By the way, ς should upcase to Σ--that much I know the case tables can handle.) > Besides, don't we also need to know that ς can only be present at the > end of a word? Don't think so. AFAIK, Unicode says nothing about ordering except when it comes to combining characters. But even it did prescribe such a rule, I don't think it would have anything to do with case folding. >> If this isn't already used for Unicode case folding, what _is_ it used >> for? > > It is used for case-insensitive regexp matching, see search.c. Right, but what I'm asking is: if Emacs doesn't do Unicode case folding, what is the purpose of the CANONICALIZE slot except as a kind of placeholder that gets autofilled? Are there other kinds of case folding--other than traditional upper/lower and Unicode--that I'm not aware of? I understand that Emacs autofills the CANONICALIZE slot from the other slots, but only when the CANONICALIZE slot is not already set to non-nil. What if the CANONICALIZE slot on ς were set to σ? I think that's all that would have to happen for the Unicode folding to work. It seems the machinery is already in place. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 14:03 ` Nathan Trapuzzano @ 2014-03-29 14:45 ` Eli Zaretskii 2014-03-29 15:29 ` Nathan Trapuzzano 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-29 14:45 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Cc: 17130@debbugs.gnu.org > Date: Sat, 29 Mar 2014 10:03:32 -0400 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> Reading through the manual section on case tables, it seems that this > >> could be supported via the extra "canonicalize" slot: > >> > >> CANONICALIZE > >> The canonicalize table maps all of a set of case-related > >> characters into a particular member of that set. > > > > Not efficiently, no. E.g., how will you find ς from σ, using this > > method? > > σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all > fold to σ. So you would need to search all characters to find those which have σ in the CANONICALIZE slot -- not very efficient, to say the least. IOW, what you suggest will provide a one-way mapping, whereas we need a two-way mapping. > > Besides, don't we also need to know that ς can only be present at the > > end of a word? > > Don't think so. AFAIK, Unicode says nothing about ordering except when > it comes to combining characters. But even it did prescribe such a > rule, I don't think it would have anything to do with case folding. Who said this is only about case folding? Emacs should use this data for up-casing and down-casing as well, for example, so that M-l downcases Σ to ς, not σ, when it is at the end of the word. Wouldn't users of Greek expect that? > >> If this isn't already used for Unicode case folding, what _is_ it used > >> for? > > > > It is used for case-insensitive regexp matching, see search.c. > > Right, but what I'm asking is: if Emacs doesn't do Unicode case folding, > what is the purpose of the CANONICALIZE slot except as a kind of > placeholder that gets autofilled? Whenever you need the canonical equivalent of a character, such as in case-insensitive search, you need that slot. > Are there other kinds of case folding--other than traditional > upper/lower and Unicode--that I'm not aware of? There's "title case", of course. There are also characters whose case pair is not a single character, but several, like the upper-case variant of ß in German. Basically, any character not marked "C" in the Unicode CaseFolding.txt is special in some way. > I understand that Emacs autofills the CANONICALIZE slot from > the other slots, but only when the CANONICALIZE slot is not already set > to non-nil. What if the CANONICALIZE slot on ς were set to σ? I think > that's all that would have to happen for the Unicode folding to work. > It seems the machinery is already in place. For this case, maybe (and even it doesn't handle Σ correctly, I think, when downcased at the end of the word). For other cases, not necessarily. Personally, I think we need an additional slot for what you want, and code to use it. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 14:45 ` Eli Zaretskii @ 2014-03-29 15:29 ` Nathan Trapuzzano 2014-03-29 17:37 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 15:29 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Eli Zaretskii <eliz@gnu.org> writes: >> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all >> fold to σ. > > So you would need to search all characters to find those which have σ > in the CANONICALIZE slot -- not very efficient, to say the least. Doesn't this already happen? If not, then what is the CANONICALIZE slot doing that couldn't be done with the regular upcase/downcase slots by themselves? > IOW, what you suggest will provide a one-way mapping, whereas we need > a two-way mapping. Not sure I follow. Seems to me the CANONICALIZE slot is sufficient, at least in principle. >> > Besides, don't we also need to know that ς can only be present at the >> > end of a word? >> >> Don't think so. AFAIK, Unicode says nothing about ordering except when >> it comes to combining characters. But even it did prescribe such a >> rule, I don't think it would have anything to do with case folding. > > Who said this is only about case folding? I should have said just "case", not "case folding". > Emacs should use this data for up-casing and down-casing as well, for > example, so that M-l downcases Σ to ς, not σ, when it is at the end of > the word. Wouldn't users of Greek expect that? Maybe. I'm just saying that Unicode itself doesn't prescribe or even recommend such behavior. It defines case conversions independently of ordering. That said, making M-l downcase terminal Σ to ς would be a nice feature that could be enabled, e.g., by enabling a minor mode or by modifying some *-functions variable of functions that get called before the normal behavior of M-l is applied, etc. But it shouldn't have anything to do with Unicode-compliant case-insensitive searching. >> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding, >> what is the purpose of the CANONICALIZE slot except as a kind of >> placeholder that gets autofilled? > > Whenever you need the canonical equivalent of a character, such as in > case-insensitive search, you need that slot. But there's nothing about the slot that mandates that only _pairs_ can be case-equivalent under case folding. Indeed, the manual speaks of "sets" of chracters that might be equivalent under case-folding, hence my understanding that σ, ς, and Σ can all have σ in their CANONICALIZE slot, and that's all it would take. (Btw, I'm using "case-insensitive" to mean the same as "under case-folding".) >> Are there other kinds of case folding--other than traditional >> upper/lower and Unicode--that I'm not aware of? > > There's "title case", of course. I think title case would require an extra slot in the case table. > There are also characters whose case pair is not a single character, > but several, like the upper-case variant of ß in German. Good point. "ß" should fold to "ss". I guess for the CANONICALIZE slot to suffice, it would have to map to a string, not a code point. > Personally, I think we need an additional slot for what you want, and > code to use it. Given the point about ß, you're probably right. Unless we can make entries in the CANONICALIZE slot be strings rather than code points. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 15:29 ` Nathan Trapuzzano @ 2014-03-29 17:37 ` Eli Zaretskii 2014-03-29 18:31 ` Nathan Trapuzzano 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-29 17:37 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Cc: 17130@debbugs.gnu.org > Date: Sat, 29 Mar 2014 11:29:43 -0400 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all > >> fold to σ. > > > > So you would need to search all characters to find those which have σ > > in the CANONICALIZE slot -- not very efficient, to say the least. > > Doesn't this already happen? No, not when that slot is used for case-insensitive search. You just use it to get the canonical equivalent, i.e. use the one-way mapping that it provides. > If not, then what is the CANONICALIZE slot doing that couldn't be > done with the regular upcase/downcase slots by themselves? If that slot is "trivial", i.e. contains the lower-case variant of the character, then indeed this slot doesn't add information, I think, only utility. But it doesn't have to contain the lower-case variant. > > IOW, what you suggest will provide a one-way mapping, whereas we need > > a two-way mapping. > > Not sure I follow. Seems to me the CANONICALIZE slot is sufficient, at > least in principle. It is sufficient for mapping a character to its canonical equivalent, but not finding the non-canonical variants of a canonical character. IOW, it is not well suited to finding ς given just σ. > > Emacs should use this data for up-casing and down-casing as well, for > > example, so that M-l downcases Σ to ς, not σ, when it is at the end of > > the word. Wouldn't users of Greek expect that? > > Maybe. I'm just saying that Unicode itself doesn't prescribe or even > recommend such behavior. It defines case conversions independently of > ordering. > > That said, making M-l downcase terminal Σ to ς would be a nice feature > that could be enabled, e.g., by enabling a minor mode or by modifying > some *-functions variable of functions that get called before the normal > behavior of M-l is applied, etc. But it shouldn't have anything to do > with Unicode-compliant case-insensitive searching. For searching, you only need the CANONICALIZE slot. But what about replacing the search string while keeping the letter case in the replacement? For that, CANONICALIZE alone is not enough, you need the reverse mapping. > > Personally, I think we need an additional slot for what you want, and > > code to use it. > > Given the point about ß, you're probably right. Unless we can make > entries in the CANONICALIZE slot be strings rather than code points. This is Lisp; a vector slot can contain any Lisp object. But using CANONICALIZE for what you want would be wrong, I think, because it will screw up case-insensitive search, which expects to find there a single character. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 17:37 ` Eli Zaretskii @ 2014-03-29 18:31 ` Nathan Trapuzzano 2014-03-29 18:36 ` Nathan Trapuzzano 2014-03-29 19:50 ` Eli Zaretskii 0 siblings, 2 replies; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 18:31 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Eli Zaretskii <eliz@gnu.org> writes: >> > So you would need to search all characters to find those which have σ >> > in the CANONICALIZE slot -- not very efficient, to say the least. >> >> Doesn't this already happen? > > No, not when that slot is used for case-insensitive search. You just > use it to get the canonical equivalent, i.e. use the one-way mapping > that it provides. I still don't get it. What I say below may explain why. >> If not, then what is the CANONICALIZE slot doing that couldn't be >> done with the regular upcase/downcase slots by themselves? > > If that slot is "trivial", i.e. contains the lower-case variant of the > character, then indeed this slot doesn't add information, I think, > only utility. But it doesn't have to contain the lower-case variant. I know. But if Emacs doesn't do Unicode folding, what is there other than lower/upper variants? >> > IOW, what you suggest will provide a one-way mapping, whereas we need >> > a two-way mapping. >> >> Not sure I follow. Seems to me the CANONICALIZE slot is sufficient, at >> least in principle. > > It is sufficient for mapping a character to its canonical equivalent, > but not finding the non-canonical variants of a canonical character. > IOW, it is not well suited to finding ς given just σ. Finding the non-canonical variants is not something that happens (at least in principle) during case-insensitive matching. You convert both the matching string and the string being matched into their canonical equivalents and see if they match. You never UNfold. Case folding is by definition a one-way operation. >> That said, making M-l downcase terminal Σ to ς would be a nice feature >> that could be enabled, e.g., by enabling a minor mode or by modifying >> some *-functions variable of functions that get called before the normal >> behavior of M-l is applied, etc. But it shouldn't have anything to do >> with Unicode-compliant case-insensitive searching. > > For searching, you only need the CANONICALIZE slot. But what about > replacing the search string while keeping the letter case in the > replacement? For that, CANONICALIZE alone is not enough, you need the > reverse mapping. There is no reverse mapping when it comes to folding. There can't be, since multiple characters can fold into the same character. I don't fully understand what "case-replace" does (e.g. case being a property of characters and not strings, what does it mean to "preserve case" when replacing a string of length x with a string of length y where x != y), but I don't think Unicode folding would complicate it. There are three cases in Unicode: lower, upper, and title. Upper and title already overlap for the vast majority of codepoints, so there you already have problems with a case-preserving replace. That said "fold" is not a case in Unicode; it's a one-way mapping of non-overlapping sets of characters to a canonical equivalent, so it makes no sense to talk about preserving case with respect to case folding. Notandum: I was wrong about Unicode saying nothing about character ordering for non-combining characters. The "special casing" document (ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt) contains context- and language- dependent case rules for certain characters, including final sigma. Notably, the document says that Σ in terminal position should (or "may"--I'm not really sure about how to interpret the document) downcase to ς. That said, the document has _nothing_ to do with case _folding_, which is always context- and language- independent. Rightly interpreted, therefore, case _conversion_ (such as in case-preserving replace) and case-insensitive _searching_ (i.e. case folding), according to Unicode, are orthogonal. We don't have to address both at the same time. >> Given the point about ß, you're probably right. Unless we can make >> entries in the CANONICALIZE slot be strings rather than code points. > > This is Lisp; a vector slot can contain any Lisp object. But using > CANONICALIZE for what you want would be wrong, I think, because it > will screw up case-insensitive search, which expects to find there a > single character. Right, that's what I meant. Putting strings there would break something. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 18:31 ` Nathan Trapuzzano @ 2014-03-29 18:36 ` Nathan Trapuzzano 2014-03-29 19:51 ` Eli Zaretskii 2014-03-29 19:50 ` Eli Zaretskii 1 sibling, 1 reply; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 18:36 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Nathan Trapuzzano <nbtrap@nbtrap.com> writes: > Rightly interpreted, therefore, case _conversion_ (such as in > case-preserving replace) and case-insensitive _searching_ (i.e. case > folding), according to Unicode, are orthogonal. We don't have to > address both at the same time. Er, let me rephrase. Case _conversion_ (such as in case-preserving replace) and case _folding_ (such as ought be used in case-insensitive searching) are orthogonal. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 18:36 ` Nathan Trapuzzano @ 2014-03-29 19:51 ` Eli Zaretskii 2014-03-29 20:15 ` Nathan Trapuzzano 0 siblings, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-29 19:51 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Cc: 17130@debbugs.gnu.org > Date: Sat, 29 Mar 2014 14:36:42 -0400 > > Er, let me rephrase. Case _conversion_ (such as in case-preserving > replace) and case _folding_ (such as ought be used in case-insensitive > searching) are orthogonal. But they can very well use the same database. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 19:51 ` Eli Zaretskii @ 2014-03-29 20:15 ` Nathan Trapuzzano 2014-03-30 2:45 ` Eli Zaretskii 0 siblings, 1 reply; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 20:15 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Eli Zaretskii <eliz@gnu.org> writes: >> From: Nathan Trapuzzano <nbtrap@nbtrap.com> >> Cc: 17130@debbugs.gnu.org >> Date: Sat, 29 Mar 2014 14:36:42 -0400 >> >> Er, let me rephrase. Case _conversion_ (such as in case-preserving >> replace) and case _folding_ (such as ought be used in case-insensitive >> searching) are orthogonal. > > But they can very well use the same database. It's not clear what you mean. We already have a place to store upper- and lower- case variants. What I'm proposing is to use the CANONICALIZE slot as a place to store the case-folding mapping. If this would mess up Emacs' case-preserving replace, then I think that would just mean that case-preserving replace is broken. There is no such case as "canonicalize"--you can't say, "Oh, this string is in the canonical case, so when I want to replace it with this other string in canonical case". A case-preserving replace should only consult the upper- and lower-case slots (and perhaps the title-case slot if it existed). ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 20:15 ` Nathan Trapuzzano @ 2014-03-30 2:45 ` Eli Zaretskii 0 siblings, 0 replies; 17+ messages in thread From: Eli Zaretskii @ 2014-03-30 2:45 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Cc: 17130@debbugs.gnu.org > Date: Sat, 29 Mar 2014 16:15:34 -0400 > > Eli Zaretskii <eliz@gnu.org> writes: > > >> From: Nathan Trapuzzano <nbtrap@nbtrap.com> > >> Cc: 17130@debbugs.gnu.org > >> Date: Sat, 29 Mar 2014 14:36:42 -0400 > >> > >> Er, let me rephrase. Case _conversion_ (such as in case-preserving > >> replace) and case _folding_ (such as ought be used in case-insensitive > >> searching) are orthogonal. > > > > But they can very well use the same database. > > It's not clear what you mean. You keep asking questions about the purpose of the CANONICALIZE slot, and I keep trying to explain that purpose. > We already have a place to store upper- and lower- case variants. What > I'm proposing is to use the CANONICALIZE slot as a place to store the > case-folding mapping. If this would mess up Emacs' case-preserving > replace, then I think that would just mean that case-preserving replace > is broken. There is no such case as "canonicalize"--you can't say, "Oh, > this string is in the canonical case, so when I want to replace it with > this other string in canonical case". A case-preserving replace should > only consult the upper- and lower-case slots (and perhaps the title-case > slot if it existed). Perhaps you should tell what does tis mean in practice, from the POV of populating the CANONICALIZE slot, and how that content would be used under your proposal. That should make the discussion more useful, I hope. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 18:31 ` Nathan Trapuzzano 2014-03-29 18:36 ` Nathan Trapuzzano @ 2014-03-29 19:50 ` Eli Zaretskii 2014-03-29 20:01 ` Nathan Trapuzzano 1 sibling, 1 reply; 17+ messages in thread From: Eli Zaretskii @ 2014-03-29 19:50 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 > From: Nathan Trapuzzano <nbtrap@nbtrap.com> > Cc: 17130@debbugs.gnu.org > Date: Sat, 29 Mar 2014 14:31:52 -0400 > > >> If not, then what is the CANONICALIZE slot doing that couldn't be > >> done with the regular upcase/downcase slots by themselves? > > > > If that slot is "trivial", i.e. contains the lower-case variant of the > > character, then indeed this slot doesn't add information, I think, > > only utility. But it doesn't have to contain the lower-case variant. > > I know. But if Emacs doesn't do Unicode folding, what is there other > than lower/upper variants? You can make it have whatever you like, because you can set up buffer-specific tables. > >> Not sure I follow. Seems to me the CANONICALIZE slot is sufficient, at > >> least in principle. > > > > It is sufficient for mapping a character to its canonical equivalent, > > but not finding the non-canonical variants of a canonical character. > > IOW, it is not well suited to finding ς given just σ. > > Finding the non-canonical variants is not something that happens (at > least in principle) during case-insensitive matching. The case database is not only for searching. > > For searching, you only need the CANONICALIZE slot. But what about > > replacing the search string while keeping the letter case in the > > replacement? For that, CANONICALIZE alone is not enough, you need the > > reverse mapping. > > There is no reverse mapping when it comes to folding. There can't be, > since multiple characters can fold into the same character. You can use the case of the string being replaced as guidelines. E.g., if the replaced string was capitalized, you can capitalize the replacement. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-29 19:50 ` Eli Zaretskii @ 2014-03-29 20:01 ` Nathan Trapuzzano 0 siblings, 0 replies; 17+ messages in thread From: Nathan Trapuzzano @ 2014-03-29 20:01 UTC (permalink / raw) To: Eli Zaretskii; +Cc: 17130 Eli Zaretskii <eliz@gnu.org> writes: >> I know. But if Emacs doesn't do Unicode folding, what is there other >> than lower/upper variants? > > You can make it have whatever you like, because you can set up > buffer-specific tables. Makes me wonder if whoever implemented the CANONICALIZE slot had Unicode folding in mind. >> Finding the non-canonical variants is not something that happens (at >> least in principle) during case-insensitive matching. > > The case database is not only for searching. > >> There is no reverse mapping when it comes to folding. There can't be, >> since multiple characters can fold into the same character. > > You can use the case of the string being replaced as guidelines. > E.g., if the replaced string was capitalized, you can capitalize the > replacement. I think you're still conflating case conversion and case folding. As I said, there is no case called "fold". There's just upper, lower, and title. And the fact that these three overlap is already a problem for case-preserving replace. I spent most of my last email trying to explain this. ^ permalink raw reply [flat|nested] 17+ messages in thread
* bug#17130: 24.4.50; Deficient Unicode case folding 2014-03-28 12:07 bug#17130: 24.4.50; Deficient Unicode case folding Nathan Trapuzzano 2014-03-28 15:51 ` Eli Zaretskii @ 2019-09-29 14:23 ` Lars Ingebrigtsen 1 sibling, 0 replies; 17+ messages in thread From: Lars Ingebrigtsen @ 2019-09-29 14:23 UTC (permalink / raw) To: Nathan Trapuzzano; +Cc: 17130 Nathan Trapuzzano <nbtrap@nbtrap.com> writes: > M-: (compare-strings "σ" nil nil "ς" nil nil t) > > ==> -1 ;; should be t (compare-strings "σ" nil nil "ς" nil nil t) => t I'm unable to reproduce this in Emacs 27, so I'm going to go ahead and guess that this has been fixed in the years since this bug was reported, and I'm closing this bug report. If this is still a problem, please reopen. -- (domestic pets only, the antidote for overdose, milk.) bloggy blog: http://lars.ingebrigtsen.no ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-09-29 14:23 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-28 12:07 bug#17130: 24.4.50; Deficient Unicode case folding Nathan Trapuzzano 2014-03-28 15:51 ` Eli Zaretskii 2014-03-28 19:31 ` nbtrap 2014-03-29 6:45 ` Eli Zaretskii [not found] ` <87ob0pnptc.fsf@nbtrap.com> 2014-03-29 13:15 ` Eli Zaretskii 2014-03-29 14:03 ` Nathan Trapuzzano 2014-03-29 14:45 ` Eli Zaretskii 2014-03-29 15:29 ` Nathan Trapuzzano 2014-03-29 17:37 ` Eli Zaretskii 2014-03-29 18:31 ` Nathan Trapuzzano 2014-03-29 18:36 ` Nathan Trapuzzano 2014-03-29 19:51 ` Eli Zaretskii 2014-03-29 20:15 ` Nathan Trapuzzano 2014-03-30 2:45 ` Eli Zaretskii 2014-03-29 19:50 ` Eli Zaretskii 2014-03-29 20:01 ` Nathan Trapuzzano 2019-09-29 14:23 ` Lars Ingebrigtsen
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.