* regex and case-fold-search problem @ 2002-08-23 6:25 Kenichi Handa 2002-08-23 15:56 ` Eli Zaretskii ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Kenichi Handa @ 2002-08-23 6:25 UTC (permalink / raw) While working on emacs-unicode, I noticed a very difficult problem which also exists in the current emacs. (let ((case-fold-search nil)) (string-match "[Þ-ß]" "Þ")) => 0 (let ((case-fold-search nil)) (string-match "[Þß]" "Þ")) => 0 (let ((case-fold-search t)) (string-match "[Þ-ß]" "Þ")) => nil !!! (let ((case-fold-search t)) (string-match "[Þß]" "Þ")) => 0 When you see the output of M-x list-charset-chars RET latin-iso8859-1 RET, you'll soon find what's going on. The relevan character codes are as follows: Þ (#x8DE) ß (#x8DF) (downcase ?Þ) == ?þ (#x8FE) (downcase ?ß) == ?ß (#x8DF) This problem is not specific to non-ASCII chars, it's just rarer to face such a sitution in ASCII chars. (let ((case-fold-search nil)) (string-match "[A-_]" "A")) => 0 (let ((case-fold-search t)) (string-match "[A-_]" "A")) => nil (let ((case-fold-search t)) (string-match "[A_]" "A")) => 0 In my opinion, specifying ranges by chars are nonsense because there should be no semantics in the order of characters codes. But, anyway, we have to decide what to do. (1) Regard the above case as a bug, and fix it completely. As we don't support a range striding over different charsets by the current Emacs, I think the fix is difficult but not that much. But, in emacs-unicode, we can't have such a restriction, and thus the fix is very difficult. (2) Regard the above case as an (unpleasant) feature, and document it. (3) Signal an error for such a regex (and of course document it). --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 6:25 regex and case-fold-search problem Kenichi Handa @ 2002-08-23 15:56 ` Eli Zaretskii 2002-08-24 0:51 ` Kenichi Handa 2002-08-23 17:36 ` Stefan Monnier 2002-08-26 21:51 ` Richard Stallman 2 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2002-08-23 15:56 UTC (permalink / raw) Cc: emacs-devel > From: Kenichi Handa <handa@etl.go.jp> > Date: Fri, 23 Aug 2002 15:25:42 +0900 (JST) > > This problem is not specific to non-ASCII chars, it's just > rarer to face such a sitution in ASCII chars. > > (let ((case-fold-search nil)) > (string-match "[A-_]" "A")) => 0 > (let ((case-fold-search t)) > (string-match "[A-_]" "A")) => nil > (let ((case-fold-search t)) > (string-match "[A_]" "A")) => 0 Does that happen because under case-fold-search non-nil the characters on the range specification are downcased? > In my opinion, specifying ranges by chars are nonsense > because there should be no semantics in the order of > characters codes. Sorry, I don't understand: how would one specify a range _except_ with two characters and a dash between them? What am I missing? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 15:56 ` Eli Zaretskii @ 2002-08-24 0:51 ` Kenichi Handa 2002-08-24 1:03 ` Miles Bader ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Kenichi Handa @ 2002-08-24 0:51 UTC (permalink / raw) Cc: emacs-devel In article <9003-Fri23Aug2002185625+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes: >> (let ((case-fold-search nil)) >> (string-match "[A-_]" "A")) => 0 >> (let ((case-fold-search t)) >> (string-match "[A-_]" "A")) => nil >> (let ((case-fold-search t)) >> (string-match "[A_]" "A")) => 0 > Does that happen because under case-fold-search non-nil the > characters on the range specification are downcased? Yes. >> In my opinion, specifying ranges by chars are nonsense >> because there should be no semantics in the order of >> characters codes. > Sorry, I don't understand: how would one specify a range _except_ > with two characters and a dash between them? What am I missing? I mean that the concept of character range itself is not good. A character code is just an identifier of a character. We usually don't think about "a range of identifiers" (e.g. "symbols in the range between t and nil" is nonsense). --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 0:51 ` Kenichi Handa @ 2002-08-24 1:03 ` Miles Bader 2002-08-24 9:42 ` Eli Zaretskii 2002-08-24 16:16 ` Andreas Schwab 2002-08-24 9:39 ` Eli Zaretskii 2002-08-25 22:21 ` Kim F. Storm 2 siblings, 2 replies; 40+ messages in thread From: Miles Bader @ 2002-08-24 1:03 UTC (permalink / raw) Cc: eliz, emacs-devel On Sat, Aug 24, 2002 at 09:51:46AM +0900, Kenichi Handa wrote: > I mean that the concept of character range itself is not good. A character > code is just an identifier of a character. We usually don't think about "a > range of identifiers" (e.g. "symbols in the range between t and nil" is > nonsense). Yeah, but character ranges make perfect sense in many local contexts. E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some character set. I think that in cases where the notion of a character range _does_ make sense, that either both ends will be downcase-able, or that both will not, so that perhaps the problem won't actually show up in practice if we just say `only use character ranges when they make sense!' -Miles -- P.S. All information contained in the above letter is false, for reasons of military security. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 1:03 ` Miles Bader @ 2002-08-24 9:42 ` Eli Zaretskii 2002-08-24 16:16 ` Andreas Schwab 1 sibling, 0 replies; 40+ messages in thread From: Eli Zaretskii @ 2002-08-24 9:42 UTC (permalink / raw) Cc: emacs-devel > Date: Fri, 23 Aug 2002 21:03:07 -0400 > From: Miles Bader <miles@gnu.org> > > I think that in cases where the notion of a character range _does_ make > sense, that either both ends will be downcase-able, or that both will not Yes, but the problem is that downcasing both ends of a range might change the range in some (admittedly a bit rare) cases. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 1:03 ` Miles Bader 2002-08-24 9:42 ` Eli Zaretskii @ 2002-08-24 16:16 ` Andreas Schwab 2002-08-26 1:54 ` Miles Bader 2002-08-26 21:51 ` Richard Stallman 1 sibling, 2 replies; 40+ messages in thread From: Andreas Schwab @ 2002-08-24 16:16 UTC (permalink / raw) Cc: Kenichi Handa, eliz, emacs-devel Miles Bader <miles@gnu.org> writes: |> On Sat, Aug 24, 2002 at 09:51:46AM +0900, Kenichi Handa wrote: |> > I mean that the concept of character range itself is not good. A character |> > code is just an identifier of a character. We usually don't think about "a |> > range of identifiers" (e.g. "symbols in the range between t and nil" is |> > nonsense). |> |> Yeah, but character ranges make perfect sense in many local contexts. |> E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some |> character set. What does [A-Z] mean in EBCDIC? [0-9] is a special case, because ISO C requires that 0,1,2,3,4,5,6,7,8,9 are consecutive in the execution character set. But in many locales the collating sequence <A> - <Z> contains more that just the upper case letters from the English alphabet. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 16:16 ` Andreas Schwab @ 2002-08-26 1:54 ` Miles Bader 2002-08-26 16:11 ` Stefan Monnier 2002-08-26 21:51 ` Richard Stallman 1 sibling, 1 reply; 40+ messages in thread From: Miles Bader @ 2002-08-26 1:54 UTC (permalink / raw) Cc: Kenichi Handa, eliz, emacs-devel Andreas Schwab <schwab@suse.de> writes: > |> Yeah, but character ranges make perfect sense in many local contexts. > |> E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some > |> character set. > > What does [A-Z] mean in EBCDIC? [0-9] is a special case, because ISO C > requires that 0,1,2,3,4,5,6,7,8,9 are consecutive in the execution > character set. But in many locales the collating sequence <A> - <Z> > contains more that just the upper case letters from the English alphabet. The question is not `does [A-Z] make sense?', but rather: `_if_ [A-Z] makes sense, does [a-z] make sense too?' That is, we aren't the ones writing [A-Z], it's lisp authors or users entering regexps or something. If they want to enter a less-than-useful character range, that's their prerogative; however, emacs should avoid making what they enter _less_ meaningful because of the case-fold-search setting. My point was that perhaps in practice, the ranges that would get screwed up by case-fold-search are even less sensible that normal, meaning it's likely most people wouldn't (or shouldn't) use them, and we really don't need to worry about the issue. [ASCII is probably a special case, since it's so well known that people actually do tend to specify wierd ranges] [but it looks like maybe it will get fixed properly anyway...] -miles -- `Life is a boundless sea of bitterness' ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-26 1:54 ` Miles Bader @ 2002-08-26 16:11 ` Stefan Monnier 0 siblings, 0 replies; 40+ messages in thread From: Stefan Monnier @ 2002-08-26 16:11 UTC (permalink / raw) Cc: Andreas Schwab, Kenichi Handa, eliz, emacs-devel > Andreas Schwab <schwab@suse.de> writes: > > |> Yeah, but character ranges make perfect sense in many local contexts. > > |> E.g., [0-9], or [<0>-<9>] where <0> and <9> are `wide' digits from some > > |> character set. > > > > What does [A-Z] mean in EBCDIC? [0-9] is a special case, because ISO C > > requires that 0,1,2,3,4,5,6,7,8,9 are consecutive in the execution > > character set. But in many locales the collating sequence <A> - <Z> > > contains more that just the upper case letters from the English alphabet. > > The question is not `does [A-Z] make sense?', but rather: `_if_ [A-Z] > makes sense, does [a-z] make sense too?' > > That is, we aren't the ones writing [A-Z], it's lisp authors or users > entering regexps or something. If they want to enter a less-than-useful > character range, that's their prerogative; however, emacs should avoid > making what they enter _less_ meaningful because of the case-fold-search > setting. > > My point was that perhaps in practice, the ranges that would get screwed > up by case-fold-search are even less sensible that normal, meaning it's > likely most people wouldn't (or shouldn't) use them, and we really don't > need to worry about the issue. [ASCII is probably a special case, since > it's so well known that people actually do tend to specify wierd ranges] > > [but it looks like maybe it will get fixed properly anyway...] I agree that we shouldn't spend too much time on it. The patch I installed does the following: - Fix a few problems such as ``if the case-table mapped ?* to ?o then "\\(fo\\)*" used to only match "foo"''. Luckily such case-tables are not very common, so nobody noticed the problem. - case-fold-search now works correctly for ranges in ASCII - case-fold-search still doesn't work correctly for ranges in non-ASCII but it matches at least as much as when case-fold-search is nil: i.e. the range might include some chars which the user didn't expect, but it at least include the chars which the user expected. The previous behavior was that the range could include some unexpected chars as well and could also not include some expected chars. The current code matches at least as many strings as the previous one. I think that's good enough for now, Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 16:16 ` Andreas Schwab 2002-08-26 1:54 ` Miles Bader @ 2002-08-26 21:51 ` Richard Stallman 1 sibling, 0 replies; 40+ messages in thread From: Richard Stallman @ 2002-08-26 21:51 UTC (permalink / raw) Cc: miles, handa, eliz, emacs-devel What does [A-Z] mean in EBCDIC? Fortunately, we don't need to worry about the question. Emacs always operates on ASCII or extensions of ASCII. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 0:51 ` Kenichi Handa 2002-08-24 1:03 ` Miles Bader @ 2002-08-24 9:39 ` Eli Zaretskii 2002-08-26 1:29 ` Kenichi Handa 2002-08-25 22:21 ` Kim F. Storm 2 siblings, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2002-08-24 9:39 UTC (permalink / raw) Cc: emacs-devel > Date: Sat, 24 Aug 2002 09:51:46 +0900 (JST) > From: Kenichi Handa <handa@etl.go.jp> > > > Does that happen because under case-fold-search non-nil the > > characters on the range specification are downcased? > > Yes. Then perhaps, instead of downcasing the range, we should do the comparison in a case-insensitive manner? Or is that impossible with the current regex code? > I mean that the concept of character range itself is not > good. As Miles wrote, it does make a perfect sense in a context of a specific language. For example, if the characters that designate the range are all Cyrillic characters, the range is sensible. It would IMHO be a pity to lose the ability to specify ranges in such cases. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 9:39 ` Eli Zaretskii @ 2002-08-26 1:29 ` Kenichi Handa 2002-08-26 2:31 ` Miles Bader 0 siblings, 1 reply; 40+ messages in thread From: Kenichi Handa @ 2002-08-26 1:29 UTC (permalink / raw) Cc: emacs-devel In article <9743-Sat24Aug2002123958+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes: >> > Does that happen because under case-fold-search non-nil the >> > characters on the range specification are downcased? >> >> Yes. > Then perhaps, instead of downcasing the range, we should do the > comparison in a case-insensitive manner? Or is that impossible with > the current regex code? Of course, it's not impossible. It's just not easy. >> I mean that the concept of character range itself is not >> good. > As Miles wrote, it does make a perfect sense in a context of a > specific language. For example, if the characters that designate the > range are all Cyrillic characters, the range is sensible. It makes sense only when we assume some character set (or locale). For instance, in Emacs 21, Cyrillic characters has the same code order as that of iso-8859-5. But, in emacs-unicode, we use Unicode. So, a Cyrillic char range that works well in Emacs 21 won't work in emacs-unicode. > It would IMHO be a pity to lose the ability to specify ranges in such > cases. I don't suggest to remove that ability. I'm just wondering if it is worth spending our time (and perhaps users time) to make Emacs behave completely correctly to handle a char range especially in the case that case-fold-search is t. I think something like Stefan's compromise method (quoted below) is good enough. > For ASCII it's pretty easy to fix. But for other charsets, it's > indeed more tricky. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, > so the behavior is indeed "implementation-defined" (in the sense > that it's not necessarily obvious to the user what happens) but > it's at least less confusing (in the sense that (case-fold-search t) > matches at least as much as (case-fold-search nil)). --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-26 1:29 ` Kenichi Handa @ 2002-08-26 2:31 ` Miles Bader 0 siblings, 0 replies; 40+ messages in thread From: Miles Bader @ 2002-08-26 2:31 UTC (permalink / raw) Cc: eliz, emacs-devel Kenichi Handa <handa@etl.go.jp> writes: > > As Miles wrote, it does make a perfect sense in a context of a > > specific language. For example, if the characters that designate the > > range are all Cyrillic characters, the range is sensible. > > It makes sense only when we assume some character set (or locale). > For instance, in Emacs 21, Cyrillic characters has the same code order > as that of iso-8859-5. But, in emacs-unicode, we use Unicode. So, a > Cyrillic char range that works well in Emacs 21 won't work in > emacs-unicode. I don't think it really matters. As I said in a previous message, the question is not `does [A-Z] make sense?', but rather: `_if_ [A-Z] makes sense, does [a-z] make sense too?' If someone writes [<cyrillic-char>-<chinese_char>] then they they get what they deserve; it's not emacs' fault. -Miles -- .Numeric stability is probably not all that important when you're guessing. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 0:51 ` Kenichi Handa 2002-08-24 1:03 ` Miles Bader 2002-08-24 9:39 ` Eli Zaretskii @ 2002-08-25 22:21 ` Kim F. Storm 2 siblings, 0 replies; 40+ messages in thread From: Kim F. Storm @ 2002-08-25 22:21 UTC (permalink / raw) Cc: eliz, emacs-devel Kenichi Handa <handa@etl.go.jp> writes: > In article <9003-Fri23Aug2002185625+0300-eliz@is.elta.co.il>, "Eli Zaretskii" <eliz@is.elta.co.il> writes: > >> (let ((case-fold-search nil)) > >> (string-match "[A-_]" "A")) => 0 > >> (let ((case-fold-search t)) > >> (string-match "[A-_]" "A")) => nil > >> (let ((case-fold-search t)) > >> (string-match "[A_]" "A")) => 0 > > > Does that happen because under case-fold-search non-nil the > > characters on the range specification are downcased? > > Yes. > > >> In my opinion, specifying ranges by chars are nonsense > >> because there should be no semantics in the order of > >> characters codes. > > > Sorry, I don't understand: how would one specify a range _except_ > > with two characters and a dash between them? What am I missing? > > I mean that the concept of character range itself is not > good. A character code is just an identifier of a > character. We usually don't think about "a range of > identifiers" (e.g. "symbols in the range between t and nil" > is nonsense). Which is why [[:alpha:]] [[:digit:]] etc were invented for regex's. They are supposed to "look at the locale"... -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 6:25 regex and case-fold-search problem Kenichi Handa 2002-08-23 15:56 ` Eli Zaretskii @ 2002-08-23 17:36 ` Stefan Monnier 2002-08-23 21:52 ` Stefan Monnier ` (2 more replies) 2002-08-26 21:51 ` Richard Stallman 2 siblings, 3 replies; 40+ messages in thread From: Stefan Monnier @ 2002-08-23 17:36 UTC (permalink / raw) Cc: emacs-devel > While working on emacs-unicode, I noticed a very difficult > problem which also exists in the current emacs. > > (let ((case-fold-search nil)) > (string-match "[Þ-ß]" "Þ")) => 0 > (let ((case-fold-search nil)) > (string-match "[Þß]" "Þ")) => 0 > > (let ((case-fold-search t)) > (string-match "[Þ-ß]" "Þ")) => nil !!! > (let ((case-fold-search t)) > (string-match "[Þß]" "Þ")) => 0 > > When you see the output of M-x list-charset-chars RET > latin-iso8859-1 RET, you'll soon find what's going on. > > The relevan character codes are as follows: > Þ (#x8DE) > ß (#x8DF) > (downcase ?Þ) == ?þ (#x8FE) > (downcase ?ß) == ?ß (#x8DF) > > This problem is not specific to non-ASCII chars, it's just > rarer to face such a sitution in ASCII chars. > > (let ((case-fold-search nil)) > (string-match "[A-_]" "A")) => 0 > (let ((case-fold-search t)) > (string-match "[A-_]" "A")) => nil > (let ((case-fold-search t)) > (string-match "[A_]" "A")) => 0 > > In my opinion, specifying ranges by chars are nonsense > because there should be no semantics in the order of > characters codes. Indeed. POSIX basically says the behavior is unclear (it's locale-dependent). But I think that if it works with (case-fold-search nil) it should also work with (case-fold-search t). The current behavior is really counter-intuitive. > But, anyway, we have to decide what to do. > > (1) Regard the above case as a bug, and fix it completely. > As we don't support a range striding over different > charsets by the current Emacs, I think the fix is > difficult but not that much. But, in emacs-unicode, we > can't have such a restriction, and thus the fix is very > difficult. For ASCII it's pretty easy to fix. But for other charsets, it's indeed more tricky. Maybe we can simply use the smallest contiguous range of chars that includes all the chars we should match, so the behavior is indeed "implementation-defined" (in the sense that it's not necessarily obvious to the user what happens) but it's at least less confusing (in the sense that (case-fold-search t) matches at least as much as (case-fold-search nil)). > (2) Regard the above case as an (unpleasant) feature, and > document it. I think we should document the fact that char-ranges shouldn't be relied upon too much, especially outside of ASCII. That's true no matter how we deal with the problem. > (3) Signal an error for such a regex (and of course document it). That might be an option as well. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 17:36 ` Stefan Monnier @ 2002-08-23 21:52 ` Stefan Monnier 2002-08-24 1:16 ` Kenichi Handa 2002-08-24 10:40 ` Kai Großjohann 2 siblings, 0 replies; 40+ messages in thread From: Stefan Monnier @ 2002-08-23 21:52 UTC (permalink / raw) Cc: emacs-devel "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> wrote: > For ASCII it's pretty easy to fix. But for other charsets, it's > indeed more tricky. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, > so the behavior is indeed "implementation-defined" (in the sense > that it's not necessarily obvious to the user what happens) but > it's at least less confusing (in the sense that (case-fold-search t) > matches at least as much as (case-fold-search nil)). How about the patch below ? Stefan Index: regex.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/regex.c,v retrieving revision 1.176 diff -u -u -b -r1.176 regex.c --- regex.c 25 Mar 2002 00:45:48 -0000 1.176 +++ regex.c 23 Aug 2002 21:49:10 -0000 @@ -1914,12 +1914,13 @@ #define BIT_UPPER 0x10 #define BIT_MULTIBYTE 0x20 -/* Set a range (RANGE_START, RANGE_END) to WORK_AREA. */ -#define SET_RANGE_TABLE_WORK_AREA(work_area, range_start, range_end) \ +/* Set a range START..END to WORK_AREA. + The range is passed through TRANSLATE, so START and END + should be untranslated. */ +#define SET_RANGE_TABLE_WORK_AREA(work_area, start, end) \ do { \ EXTEND_RANGE_TABLE_WORK_AREA ((work_area), 2); \ - (work_area).table[(work_area).used++] = (range_start); \ - (work_area).table[(work_area).used++] = (range_end); \ + set_image_of_range (&work_area, start, end, translate); \ } while (0) /* Free allocated memory for WORK_AREA. */ @@ -2077,6 +2078,31 @@ } #endif + + +/* We need to find the image of the range start..end when passed through + TRANSLATE. This is not necessarily TRANSLATE(start)..TRANSLATE(end) + and is not even necessarily contiguous. + We approximate it with the smallest contiguous range that contains + all the chars we need. */ +static void +set_image_of_range (work_area, start, end, translate) + RE_TRANSLATE_TYPE translate; + struct range_table_work_area *work_area; + re_wchar_t start, end; +{ + re_wchar_t cmin = TRANSLATE (start), cmax = TRANSLATE (end); + if (RE_TRANSLATE_P (translate)) + for (; start <= end; start++) + { + re_wchar_t c = TRANSLATE (start); + cmin = MIN (cmin, c); + cmax = MAX (cmax, c); + } + work_area->table[work_area->used++] = (cmin); + work_area->table[work_area->used++] = (cmax); +} + /* Explicit quit checking is only used on NTemacs. */ #if defined WINDOWSNT && defined emacs && defined QUIT extern int immediate_quit; @@ -2525,14 +2551,18 @@ if (p == pend) FREE_STACK_RETURN (REG_EBRACK); - PATFETCH (c); + /* Don't translate yet. The range TRANSLATE(X..Y) cannot + always be determined from TRANSLATE(X) and TRANSLATE(Y) + So the translation is done later in a loop. Example: + (let ((case-fold-search t)) (string-match "[A-_]" "A")) */ + PATFETCH_RAW (c); /* \ might escape characters inside [...] and [^...]. */ if ((syntax & RE_BACKSLASH_ESCAPE_IN_LISTS) && c == '\\') { if (p == pend) FREE_STACK_RETURN (REG_EESCAPE); - PATFETCH (c); + PATFETCH_RAW (c); escaped_char = true; } else @@ -2636,10 +2668,10 @@ { /* Discard the `-'. */ - PATFETCH (c1); + PATFETCH_RAW (c1); /* Fetch the character which ends the range. */ - PATFETCH (c1); + PATFETCH_RAW (c1); if (SINGLE_BYTE_CHAR_P (c)) { @@ -2653,7 +2685,7 @@ starting at the smallest character in the charset of C1 and ending at C1. */ int charset = CHAR_CHARSET (c1); - int c2 = MAKE_CHAR (charset, 0, 0); + re_wchar_t c2 = MAKE_CHAR (charset, 0, 0); SET_RANGE_TABLE_WORK_AREA (range_table_work, c2, c1); @@ -2672,7 +2704,7 @@ /* ... into bitmap. */ { re_wchar_t this_char; - int range_start = c, range_end = c1; + re_wchar_t range_start = c, range_end = c1; /* If the start is after the end, the range is empty. */ if (range_start > range_end) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 17:36 ` Stefan Monnier 2002-08-23 21:52 ` Stefan Monnier @ 2002-08-24 1:16 ` Kenichi Handa 2002-08-25 18:52 ` Stefan Monnier 2002-08-24 10:40 ` Kai Großjohann 2 siblings, 1 reply; 40+ messages in thread From: Kenichi Handa @ 2002-08-24 1:16 UTC (permalink / raw) Cc: emacs-devel In article <200208231736.g7NHafW02174@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes: > But I think that if it works with (case-fold-search nil) it should > also work with (case-fold-search t). The current behavior is really > counter-intuitive. I agree. >> But, anyway, we have to decide what to do. >> >> (1) Regard the above case as a bug, and fix it completely. >> As we don't support a range striding over different >> charsets by the current Emacs, I think the fix is >> difficult but not that much. But, in emacs-unicode, we >> can't have such a restriction, and thus the fix is very >> difficult. > For ASCII it's pretty easy to fix. But for other charsets, it's > indeed more tricky. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, > so the behavior is indeed "implementation-defined" (in the sense > that it's not necessarily obvious to the user what happens) but > it's at least less confusing (in the sense that (case-fold-search t) > matches at least as much as (case-fold-search nil)). Ideally, the range "[A-_]" must be converted to "[a-z[-_]". But, it seems that your idea is to convert "[A-_]" to "[_-z]", correct? I agree that it results in less counter-intuitive behaviour. > How about the patch below ? [...] ?? It seems that the patch handles only non-ASCII chars. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-24 1:16 ` Kenichi Handa @ 2002-08-25 18:52 ` Stefan Monnier 2002-08-26 1:56 ` Kenichi Handa 0 siblings, 1 reply; 40+ messages in thread From: Stefan Monnier @ 2002-08-25 18:52 UTC (permalink / raw) Cc: monnier+gnu/emacs, emacs-devel > In article <200208231736.g7NHafW02174@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes: > > But I think that if it works with (case-fold-search nil) it should > > also work with (case-fold-search t). The current behavior is really > > counter-intuitive. > > I agree. > > >> But, anyway, we have to decide what to do. > >> > >> (1) Regard the above case as a bug, and fix it completely. > >> As we don't support a range striding over different > >> charsets by the current Emacs, I think the fix is > >> difficult but not that much. But, in emacs-unicode, we > >> can't have such a restriction, and thus the fix is very > >> difficult. > > > For ASCII it's pretty easy to fix. But for other charsets, it's > > indeed more tricky. Maybe we can simply use the smallest contiguous > > range of chars that includes all the chars we should match, > > so the behavior is indeed "implementation-defined" (in the sense > > that it's not necessarily obvious to the user what happens) but > > it's at least less confusing (in the sense that (case-fold-search t) > > matches at least as much as (case-fold-search nil)). > > Ideally, the range "[A-_]" must be converted to "[a-z[-_]". Indeed and the (new) current code does just that for ASCII. > But, it seems that your idea is to convert "[A-_]" to > "[_-z]", correct? I agree that it results in less > counter-intuitive behaviour. Not quite: [_-z] would not include [ \ ] and ^. So instead it's [[-z] which includes all of [a-z[-_] as well as ` (in this particular case). > > How about the patch below ? > [...] > ?? It seems that the patch handles only non-ASCII chars. Well, that's because the code for ASCII was already there (just didn't work right because we did PATFETCH instead of PATFETCH_RAW). Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-25 18:52 ` Stefan Monnier @ 2002-08-26 1:56 ` Kenichi Handa 0 siblings, 0 replies; 40+ messages in thread From: Kenichi Handa @ 2002-08-26 1:56 UTC (permalink / raw) Cc: emacs-devel In article <200208251852.g7PIqf121329@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes: >> But, it seems that your idea is to convert "[A-_]" to >> "[_-z]", correct? I agree that it results in less >> counter-intuitive behaviour. > Not quite: [_-z] would not include [ \ ] and ^. > So instead it's [[-z] which includes all of [a-z[-_] > as well as ` (in this particular case). Ah! Right. >> > How about the patch below ? >> [...] >> ?? It seems that the patch handles only non-ASCII chars. > Well, that's because the code for ASCII was already there (just > didn't work right because we did PATFETCH instead of PATFETCH_RAW). I see. I confirmed that with the latest code. Thank you! --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 17:36 ` Stefan Monnier 2002-08-23 21:52 ` Stefan Monnier 2002-08-24 1:16 ` Kenichi Handa @ 2002-08-24 10:40 ` Kai Großjohann 2 siblings, 0 replies; 40+ messages in thread From: Kai Großjohann @ 2002-08-24 10:40 UTC (permalink / raw) Cc: Kenichi Handa, emacs-devel "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes: > For ASCII it's pretty easy to fix. But for other charsets, it's > indeed more tricky. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, > so the behavior is indeed "implementation-defined" (in the sense > that it's not necessarily obvious to the user what happens) but > it's at least less confusing (in the sense that (case-fold-search t) > matches at least as much as (case-fold-search nil)). My first intuition would be to take all the characters in the range [A-_] (preserving case), then to "double" each character that has an uppercase and a lowercase variant. So we are talking about the characters "ABCDEFGHIJKLMNOPQRSTXYZ[\]^_" for the given range, and now we make a case-insensitive variant of this list of characters. Does this make sense? Is it feasible to implement? kai -- A large number of young women don't trust men with beards. (BFBS Radio) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-23 6:25 regex and case-fold-search problem Kenichi Handa 2002-08-23 15:56 ` Eli Zaretskii 2002-08-23 17:36 ` Stefan Monnier @ 2002-08-26 21:51 ` Richard Stallman 2002-08-29 8:53 ` Kenichi Handa 2 siblings, 1 reply; 40+ messages in thread From: Richard Stallman @ 2002-08-26 21:51 UTC (permalink / raw) Cc: emacs-devel In my opinion, specifying ranges by chars are nonsense because there should be no semantics in the order of characters codes. The fact is, people know the character codes and take advantage of their knowledge. I don't think this is unreasonable. But that question is academic, since the feature is used and we need to make it work. Does that happen because under case-fold-search non-nil the characters on the range specification are downcased? It looks that way. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, That isn't right. The range should be equal to the disjunction of all characters in it; A-_ should be equivalent to []A.....Z[\^_]. With case folding, that should match A-Z, a-z, and [\]^_. In other words, The correct behavior is that all character codes that are equivalent (when you ignore case) to any character in the originally specified range should match. Given the whole case table, you can compute this by looping over the original (non-case-folded) range and finding, for each character, all the characters that are equivalent to it. Then those could be assembled into the smallest possible number of ranges. A faster way, in the usual cases, would be to look for the case where several consecutive characters that have just one case-sibling each, and the siblings are consecutive too. Each subrange of this kind can be turned into two subranges, the original and the case-converted. Also identify subranges of characters that have no case-siblings; each subrange of this kind just remains as it is. Finally, any unusual characters that are encountered can be replaced with a list of all the case-siblings. This too requires use of the whole case table. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-26 21:51 ` Richard Stallman @ 2002-08-29 8:53 ` Kenichi Handa 2002-08-29 12:33 ` Kim F. Storm 2002-08-30 19:19 ` Richard Stallman 0 siblings, 2 replies; 40+ messages in thread From: Kenichi Handa @ 2002-08-29 8:53 UTC (permalink / raw) Cc: emacs-devel In article <200208262151.g7QLpfA12782@wijiji.santafe.edu>, Richard Stallman <rms@gnu.org> writes: > The fact is, people know the character codes and take advantage of > their knowledge. I don't think this is unreasonable. But that > question is academic, since the feature is used and we need to make it > work. People know the character codes that are based on their familiar charset. So, they can take advantage only when Emacs internally uses the character representation in which character code order is the same as that familiar charset. For instance, those who are familiar with iso-8859-2 charset can take advantage of their knowledge in Emacs 21. But, if they write such a regular expression, they'll find it matches different characters in emacs-unicode. > Maybe we can simply use the smallest contiguous >> range of chars that includes all the chars we should match, > That isn't right. The range should be equal to the disjunction of all > characters in it; A-_ should be equivalent to []A.....Z[\^_]. With > case folding, that should match A-Z, a-z, and [\]^_. In other words, > The correct behavior is that all character codes that are equivalent > (when you ignore case) to any character in the originally specified > range should match. I think we all know that is the right behaviour, and at least for ASCII, the latest code works as that. Perhpas, we should make Emacs work correctly also for Latin-1 chars, because in emacs-unicode also, they have the same code order. But... > Given the whole case table, you can compute this by looping over the > original (non-case-folded) range and finding, for each character, all > the characters that are equivalent to it. Then those could be > assembled into the smallest possible number of ranges. > A faster way, in the usual cases, would be to look for the case where > several consecutive characters that have just one case-sibling each, > and the siblings are consecutive too. Each subrange of this kind can > be turned into two subranges, the original and the case-converted. > Also identify subranges of characters that have no case-siblings; each > subrange of this kind just remains as it is. Finally, any unusual > characters that are encountered can be replaced with a list of all the > case-siblings. > This too requires use of the whole case table. Implemnting that for any range of characters consumes our man-power and makes the running code slower. Consider the situation that one writes this regexp "[\000-\xffff]" to search only Unicode BMP chars in emacs-unicode. I suspect that, if we implent the above method, compiling this regexp when case-fold-search is non-nil takes longer time than people usually expect. So, I agree with Stephen that his method is good enough. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-29 8:53 ` Kenichi Handa @ 2002-08-29 12:33 ` Kim F. Storm 2002-08-29 13:38 ` Kenichi Handa 2002-08-30 19:19 ` Richard Stallman 1 sibling, 1 reply; 40+ messages in thread From: Kim F. Storm @ 2002-08-29 12:33 UTC (permalink / raw) Cc: rms, emacs-devel > > Consider the situation that one writes this regexp > "[\000-\xffff]" > to search only Unicode BMP chars in emacs-unicode. I > suspect that, if we implent the above method, compiling this > regexp when case-fold-search is non-nil takes longer time > than people usually expect. > > So, I agree with Stephen that his method is good enough. IMO, it is wrong to handle case-fold-search for regexp ranges by trying to modify the interpretation of the regex range. Instead, the regex matcher should try to upcase and lowercase each character in the string and see if either of these caracters are within the given range. -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-29 12:33 ` Kim F. Storm @ 2002-08-29 13:38 ` Kenichi Handa 2002-08-29 15:00 ` Kim F. Storm 2002-08-29 16:00 ` Stefan Monnier 0 siblings, 2 replies; 40+ messages in thread From: Kenichi Handa @ 2002-08-29 13:38 UTC (permalink / raw) Cc: rms, emacs-devel In article <5x8z2pj13t.fsf@kfs2.cua.dk>, storm@cua.dk (Kim F. Storm) writes: > IMO, it is wrong to handle case-fold-search for regexp ranges by > trying to modify the interpretation of the regex range. > Instead, the regex matcher should try to upcase and lowercase each > character in the string and see if either of these caracters are > within the given range. I also reached to that idea. It makes regexp compiling simpler and faster but makes regexp matching a little bit slower. I don't know if that slowerness is tolerable or not, but it's worth trying. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-29 13:38 ` Kenichi Handa @ 2002-08-29 15:00 ` Kim F. Storm 2002-08-29 16:00 ` Stefan Monnier 1 sibling, 0 replies; 40+ messages in thread From: Kim F. Storm @ 2002-08-29 15:00 UTC (permalink / raw) Cc: storm, rms, emacs-devel Kenichi Handa <handa@etl.go.jp> writes: > In article <5x8z2pj13t.fsf@kfs2.cua.dk>, storm@cua.dk (Kim F. Storm) writes: > > IMO, it is wrong to handle case-fold-search for regexp ranges by > > trying to modify the interpretation of the regex range. > > > Instead, the regex matcher should try to upcase and lowercase each > > character in the string and see if either of these caracters are > > within the given range. > > I also reached to that idea. It makes regexp compiling > simpler and faster but makes regexp matching a little bit > slower. I don't know if that slowerness is tolerable or > not, but it's worth trying. Maybe it can be semi-optimized for a char C as follows: MATCH = (C in range) || (UC = uppercase(C)) != C ? (UC in range) : (lowercase(C) in range)) -- Kim F. Storm <storm@cua.dk> http://www.cua.dk ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-29 13:38 ` Kenichi Handa 2002-08-29 15:00 ` Kim F. Storm @ 2002-08-29 16:00 ` Stefan Monnier 2002-08-30 1:11 ` Kenichi Handa 1 sibling, 1 reply; 40+ messages in thread From: Stefan Monnier @ 2002-08-29 16:00 UTC (permalink / raw) Cc: storm, rms, emacs-devel > In article <5x8z2pj13t.fsf@kfs2.cua.dk>, storm@cua.dk (Kim F. Storm) writes: > > IMO, it is wrong to handle case-fold-search for regexp ranges by > > trying to modify the interpretation of the regex range. > > > Instead, the regex matcher should try to upcase and lowercase each > > character in the string and see if either of these caracters are > > within the given range. > > I also reached to that idea. It makes regexp compiling > simpler and faster but makes regexp matching a little bit > slower. I don't know if that slowerness is tolerable or > not, but it's worth trying. Two things: - Neither `upper(lower(x)) = x' nor `lower(upper(x)) = x' are guaranteed. - The regexp matcher right now only has access to one of the two tables (I believe it's the `lower' but I'm not even sure) and so two chars are deemed to match if translate(a) = translate(b). The first might be a non-issue, I don't know. The second is more serious because that means that if we want to use `upper' we'll need to somehow pass that table as well, which requires changing the interface to the reg-matching functions. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-29 16:00 ` Stefan Monnier @ 2002-08-30 1:11 ` Kenichi Handa 2002-08-30 19:19 ` Richard Stallman 0 siblings, 1 reply; 40+ messages in thread From: Kenichi Handa @ 2002-08-30 1:11 UTC (permalink / raw) Cc: storm, rms, emacs-devel In article <200208291600.g7TG0NZ11087@rum.cs.yale.edu>, "Stefan Monnier" <monnier+gnu/emacs@rum.cs.yale.edu> writes: > Two things: > - Neither `upper(lower(x)) = x' nor `lower(upper(x)) = x' are guaranteed. > - The regexp matcher right now only has access to one of the two tables > (I believe it's the `lower' but I'm not even sure) and so two chars > are deemed to match if translate(a) = translate(b). > The first might be a non-issue, I don't know. There's an EQUIVALENCES table. It seems that the documentation of set-case-table says that: X and Y match in case-fold-search if: equiv(X) == Y or equiv(equiv(X)) == Y or equiv(equiv(equiv(X))) == Y or ... Correct? > The second is more serious because that means that if we want to use > `upper' we'll need to somehow pass that table as well, which requires > changing the interface to the reg-matching functions. TRANSLATE table is passed as the member `tranlate' of re_pattern_buffer. Instead of setting it to lowercase table, we can set it to the case-table itself that has upcase, canon, and equiv tables in the extra slots. Or, if we can use EQUIVALENCES table as above, what we need is only that table. --- Ken'ichi HANDA handa@etl.go.jp ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-30 1:11 ` Kenichi Handa @ 2002-08-30 19:19 ` Richard Stallman 0 siblings, 0 replies; 40+ messages in thread From: Richard Stallman @ 2002-08-30 19:19 UTC (permalink / raw) Cc: monnier+gnu/emacs, storm, emacs-devel This slot in the case table may be useful: CANONICALIZE maps each character to a canonical equivalent; any two characters that are related by case-conversion have the same canonical equivalent character; ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-29 8:53 ` Kenichi Handa 2002-08-29 12:33 ` Kim F. Storm @ 2002-08-30 19:19 ` Richard Stallman 2002-08-30 20:08 ` Stefan Monnier 2002-08-31 6:14 ` regex and case-fold-search problem Eli Zaretskii 1 sibling, 2 replies; 40+ messages in thread From: Richard Stallman @ 2002-08-30 19:19 UTC (permalink / raw) Cc: emacs-devel So, I agree with Stephen that his method is good enough. It is wrong even for ASCII--we definitely must do something better, at least for ASCII. The only question is, how much more than ASCII? I think we all know that is the right behaviour, and at least for ASCII, the latest code works as that. Perhpas, we should make Emacs work correctly also for Latin-1 chars, because in emacs-unicode also, they have the same code order. What about for Latin-2 characters? Will those regexp ranges change their meaning in emacs-unicode? If so, perhaps we only need to make an effort to support ranges really right for codes 0-256. > A faster way, in the usual cases, would be to look for the case where > several consecutive characters that have just one case-sibling each, > and the siblings are consecutive too. Each subrange of this kind can > be turned into two subranges, the original and the case-converted. > Also identify subranges of characters that have no case-siblings; each > subrange of this kind just remains as it is. Finally, any unusual > characters that are encountered can be replaced with a list of all the > case-siblings. > This too requires use of the whole case table. Implemnting that for any range of characters consumes our man-power and makes the running code slower. It is not a very hard program to write, I think. I'd guess around 30 lines. However, you're right about the slowness for large ranges. If we only do this for codes 0-256 (or, currently, for ASCII and Latin-1), then it won't be too slow. Consider the situation that one writes this regexp "[\000-\xffff]" to search only Unicode BMP chars in emacs-unicode. Do you think that is a reasonable kind of range that we should try to support? If so, there goes my idea that we only need to support ranges in 0-256 very well. On the other hand, if we handle \000-\xffff by doing case conversion carefully only for ASCII and Latin-1, and treat the rest of the range in a less smart way, we would get the same results in this case. Is that a good solution? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-30 19:19 ` Richard Stallman @ 2002-08-30 20:08 ` Stefan Monnier 2002-09-01 13:15 ` Richard Stallman 2002-08-31 6:14 ` regex and case-fold-search problem Eli Zaretskii 1 sibling, 1 reply; 40+ messages in thread From: Stefan Monnier @ 2002-08-30 20:08 UTC (permalink / raw) Cc: handa, emacs-devel > So, I agree with Stephen that his method is good enough. > > It is wrong even for ASCII Do you have any evidence to support that claim ? Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-30 20:08 ` Stefan Monnier @ 2002-09-01 13:15 ` Richard Stallman 2002-09-01 16:26 ` Stefan Monnier 0 siblings, 1 reply; 40+ messages in thread From: Richard Stallman @ 2002-09-01 13:15 UTC (permalink / raw) Cc: handa, emacs-devel > It is wrong even for ASCII Do you have any evidence to support that claim ? You yourself said it would match characters which were not case-equivalent to something in the originally specified range. That means it is wrong. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-09-01 13:15 ` Richard Stallman @ 2002-09-01 16:26 ` Stefan Monnier 2002-09-02 14:54 ` Richard Stallman 0 siblings, 1 reply; 40+ messages in thread From: Stefan Monnier @ 2002-09-01 16:26 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel > > It is wrong even for ASCII > Do you have any evidence to support that claim ? > You yourself said it would match characters which > were not case-equivalent to something in the originally specified range. When was it ? I'd guess that was before I installed my patch. > That means it is wrong. Thank you for your trust ;-) Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-09-01 16:26 ` Stefan Monnier @ 2002-09-02 14:54 ` Richard Stallman 2002-09-02 16:58 ` Stefan Monnier 0 siblings, 1 reply; 40+ messages in thread From: Richard Stallman @ 2002-09-02 14:54 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel > > It is wrong even for ASCII > Do you have any evidence to support that claim ? > You yourself said it would match characters which > were not case-equivalent to something in the originally specified range. When was it ? I'd guess that was before I installed my patch. src/ChangeLog does not list any recent changes in regex.c. Did you install a change and fail to put it in ChangeLog? Anyway, the change you sent seemed to have the problem of including excess characters in ASCII ranges. The change can't stay if it has that problem. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-09-02 14:54 ` Richard Stallman @ 2002-09-02 16:58 ` Stefan Monnier 2002-09-04 14:13 ` Richard Stallman 0 siblings, 1 reply; 40+ messages in thread From: Stefan Monnier @ 2002-09-02 16:58 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel > > > It is wrong even for ASCII > > Do you have any evidence to support that claim ? > > You yourself said it would match characters which > > were not case-equivalent to something in the originally specified range. > > When was it ? I'd guess that was before I installed my patch. > > src/ChangeLog does not list any recent changes in regex.c. > Did you install a change and fail to put it in ChangeLog? 2002-08-23 Stefan Monnier <monnier@cs.yale.edu> * regex.c (PATFETCH): Remove the translating fetch. (PATFETCH_RAW): Rename to PATFETCH. (set_image_of_range): New fun. (SET_RANGE_TABLE_WORK_AREA): Use it. (regex_compile): Don't translate the pattern chars so eagerly. Only do it when inserting an `exactn' bytecode or when handling a char-range. (mutually_exclusive_p): Avoid empty statement. > Anyway, the change you sent seemed to have the problem of including > excess characters in ASCII ranges. No, only in non-ASCII chars. The excess is introduced in set_image_of_range which is only used for non-ASCII chars. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-09-02 16:58 ` Stefan Monnier @ 2002-09-04 14:13 ` Richard Stallman 2002-09-04 16:04 ` Stefan Monnier 0 siblings, 1 reply; 40+ messages in thread From: Richard Stallman @ 2002-09-04 14:13 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel No, only in non-ASCII chars. The excess is introduced in set_image_of_range which is only used for non-ASCII chars. Does that include Latin-1? The results of our conversation suggest that we need to fix this at least for Latin-1. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-09-04 14:13 ` Richard Stallman @ 2002-09-04 16:04 ` Stefan Monnier 2002-09-05 18:02 ` Richard Stallman 0 siblings, 1 reply; 40+ messages in thread From: Stefan Monnier @ 2002-09-04 16:04 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel > No, only in non-ASCII chars. The excess is introduced in set_image_of_range > which is only used for non-ASCII chars. > Does that include Latin-1? No. > The results of our conversation suggest that we need to fix this > at least for Latin-1. I don't feel an urgent need, so you'll be more quickly served if you ask someone else to do it. He'll need to improve set_image_of_range. Stefan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-09-04 16:04 ` Stefan Monnier @ 2002-09-05 18:02 ` Richard Stallman 2002-09-06 1:00 ` re-search-forward seems to be broken Miles Bader 0 siblings, 1 reply; 40+ messages in thread From: Richard Stallman @ 2002-09-05 18:02 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel I don't feel an urgent need, so you'll be more quickly served if you ask someone else to do it. He'll need to improve set_image_of_range. I did it. ^ permalink raw reply [flat|nested] 40+ messages in thread
* re-search-forward seems to be broken 2002-09-05 18:02 ` Richard Stallman @ 2002-09-06 1:00 ` Miles Bader 2002-09-06 20:03 ` Richard Stallman 0 siblings, 1 reply; 40+ messages in thread From: Miles Bader @ 2002-09-06 1:00 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel When I do: (re-search-forward "[«»{}()]" nil t) I get: Lisp error: (wrong-type-argument arrayp nil) I presume this is from the `set_image_of_range' changes. -Miles -- P.S. All information contained in the above letter is false, for reasons of military security. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: re-search-forward seems to be broken 2002-09-06 1:00 ` re-search-forward seems to be broken Miles Bader @ 2002-09-06 20:03 ` Richard Stallman 0 siblings, 0 replies; 40+ messages in thread From: Richard Stallman @ 2002-09-06 20:03 UTC (permalink / raw) Cc: monnier+gnu/emacs, handa, emacs-devel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 133 bytes --] (re-search-forward "[«»{}()]" nil t) I get: Lisp error: (wrong-type-argument arrayp nil) I fixed this. Thanks. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-30 19:19 ` Richard Stallman 2002-08-30 20:08 ` Stefan Monnier @ 2002-08-31 6:14 ` Eli Zaretskii 2002-09-01 13:14 ` Richard Stallman 1 sibling, 1 reply; 40+ messages in thread From: Eli Zaretskii @ 2002-08-31 6:14 UTC (permalink / raw) Cc: emacs-devel > From: Richard Stallman <rms@gnu.org> > Date: Fri, 30 Aug 2002 15:19:14 -0400 > > I think we all know that is the right behaviour, and at > least for ASCII, the latest code works as that. Perhpas, we > should make Emacs work correctly also for Latin-1 chars, > because in emacs-unicode also, they have the same code > order. > > What about for Latin-2 characters? Will those regexp ranges > change their meaning in emacs-unicode? Yes. Latin-2 characters have different order in Unicode than in 8859-2. Those characters which are common to Latin-2 and Latin-1 are in the same order, but those which aren't have different places. The same goes for all the other Latin-N characters where N != 1. We could have some code to map a range specified by a Lisp program into a range of internal character codepoints (in Unicode Emacs, the latter would be Unicode codepoints). We could make this code depend on some user variable that states the external ordering meant by the application. For example, Cyrillic users could tell Emacs that [A-Z] was intended to work as in KOI8-R or as in 8859-5. Would something like that work? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: regex and case-fold-search problem 2002-08-31 6:14 ` regex and case-fold-search problem Eli Zaretskii @ 2002-09-01 13:14 ` Richard Stallman 0 siblings, 0 replies; 40+ messages in thread From: Richard Stallman @ 2002-09-01 13:14 UTC (permalink / raw) Cc: emacs-devel > What about for Latin-2 characters? Will those regexp ranges > change their meaning in emacs-unicode? Yes. Latin-2 characters have different order in Unicode than in 8859-2. Those characters which are common to Latin-2 and Latin-1 are in the same order, but those which aren't have different places. The same goes for all the other Latin-N characters where N != 1. This suggests that perhaps there is no need to be careful about case-folding of ranges outside of ASCII and Latin-1. We could have some code to map a range specified by a Lisp program into a range of internal character codepoints (in Unicode Emacs, the latter would be Unicode codepoints). We could make this code depend on some user variable that states the external ordering meant by the application. For example, Cyrillic users could tell Emacs that [A-Z] was intended to work as in KOI8-R or as in 8859-5. This is a coherent idea, but since it is a substantial amount of work, the question is whether it is better to do this or do nothing about those cases. I wonder how many programs use ranges of Latin-2 or KOI8-R and depend on case-folding to work precisely. Probably few or none. ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2002-09-06 20:03 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-08-23 6:25 regex and case-fold-search problem Kenichi Handa 2002-08-23 15:56 ` Eli Zaretskii 2002-08-24 0:51 ` Kenichi Handa 2002-08-24 1:03 ` Miles Bader 2002-08-24 9:42 ` Eli Zaretskii 2002-08-24 16:16 ` Andreas Schwab 2002-08-26 1:54 ` Miles Bader 2002-08-26 16:11 ` Stefan Monnier 2002-08-26 21:51 ` Richard Stallman 2002-08-24 9:39 ` Eli Zaretskii 2002-08-26 1:29 ` Kenichi Handa 2002-08-26 2:31 ` Miles Bader 2002-08-25 22:21 ` Kim F. Storm 2002-08-23 17:36 ` Stefan Monnier 2002-08-23 21:52 ` Stefan Monnier 2002-08-24 1:16 ` Kenichi Handa 2002-08-25 18:52 ` Stefan Monnier 2002-08-26 1:56 ` Kenichi Handa 2002-08-24 10:40 ` Kai Großjohann 2002-08-26 21:51 ` Richard Stallman 2002-08-29 8:53 ` Kenichi Handa 2002-08-29 12:33 ` Kim F. Storm 2002-08-29 13:38 ` Kenichi Handa 2002-08-29 15:00 ` Kim F. Storm 2002-08-29 16:00 ` Stefan Monnier 2002-08-30 1:11 ` Kenichi Handa 2002-08-30 19:19 ` Richard Stallman 2002-08-30 19:19 ` Richard Stallman 2002-08-30 20:08 ` Stefan Monnier 2002-09-01 13:15 ` Richard Stallman 2002-09-01 16:26 ` Stefan Monnier 2002-09-02 14:54 ` Richard Stallman 2002-09-02 16:58 ` Stefan Monnier 2002-09-04 14:13 ` Richard Stallman 2002-09-04 16:04 ` Stefan Monnier 2002-09-05 18:02 ` Richard Stallman 2002-09-06 1:00 ` re-search-forward seems to be broken Miles Bader 2002-09-06 20:03 ` Richard Stallman 2002-08-31 6:14 ` regex and case-fold-search problem Eli Zaretskii 2002-09-01 13:14 ` Richard Stallman
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.