From mboxrd@z Thu Jan 1 00:00:00 1970 Path: main.gmane.org!not-for-mail From: "Stefan Monnier" Newsgroups: gmane.emacs.devel Subject: Re: regex and case-fold-search problem Date: Fri, 23 Aug 2002 17:52:37 -0400 Sender: emacs-devel-admin@gnu.org Message-ID: <200208232152.g7NLqbe03698@rum.cs.yale.edu> References: <200208230625.PAA23426@etlken.m17n.org> <200208231736.g7NHafW02174@rum.cs.yale.edu> NNTP-Posting-Host: localhost.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Trace: main.gmane.org 1030139614 8026 127.0.0.1 (23 Aug 2002 21:53:34 GMT) X-Complaints-To: usenet@main.gmane.org NNTP-Posting-Date: Fri, 23 Aug 2002 21:53:34 +0000 (UTC) Cc: emacs-devel@gnu.org Return-path: Original-Received: from quimby.gnus.org ([80.91.224.244]) by main.gmane.org with esmtp (Exim 3.35 #1 (Debian)) id 17iMN2-00025L-00 for ; Fri, 23 Aug 2002 23:53:32 +0200 Original-Received: from monty-python.gnu.org ([199.232.76.173]) by quimby.gnus.org with esmtp (Exim 3.12 #1 (Debian)) id 17iMqb-0006kW-00 for ; Sat, 24 Aug 2002 00:24:05 +0200 Original-Received: from localhost ([127.0.0.1] helo=monty-python.gnu.org) by monty-python.gnu.org with esmtp (Exim 4.10) id 17iMOB-00045O-00; Fri, 23 Aug 2002 17:54:43 -0400 Original-Received: from list by monty-python.gnu.org with tmda-scanned (Exim 4.10) id 17iMMF-0003fd-00 for emacs-devel@gnu.org; Fri, 23 Aug 2002 17:52:43 -0400 Original-Received: from mail by monty-python.gnu.org with spam-scanned (Exim 4.10) id 17iMMC-0003fL-00 for emacs-devel@gnu.org; Fri, 23 Aug 2002 17:52:42 -0400 Original-Received: from rum.cs.yale.edu ([128.36.229.169]) by monty-python.gnu.org with esmtp (Exim 4.10) id 17iMMC-0003eW-00 for emacs-devel@gnu.org; Fri, 23 Aug 2002 17:52:40 -0400 Original-Received: (from monnier@localhost) by rum.cs.yale.edu (8.11.6/8.11.6) id g7NLqbe03698; Fri, 23 Aug 2002 17:52:37 -0400 X-Mailer: exmh version 2.4 06/23/2000 with nmh-1.0.4 Original-To: Kenichi Handa Errors-To: emacs-devel-admin@gnu.org X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.0.11 Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: Emacs development discussions. List-Unsubscribe: , List-Archive: Xref: main.gmane.org gmane.emacs.devel:6814 X-Report-Spam: http://spam.gmane.org/gmane.emacs.devel:6814 "Stefan Monnier" wrote: > For ASCII it's pretty easy to fix. But for other charsets, it's > indeed more tricky. Maybe we can simply use the smallest contiguous > range of chars that includes all the chars we should match, > so the behavior is indeed "implementation-defined" (in the sense > that it's not necessarily obvious to the user what happens) but > it's at least less confusing (in the sense that (case-fold-search t) > matches at least as much as (case-fold-search nil)). How about the patch below ? Stefan Index: regex.c =================================================================== RCS file: /cvsroot/emacs/emacs/src/regex.c,v retrieving revision 1.176 diff -u -u -b -r1.176 regex.c --- regex.c 25 Mar 2002 00:45:48 -0000 1.176 +++ regex.c 23 Aug 2002 21:49:10 -0000 @@ -1914,12 +1914,13 @@ #define BIT_UPPER 0x10 #define BIT_MULTIBYTE 0x20 -/* Set a range (RANGE_START, RANGE_END) to WORK_AREA. */ -#define SET_RANGE_TABLE_WORK_AREA(work_area, range_start, range_end) \ +/* Set a range START..END to WORK_AREA. + The range is passed through TRANSLATE, so START and END + should be untranslated. */ +#define SET_RANGE_TABLE_WORK_AREA(work_area, start, end) \ do { \ EXTEND_RANGE_TABLE_WORK_AREA ((work_area), 2); \ - (work_area).table[(work_area).used++] = (range_start); \ - (work_area).table[(work_area).used++] = (range_end); \ + set_image_of_range (&work_area, start, end, translate); \ } while (0) /* Free allocated memory for WORK_AREA. */ @@ -2077,6 +2078,31 @@ } #endif + + +/* We need to find the image of the range start..end when passed through + TRANSLATE. This is not necessarily TRANSLATE(start)..TRANSLATE(end) + and is not even necessarily contiguous. + We approximate it with the smallest contiguous range that contains + all the chars we need. */ +static void +set_image_of_range (work_area, start, end, translate) + RE_TRANSLATE_TYPE translate; + struct range_table_work_area *work_area; + re_wchar_t start, end; +{ + re_wchar_t cmin = TRANSLATE (start), cmax = TRANSLATE (end); + if (RE_TRANSLATE_P (translate)) + for (; start <= end; start++) + { + re_wchar_t c = TRANSLATE (start); + cmin = MIN (cmin, c); + cmax = MAX (cmax, c); + } + work_area->table[work_area->used++] = (cmin); + work_area->table[work_area->used++] = (cmax); +} + /* Explicit quit checking is only used on NTemacs. */ #if defined WINDOWSNT && defined emacs && defined QUIT extern int immediate_quit; @@ -2525,14 +2551,18 @@ if (p == pend) FREE_STACK_RETURN (REG_EBRACK); - PATFETCH (c); + /* Don't translate yet. The range TRANSLATE(X..Y) cannot + always be determined from TRANSLATE(X) and TRANSLATE(Y) + So the translation is done later in a loop. Example: + (let ((case-fold-search t)) (string-match "[A-_]" "A")) */ + PATFETCH_RAW (c); /* \ might escape characters inside [...] and [^...]. */ if ((syntax & RE_BACKSLASH_ESCAPE_IN_LISTS) && c == '\\') { if (p == pend) FREE_STACK_RETURN (REG_EESCAPE); - PATFETCH (c); + PATFETCH_RAW (c); escaped_char = true; } else @@ -2636,10 +2668,10 @@ { /* Discard the `-'. */ - PATFETCH (c1); + PATFETCH_RAW (c1); /* Fetch the character which ends the range. */ - PATFETCH (c1); + PATFETCH_RAW (c1); if (SINGLE_BYTE_CHAR_P (c)) { @@ -2653,7 +2685,7 @@ starting at the smallest character in the charset of C1 and ending at C1. */ int charset = CHAR_CHARSET (c1); - int c2 = MAKE_CHAR (charset, 0, 0); + re_wchar_t c2 = MAKE_CHAR (charset, 0, 0); SET_RANGE_TABLE_WORK_AREA (range_table_work, c2, c1); @@ -2672,7 +2704,7 @@ /* ... into bitmap. */ { re_wchar_t this_char; - int range_start = c, range_end = c1; + re_wchar_t range_start = c, range_end = c1; /* If the start is after the end, the range is empty. */ if (range_start > range_end)