From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: Using libunistring for string comparisons et al Date: Fri, 18 Mar 2011 08:05:01 -0400 Message-ID: <87zkosiqhe.fsf@netris.org> References: <336042.33326.qm@web37901.mail.mud.yahoo.com> <878vwgmhah.fsf@netris.org> <511668.33680.qm@web37902.mail.mud.yahoo.com> <87sjuokniq.fsf@netris.org> <118142.11911.qm@web37907.mail.mud.yahoo.com> <87ipvjlvgj.fsf@netris.org> <87oc5b8fx3.fsf@gnu.org> <87tyf1kbae.fsf@netris.org> <877hbxwxjj.fsf@gnu.org> <87k4fxk4rx.fsf@netris.org> <87fwql5lvj.fsf@ambire.localdomain> <8762rhjjhn.fsf@netris.org> <87ei64u886.fsf@ambire.localdomain> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1300449939 10260 80.91.229.12 (18 Mar 2011 12:05:39 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 18 Mar 2011 12:05:39 +0000 (UTC) Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Thien-Thi Nguyen Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Mar 18 13:05:35 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Q0YQs-0003nS-Qj for guile-devel@m.gmane.org; Fri, 18 Mar 2011 13:05:30 +0100 Original-Received: from localhost ([127.0.0.1]:57018 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q0YQs-0000kL-7g for guile-devel@m.gmane.org; Fri, 18 Mar 2011 08:05:30 -0400 Original-Received: from [140.186.70.92] (port=39630 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q0YQk-0000k9-FQ for guile-devel@gnu.org; Fri, 18 Mar 2011 08:05:23 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Q0YQi-0004Qt-6j for guile-devel@gnu.org; Fri, 18 Mar 2011 08:05:22 -0400 Original-Received: from world.peace.net ([96.39.62.75]:44229) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Q0YQg-0004QE-MA; Fri, 18 Mar 2011 08:05:18 -0400 Original-Received: from 209-6-93-251.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.93.251] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1Q0YQa-0000SP-Ud; Fri, 18 Mar 2011 08:05:13 -0400 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1Q0YQP-0002h2-RK; Fri, 18 Mar 2011 08:05:01 -0400 In-Reply-To: <87ei64u886.fsf@ambire.localdomain> (Thien-Thi Nguyen's message of "Fri, 18 Mar 2011 09:46:17 +0100") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11904 Archived-At: Thien-Thi Nguyen writes: > () Mark H Weaver > () Thu, 17 Mar 2011 21:38:28 -0400 > > If we may assume that the searched string is valid UTF-8, and when only > ASCII characters are excluded (e.g. "."), then three additional states > are required in the generated DFA. Let us call them S1, S2, and S3. > > [handling these states] > > When non-ASCII characters are excluded, additional states must be adde= d: > one for each unique prefix of the excluded multibyte characters. It's > quite straightforward. > > I don't understand what "excluded" means here. Is this a property > of the string, the regexp, the (dynamic, environmental) operation, or ...? The excluded characters are the ones listed inside of a [^...] form. For example, [^abc] excludes only the ASCII characters #\a, #\b, and #\c. [^abc=CE=BB] excludes also the non-ASCII character #\=CE=BB. "." is equivalent to a [^...] form that excludes only newlines. I should also note that although UTF-8 causes a few complications when compiling regexps, UTF-32 creates much worse problems that require a different DFA data structure and a slower scan. In the UTF-8 case, each state in the DFA can contain a fixed lookup table with 256 entries (one for each byte) as is used for ASCII or Latin1. However, in the UTF-32 case, it is _not_ practical for each state to contain a lookup table with over 1 million entries (one for each code point). Mark