From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: Using libunistring for string comparisons et al Date: Thu, 17 Mar 2011 21:38:28 -0400 Message-ID: <8762rhjjhn.fsf@netris.org> References: <336042.33326.qm@web37901.mail.mud.yahoo.com> <878vwgmhah.fsf@netris.org> <511668.33680.qm@web37902.mail.mud.yahoo.com> <87sjuokniq.fsf@netris.org> <118142.11911.qm@web37907.mail.mud.yahoo.com> <87ipvjlvgj.fsf@netris.org> <87oc5b8fx3.fsf@gnu.org> <87tyf1kbae.fsf@netris.org> <877hbxwxjj.fsf@gnu.org> <87k4fxk4rx.fsf@netris.org> <87fwql5lvj.fsf@ambire.localdomain> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1300412340 643 80.91.229.12 (18 Mar 2011 01:39:00 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 18 Mar 2011 01:39:00 +0000 (UTC) Cc: Ludovic =?utf-8?Q?Court=C3=A8s?= , guile-devel@gnu.org To: Thien-Thi Nguyen Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Mar 18 02:38:56 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Q0OeV-0002n7-Mt for guile-devel@m.gmane.org; Fri, 18 Mar 2011 02:38:55 +0100 Original-Received: from localhost ([127.0.0.1]:33905 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q0OeV-0008Q7-Eg for guile-devel@m.gmane.org; Thu, 17 Mar 2011 21:38:55 -0400 Original-Received: from [140.186.70.92] (port=55955 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q0OeN-0008Pm-Au for guile-devel@gnu.org; Thu, 17 Mar 2011 21:38:48 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Q0OeL-0005p4-VT for guile-devel@gnu.org; Thu, 17 Mar 2011 21:38:46 -0400 Original-Received: from world.peace.net ([96.39.62.75]:59475) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Q0OeK-0005oT-OS; Thu, 17 Mar 2011 21:38:44 -0400 Original-Received: from 209-6-93-251.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.93.251] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1Q0OeF-0007m1-TP; Thu, 17 Mar 2011 21:38:39 -0400 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1Q0Oe4-0002Yt-Rw; Thu, 17 Mar 2011 21:38:28 -0400 In-Reply-To: <87fwql5lvj.fsf@ambire.localdomain> (Thien-Thi Nguyen's message of "Fri, 18 Mar 2011 01:10:40 +0100") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11900 Archived-At: Thien-Thi Nguyen writes: > In unibyte land, "." matches a byte. OK. > > In multibyte land done "bytewise", "." matches ____________. > (What goes in the blank?) "." (and more generally [^...]) is equivalent to (a|b|c|d|...) where every valid UTF-8 character is present in the disjunction except for the excluded characters. However, this case does indeed require special consideration for the sake of efficiency in the regexp compiler. If we may assume that the searched string is valid UTF-8, and when only ASCII characters are excluded (e.g. "."), then three additional states are required in the generated DFA. Let us call them S1, S2, and S3. S1 accepts one arbitrary byte, i.e. it has an outgoing edge for every byte which leads to the final state. S2 accepts any two bytes, i.e. it has an outgoing edge for every byte which leads to S1. Similarly for S3. Now, the start state of "." (or [^...]) has outgoing edges for every byte from 0x80 to 0xFF, leading to S1, S2, or S3, depending on the high bits of the first byte. Specifically, byte values 0xC0 through 0xDF lead to S1, 0xE0 through 0xEF lead to S2, and 0xF0 through 0xF7 lead to S3. When non-ASCII characters are excluded, additional states must be added: one for each unique prefix of the excluded multibyte characters. It's quite straightforward. Regards, Mark