From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Using libunistring for string comparisons et al Date: Fri, 11 Mar 2011 17:33:47 -0500 Message-ID: <87aah1qoc4.fsf@netris.org> References: <486722.32491.qm@web37905.mail.mud.yahoo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1299882861 3781 80.91.229.12 (11 Mar 2011 22:34:21 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 11 Mar 2011 22:34:21 +0000 (UTC) Cc: "guile-devel@gnu.org" To: Mike Gran Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Fri Mar 11 23:34:15 2011 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PyAuU-0000Qx-21 for guile-devel@m.gmane.org; Fri, 11 Mar 2011 23:34:14 +0100 Original-Received: from localhost ([127.0.0.1]:35376 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PyAuT-0007cv-IX for guile-devel@m.gmane.org; Fri, 11 Mar 2011 17:34:13 -0500 Original-Received: from [140.186.70.92] (port=43203 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PyAuN-0007co-El for guile-devel@gnu.org; Fri, 11 Mar 2011 17:34:08 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PyAuM-0001St-7s for guile-devel@gnu.org; Fri, 11 Mar 2011 17:34:07 -0500 Original-Received: from world.peace.net ([216.204.32.208]:57946) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PyAuM-0001Sp-1y for guile-devel@gnu.org; Fri, 11 Mar 2011 17:34:06 -0500 Original-Received: from turntable.mit.edu ([18.160.0.29] helo=freedomincluded) by world.peace.net with esmtpa (Exim 4.69) (envelope-from ) id 1PyAuG-0003Rc-2j; Fri, 11 Mar 2011 17:34:00 -0500 Original-Received: from mhw by freedomincluded with local (Exim 4.69) (envelope-from ) id 1PyAu4-0008CD-5C; Fri, 11 Mar 2011 17:33:48 -0500 In-Reply-To: <486722.32491.qm@web37905.mail.mud.yahoo.com> (Mike Gran's message of "Thu, 10 Mar 2011 16:54:41 -0800 (PST)") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 216.204.32.208 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:11853 Archived-At: Mike Gran writes: > [...] But doing the upper->lower operation picks > up a few more of the corner cases, like U+03C2 GREEK > SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA > which are the same letter with different representations, > or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU > which are supposed to have the same sort ordering. Ah, okay. Makes sense. > Now that we've pulled in all of libunistring, it might > be a good idea to see if it has a complete implementation > of unicode case folding, because upper->lower is also not > completely correct. I looked into this. Indeed, the libunistring documentation mentions that in some languages (e.g. German), the to_upper and to_lower conversions cannot be done properly on a per-character basis, because the number of character can change. These operations much be done on an entire string. For example: (string-upcase "Stra=C3=9Fe") =3D> "STRASSE" (string-foldcase "Stra=C3=9Fe") =3D> "strasse" libunistring contains all the necessary functions, including case-insensitive string comparisons. However, the only string representations supported by these operations are: UTF-8, UTF-16, UTF-32, or locale-encoded strings, and for comparisons both strings must be the same encoding. I'm aware that this proposal will be very controversial, but starting in Guile 2.2, I think we ought to consider storing strings internally in UTF-8, as is done in Gauche. This would of course make string-ref and string-set! into O(n) operations. However, I claim that any code that depends on string-ref and string-set! could be better written=20