From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Ulrich Mueller Newsgroups: gmane.emacs.devel Subject: Re: Case mapping of sharp s Date: Fri, 20 Nov 2009 09:10:29 +0100 Message-ID: <19206.20213.843972.495981@a1i15.kph.uni-mainz.de> References: <4B05A11F.5000700@gmx.de> <87iqd6gmpk.fsf@lola.goethe.zz> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1258704884 10759 80.91.229.12 (20 Nov 2009 08:14:44 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 20 Nov 2009 08:14:44 +0000 (UTC) Cc: emacs-devel@gnu.org To: David Kastrup Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Nov 20 09:14:37 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1NBObT-0000pO-KH for ged-emacs-devel@m.gmane.org; Fri, 20 Nov 2009 09:12:27 +0100 Original-Received: from localhost ([127.0.0.1]:40658 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBObS-0001ht-VI for ged-emacs-devel@m.gmane.org; Fri, 20 Nov 2009 03:12:26 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NBOa9-000131-0h for emacs-devel@gnu.org; Fri, 20 Nov 2009 03:11:05 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NBOa4-00010B-G0 for emacs-devel@gnu.org; Fri, 20 Nov 2009 03:11:04 -0500 Original-Received: from [199.232.76.173] (port=59442 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NBOa4-000105-2G for emacs-devel@gnu.org; Fri, 20 Nov 2009 03:11:00 -0500 Original-Received: from mx20.gnu.org ([199.232.41.8]:42378) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1NBOa0-0006Vb-Gz; Fri, 20 Nov 2009 03:10:57 -0500 Original-Received: from a1iwww1.kph.uni-mainz.de ([134.93.134.1]) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NBOZo-0000HV-4T; Fri, 20 Nov 2009 03:10:44 -0500 Original-Received: from a1i15.kph.uni-mainz.de (a1i15.kph.uni-mainz.de [134.93.134.92]) by a1iwww1.kph.uni-mainz.de (8.14.0/8.13.4) with ESMTP id nAK8AU1P030920; Fri, 20 Nov 2009 09:10:30 +0100 Original-Received: from a1i15.kph.uni-mainz.de (localhost [127.0.0.1]) by a1i15.kph.uni-mainz.de (8.14.3/8.14.2) with ESMTP id nAK8AT2j021171; Fri, 20 Nov 2009 09:10:29 +0100 Original-Received: (from ulm@localhost) by a1i15.kph.uni-mainz.de (8.14.3/8.14.3/Submit) id nAK8ATAh021168; Fri, 20 Nov 2009 09:10:29 +0100 In-Reply-To: <87iqd6gmpk.fsf@lola.goethe.zz> X-Mailer: VM 8.0.12 under 23.1.1 (x86_64-pc-linux-gnu) X-detected-operating-system: by mx20.gnu.org: GNU/Linux 2.6 (newer, 1) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:117338 Archived-At: >>>>> On Thu, 19 Nov 2009, David Kastrup wrote: >> I can guess why it's much slower going backward: the simple search >> operates on chars rather than bytes. The internal encoding we use >> (currently based on utf-8) is designed to be easy to parse going >> forward but not so easy going backward (IIRC our encoding is >> actually even a bit more painful in this case than pure utf-8). > I don't think so. The utf-8 _scheme_ can be used to encode 21bits in > 4 characters. The original UTF-8 (specified in RFC 2279) was good for encoding of the full range of 2^31 characters in up to 6 bytes. The limitation to 2^20.1 came later and is artificial. > We stay within that range, in the utf-8 4 character scheme, but > outside of the Unicode range 2^20+2^16. character.h says it's up to 22 bits encoded in up to 5 bytes: ,---- | character code 1st byte byte sequence | -------------- -------- ------------- | 0-7F 00..7F 0xxxxxxx | 80-7FF C2..DF 110xxxxx 10xxxxxx | 800-FFFF E0..EF 1110xxxx 10xxxxxx 10xxxxxx | 10000-1FFFFF F0..F7 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 200000-3FFF7F F8 11111000 1000xxxx 10xxxxxx 10xxxxxx 10xxxxxx | 3FFF80-3FFFFF C0..C1 1100000x 10xxxxxx (for eight-bit-char) | 400000-... invalid `---- >> BM on the other hand works on bytes, so there's no such slowdown. > With utf-8, I think that apart from character ranges, search forward and > backward should work perfectly like on 8-bit characters. Exception is > incomplete character matches, but since the utf-8 scheme can immediately > tell "is a 7-bit character" "is the first character of a multibyte > sequence of length n" "is last or intermediate character of multibyte > sequence" this is not a serious problem. When the search is for equivalence classes of characters (e.g. case folding), then I think it must operate on whole characters and therefore has to find the start of each multibyte sequence. Ulrich