From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Ulrich Mueller <ulm@gentoo.org>
Newsgroups: gmane.emacs.devel
Subject: Re: Case mapping of sharp s
Date: Fri, 20 Nov 2009 09:10:29 +0100
Message-ID: <19206.20213.843972.495981@a1i15.kph.uni-mainz.de>
References: <4B05A11F.5000700@gmx.de> <jwvskcai43z.fsf-monnier+emacs@gnu.org>
	<87iqd6gmpk.fsf@lola.goethe.zz>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: ger.gmane.org 1258704884 10759 80.91.229.12 (20 Nov 2009 08:14:44 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 20 Nov 2009 08:14:44 +0000 (UTC)
Cc: emacs-devel@gnu.org
To: David Kastrup <dak@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Nov 20 09:14:37 2009
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1NBObT-0000pO-KH
	for ged-emacs-devel@m.gmane.org; Fri, 20 Nov 2009 09:12:27 +0100
Original-Received: from localhost ([127.0.0.1]:40658 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1NBObS-0001ht-VI
	for ged-emacs-devel@m.gmane.org; Fri, 20 Nov 2009 03:12:26 -0500
Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NBOa9-000131-0h
	for emacs-devel@gnu.org; Fri, 20 Nov 2009 03:11:05 -0500
Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1NBOa4-00010B-G0
	for emacs-devel@gnu.org; Fri, 20 Nov 2009 03:11:04 -0500
Original-Received: from [199.232.76.173] (port=59442 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NBOa4-000105-2G
	for emacs-devel@gnu.org; Fri, 20 Nov 2009 03:11:00 -0500
Original-Received: from mx20.gnu.org ([199.232.41.8]:42378)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <ulm@kph.uni-mainz.de>)
	id 1NBOa0-0006Vb-Gz; Fri, 20 Nov 2009 03:10:57 -0500
Original-Received: from a1iwww1.kph.uni-mainz.de ([134.93.134.1])
	by mx20.gnu.org with esmtp (Exim 4.60)
	(envelope-from <ulm@kph.uni-mainz.de>)
	id 1NBOZo-0000HV-4T; Fri, 20 Nov 2009 03:10:44 -0500
Original-Received: from a1i15.kph.uni-mainz.de (a1i15.kph.uni-mainz.de [134.93.134.92])
	by a1iwww1.kph.uni-mainz.de (8.14.0/8.13.4) with ESMTP id
	nAK8AU1P030920; Fri, 20 Nov 2009 09:10:30 +0100
Original-Received: from a1i15.kph.uni-mainz.de (localhost [127.0.0.1])
	by a1i15.kph.uni-mainz.de (8.14.3/8.14.2) with ESMTP id nAK8AT2j021171; 
	Fri, 20 Nov 2009 09:10:29 +0100
Original-Received: (from ulm@localhost)
	by a1i15.kph.uni-mainz.de (8.14.3/8.14.3/Submit) id nAK8ATAh021168;
	Fri, 20 Nov 2009 09:10:29 +0100
In-Reply-To: <87iqd6gmpk.fsf@lola.goethe.zz>
X-Mailer: VM 8.0.12 under 23.1.1 (x86_64-pc-linux-gnu)
X-detected-operating-system: by mx20.gnu.org: GNU/Linux 2.6 (newer, 1)
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6,
	seldom 2.4 (older, 4)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.devel:117338
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/117338>

>>>>> On Thu, 19 Nov 2009, David Kastrup wrote:

>> I can guess why it's much slower going backward: the simple search
>> operates on chars rather than bytes. The internal encoding we use
>> (currently based on utf-8) is designed to be easy to parse going
>> forward but not so easy going backward (IIRC our encoding is
>> actually even a bit more painful in this case than pure utf-8).

> I don't think so. The utf-8 _scheme_ can be used to encode 21bits in
> 4 characters.

The original UTF-8 (specified in RFC 2279) was good for encoding of
the full range of 2^31 characters in up to 6 bytes. The limitation to
2^20.1 came later and is artificial.

> We stay within that range, in the utf-8 4 character scheme, but
> outside of the Unicode range 2^20+2^16.

character.h says it's up to 22 bits encoded in up to 5 bytes:

,----
|    character code	1st byte   byte sequence
|    --------------	--------   -------------
|         0-7F		00..7F	   0xxxxxxx
|        80-7FF		C2..DF	   110xxxxx 10xxxxxx
|       800-FFFF		E0..EF	   1110xxxx 10xxxxxx 10xxxxxx
|     10000-1FFFFF	F0..F7	   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
|    200000-3FFF7F	F8	   11111000 1000xxxx 10xxxxxx 10xxxxxx 10xxxxxx
|    3FFF80-3FFFFF	C0..C1	   1100000x 10xxxxxx (for eight-bit-char)
|    400000-...		invalid
`----

>> BM on the other hand works on bytes, so there's no such slowdown.

> With utf-8, I think that apart from character ranges, search forward and
> backward should work perfectly like on 8-bit characters.  Exception is
> incomplete character matches, but since the utf-8 scheme can immediately
> tell "is a 7-bit character" "is the first character of a multibyte
> sequence of length n" "is last or intermediate character of multibyte
> sequence" this is not a serious problem.

When the search is for equivalence classes of characters (e.g. case
folding), then I think it must operate on whole characters and
therefore has to find the start of each multibyte sequence.

Ulrich