From: Ulrich Mueller <ulm@gentoo.org>
To: David Kastrup <dak@gnu.org>
Cc: emacs-devel@gnu.org
Subject: Re: Case mapping of sharp s
Date: Fri, 20 Nov 2009 09:10:29 +0100 [thread overview]
Message-ID: <19206.20213.843972.495981@a1i15.kph.uni-mainz.de> (raw)
In-Reply-To: <87iqd6gmpk.fsf@lola.goethe.zz>
>>>>> On Thu, 19 Nov 2009, David Kastrup wrote:
>> I can guess why it's much slower going backward: the simple search
>> operates on chars rather than bytes. The internal encoding we use
>> (currently based on utf-8) is designed to be easy to parse going
>> forward but not so easy going backward (IIRC our encoding is
>> actually even a bit more painful in this case than pure utf-8).
> I don't think so. The utf-8 _scheme_ can be used to encode 21bits in
> 4 characters.
The original UTF-8 (specified in RFC 2279) was good for encoding of
the full range of 2^31 characters in up to 6 bytes. The limitation to
2^20.1 came later and is artificial.
> We stay within that range, in the utf-8 4 character scheme, but
> outside of the Unicode range 2^20+2^16.
character.h says it's up to 22 bits encoded in up to 5 bytes:
,----
| character code 1st byte byte sequence
| -------------- -------- -------------
| 0-7F 00..7F 0xxxxxxx
| 80-7FF C2..DF 110xxxxx 10xxxxxx
| 800-FFFF E0..EF 1110xxxx 10xxxxxx 10xxxxxx
| 10000-1FFFFF F0..F7 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
| 200000-3FFF7F F8 11111000 1000xxxx 10xxxxxx 10xxxxxx 10xxxxxx
| 3FFF80-3FFFFF C0..C1 1100000x 10xxxxxx (for eight-bit-char)
| 400000-... invalid
`----
>> BM on the other hand works on bytes, so there's no such slowdown.
> With utf-8, I think that apart from character ranges, search forward and
> backward should work perfectly like on 8-bit characters. Exception is
> incomplete character matches, but since the utf-8 scheme can immediately
> tell "is a 7-bit character" "is the first character of a multibyte
> sequence of length n" "is last or intermediate character of multibyte
> sequence" this is not a serious problem.
When the search is for equivalence classes of characters (e.g. case
folding), then I think it must operate on whole characters and
therefore has to find the start of each multibyte sequence.
Ulrich
next prev parent reply other threads:[~2009-11-20 8:10 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-19 19:48 Case mapping of sharp s grischka
2009-11-19 21:49 ` Stefan Monnier
2009-11-19 22:43 ` David Kastrup
2009-11-20 2:08 ` Stefan Monnier
2009-11-20 8:03 ` David Kastrup
2009-11-20 14:14 ` Stefan Monnier
2009-11-20 3:41 ` Stephen J. Turnbull
2009-11-20 4:20 ` Stefan Monnier
2009-11-20 7:13 ` Stephen J. Turnbull
2009-11-21 0:02 ` Richard Stallman
2009-11-21 12:39 ` David Kastrup
2009-11-21 17:40 ` Stephen J. Turnbull
2009-11-21 19:15 ` Eli Zaretskii
2009-11-22 2:58 ` Stephen J. Turnbull
2009-11-22 4:28 ` Eli Zaretskii
2009-11-22 8:27 ` Stephen J. Turnbull
2009-11-23 1:30 ` Kenichi Handa
2009-11-21 22:52 ` Richard Stallman
2009-11-20 8:10 ` Ulrich Mueller [this message]
2009-11-20 11:46 ` Stephen J. Turnbull
2009-11-20 14:43 ` Ulrich Mueller
2009-11-21 4:33 ` Stephen J. Turnbull
2009-11-19 23:25 ` grischka
2009-11-20 2:11 ` Stefan Monnier
2009-11-21 3:08 ` grischka
2009-11-21 8:58 ` Eli Zaretskii
2009-11-21 9:33 ` Andreas Schwab
2009-11-21 11:45 ` Eli Zaretskii
2009-11-21 15:33 ` grischka
2009-11-21 10:41 ` Ulrich Mueller
2009-11-21 11:58 ` Andreas Schwab
2009-11-21 17:01 ` Ulrich Mueller
2009-11-22 12:11 ` Andreas Schwab
2009-11-22 20:15 ` Stefan Monnier
2009-11-24 12:26 ` Kenichi Handa
2009-11-24 19:23 ` grischka
2009-11-25 2:13 ` Kenichi Handa
2009-11-26 13:07 ` grischka
2009-11-29 22:03 ` Juri Linkov
2009-11-30 1:22 ` Stefan Monnier
2009-11-30 1:28 ` Kenichi Handa
2009-11-30 1:36 ` Kenichi Handa
2009-11-30 7:01 ` Ulrich Mueller
2009-11-30 12:01 ` Juri Linkov
2009-11-30 13:09 ` martin rudalics
2009-11-30 21:57 ` Juri Linkov
2009-11-30 22:34 ` Ulrich Mueller
2009-12-01 0:02 ` Juri Linkov
-- strict thread matches above, loose matches on Subject: below --
2009-11-15 14:29 Ulrich Mueller
2009-11-16 12:06 ` Kenichi Handa
2009-11-16 16:38 ` Ulrich Mueller
2009-11-17 7:36 ` Kenichi Handa
2009-11-17 21:23 ` Reiner Steib
2009-11-16 19:12 ` Eli Zaretskii
2009-11-17 7:43 ` martin rudalics
2009-11-17 7:49 ` Kenichi Handa
2009-11-17 18:56 ` Eli Zaretskii
2009-11-18 1:00 ` Kenichi Handa
2009-11-18 4:09 ` Eli Zaretskii
2009-11-18 5:33 ` Stephen J. Turnbull
2009-11-18 6:26 ` Kenichi Handa
2009-11-18 14:44 ` Stefan Monnier
2009-11-18 19:05 ` Ulrich Mueller
2009-11-19 1:16 ` Stefan Monnier
2009-11-18 17:58 ` Eli Zaretskii
2009-11-19 1:57 ` Stephen J. Turnbull
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=19206.20213.843972.495981@a1i15.kph.uni-mainz.de \
--to=ulm@gentoo.org \
--cc=dak@gnu.org \
--cc=emacs-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).