From: Eli Zaretskii <eliz@gnu.org>
To: MON KEY <monkey@sandpframing.com>
Cc: 6283@debbugs.gnu.org
Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
Date: Tue, 01 Jun 2010 21:38:41 +0300 [thread overview]
Message-ID: <831vcqtx72.fsf@gnu.org> (raw)
In-Reply-To: <AANLkTinnkWK1YEK5NyDkpvUemykStbN0_Vw64jPURfrJ@mail.gmail.com>
> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
>
> If I evauate the following:
>
> (progn
> (save-excursion
> (insert-byte (multibyte-char-to-unibyte 4194221) 1)
> (insert-byte (multibyte-char-to-unibyte 4194303) 1))
> (search-forward-regexp "ÿ" nil t))
>
> I don't match.
Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte. Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character. See this fragment from
the ELisp manual:
Emacs defines several special character sets. The character set
`unicode' includes all the characters whose Emacs code points are in
the range `0..#x10FFFF'. The character set `emacs' includes all ASCII
and non-ASCII characters. Finally, the `eight-bit' charset includes
the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
in text.
and also this one:
To support this multitude of characters and scripts, Emacs closely
follows the "Unicode Standard". The Unicode Standard assigns a unique
number, called a "codepoint", to each and every character. The range
of codepoints defined by Unicode, or the Unicode "codespace", is
`0..#x10FFFF' (in hexadecimal notation), inclusive. Emacs extends this
range with codepoints in the range `#x110000..#x3FFFFF', which it uses
for representing characters that are not unified with Unicode and "raw
8-bit bytes" that cannot be interpreted as characters. Thus, a
character codepoint in Emacs is a 22-bit integer number.
> Whereas if I evaluate:
>
> (progn
> (save-excursion (insert 10 #o377))
> (search-forward-regexp "ÿ" nil t))
>
> I get a match.
Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.
> Likewise, if I evaluate
>
> (progn (save-excursion (insert 10 4194303))
> (search-forward-regexp "\377" nil t))
>
> I get a match.
>
> Which is to say, given the example regexp from the manual, i.e:
>
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
>
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS
Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs. Feel free to file a
separate bug about that.
> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
>
> I hope this is more clear than the previous mail. I apologize if it is not.
I hope my answers make this issue more clear. (Did I say that use of
raw bytes is complicated and full of subtleties?)
next prev parent reply other threads:[~2010-06-01 18:38 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-27 17:28 bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? MON KEY
2010-05-27 18:10 ` Eli Zaretskii
2010-05-27 22:59 ` MON KEY
2010-05-29 14:28 ` Kevin Rodgers
[not found] ` <AANLkTikjCByug1U69tbhsnmS4c1VXSNzoqAOAxmbt3bI@mail.gmail.com>
2010-05-28 7:15 ` Eli Zaretskii
2010-05-28 23:20 ` MON KEY
2010-05-29 6:45 ` Eli Zaretskii
2010-05-31 5:35 ` MON KEY
2010-05-31 18:49 ` Eli Zaretskii
2010-06-01 0:24 ` MON KEY
2010-06-01 18:38 ` Eli Zaretskii [this message]
2010-06-02 19:41 ` MON KEY
2010-06-03 14:39 ` Kevin Rodgers
2010-05-31 14:45 ` MON KEY
2010-05-31 18:51 ` Eli Zaretskii
2010-05-31 23:44 ` MON KEY
2010-06-02 16:06 ` MON KEY
2010-06-02 17:30 ` Chong Yidong
2010-06-02 17:46 ` Eli Zaretskii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/emacs/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=831vcqtx72.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=6283@debbugs.gnu.org \
--cc=monkey@sandpframing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.savannah.gnu.org/cgit/emacs.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).