all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Eli Zaretskii <eliz@gnu.org>
To: MON KEY <monkey@sandpframing.com>
Cc: 6283@debbugs.gnu.org
Subject: bug#6283: doc/lispref/searching.texi reference to octal code	`0377' correct?
Date: Tue, 01 Jun 2010 21:38:41 +0300	[thread overview]
Message-ID: <831vcqtx72.fsf@gnu.org> (raw)
In-Reply-To: <AANLkTinnkWK1YEK5NyDkpvUemykStbN0_Vw64jPURfrJ@mail.gmail.com>

> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
> 
> If I evauate the following:
> 
>  (progn
>    (save-excursion
>      (insert-byte (multibyte-char-to-unibyte 4194221) 1)
>      (insert-byte (multibyte-char-to-unibyte 4194303) 1))
>    (search-forward-regexp "ÿ" nil t))
> 
> I don't match.

Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte.  Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character.  See this fragment from
the ELisp manual:

     Emacs defines several special character sets.  The character set
  `unicode' includes all the characters whose Emacs code points are in
  the range `0..#x10FFFF'.  The character set `emacs' includes all ASCII
  and non-ASCII characters.  Finally, the `eight-bit' charset includes
  the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
  in text.

and also this one:

     To support this multitude of characters and scripts, Emacs closely
  follows the "Unicode Standard".  The Unicode Standard assigns a unique
  number, called a "codepoint", to each and every character.  The range
  of codepoints defined by Unicode, or the Unicode "codespace", is
  `0..#x10FFFF' (in hexadecimal notation), inclusive.  Emacs extends this
  range with codepoints in the range `#x110000..#x3FFFFF', which it uses
  for representing characters that are not unified with Unicode and "raw
  8-bit bytes" that cannot be interpreted as characters.  Thus, a
  character codepoint in Emacs is a 22-bit integer number.

> Whereas if I evaluate:
> 
>  (progn
>    (save-excursion (insert 10 #o377))
>    (search-forward-regexp "ÿ" nil t))
> 
> I get a match.

Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.

> Likewise, if I evaluate
> 
>  (progn (save-excursion (insert 10 4194303))
>         (search-forward-regexp "\377" nil t))
> 
> I get a match.
> 
> Which is to say, given the example regexp from the manual, i.e:
> 
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
> 
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS

Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs.  Feel free to file a
separate bug about that.

> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
> 
> I hope this is more clear than the previous mail. I apologize if it is not.

I hope my answers make this issue more clear.  (Did I say that use of
raw bytes is complicated and full of subtleties?)






  reply	other threads:[~2010-06-01 18:38 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-27 17:28 bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? MON KEY
2010-05-27 18:10 ` Eli Zaretskii
2010-05-27 22:59   ` MON KEY
2010-05-29 14:28     ` Kevin Rodgers
     [not found]   ` <AANLkTikjCByug1U69tbhsnmS4c1VXSNzoqAOAxmbt3bI@mail.gmail.com>
2010-05-28  7:15     ` Eli Zaretskii
2010-05-28 23:20       ` MON KEY
2010-05-29  6:45         ` Eli Zaretskii
2010-05-31  5:35           ` MON KEY
2010-05-31 18:49             ` Eli Zaretskii
2010-06-01  0:24               ` MON KEY
2010-06-01 18:38                 ` Eli Zaretskii [this message]
2010-06-02 19:41                   ` MON KEY
2010-06-03 14:39                     ` Kevin Rodgers
2010-05-31 14:45           ` MON KEY
2010-05-31 18:51             ` Eli Zaretskii
2010-05-31 23:44 ` MON KEY
2010-06-02 16:06 ` MON KEY
2010-06-02 17:30   ` Chong Yidong
2010-06-02 17:46   ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=831vcqtx72.fsf@gnu.org \
    --to=eliz@gnu.org \
    --cc=6283@debbugs.gnu.org \
    --cc=monkey@sandpframing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.