Making re-search-forward search for \377

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Making re-search-forward search for \377
@ 2008-11-02  7:31 Tyler Spivey
  2008-11-02  8:45 ` Xah
  0 siblings, 1 reply; 9+ messages in thread
From: Tyler Spivey @ 2008-11-02  7:31 UTC (permalink / raw)
  To: help-gnu-emacs

I'm having a hell of a time trying to get re-search-forward to find a
\377 character in my buffer. Here is what I've tried so far, using
*scratch*:

1. C-q 377 RET M-< RET
(re-search-forward "\377") C-x C-e - not found.
(re-search-forward "[press C-q 377]") C-x C-e - this works.
If I turn multibyte off with M-X toggle-enable-multibyte-characters,
none of these work. My eventual goal is to do this in an elisp program,
but I need to get the basics
reliably working first. I've tried string-{as,to,make}-multibyte on the
"\377", with no luck. I've read the info pages on coding systems and
such, but I'm not sure what I'm missing here.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-02  7:31 Making re-search-forward search for \377 Tyler Spivey
@ 2008-11-02  8:45 ` Xah
  2008-11-02  9:12   ` Tyler Spivey
  0 siblings, 1 reply; 9+ messages in thread
From: Xah @ 2008-11-02  8:45 UTC (permalink / raw)
  To: help-gnu-emacs

On Nov 2, 12:31 am, Tyler Spivey <tspi...@pcdesk.net> wrote:
> I'm having a hell of a time trying to get re-search-forward to find a
> \377 character in my buffer. Here is what I've tried so far, using
> *scratch*:
>
> 1. C-q 377 RET M-< RET
> (re-search-forward "\377") C-x C-e - not found.
> (re-search-forward "[press C-q 377]") C-x C-e - this works.
> If I turn multibyte off with M-X toggle-enable-multibyte-characters,
> none of these work. My eventual goal is to do this in an elisp program,
> but I need to get the basics
> reliably working first. I've tried string-{as,to,make}-multibyte on the
> "\377", with no luck. I've read the info pages on coding systems and
> such, but I'm not sure what I'm missing here.

what's the C-q 377 char?

if i press Ctrl+q 377 Enter, i get this char: ÿ, which is LATIN SMALL
LETTER Y WITH DIAERESIS (unicode U+00FF).

Then if i do:

(re-search-forward "ÿ")
ÿ

it works perfectly.

as far as my experience goes, the ease of programing with unicode in
elisp beats Perl and Python hands down...

  Xah
∑ http://xahlee.org/

☄


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-02  8:45 ` Xah
@ 2008-11-02  9:12   ` Tyler Spivey
  2008-11-02 18:10     ` Kevin Rodgers
                       ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Tyler Spivey @ 2008-11-02  9:12 UTC (permalink / raw)
  To: help-gnu-emacs

Xah <xahlee@gmail.com> writes:

> what's the C-q 377 char?
>
> if i press Ctrl+q 377 Enter, i get this char: ÿ, which is LATIN SMALL
> LETTER Y WITH DIAERESIS (unicode U+00FF).
>
> Then if i do:
>
> (re-search-forward "ÿ")
>
> it works perfectly.
>
> as far as my experience goes, the ease of programing with unicode in
> elisp beats Perl and Python hands down...

I'm probably going to end up working with binary data in a temp
buffer. Doing more research, I want enable-multibyte-characters to be
off. Given that, if we go to *scratch*
and run M-X toggle-enable-multibyte-characters until that variable
becomes nil, doing C-Q 377 RET gives 0xff, which is what I want
(according to C-x =, C-u C-x = and M-x describe-char). Now to
match it, I try:

(re-search-forward "\xff") - no luck

What did you use to figure out that the multibyte version of that
character was 0x00FF? I found it out accidentally as a lisp error, but
none of the previously described commands (C-X =, M-X describe-char or
C-u C-x =) will show that it is 0x00ff, they just show FF.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-02  9:12   ` Tyler Spivey
@ 2008-11-02 18:10     ` Kevin Rodgers
  2008-11-02 20:32     ` Xah
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Kevin Rodgers @ 2008-11-02 18:10 UTC (permalink / raw)
  To: help-gnu-emacs

Tyler Spivey wrote:
> Xah <xahlee@gmail.com> writes:
> 
>> what's the C-q 377 char?
>>
>> if i press Ctrl+q 377 Enter, i get this char: ÿ, which is LATIN SMALL
>> LETTER Y WITH DIAERESIS (unicode U+00FF).
>>
>> Then if i do:
>>
>> (re-search-forward "ÿ")
>>
>> it works perfectly.
>>
>> as far as my experience goes, the ease of programing with unicode in
>> elisp beats Perl and Python hands down...
> 
> I'm probably going to end up working with binary data in a temp
> buffer. Doing more research, I want enable-multibyte-characters to be
> off. Given that, if we go to *scratch*
> and run M-X toggle-enable-multibyte-characters until that variable
> becomes nil, doing C-Q 377 RET gives 0xff, which is what I want
> (according to C-x =, C-u C-x = and M-x describe-char). Now to
> match it, I try:
> 
> (re-search-forward "\xff") - no luck
> 
> What did you use to figure out that the multibyte version of that
> character was 0x00FF? I found it out accidentally as a lisp error, but
> none of the previously described commands (C-X =, M-X describe-char or
> C-u C-x =) will show that it is 0x00ff, they just show FF.

`C-u C-x =' shows this:

   character: ÿ (2303, #o4377, #x8ff, U+00FF)

-- 
Kevin Rodgers
Denver, Colorado, USA





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-02  9:12   ` Tyler Spivey
  2008-11-02 18:10     ` Kevin Rodgers
@ 2008-11-02 20:32     ` Xah
  2008-11-02 22:35       ` Tyler Spivey
  2008-11-03  4:21     ` Eli Zaretskii
       [not found]     ` <mailman.2743.1225686066.25473.help-gnu-emacs@gnu.org>
  3 siblings, 1 reply; 9+ messages in thread
From: Xah @ 2008-11-02 20:32 UTC (permalink / raw)
  To: help-gnu-emacs

Xah Lee wrote:
> Xah<xah...@gmail.com> writes:
> > what's the C-q 377 char?
>
> > if i press Ctrl+q 377 Enter, i get this char: ÿ, which is LATIN SMALL
> > LETTER Y WITH DIAERESIS (unicode U+00FF).
>
> > Then if i do:
>
> > (re-search-forward "ÿ")

Tyler Spivey wrote:
> I'm probably going to end up working with binary data in a temp
> buffer. Doing more research, I want enable-multibyte-characters to be
> off. Given that, if we go to *scratch*
> and run M-X toggle-enable-multibyte-characters until that variable
> becomes nil, doing C-Q 377 RET gives 0xff, which is what I want
> (according to C-x =, C-u C-x = and M-x describe-char). Now to
> match it, I try:
>
> (re-search-forward "\xff") - no luck

sorry can't help you much there. ...i don't have much experience
working with binary data.

> What did you use to figure out that the multibyte version of that
> character was 0x00FF? I found it out accidentally as a lisp error, but
> none of the previously described commands (C-X =, M-X describe-char or
> C-u C-x =) will show that it is 0x00ff, they just show FF.

installing a unicode data file is probably what you need.

Q: I have this character α on the screen. How to find out its
unicode's hex value or name?

You can find out a character's decimal, octal, or hex values by
placing your cursor on the character, and type “Alt+x what-cursor-
position” (Ctrl+x =). You can get more info if you place your cursor
on the character, then press “Ctrl+u Ctrl+x =”.

However, if you want the complete unicode info of a character, you
need to download a unicode data file and let emacs know where it is.
The unicode data file can be downloaded at: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
After you downloaded it, place the following code in your “~/.emacs”
to let emacs know where it is:

; set unicode data file location. (used by what-cursor-position)
(let ((x "~/Documents/emacs/UnicodeData.txt"))
  (when (file-exists-p x)
    (setq describe-char-unicodedata-file x)))

Then restart emacs. Once you've done this, then place your cursor on a
unicode char, and do “Ctrl+u Ctrl+x =”, then emacs will give you all
the unicode info about that char, including the code point in decimal,
octal, hex notations, as well the unicode character name, category,
the font emacs is using, and others.

For example, here's the output on the character “α”:

      character: α (332721, #o1211661, #x513b1, U+03B1)
        charset: mule-unicode-0100-24ff
                 (Unicode characters of the range U+0100..U+24FF.)
     code point: #x27 #x31
         syntax: w 	which means: word
       category: g:Greek
    buffer code: #x9C #xF4 #xA7 #xB1
      file code: #xCE #xB1 (encoded by coding system mule-utf-8-unix)
        display: by this font (glyph code)
     -apple-symbol-medium-r-normal--14-140-72-72-m-140-mac-symbol
(#x61)
   Unicode data:
           Name: GREEK SMALL LETTER ALPHA
       Category: lowercase letter
Combining class: Spacing
  Bidi category: Left-to-Right
      Uppercase: Α
      Titlecase: Α

There are text properties here:
  fontified            t

this page might help you if you work with unicode.
http://xahlee.org/emacs/emacs_n_unicode.html

  Xah
∑ http://xahlee.org/

☄

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-02 20:32     ` Xah
@ 2008-11-02 22:35       ` Tyler Spivey
  0 siblings, 0 replies; 9+ messages in thread
From: Tyler Spivey @ 2008-11-02 22:35 UTC (permalink / raw)
  To: help-gnu-emacs

Xah <xahlee@gmail.com> writes:

> Xah Lee wrote:
>> Xah<xah...@gmail.com> writes:
>> > what's the C-q 377 char?
>>
>> > if i press Ctrl+q 377 Enter, i get this char: ÿ, which is LATIN SMALL
>> > LETTER Y WITH DIAERESIS (unicode U+00FF).
>>
>> > Then if i do:
>>
>> > (re-search-forward "ÿ")
>
> Tyler Spivey wrote:
>> I'm probably going to end up working with binary data in a temp
>> buffer. Doing more research, I want enable-multibyte-characters to be
>> off. Given that, if we go to *scratch*
>> and run M-X toggle-enable-multibyte-characters until that variable
>> becomes nil, doing C-Q 377 RET gives 0xff, which is what I want
>> (according to C-x =, C-u C-x = and M-x describe-char). Now to
>> match it, I try:
>>
>> (re-search-forward "\xff") - no luck
>
I've done yet more digging, and it seems that I need to use
raw-text-unix encoding. I've sort of got this to work, and this next
example is more like what I'm doing; the smallest part that seems to
fail:
(progn
  (setq re1 "\377\371")
  (setq re2 "\\(\377\371\\)")
  (insert (decode-coding-string "line 1\nline 2\377\371" 'raw-text-unix)))

Evaluate that in an empty buffer, and then run M-: (re-search-forward re1) RET at the beginning of the text after the sexp.
Then try M-: (re-search-forward re2) RET from just after the sexp.
re1 matches fine, but re2 won't match. What am I missing here? I thought that putting parens around re1 to get re2 should
give me the same expression but with capturing. Here are details on my emacs version:
GNU Emacs 23.0.60.1 (x86_64-unknown-linux-gnu, GTK+ Version 2.14.4) of 2008-11-01 on arch1
I tested this in 22.3, and it seems to work. In reading the NEWS file for 23,
I see changes in character set handling. What do I need to do to make re2 match what re1 does but with capturing? I realize
that in this case I can probably use (match-string 0), but the full RE that I'm going to eventually be matching on is this:
"\\(\377[\371\357]\\)\\|\\(\n\\)"
Any help would be appreciated.
- Tyler


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-02  9:12   ` Tyler Spivey
  2008-11-02 18:10     ` Kevin Rodgers
  2008-11-02 20:32     ` Xah
@ 2008-11-03  4:21     ` Eli Zaretskii
       [not found]     ` <mailman.2743.1225686066.25473.help-gnu-emacs@gnu.org>
  3 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2008-11-03  4:21 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Tyler Spivey <tspivey@pcdesk.net>
> Date: Sun, 02 Nov 2008 01:12:10 -0800
> 
> I'm probably going to end up working with binary data in a temp
> buffer. Doing more research, I want enable-multibyte-characters to be
> off. Given that, if we go to *scratch*
> and run M-X toggle-enable-multibyte-characters until that variable
> becomes nil, doing C-Q 377 RET gives 0xff, which is what I want
> (according to C-x =, C-u C-x = and M-x describe-char). Now to
> match it, I try:
> 
> (re-search-forward "\xff") - no luck
> 
> What did you use to figure out that the multibyte version of that
> character was 0x00FF? I found it out accidentally as a lisp error, but
> none of the previously described commands (C-X =, M-X describe-char or
> C-u C-x =) will show that it is 0x00ff, they just show FF.

Why are you trying to use re-search-forward with octal codes such as
\377?  What are you trying to do? does the buffer you are searching
hold human-readable text or does it hold binary data, i.e. raw bytes?

In the former case, you need to use characters in the search string,
not literal codes like \377 or xff, and the buffer should be in the
(default) multibyte mode.  \377 is not a character code, as far as
Emacs is concerned, it's an encoding of some character.  Do _not_ make
a mistake of turning enable-multibyte-characters off and using raw
bytes such as \377 for searching normal text, that way lies madness.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
       [not found]     ` <mailman.2743.1225686066.25473.help-gnu-emacs@gnu.org>
@ 2008-11-03  4:54       ` Tyler Spivey
  2008-11-03 19:42         ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Tyler Spivey @ 2008-11-03  4:54 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Tyler Spivey <tspivey@pcdesk.net>
>> Date: Sun, 02 Nov 2008 01:12:10 -0800
>> 
>> I'm probably going to end up working with binary data in a temp
>> buffer. Doing more research, I want enable-multibyte-characters to be
>> off. Given that, if we go to *scratch*
>> and run M-X toggle-enable-multibyte-characters until that variable
>> becomes nil, doing C-Q 377 RET gives 0xff, which is what I want
>> (according to C-x =, C-u C-x = and M-x describe-char). Now to
>> match it, I try:
>> 
>> (re-search-forward "\xff") - no luck
>> 
>> What did you use to figure out that the multibyte version of that
>> character was 0x00FF? I found it out accidentally as a lisp error, but
>> none of the previously described commands (C-X =, M-X describe-char or
>> C-u C-x =) will show that it is 0x00ff, they just show FF.
>
> Why are you trying to use re-search-forward with octal codes such as
> \377?  What are you trying to do? does the buffer you are searching
> hold human-readable text or does it hold binary data, i.e. raw bytes?
>
> In the former case, you need to use characters in the search string,
> not literal codes like \377 or xff, and the buffer should be in the
> (default) multibyte mode.  \377 is not a character code, as far as
> Emacs is concerned, it's an encoding of some character.  Do _not_ make
> a mistake of turning enable-multibyte-characters off and using raw
> bytes such as \377 for searching normal text, that way lies madness.

I think this is partially a problem with emacs, and partially a problem
with what I'm trying to do, or my understanding of regex. I posted to emacs-devel, maybe someone there
might know more. What I'm trying to do is split text up for use in a mud
client, based on the following re:
"\\(\377[\371\357]\\)\\|\\(\n\\)"
the encoding of the process is raw-text-unix.
manually running M-: (re-search-forward "\\(\377[\371\357]\\)") fails,
but
running M-: (re-search-forward "\377\371") works fine. However, I want
it to match
the longer re stated above, but running re-search on that just matches
the newlines.

This is mostly text, with telnet control characters thrown in that I
want to use as delimiters of a sort and process on them, while deleting
them from the text. Using a re-search would be perfect for this if I
could figure out how to do it.

In reading section 2.3.8.2 of the manual, we get this:
   You can represent a unibyte non-ASCII character with its character
code, which must be in the range from 128 (0200 octal) to 255 (0377
octal).  If you write all such character codes in octal and the string
contains no other characters forcing it to be multibyte, this produces
a unibyte string.  However, using any hex escape in a string (even for
an ASCII character) forces the string to be multibyte.

I've left enable-multibyte-characters alone, but even searching for
"[\377]\371" fails, while "\377\371" succeeds.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Making re-search-forward search for \377
  2008-11-03  4:54       ` Tyler Spivey
@ 2008-11-03 19:42         ` Eli Zaretskii
  0 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2008-11-03 19:42 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Tyler Spivey <tspivey@pcdesk.net>
> Date: Sun, 02 Nov 2008 20:54:52 -0800
> 
> What I'm trying to do is split text up for use in a mud
> client, based on the following re:
> "\\(\377[\371\357]\\)\\|\\(\n\\)"
> the encoding of the process is raw-text-unix.
> manually running M-: (re-search-forward "\\(\377[\371\357]\\)") fails,
> but
> running M-: (re-search-forward "\377\371") works fine. However, I want
> it to match
> the longer re stated above, but running re-search on that just matches
> the newlines.
> 
> This is mostly text, with telnet control characters thrown in

If it's text, Emacs is unlikely to treat what was \377 etc. in the
file as just 8-bit byte whose integer value is \377.  Depending on
your locale, Emacs will interpret such bytes as encoded characters and
convert them to its internal representation, which is exposed to you
as a large integer.  (This conversion is called ``decoding''.)

To see what Emacs thinks about those characters, go to one of them and
type "C-u C-x =".

If I'm right, searching for literal \377\371 is unlikely to succeed,
since there's no such character in the buffer after decoding.
Instead, you should search for the codepoints in the internal
representation, as shown to you by "C-u C-x =".  To insert such
characters, the easiest way is to use an ``input method''.  You set an
input method by typing "C-u C-\" and then the name of the input method
you want.  Typing "C-u C-\ TAB" will show the list of available input
methods, and "C-h C-\ METHOD" will describe the named input method.

> In reading section 2.3.8.2 of the manual, we get this:
>    You can represent a unibyte non-ASCII character with its character
> code, which must be in the range from 128 (0200 octal) to 255 (0377
> octal).  If you write all such character codes in octal and the string
> contains no other characters forcing it to be multibyte, this produces
> a unibyte string.  However, using any hex escape in a string (even for
> an ASCII character) forces the string to be multibyte.
> 
> I've left enable-multibyte-characters alone, but even searching for
> "[\377]\371" fails, while "\377\371" succeeds.

I don't recommend to use unibyte facilities, they are tricky and
treacherous.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-11-03 19:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-02  7:31 Making re-search-forward search for \377 Tyler Spivey
2008-11-02  8:45 ` Xah
2008-11-02  9:12   ` Tyler Spivey
2008-11-02 18:10     ` Kevin Rodgers
2008-11-02 20:32     ` Xah
2008-11-02 22:35       ` Tyler Spivey
2008-11-03  4:21     ` Eli Zaretskii
     [not found]     ` <mailman.2743.1225686066.25473.help-gnu-emacs@gnu.org>
2008-11-03  4:54       ` Tyler Spivey
2008-11-03 19:42         ` Eli Zaretskii

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).