unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
@ 2010-05-27 17:28 MON KEY
  2010-05-27 18:10 ` Eli Zaretskii
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: MON KEY @ 2010-05-27 17:28 UTC (permalink / raw)
  To: 6283

In doc/lispref/searching.texi is the following reference to octal code
`0377' correct?

,----
| You cannot always match all non-@acronym{ASCII} characters with the
| regular expression @code{"[\200-\377]"}.  This works when searching a
| unibyte buffer or string (@pxref{Text Representations}), but not in a
| multibyte buffer or string, because many non-@acronym{ASCII}
| characters have codes above octal 0377.  {....}
`---- :FILE doc/lispref/searching.texi  (info "(elisp)Regexp Special")

Shouldn't that be:

"characters have codes above octal #o377"

--
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-27 17:28 bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? MON KEY
@ 2010-05-27 18:10 ` Eli Zaretskii
  2010-05-27 22:59   ` MON KEY
       [not found]   ` <AANLkTikjCByug1U69tbhsnmS4c1VXSNzoqAOAxmbt3bI@mail.gmail.com>
  2010-05-31 23:44 ` MON KEY
  2010-06-02 16:06 ` MON KEY
  2 siblings, 2 replies; 19+ messages in thread
From: Eli Zaretskii @ 2010-05-27 18:10 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Thu, 27 May 2010 13:28:16 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 
> 
> In doc/lispref/searching.texi is the following reference to octal code
> `0377' correct?
> 
> ,----
> | You cannot always match all non-@acronym{ASCII} characters with the
> | regular expression @code{"[\200-\377]"}.  This works when searching a
> | unibyte buffer or string (@pxref{Text Representations}), but not in a
> | multibyte buffer or string, because many non-@acronym{ASCII}
> | characters have codes above octal 0377.  {....}
> `---- :FILE doc/lispref/searching.texi  (info "(elisp)Regexp Special")
> 
> Shouldn't that be:
> 
> "characters have codes above octal #o377"

What's the difference between what's written and what you suggest?





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-27 18:10 ` Eli Zaretskii
@ 2010-05-27 22:59   ` MON KEY
  2010-05-29 14:28     ` Kevin Rodgers
       [not found]   ` <AANLkTikjCByug1U69tbhsnmS4c1VXSNzoqAOAxmbt3bI@mail.gmail.com>
  1 sibling, 1 reply; 19+ messages in thread
From: MON KEY @ 2010-05-27 22:59 UTC (permalink / raw)
  To: 6283

On Thu, May 27, 2010 at 2:10 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> `---- :FILE doc/lispref/searching.texi  (info "(elisp)Regexp Special")
>>
>> Shouldn't that be:
>>
>> "characters have codes above octal #o377"
>
> What's the difference between what's written and what you suggest?
>

(string-equal "0377"  "#o377")  => nil
(string-equal "0377"  "0377")   => t
(string-equal "#o377" "#o377")  => t

The latter forms read syntax being more in keeping with how the lisp
reader would interpret what the info docs are referring to as `octal
0377', and is at in keeping with what is presented in
(info "(elisp)Integer Basics"):

(eval #o377) => 255

What isn't at all clear in the infos in general is that the octal (or
FTM decimal, hex, etc. representations) for the literal raw-byte \255
is arrived at with something more like:

(insert (char-to-string #o17777655))

(insert (char-to-string #x3fffad))

(insert (char-to-string 4194221))

e.g. decimal 4194221 -> octal #o17777655 -> hex #x3fffad

Without knowing what do with that octal value simply referencing \255
as octal 0377 or hex X3FFFAD isn't all that informative of itself.

FWIW It took me a coupla years to figure out what how to frob those
values into a raw-byte and I still require to relearn it from the docs
whenever I need to manually revert some raw-bytes or improperly
encoded bit-rotted text using regexps.

-- 
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
       [not found]   ` <AANLkTikjCByug1U69tbhsnmS4c1VXSNzoqAOAxmbt3bI@mail.gmail.com>
@ 2010-05-28  7:15     ` Eli Zaretskii
  2010-05-28 23:20       ` MON KEY
  0 siblings, 1 reply; 19+ messages in thread
From: Eli Zaretskii @ 2010-05-28  7:15 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Thu, 27 May 2010 18:56:51 -0400
> From: MON KEY <monkey@sandpframing.com>
> 
> On Thu, May 27, 2010 at 2:10 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
> >> | characters have codes above octal 0377.  {....}
> >> `---- :FILE doc/lispref/searching.texi  (info "(elisp)Regexp Special")
> >>
> >> Shouldn't that be:
> >>
> >> "characters have codes above octal #o377"
> >
> > What's the difference between what's written and what you suggest?
> >
> 
> (string-equal "0377"  "#o377")  => nil
> (string-equal "0377"  "0377")   => t
> (string-equal "#o377" "#o377")  => t

Sorry, I don't see the relevance.  The manual talks about the
_numeric_ code of characters, not about their read syntax.  It uses
"octal 0377" to present values because octal notation of single-byte
characters is something many people are familiar with, regardless of
them being Lisp programmers or not.  After all, that is the codepoint
of the character.

> What isn't at all clear in the infos in general is that the octal (or
> FTM decimal, hex, etc. representations) for the literal raw-byte \255
> is arrived at with something more like:
> 
> (insert (char-to-string #o17777655))

This is explained in "Non-ASCII Characters".  But we generally try not
to advertise this issue too much, because there should be no good
reason for a Lisp program to create raw bytes.  Emacs is a text
editor, while raw bytes are not text

> FWIW It took me a coupla years to figure out what how to frob those
> values into a raw-byte and I still require to relearn it from the docs
> whenever I need to manually revert some raw-bytes or improperly
> encoded bit-rotted text using regexps.

It's hard to believe Emacs couldn't handle any such text in some other
way.  What "improper encoding" was that which Emacs couldn't handle?
Could it be that you simply gave up too early and tried to solve the
problem by treating text as bytes, while it really wasn't?






^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-28  7:15     ` Eli Zaretskii
@ 2010-05-28 23:20       ` MON KEY
  2010-05-29  6:45         ` Eli Zaretskii
  0 siblings, 1 reply; 19+ messages in thread
From: MON KEY @ 2010-05-28 23:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6283

On Fri, May 28, 2010 at 3:15 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> Sorry, I don't see the relevance.  The manual talks about the
> _numeric_ code of characters, not about their read syntax.

I must be misunderstanding something.
What is the numeric code of \255 ?

> It uses "octal 0377" to present values because octal notation of
> single-byte characters is something many people are familiar with,

Where is this convention detailed/discussed in the manual?
I don't find it mentioned in the (info "(elisp)Conventions").

Should it be, esp. as 0377 is not a representation exposed by the
Emacs user level interface (at least none that that I'm aware of).

> After all, that is the codepoint of the character.

Of which character?

0377 doesn't have a character that I'm aware of.

> This is explained in "Non-ASCII Characters".  But we generally try not


But this is my point, that section (being the most relevant to
Non-ASCII notation) tends to use the #<Radian> notation.

> to advertise this issue too much, because there should be no good
> reason for a Lisp program to create raw bytes.  Emacs is a text
> editor, while raw bytes are not text

Thats just silly. Emacs accomodates noodling w/ raw-bytes because it
is neccesary to edit them on occasion. Heck, Emacs w32 distributes
with a dedicated executable just to edit binary data in hexadecimal
form.

>> whenever I need to manually revert some raw-bytes or improperly
>> encoded bit-rotted text using regexps.
>
> It's hard to believe Emacs couldn't handle any such text in some other
> way.

It generally can. However, sometimes file encodings get out of whack
over time and once they are more than a generation away from
rightedness Emacs isn't always able to revert them.

The good thing is Emacs can do this and I'm very glad it does :)

Besides, its my prerogative how I choose to abuse Emacs into abusing
my data.

> What "improper encoding" was that which Emacs couldn't handle?

The "mixed bag encoding". Not all of my files origniated in Emacs. Not
all of them get read into an Emacs buffer without problems.

GIGO c'est la vie.

FWIW I have entire SQL databases multi-lingual multi-encoding data
that was improperly uploaded into them via a misconfigured PHP script
with a funky encoding declartion which itself got its input from a
certain legacy proprietary w32 web-browser that understood (read
willfully mis-interpreted) UTF-8 according to its own whims and I can
assure you that encodings don't translate perfectly nor are the
mis-translations always easily caught or corrected.

Stuff like this can sometimes happen with system locales too.
Transitioning files from vfat will clobber file names too if your not carefull.

Sometimes I need to find the raw-bytes and replace them with their
character equivalent.

> Could it be that you simply gave up too early and tried to solve the
> problem by treating text as bytes, while it really wasn't?

Nope. I'm usually pretty good about _not_ approaching these problems
with this type of hammer unless it is a last resort.

--
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-28 23:20       ` MON KEY
@ 2010-05-29  6:45         ` Eli Zaretskii
  2010-05-31  5:35           ` MON KEY
  2010-05-31 14:45           ` MON KEY
  0 siblings, 2 replies; 19+ messages in thread
From: Eli Zaretskii @ 2010-05-29  6:45 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Fri, 28 May 2010 19:20:18 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
> 
> On Fri, May 28, 2010 at 3:15 AM, Eli Zaretskii <eliz@gnu.org> wrote:
> > Sorry, I don't see the relevance.  The manual talks about the
> > _numeric_ code of characters, not about their read syntax.
> 
> I must be misunderstanding something.
> What is the numeric code of \255 ?

\255 is not a character, it's the numeric code itself.

> > It uses "octal 0377" to present values because octal notation of
> > single-byte characters is something many people are familiar with,
> 
> Where is this convention detailed/discussed in the manual?

It's not an Emacs convention to represent characters by their
codepoints expressed in octal.  It's a widely accepted practice.  If
we were to describe every convention in the world in the manual, 99%
of the manual would be devoted to describing conventions.

> Should it be, esp. as 0377 is not a representation exposed by the
> Emacs user level interface (at least none that that I'm aware of).

Again, this part of the manual is not about how Emacs represents
characters or reads them.  It's about their codes.

> > After all, that is the codepoint of the character.
> 
> Of which character?
> 
> 0377 doesn't have a character that I'm aware of.

In Unicode, it's a codepoint of LATIN SMALL LETTER Y WITH DIAERESIS.

But the text says "...many non-ASCII characters have codes above octal
0377".  It doesn't talk about a specific character here, just about
which codepoints are below it and which are above it.

> > to advertise this issue too much, because there should be no good
> > reason for a Lisp program to create raw bytes.  Emacs is a text
> > editor, while raw bytes are not text
> 
> Thats just silly. Emacs accomodates noodling w/ raw-bytes because it
> is neccesary to edit them on occasion. Heck, Emacs w32 distributes
> with a dedicated executable just to edit binary data in hexadecimal
> form.

I didn't say that we are going to remove these features any time soon.
Just that the manual doesn't talk too much about this, to avoid
confusing users with issues that are both very complicated and very
obscure, and are rarely if at all needed on the Lisp level.

> It generally can. However, sometimes file encodings get out of whack
> over time and once they are more than a generation away from
> rightedness Emacs isn't always able to revert them.
> 
> The good thing is Emacs can do this and I'm very glad it does :)
> 
> Besides, its my prerogative how I choose to abuse Emacs into abusing
> my data.

Of course.  But why do you expect to find the description of such
abuse in the manual?





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-27 22:59   ` MON KEY
@ 2010-05-29 14:28     ` Kevin Rodgers
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin Rodgers @ 2010-05-29 14:28 UTC (permalink / raw)
  To: bug-gnu-emacs

MON KEY wrote:
> On Thu, May 27, 2010 at 2:10 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>>> `---- :FILE doc/lispref/searching.texi  (info "(elisp)Regexp Special")
>>>
>>> Shouldn't that be:
>>>
>>> "characters have codes above octal #o377"
>> What's the difference between what's written and what you suggest?
>>
> 
> (string-equal "0377"  "#o377")  => nil
> (string-equal "0377"  "0377")   => t
> (string-equal "#o377" "#o377")  => t

Those strings are not what you seem to think:

(length "0377") ⇒ 4

(length "#o377") ⇒ 5

I think "\377" aka "\xFF" aka "\u00FF" is what you mean.

> The latter forms read syntax being more in keeping with how the lisp
> reader would interpret what the info docs are referring to as `octal
> 0377', and is at in keeping with what is presented in
> (info "(elisp)Integer Basics"):
> 
> (eval #o377) => 255
> 
> What isn't at all clear in the infos in general is that the octal (or
> FTM decimal, hex, etc. representations) for the literal raw-byte \255
> is arrived at with something more like:
> 
> (insert (char-to-string #o17777655))
> 
> (insert (char-to-string #x3fffad))
> 
> (insert (char-to-string 4194221))
> 
> e.g. decimal 4194221 -> octal #o17777655 -> hex #x3fffad
> 
> Without knowing what do with that octal value simply referencing \255
> as octal 0377 or hex X3FFFAD isn't all that informative of itself.
> 
> FWIW It took me a coupla years to figure out what how to frob those
> values into a raw-byte and I still require to relearn it from the docs
> whenever I need to manually revert some raw-bytes or improperly
> encoded bit-rotted text using regexps.
> 


-- 
Kevin Rodgers
Denver, Colorado, USA






^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-29  6:45         ` Eli Zaretskii
@ 2010-05-31  5:35           ` MON KEY
  2010-05-31 18:49             ` Eli Zaretskii
  2010-05-31 14:45           ` MON KEY
  1 sibling, 1 reply; 19+ messages in thread
From: MON KEY @ 2010-05-31  5:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6283

On Sat, May 29, 2010 at 2:45 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>
> It's not an Emacs convention to represent characters by their
> codepoints expressed in octal.  It's a widely accepted practice.  If
> we were to describe every convention in the world in the manual, 99%
> of the manual would be devoted to describing conventions.
>

That it is widely accepted practice is what makes it a convention.
Within Emacs lisp it also widely accepted practice to denote numeric
representations with #<radixN> notation. This is a conflict of
convention. The purpose of demarcating the use of a particular
convention in the stead of another is to clarify when one shall be
used with preference over another. It is unconventional for the manual
to use conflicting conventions without prejudice. This is my concern.

> Again, this part of the manual is not about how Emacs represents
> characters or reads them.  It's about their codes.

This is how I understood this portion of the manual.
Maybe I'm misunderstanding something fundamental about this distinction.

If this is so, I would greatly appreciate it if you could help me to
see it more clearly.

>> 0377 doesn't have a character that I'm aware of.
>
> In Unicode, it's a codepoint of LATIN SMALL LETTER Y WITH DIAERESIS.

I don't understand this.

>
> But the text says "...many non-ASCII characters have codes above octal
> 0377".  It doesn't talk about a specific character here, just about
> which codepoints are below it and which are above it.

Yes, but the regexp is "[\200-\377]".

>
> I didn't say that we are going to remove these features any time soon.
> Just that the manual doesn't talk too much about this, to avoid
> confusing users with issues that are both very complicated and very
> obscure, and are rarely if at all needed on the Lisp level.
>

I certainly agree they are confusing and easily misunderstood.
I disagree however that these issues are all that obscure.
You seem to suggest that the notation "octal 0NNN" is commonplace yet
i personally find this notation to be obscure.

tomato|potato <-> potato|tomato

>
> Of course.  But why do you expect to find the description of such
> abuse in the manual?
>

I _do_ find them whereas I don't find reference such w/re the 0377 convention.
This is, I guess, my concern.

Following is my attempt to come to grips with the distinction between
the numeric codepoint, integer character representations, reader
conventions etc. w/re the manual and particularly their use in
conjuction w/ regexps.  I believe this example illustrates some
reasonable familiarity with aspects of char/code representation.

But maybe this bit of code can help to show if is there something that
I am not getting???

;;; ================================================================

(let (chars-found frob-found)
  (with-temp-buffer
    (save-excursion
      (insert 10 255 10 ?\377 10 "\255" 10 4194221 10 "\377" 10 4194303))
    (while (search-forward-regexp "[\200-\377]" nil t)
      (let* ((md (match-data t))
             (md-char (char-before (cadr md))))
        (push `(,md-char ,(car md) ,(cadr md)) chars-found))))
  (setq chars-found (nreverse chars-found))
  (dolist (cf chars-found
              (setq chars-found
                    `(,(setq frob-found (nreverse frob-found))
                      ,chars-found)))
    (push (car (read-from-string (format "#o%o" (car cf)))) frob-found))
  (setq frob-found nil)
  (dolist (ints (car chars-found)
                (setq chars-found
                      `(,(setq frob-found (nreverse frob-found))
                        ,@chars-found)))
    (push `(,ints . ,(char-to-string ints)) frob-found))
  (setq frob-found nil)
  (dolist (d (car chars-found)
             (setq chars-found
                   `(,(setq frob-found (nreverse frob-found)) ,@chars-found)))
    (let* ((mltb-int (car d))
           (unib-str (cdr d))
           (unib-str->mchar (string-to-char (symbol-name (read unib-str))))
           (mltb-int->uchar (multibyte-char-to-unibyte mltb-int)))
      (push `(:mltb-int ,mltb-int
                        :unib-str ,unib-str
                        :unib-str->mchar ,unib-str->mchar
                        :mltb-int->uchar ,mltb-int->uchar)
            frob-found)))
  (insert 10 (make-string 68 59) 10
          ";; With this regexp:" 10
          ";; \(search-forward-regexp \"[\\200-\\377]\" nil t\)" 10
          ";; Matched these chars:" 10
          255 10 ?\377 10 "\255" 10 4194221 10 "\377" 10 4194303 10
          (make-string 68 59) 10)
  (pp chars-found (current-buffer))
  (insert (make-string 68 59) "\n")
  (let ((cnt 0))
    (dolist (pl (car chars-found))
      (setq cnt (1+ cnt))
      (insert
       10 (make-string 68 59) 10
       (format
        (concat
         ";; :MATCH-DATA-#%d\n"
         "\n(char-to-string (unibyte-char-to-multibyte %d)) ;<-\"%c%d\"\n"
         "\n(insert (char-to-string (unibyte-char-to-multibyte %d)))
;<- multibyte-char\n"
         "\n(insert (identity %S)) ;<- raw-byte\n"
         "\n(insert (string-to-char (identity %S))) ;<- multibyte-char\n"
         "\n(insert-byte %d 1) ;<-raw-byte unibyte-char\n"
         "\n(insert (format \"(insert (identity #o%%o))\"
(unibyte-char-to-multibyte %d)))\n")
        cnt
        (plist-get pl :mltb-int->uchar)
        92
        (string-to-number (format "%o" (plist-get pl :mltb-int->uchar)))
        (plist-get pl :mltb-int->uchar)
        (plist-get pl :unib-str)
        (plist-get pl :unib-str)
        (plist-get pl :mltb-int->uchar)
        (plist-get pl :mltb-int->uchar))))))

;;; ================================================================

--
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-29  6:45         ` Eli Zaretskii
  2010-05-31  5:35           ` MON KEY
@ 2010-05-31 14:45           ` MON KEY
  2010-05-31 18:51             ` Eli Zaretskii
  1 sibling, 1 reply; 19+ messages in thread
From: MON KEY @ 2010-05-31 14:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6283

On Sat, May 29, 2010 at 2:45 AM, Eli Zaretskii <eliz@gnu.org> wrote:
>> Should it be, esp. as 0377 is not a representation exposed by the
>> Emacs user level interface (at least none that that I'm aware of).
>
> Again, this part of the manual is not about how Emacs represents
> characters or reads them.  It's about their codes.
>
Noticed this morning the following section of the manual which
references hex values in radian notation:

,---- (info "(elisp)Coding Systems")
|
|    The result of encoding, and the input to decoding, are not ordinary
| text.  They logically consist of a series of byte values; that is, a
| series of ASCII and eight-bit characters.  In unibyte buffers and
| strings, these characters have codes in the range 0 through #xFF
| (255).  In a multibyte buffer or string, eight-bit characters have
| character codes higher than #xFF (*note Text Representations::), but
| Emacs transparently converts them to their single-byte values when you
| encode or decode such text.
|
`----

This is, I believe an example of contradictory convention in the manual.

-- 
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-31  5:35           ` MON KEY
@ 2010-05-31 18:49             ` Eli Zaretskii
  2010-06-01  0:24               ` MON KEY
  0 siblings, 1 reply; 19+ messages in thread
From: Eli Zaretskii @ 2010-05-31 18:49 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Mon, 31 May 2010 01:35:41 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
> 
> > Again, this part of the manual is not about how Emacs represents
> > characters or reads them.  It's about their codes.
> 
> This is how I understood this portion of the manual.
> Maybe I'm misunderstanding something fundamental about this distinction.

A character and its codepoint are not the same thing.  If this
distinction is not clear, I suggest reading the Unicode Technical
Report #17 (http://unicode.org/reports/tr17/).

> >> 0377 doesn't have a character that I'm aware of.
> >
> > In Unicode, it's a codepoint of LATIN SMALL LETTER Y WITH DIAERESIS.
> 
> I don't understand this.

I don't know how to express this more clearly.  Perhaps you could ask
specific questions.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-31 14:45           ` MON KEY
@ 2010-05-31 18:51             ` Eli Zaretskii
  0 siblings, 0 replies; 19+ messages in thread
From: Eli Zaretskii @ 2010-05-31 18:51 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Mon, 31 May 2010 10:45:59 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
> 
> Noticed this morning the following section of the manual which
> references hex values in radian notation:
> 
> ,---- (info "(elisp)Coding Systems")
> |
> |    The result of encoding, and the input to decoding, are not ordinary
> | text.  They logically consist of a series of byte values; that is, a
> | series of ASCII and eight-bit characters.  In unibyte buffers and
> | strings, these characters have codes in the range 0 through #xFF
> | (255).  In a multibyte buffer or string, eight-bit characters have
> | character codes higher than #xFF (*note Text Representations::), but
> | Emacs transparently converts them to their single-byte values when you
> | encode or decode such text.
> |
> `----
> 
> This is, I believe an example of contradictory convention in the manual.

Please be more specific, because I don't see any contradictions here.
Overloaded terminology, maybe, but not contradictions.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-27 17:28 bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? MON KEY
  2010-05-27 18:10 ` Eli Zaretskii
@ 2010-05-31 23:44 ` MON KEY
  2010-06-02 16:06 ` MON KEY
  2 siblings, 0 replies; 19+ messages in thread
From: MON KEY @ 2010-05-31 23:44 UTC (permalink / raw)
  To: 6283

On Mon, May 31, 2010 at 2:51 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> Please be more specific, because I don't see any contradictions here.
> Overloaded terminology, maybe, but not contradictions.

The subject line of the original bugreport is:

"doc/lispref/searching.texi reference to octal code `0377' correct?"

Your position (if I understand correctly) is that the manuals use of
`octal 0NNN' is in keeping with an accepted practice outside of Emacs,
the Emacs manual, and Emacs Lisp reader syntax. You've said:

  "It uses "octal 0377" to present values because octal notation of
  single-byte characters is something many people are familiar with,"

  "It's not an Emacs convention to represent characters by their
  codepoints expressed in octal.  It's a widely accepted practice."

My contention is that w/re the manual's reference to non-ASCII
characters/codes and their non-decimal numeric representations the
manual is internally inconsistent in its application of a `convention'.

That this is so (as the excerpts below clearly indicate), my
contention is that the manual should consistently apply the
`#<Radian>NNN' notation as it is what Emacs exposes to the user
whereas Emacs/Emacs-lisp is unaware of and certainly doesn't expose
`octal 0NNN' notation in any immediate or functionally equivalent
manner to the user.

 ,---- (info "(elisp)Coding Systems")
 |
1 | The result of encoding, and the input to decoding, are not ordinary
2 | text.  They logically consist of a series of byte values; that is, a
3 | series of ASCII and eight-bit characters.  In unibyte buffers and
4 | strings, these characters have codes in the range 0 through #xFF
5 | (255).  In a multibyte buffer or string, eight-bit characters have
6 | character codes higher than #xFF (*note Text Representations::), but
7 | Emacs transparently converts them to their single-byte values when you
8 | encode or decode such text.
 |
 `----

 ,---- (info "(elisp)Regexp Special")
 |
1 | You cannot always match all non-ASCII characters with the regular
2 | expression `"[\200-\377]"'.  This works when searching a unibyte
3 | buffer or string (*note Text Representations::), but not in a
4 | multibyte buffer or string, because many non-ASCII characters have
5 | codes above octal 0377.  However, the regular expression
6 | `"[^\000-\177]"' does match all non-ASCII characters (see below
7 | regarding `^'), in both multibyte and unibyte representations, because
8 | only the ASCII characters are excluded.
 |
 `----

In lines 4 an 8 of the first excerpt alternative non-decimal numeric
references are given in radian notation.

In line 5 of the second example alternative non-decimal numeric
references are given in `octal 0NNN' notation.

Do you not see a contradiction of convention here?

Do you agree there is an intersection of subject scope?

What is the overloaded terminology shared of this intersection?

--
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-31 18:49             ` Eli Zaretskii
@ 2010-06-01  0:24               ` MON KEY
  2010-06-01 18:38                 ` Eli Zaretskii
  0 siblings, 1 reply; 19+ messages in thread
From: MON KEY @ 2010-06-01  0:24 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6283

On Mon, May 31, 2010 at 2:49 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> > In Unicode, it's a codepoint of LATIN SMALL LETTER Y WITH DIAERESIS.
>>
>> I don't understand this.
>
> I don't know how to express this more clearly.  Perhaps you could ask
> specific questions.
>

If you step through the Emacs Lisp example I sent along previously you
may notice that the search doesn't match either of the `ÿ's.

It does however match the character with numeric notations:

 4194303, #o17777777, #x3fffff
 4194221, #o17777655, #x3fffad

E.g. These rawbytes as presented by Emacs as characters:

 (insert-byte (multibyte-char-to-unibyte 4194221) 1)
 (insert-byte (multibyte-char-to-unibyte 4194303) 1)

This is what I don't understand.

If I evauate the following:

 (progn
   (save-excursion
     (insert-byte (multibyte-char-to-unibyte 4194221) 1)
     (insert-byte (multibyte-char-to-unibyte 4194303) 1))
   (search-forward-regexp "ÿ" nil t))

I don't match.

Whereas if I evaluate:

 (progn
   (save-excursion (insert 10 #o377))
   (search-forward-regexp "ÿ" nil t))

I get a match.

Likewise, if I evaluate

 (progn (save-excursion (insert 10 4194303))
        (search-forward-regexp "\377" nil t))

I get a match.

Which is to say, given the example regexp from the manual, i.e:

,----
| You cannot always match all non-ASCII characters with the regular
| expression `"[\200-\377]"'
`----

I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
LATIN SMALL LETTER Y WITH DIAERESIS

To be clear, my issue isn't that I am not able to match `ÿ' but rather
that I am able to match the raw-byte character representation with a
visual appearance which coincides with the octal value for the `ÿ'
character code i.e. #o377 this being otherwise widely understood as
`octal 0377'.

I hope this is more clear than the previous mail. I apologize if it is not.

--
/s_P]





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-06-01  0:24               ` MON KEY
@ 2010-06-01 18:38                 ` Eli Zaretskii
  2010-06-02 19:41                   ` MON KEY
  0 siblings, 1 reply; 19+ messages in thread
From: Eli Zaretskii @ 2010-06-01 18:38 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 6283@debbugs.gnu.org
> 
> If I evauate the following:
> 
>  (progn
>    (save-excursion
>      (insert-byte (multibyte-char-to-unibyte 4194221) 1)
>      (insert-byte (multibyte-char-to-unibyte 4194303) 1))
>    (search-forward-regexp "ÿ" nil t))
> 
> I don't match.

Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte.  Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character.  See this fragment from
the ELisp manual:

     Emacs defines several special character sets.  The character set
  `unicode' includes all the characters whose Emacs code points are in
  the range `0..#x10FFFF'.  The character set `emacs' includes all ASCII
  and non-ASCII characters.  Finally, the `eight-bit' charset includes
  the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
  in text.

and also this one:

     To support this multitude of characters and scripts, Emacs closely
  follows the "Unicode Standard".  The Unicode Standard assigns a unique
  number, called a "codepoint", to each and every character.  The range
  of codepoints defined by Unicode, or the Unicode "codespace", is
  `0..#x10FFFF' (in hexadecimal notation), inclusive.  Emacs extends this
  range with codepoints in the range `#x110000..#x3FFFFF', which it uses
  for representing characters that are not unified with Unicode and "raw
  8-bit bytes" that cannot be interpreted as characters.  Thus, a
  character codepoint in Emacs is a 22-bit integer number.

> Whereas if I evaluate:
> 
>  (progn
>    (save-excursion (insert 10 #o377))
>    (search-forward-regexp "ÿ" nil t))
> 
> I get a match.

Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.

> Likewise, if I evaluate
> 
>  (progn (save-excursion (insert 10 4194303))
>         (search-forward-regexp "\377" nil t))
> 
> I get a match.
> 
> Which is to say, given the example regexp from the manual, i.e:
> 
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
> 
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS

Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs.  Feel free to file a
separate bug about that.

> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
> 
> I hope this is more clear than the previous mail. I apologize if it is not.

I hope my answers make this issue more clear.  (Did I say that use of
raw bytes is complicated and full of subtleties?)






^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-05-27 17:28 bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? MON KEY
  2010-05-27 18:10 ` Eli Zaretskii
  2010-05-31 23:44 ` MON KEY
@ 2010-06-02 16:06 ` MON KEY
  2010-06-02 17:30   ` Chong Yidong
  2010-06-02 17:46   ` Eli Zaretskii
  2 siblings, 2 replies; 19+ messages in thread
From: MON KEY @ 2010-06-02 16:06 UTC (permalink / raw)
  To: 6283

On Tue, Jun 1, 2010 at 2:26 PM, Eli Zaretskii <eliz@gnu.org> wrote:
>> Do you not see a contradiction of convention here?
>
> No, I see two different conventions used interchangeably.

Do you recognize that one convention is explicity recognized by
Emacs/Emacs-elisp whereas the other is not? Do you recognize that each
of these can be readily evaluated from within info by the Emacs lisp
reader and produce an equivalent decimal value which is in keeping
with the context/scope of the presented subject matter:

 #xff => 255
 (identity #xff)  => 255

 #o377            => 255
 (identity #o377) => 255

While the following evaluates to decimal 377 and does not:

 0377                    => 377
 (identity 0377)         => 377
 (identity "octal 0377") => "octal 0377"

Do you see that these two different return values may not be seen as
equivalent by the user?

Do you see that these two different return values may not be seen as
interchangeable by the user?

In either case, do you recognize that while these two separate return
values may be mutually inclusive conventions understood by the
initiated, the user may not have been suitably intitiated to have been
made aware of these respective conventions and the mechanics of their
interchangeability?

Can you maybe see how the interchangeable use of these two different
conventions might be confusing to the audience for which the _elisp_
manual was intended (presumably those interested in the conventions of
the Emacs' _elisp_ API where such set of users may not necessarily
represent/reflect the general programming community at large)?

--
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-06-02 16:06 ` MON KEY
@ 2010-06-02 17:30   ` Chong Yidong
  2010-06-02 17:46   ` Eli Zaretskii
  1 sibling, 0 replies; 19+ messages in thread
From: Chong Yidong @ 2010-06-02 17:30 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283-done

The manual discussion about how *not* to match non-ascii characters is
silly.  Instead of discussing the bad ways to match non-ascii
characters, I added a note about how to do it properly.

Since the text in question is now deleted, I am closing this bug.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-06-02 16:06 ` MON KEY
  2010-06-02 17:30   ` Chong Yidong
@ 2010-06-02 17:46   ` Eli Zaretskii
  1 sibling, 0 replies; 19+ messages in thread
From: Eli Zaretskii @ 2010-06-02 17:46 UTC (permalink / raw)
  To: MON KEY; +Cc: 6283

> Date: Wed, 2 Jun 2010 12:06:34 -0400
> From: MON KEY <monkey@sandpframing.com>
> Cc: 
> 
> On Tue, Jun 1, 2010 at 2:26 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> >> Do you not see a contradiction of convention here?
> >
> > No, I see two different conventions used interchangeably.
> 
> Do you recognize that one convention is explicity recognized by
> Emacs/Emacs-elisp whereas the other is not?

Yes, but I don't think that the manual should use only forms that can
be evaluated or read by the Lisp reader.  A manual is intended for
human consumption (except where it shows examples of code), not for
the Lisp reader.

> Can you maybe see how the interchangeable use of these two different
> conventions might be confusing to the audience for which the _elisp_
> manual was intended

No, I don't see how using conventions widely accepted in the
programming world should be confusing to Lisp programmers.  I actually
think that readers who are not too experienced in Emacs Lisp will find
this text easier to understand if it uses conventions they are
familiar with and used to.

Anyway, I think it's time to end this discussion.  It's quite clear we
disagree on this issue, and no repetition of the same arguments will
change that.  Thanks for taking time to make your position clear.





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-06-01 18:38                 ` Eli Zaretskii
@ 2010-06-02 19:41                   ` MON KEY
  2010-06-03 14:39                     ` Kevin Rodgers
  0 siblings, 1 reply; 19+ messages in thread
From: MON KEY @ 2010-06-02 19:41 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: 6283

As this bug seems closed I'm replying in reverse for the sake of
brevity w/re others future perusal.

On Tue, Jun 1, 2010 at 2:38 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> I hope my answers make this issue more clear.

Yes, Thank You. I appreciate that you've been generous in sharing time
to help make this distinction more clear.

> (Did I say that use of raw bytes is complicated and full of subtleties?)

Indeed. It is definitely something I've personally had trouble grasping
Thanks again.

>> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
>> LATIN SMALL LETTER Y WITH DIAERESIS

> Sounds like a bug to me --- not in the conventions used by the
> manual, but rather in regexp search in Emacs.  Feel free to file a
> separate bug about that.

Given my current trepidations I'm not sure how to characterize the bug
(if any) nor if I am the right person to do so.

Are you able to reproduce this behaviour?

Feel free to reply to the rest of this mail in private should you be
so inclined:

> Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
> is a raw byte.

So, would it be reasonable of me to characterize the mechanism of
Emacs regexps as (conceptually) searching over an in memory numeric
representation of character codepoints where a given character has a
numeric value (regardless of the radix notation used to represent it)
which falls within the numerical range of 22-bit numbers represented
by the set of integers encompassed by the return value of (max-char)?

IOW (search-forward-regexp "ÿÿÿ") doesnt' match three `ÿ's so much as
it attempts to match against whatever in memory representation Emacs
currently has for the current buffer's character set by moving across
an array of integers (which correspond to the buffer numeric character
values) looking for a particular sequence of integer value(s). That we
aren't matching the character represented by a respective codepoint
but rather the integer value which maps to that character's respective
codepoint according to the current buffer's coding system.

Which is to say in a buffer having the `buffer-file-coding-system'
value utf-8-unix and which contains the characters: "set of ÿÿÿ chars"
the regexp:

 (search-forward-regexp "ÿÿÿ")

is (conceptually) equivalent to searching across this array:

 [115 101 116 32 111 102 32 255 255 255 32 99 104 97 114 115]

for the sequence of consecutive adjacent integers with the value 255.

And, that were this a search for three consectuive raw-byte
characters with the multibyte numeric value 4194303, the regexp:

 (search-forward-regexp "\377\377\377")

is (conceptually) equivalent to searching across this array:

 [115 101 116 32 111 102 32 4194303
  4194303 4194303 32 99 104 97 114 115]

for three consecutive adjacent integers with the value 4194303.

With this latter integer (4194303), it so happens, being the decimal
value representing the uppermost of Emacs' internal `codespace'.
Where this `codespace' is the is understood as the range of the set of
characters which may be represented by the positive numerical range of
the 22-bit number corresponding to the integer return value of
`max-char', e.g.:

 (max-char) => 4194303 (#o17777777, #x3fffff)

Such that `max-char's numerical value (and lesser positive values
therof) may be presented to the Emacs lisp readers in various ways
including -- and in addition to decimal (base 10) notation -- those
integer values represented with the reader syntax:

  #<radix>N and #<R>rN

in any number of radix in incluing 10, 8, 16, and 2 as follows:

 decimal value     4194303    or #10r4194303

 octal value       #o17777777 or #8r17777777

 hexidecimal value #x3fffff   or #16r3fffff

 binary value      #b01111111111111111111111
                or #2r01111111111111111111111

Where this particular numeric value is more widely understood as:
raw-byte 255

This `raw-byte' being understood more generally as the uppermost in the
so called `octal range': 0200-0377

With the `octal range' being otherwise represented within the Emacs
codespace at its upper bounds as the final range of 127 numeric
character values beginning from the code offset 4194176
(inclusive). Such that the range of raw-bytes 127-255 beginning with
the codespace's integer value 4194176 and extendingto 4194303 e.g.:

 (cons 4194176 (+ 4194176 (- 255 128)))

And may more generally be represented in Emacs as:

numeric code-point range:  0x80 - 0xFF

decimal range:             4194176 - 4194303

octal range:               #o17777600 - #o17777777

hexidecimal range:         #x3fff80   - #x3fffff

binary range:              #b01111111111111110000000 - #b01111111111111111111111

--
/s_P\





^ permalink raw reply	[flat|nested] 19+ messages in thread

* bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
  2010-06-02 19:41                   ` MON KEY
@ 2010-06-03 14:39                     ` Kevin Rodgers
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin Rodgers @ 2010-06-03 14:39 UTC (permalink / raw)
  To: bug-gnu-emacs

MON KEY wrote:
 >> Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
 >> is a raw byte.
 >
 > So, would it be reasonable of me to characterize the mechanism of
 > Emacs regexps as (conceptually) searching over an in memory numeric
 > representation of character codepoints where a given character has a
 > numeric value (regardless of the radix notation used to represent it)
 > which falls within the numerical range of 22-bit numbers represented
 > by the set of integers encompassed by the return value of (max-char)?

Sure.  But it doesn't make sense to me to even consider "the radix notation
used to represent it".  Characters are read, usually from buffers (including
the minibuffer), and the notation is only relevant with respect to the
buffer or keyboard coding system because each character is exactly that:
a character, represented internally as an integer.

 > IOW (search-forward-regexp "ÿÿÿ") doesnt' match three `ÿ's so much as
 > it attempts to match against whatever in memory representation Emacs
 > currently has for the current buffer's character set by moving across
 > an array of integers (which correspond to the buffer numeric character
 > values) looking for a particular sequence of integer value(s). That we
 > aren't matching the character represented by a respective codepoint
 > but rather the integer value which maps to that character's respective
 > codepoint according to the current buffer's coding system.

Why does the distinction between the codepoint and the representation matter,
since there is a 1:1 relationship between them?

I think that character sets and coding systems are irrelevant at this point:
the coding system was used to convert the text to the internal representation
when it was read into memory.  The only character set that matters is Unicode,
the only codepoints that matter are Unicode and Emacs' internal representation.

I just verified that like this: Unicode has the same codepoint →
character mappings as ASCII and ISO-8859-1, but ISO-8859-2 has different
characters than Unicode at some codepoints.  For example, codepoint xA1
aka o241 aka 161 is INVERTED EXCLAMATION MARK in Unicode but LATIN
CAPITAL LETTER A WITH OGONEK in ISO-8859-2.

If I have a UTF-8 buffer and an ISO-8859-2 buffer, `M-: (ucs=insert
0104)' inserts the same character into both, as expected: LATIN CAPITAL
LETTER A WITH OGONEK.  The only difference in the output from `C-u C-x
=' are the file codes -- the internal buffer codes are the same.

I thought that perhaps C-q 241 would insert different characters into the
buffers, since their coding systems assign different characters to that
codepoint, but they don't: in both cases, it is INVERTED EXCLAMATION
MARK.

So it seems that Unicode is used regardless of buffer-coding-system.  Even
`C-x RET c iso-8859-2 RET C-q 241' inserts INVERTED EXCLAMATION MARK, not
LATIN CAPITAL LETTER A WITH OGONEK.

Perhaps someone can explain how to insert a character using its numeric
codepoint in a specific character set?

-- 
Kevin Rodgers
Denver, Colorado, USA






^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-06-03 14:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-27 17:28 bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? MON KEY
2010-05-27 18:10 ` Eli Zaretskii
2010-05-27 22:59   ` MON KEY
2010-05-29 14:28     ` Kevin Rodgers
     [not found]   ` <AANLkTikjCByug1U69tbhsnmS4c1VXSNzoqAOAxmbt3bI@mail.gmail.com>
2010-05-28  7:15     ` Eli Zaretskii
2010-05-28 23:20       ` MON KEY
2010-05-29  6:45         ` Eli Zaretskii
2010-05-31  5:35           ` MON KEY
2010-05-31 18:49             ` Eli Zaretskii
2010-06-01  0:24               ` MON KEY
2010-06-01 18:38                 ` Eli Zaretskii
2010-06-02 19:41                   ` MON KEY
2010-06-03 14:39                     ` Kevin Rodgers
2010-05-31 14:45           ` MON KEY
2010-05-31 18:51             ` Eli Zaretskii
2010-05-31 23:44 ` MON KEY
2010-06-02 16:06 ` MON KEY
2010-06-02 17:30   ` Chong Yidong
2010-06-02 17:46   ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).