From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: MON KEY Newsgroups: gmane.emacs.bugs Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? Date: Wed, 2 Jun 2010 15:41:38 -0400 Message-ID: References: <83vda9md09.fsf@gnu.org> <83sk5cmr8k.fsf@gnu.org> <83sk5btdcu.fsf@gnu.org> <836323ucry.fsf@gnu.org> <831vcqtx72.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1275508716 7897 80.91.229.12 (2 Jun 2010 19:58:36 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 2 Jun 2010 19:58:36 +0000 (UTC) Cc: 6283@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Wed Jun 02 21:58:35 2010 connect(): No such file or directory Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1OJu4z-00010P-9D for geb-bug-gnu-emacs@m.gmane.org; Wed, 02 Jun 2010 21:58:30 +0200 Original-Received: from localhost ([127.0.0.1]:53379 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OJu4l-0004cX-Bw for geb-bug-gnu-emacs@m.gmane.org; Wed, 02 Jun 2010 15:58:07 -0400 Original-Received: from [140.186.70.92] (port=45506 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OJu4V-0004V5-GD for bug-gnu-emacs@gnu.org; Wed, 02 Jun 2010 15:57:53 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OJu4R-00086C-Tu for bug-gnu-emacs@gnu.org; Wed, 02 Jun 2010 15:57:50 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:36766) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJu4R-000861-R2 for bug-gnu-emacs@gnu.org; Wed, 02 Jun 2010 15:57:47 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69) (envelope-from ) id 1OJtpB-0005XU-Mu; Wed, 02 Jun 2010 15:42:01 -0400 X-Loop: help-debbugs@gnu.org Resent-From: MON KEY Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Wed, 02 Jun 2010 19:42:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 6283 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 6283-submit@debbugs.gnu.org id=B6283.127550770821284 (code B ref 6283); Wed, 02 Jun 2010 19:42:01 +0000 Original-Received: (at 6283) by debbugs.gnu.org; 2 Jun 2010 19:41:48 +0000 Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJtoy-0005XF-5x for submit@debbugs.gnu.org; Wed, 02 Jun 2010 15:41:48 -0400 Original-Received: from mail-gw0-f44.google.com ([74.125.83.44]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJtow-0005X9-H9 for 6283@debbugs.gnu.org; Wed, 02 Jun 2010 15:41:47 -0400 Original-Received: by gwj19 with SMTP id 19so4693446gwj.3 for <6283@debbugs.gnu.org>; Wed, 02 Jun 2010 12:41:42 -0700 (PDT) Original-Received: by 10.150.239.20 with SMTP id m20mr9137503ybh.407.1275507698757; Wed, 02 Jun 2010 12:41:38 -0700 (PDT) Original-Received: by 10.151.143.21 with HTTP; Wed, 2 Jun 2010 12:41:38 -0700 (PDT) In-Reply-To: <831vcqtx72.fsf@gnu.org> X-Google-Sender-Auth: o4PuF9ynXPoa4596IlsOB-0NMrY X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list Resent-Date: Wed, 02 Jun 2010 15:42:01 -0400 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:37490 Archived-At: As this bug seems closed I'm replying in reverse for the sake of brevity w/re others future perusal. On Tue, Jun 1, 2010 at 2:38 PM, Eli Zaretskii wrote: > I hope my answers make this issue more clear. Yes, Thank You. I appreciate that you've been generous in sharing time to help make this distinction more clear. > (Did I say that use of raw bytes is complicated and full of subtleties?) Indeed. It is definitely something I've personally had trouble grasping Thanks again. >> I am unable to locate the character: =C3=BF (255, #o377, #xff) e.g. >> LATIN SMALL LETTER Y WITH DIAERESIS > Sounds like a bug to me --- not in the conventions used by the > manual, but rather in regexp search in Emacs. Feel free to file a > separate bug about that. Given my current trepidations I'm not sure how to characterize the bug (if any) nor if I am the right person to do so. Are you able to reproduce this behaviour? Feel free to reply to the rest of this mail in private should you be so inclined: > Because =C3=BF is a character, whereas `(multibyte-char-to-unibyte 419430= 3)' > is a raw byte. So, would it be reasonable of me to characterize the mechanism of Emacs regexps as (conceptually) searching over an in memory numeric representation of character codepoints where a given character has a numeric value (regardless of the radix notation used to represent it) which falls within the numerical range of 22-bit numbers represented by the set of integers encompassed by the return value of (max-char)? IOW (search-forward-regexp "=C3=BF=C3=BF=C3=BF") doesnt' match three `=C3= =BF's so much as it attempts to match against whatever in memory representation Emacs currently has for the current buffer's character set by moving across an array of integers (which correspond to the buffer numeric character values) looking for a particular sequence of integer value(s). That we aren't matching the character represented by a respective codepoint but rather the integer value which maps to that character's respective codepoint according to the current buffer's coding system. Which is to say in a buffer having the `buffer-file-coding-system' value utf-8-unix and which contains the characters: "set of =C3=BF=C3=BF=C3= =BF chars" the regexp: (search-forward-regexp "=C3=BF=C3=BF=C3=BF") is (conceptually) equivalent to searching across this array: [115 101 116 32 111 102 32 255 255 255 32 99 104 97 114 115] for the sequence of consecutive adjacent integers with the value 255. And, that were this a search for three consectuive raw-byte characters with the multibyte numeric value 4194303, the regexp: (search-forward-regexp "\377\377\377") is (conceptually) equivalent to searching across this array: [115 101 116 32 111 102 32 4194303 4194303 4194303 32 99 104 97 114 115] for three consecutive adjacent integers with the value 4194303. With this latter integer (4194303), it so happens, being the decimal value representing the uppermost of Emacs' internal `codespace'. Where this `codespace' is the is understood as the range of the set of characters which may be represented by the positive numerical range of the 22-bit number corresponding to the integer return value of `max-char', e.g.: (max-char) =3D> 4194303 (#o17777777, #x3fffff) Such that `max-char's numerical value (and lesser positive values therof) may be presented to the Emacs lisp readers in various ways including -- and in addition to decimal (base 10) notation -- those integer values represented with the reader syntax: #N and #rN in any number of radix in incluing 10, 8, 16, and 2 as follows: decimal value 4194303 or #10r4194303 octal value #o17777777 or #8r17777777 hexidecimal value #x3fffff or #16r3fffff binary value #b01111111111111111111111 or #2r01111111111111111111111 Where this particular numeric value is more widely understood as: raw-byte 255 This `raw-byte' being understood more generally as the uppermost in the so called `octal range': 0200-0377 With the `octal range' being otherwise represented within the Emacs codespace at its upper bounds as the final range of 127 numeric character values beginning from the code offset 4194176 (inclusive). Such that the range of raw-bytes 127-255 beginning with the codespace's integer value 4194176 and extendingto 4194303 e.g.: (cons 4194176 (+ 4194176 (- 255 128))) And may more generally be represented in Emacs as: numeric code-point range: 0x80 - 0xFF decimal range: 4194176 - 4194303 octal range: #o17777600 - #o17777777 hexidecimal range: #x3fff80 - #x3fffff binary range: #b01111111111111110000000 - #b011111111111111111= 11111 -- /s_P\