From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Kevin Rodgers Newsgroups: gmane.emacs.bugs Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? Date: Thu, 03 Jun 2010 08:39:44 -0600 Message-ID: References: <83vda9md09.fsf@gnu.org> <83sk5cmr8k.fsf@gnu.org> <83sk5btdcu.fsf@gnu.org> <836323ucry.fsf@gnu.org> <831vcqtx72.fsf@gnu.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Trace: dough.gmane.org 1275577093 21952 80.91.229.12 (3 Jun 2010 14:58:13 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 3 Jun 2010 14:58:13 +0000 (UTC) To: bug-gnu-emacs@gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Thu Jun 03 16:58:11 2010 connect(): No such file or directory Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1OKBs2-0002sU-Kh for geb-bug-gnu-emacs@m.gmane.org; Thu, 03 Jun 2010 16:58:11 +0200 Original-Received: from localhost ([127.0.0.1]:46772 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OKBs2-00043j-0F for geb-bug-gnu-emacs@m.gmane.org; Thu, 03 Jun 2010 10:58:10 -0400 Original-Received: from [140.186.70.92] (port=53140 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OKBro-00040x-SX for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 10:58:01 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OKBrj-0008RJ-MX for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 10:57:56 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:50136) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OKBrj-0008RF-LC for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 10:57:51 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69) (envelope-from ) id 1OKBbS-0006Rw-4i; Thu, 03 Jun 2010 10:41:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Kevin Rodgers Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 03 Jun 2010 14:41:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 6283 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by submit@debbugs.gnu.org id=B.127557601824781 (code B ref -1); Thu, 03 Jun 2010 14:41:02 +0000 Original-Received: (at submit) by debbugs.gnu.org; 3 Jun 2010 14:40:18 +0000 Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OKBak-0006Re-0g for submit@debbugs.gnu.org; Thu, 03 Jun 2010 10:40:18 -0400 Original-Received: from mx10.gnu.org ([199.232.76.166]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OKBah-0006RX-9r for submit@debbugs.gnu.org; Thu, 03 Jun 2010 10:40:16 -0400 Original-Received: from lists.gnu.org ([199.232.76.165]:33008) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1OKBac-0003Wg-HW for submit@debbugs.gnu.org; Thu, 03 Jun 2010 10:40:10 -0400 Original-Received: from [140.186.70.92] (port=58012 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OKBaX-0002DB-D6 for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 10:40:10 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OKBaR-0005Ui-RG for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 10:40:05 -0400 Original-Received: from lo.gmane.org ([80.91.229.12]:39372) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OKBaR-0005U7-E9 for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 10:39:59 -0400 Original-Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1OKBaO-0007Wj-Bm for bug-gnu-emacs@gnu.org; Thu, 03 Jun 2010 16:39:56 +0200 Original-Received: from c-71-237-24-138.hsd1.co.comcast.net ([71.237.24.138]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 03 Jun 2010 16:39:56 +0200 Original-Received: from kevin.d.rodgers by c-71-237-24-138.hsd1.co.comcast.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 03 Jun 2010 16:39:56 +0200 X-Injected-Via-Gmane: http://gmane.org/ connect(): No such file or directory Original-Lines: 60 Original-X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: c-71-237-24-138.hsd1.co.comcast.net User-Agent: Thunderbird 2.0.0.24 (Macintosh/20100228) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list Resent-Date: Thu, 03 Jun 2010 10:41:02 -0400 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:37506 Archived-At: MON KEY wrote: >> Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)' >> is a raw byte. > > So, would it be reasonable of me to characterize the mechanism of > Emacs regexps as (conceptually) searching over an in memory numeric > representation of character codepoints where a given character has a > numeric value (regardless of the radix notation used to represent it) > which falls within the numerical range of 22-bit numbers represented > by the set of integers encompassed by the return value of (max-char)? Sure. But it doesn't make sense to me to even consider "the radix notation used to represent it". Characters are read, usually from buffers (including the minibuffer), and the notation is only relevant with respect to the buffer or keyboard coding system because each character is exactly that: a character, represented internally as an integer. > IOW (search-forward-regexp "ÿÿÿ") doesnt' match three `ÿ's so much as > it attempts to match against whatever in memory representation Emacs > currently has for the current buffer's character set by moving across > an array of integers (which correspond to the buffer numeric character > values) looking for a particular sequence of integer value(s). That we > aren't matching the character represented by a respective codepoint > but rather the integer value which maps to that character's respective > codepoint according to the current buffer's coding system. Why does the distinction between the codepoint and the representation matter, since there is a 1:1 relationship between them? I think that character sets and coding systems are irrelevant at this point: the coding system was used to convert the text to the internal representation when it was read into memory. The only character set that matters is Unicode, the only codepoints that matter are Unicode and Emacs' internal representation. I just verified that like this: Unicode has the same codepoint → character mappings as ASCII and ISO-8859-1, but ISO-8859-2 has different characters than Unicode at some codepoints. For example, codepoint xA1 aka o241 aka 161 is INVERTED EXCLAMATION MARK in Unicode but LATIN CAPITAL LETTER A WITH OGONEK in ISO-8859-2. If I have a UTF-8 buffer and an ISO-8859-2 buffer, `M-: (ucs=insert 0104)' inserts the same character into both, as expected: LATIN CAPITAL LETTER A WITH OGONEK. The only difference in the output from `C-u C-x =' are the file codes -- the internal buffer codes are the same. I thought that perhaps C-q 241 would insert different characters into the buffers, since their coding systems assign different characters to that codepoint, but they don't: in both cases, it is INVERTED EXCLAMATION MARK. So it seems that Unicode is used regardless of buffer-coding-system. Even `C-x RET c iso-8859-2 RET C-q 241' inserts INVERTED EXCLAMATION MARK, not LATIN CAPITAL LETTER A WITH OGONEK. Perhaps someone can explain how to insert a character using its numeric codepoint in a specific character set? -- Kevin Rodgers Denver, Colorado, USA