From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct? Date: Tue, 01 Jun 2010 21:38:41 +0300 Message-ID: <831vcqtx72.fsf@gnu.org> References: <83vda9md09.fsf@gnu.org> <83sk5cmr8k.fsf@gnu.org> <83sk5btdcu.fsf@gnu.org> <836323ucry.fsf@gnu.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Trace: dough.gmane.org 1275419925 31806 80.91.229.12 (1 Jun 2010 19:18:45 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 1 Jun 2010 19:18:45 +0000 (UTC) Cc: 6283@debbugs.gnu.org To: MON KEY Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Jun 01 21:18:44 2010 connect(): No such file or directory Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1OJWz5-0004sJ-Ee for geb-bug-gnu-emacs@m.gmane.org; Tue, 01 Jun 2010 21:18:43 +0200 Original-Received: from localhost ([127.0.0.1]:57150 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OJWz4-0000Pu-Li for geb-bug-gnu-emacs@m.gmane.org; Tue, 01 Jun 2010 15:18:42 -0400 Original-Received: from [140.186.70.92] (port=53592 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OJWen-0000U3-M0 for bug-gnu-emacs@gnu.org; Tue, 01 Jun 2010 14:57:47 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OJWel-0002YW-Da for bug-gnu-emacs@gnu.org; Tue, 01 Jun 2010 14:57:45 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:52959) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJWel-0002YJ-C9 for bug-gnu-emacs@gnu.org; Tue, 01 Jun 2010 14:57:43 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.69) (envelope-from ) id 1OJWNe-0002BH-OH; Tue, 01 Jun 2010 14:40:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 01 Jun 2010 18:40:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 6283 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 6283-submit@debbugs.gnu.org id=B6283.12754175628366 (code B ref 6283); Tue, 01 Jun 2010 18:40:02 +0000 Original-Received: (at 6283) by debbugs.gnu.org; 1 Jun 2010 18:39:22 +0000 Original-Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJWN0-0002At-CQ for submit@debbugs.gnu.org; Tue, 01 Jun 2010 14:39:22 -0400 Original-Received: from mtaout20.012.net.il ([80.179.55.166]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OJWMx-0002Ao-3h for 6283@debbugs.gnu.org; Tue, 01 Jun 2010 14:39:20 -0400 Original-Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0L3C00000MDYZK00@a-mtaout20.012.net.il> for 6283@debbugs.gnu.org; Tue, 01 Jun 2010 21:38:37 +0300 (IDT) Original-Received: from HOME-C4E4A596F7 ([77.126.62.239]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0L3C00GNUMGCTBJ0@a-mtaout20.012.net.il>; Tue, 01 Jun 2010 21:38:37 +0300 (IDT) In-reply-to: X-012-Sender: halo1@inter.net.il X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list Resent-Date: Tue, 01 Jun 2010 14:40:02 -0400 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:37480 Archived-At: > Date: Mon, 31 May 2010 20:24:00 -0400 > From: MON KEY > Cc: 6283@debbugs.gnu.org >=20 > If I evauate the following: >=20 > (progn > (save-excursion > (insert-byte (multibyte-char-to-unibyte 4194221) 1) > (insert-byte (multibyte-char-to-unibyte 4194303) 1)) > (search-forward-regexp "=C3=BF" nil t)) >=20 > I don't match. Because =C3=BF is a character, whereas `(multibyte-char-to-unibyte 41= 94303)' is a raw byte. Emacs can distinguish between these two because it uses a special multibyte representation for raw bytes, which is different from any other Unicode character. See this fragment from the ELisp manual: Emacs defines several special character sets. The character set `unicode' includes all the characters whose Emacs code points are i= n the range `0..#x10FFFF'. The character set `emacs' includes all AS= CII and non-ASCII characters. Finally, the `eight-bit' charset include= s the 8-bit raw bytes; Emacs uses it to represent raw bytes encounter= ed in text. and also this one: To support this multitude of characters and scripts, Emacs close= ly follows the "Unicode Standard". The Unicode Standard assigns a uni= que number, called a "codepoint", to each and every character. The ran= ge of codepoints defined by Unicode, or the Unicode "codespace", is `0..#x10FFFF' (in hexadecimal notation), inclusive. Emacs extends = this range with codepoints in the range `#x110000..#x3FFFFF', which it u= ses for representing characters that are not unified with Unicode and "= raw 8-bit bytes" that cannot be interpreted as characters. Thus, a character codepoint in Emacs is a 22-bit integer number. > Whereas if I evaluate: >=20 > (progn > (save-excursion (insert 10 #o377)) > (search-forward-regexp "=C3=BF" nil t)) >=20 > I get a match. Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH DIAERESIS, by design. > Likewise, if I evaluate >=20 > (progn (save-excursion (insert 10 4194303)) > (search-forward-regexp "\377" nil t)) >=20 > I get a match. >=20 > Which is to say, given the example regexp from the manual, i.e: >=20 > ,---- > | You cannot always match all non-ASCII characters with the regular > | expression `"[\200-\377]"' > `---- >=20 > I am unable to locate the character: =C3=BF (255, #o377, #xff) e.g. > LATIN SMALL LETTER Y WITH DIAERESIS Sounds like a bug to me --- not in the conventions used by the manual, but rather in regexp search in Emacs. Feel free to file a separate bug about that. > To be clear, my issue isn't that I am not able to match `=C3=BF' bu= t rather > that I am able to match the raw-byte character representation with = a > visual appearance which coincides with the octal value for the `= =C3=BF' > character code i.e. #o377 this being otherwise widely understood as > `octal 0377'. >=20 > I hope this is more clear than the previous mail. I apologize if it= is not. I hope my answers make this issue more clear. (Did I say that use of raw bytes is complicated and full of subtleties?)