From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Drew Adams" Newsgroups: gmane.emacs.bugs Subject: bug#12054: 24.1; regression? font-lock no-break-space with nil nobreak-char-display Date: Sat, 3 Nov 2012 12:01:29 -0700 Message-ID: <0B444DBDD1D14FD7B5EDE10E30ED320D@us.oracle.com> References: <87mwyzyn76.fsf@gnu.org><45DEAA69BC6E4630BA8DA0B07A0ECE92@us.oracle.com> <87lieimx9n.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1351969335 20998 80.91.229.3 (3 Nov 2012 19:02:15 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 3 Nov 2012 19:02:15 +0000 (UTC) Cc: 12054@debbugs.gnu.org To: "'Chong Yidong'" Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Sat Nov 03 20:02:24 2012 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TUiz8-00043G-I5 for geb-bug-gnu-emacs@m.gmane.org; Sat, 03 Nov 2012 20:02:22 +0100 Original-Received: from localhost ([::1]:35491 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TUiyz-00049m-UZ for geb-bug-gnu-emacs@m.gmane.org; Sat, 03 Nov 2012 15:02:13 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:58845) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TUiyv-00043n-5Q for bug-gnu-emacs@gnu.org; Sat, 03 Nov 2012 15:02:11 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TUiyt-00014V-TQ for bug-gnu-emacs@gnu.org; Sat, 03 Nov 2012 15:02:09 -0400 Original-Received: from debbugs.gnu.org ([140.186.70.43]:37262) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TUiyt-00014Q-OX for bug-gnu-emacs@gnu.org; Sat, 03 Nov 2012 15:02:07 -0400 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1TUj1i-0000pT-Af for bug-gnu-emacs@gnu.org; Sat, 03 Nov 2012 15:05:02 -0400 X-Loop: help-debbugs@gnu.org Resent-From: "Drew Adams" Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sat, 03 Nov 2012 19:05:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 12054 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 12054-submit@debbugs.gnu.org id=B12054.13519694813146 (code B ref 12054); Sat, 03 Nov 2012 19:05:02 +0000 Original-Received: (at 12054) by debbugs.gnu.org; 3 Nov 2012 19:04:41 +0000 Original-Received: from localhost ([127.0.0.1]:47510 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TUj1M-0000oh-K7 for submit@debbugs.gnu.org; Sat, 03 Nov 2012 15:04:41 -0400 Original-Received: from userp1040.oracle.com ([156.151.31.81]:38546) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TUj1I-0000oY-MR for 12054@debbugs.gnu.org; Sat, 03 Nov 2012 15:04:38 -0400 Original-Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qA3J1dA0003322 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 3 Nov 2012 19:01:40 GMT Original-Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qA3J1cxn011286 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 3 Nov 2012 19:01:39 GMT Original-Received: from abhmt118.oracle.com (abhmt118.oracle.com [141.146.116.70]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qA3J1cl8025876; Sat, 3 Nov 2012 14:01:38 -0500 Original-Received: from dradamslap1 (/10.159.185.65) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sat, 03 Nov 2012 12:01:38 -0700 X-Mailer: Microsoft Office Outlook 11 In-reply-to: <87lieimx9n.fsf@gnu.org> Thread-Index: Ac255ZOvlaqbUNMMR3aSGHlxwqmi5AABfZhg X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:66404 Archived-At: > > Just why is it that the regexp "[\240]+" does not match this char? > > Why should a character-alternative expression care whether the > > representation is unibyte or multibyte? Isn't that a bug? > > When \240 occurs in a unibyte string, Emacs recognizes it as an > eight-bit raw byte. When converting unibyte strings to > multibyte, Emacs does not "unify" eight-bit raw bytes with > Unicode characters #x80-#xff; they get their own code points, > in this case #x3fffa0. I think I understand this (but I might be misunderstanding). The \240 in the 4-char ASCII regexp string "\240" is interpreted (read?) as a raw byte, not as the char I wanted. That is, the literal string in my code is read as a string that contains only a single raw byte of octal 240 in place of the 4 chars \240 (and instead of as a string with the multibyte char no-break space). Is that right? And putting that together with Eli's statement about insertion ("'insert' treats strings such as "\nnn" as unibyte strings"), I understand that the buffer text after I type `C-q 240' contains a unibyte raw byte, and not the multibyte char no-break space. But in that case I do not understand why `C-u C-x =' says that it _is_ the Unicode no-break space char. And I do not understand why Yidong's font-lock correction also shows that it is a no-break space char. So I'm confused about what is actually in the buffer. From the doc and from Eli's statement, I gather that there is a unibyte raw byte (octal 240) at that position. But `C-u C-x =' and font-lock seem to tell me that there is a (multibyte) no-break space char there. If there is in fact a multibyte char there and the literal "\240" in my font-lock sexp results in a unibyte raw byte search, that would explain the mismatch. But I still wonder about this motivation for the treatment of \nnn in literal strings in Lisp code: > (One reason for doing this is to allow unibyte strings to > be specified using string constants in Emacs Lisp source code.) I can see how that can be useful. But I can also see how it would be useful to have some way of using octal syntax to match multibyte chars. Isn't there some reasonable way to allow for both? E.g. can I specify a multibyte string somehow, starting with octal syntax? Is there a way, for example, to use octal sytax to provide octal codes 0302 and 0240, which together define U+00AO for UTF8? [See below.] Is there, for example, (or could there be added) a function that one can apply to the unibyte string for \240 that would convert it to a string that DTRT wrt multibyte? So I could do something like this (assuming the function is available for older Emacs versions too), where `foo' is the function: (font-lock-add-keywords nil `((,(foo "\240+") (0 'foo t))) 'APPEND) >From the doc, I was thinking that perhaps `string-to-multibyte' would do the trick, i.e., (string-to-multibyte "\240+") would return "\u00a0+" or the literal Unibyte char in a multibyte string. But it returns "\240+". I can understand that the actual chars in that input string are all ASCII, so that makes sense, I guess. But I was thinking from Yidong's statement above that such a literal string in Lisp code gets read as a unibyte, raw-byte string. Since that doesn't seem to be the case here (?), is there a function that will convert "\240" (4 chars) to a string with just that one "eight-bit raw byte" char? I tried `read', but that didn't help. I hope I'm just missing something, and that there is a function (or combination of functions) to which I can pass the 4-char ASCII string "\240" (or the 8-char string "\302\240") and that will return the proper multibyte string containing the Unicode no-break space char. Ideal would be such a function that works also in older Emacs versions. ... OK, digging some more, it seems that this will do the trick: (decode-coding-string "\302\240" 'utf-8) That allows use of only octal syntax - good. But it still doesn't solve the problem for older Emacs versions - they raise the error (coding-system-error utf-8). Is there a way to use only octal syntax with older Emacs versions, so the font-locking code highlights such a Unicode char in a file/buffer? Judging by my current confusion, I am sure that my statements above must be full of misconceptions. I will be glad to be shown my misunderstanding and a simple solution.