From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Eduardo Ochs" Newsgroups: gmane.emacs.bugs,gmane.emacs.pretest.bugs Subject: bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends) Date: Tue, 21 Oct 2008 12:00:58 -0400 Message-ID: Reply-To: Eduardo Ochs , 1215@emacsbugs.donarmstrong.com NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1224649651 3078 80.91.229.12 (22 Oct 2008 04:27:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 22 Oct 2008 04:27:31 +0000 (UTC) To: emacs-pretest-bug@gnu.org Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Wed Oct 22 06:28:18 2008 connect(): Connection refused Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1KsVKS-000605-Pj for geb-bug-gnu-emacs@m.gmane.org; Wed, 22 Oct 2008 06:28:17 +0200 Original-Received: from localhost ([127.0.0.1]:55204 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KsVJN-0002La-3c for geb-bug-gnu-emacs@m.gmane.org; Wed, 22 Oct 2008 00:27:09 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KsK7a-0006SI-2W for bug-gnu-emacs@gnu.org; Tue, 21 Oct 2008 12:30:14 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KsK7Y-0006S3-GL for bug-gnu-emacs@gnu.org; Tue, 21 Oct 2008 12:30:13 -0400 Original-Received: from [199.232.76.173] (port=36776 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KsK7Y-0006S0-Ek for bug-gnu-emacs@gnu.org; Tue, 21 Oct 2008 12:30:12 -0400 Original-Received: from rzlab.ucr.edu ([138.23.92.77]:50337) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1KsK7X-0003Ka-UA for bug-gnu-emacs@gnu.org; Tue, 21 Oct 2008 12:30:12 -0400 Original-Received: from rzlab.ucr.edu (rzlab.ucr.edu [127.0.0.1]) by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id m9LGU1iB002258; Tue, 21 Oct 2008 09:30:01 -0700 Original-Received: (from debbugs@localhost) by rzlab.ucr.edu (8.13.8/8.13.8/Submit) id m9LGA3g3030086; Tue, 21 Oct 2008 09:10:03 -0700 X-Loop: don@donarmstrong.com Resent-From: "Eduardo Ochs" Resent-To: bug-submit-list@donarmstrong.com Resent-CC: Emacs Bugs Resent-Date: Tue, 21 Oct 2008 16:10:03 +0000 Resent-Message-ID: Resent-Sender: don@donarmstrong.com X-Emacs-PR-Message: report 1215 X-Emacs-PR-Package: emacs X-Emacs-PR-Keywords: Original-Received: via spool by submit@emacsbugs.donarmstrong.com id=B.122460486627988 (code B ref -1); Tue, 21 Oct 2008 16:10:03 +0000 Original-Received: (at submit) by emacsbugs.donarmstrong.com; 21 Oct 2008 16:01:06 +0000 Original-Received: from fencepost.gnu.org (fencepost.gnu.org [140.186.70.10]) by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id m9LG11rE027813 for ; Tue, 21 Oct 2008 09:01:02 -0700 Original-Received: from mail.gnu.org ([199.232.76.166]:48356 helo=mx10.gnu.org) by fencepost.gnu.org with esmtp (Exim 4.67) (envelope-from ) id 1KsJcj-0007nd-4z for emacs-pretest-bug@gnu.org; Tue, 21 Oct 2008 11:58:21 -0400 Original-Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60) (envelope-from ) id 1KsJfH-0007oA-S4 for emacs-pretest-bug@gnu.org; Tue, 21 Oct 2008 12:01:00 -0400 Original-Received: from yw-out-1718.google.com ([74.125.46.152]:19517) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1KsJfH-0007o0-Cs for emacs-pretest-bug@gnu.org; Tue, 21 Oct 2008 12:00:59 -0400 Original-Received: by yw-out-1718.google.com with SMTP id 9so438154ywk.66 for ; Tue, 21 Oct 2008 09:00:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type:content-transfer-encoding :content-disposition; bh=D+bhlpNAHlVhjywWPmLgOMW1HFq22EuOV2ecZd4Yn0c=; b=to1gzvuJYN2gAcKODYp8AkLuepKgGaIHC2wGNtHvb3zBpBTY699OEGDJbfSyVd3aLW AITpGp5sTxNCXXW/gCIpic0XlledCCfrtZel+XORiCZYm/celUAVk6DVZ9LdpVr62DdE 8nB24QkTIGJCzM06Us4u0ETww5HPs5pu8rnrc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type :content-transfer-encoding:content-disposition; b=VaTEjz0ynzXxwLLi/69bzLZlYbZhxt1Rv0wo6BpMnw2KWo/qM/mMD+4A8aEhR6cARD Bzao4doLAM8p6TcN1jdgLt6Ccl2rz2FtPEObGzbcNy5TPtDZ34AwjEQ2mDDGyLmYz4hE wKIEAHyutzs47uvNITve1r4o0AEC/mMI7St2g= Original-Received: by 10.90.83.2 with SMTP id g2mr9060466agb.7.1224604858134; Tue, 21 Oct 2008 09:00:58 -0700 (PDT) Original-Received: by 10.90.98.4 with HTTP; Tue, 21 Oct 2008 09:00:58 -0700 (PDT) Content-Disposition: inline X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 3) Resent-Date: Tue, 21 Oct 2008 12:30:13 -0400 X-Mailman-Approved-At: Wed, 22 Oct 2008 00:27:05 -0400 X-BeenThere: bug-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:21772 gmane.emacs.pretest.bugs:23287 Archived-At: Hello, this may not be exactly a bug, I'm just struggling with an obscure part of Emacs... anyway, I did my best to make this look like a nice bug report, and to make the tests clear enough to help other people who also find unibyte<->multibyte conversions obscure... The short story =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Let me refer to strings like "<>" - where the "<<" and ">>" stand for guillemets, i.e., the characters that we type with `C-x 8 <' and `C-x 8 >' - as "anchors". So: if I produce an anchor string in a unibyte buffer and then I search for an occurrence of that string in multibyte buffer, the search fails. The two small blocks below illustrate this. Instructions: save the first one to "/tmp/1.txt", the second one to "/tmp/2.txt", and then run: (load-file "/tmp/1.txt") It will show "uni" in the "*Messages*" buffer, and the search will fail. The detailed message about the failure of the search will be like this: progn: Search failed: "\302\253foo\302\273" meaning the anchor string has been incorrectly converted. ;;--------snip,snip-------- ;; -*- coding: raw-text-unix -*- ;; (save-this-block-as "/tmp/1.txt") (progn (find-file "/tmp/2.txt") (goto-char (point-min)) (setq anchorstr "=ABfoo=BB") (message (if (multibyte-string-p anchorstr) "multi" "uni")) (search-forward anchorstr)) ;;--------snip,snip-------- ;;--------snip,snip-------- ;; -*- coding: latin-1 -*- ;; (save-this-block-as "/tmp/2.txt") (search-forward "=ABfoo=BB") ;; =ABfoo=BB ;;--------snip,snip-------- The long story =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Save the block below as "/tmp/3.txt" and follow the instructions in it. Note that it doesn't have any non-ascii characters - the anchors are produced by running the "(insert ...)" sexps. ;;--------snip,snip-------- ;; -*- coding: latin-1 -*- ;; (save-this-block-as "/tmp/3.txt") ;; Run the "progn" below with C-x C-e. ;; It will create a line like this: ;; <>\253anchor\273\253anchor\273\253anchor\273 ;; (but the "<<", ">>", "\253", "\273" are single characters). ;; Don't delete that line, it will be used later. ;; (progn (defun mmb (str) (string-make-multibyte str)) (defun mub (str) (string-make-unibyte str)) (insert 171 "anchor" 187) (insert "\253anchor\273") (insert (mub "\253anchor\273")) (insert (mmb (mub "\253anchor\273"))) ) ;; Now try to save this file. ;; Emacs will complain about the "\253"s and "\273"s - it will ;; say that iso-latin-1-unix and utf-8-unix cannot encode them. ;; The "<<" and ">>" are ok, though... ;; ;; So: leave the "<>" above, delete the "\253anchor\273"s, ;; save this file, and reload it. DON'T SKIP THIS STEP - the ;; charset properties mentioned below behave differently before ;; and after reloads, and I don't know exactly the mechanics of ;; this... 8-\ ;; ;; If we inspect the "<<", ">>" "\253", "\273" with `C-x =3D' ;; we see this: ;; Char: << (171, #o253, #xab, file #xAB) ;; Char: >> (187, #o273, #xbb, file #xBB) ;; Char: \253 (4194219, #o17777653, #x3fffab, raw-byte) ;; Char: \253 (4194235, #o17777673, #x3fffbb, raw-byte) ;; ;; Now mark the "<>" above and copy it to the top of ;; the kill ring with `M-w'. Let's examine the results of ;; several obvious ways to (re)create the "<>" ;; above as a string... ;; Here are some of the results: ;; ;; "\253anchor\273" =3D=3D> "<>" ;; (mub "\253anchor\273") =3D=3D> "<>" ;; (mmb (mub "\253anchor\273")) =3D=3D> "\253anchor\273" ;; (car kill-ring) =3D=3D> ;; #("<>" 0 8 (charset iso-8859-1)) ;; (mub (car kill-ring)) =3D=3D> "<>" ;; (mmb (mub (car kill-ring))) =3D=3D> "\253anchor\273" "\253anchor\273" (mub "\253anchor\273") (mmb (mub "\253anchor\273")) (mub (mmb (mub "\253anchor\273"))) (mapcar 'identity "\253anchor\273") (mapcar 'identity (mub "\253anchor\273")) (mapcar 'identity (mmb (mub "\253anchor\273"))) (car kill-ring) (mub (car kill-ring)) (mmb (mub (car kill-ring))) (mapcar 'identity (car kill-ring)) (mapcar 'identity (mub (car kill-ring))) (mapcar 'identity (mmb (mub (car kill-ring)))) ;; This is the weird part. ;; Let's insert another "<>"/"\253anchor\273" pair, and ;; let's try to jump to its "anchors" with `search-backward'. (insert 171 "anchor" 187 "\n\253anchor\273") (search-backward "\253anchor\273") (search-backward (mub "\253anchor\273")) (search-backward (mmb (mub "\253anchor\273"))) (search-backward (car kill-ring)) (search-backward (mub (car kill-ring))) (search-backward (mmb (mub (car kill-ring)))) ;; Only "(search-backward (car kill-ring))" jumps to ;; "<>" - all the others jump to "\253anchor\273". ;; The trick - aha! - is that "(car kill-ring)" holds this ;; string, ;; ;; (car kill-ring) =3D=3D> ;; #("<>" 0 8 (charset iso-8859-1)) ;; ;; and the "(charset iso-8859-1)" property is essential... ;;--------snip,snip-------- What is the standard way to convert unibyte strings (for example anchor strings, generated from code in raw-text-unix ".el" files) to strings with the right charset property (if needed) and the right encoding? I couldn't find the functions for that... Cheers, thanks in advance, Eduardo Ochs eduardoochs at gmail.com http://angg.twu.net/ P.S.: (emacs-version) =3D=3D> "GNU Emacs 23.0.60.1 (i686-pc-linux-gnu, GTK+ Version 2.8.20) of 2008-10-11 on dekooning"