all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends)
@ 2008-10-21 16:00 Eduardo Ochs
  2008-10-22 14:51 ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Eduardo Ochs @ 2008-10-21 16:00 UTC (permalink / raw)
  To: emacs-pretest-bug

Hello,

this may not be exactly a bug, I'm just struggling with an obscure
part of Emacs... anyway, I did my best to make this look like a nice
bug report, and to make the tests clear enough to help other people
who also find unibyte<->multibyte conversions obscure...

The short story
===============
Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand
for guillemets, i.e., the characters that we type with `C-x 8 <' and
`C-x 8 >' - as "anchors". So: if I produce an anchor string in a
unibyte buffer and then I search for an occurrence of that string in
multibyte buffer, the search fails.

The two small blocks below illustrate this. Instructions: save the
first one to "/tmp/1.txt", the second one to "/tmp/2.txt", and then
run:

  (load-file "/tmp/1.txt")

It will show "uni" in the "*Messages*" buffer, and the search will
fail. The detailed message about the failure of the search will be
like this:

  progn: Search failed: "\302\253foo\302\273"

meaning the anchor string has been incorrectly converted.



;;--------snip,snip--------
;; -*- coding: raw-text-unix -*-
;; (save-this-block-as "/tmp/1.txt")
(progn
  (find-file "/tmp/2.txt")
  (goto-char (point-min))
  (setq anchorstr "«foo»")
  (message (if (multibyte-string-p anchorstr) "multi" "uni"))
  (search-forward anchorstr))
;;--------snip,snip--------

;;--------snip,snip--------
;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/2.txt")
(search-forward "«foo»")
;; «foo»
;;--------snip,snip--------



The long story
==============
Save the block below as "/tmp/3.txt" and follow the instructions in
it. Note that it doesn't have any non-ascii characters - the anchors
are produced by running the "(insert ...)" sexps.



;;--------snip,snip--------
;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/3.txt")

;; Run the "progn" below with C-x C-e.
;; It will create a line like this:
;; <<anchor>>\253anchor\273\253anchor\273\253anchor\273
;; (but the "<<", ">>", "\253", "\273" are single characters).
;; Don't delete that line, it will be used later.
;;
(progn
  (defun mmb (str) (string-make-multibyte str))
  (defun mub (str) (string-make-unibyte   str))
  (insert 171 "anchor" 187)
  (insert           "\253anchor\273")
  (insert      (mub "\253anchor\273"))
  (insert (mmb (mub "\253anchor\273")))
  )


;; Now try to save this file.
;; Emacs will complain about the "\253"s and "\273"s - it will
;; say that iso-latin-1-unix and utf-8-unix cannot encode them.
;; The "<<" and ">>" are ok, though...
;;
;; So: leave the "<<anchor>>" above, delete the "\253anchor\273"s,
;; save this file, and reload it. DON'T SKIP THIS STEP - the
;; charset properties mentioned below behave differently before
;; and after reloads, and I don't know exactly the mechanics of
;; this... 8-\
;;
;; If we inspect the "<<", ">>" "\253", "\273" with `C-x ='
;; we see this:
;; Char: << (171, #o253, #xab, file #xAB)
;; Char: >> (187, #o273, #xbb, file #xBB)
;; Char: \253 (4194219, #o17777653, #x3fffab, raw-byte)
;; Char: \253 (4194235, #o17777673, #x3fffbb, raw-byte)
;;
;; Now mark the "<<anchor>>" above and copy it to the top of
;; the kill ring with `M-w'. Let's examine the results of
;; several obvious ways to (re)create the "<<anchor>>"
;; above as a string...
;; Here are some of the results:
;;
;;               "\253anchor\273"   ==> "<<anchor>>"
;;          (mub "\253anchor\273")  ==> "<<anchor>>"
;;     (mmb (mub "\253anchor\273")) ==> "\253anchor\273"
;;               (car kill-ring)    ==>
;;               #("<<anchor>>" 0 8 (charset iso-8859-1))
;;          (mub (car kill-ring))   ==> "<<anchor>>"
;;     (mmb (mub (car kill-ring)))  ==> "\253anchor\273"

                            "\253anchor\273"
                       (mub "\253anchor\273")
                  (mmb (mub "\253anchor\273"))
             (mub (mmb (mub "\253anchor\273")))
(mapcar 'identity           "\253anchor\273")
(mapcar 'identity      (mub "\253anchor\273"))
(mapcar 'identity (mmb (mub "\253anchor\273")))
                            (car kill-ring)
                       (mub (car kill-ring))
                  (mmb (mub (car kill-ring)))
(mapcar 'identity           (car kill-ring))
(mapcar 'identity      (mub (car kill-ring)))
(mapcar 'identity (mmb (mub (car kill-ring))))


;; This is the weird part.
;; Let's insert another "<<anchor>>"/"\253anchor\273" pair, and
;; let's try to jump to its "anchors" with `search-backward'.

(insert 171 "anchor" 187 "\n\253anchor\273")



(search-backward            "\253anchor\273")
(search-backward       (mub "\253anchor\273"))
(search-backward  (mmb (mub "\253anchor\273")))
(search-backward            (car kill-ring))
(search-backward       (mub (car kill-ring)))
(search-backward  (mmb (mub (car kill-ring))))

;; Only "(search-backward (car kill-ring))" jumps to
;; "<<anchor>>" - all the others jump to "\253anchor\273".
;; The trick - aha! - is that "(car kill-ring)" holds this
;; string,
;;
;;          (car kill-ring)    ==>
;;          #("<<anchor>>" 0 8 (charset iso-8859-1))
;;
;; and the "(charset iso-8859-1)" property is essential...
;;--------snip,snip--------


What is the standard way to convert unibyte strings (for example
anchor strings, generated from code in raw-text-unix ".el" files) to
strings with the right charset property (if needed) and the right
encoding? I couldn't find the functions for that...

  Cheers, thanks in advance,
    Eduardo Ochs
    eduardoochs at gmail.com
    http://angg.twu.net/



P.S.: (emacs-version) ==>
"GNU Emacs 23.0.60.1 (i686-pc-linux-gnu, GTK+ Version 2.8.20)
 of 2008-10-11 on dekooning"






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends)
  2008-10-21 16:00 bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends) Eduardo Ochs
@ 2008-10-22 14:51 ` Stefan Monnier
  2009-01-16  0:19   ` Juanma Barranquero
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2008-10-22 14:51 UTC (permalink / raw)
  To: Eduardo Ochs; +Cc: emacs-pretest-bug, 1215

> Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand
> for guillemets, i.e., the characters that we type with `C-x 8 <' and
> `C-x 8 >' - as "anchors". So: if I produce an anchor string in a
> unibyte buffer and then I search for an occurrence of that string in
> multibyte buffer, the search fails.

There are no guillemets in unibyte buffers.

> ;;--------snip,snip--------
> ;; -*- coding: raw-text-unix -*-
> ;; (save-this-block-as "/tmp/1.txt")
> (progn
>   (find-file "/tmp/2.txt")
>   (goto-char (point-min))
>   (setq anchorstr "«foo»")
>   (message (if (multibyte-string-p anchorstr) "multi" "uni"))
>   (search-forward anchorstr))

There's a bug here, indeed: Emacs should refuse to save such a file,
because raw-text-unix (to which I prefer to refer as `binary') cannot
encode « and ».


        Stefan






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends)
  2008-10-22 14:51 ` Stefan Monnier
@ 2009-01-16  0:19   ` Juanma Barranquero
  2009-01-16  2:47     ` bug#1215: 23.0.60; unibyte->multibyte conversion problem (in Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Juanma Barranquero @ 2009-01-16  0:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 1215

On Wed, Oct 22, 2008 at 15:51, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

> There's a bug here, indeed: Emacs should refuse to save such a file,
> because raw-text-unix (to which I prefer to refer as `binary') cannot
> encode « and ».

Why not? « is U+00AB and » is U+00BB.

(with-temp-file "/temp/guillemets.txt"
  (set-buffer-multibyte nil)
  (setq buffer-file-coding-system 'raw-text-unix)
  (insert ?« "Test" ?» ?\n))

=>

0000 0000 ab 54 65 73 74 bb 0a                              ½Test╗.

    Juanma






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in
  2009-01-16  0:19   ` Juanma Barranquero
@ 2009-01-16  2:47     ` Stefan Monnier
  2009-01-16  2:59       ` Juanma Barranquero
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2009-01-16  2:47 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 1215

>> There's a bug here, indeed: Emacs should refuse to save such a file,
>> because raw-text-unix (to which I prefer to refer as `binary') cannot
>> encode « and ».
> Why not? « is U+00AB and » is U+00BB.

Neither of which is a byte.  The byte 0xAB is the Emacs character
#x3fffab, as shown by (unibyte-char-to-multibyte #xab).

If you save that file and read it back in, you'll see that its content
has changed.  `save-buffer' should not silently save if it will
lose information.


        Stefan






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in
  2009-01-16  2:47     ` bug#1215: 23.0.60; unibyte->multibyte conversion problem (in Stefan Monnier
@ 2009-01-16  2:59       ` Juanma Barranquero
  2009-01-16  3:37         ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Juanma Barranquero @ 2009-01-16  2:59 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 1215

> If you save that file and read it back in, you'll see that its content
> has changed.

Sorry, but I don't see that.

  emacs -Q

then I evaluate this:

  (with-temp-file "/temp/guillemets.txt"
    (set-buffer-multibyte nil)
    (setq buffer-file-coding-system 'raw-text-unix)
    (insert ?« "Test" ?» ?\n))

then

  C-x C-f /temp/guillemets.txt

I get a buffer guillemets.txt with

  «Test»

as a multibyte file in iso-latin-1-unix. I can modify it and save it,
and still the guillemets are bytes 0xab and 0xbb in the resulting
file.

    Juanma






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in
  2009-01-16  2:59       ` Juanma Barranquero
@ 2009-01-16  3:37         ` Stefan Monnier
  2009-01-16 11:08           ` Juanma Barranquero
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2009-01-16  3:37 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 1215

> Sorry, but I don't see that.
>   emacs -Q
> then I evaluate this:
>   (with-temp-file "/temp/guillemets.txt"
>     (set-buffer-multibyte nil)
>     (setq buffer-file-coding-system 'raw-text-unix)
>     (insert ?« "Test" ?» ?\n))

You're cheating: remove the (set-buffer-multibyte nil).
Otherwise you're not actually inserting the ?« char but the #xAB
byte instead.


        Stefan






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in
  2009-01-16  3:37         ` Stefan Monnier
@ 2009-01-16 11:08           ` Juanma Barranquero
  2009-01-16 20:56             ` Stefan Monnier
  0 siblings, 1 reply; 9+ messages in thread
From: Juanma Barranquero @ 2009-01-16 11:08 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: 1215

On Fri, Jan 16, 2009 at 04:37, Stefan Monnier <monnier@iro.umontreal.ca> wrote:

> You're cheating: remove the (set-buffer-multibyte nil).
> Otherwise you're not actually inserting the ?« char but the #xAB
> byte instead.

OK, I see.

You said:

"There's a bug here, indeed: Emacs should refuse to save such a file,
because raw-text-unix (to which I prefer to refer as `binary') cannot
encode « and »."

but according to raw-text-unix's description:

  t -- raw-text-unix

  Raw text, which means text contains random 8-bit codes.
  Encoding text with this coding system produces the actual byte
  sequence of the text in buffers and strings.  An exception is made for
  eight-bit-control characters.  Each of them is encoded into a single
  byte.

you can save (almost) anything with it. What is the bug?

    Juanma






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in
  2009-01-16 11:08           ` Juanma Barranquero
@ 2009-01-16 20:56             ` Stefan Monnier
  2009-01-17 10:10               ` Eli Zaretskii
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Monnier @ 2009-01-16 20:56 UTC (permalink / raw)
  To: Juanma Barranquero; +Cc: 1215

> but according to raw-text-unix's description:

>   t -- raw-text-unix

>   Raw text, which means text contains random 8-bit codes.
>   Encoding text with this coding system produces the actual byte
>   sequence of the text in buffers and strings.  An exception is made for
>   eight-bit-control characters.  Each of them is encoded into a single
>   byte.

> you can save (almost) anything with it. What is the bug?

The bug is that you can currently save (almost) anything with it.  This is
due to historical reasons, where different notions of "no encoding" were
mixed up.  So on save, raw-text-unix behaves pretty much like
utf-8-mule under Emacs-23 and emacs-mule under Emacs-22.  On load, it
behaves pretty much like `binary'.


        Stefan






^ permalink raw reply	[flat|nested] 9+ messages in thread

* bug#1215: 23.0.60; unibyte->multibyte conversion problem (in
  2009-01-16 20:56             ` Stefan Monnier
@ 2009-01-17 10:10               ` Eli Zaretskii
  0 siblings, 0 replies; 9+ messages in thread
From: Eli Zaretskii @ 2009-01-17 10:10 UTC (permalink / raw)
  To: Stefan Monnier, 1215; +Cc: lekktu

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Fri, 16 Jan 2009 15:56:44 -0500
> Cc: 1215@emacsbugs.donarmstrong.com
> 
> > but according to raw-text-unix's description:
> 
> >   t -- raw-text-unix
> 
> >   Raw text, which means text contains random 8-bit codes.
> >   Encoding text with this coding system produces the actual byte
> >   sequence of the text in buffers and strings.  An exception is made for
> >   eight-bit-control characters.  Each of them is encoded into a single
> >   byte.
> 
> > you can save (almost) anything with it. What is the bug?
> 
> The bug is that you can currently save (almost) anything with it.  This is
> due to historical reasons, where different notions of "no encoding" were
> mixed up.  So on save, raw-text-unix behaves pretty much like
> utf-8-mule under Emacs-23 and emacs-mule under Emacs-22.  On load, it
> behaves pretty much like `binary'.

I documented this in the ELisp manual.






^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-01-17 10:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-21 16:00 bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends) Eduardo Ochs
2008-10-22 14:51 ` Stefan Monnier
2009-01-16  0:19   ` Juanma Barranquero
2009-01-16  2:47     ` bug#1215: 23.0.60; unibyte->multibyte conversion problem (in Stefan Monnier
2009-01-16  2:59       ` Juanma Barranquero
2009-01-16  3:37         ` Stefan Monnier
2009-01-16 11:08           ` Juanma Barranquero
2009-01-16 20:56             ` Stefan Monnier
2009-01-17 10:10               ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.