all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Ralf Angeli <dev.null@iwi.uni-sb.de>
Subject: Convert unibyte to multibyte on input
Date: Tue, 05 Jul 2005 12:49:59 +0200	[thread overview]
Message-ID: <42ca65d2$0$18647$14726298@news.sunsite.dk> (raw)

If I understand (info "(elisp)Converting Representations") correctly,
Emacs will convert unibyte text to multibyte if it is inserted into a
multibyte buffer.  However, on Windows I could observe that text,
guillemets in particular, copied from the character table and pasted
into Emacs will remain in its unibyte representation.  When typing
`C-u C-x =' on a « character one gets the following result with a CVS
Emacs checked out and compiled a few days ago:

,----
|   character: « (0253, 171, 0xab)
|     charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF))
|  code point: 171
|      syntax:   	which means: whitespace
| buffer code: 0xAB
|   file code: 0xAB (encoded by coding system raw-text-dos)
|     display: by display table entry [?«] (see below)
| 
| The display table entry is displayed by these fonts (glyph codes):
| «: -raster-Courier-normal-r-normal-normal-20-120-120-120-c-120-iso8859-1 (0xAB)
|  
| There are text properties here:
|   fontified            t
`----

I would have expected to see the multibyte representation:

,----
|   character: « (04253, 2219, 0x8ab, U+00AB)
|     charset: latin-iso8859-1
| 	     (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100.)
|  code point: 43
|      syntax: . 	which means: punctuation
|    category: l:Latin  
| buffer code: 0x81 0xAB
|   file code: 0xC2 0xAB (encoded by coding system mule-utf-8-dos)
|     display: by this font (glyph code)
|      -raster-Courier-normal-r-normal-normal-20-120-120-120-c-120-iso8859-1 (0xAB)
| 
| There are text properties here:
|   face                 [font-latex-string-face]
|   fontified            t
`----

(This was the result of pasting into a UTF-8 buffer.)

I am not sure if this is a bug, a user mistake, or something else.  On
GNU/Linux I can simulate the problem by typing `M-: (insert 171) RET'
in a Latin-1 buffer.

Now my problem is, that I have to compare the guillemet found in the
buffer with another one in Lisp code in order to find the matching
closing one for font locking.  `re-search-forward' obviously finds the
opening guillement in its unibyte form, but then comparing it with a
multibyte guillemet fails.  (What happens is probably something like
`(string= (string 171) (string 2219))'.)

So I am wondering if the unibyte strings should not be present in the
buffer in the first place[1] or if I have to explicitely convert the
unibyte strings to multibyte (e.g. with `string-make-multibyte').


Footnotes: 
[1] Such strings are, BTW, a nice way to shoot yourself in the foot:

(progn
  (find-file "foo.txt")
  (insert 171 "foo" 187 "\n")
  (set-buffer-file-coding-system 'mule-utf-8)
  (save-buffer)
  (kill-buffer (current-buffer))
  (find-file "foo.txt")
  (insert 171 "bar" 187 "\n")
  (set-buffer-file-coding-system 'mule-utf-8)
  (save-buffer)
  (kill-buffer (current-buffer))
  (find-file "foo.txt"))

-- 
Ralf

             reply	other threads:[~2005-07-05 10:49 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-07-05 10:49 Ralf Angeli [this message]
2005-07-06  7:35 ` Convert unibyte to multibyte on input Ralf Angeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='42ca65d2$0$18647$14726298@news.sunsite.dk' \
    --to=dev.null@iwi.uni-sb.de \
    --cc=angeli@iwi.uni-sb.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.