* Convert unibyte to multibyte on input
@ 2005-07-05 10:49 Ralf Angeli
2005-07-06 7:35 ` Ralf Angeli
0 siblings, 1 reply; 2+ messages in thread
From: Ralf Angeli @ 2005-07-05 10:49 UTC (permalink / raw)
If I understand (info "(elisp)Converting Representations") correctly,
Emacs will convert unibyte text to multibyte if it is inserted into a
multibyte buffer. However, on Windows I could observe that text,
guillemets in particular, copied from the character table and pasted
into Emacs will remain in its unibyte representation. When typing
`C-u C-x =' on a « character one gets the following result with a CVS
Emacs checked out and compiled a few days ago:
,----
| character: « (0253, 171, 0xab)
| charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF))
| code point: 171
| syntax: which means: whitespace
| buffer code: 0xAB
| file code: 0xAB (encoded by coding system raw-text-dos)
| display: by display table entry [?«] (see below)
|
| The display table entry is displayed by these fonts (glyph codes):
| «: -raster-Courier-normal-r-normal-normal-20-120-120-120-c-120-iso8859-1 (0xAB)
|
| There are text properties here:
| fontified t
`----
I would have expected to see the multibyte representation:
,----
| character: « (04253, 2219, 0x8ab, U+00AB)
| charset: latin-iso8859-1
| (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100.)
| code point: 43
| syntax: . which means: punctuation
| category: l:Latin
| buffer code: 0x81 0xAB
| file code: 0xC2 0xAB (encoded by coding system mule-utf-8-dos)
| display: by this font (glyph code)
| -raster-Courier-normal-r-normal-normal-20-120-120-120-c-120-iso8859-1 (0xAB)
|
| There are text properties here:
| face [font-latex-string-face]
| fontified t
`----
(This was the result of pasting into a UTF-8 buffer.)
I am not sure if this is a bug, a user mistake, or something else. On
GNU/Linux I can simulate the problem by typing `M-: (insert 171) RET'
in a Latin-1 buffer.
Now my problem is, that I have to compare the guillemet found in the
buffer with another one in Lisp code in order to find the matching
closing one for font locking. `re-search-forward' obviously finds the
opening guillement in its unibyte form, but then comparing it with a
multibyte guillemet fails. (What happens is probably something like
`(string= (string 171) (string 2219))'.)
So I am wondering if the unibyte strings should not be present in the
buffer in the first place[1] or if I have to explicitely convert the
unibyte strings to multibyte (e.g. with `string-make-multibyte').
Footnotes:
[1] Such strings are, BTW, a nice way to shoot yourself in the foot:
(progn
(find-file "foo.txt")
(insert 171 "foo" 187 "\n")
(set-buffer-file-coding-system 'mule-utf-8)
(save-buffer)
(kill-buffer (current-buffer))
(find-file "foo.txt")
(insert 171 "bar" 187 "\n")
(set-buffer-file-coding-system 'mule-utf-8)
(save-buffer)
(kill-buffer (current-buffer))
(find-file "foo.txt"))
--
Ralf
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Convert unibyte to multibyte on input
2005-07-05 10:49 Convert unibyte to multibyte on input Ralf Angeli
@ 2005-07-06 7:35 ` Ralf Angeli
0 siblings, 0 replies; 2+ messages in thread
From: Ralf Angeli @ 2005-07-06 7:35 UTC (permalink / raw)
* Ralf Angeli (2005-07-05) writes:
> If I understand (info "(elisp)Converting Representations") correctly,
> Emacs will convert unibyte text to multibyte if it is inserted into a
> multibyte buffer. However, on Windows I could observe that text,
> guillemets in particular, copied from the character table and pasted
> into Emacs will remain in its unibyte representation. When typing
> `C-u C-x =' on a « character one gets the following result with a CVS
> Emacs checked out and compiled a few days ago:
>
> ,----
> | character: « (0253, 171, 0xab)
> | charset: eight-bit-graphic (8-bit graphic char (0xA0..0xFF))
> | code point: 171
> | syntax: which means: whitespace
> | buffer code: 0xAB
> | file code: 0xAB (encoded by coding system raw-text-dos)
I think I identified the cause for this. The problem shows up
particularly in LaTeX files. Those are opened with a raw-text-dos
file coding system which prevents character code conversion. The
raw-text-dos file coding system being picked is likely the result of
missing autoloads for latexenc.el in the Windows build. I sent a bug
report to emacs-pretest-bugs.
--
Ralf
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2005-07-06 7:35 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-05 10:49 Convert unibyte to multibyte on input Ralf Angeli
2005-07-06 7:35 ` Ralf Angeli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).