unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* problem with editing/decoding utf-8 text
@ 2003-05-23 12:08 Fery
  0 siblings, 0 replies; 13+ messages in thread
From: Fery @ 2003-05-23 12:08 UTC (permalink / raw)


Hello there,

I have a UTF-8 text file, containing latin-1 text. When I try to edit it
with emacs, it does not detect that it is utf-8; the
describe-coding-system gives back 'iso-latin-1-unix'. (And I see the
two-byte representation of latin1 chars, which is not bad to me.)

When I save the buffer, it displays an error message:

These default coding systems were tried:
  iso-latin-1-unix
However, none of them safely encodes the target text.

Now, no matter what I choose (raw-text, no-conversion, utf-8), it
modifies all of the utf8 chars which are not fit into the ascii charset.
It seems, that it inserts a \201 before every char which is not in the
ascii charset. I.e. if I just load and save a file, emacs does not
behaves transparently.

Moreover, there is a BUG: if I press ^G at the error message above, and
quit without saving the file, it _deletes_ the file, although leaves an
auto-save file (where the latin1 chars are bad).

I have found one solution: opening the file with
universal-coding-system-argument, using even UTF-8 (then I see correctly
the chars, although it is not always important) or e.g. no-conversion.

My questions:

0. What is this \201 byte?

1. Cannot I tell to a buffer (after the load of a file) that interpet it
as binary, and save exactly the same bytes what it did read into the
buffer (i.e. transparent buffer)?

2. What is the difference between raw-text, no-conversion, binary? On
some places, I can choose any of them, on other places not... This whole
coding system is a nightmare... :(((

3. Cannot I tell to emacs that interpret the keyboard input as "raw"? I
have set input-meta to On, convert-meta to Off in .inputrc, and if I
could tell emacs that "just interpret the bytes from the terminal input
what they are", then I could copy/paste utf-8 data (in raw format) from
another application. (I run emacs on linux, with the 'putty' terminal on
windows).

GNU Emacs 21.3.2 on debian unstable linux.

Thanks:
Circum

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
       [not found] <mailman.6635.1053692285.21513.help-gnu-emacs@gnu.org>
@ 2003-05-23 16:50 ` Kai Großjohann
  2003-05-23 19:23   ` Oliver Scholz
  2003-05-23 21:20 ` Stefan Monnier
  1 sibling, 1 reply; 13+ messages in thread
From: Kai Großjohann @ 2003-05-23 16:50 UTC (permalink / raw)


Fery <engard.ferenc@innomed.hu> writes:

> I have a UTF-8 text file, containing latin-1 text. When I try to edit it
> with emacs, it does not detect that it is utf-8; the
> describe-coding-system gives back 'iso-latin-1-unix'. (And I see the
> two-byte representation of latin1 chars, which is not bad to me.)

Released versions of Emacs put UTF-8 at a rather low priority for
automatic encoding detection.  So you need to help Emacs by
explicitly specifying the encoding.  Do C-x RET c utf-8 RET before
using C-x C-f to open the file.

You can also put utf-8 somewhat earlier in the list for automatic
encoding detection.  I think this can be achieved in the following
way, but I'm not sure.  I'm not a Mule expert.  If anyone knows
better, please help out.

(setq coding-category-list
      (cons 'coding-category-utf-8
            (delq 'coding-cateogcoding-utf-8
                  coding-category-list)))

> When I save the buffer, it displays an error message:
>
> These default coding systems were tried:
>   iso-latin-1-unix
> However, none of them safely encodes the target text.
>
> Now, no matter what I choose (raw-text, no-conversion, utf-8), it
> modifies all of the utf8 chars which are not fit into the ascii charset.
> It seems, that it inserts a \201 before every char which is not in the
> ascii charset. I.e. if I just load and save a file, emacs does not
> behaves transparently.

You should make sure that UTF-8 is properly recognized when opening
the file, then saving will Just Work.

> I have found one solution: opening the file with
> universal-coding-system-argument, using even UTF-8 (then I see correctly
> the chars, although it is not always important) or e.g. no-conversion.

Do not use no-conversion.  The file is UTF-8, so UTF-8 is the right
encoding to specify.

> My questions:
>
> 0. What is this \201 byte?

Emacs encodes Latin-1 characters internally by a two-byte sequence.
The first byte is \201 (indicating the Latin-1 character set), and
the second byte is the actual character.  \202 stands for Latin-2, as
you might guess.

> 1. Cannot I tell to a buffer (after the load of a file) that interpet it
> as binary, and save exactly the same bytes what it did read into the
> buffer (i.e. transparent buffer)?

It's not a good idea.  The buffer contents might already be munged at
that point.

> 2. What is the difference between raw-text, no-conversion, binary? On
> some places, I can choose any of them, on other places not... This whole
> coding system is a nightmare... :(((

The differences are rather subtle, I'm afraid.  I think binary is an
alias for no-conversion.  raw-text does EOL conversion, whereas
no-conversion doesn't.

> 3. Cannot I tell to emacs that interpret the keyboard input as
> "raw"? I have set input-meta to On, convert-meta to Off in .inputrc,
> and if I could tell emacs that "just interpret the bytes from the
> terminal input what they are", then I could copy/paste utf-8 data
> (in raw format) from another application. (I run emacs on linux,
> with the 'putty' terminal on windows).

It does not make sense to do that, IMHO.  For example, M-f would
cease to work because Emacs wouldn't know what characters are
represented by the bytes, and so it wouldn't know which characters
are parts of words.

But it seems your terminal uses utf-8, so you can just teach Emacs
about this: C-x RET k utf-8 RET.
-- 
This line is not blank.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
  2003-05-23 16:50 ` Kai Großjohann
@ 2003-05-23 19:23   ` Oliver Scholz
  2003-05-23 20:53     ` Kai Großjohann
  0 siblings, 1 reply; 13+ messages in thread
From: Oliver Scholz @ 2003-05-23 19:23 UTC (permalink / raw)


kai.grossjohann@gmx.net (Kai Großjohann) writes:

> Fery <engard.ferenc@innomed.hu> writes:
>
>> I have a UTF-8 text file, containing latin-1 text. When I try to edit it
>> with emacs, it does not detect that it is utf-8; the
>> describe-coding-system gives back 'iso-latin-1-unix'. (And I see the
>> two-byte representation of latin1 chars, which is not bad to me.)
>
> Released versions of Emacs put UTF-8 at a rather low priority for
> automatic encoding detection.  So you need to help Emacs by
> explicitly specifying the encoding.  Do C-x RET c utf-8 RET before
> using C-x C-f to open the file.
>
> You can also put utf-8 somewhat earlier in the list for automatic
> encoding detection.  I think this can be achieved in the following
> way, but I'm not sure.  I'm not a Mule expert.  If anyone knows
> better, please help out.
>
> (setq coding-category-list
>       (cons 'coding-category-utf-8
>             (delq 'coding-cateogcoding-utf-8
>                   coding-category-list)))
>

Not 100% if this really makes a difference --

(set-coding-priority (list 'coding-category-utf-8))

maybe?

If you want UTF-8 to be the default for new files:

(prefer-coding-system 'utf-8)

[...]
>> 1. Cannot I tell to a buffer (after the load of a file) that interpet it
>> as binary, and save exactly the same bytes what it did read into the
>> buffer (i.e. transparent buffer)?
>
> It's not a good idea.  The buffer contents might already be munged at
> that point.
[...]

Maybe the OP wants to visit files with `M-x find-file-literally'?

    Oliver
-- 
4 Prairial an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
  2003-05-23 19:23   ` Oliver Scholz
@ 2003-05-23 20:53     ` Kai Großjohann
  0 siblings, 0 replies; 13+ messages in thread
From: Kai Großjohann @ 2003-05-23 20:53 UTC (permalink / raw)


Oliver Scholz <alkibiades@gmx.de> writes:

> Not 100% if this really makes a difference --
>
> (set-coding-priority (list 'coding-category-utf-8))
>
> maybe?

Thanks!

> If you want UTF-8 to be the default for new files:
> (prefer-coding-system 'utf-8)
>
> [...]
>>> 1. Cannot I tell to a buffer (after the load of a file) that interpet it
>>> as binary, and save exactly the same bytes what it did read into the
>>> buffer (i.e. transparent buffer)?
>>
>> It's not a good idea.  The buffer contents might already be munged at
>> that point.
> [...]
>
> Maybe the OP wants to visit files with `M-x find-file-literally'?

Gnah :-(
-- 
This line is not blank.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
       [not found] <mailman.6635.1053692285.21513.help-gnu-emacs@gnu.org>
  2003-05-23 16:50 ` Kai Großjohann
@ 2003-05-23 21:20 ` Stefan Monnier
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Monnier @ 2003-05-23 21:20 UTC (permalink / raw)


> Now, no matter what I choose (raw-text, no-conversion, utf-8), it
> modifies all of the utf8 chars which are not fit into the ascii charset.
> It seems, that it inserts a \201 before every char which is not in the
> ascii charset. I.e. if I just load and save a file, emacs does not
> behaves transparently.

Do you also get the \201 if you choose `utf-8' ?
If so, it's definitely a bug.

> 0. What is this \201 byte?

An internal thing that you shouldn't see unless you ask to see it.
Using `raw-text' or `no-conversion' is debatably considered as "asking to
see it", but utf-8 definitely isn't, so if you see it with utf-8, it's
a bug.

> 1. Cannot I tell to a buffer (after the load of a file) that interpet it
> as binary, and save exactly the same bytes what it did read into the
> buffer (i.e. transparent buffer)?

If you save with the same coding-system as when you loaded, yes.
In your case, you loaded with a latin-1 coding-system and then saved with
another, so obviously Emacs had to do some conversion work and you don't
get the same sequence of byte.
Of course the fact that Emacs happily visited the file in latin-1 but then
refused to save it in latin-1 is a bug.  I vaguely seem to remember that
such a bug has been fixed in Emacs-CVS, but it would be great if you could
either check it or report a precise test case.

> 2. What is the difference between raw-text, no-conversion, binary? On
> some places, I can choose any of them, on other places not... This whole
> coding system is a nightmare... :(((

Yes it is but it's not all Emacs fault.  The only alternative would be for
Emacs to say "I only ever support 1 encoding".  The current code is
supposed to work just fine in this "single encoding" situation while also
allowing you to use other encodings if you want to.
Of course bugs, make this dream a bit less sweet.


        Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
@ 2003-05-26  9:47 Fery
  0 siblings, 0 replies; 13+ messages in thread
From: Fery @ 2003-05-26  9:47 UTC (permalink / raw)


Oliver Scholz wrote:
> 
> kai.grossjohann@gmx.net (Kai Großjohann) writes:
> > (setq coding-category-list
> >       (cons 'coding-category-utf-8
> >             (delq 'coding-cateogcoding-utf-8
> >                   coding-category-list)))
> >
> 
> Not 100% if this really makes a difference --
> 
> (set-coding-priority (list 'coding-category-utf-8))
> 
> maybe?

It helps. Thanks!

> >> 1. Cannot I tell to a buffer (after the load of a file) that interpet it
> >> as binary, and save exactly the same bytes what it did read into the
> >> buffer (i.e. transparent buffer)?
> >
> > It's not a good idea.  The buffer contents might already be munged at
> > that point.

I know, I know, but I am the user, I should know if it is safe... :)

> Maybe the OP wants to visit files with `M-x find-file-literally'?

Yes, this is what I wanted originally. :) But, without the existence of
a 'literal keyboard' it doesn't help _so_ much (although, I can edit the
non-utf-8 part of the file)...

Circum

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
@ 2003-05-26  9:47 Fery
  0 siblings, 0 replies; 13+ messages in thread
From: Fery @ 2003-05-26  9:47 UTC (permalink / raw)


Stefan Monnier wrote:
> 
> > Now, no matter what I choose (raw-text, no-conversion, utf-8), it
> > modifies all of the utf8 chars which are not fit into the ascii charset.
> > It seems, that it inserts a \201 before every char which is not in the
> > ascii charset. I.e. if I just load and save a file, emacs does not
> > behaves transparently.
> 
> Do you also get the \201 if you choose `utf-8' ?
> If so, it's definitely a bug.

Yes.

> Of course the fact that Emacs happily visited the file in latin-1 but then
> refused to save it in latin-1 is a bug.  I vaguely seem to remember that
> such a bug has been fixed in Emacs-CVS, but it would be great if you could
> either check it or report a precise test case.

Attached a small text file, which opens as latin-1 at me, and refuse to
save.

> > 2. What is the difference between raw-text, no-conversion, binary? On
> > some places, I can choose any of them, on other places not... This whole
> > coding system is a nightmare... :(((
> 
> Yes it is but it's not all Emacs fault.  The only alternative would be for

I know, I just have to look at my 'another OS' :(((

Circum

PS: What about the another (losing the file completely) bug?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
@ 2003-05-27  8:06 Fery
  0 siblings, 0 replies; 13+ messages in thread
From: Fery @ 2003-05-27  8:06 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

> > Attached a small text file, which opens as latin-1 at me, and refuse to
> > save.
> 
> The attachment was missing (we really need our mailers to check the
> presence of an attachment when the main text mentions the word
> "attachment").

Yes, the choice is whether the program should be smarter or its user.
:))  Next try.

Circum

PS: Sorry Stefan for the two copies... :(

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: test --]
[-- Type: text/plain; charset=us-ascii; name="test", Size: 76 bytes --]

# it is a hungarian word, coded in utf-8...
   comment = Õrzõfejlesztés


[-- Attachment #3: Type: text/plain, Size: 151 bytes --]

_______________________________________________
Help-gnu-emacs mailing list
Help-gnu-emacs@gnu.org
http://mail.gnu.org/mailman/listinfo/help-gnu-emacs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
       [not found] <mailman.6770.1053942670.21513.help-gnu-emacs@gnu.org>
@ 2003-05-27 11:05 ` Oliver Scholz
  2003-05-27 11:41   ` Oliver Scholz
  0 siblings, 1 reply; 13+ messages in thread
From: Oliver Scholz @ 2003-05-27 11:05 UTC (permalink / raw)


Fery <engard.ferenc@innomed.hu> writes:
[...]
>> >> 1. Cannot I tell to a buffer (after the load of a file) that interpet it
>> >> as binary, and save exactly the same bytes what it did read into the
>> >> buffer (i.e. transparent buffer)?
>> >
>> > It's not a good idea.  The buffer contents might already be munged at
>> > that point.
>
> I know, I know, but I am the user, I should know if it is safe... :)
>
>> Maybe the OP wants to visit files with `M-x find-file-literally'?
>
> Yes, this is what I wanted originally. :) But, without the existence of
> a 'literal keyboard' it doesn't help _so_ much (although, I can edit the
> non-utf-8 part of the file)...
[...]

I can not parse this. Maybe there's a misunderstanding here?

When you hit a non-ascii key (like "ö" on a German keyboard), then
Emacs inserts the 8bit code for ö in the ISO 8859-1 character encoding
scheme (normally). This would be the octal code 366.

[Note: This applies only to unibyte buffers. A buffer visiting a file
with `find-file-literally' is a unibyte buffer.]

What else do you want? Be aware that Emacs doesn't display this as
some octal escape sequence (or some such) but as the actual glyph "ö".

[If you think this latter thing is bad, then you are not the only one
and I could post some Elisp code, which remedies this.]

Note, however, that the 8bit codes from the right hand part of the
Latin 1 CES are *not* allowed octetts in UTF-8. So inserting an ö will
break your UTF-8 encoding. So maybe you want -- on the contrary --
Emacs to insert the correct UTF-8 octetts corresponding to the
character on you keyboard (which is definitely not "literally")?

    Oliver
-- 
8 Prairial an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
       [not found] <mailman.6818.1054022957.21513.help-gnu-emacs@gnu.org>
@ 2003-05-27 11:10 ` Oliver Scholz
       [not found] ` <3ED37785.CA5A9AD5@innomed.hu>
  1 sibling, 0 replies; 13+ messages in thread
From: Oliver Scholz @ 2003-05-27 11:10 UTC (permalink / raw)


Fery <engard.ferenc@innomed.hu> writes:
[...]
>
> PS: Sorry Stefan for the two copies... :(
> --------------100ADB32E276E2933804EED5
> Content-Type: text/plain; charset=us-ascii;
                            ^^^^^^^
>  name="test"
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline;
>  filename="test"
>
> # it is a hungarian word, coded in utf-8...
>    comment = Ã.rzõfejlesztés

This didn't quite work.

    Oliver
-- 
8 Prairial an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
  2003-05-27 11:05 ` Oliver Scholz
@ 2003-05-27 11:41   ` Oliver Scholz
  0 siblings, 0 replies; 13+ messages in thread
From: Oliver Scholz @ 2003-05-27 11:41 UTC (permalink / raw)


Oliver Scholz <alkibiades@gmx.de> writes:
[...]
> Note, however, that the 8bit codes from the right hand part of the
                      ^^^^^^^^
> Latin 1 CES are *not* allowed octetts in UTF-8. 
[...]

Erm, I meant "_some_ of the 8 bit code".

    Oliver
-- 
8 Prairial an 211 de la Révolution
Liberté, Egalité, Fraternité!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
       [not found]   ` <ubrxnb5m2.fsf@ID-87814.user.dfncis.de>
@ 2003-05-30 12:45     ` Fery
       [not found]     ` <mailman.7046.1054298932.21513.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 13+ messages in thread
From: Fery @ 2003-05-30 12:45 UTC (permalink / raw)


> You said that Emacs corrupted your files in some cases. Could you
> please tell the exact steps to reproduce this with the file you sent?

$ emacs test

- type 'a' and then backspace
- ^X^S
- ^G
- ^X^C
- n
- yes

After that, 'test' is disappeared, and '#test#' contains a different
file compared to the old test.

Circum

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: problem with editing/decoding utf-8 text
       [not found]     ` <mailman.7046.1054298932.21513.help-gnu-emacs@gnu.org>
@ 2003-05-30 13:24       ` Kai Großjohann
  0 siblings, 0 replies; 13+ messages in thread
From: Kai Großjohann @ 2003-05-30 13:24 UTC (permalink / raw)


Fery <engard.ferenc@innomed.hu> writes:

>> You said that Emacs corrupted your files in some cases. Could you
>> please tell the exact steps to reproduce this with the file you sent?
>
> $ emacs test
>
> - type 'a' and then backspace
> - ^X^S
> - ^G
> - ^X^C
> - n
> - yes
>
> After that, 'test' is disappeared, and '#test#' contains a different
> file compared to the old test.

The autosave files (#test# is an autosave file) are saved in the
emacs-mule encoding, because we're sure that this encoding can
represent all characters that Emacs can represent.  Using any other
encoding (including UTF-8) may possibly lead to information loss.  At
least, it might have to ask the user, and you surely don't want
auto-saving to ask the user.
-- 
This line is not blank.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2003-05-30 13:24 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.6818.1054022957.21513.help-gnu-emacs@gnu.org>
2003-05-27 11:10 ` problem with editing/decoding utf-8 text Oliver Scholz
     [not found] ` <3ED37785.CA5A9AD5@innomed.hu>
     [not found]   ` <ubrxnb5m2.fsf@ID-87814.user.dfncis.de>
2003-05-30 12:45     ` Fery
     [not found]     ` <mailman.7046.1054298932.21513.help-gnu-emacs@gnu.org>
2003-05-30 13:24       ` Kai Großjohann
     [not found] <mailman.6770.1053942670.21513.help-gnu-emacs@gnu.org>
2003-05-27 11:05 ` Oliver Scholz
2003-05-27 11:41   ` Oliver Scholz
2003-05-27  8:06 Fery
  -- strict thread matches above, loose matches on Subject: below --
2003-05-26  9:47 Fery
2003-05-26  9:47 Fery
     [not found] <mailman.6635.1053692285.21513.help-gnu-emacs@gnu.org>
2003-05-23 16:50 ` Kai Großjohann
2003-05-23 19:23   ` Oliver Scholz
2003-05-23 20:53     ` Kai Großjohann
2003-05-23 21:20 ` Stefan Monnier
2003-05-23 12:08 Fery

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).