unofficial mirror of bug-gnu-emacs@gnu.org 
 help / color / mirror / code / Atom feed
From: Vincent Lefevre <vincent@vinc17.net>
To: Eli Zaretskii <eliz@gnu.org>
Cc: 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org,
	rlb@defaultvalue.org, 13505@debbugs.gnu.org
Subject: bug#13505: Bug#696026: emacs24: file corruption on saving
Date: Tue, 22 Jan 2013 03:35:57 +0100	[thread overview]
Message-ID: <20130122023557.GA25002@xvii.vinc17.org> (raw)
In-Reply-To: <83vcaqo1yv.fsf@gnu.org>

On 2013-01-21 19:55:20 +0200, Eli Zaretskii wrote:
> > Date: Mon, 21 Jan 2013 05:14:10 +0100
> > From: Vincent Lefevre <vincent@vinc17.net>
> > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org,
> > 	696026-forwarded@bugs.debian.org, 696026@bugs.debian.org
> > 
> > On 2013-01-21 05:48:14 +0200, Eli Zaretskii wrote:
> > > > You said:
> > > > 
> > > > | The original encoded form of the characters as found on disk at
> > > > | visit time _cannot_ be recovered by saving with raw-text, because
> > > > | that encoded form is lost without a trace when the file is _visited_
> > > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > | and decoded into the internal representation.
> > > > 
> > > > This is what lossy is.
> > > 
> > > In that sense, every encoding except no-conversion is lossy.
> > 
> > Even 8-bit encodings such as latin-1?
> 
> Yes.  When latin-1 characters are decoded (as part of visiting a
> file), they are converted to the internal representation, and cease to
> be single 8-bit bytes.

Any example where saving the file without modifying it (see below)
would modify the data (as a sequence of bytes on the disk)?

> > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs
> > > > seems to handle files with invalid UTF-8 sequences without any loss.
> > > > So, this encoding is safe, even if Emacs wrongly guess the encoding.
> > > 
> > > No, it isn't, although you could get away with it most of the time.
> > 
> > Could you give an example where one loses data with the utf-8 encoding?
> 
> E.g., in your test file, the byte whose value is 0x80 is converted to
> 0x3fff80 when the file is read into a buffer.

No, there are no problems with this example:

$ printf "\x80" > file
$ hd file
00000000  80                                                |.|
00000001
$ emacs -q file

Here the encoding by Emacs is utf-8-unix. Then I do
  M-: (set-buffer-modified-p t)
to mark the buffer as modified (as in the bug report)., then
C-x C-s. Emacs proposes raw-text, which I choose. Then C-x C-c
to quit.

$ hd file
00000000  80                                                |.|
00000001

So, the file has *not* been corrupted.

> Perhaps by "lossless" you mean "reversible", in the sense that saving
> the same buffer will perform the reverse conversion.

Actually I don't mind what occurs internally. What I mean is things
like: saved file = initial file if it hasn't been modified (as above)
and with the default encoding(s) proposed by Emacs (when visiting and
when saving).

> In that case, even the in-is13194-devanagari-unix is reversible: if
> you type this encoding when Emacs prompts you to select one of the
> coding systems, then you get the same file on disk with no
> corruption whatsoever.

Then this is what Emacs should propose by default on this example!
I suppose that Emacs is able to remember the encoding used to visit
the file, so that this should be possible...

> > > > But Emacs should clearly tell the user what to do after C-x C-s and
> > > > clearly say when there can be data loss.
> > > 
> > > At save time, "data loss" is wrt what's in the buffer.  In that sense,
> > > the encodings Emacs suggested don't lose any data.
> > 
> > "data loss" is the difference between the original file and the saved
> > file.
> 
> But what do you want Emacs to do with this?  When you save the buffer,
> the original file might be different or no longer be available (or not
> accessible even in principle, e.g. if the data came from a
> subprocess).

The file may be different, but in general, the encoding should remain
the same. This is particularly true when Emacs is used as the editor
by some application: if the encoding of the file has been changed by
Emacs, the application will be confused.

> These issues should be detected at file visit time, if at all, not
> at buffer save time.

Possibly (this is something that the end user doesn't have to know if
the goal is to modify a file).

> > > > Then Emacs says: "Select one of the safe coding systems listed below
> > > > [...]", but doesn't say that something has already been lost. So, the
> > > > words "safe coding systems" are really misleading.
> > > 
> > > It's misleading because you misunderstand what is "safe" at buffer
> > > save time.
> > 
> > No, it's misleading because Emacs didn't say that data were lost
> > when visiting the file.
> 
> Let's be constructive here.  Please suggest some practical way for
> Emacs to handle this situation better.
> 
> For the record, here are the various alternative ways Emacs supports
> the use case you described, when a file with inconsistent encoding
> needs to be repaired manually:
> 
>  . Visit the file with "M-x find-file-literally RET".  This yields a
>    unibyte buffer, where each byte stands for itself, and which you
>    can edit without risking en-/decoding issues.

Though the above is possible, the user often opens files with
"emacs <file>".

>  . Visit the file normally, then type "M-x hexl-mode RET" (or use 
>    "M-x hexl-find-file RET" to visit it in the first place).  This
>    revisits (or visits) the file in a unibyte buffer, and in addition
>    lets you edit the binary stuff regardless of its graphic
>    representation.

If Emacs notices a potential problem when visiting the file, this
method can be proposed by Emacs, but it shouldn't be the only way,
because the file may contain mostly ASCII characters and hex-editing
is not the best choice in such a case.

>  . After visiting the file normally and noticing that it contains
>    weird characters, or after being prompted to select a coding system
>    when saving the buffer, type "C-x RET r raw-text RET" to revisit
>    the file in raw-text encoding.  Then edit the bytes and save the
>    file.

But that could be proposed by Emacs directly: instead of decoding the
file directly in the buffer, Emacs could ask the user which coding
system he wants to use.

One drawback of raw-text is that 8-bit characters are completely
unreadable. I think that there should be, for instance, a utf-8
degraded coding system: correct UTF-8 sequences are decoded using
UTF-8, and invalid sequences are left intact. Emacs can already
do such kind of things, but there should be 2 differences from
the current behavior:

* When visiting the file, ask the user what to do in case Emacs
  cannot select a clean coding system without any problem. For
  instance, a "Select coding system" prompt. (BTW, couldn't hexl
  be regarded as a special coding system at this point? Perhaps
  "coding system" isn't the right term here, "editing mode" might
  be better.) Other settings in .emacs could override that,
  of course, i.e. this would just be the default.

* In case of UTF-8 degraded coding system, Emacs should save the
  file in the same UTF-8 degraded coding system. This is a way
  for the user to say: "I know that there are invalid sequences,
  just keep them."

UTF-8 is just an example above. There could be the same kind of
things with other encodings.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)





  reply	other threads:[~2013-01-22  2:35 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20121215223809.GA7549@xvii.vinc17.org>
2013-01-20  4:09 ` bug#13505: Bug#696026: emacs24: file corruption on saving Rob Browning
2013-01-20 16:49   ` Eli Zaretskii
2013-01-20 17:31     ` Rob Browning
2013-01-20 20:24     ` Glenn Morris
2013-01-20 21:25     ` Vincent Lefevre
2013-01-20 21:40       ` Eli Zaretskii
2013-01-20 22:10         ` Vincent Lefevre
2013-01-20 22:22           ` Vincent Lefevre
2013-01-21  3:49             ` Eli Zaretskii
2013-01-21  3:48           ` Eli Zaretskii
2013-01-21  4:14             ` Vincent Lefevre
2013-01-21 17:55               ` Eli Zaretskii
2013-01-22  2:35                 ` Vincent Lefevre [this message]
2013-01-22  7:56                   ` Eli Zaretskii
2013-01-20 23:01     ` Andreas Schwab
2013-01-20 23:27       ` bug#13505: Bug#696026: " Rob Browning

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130122023557.GA25002@xvii.vinc17.org \
    --to=vincent@vinc17.net \
    --cc=13505@debbugs.gnu.org \
    --cc=696026-forwarded@bugs.debian.org \
    --cc=696026@bugs.debian.org \
    --cc=eliz@gnu.org \
    --cc=rlb@defaultvalue.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).