From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.bugs Subject: bug#13505: Bug#696026: emacs24: file corruption on saving Date: Tue, 22 Jan 2013 09:56:44 +0200 Message-ID: <83k3r5odkz.fsf@gnu.org> References: <20121215223809.GA7549@xvii.vinc17.org> <877gn8ijgn.fsf@trouble.defaultvalue.org> <83obgjpzod.fsf@gnu.org> <20130120212508.GF2695@xvii.vinc17.org> <83bocjpm81.fsf@gnu.org> <20130120221007.GG2695@xvii.vinc17.org> <83a9s3p56p.fsf@gnu.org> <20130121041410.GJ2695@xvii.vinc17.org> <83vcaqo1yv.fsf@gnu.org> <20130122023557.GA25002@xvii.vinc17.org> Reply-To: Eli Zaretskii NNTP-Posting-Host: plane.gmane.org X-Trace: ger.gmane.org 1358841483 5230 80.91.229.3 (22 Jan 2013 07:58:03 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 22 Jan 2013 07:58:03 +0000 (UTC) Cc: 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org, rlb@defaultvalue.org, 13505@debbugs.gnu.org To: Vincent Lefevre Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Jan 22 08:58:21 2013 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TxYkO-0007di-JP for geb-bug-gnu-emacs@m.gmane.org; Tue, 22 Jan 2013 08:58:20 +0100 Original-Received: from localhost ([::1]:41946 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TxYk7-0006DL-EL for geb-bug-gnu-emacs@m.gmane.org; Tue, 22 Jan 2013 02:58:03 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:57378) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TxYk2-0006BR-Ib for bug-gnu-emacs@gnu.org; Tue, 22 Jan 2013 02:58:01 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TxYjz-0006eD-Ae for bug-gnu-emacs@gnu.org; Tue, 22 Jan 2013 02:57:58 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:38530) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TxYjz-0006e8-6i for bug-gnu-emacs@gnu.org; Tue, 22 Jan 2013 02:57:55 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1TxYl4-0006PA-Hl for bug-gnu-emacs@gnu.org; Tue, 22 Jan 2013 02:59:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Eli Zaretskii Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 22 Jan 2013 07:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 13505 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 13505-submit@debbugs.gnu.org id=B13505.135884149324552 (code B ref 13505); Tue, 22 Jan 2013 07:59:02 +0000 Original-Received: (at 13505) by debbugs.gnu.org; 22 Jan 2013 07:58:13 +0000 Original-Received: from localhost ([127.0.0.1]:43994 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TxYkF-0006Nv-8l for submit@debbugs.gnu.org; Tue, 22 Jan 2013 02:58:12 -0500 Original-Received: from mtaout21.012.net.il ([80.179.55.169]:55721) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TxYkB-0006Nk-NG for 13505@debbugs.gnu.org; Tue, 22 Jan 2013 02:58:09 -0500 Original-Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0MH000M00OLEO100@a-mtaout21.012.net.il> for 13505@debbugs.gnu.org; Tue, 22 Jan 2013 09:56:27 +0200 (IST) Original-Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MH000MNZOQ2HWA0@a-mtaout21.012.net.il>; Tue, 22 Jan 2013 09:56:27 +0200 (IST) In-reply-to: <20130122023557.GA25002@xvii.vinc17.org> X-012-Sender: halo1@inter.net.il X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:70154 Archived-At: > Date: Tue, 22 Jan 2013 03:35:57 +0100 > From: Vincent Lefevre > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > > > > > > | The original encoded form of the characters as found on disk at > > > > > | visit time _cannot_ be recovered by saving with raw-text, because > > > > > | that encoded form is lost without a trace when the file is _visited_ > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > | and decoded into the internal representation. > > > > > > > > > > This is what lossy is. > > > > > > > > In that sense, every encoding except no-conversion is lossy. > > > > > > Even 8-bit encodings such as latin-1? > > > > Yes. When latin-1 characters are decoded (as part of visiting a > > file), they are converted to the internal representation, and cease to > > be single 8-bit bytes. > > Any example where saving the file without modifying it (see below) > would modify the data (as a sequence of bytes on the disk)? See above: I was talking about changes at file-visit time. > > > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs > > > > > seems to handle files with invalid UTF-8 sequences without any loss. > > > > > So, this encoding is safe, even if Emacs wrongly guess the encoding. > > > > > > > > No, it isn't, although you could get away with it most of the time. > > > > > > Could you give an example where one loses data with the utf-8 encoding? > > > > E.g., in your test file, the byte whose value is 0x80 is converted to > > 0x3fff80 when the file is read into a buffer. > > No, there are no problems with this example: Again, because we are talking about two different things. > > Perhaps by "lossless" you mean "reversible", in the sense that saving > > the same buffer will perform the reverse conversion. > > Actually I don't mind what occurs internally. What I mean is things > like: saved file = initial file if it hasn't been modified (as above) > and with the default encoding(s) proposed by Emacs (when visiting and > when saving). That's reversibility. > > In that case, even the in-is13194-devanagari-unix is reversible: if > > you type this encoding when Emacs prompts you to select one of the > > coding systems, then you get the same file on disk with no > > corruption whatsoever. > > Then this is what Emacs should propose by default on this example! It can't easily do that. There are 2 different use cases here: 1) A file was visited and its encoding was found to be inconsistent. Then it is being saved. This is your use case. 2) A file was modified by adding to it characters that cannot be encoded by the original encoding. For example, you visit a Latin-1 encoded file, then add to it characters that are outside the coverage of Latin-1. Then you save the file. What Emacs proposes is biased for the second use case, because it is by far the most frequent one. The other use case is supposed to be treated by other means, those which I mentioned in my previous mail. Giving instructions to both use cases is not a good idea, IMO, because it will confuse users who do not necessarily understand what is going on and in particular don't realize which of the two situations they are in. > I suppose that Emacs is able to remember the encoding used to visit > the file, so that this should be possible... It does remember. It actually shows it in the "select safe coding system" prompt. The problem is that its use can do the wrong thing in the second use case above. > > > > > But Emacs should clearly tell the user what to do after C-x C-s and > > > > > clearly say when there can be data loss. > > > > > > > > At save time, "data loss" is wrt what's in the buffer. In that sense, > > > > the encodings Emacs suggested don't lose any data. > > > > > > "data loss" is the difference between the original file and the saved > > > file. > > > > But what do you want Emacs to do with this? When you save the buffer, > > the original file might be different or no longer be available (or not > > accessible even in principle, e.g. if the data came from a > > subprocess). > > The file may be different, but in general, the encoding should remain > the same. That's what Emacs does, as long as it can. But in this case, that encoding might produce inconsistently encoded file, so Emacs doesn't want to do that silently. It has no idea that the file was inconsistently encoded in the first place, nor that you _want_ it to continue being inconsistently encoded. > This is particularly true when Emacs is used as the editor by some > application: if the encoding of the file has been changed by Emacs, > the application will be confused. Again, that's what Emacs does normally, if that encoding can do the job. Producing inconsistent encoding will certainly confuse those other programs. > > These issues should be detected at file visit time, if at all, not > > at buffer save time. > > Possibly (this is something that the end user doesn't have to know if > the goal is to modify a file). This use case proves otherwise. > > . Visit the file with "M-x find-file-literally RET". This yields a > > unibyte buffer, where each byte stands for itself, and which you > > can edit without risking en-/decoding issues. > > Though the above is possible, the user often opens files with > "emacs ". Many users have Emacs up and running for the entire session. > > . Visit the file normally, then type "M-x hexl-mode RET" (or use > > "M-x hexl-find-file RET" to visit it in the first place). This > > revisits (or visits) the file in a unibyte buffer, and in addition > > lets you edit the binary stuff regardless of its graphic > > representation. > > If Emacs notices a potential problem when visiting the file, this > method can be proposed by Emacs, but it shouldn't be the only way, > because the file may contain mostly ASCII characters and hex-editing > is not the best choice in such a case. ??? Hexl Mode shows the printable characters (at the right side of the display) in addition to the codes. What exactly is the problem here? > > . After visiting the file normally and noticing that it contains > > weird characters, or after being prompted to select a coding system > > when saving the buffer, type "C-x RET r raw-text RET" to revisit > > the file in raw-text encoding. Then edit the bytes and save the > > file. > > But that could be proposed by Emacs directly: instead of decoding the > file directly in the buffer, Emacs could ask the user which coding > system he wants to use. That'd be a nuisance, I think, because more often than not, keeping the original inconsistent encoding is not what the user wants. > One drawback of raw-text is that 8-bit characters are completely > unreadable. That's why I listed it the last.