From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Vincent Lefevre Newsgroups: gmane.emacs.bugs Subject: bug#13505: Bug#696026: emacs24: file corruption on saving Date: Tue, 22 Jan 2013 03:35:57 +0100 Message-ID: <20130122023557.GA25002@xvii.vinc17.org> References: <20121215223809.GA7549@xvii.vinc17.org> <877gn8ijgn.fsf@trouble.defaultvalue.org> <83obgjpzod.fsf@gnu.org> <20130120212508.GF2695@xvii.vinc17.org> <83bocjpm81.fsf@gnu.org> <20130120221007.GG2695@xvii.vinc17.org> <83a9s3p56p.fsf@gnu.org> <20130121041410.GJ2695@xvii.vinc17.org> <83vcaqo1yv.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1358822258 18381 80.91.229.3 (22 Jan 2013 02:37:38 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 22 Jan 2013 02:37:38 +0000 (UTC) Cc: 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org, rlb@defaultvalue.org, 13505@debbugs.gnu.org To: Eli Zaretskii Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Tue Jan 22 03:37:53 2013 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TxTkD-0002AS-Tg for geb-bug-gnu-emacs@m.gmane.org; Tue, 22 Jan 2013 03:37:50 +0100 Original-Received: from localhost ([::1]:46343 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TxTjw-00081B-MG for geb-bug-gnu-emacs@m.gmane.org; Mon, 21 Jan 2013 21:37:32 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:50174) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TxTjN-0006K9-Vl for bug-gnu-emacs@gnu.org; Mon, 21 Jan 2013 21:36:59 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TxTjL-0005AW-Og for bug-gnu-emacs@gnu.org; Mon, 21 Jan 2013 21:36:57 -0500 Original-Received: from debbugs.gnu.org ([140.186.70.43]:38356) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TxTjL-0005AP-Lc for bug-gnu-emacs@gnu.org; Mon, 21 Jan 2013 21:36:55 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.72) (envelope-from ) id 1TxTkP-0006yl-Ql for bug-gnu-emacs@gnu.org; Mon, 21 Jan 2013 21:38:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Vincent Lefevre Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 22 Jan 2013 02:38:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 13505 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: Original-Received: via spool by 13505-submit@debbugs.gnu.org id=B13505.135882223126765 (code B ref 13505); Tue, 22 Jan 2013 02:38:01 +0000 Original-Received: (at 13505) by debbugs.gnu.org; 22 Jan 2013 02:37:11 +0000 Original-Received: from localhost ([127.0.0.1]:43819 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TxTja-0006xc-S9 for submit@debbugs.gnu.org; Mon, 21 Jan 2013 21:37:11 -0500 Original-Received: from vinc17.pck.nerim.net ([213.41.242.187]:57685 helo=smtp-xvii.vinc17.net) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TxTjX-0006xT-1v for 13505@debbugs.gnu.org; Mon, 21 Jan 2013 21:37:09 -0500 Original-Received: by xvii.vinc17.org (Postfix, from userid 1000) id C919B31001E; Tue, 22 Jan 2013 03:35:57 +0100 (CET) Content-Disposition: inline In-Reply-To: <83vcaqo1yv.fsf@gnu.org> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.21-6291-vl-r57386 (2013-01-20) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 140.186.70.43 X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.bugs:70149 Archived-At: On 2013-01-21 19:55:20 +0200, Eli Zaretskii wrote: > > Date: Mon, 21 Jan 2013 05:14:10 +0100 > > From: Vincent Lefevre > > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > >=20 > > On 2013-01-21 05:48:14 +0200, Eli Zaretskii wrote: > > > > You said: > > > >=20 > > > > | The original encoded form of the characters as found on disk at > > > > | visit time _cannot_ be recovered by saving with raw-text, becau= se > > > > | that encoded form is lost without a trace when the file is _vis= ited_ > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > | and decoded into the internal representation. > > > >=20 > > > > This is what lossy is. > > >=20 > > > In that sense, every encoding except no-conversion is lossy. > >=20 > > Even 8-bit encodings such as latin-1? >=20 > Yes. When latin-1 characters are decoded (as part of visiting a > file), they are converted to the internal representation, and cease to > be single 8-bit bytes. Any example where saving the file without modifying it (see below) would modify the data (as a sequence of bytes on the disk)? > > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Ema= cs > > > > seems to handle files with invalid UTF-8 sequences without any lo= ss. > > > > So, this encoding is safe, even if Emacs wrongly guess the encodi= ng. > > >=20 > > > No, it isn't, although you could get away with it most of the time. > >=20 > > Could you give an example where one loses data with the utf-8 encodin= g? >=20 > E.g., in your test file, the byte whose value is 0x80 is converted to > 0x3fff80 when the file is read into a buffer. No, there are no problems with this example: $ printf "\x80" > file $ hd file 00000000 80 |.| 00000001 $ emacs -q file Here the encoding by Emacs is utf-8-unix. Then I do M-: (set-buffer-modified-p t) to mark the buffer as modified (as in the bug report)., then C-x C-s. Emacs proposes raw-text, which I choose. Then C-x C-c to quit. $ hd file 00000000 80 |.| 00000001 So, the file has *not* been corrupted. > Perhaps by "lossless" you mean "reversible", in the sense that saving > the same buffer will perform the reverse conversion. Actually I don't mind what occurs internally. What I mean is things like: saved file =3D initial file if it hasn't been modified (as above) and with the default encoding(s) proposed by Emacs (when visiting and when saving). > In that case, even the in-is13194-devanagari-unix is reversible: if > you type this encoding when Emacs prompts you to select one of the > coding systems, then you get the same file on disk with no > corruption whatsoever. Then this is what Emacs should propose by default on this example! I suppose that Emacs is able to remember the encoding used to visit the file, so that this should be possible... > > > > But Emacs should clearly tell the user what to do after C-x C-s a= nd > > > > clearly say when there can be data loss. > > >=20 > > > At save time, "data loss" is wrt what's in the buffer. In that sen= se, > > > the encodings Emacs suggested don't lose any data. > >=20 > > "data loss" is the difference between the original file and the saved > > file. >=20 > But what do you want Emacs to do with this? When you save the buffer, > the original file might be different or no longer be available (or not > accessible even in principle, e.g. if the data came from a > subprocess). The file may be different, but in general, the encoding should remain the same. This is particularly true when Emacs is used as the editor by some application: if the encoding of the file has been changed by Emacs, the application will be confused. > These issues should be detected at file visit time, if at all, not > at buffer save time. Possibly (this is something that the end user doesn't have to know if the goal is to modify a file). > > > > Then Emacs says: "Select one of the safe coding systems listed be= low > > > > [...]", but doesn't say that something has already been lost. So,= the > > > > words "safe coding systems" are really misleading. > > >=20 > > > It's misleading because you misunderstand what is "safe" at buffer > > > save time. > >=20 > > No, it's misleading because Emacs didn't say that data were lost > > when visiting the file. >=20 > Let's be constructive here. Please suggest some practical way for > Emacs to handle this situation better. >=20 > For the record, here are the various alternative ways Emacs supports > the use case you described, when a file with inconsistent encoding > needs to be repaired manually: >=20 > . Visit the file with "M-x find-file-literally RET". This yields a > unibyte buffer, where each byte stands for itself, and which you > can edit without risking en-/decoding issues. Though the above is possible, the user often opens files with "emacs ". > . Visit the file normally, then type "M-x hexl-mode RET" (or use=20 > "M-x hexl-find-file RET" to visit it in the first place). This > revisits (or visits) the file in a unibyte buffer, and in addition > lets you edit the binary stuff regardless of its graphic > representation. If Emacs notices a potential problem when visiting the file, this method can be proposed by Emacs, but it shouldn't be the only way, because the file may contain mostly ASCII characters and hex-editing is not the best choice in such a case. > . After visiting the file normally and noticing that it contains > weird characters, or after being prompted to select a coding system > when saving the buffer, type "C-x RET r raw-text RET" to revisit > the file in raw-text encoding. Then edit the bytes and save the > file. But that could be proposed by Emacs directly: instead of decoding the file directly in the buffer, Emacs could ask the user which coding system he wants to use. One drawback of raw-text is that 8-bit characters are completely unreadable. I think that there should be, for instance, a utf-8 degraded coding system: correct UTF-8 sequences are decoded using UTF-8, and invalid sequences are left intact. Emacs can already do such kind of things, but there should be 2 differences from the current behavior: * When visiting the file, ask the user what to do in case Emacs cannot select a clean coding system without any problem. For instance, a "Select coding system" prompt. (BTW, couldn't hexl be regarded as a special coding system at this point? Perhaps "coding system" isn't the right term here, "editing mode" might be better.) Other settings in .emacs could override that, of course, i.e. this would just be the default. * In case of UTF-8 degraded coding system, Emacs should save the file in the same UTF-8 degraded coding system. This is a way for the user to say: "I know that there are invalid sequences, just keep them." UTF-8 is just an example above. There could be the same kind of things with other encodings. --=20 Vincent Lef=E8vre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)