* bug#13505: Bug#696026: emacs24: file corruption on saving [not found] <20121215223809.GA7549@xvii.vinc17.org> @ 2013-01-20 4:09 ` Rob Browning 2013-01-20 16:49 ` Eli Zaretskii 0 siblings, 1 reply; 16+ messages in thread From: Rob Browning @ 2013-01-20 4:09 UTC (permalink / raw To: 13505; +Cc: 696026-forwarded, Vincent Lefevre, 696026 (If possible, please preserve the *-forwarded address in any replies.) The following bug was reported to Debian. I've tested both the Debian emacs24 package, and current upstream emacs-24, as of: Author: Leo Liu <sdl.web@gmail.com> Date: Sat Jan 19 02:35:44 2013 +0800 Prune erroneous values in dired-get-marked-files In both cases, I was able to reproduce the reported issue. Please let me know if I can provide further information. Vincent Lefevre <vincent@vinc17.net> writes: > Package: emacs24 > Version: 24.2+1-1 > Severity: grave > Justification: causes non-serious data loss > > The file "file1" (attached) has the following contents: > > 00000000 6c e2 80 99 c3 a9 0a 74 65 73 74 e9 0a |l......test..| > > 1. Open "file1" with "emacs -Q". It is regarded as > an in-is13194-devanagari-unix file. > > 2. Type M-: (set-buffer-modified-p t) to mark the buffer as modified > (so that one can save it). > > 3. Save the file with C-x C-s. It is proposed: > > [...] > Select one of the safe coding systems listed below, > or cancel the writing with C-g and edit the buffer > to remove or modify the problematic characters, > or specify any other coding system (and risk losing > the problematic characters). > > raw-text emacs-mule no-conversion > > 4. Choose raw-text (the default) or no-conversion. One can assume > that the file will not be modified. But it gets corrupted: one > obtains a file "file2" (attached) with the following contents: > > 00000000 6c e0 a5 88 80 99 e0 a4 a5 e0 a4 8a 0a 74 65 73 |l............tes| > 00000010 74 e0 a4 bc 0a |t....| > > Note: Actually "file1" has mixed UTF-8 and ISO-8859-1 contents due to > a user error. But due to this bug, an attempt to fix the problem with > Emacs makes things even worse! BTW, I had the same problem in the past > when attempting to edit an mbox file with Emacs (in this case, having > mixed UTF-8 and ISO-8859-1 contents is normal). How Emacs interprets > such contents doesn't matter, but by default, it mustn't corrupt the > file on saving. > > There is no such problem with GNU Emacs 23.4.1 (Debian package > emacs23 23.4+1-4). > > -- System Information: > Debian Release: 7.0 > APT prefers unstable > APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental') > Architecture: amd64 (x86_64) > > Kernel: Linux 3.5-trunk-amd64 (SMP w/2 CPU cores) > Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) > Shell: /bin/sh linked to /bin/dash > > Versions of packages emacs24 depends on: > ii emacs24-bin-common 24.2+1-1 > ii gconf-service 3.2.5-1+build1 > ii libasound2 1.0.25-4 > ii libatk1.0-0 2.4.0-2 > ii libc6 2.13-37 > ii libcairo2 1.12.2-2 > ii libdbus-1-3 1.6.8-1 > ii libfontconfig1 2.9.0-7.1 > ii libfreetype6 2.4.9-1 > ii libgconf-2-4 3.2.5-1+build1 > ii libgdk-pixbuf2.0-0 2.26.1-1 > ii libgif4 4.1.6-10 > ii libglib2.0-0 2.33.12+really2.32.4-3 > ii libgnutls26 2.12.20-2 > ii libgomp1 4.7.2-4 > ii libgpm2 1.20.4-6 > ii libgtk2.0-0 2.24.10-2 > ii libice6 2:1.0.8-2 > ii libjpeg8 8d-1 > ii libm17n-0 1.6.3-2 > ii libmagickcore5 8:6.7.7.10-5 > ii libmagickwand5 8:6.7.7.10-5 > ii libncurses5 5.9-10 > ii libotf0 0.9.12-2 > ii libpango1.0-0 1.30.0-1 > ii libpng12-0 1.2.49-3 > ii librsvg2-2 2.36.1-1 > ii libselinux1 2.1.9-5 > ii libsm6 2:1.2.1-2 > ii libtiff4 3.9.6-9 > ii libtinfo5 5.9-10 > ii libx11-6 2:1.5.0-1 > ii libxft2 2.3.1-1 > ii libxml2 2.8.0+dfsg1-7 > ii libxpm4 1:3.5.10-1 > ii libxrender1 1:0.9.7-1 > ii zlib1g 1:1.2.7.dfsg-13 > > emacs24 recommends no packages. > > Versions of packages emacs24 suggests: > ii emacs24-common-non-dfsg 24.2+1-1 > > -- no debconf information -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4 ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 4:09 ` bug#13505: Bug#696026: emacs24: file corruption on saving Rob Browning @ 2013-01-20 16:49 ` Eli Zaretskii 2013-01-20 17:31 ` Rob Browning ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: Eli Zaretskii @ 2013-01-20 16:49 UTC (permalink / raw To: Rob Browning, Kenichi Handa; +Cc: 696026-forwarded, vincent, 696026, 13505 > From: Rob Browning <rlb@defaultvalue.org> > Date: Sat, 19 Jan 2013 22:09:28 -0600 > Cc: 696026-forwarded@bugs.debian.org, Vincent Lefevre <vincent@vinc17.net>, > 696026@bugs.debian.org > > Vincent Lefevre <vincent@vinc17.net> writes: > > > Package: emacs24 > > Version: 24.2+1-1 > > Severity: grave > > Justification: causes non-serious data loss > > > > The file "file1" (attached) has the following contents: > > > > 00000000 6c e2 80 99 c3 a9 0a 74 65 73 74 e9 0a |l......test..| > > > > 1. Open "file1" with "emacs -Q". It is regarded as > > an in-is13194-devanagari-unix file. > > > > 2. Type M-: (set-buffer-modified-p t) to mark the buffer as modified > > (so that one can save it). > > > > 3. Save the file with C-x C-s. It is proposed: > > > > [...] > > Select one of the safe coding systems listed below, > > or cancel the writing with C-g and edit the buffer > > to remove or modify the problematic characters, > > or specify any other coding system (and risk losing > > the problematic characters). > > > > raw-text emacs-mule no-conversion > > > > 4. Choose raw-text (the default) or no-conversion. One can assume > > that the file will not be modified. But it gets corrupted: one > > obtains a file "file2" (attached) with the following contents: > > > > 00000000 6c e0 a5 88 80 99 e0 a4 a5 e0 a4 8a 0a 74 65 73 |l............tes| > > 00000010 74 e0 a4 bc 0a |t....| > > > > Note: Actually "file1" has mixed UTF-8 and ISO-8859-1 contents due to > > a user error. But due to this bug, an attempt to fix the problem with > > Emacs makes things even worse! BTW, I had the same problem in the past > > when attempting to edit an mbox file with Emacs (in this case, having > > mixed UTF-8 and ISO-8859-1 contents is normal). How Emacs interprets > > such contents doesn't matter, but by default, it mustn't corrupt the > > file on saving. > > > > There is no such problem with GNU Emacs 23.4.1 (Debian package > > emacs23 23.4+1-4). First, this isn't really a regression: Emacs 23 has the same "problem". It's just that Emacs 23 doesn't autodetect in-is13194-devanagari in this file, while Emacs 24 does. If you say "C-x RET c raw-text RET C-x C-f" to visit this file in Emacs 24, the problem will be gone, which is exactly what happens in Emacs 23, because it visits the file in raw-text to begin with. Conversely, if you use "C-x RET c in-is13194-devanagari RET C-x C-f" to visit the file in Emacs 23, you will get the same "problem" saving it. I didn't research the reason why Emacs 24 autodetects this encoding, and whether this is on purpose. Perhaps Handa-san could tell. More to the point: there seems to be a fundamental misunderstanding here regarding the effect of selecting an encoding at save time. It sounds like the OP thought that selecting a "literal" encoding, such as raw-text, which is supposed to leave the binary stream unaltered (apart of the EOL format), will ensure that a buffer will be saved exactly as it was originally found on disk. But this is false. What raw-text and no-conversion do is to write out the _internal_ representation of each character without any conversions. The original encoded form of the characters as found on disk at visit time _cannot_ be recovered by saving with raw-text, because that encoded form is lost without a trace when the file is _visited_ and decoded into the internal representation. The only information that's left is the coding-system used to decode the characters. But since the file's encoding in this case is inconsistent, that coding-system cannot be used to save it back (Emacs will not let you do so, as demonstrated in the report), and therefore the original form cannot be recovered this way. What the user should do to avoid this data loss is prevent the incorrect decoding of the file's contents when the file is visited. To this end, the file should be visited with no-conversion or raw-text, using "C-x RET c raw-text RET C-x C-f". Then it will be possible to repair the file and write it back using the same raw-text encoding. If the fact that the file's encoding is inconsistent is not realized until some time after the file is visited, the user should use "C-x RET r raw-text RET" to re-visit the file using raw-text. IOW, only selecting the appropriate encoding _at_visit_time_ can prevent data loss in these cases. The expectation that "Emacs mustn't corrupt the file on saving" when the file has inconsistent encoding and was decoded with anything but raw-text or no-conversion is unjustified. Personally, I don't think there's a bug here. It's a cockpit error. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 16:49 ` Eli Zaretskii @ 2013-01-20 17:31 ` Rob Browning 2013-01-20 20:24 ` Glenn Morris ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Rob Browning @ 2013-01-20 17:31 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026-forwarded, vincent, 696026, 13505 Eli Zaretskii <eliz@gnu.org> writes: > More to the point: there seems to be a fundamental misunderstanding > here regarding the effect of selecting an encoding at save time. It > sounds like the OP thought that selecting a "literal" encoding, such > as raw-text, which is supposed to leave the binary stream unaltered > (apart of the EOL format), will ensure that a buffer will be saved > exactly as it was originally found on disk. But this is false. What > raw-text and no-conversion do is to write out the _internal_ > representation of each character without any conversions. The > original encoded form of the characters as found on disk at visit time > _cannot_ be recovered by saving with raw-text, because that encoded > form is lost without a trace when the file is _visited_ and decoded > into the internal representation. The only information that's left is > the coding-system used to decode the characters. But since the file's > encoding in this case is inconsistent, that coding-system cannot be > used to save it back (Emacs will not let you do so, as demonstrated in > the report), and therefore the original form cannot be recovered this > way. Ahh, right; that make sense to me. -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4 ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 16:49 ` Eli Zaretskii 2013-01-20 17:31 ` Rob Browning @ 2013-01-20 20:24 ` Glenn Morris 2013-01-20 21:25 ` Vincent Lefevre 2013-01-20 23:01 ` Andreas Schwab 3 siblings, 0 replies; 16+ messages in thread From: Glenn Morris @ 2013-01-20 20:24 UTC (permalink / raw To: Eli Zaretskii; +Cc: 13505 Eli Zaretskii wrote: > Personally, I don't think there's a bug here. It's a cockpit error. Does this also apply to http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13377 I already marked that important, forwarded it, and suggested it was related to the original Debian report in this issue, but have received no response. Please follow up to 13377 if you have any info. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 16:49 ` Eli Zaretskii 2013-01-20 17:31 ` Rob Browning 2013-01-20 20:24 ` Glenn Morris @ 2013-01-20 21:25 ` Vincent Lefevre 2013-01-20 21:40 ` Eli Zaretskii 2013-01-20 23:01 ` Andreas Schwab 3 siblings, 1 reply; 16+ messages in thread From: Vincent Lefevre @ 2013-01-20 21:25 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026-forwarded, 696026, Rob Browning, 13505 On 2013-01-20 18:49:38 +0200, Eli Zaretskii wrote: > Personally, I don't think there's a bug here. It's a cockpit error. Perhaps it isn't a bug at save time. But then, selecting a lossy encoding by default when visiting the file is the bug (and really a regression), particularly if this isn't clearly told to the user. Actually this is related, since the lossy encoding becomes a real problem only at save time (and for copy-paste I assume, though the file doesn't get overwritten by that). -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 21:25 ` Vincent Lefevre @ 2013-01-20 21:40 ` Eli Zaretskii 2013-01-20 22:10 ` Vincent Lefevre 0 siblings, 1 reply; 16+ messages in thread From: Eli Zaretskii @ 2013-01-20 21:40 UTC (permalink / raw To: Vincent Lefevre; +Cc: 696026-forwarded, 696026, rlb, 13505 > Date: Sun, 20 Jan 2013 22:25:08 +0100 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: Rob Browning <rlb@defaultvalue.org>, Kenichi Handa <handa@gnu.org>, > 13505@debbugs.gnu.org, 696026-forwarded@bugs.debian.org, > 696026@bugs.debian.org > > On 2013-01-20 18:49:38 +0200, Eli Zaretskii wrote: > > Personally, I don't think there's a bug here. It's a cockpit error. > > Perhaps it isn't a bug at save time. But then, selecting a lossy > encoding by default when visiting the file is the bug (and really > a regression), particularly if this isn't clearly told to the user. The encoding isn't lossy. In any case, I don't really understand your proposal. Suppose the file was indeed encoded in in-is13194-devanagari, would you argue then that selecting it would be incorrect or undesirable behavior? > Actually this is related, since the lossy encoding becomes a real > problem only at save time (and for copy-paste I assume, though the > file doesn't get overwritten by that). It is only a problem when you try to save or otherwise output it (e.g., send in an email). But what you should do then is "C-x RET r raw-text RET", and recover. That is the only way to avoid corruption in files that use inconsistent encoding. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 21:40 ` Eli Zaretskii @ 2013-01-20 22:10 ` Vincent Lefevre 2013-01-20 22:22 ` Vincent Lefevre 2013-01-21 3:48 ` Eli Zaretskii 0 siblings, 2 replies; 16+ messages in thread From: Vincent Lefevre @ 2013-01-20 22:10 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026-forwarded, 696026, rlb, 13505 On 2013-01-20 23:40:14 +0200, Eli Zaretskii wrote: > > Date: Sun, 20 Jan 2013 22:25:08 +0100 > > From: Vincent Lefevre <vincent@vinc17.net> > > Cc: Rob Browning <rlb@defaultvalue.org>, Kenichi Handa <handa@gnu.org>, > > 13505@debbugs.gnu.org, 696026-forwarded@bugs.debian.org, > > 696026@bugs.debian.org > > > > On 2013-01-20 18:49:38 +0200, Eli Zaretskii wrote: > > > Personally, I don't think there's a bug here. It's a cockpit error. > > > > Perhaps it isn't a bug at save time. But then, selecting a lossy > > encoding by default when visiting the file is the bug (and really > > a regression), particularly if this isn't clearly told to the user. > > The encoding isn't lossy. You said: | The original encoded form of the characters as found on disk at | visit time _cannot_ be recovered by saving with raw-text, because | that encoded form is lost without a trace when the file is _visited_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | and decoded into the internal representation. This is what lossy is. On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs seems to handle files with invalid UTF-8 sequences without any loss. So, this encoding is safe, even if Emacs wrongly guess the encoding. > In any case, I don't really understand your proposal. Suppose the > file was indeed encoded in in-is13194-devanagari, would you argue then > that selecting it would be incorrect or undesirable behavior? If Emacs modifies the contents when saving the file, it would be incorrect. > > Actually this is related, since the lossy encoding becomes a real > > problem only at save time (and for copy-paste I assume, though the > > file doesn't get overwritten by that). > > It is only a problem when you try to save or otherwise output it > (e.g., send in an email). > > But what you should do then is "C-x RET r raw-text RET", and recover. > That is the only way to avoid corruption in files that use > inconsistent encoding. But Emacs should clearly tell the user what to do after C-x C-s and clearly say when there can be data loss. Currently it says: "These default coding systems were tried to encode text in the buffer `file1': (in-is13194-devanagari-unix (2 . 2376) (3 . 4194176) (4 . 4194201) (5 . 2341) (6 . 2314) (12 . 2364)) (utf-8-unix (3 . 4194176) (4 . 4194201)) However, each of them encountered characters it couldn't encode: in-is13194-devanagari-unix cannot encode these: [...] utf-8-unix cannot encode these: [...]" This shouldn't be regarded as a problem by the user, because if Emacs could read and interpret the file (and such characters have not been added by the user), it should be able to save it. Then Emacs says: "Select one of the safe coding systems listed below [...]", but doesn't say that something has already been lost. So, the words "safe coding systems" are really misleading. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 22:10 ` Vincent Lefevre @ 2013-01-20 22:22 ` Vincent Lefevre 2013-01-21 3:49 ` Eli Zaretskii 2013-01-21 3:48 ` Eli Zaretskii 1 sibling, 1 reply; 16+ messages in thread From: Vincent Lefevre @ 2013-01-20 22:22 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026-forwarded, 696026, rlb, 13505 On 2013-01-20 23:10:08 +0100, Vincent Lefevre wrote: > But Emacs should clearly tell the user what to do after C-x C-s and > clearly say when there can be data loss. Currently it says: [...] In fact, I fear that this may not be sufficient, because some data loss silently occurs when visiting the file. If after the decoding, it appears that there are no problematic characters (is this possible?), the user would be able to save the file without any message from Emacs. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 22:22 ` Vincent Lefevre @ 2013-01-21 3:49 ` Eli Zaretskii 0 siblings, 0 replies; 16+ messages in thread From: Eli Zaretskii @ 2013-01-21 3:49 UTC (permalink / raw To: Vincent Lefevre; +Cc: 696026-forwarded, 696026, rlb, 13505 > Date: Sun, 20 Jan 2013 23:22:11 +0100 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > > On 2013-01-20 23:10:08 +0100, Vincent Lefevre wrote: > > But Emacs should clearly tell the user what to do after C-x C-s and > > clearly say when there can be data loss. Currently it says: > [...] > > In fact, I fear that this may not be sufficient, because some data > loss silently occurs when visiting the file. Exactly! > If after the decoding, it appears that there are no problematic > characters (is this possible?), the user would be able to save the > file without any message from Emacs. I don't know how to do that within the framework of Emacs handling of non-ASCII text. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 22:10 ` Vincent Lefevre 2013-01-20 22:22 ` Vincent Lefevre @ 2013-01-21 3:48 ` Eli Zaretskii 2013-01-21 4:14 ` Vincent Lefevre 1 sibling, 1 reply; 16+ messages in thread From: Eli Zaretskii @ 2013-01-21 3:48 UTC (permalink / raw To: Vincent Lefevre; +Cc: 696026-forwarded, 696026, rlb, 13505 > Date: Sun, 20 Jan 2013 23:10:08 +0100 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > > On 2013-01-20 23:40:14 +0200, Eli Zaretskii wrote: > > > Date: Sun, 20 Jan 2013 22:25:08 +0100 > > > From: Vincent Lefevre <vincent@vinc17.net> > > > Cc: Rob Browning <rlb@defaultvalue.org>, Kenichi Handa <handa@gnu.org>, > > > 13505@debbugs.gnu.org, 696026-forwarded@bugs.debian.org, > > > 696026@bugs.debian.org > > > > > > On 2013-01-20 18:49:38 +0200, Eli Zaretskii wrote: > > > > Personally, I don't think there's a bug here. It's a cockpit error. > > > > > > Perhaps it isn't a bug at save time. But then, selecting a lossy > > > encoding by default when visiting the file is the bug (and really > > > a regression), particularly if this isn't clearly told to the user. > > > > The encoding isn't lossy. > > You said: > > | The original encoded form of the characters as found on disk at > | visit time _cannot_ be recovered by saving with raw-text, because > | that encoded form is lost without a trace when the file is _visited_ > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > | and decoded into the internal representation. > > This is what lossy is. In that sense, every encoding except no-conversion is lossy. > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs > seems to handle files with invalid UTF-8 sequences without any loss. > So, this encoding is safe, even if Emacs wrongly guess the encoding. No, it isn't, although you could get away with it most of the time. > But Emacs should clearly tell the user what to do after C-x C-s and > clearly say when there can be data loss. At save time, "data loss" is wrt what's in the buffer. In that sense, the encodings Emacs suggested don't lose any data. > Then Emacs says: "Select one of the safe coding systems listed below > [...]", but doesn't say that something has already been lost. So, the > words "safe coding systems" are really misleading. It's misleading because you misunderstand what is "safe" at buffer save time. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-21 3:48 ` Eli Zaretskii @ 2013-01-21 4:14 ` Vincent Lefevre 2013-01-21 17:55 ` Eli Zaretskii 0 siblings, 1 reply; 16+ messages in thread From: Vincent Lefevre @ 2013-01-21 4:14 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026-forwarded, 696026, rlb, 13505 On 2013-01-21 05:48:14 +0200, Eli Zaretskii wrote: > > You said: > > > > | The original encoded form of the characters as found on disk at > > | visit time _cannot_ be recovered by saving with raw-text, because > > | that encoded form is lost without a trace when the file is _visited_ > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > | and decoded into the internal representation. > > > > This is what lossy is. > > In that sense, every encoding except no-conversion is lossy. Even 8-bit encodings such as latin-1? > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs > > seems to handle files with invalid UTF-8 sequences without any loss. > > So, this encoding is safe, even if Emacs wrongly guess the encoding. > > No, it isn't, although you could get away with it most of the time. Could you give an example where one loses data with the utf-8 encoding? > > But Emacs should clearly tell the user what to do after C-x C-s and > > clearly say when there can be data loss. > > At save time, "data loss" is wrt what's in the buffer. In that sense, > the encodings Emacs suggested don't lose any data. "data loss" is the difference between the original file and the saved file. > > Then Emacs says: "Select one of the safe coding systems listed below > > [...]", but doesn't say that something has already been lost. So, the > > words "safe coding systems" are really misleading. > > It's misleading because you misunderstand what is "safe" at buffer > save time. No, it's misleading because Emacs didn't say that data were lost when visiting the file. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-21 4:14 ` Vincent Lefevre @ 2013-01-21 17:55 ` Eli Zaretskii 2013-01-22 2:35 ` Vincent Lefevre 0 siblings, 1 reply; 16+ messages in thread From: Eli Zaretskii @ 2013-01-21 17:55 UTC (permalink / raw To: Vincent Lefevre; +Cc: 696026-forwarded, 696026, rlb, 13505 > Date: Mon, 21 Jan 2013 05:14:10 +0100 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > > On 2013-01-21 05:48:14 +0200, Eli Zaretskii wrote: > > > You said: > > > > > > | The original encoded form of the characters as found on disk at > > > | visit time _cannot_ be recovered by saving with raw-text, because > > > | that encoded form is lost without a trace when the file is _visited_ > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > | and decoded into the internal representation. > > > > > > This is what lossy is. > > > > In that sense, every encoding except no-conversion is lossy. > > Even 8-bit encodings such as latin-1? Yes. When latin-1 characters are decoded (as part of visiting a file), they are converted to the internal representation, and cease to be single 8-bit bytes. > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs > > > seems to handle files with invalid UTF-8 sequences without any loss. > > > So, this encoding is safe, even if Emacs wrongly guess the encoding. > > > > No, it isn't, although you could get away with it most of the time. > > Could you give an example where one loses data with the utf-8 encoding? E.g., in your test file, the byte whose value is 0x80 is converted to 0x3fff80 when the file is read into a buffer. Perhaps by "lossless" you mean "reversible", in the sense that saving the same buffer will perform the reverse conversion. In that case, even the in-is13194-devanagari-unix is reversible: if you type this encoding when Emacs prompts you to select one of the coding systems, then you get the same file on disk with no corruption whatsoever. > > > But Emacs should clearly tell the user what to do after C-x C-s and > > > clearly say when there can be data loss. > > > > At save time, "data loss" is wrt what's in the buffer. In that sense, > > the encodings Emacs suggested don't lose any data. > > "data loss" is the difference between the original file and the saved > file. But what do you want Emacs to do with this? When you save the buffer, the original file might be different or no longer be available (or not accessible even in principle, e.g. if the data came from a subprocess). These issues should be detected at file visit time, if at all, not at buffer save time. > > > Then Emacs says: "Select one of the safe coding systems listed below > > > [...]", but doesn't say that something has already been lost. So, the > > > words "safe coding systems" are really misleading. > > > > It's misleading because you misunderstand what is "safe" at buffer > > save time. > > No, it's misleading because Emacs didn't say that data were lost > when visiting the file. Let's be constructive here. Please suggest some practical way for Emacs to handle this situation better. For the record, here are the various alternative ways Emacs supports the use case you described, when a file with inconsistent encoding needs to be repaired manually: . Visit the file with "M-x find-file-literally RET". This yields a unibyte buffer, where each byte stands for itself, and which you can edit without risking en-/decoding issues. . Visit the file normally, then type "M-x hexl-mode RET" (or use "M-x hexl-find-file RET" to visit it in the first place). This revisits (or visits) the file in a unibyte buffer, and in addition lets you edit the binary stuff regardless of its graphic representation. . After visiting the file normally and noticing that it contains weird characters, or after being prompted to select a coding system when saving the buffer, type "C-x RET r raw-text RET" to revisit the file in raw-text encoding. Then edit the bytes and save the file. These alternatives are listed in the descending order of priority (IMO). There are more ways to deal with this, but the rest are more complicated and dangerous, so I don't mention them here. (It is also possible that you will find the second alternative more convenient than the 1st one.) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-21 17:55 ` Eli Zaretskii @ 2013-01-22 2:35 ` Vincent Lefevre 2013-01-22 7:56 ` Eli Zaretskii 0 siblings, 1 reply; 16+ messages in thread From: Vincent Lefevre @ 2013-01-22 2:35 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026-forwarded, 696026, rlb, 13505 On 2013-01-21 19:55:20 +0200, Eli Zaretskii wrote: > > Date: Mon, 21 Jan 2013 05:14:10 +0100 > > From: Vincent Lefevre <vincent@vinc17.net> > > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > > > > On 2013-01-21 05:48:14 +0200, Eli Zaretskii wrote: > > > > You said: > > > > > > > > | The original encoded form of the characters as found on disk at > > > > | visit time _cannot_ be recovered by saving with raw-text, because > > > > | that encoded form is lost without a trace when the file is _visited_ > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > | and decoded into the internal representation. > > > > > > > > This is what lossy is. > > > > > > In that sense, every encoding except no-conversion is lossy. > > > > Even 8-bit encodings such as latin-1? > > Yes. When latin-1 characters are decoded (as part of visiting a > file), they are converted to the internal representation, and cease to > be single 8-bit bytes. Any example where saving the file without modifying it (see below) would modify the data (as a sequence of bytes on the disk)? > > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs > > > > seems to handle files with invalid UTF-8 sequences without any loss. > > > > So, this encoding is safe, even if Emacs wrongly guess the encoding. > > > > > > No, it isn't, although you could get away with it most of the time. > > > > Could you give an example where one loses data with the utf-8 encoding? > > E.g., in your test file, the byte whose value is 0x80 is converted to > 0x3fff80 when the file is read into a buffer. No, there are no problems with this example: $ printf "\x80" > file $ hd file 00000000 80 |.| 00000001 $ emacs -q file Here the encoding by Emacs is utf-8-unix. Then I do M-: (set-buffer-modified-p t) to mark the buffer as modified (as in the bug report)., then C-x C-s. Emacs proposes raw-text, which I choose. Then C-x C-c to quit. $ hd file 00000000 80 |.| 00000001 So, the file has *not* been corrupted. > Perhaps by "lossless" you mean "reversible", in the sense that saving > the same buffer will perform the reverse conversion. Actually I don't mind what occurs internally. What I mean is things like: saved file = initial file if it hasn't been modified (as above) and with the default encoding(s) proposed by Emacs (when visiting and when saving). > In that case, even the in-is13194-devanagari-unix is reversible: if > you type this encoding when Emacs prompts you to select one of the > coding systems, then you get the same file on disk with no > corruption whatsoever. Then this is what Emacs should propose by default on this example! I suppose that Emacs is able to remember the encoding used to visit the file, so that this should be possible... > > > > But Emacs should clearly tell the user what to do after C-x C-s and > > > > clearly say when there can be data loss. > > > > > > At save time, "data loss" is wrt what's in the buffer. In that sense, > > > the encodings Emacs suggested don't lose any data. > > > > "data loss" is the difference between the original file and the saved > > file. > > But what do you want Emacs to do with this? When you save the buffer, > the original file might be different or no longer be available (or not > accessible even in principle, e.g. if the data came from a > subprocess). The file may be different, but in general, the encoding should remain the same. This is particularly true when Emacs is used as the editor by some application: if the encoding of the file has been changed by Emacs, the application will be confused. > These issues should be detected at file visit time, if at all, not > at buffer save time. Possibly (this is something that the end user doesn't have to know if the goal is to modify a file). > > > > Then Emacs says: "Select one of the safe coding systems listed below > > > > [...]", but doesn't say that something has already been lost. So, the > > > > words "safe coding systems" are really misleading. > > > > > > It's misleading because you misunderstand what is "safe" at buffer > > > save time. > > > > No, it's misleading because Emacs didn't say that data were lost > > when visiting the file. > > Let's be constructive here. Please suggest some practical way for > Emacs to handle this situation better. > > For the record, here are the various alternative ways Emacs supports > the use case you described, when a file with inconsistent encoding > needs to be repaired manually: > > . Visit the file with "M-x find-file-literally RET". This yields a > unibyte buffer, where each byte stands for itself, and which you > can edit without risking en-/decoding issues. Though the above is possible, the user often opens files with "emacs <file>". > . Visit the file normally, then type "M-x hexl-mode RET" (or use > "M-x hexl-find-file RET" to visit it in the first place). This > revisits (or visits) the file in a unibyte buffer, and in addition > lets you edit the binary stuff regardless of its graphic > representation. If Emacs notices a potential problem when visiting the file, this method can be proposed by Emacs, but it shouldn't be the only way, because the file may contain mostly ASCII characters and hex-editing is not the best choice in such a case. > . After visiting the file normally and noticing that it contains > weird characters, or after being prompted to select a coding system > when saving the buffer, type "C-x RET r raw-text RET" to revisit > the file in raw-text encoding. Then edit the bytes and save the > file. But that could be proposed by Emacs directly: instead of decoding the file directly in the buffer, Emacs could ask the user which coding system he wants to use. One drawback of raw-text is that 8-bit characters are completely unreadable. I think that there should be, for instance, a utf-8 degraded coding system: correct UTF-8 sequences are decoded using UTF-8, and invalid sequences are left intact. Emacs can already do such kind of things, but there should be 2 differences from the current behavior: * When visiting the file, ask the user what to do in case Emacs cannot select a clean coding system without any problem. For instance, a "Select coding system" prompt. (BTW, couldn't hexl be regarded as a special coding system at this point? Perhaps "coding system" isn't the right term here, "editing mode" might be better.) Other settings in .emacs could override that, of course, i.e. this would just be the default. * In case of UTF-8 degraded coding system, Emacs should save the file in the same UTF-8 degraded coding system. This is a way for the user to say: "I know that there are invalid sequences, just keep them." UTF-8 is just an example above. There could be the same kind of things with other encodings. -- Vincent Lefèvre <vincent@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-22 2:35 ` Vincent Lefevre @ 2013-01-22 7:56 ` Eli Zaretskii 0 siblings, 0 replies; 16+ messages in thread From: Eli Zaretskii @ 2013-01-22 7:56 UTC (permalink / raw To: Vincent Lefevre; +Cc: 696026-forwarded, 696026, rlb, 13505 > Date: Tue, 22 Jan 2013 03:35:57 +0100 > From: Vincent Lefevre <vincent@vinc17.net> > Cc: rlb@defaultvalue.org, handa@gnu.org, 13505@debbugs.gnu.org, > 696026-forwarded@bugs.debian.org, 696026@bugs.debian.org > > > > > > | The original encoded form of the characters as found on disk at > > > > > | visit time _cannot_ be recovered by saving with raw-text, because > > > > > | that encoded form is lost without a trace when the file is _visited_ > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > | and decoded into the internal representation. > > > > > > > > > > This is what lossy is. > > > > > > > > In that sense, every encoding except no-conversion is lossy. > > > > > > Even 8-bit encodings such as latin-1? > > > > Yes. When latin-1 characters are decoded (as part of visiting a > > file), they are converted to the internal representation, and cease to > > be single 8-bit bytes. > > Any example where saving the file without modifying it (see below) > would modify the data (as a sequence of bytes on the disk)? See above: I was talking about changes at file-visit time. > > > > > On the opposite, the utf-8 encoding doesn't seem to be lossy: Emacs > > > > > seems to handle files with invalid UTF-8 sequences without any loss. > > > > > So, this encoding is safe, even if Emacs wrongly guess the encoding. > > > > > > > > No, it isn't, although you could get away with it most of the time. > > > > > > Could you give an example where one loses data with the utf-8 encoding? > > > > E.g., in your test file, the byte whose value is 0x80 is converted to > > 0x3fff80 when the file is read into a buffer. > > No, there are no problems with this example: Again, because we are talking about two different things. > > Perhaps by "lossless" you mean "reversible", in the sense that saving > > the same buffer will perform the reverse conversion. > > Actually I don't mind what occurs internally. What I mean is things > like: saved file = initial file if it hasn't been modified (as above) > and with the default encoding(s) proposed by Emacs (when visiting and > when saving). That's reversibility. > > In that case, even the in-is13194-devanagari-unix is reversible: if > > you type this encoding when Emacs prompts you to select one of the > > coding systems, then you get the same file on disk with no > > corruption whatsoever. > > Then this is what Emacs should propose by default on this example! It can't easily do that. There are 2 different use cases here: 1) A file was visited and its encoding was found to be inconsistent. Then it is being saved. This is your use case. 2) A file was modified by adding to it characters that cannot be encoded by the original encoding. For example, you visit a Latin-1 encoded file, then add to it characters that are outside the coverage of Latin-1. Then you save the file. What Emacs proposes is biased for the second use case, because it is by far the most frequent one. The other use case is supposed to be treated by other means, those which I mentioned in my previous mail. Giving instructions to both use cases is not a good idea, IMO, because it will confuse users who do not necessarily understand what is going on and in particular don't realize which of the two situations they are in. > I suppose that Emacs is able to remember the encoding used to visit > the file, so that this should be possible... It does remember. It actually shows it in the "select safe coding system" prompt. The problem is that its use can do the wrong thing in the second use case above. > > > > > But Emacs should clearly tell the user what to do after C-x C-s and > > > > > clearly say when there can be data loss. > > > > > > > > At save time, "data loss" is wrt what's in the buffer. In that sense, > > > > the encodings Emacs suggested don't lose any data. > > > > > > "data loss" is the difference between the original file and the saved > > > file. > > > > But what do you want Emacs to do with this? When you save the buffer, > > the original file might be different or no longer be available (or not > > accessible even in principle, e.g. if the data came from a > > subprocess). > > The file may be different, but in general, the encoding should remain > the same. That's what Emacs does, as long as it can. But in this case, that encoding might produce inconsistently encoded file, so Emacs doesn't want to do that silently. It has no idea that the file was inconsistently encoded in the first place, nor that you _want_ it to continue being inconsistently encoded. > This is particularly true when Emacs is used as the editor by some > application: if the encoding of the file has been changed by Emacs, > the application will be confused. Again, that's what Emacs does normally, if that encoding can do the job. Producing inconsistent encoding will certainly confuse those other programs. > > These issues should be detected at file visit time, if at all, not > > at buffer save time. > > Possibly (this is something that the end user doesn't have to know if > the goal is to modify a file). This use case proves otherwise. > > . Visit the file with "M-x find-file-literally RET". This yields a > > unibyte buffer, where each byte stands for itself, and which you > > can edit without risking en-/decoding issues. > > Though the above is possible, the user often opens files with > "emacs <file>". Many users have Emacs up and running for the entire session. > > . Visit the file normally, then type "M-x hexl-mode RET" (or use > > "M-x hexl-find-file RET" to visit it in the first place). This > > revisits (or visits) the file in a unibyte buffer, and in addition > > lets you edit the binary stuff regardless of its graphic > > representation. > > If Emacs notices a potential problem when visiting the file, this > method can be proposed by Emacs, but it shouldn't be the only way, > because the file may contain mostly ASCII characters and hex-editing > is not the best choice in such a case. ??? Hexl Mode shows the printable characters (at the right side of the display) in addition to the codes. What exactly is the problem here? > > . After visiting the file normally and noticing that it contains > > weird characters, or after being prompted to select a coding system > > when saving the buffer, type "C-x RET r raw-text RET" to revisit > > the file in raw-text encoding. Then edit the bytes and save the > > file. > > But that could be proposed by Emacs directly: instead of decoding the > file directly in the buffer, Emacs could ask the user which coding > system he wants to use. That'd be a nuisance, I think, because more often than not, keeping the original inconsistent encoding is not what the user wants. > One drawback of raw-text is that 8-bit characters are completely > unreadable. That's why I listed it the last. ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 16:49 ` Eli Zaretskii ` (2 preceding siblings ...) 2013-01-20 21:25 ` Vincent Lefevre @ 2013-01-20 23:01 ` Andreas Schwab 2013-01-20 23:27 ` bug#13505: Bug#696026: " Rob Browning 3 siblings, 1 reply; 16+ messages in thread From: Andreas Schwab @ 2013-01-20 23:01 UTC (permalink / raw To: Eli Zaretskii; +Cc: 696026, 696026-forwarded, 13505, vincent, Rob Browning Eli Zaretskii <eliz@gnu.org> writes: > I didn't research the reason why Emacs 24 autodetects this encoding, > and whether this is on purpose. It's a bug, fixed now. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 16+ messages in thread
* bug#13505: Bug#696026: bug#13505: Bug#696026: emacs24: file corruption on saving 2013-01-20 23:01 ` Andreas Schwab @ 2013-01-20 23:27 ` Rob Browning 0 siblings, 0 replies; 16+ messages in thread From: Rob Browning @ 2013-01-20 23:27 UTC (permalink / raw To: Andreas Schwab; +Cc: 696026, control, 696026-forwarded, 13505, vincent Andreas Schwab <schwab@linux-m68k.org> writes: > Eli Zaretskii <eliz@gnu.org> writes: > >> I didn't research the reason why Emacs 24 autodetects this encoding, >> and whether this is on purpose. > > It's a bug, fixed now. Great, and thanks. -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4 ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2013-01-22 7:56 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20121215223809.GA7549@xvii.vinc17.org> 2013-01-20 4:09 ` bug#13505: Bug#696026: emacs24: file corruption on saving Rob Browning 2013-01-20 16:49 ` Eli Zaretskii 2013-01-20 17:31 ` Rob Browning 2013-01-20 20:24 ` Glenn Morris 2013-01-20 21:25 ` Vincent Lefevre 2013-01-20 21:40 ` Eli Zaretskii 2013-01-20 22:10 ` Vincent Lefevre 2013-01-20 22:22 ` Vincent Lefevre 2013-01-21 3:49 ` Eli Zaretskii 2013-01-21 3:48 ` Eli Zaretskii 2013-01-21 4:14 ` Vincent Lefevre 2013-01-21 17:55 ` Eli Zaretskii 2013-01-22 2:35 ` Vincent Lefevre 2013-01-22 7:56 ` Eli Zaretskii 2013-01-20 23:01 ` Andreas Schwab 2013-01-20 23:27 ` bug#13505: Bug#696026: " Rob Browning
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.