* docs for insert-file-contents use 'bytes' @ 2008-09-29 19:58 Ted Zlatanov 2008-09-29 20:12 ` Eli Zaretskii 0 siblings, 1 reply; 15+ messages in thread From: Ted Zlatanov @ 2008-09-29 19:58 UTC (permalink / raw) To: emacs-devel The docs for insert-file-contents say the range is in bytes, but that function does decoding of the contents. Can it, therefore, read from an undesirable position (e.g. the middle of a UTF-8 sequence)? How does Emacs handle that? Or does "bytes" really mean position by character? Either way the docs need to state the operation mode clearly. I'd do it but I don't know enough :) Thanks Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-29 19:58 docs for insert-file-contents use 'bytes' Ted Zlatanov @ 2008-09-29 20:12 ` Eli Zaretskii 2008-09-29 21:04 ` Ted Zlatanov 0 siblings, 1 reply; 15+ messages in thread From: Eli Zaretskii @ 2008-09-29 20:12 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel > From: Ted Zlatanov <tzz@lifelogs.com> > Date: Mon, 29 Sep 2008 14:58:17 -0500 > > The docs for insert-file-contents say the range is in bytes, but that > function does decoding of the contents. Can it, therefore, read from an > undesirable position (e.g. the middle of a UTF-8 sequence)? The range _is_ in bytes (you will see in fileio.c that Emacs uses `lseek' to get to the required file positions). Yes, reading a part of a multibyte sequence is a possibility. > How does Emacs handle that? Like with any other random bytes, I think: it will produce eight-bit-* characters in the buffer. IOW, you get garbled text. > Either way the docs need to state the operation mode clearly. Assuming I don't miss anything, and the above is indeed correct, what would you like the doc string to say, exactly? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-29 20:12 ` Eli Zaretskii @ 2008-09-29 21:04 ` Ted Zlatanov 2008-09-30 6:06 ` Miles Bader 2008-09-30 7:19 ` Eli Zaretskii 0 siblings, 2 replies; 15+ messages in thread From: Ted Zlatanov @ 2008-09-29 21:04 UTC (permalink / raw) To: emacs-devel On Mon, 29 Sep 2008 23:12:58 +0300 Eli Zaretskii <eliz@gnu.org> wrote: >> From: Ted Zlatanov <tzz@lifelogs.com> >> Date: Mon, 29 Sep 2008 14:58:17 -0500 >> >> The docs for insert-file-contents say the range is in bytes, but that >> function does decoding of the contents. Can it, therefore, read from an >> undesirable position (e.g. the middle of a UTF-8 sequence)? EZ> The range _is_ in bytes (you will see in fileio.c that Emacs uses EZ> `lseek' to get to the required file positions). Yes, reading a part EZ> of a multibyte sequence is a possibility. >> How does Emacs handle that? EZ> Like with any other random bytes, I think: it will produce eight-bit-* EZ> characters in the buffer. IOW, you get garbled text. This is not a safe operation mode with multibyte sequences; is there a way to DTRT? I'm specifically thinking about a paged buffer mode where you only see a small portion of the file (for editing large files, as we discussed in another newsgroup a while ago). >> Either way the docs need to state the operation mode clearly. EZ> Assuming I don't miss anything, and the above is indeed correct, what EZ> would you like the doc string to say, exactly? Maybe add: "Warning: this is not safe with variable-length multibyte encodings such as UTF-8, because it works by byte offset without encoding awareness, so you may get garbled data. See ??? instead." I don't know if this is the right wording, but it's a pretty essential operation so it should give some warning about this common (nowadays) case. Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-29 21:04 ` Ted Zlatanov @ 2008-09-30 6:06 ` Miles Bader 2008-09-30 7:19 ` Eli Zaretskii 1 sibling, 0 replies; 15+ messages in thread From: Miles Bader @ 2008-09-30 6:06 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel Ted Zlatanov <tzz@lifelogs.com> writes: > EZ> Like with any other random bytes, I think: it will produce eight-bit-* > EZ> characters in the buffer. IOW, you get garbled text. > > This is not a safe operation mode with multibyte sequences; is there a > way to DTRT? I'm specifically thinking about a paged buffer mode where > you only see a small portion of the file (for editing large files, as we > discussed in another newsgroup a while ago). Why is it "not safe"? How would you do things differently? In conjunction with _file_ contents, a byte offset seems certainly the most natural thing. An "encoded character offset", for instance, would be far less efficient, much more complex to implement (and thus buggier), and harder to use in general. -Miles -- Sabbath, n. A weekly festival having its origin in the fact that God made the world in six days and was arrested on the seventh. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-29 21:04 ` Ted Zlatanov 2008-09-30 6:06 ` Miles Bader @ 2008-09-30 7:19 ` Eli Zaretskii 2008-09-30 13:48 ` Ted Zlatanov 1 sibling, 1 reply; 15+ messages in thread From: Eli Zaretskii @ 2008-09-30 7:19 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel > From: Ted Zlatanov <tzz@lifelogs.com> > Date: Mon, 29 Sep 2008 16:04:13 -0500 > > This is not a safe operation mode with multibyte sequences; is there a > way to DTRT? I'm specifically thinking about a paged buffer mode where > you only see a small portion of the file (for editing large files, as we > discussed in another newsgroup a while ago). How about this idea: read a bit more than you want, then find safe place to end this page-full? > I don't know if this is the right wording, but it's a pretty essential > operation so it should give some warning about this common (nowadays) > case. Is it really a common case that insert-file-contents is used to read a portion of a file? Where is this used? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-30 7:19 ` Eli Zaretskii @ 2008-09-30 13:48 ` Ted Zlatanov 2008-09-30 15:58 ` Stefan Monnier ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Ted Zlatanov @ 2008-09-30 13:48 UTC (permalink / raw) To: emacs-devel On Tue, 30 Sep 2008 10:19:26 +0300 Eli Zaretskii <eliz@gnu.org> wrote: >> From: Ted Zlatanov <tzz@lifelogs.com> >> Date: Mon, 29 Sep 2008 16:04:13 -0500 >> >> This is not a safe operation mode with multibyte sequences; is there a >> way to DTRT? I'm specifically thinking about a paged buffer mode where >> you only see a small portion of the file (for editing large files, as we >> discussed in another newsgroup a while ago). EZ> How about this idea: read a bit more than you want, then find safe EZ> place to end this page-full? How do I find the next safe position in the byte flow? >> I don't know if this is the right wording, but it's a pretty essential >> operation so it should give some warning about this common (nowadays) >> case. EZ> Is it really a common case that insert-file-contents is used to read a EZ> portion of a file? Where is this used? I want to use it to implement a paged view of large files. We discussed this in emacs-help and you suggested using insert-file-contents IIRC. Anyhow, the point is the docs don't mention this issue, let's fix that first. I mention one possible way to do the code below. On Tue, 30 Sep 2008 15:06:17 +0900 Miles Bader <miles.bader@necel.com> wrote: MB> Ted Zlatanov <tzz@lifelogs.com> writes: EZ> Like with any other random bytes, I think: it will produce eight-bit-* EZ> characters in the buffer. IOW, you get garbled text. >> >> This is not a safe operation mode with multibyte sequences; is there a >> way to DTRT? I'm specifically thinking about a paged buffer mode where >> you only see a small portion of the file (for editing large files, as we >> discussed in another newsgroup a while ago). MB> Why is it "not safe"? Because the text will be corrupted if you seek in the middle of a multibyte sequence, and there's no way to know in advance if a position is safe without at least some scanning. MB> How would you do things differently? I don't know, I'm just saying the docs don't mention the possibility of corrupted text. Can we fix that, if possible? The docs just need to warn, not solve the issue. MB> In conjunction with _file_ contents, a byte offset seems certainly the MB> most natural thing. An "encoded character offset", for instance, would MB> be far less efficient, much more complex to implement (and thus MB> buggier), and harder to use in general. Agreed. Still, encoding schemes like UTF-8 are so popular today that the docs should at least warn about careless seeking to a byte offset. There could be a insert-file-decoded-contents that seeks to a byte position and gets the next character at or after that position. That's not too hard to implement and it's fast. Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-30 13:48 ` Ted Zlatanov @ 2008-09-30 15:58 ` Stefan Monnier 2008-09-30 16:29 ` Eli Zaretskii 2008-10-01 0:44 ` Kenichi Handa 2 siblings, 0 replies; 15+ messages in thread From: Stefan Monnier @ 2008-09-30 15:58 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel >>> This is not a safe operation mode with multibyte sequences; is there a >>> way to DTRT? I'm specifically thinking about a paged buffer mode where >>> you only see a small portion of the file (for editing large files, as we >>> discussed in another newsgroup a while ago). EZ> How about this idea: read a bit more than you want, then find safe EZ> place to end this page-full? > How do I find the next safe position in the byte flow? It's a dificult problem for everyone. Which is why Emacs doesn't do it for you, basically: I don't think anyone has made serious use of that feature yet, so nobody has gone to the trouble of coming up with a good solution. Maybe you can simply look at the end of the previous insertion, count the number of eight-bit-* chars that were inserted (these correspond to bytes that belong to the char that straddles the boundary) so as to find the end of the last complete char you encountred. > I want to use it to implement a paged view of large files. We discussed > this in emacs-help and you suggested using insert-file-contents IIRC. This is a very good application indeed. > Because the text will be corrupted if you seek in the middle of a > multibyte sequence, and there's no way to know in advance if a position > is safe without at least some scanning. It's not exactly "corrupted" in the sense that, while it is not displayed correctly, it should be correctly saved back so no information is lost. Basically, some of the bytes are decoded with the wrong coding-system, but this coding system is supposed to be safe. No doubt that it's not "good enough" in general. > There could be a insert-file-decoded-contents that seeks to a byte > position and gets the next character at or after that position. That's > not too hard to implement and it's fast. It wouldn't be good enough for your application because you might then lose the chars that straddle a boundary. Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-30 13:48 ` Ted Zlatanov 2008-09-30 15:58 ` Stefan Monnier @ 2008-09-30 16:29 ` Eli Zaretskii 2008-10-01 0:44 ` Kenichi Handa 2 siblings, 0 replies; 15+ messages in thread From: Eli Zaretskii @ 2008-09-30 16:29 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel > From: Ted Zlatanov <tzz@lifelogs.com> > Date: Tue, 30 Sep 2008 08:48:28 -0500 > > EZ> How about this idea: read a bit more than you want, then find safe > EZ> place to end this page-full? > > How do I find the next safe position in the byte flow? In C or in ELisp? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-09-30 13:48 ` Ted Zlatanov 2008-09-30 15:58 ` Stefan Monnier 2008-09-30 16:29 ` Eli Zaretskii @ 2008-10-01 0:44 ` Kenichi Handa 2008-10-01 16:54 ` Ted Zlatanov 2 siblings, 1 reply; 15+ messages in thread From: Kenichi Handa @ 2008-10-01 0:44 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel In article <8663od68yb.fsf@lifelogs.com>, Ted Zlatanov <tzz@lifelogs.com> writes: > There could be a insert-file-decoded-contents that seeks to a byte > position and gets the next character at or after that position. That's > not too hard to implement and it's fast. It's not that easy. Some encoding requires to seek back an escape sequence to get the next character. And, for UTF-16 with BOM, we have to check the first 2-byte. --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-10-01 0:44 ` Kenichi Handa @ 2008-10-01 16:54 ` Ted Zlatanov 2008-10-02 1:33 ` Kenichi Handa 2008-10-02 1:58 ` Kenichi Handa 0 siblings, 2 replies; 15+ messages in thread From: Ted Zlatanov @ 2008-10-01 16:54 UTC (permalink / raw) To: emacs-devel On Wed, 01 Oct 2008 09:44:31 +0900 Kenichi Handa <handa@m17n.org> wrote: KH> In article <8663od68yb.fsf@lifelogs.com>, Ted Zlatanov <tzz@lifelogs.com> writes: >> There could be a insert-file-decoded-contents that seeks to a byte >> position and gets the next character at or after that position. That's >> not too hard to implement and it's fast. KH> It's not that easy. Some encoding requires to seek back an KH> escape sequence to get the next character. And, for UTF-16 KH> with BOM, we have to check the first 2-byte. OK. Does it ever require going more than N*2 (where N = max sequence length for the encoding) bytes back? Is N ever bigger than 10? If not, it may be complicated code but at least it will be fairly fast. The semantics could be (given N as above): 1) jump to character number C: scan from beginning of file and count characters up to C if the encoding has a variable length. Otherwise the offset is obvious. 2) jump to character around/at byte B: jump to B-N*2 and scan characters forward until you find the one that straddles or begins at B. Also should have a way to report that character's actual starting byte position. 3) jump to byte: operate as now, just a fseek For my purposes (2) is most useful, but I can use (3) and bypass encodings. (1) is not good for me, since the application is to view large files, but (1) is OK for small files. On Tue, 30 Sep 2008 11:58:12 -0400 Stefan Monnier <monnier@IRO.UMontreal.CA> wrote: >> I want to use it to implement a paged view of large files. We discussed >> this in emacs-help and you suggested using insert-file-contents IIRC. SM> This is a very good application indeed. It's on my TODO list :) I'm more concerned with fixing the docs, though, as that was the original intent of my post. Can they just warn about the issue, as I suggested? >> Because the text will be corrupted if you seek in the middle of a >> multibyte sequence, and there's no way to know in advance if a position >> is safe without at least some scanning. SM> It's not exactly "corrupted" in the sense that, while it is not SM> displayed correctly, it should be correctly saved back so no information SM> is lost. Basically, some of the bytes are decoded with the wrong SM> coding-system, but this coding system is supposed to be safe. SM> No doubt that it's not "good enough" in general. Right, it's the interpretation of the data, not the data itself, that's incorrect. Variable-length encodings are annoying. >> There could be a insert-file-decoded-contents that seeks to a byte >> position and gets the next character at or after that position. That's >> not too hard to implement and it's fast. SM> It wouldn't be good enough for your application because you might then SM> lose the chars that straddle a boundary. That's fine. I would just seek back N bytes and resynchronize (see above). I could also make the mode only work with bytes, bypassing encoding schemes. This would be a good thing, actually, for cases where the user might page or search through many megabytes of data quickly. Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-10-01 16:54 ` Ted Zlatanov @ 2008-10-02 1:33 ` Kenichi Handa 2008-10-02 13:42 ` Ted Zlatanov 2008-10-02 1:58 ` Kenichi Handa 1 sibling, 1 reply; 15+ messages in thread From: Kenichi Handa @ 2008-10-02 1:33 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel In article <868wt845op.fsf@lifelogs.com>, Ted Zlatanov <tzz@lifelogs.com> writes: KH> It's not that easy. Some encoding requires to seek back an KH> escape sequence to get the next character. And, for UTF-16 KH> with BOM, we have to check the first 2-byte. > OK. Does it ever require going more than N*2 (where N = max sequence > length for the encoding) bytes back? Is N ever bigger than 10? If not, > it may be complicated code but at least it will be fairly fast. N can be much much longer than 10. For instance, the following is the byte sequence of iso-2022-jp for a Japanese sentence (ESC code is represented by "^["). ^[$BA0$N2hLL$H<!$N2hLL$H$G$O!I=<($5$l$kFbMF$K2?9T$+$N=E$J$j$,$$j$^$9!#$3^[(B ^[$B$l$O!I=<($5$l$F$$$kFbMF$,OB3$7$F$$$k$3$H$,$9$0H=$k$h$&$K$9$k$?$a$G$9!#^[(B We must search back the sequence ^[$B or ^[(B for iso-2022-jp. Which pattern to search depends on the coding-system. > The semantics could be (given N as above): > 1) jump to character number C: scan from beginning of file and count > characters up to C if the encoding has a variable length. Otherwise the > offset is obvious. > 2) jump to character around/at byte B: jump to B-N*2 and scan characters > forward until you find the one that straddles or begins at B. Also > should have a way to report that character's actual starting byte > position. > 3) jump to byte: operate as now, just a fseek > For my purposes (2) is most useful, but I can use (3) and bypass > encodings. (1) is not good for me, since the application is to view > large files, but (1) is OK for small files. As you now see from the above example, implementing (2) is very difficult. And, for small files, we don't need (1). We can just read the whole file. --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-10-02 1:33 ` Kenichi Handa @ 2008-10-02 13:42 ` Ted Zlatanov 2008-10-02 18:55 ` Stefan Monnier 0 siblings, 1 reply; 15+ messages in thread From: Ted Zlatanov @ 2008-10-02 13:42 UTC (permalink / raw) To: emacs-devel On Thu, 02 Oct 2008 10:33:49 +0900 Kenichi Handa <handa@m17n.org> wrote: KH> In article <868wt845op.fsf@lifelogs.com>, Ted Zlatanov <tzz@lifelogs.com> writes: >> That's fine. I would just seek back N bytes and resynchronize (see >> above). I could also make the mode only work with bytes, bypassing >> encoding schemes. This would be a good thing, actually, for cases where >> the user might page or search through many megabytes of data quickly. KH> How about reading a file in a unibyte buffer with KH> no-conversion, and decode one page by one into a view KH> buffer. Except for UTF-16 encoding, it is safe to set the KH> decoding boundary at newline positions. That could work, but I'd have to grab more than one page every time, so the math could get tricky. I'll have to play with this when it comes up on my TODO list :) Thanks Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-10-02 13:42 ` Ted Zlatanov @ 2008-10-02 18:55 ` Stefan Monnier 2008-10-03 13:55 ` Ted Zlatanov 0 siblings, 1 reply; 15+ messages in thread From: Stefan Monnier @ 2008-10-02 18:55 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel KH> How about reading a file in a unibyte buffer with KH> no-conversion, and decode one page by one into a view KH> buffer. Except for UTF-16 encoding, it is safe to set the KH> decoding boundary at newline positions. > That could work, but I'd have to grab more than one page every time, so > the math could get tricky. I'll have to play with this when it comes up > on my TODO list :) I'm not sure why you'd need such trick anyway. Supposedly when you need to read a new chunk, it's because the user bumped into the end of the previous chunk, so you've read the previous chunk and from that read you should be able to compute the byte-position of the last complete character. If that doesn't work in some cases (e.g. because the encoding has state so you can't just start reading from the byte position of the last complete char), maybe insert-file-contents should return a "decoder state" object when can then be passed back in. Stefan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-10-02 18:55 ` Stefan Monnier @ 2008-10-03 13:55 ` Ted Zlatanov 0 siblings, 0 replies; 15+ messages in thread From: Ted Zlatanov @ 2008-10-03 13:55 UTC (permalink / raw) To: emacs-devel On Thu, 02 Oct 2008 14:55:10 -0400 Stefan Monnier <monnier@IRO.UMontreal.CA> wrote: KH> How about reading a file in a unibyte buffer with KH> no-conversion, and decode one page by one into a view KH> buffer. Except for UTF-16 encoding, it is safe to set the KH> decoding boundary at newline positions. >> That could work, but I'd have to grab more than one page every time, so >> the math could get tricky. I'll have to play with this when it comes up >> on my TODO list :) SM> I'm not sure why you'd need such trick anyway. Supposedly when you need SM> to read a new chunk, it's because the user bumped into the end of the SM> previous chunk, so you've read the previous chunk and from that read you SM> should be able to compute the byte-position of the last SM> complete character. SM> If that doesn't work in some cases (e.g. because the encoding has state SM> so you can't just start reading from the byte position of the last SM> complete char), maybe insert-file-contents should return a "decoder SM> state" object when can then be passed back in. You're right, usually the user will ask for sequential access to large files so these optimizations should work. It needs to be as fast as possible regardless, because searching through large amounts of data is painful (especially if the search fails) and I expect searching to be one of the most common needs when viewing large files. Editing is another really nasty problem, of course, so I plan to make the mode read-only at first. Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: docs for insert-file-contents use 'bytes' 2008-10-01 16:54 ` Ted Zlatanov 2008-10-02 1:33 ` Kenichi Handa @ 2008-10-02 1:58 ` Kenichi Handa 1 sibling, 0 replies; 15+ messages in thread From: Kenichi Handa @ 2008-10-02 1:58 UTC (permalink / raw) To: Ted Zlatanov; +Cc: emacs-devel In article <868wt845op.fsf@lifelogs.com>, Ted Zlatanov <tzz@lifelogs.com> writes: > That's fine. I would just seek back N bytes and resynchronize (see > above). I could also make the mode only work with bytes, bypassing > encoding schemes. This would be a good thing, actually, for cases where > the user might page or search through many megabytes of data quickly. How about reading a file in a unibyte buffer with no-conversion, and decode one page by one into a view buffer. Except for UTF-16 encoding, it is safe to set the decoding boundary at newline positions. --- Kenichi Handa handa@ni.aist.go.jp ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2008-10-03 13:55 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-29 19:58 docs for insert-file-contents use 'bytes' Ted Zlatanov 2008-09-29 20:12 ` Eli Zaretskii 2008-09-29 21:04 ` Ted Zlatanov 2008-09-30 6:06 ` Miles Bader 2008-09-30 7:19 ` Eli Zaretskii 2008-09-30 13:48 ` Ted Zlatanov 2008-09-30 15:58 ` Stefan Monnier 2008-09-30 16:29 ` Eli Zaretskii 2008-10-01 0:44 ` Kenichi Handa 2008-10-01 16:54 ` Ted Zlatanov 2008-10-02 1:33 ` Kenichi Handa 2008-10-02 13:42 ` Ted Zlatanov 2008-10-02 18:55 ` Stefan Monnier 2008-10-03 13:55 ` Ted Zlatanov 2008-10-02 1:58 ` Kenichi Handa
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).