jka-compr.el doesn't recognise gzipped files from their magic bytes

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* jka-compr.el doesn't recognise gzipped files from their magic bytes
@ 2007-09-17 19:28 Chris Moore
  2007-09-17 20:29 ` Eli Zaretskii
  2007-09-17 21:50 ` Stefan Monnier
  0 siblings, 2 replies; 17+ messages in thread
From: Chris Moore @ 2007-09-17 19:28 UTC (permalink / raw)
  To: emacs-pretest-bug

I'm using an application which saves natively in .sif format, or
optionally in .sifz format, which is the same as .sif, only gzipped.

Opening a .sifz file in Emacs shows binary junk - Emacs doesn't
recognise that it's really a gzipped file, even though it does know
the 'magic' string that identifies all .gz files (see the value of
jka-compr-compression-info-list).

So shouldn't jka-compr look for the magic strings, and identify
gzipped files that way?

I notice that '.svgz', the common name for gzipped .svg vector
graphics files has been hardcoded into jka-cmpr-hook.el, rather than
adding general support for gzipped files, whatever the extension.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-17 19:28 jka-compr.el doesn't recognise gzipped files from their magic bytes Chris Moore
@ 2007-09-17 20:29 ` Eli Zaretskii
  2007-09-18  3:35   ` Stephen J. Turnbull
  2007-09-17 21:50 ` Stefan Monnier
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2007-09-17 20:29 UTC (permalink / raw)
  To: Chris Moore; +Cc: emacs-pretest-bug

> Date: Mon, 17 Sep 2007 21:28:24 +0200
> From: "Chris Moore" <christopher.ian.moore@gmail.com>
> Cc: 
> 
> So shouldn't jka-compr look for the magic strings, and identify
> gzipped files that way?

It would be a performance hit: Emacs will need to open a file and read
its first bytes just to know what to do with it.  Then it will need to
close the file and reopen it again (or else jump through the hoops to
keep the first several bytes around for the second attempt).

I say, as long as we need to add a few more extensions, let's do that
and be done.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-17 19:28 jka-compr.el doesn't recognise gzipped files from their magic bytes Chris Moore
  2007-09-17 20:29 ` Eli Zaretskii
@ 2007-09-17 21:50 ` Stefan Monnier
  1 sibling, 0 replies; 17+ messages in thread
From: Stefan Monnier @ 2007-09-17 21:50 UTC (permalink / raw)
  To: Chris Moore; +Cc: emacs-pretest-bug

> So shouldn't jka-compr look for the magic strings, and identify
> gzipped files that way?

jka-compr uses a file-name-handler which (as the name indicates) works at
the file-name level to detect the need for special behavior.

So what you're suggesting is a different feature, which works more like
a special major mode.  That might make a lot of sense.
Contributions welcome.  Note that a major-mode would also have some
disadvantages: e.g. when saving a file to some new name ".sifz" Emacs
wouldn't know to compress it.

        Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-17 20:29 ` Eli Zaretskii
@ 2007-09-18  3:35   ` Stephen J. Turnbull
  2007-09-18  4:14     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-18  3:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-pretest-bug, Chris Moore

Eli Zaretskii writes:

 > It would be a performance hit: Emacs will need to open a file and read
 > its first bytes just to know what to do with it.

This is precisely what an autodetecting coding system does.  You've
already got that code, why not use it? ;-)

According to some ancient notes from Ben Wing this could be a little
tricky for gzip, but I can't imagine it's actually *hard*.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-18  3:35   ` Stephen J. Turnbull
@ 2007-09-18  4:14     ` Eli Zaretskii
  2007-09-18  5:49       ` Stephen J. Turnbull
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2007-09-18  4:14 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, christopher.ian.moore

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: "Chris Moore" <christopher.ian.moore@gmail.com>,
>     emacs-pretest-bug@gnu.org
> Date: Tue, 18 Sep 2007 12:35:27 +0900
> 
> Eli Zaretskii writes:
> 
>  > It would be a performance hit: Emacs will need to open a file and read
>  > its first bytes just to know what to do with it.
> 
> This is precisely what an autodetecting coding system does.

No, that's not true; at least not in GNU Emacs.  Autodetecting reads
the file in its entirety.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-18  4:14     ` Eli Zaretskii
@ 2007-09-18  5:49       ` Stephen J. Turnbull
  2007-09-18 19:32         ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-18  5:49 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-pretest-bug, christopher.ian.moore

Eli Zaretskii writes:

 > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
 > > Cc: "Chris Moore" <christopher.ian.moore@gmail.com>,
 > >     emacs-pretest-bug@gnu.org
 > > Date: Tue, 18 Sep 2007 12:35:27 +0900
 > > 
 > > Eli Zaretskii writes:
 > > 
 > >  > It would be a performance hit: Emacs will need to open a file and read
 > >  > its first bytes just to know what to do with it.
 > > 
 > > This is precisely what an autodetecting coding system does.
 > 
 > No, that's not true; at least not in GNU Emacs.  Autodetecting reads
 > the file in its entirety.

Well, if you're going to open and convert the file anyway, this isn't
really a loss.

Furthermore, there must be some scheme for buffering for process
coding systems.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-18  5:49       ` Stephen J. Turnbull
@ 2007-09-18 19:32         ` Eli Zaretskii
  2007-09-19  3:55           ` Stephen J. Turnbull
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2007-09-18 19:32 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, christopher.ian.moore

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: christopher.ian.moore@gmail.com,
>     emacs-pretest-bug@gnu.org
> Date: Tue, 18 Sep 2007 14:49:17 +0900
> 
> Furthermore, there must be some scheme for buffering for process
> coding systems.

I'm not sure what buffering you had in mind.  AFAIK, Emacs reads into
a buffer whatever portion output of a process is available in the
OS-maintained pipe, decodes that portion, and passes the decoded text
to the filter or sentinel function.

If I understand your intent correctly, there's no buffering on the
Emacs side in this case.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-18 19:32         ` Eli Zaretskii
@ 2007-09-19  3:55           ` Stephen J. Turnbull
  2007-09-19 18:43             ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-19  3:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-pretest-bug, christopher.ian.moore

Eli Zaretskii writes:

 > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
 > > Cc: christopher.ian.moore@gmail.com,
 > >     emacs-pretest-bug@gnu.org
 > > Date: Tue, 18 Sep 2007 14:49:17 +0900
 > > 
 > > Furthermore, there must be some scheme for buffering for process
 > > coding systems.
 > 
 > I'm not sure what buffering you had in mind.  AFAIK, Emacs reads into
 > a buffer whatever portion output of a process is available in the
 > OS-maintained pipe, decodes that portion, and passes the decoded text
 > to the filter or sentinel function.

Decodes with what? is the question.  If the process output coding
system is nil, Emacs must "guess" what it is.  This requires reading
all of the input and collecting some "statistics about it", then
making a decision about which coding system to use, and finally it
*goes back to the beginning* and decodes according to that coding
system.  Which is exactly the situation we are describing here.

There may be no buffer there that Emacs can access under normal
circumstances, however, there is some mechanism that can be used by
coding systems.  For example, CCL could be used.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-19  3:55           ` Stephen J. Turnbull
@ 2007-09-19 18:43             ` Eli Zaretskii
  2007-09-20 18:53               ` Stephen J. Turnbull
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2007-09-19 18:43 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-pretest-bug, christopher.ian.moore

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: emacs-pretest-bug@gnu.org,
>     christopher.ian.moore@gmail.com
> Date: Wed, 19 Sep 2007 12:55:50 +0900
> 
> If the process output coding system is nil, Emacs must "guess" what
> it is.

Yes.

> This requires reading all of the input

No.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-19 18:43             ` Eli Zaretskii
@ 2007-09-20 18:53               ` Stephen J. Turnbull
  2007-09-20 19:06                 ` Stefan Monnier
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-20 18:53 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-pretest-bug, christopher.ian.moore

Eli Zaretskii writes:
 > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
 > > Cc: emacs-pretest-bug@gnu.org,
 > >     christopher.ian.moore@gmail.com
 > > Date: Wed, 19 Sep 2007 12:55:50 +0900
 > > 
 > > If the process output coding system is nil, Emacs must "guess" what
 > > it is.
 > 
 > Yes.
 > 
 > > This requires reading all of the input
 > 
 > No.

OK, you win; I'm not going to waste time trying to *make* you
understand.  Feel free to ask what I'm trying to say some time when
you're in a better mood.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-20 18:53               ` Stephen J. Turnbull
@ 2007-09-20 19:06                 ` Stefan Monnier
  2007-09-20 23:24                   ` Stephen J. Turnbull
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2007-09-20 19:06 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Eli Zaretskii, christopher.ian.moore

> OK, you win; I'm not going to waste time trying to *make* you
> understand.  Feel free to ask what I'm trying to say some time when
> you're in a better mood.

Let me try to answer: I don't know how it really works, but IIUC as long as
we only get ASCII bytes without end of line, the coding system is left as
`undecided' and on the first packet we receive with an LF or CR or a byte
larger than 128, the coding system is decided based on this packet and this
packet only.  Tho, I guess the decision on EOL is orthogonal, so we may go
from `undecided' to `undecided-unix' on one packet (or to `latin-undecided')
and only get to `latin1-unix' on a later packet.

Anyway, this is only based on the mental model I created to explain to
myself what I see (also based on a vague impression of how I'd have tried to
code it if I had had to), but I haven't actually looked at the relevant
code.  So maybe it works completely differently.

        Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-20 19:06                 ` Stefan Monnier
@ 2007-09-20 23:24                   ` Stephen J. Turnbull
  2007-09-21  3:35                     ` Stefan Monnier
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-20 23:24 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-pretest-bug, Eli Zaretskii, christopher.ian.moore

Stefan Monnier writes:

 > Let me try to answer: I don't know how it really works, but IIUC as long as
 > we only get ASCII bytes without end of line, the coding system is left as
 > `undecided' and on the first packet we receive with an LF or CR or a byte
 > larger than 128, the coding system is decided based on this packet and this
 > packet only.  Tho, I guess the decision on EOL is orthogonal, so we may go
 > from `undecided' to `undecided-unix' on one packet (or to `latin-undecided')
 > and only get to `latin1-unix' on a later packet.

Could be, although I really wouldn't want to make a decision based on
a very few non-ASCII bytes.  Point is, there has to be a buffer
holding that packet that the coding system has access to.  Perhaps
even an Emacs buffer in binary coding system or buffer-as-unibyte
mode.  It is analyzed and then the process seeks back to where the
non-ASCII stuff started (or, more likely, the beginning of the
buffer), and decodes it.  Then further input is read.

The same thing can surely be done with magic numbers to identify
images, zipfiles, and the like.  There is no need to open, close, and
reopen the stream, and none of the inefficiency that Eli was claiming.
The exception is if you're going to process it through an external
process (such as /bin/gzip) anyway, in which case the detection phase
is pretty small overhead compared to the convenience of doing the
detection in Emacs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-20 23:24                   ` Stephen J. Turnbull
@ 2007-09-21  3:35                     ` Stefan Monnier
  2007-09-21  5:00                       ` Stephen J. Turnbull
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2007-09-21  3:35 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, Eli Zaretskii, christopher.ian.moore

> Could be, although I really wouldn't want to make a decision based on
> a very few non-ASCII bytes.

I can't see there being much opportunity for choice.

> Point is, there has to be a buffer holding that packet that the coding
> system has access to.

Well, there's obviously some "char*" array into which the data is read.
But it's not like there's a buffer where we keep data for later
processing, AFAIK, because that'd be difficult to handle correctly
(how long would we wait until processing it?).


        Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-21  3:35                     ` Stefan Monnier
@ 2007-09-21  5:00                       ` Stephen J. Turnbull
  2007-09-21  5:10                         ` Kenichi Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-21  5:00 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-pretest-bug, Eli Zaretskii, christopher.ian.moore

Stefan Monnier writes:

 > But it's not like there's a buffer where we keep data for later
 > processing, AFAIK, because that'd be difficult to handle correctly
 > (how long would we wait until processing it?).

CCL has access to that stream.  Lisp has access to CCL.  There will be
a buffer there, and you can get to it via CCL.

Ask Handa-san, he'll know.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-21  5:00                       ` Stephen J. Turnbull
@ 2007-09-21  5:10                         ` Kenichi Handa
  2007-09-21  6:21                           ` Stephen J. Turnbull
  0 siblings, 1 reply; 17+ messages in thread
From: Kenichi Handa @ 2007-09-21  5:10 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, eliz, christopher.ian.moore, monnier

In article <871wcs394l.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> Stefan Monnier writes:
> But it's not like there's a buffer where we keep data for later
> processing, AFAIK, because that'd be difficult to handle correctly
> (how long would we wait until processing it?).

> CCL has access to that stream.  Lisp has access to CCL.  There will be
> a buffer there, and you can get to it via CCL.

> Ask Handa-san, he'll know.

Sorry, I have not followed this thread.  Could someone
please explain how CCL is related here?

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-21  5:10                         ` Kenichi Handa
@ 2007-09-21  6:21                           ` Stephen J. Turnbull
  2007-09-21  7:41                             ` Kenichi Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen J. Turnbull @ 2007-09-21  6:21 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-pretest-bug, eliz, christopher.ian.moore, monnier

Kenichi Handa writes:

 > Sorry, I have not followed this thread.  Could someone
 > please explain how CCL is related here?

I'm drawing an analogy between (1) looking at a text stream (process
or file) with autodetection, then decoding the text accroding to the
detected coding system, and (2) looking at the magic in a file, and
decoding it according to that magic.  The example of (2) discussed
here is a gzipped stream.  The problem to solve is that jka-compr
doesn't recognize such a stream unless it's a file with a name with
.gz extension.

Eli claims that looking at magic and then decoding the file would
necessarily introduce a lot of overhead, and I say that that is not
true, because it's not true for autodetected coding systems.

I also claim that for coding that can be implemented inside of Emacs,
the coding system framework could be adapted to this (or perhaps even
used directly, Ben Wing has demonstrated this in XEmacs with zlib, but
XEmacs is a quite different implementation from current Emacs).

The point about CCL is simply that a CCL coding system is completely
separate from the detector, and so must work on buffered input.  You
have to be able to seek to the beginning of the input stream and then
translate it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: jka-compr.el doesn't recognise gzipped files from their magic bytes
  2007-09-21  6:21                           ` Stephen J. Turnbull
@ 2007-09-21  7:41                             ` Kenichi Handa
  0 siblings, 0 replies; 17+ messages in thread
From: Kenichi Handa @ 2007-09-21  7:41 UTC (permalink / raw)
  To: Stephen J. Turnbull
  Cc: emacs-pretest-bug, eliz, christopher.ian.moore, monnier, handa

In article <87wsuk1quo.fsf@uwakimon.sk.tsukuba.ac.jp>, "Stephen J. Turnbull" <stephen@xemacs.org> writes:

> I'm drawing an analogy between (1) looking at a text stream (process
> or file) with autodetection, then decoding the text accroding to the
> detected coding system, and (2) looking at the magic in a file, and
> decoding it according to that magic.  The example of (2) discussed
> here is a gzipped stream.  The problem to solve is that jka-compr
> doesn't recognize such a stream unless it's a file with a name with
> .gz extension.

> Eli claims that looking at magic and then decoding the file would
> necessarily introduce a lot of overhead, and I say that that is not
> true, because it's not true for autodetected coding systems.

I think it's possible to implement what you want by
modifying Finsert_file_contents as this:

In the case of visiting a regular file, it reads the whole
file into an unibyte buffer, then calls set-auto-coding
(Lisp).  We can make it read the first 1K and the last 3K
bytes into an unibyte buffer and call set-auto-coding.  And
set-auto-coding can be modified to handle a magic head
(perhaps, if gzip-magic is found, replace the buffer with
gunzipped-data, and do the normal work of set-auto-coding).
Finsert_file_contents can know that the buffer already
contains a full data by checking if the buffer is modified
or not.

Anyway, my long term todo-list contains re-implementation of
insert-file-contents in Lisp, and change the current
Finsert_file_contents to do only a basic file reading (as in
Emacs 19).  Then we can improve insert-file-contents more
easily.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-09-21  7:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-17 19:28 jka-compr.el doesn't recognise gzipped files from their magic bytes Chris Moore
2007-09-17 20:29 ` Eli Zaretskii
2007-09-18  3:35   ` Stephen J. Turnbull
2007-09-18  4:14     ` Eli Zaretskii
2007-09-18  5:49       ` Stephen J. Turnbull
2007-09-18 19:32         ` Eli Zaretskii
2007-09-19  3:55           ` Stephen J. Turnbull
2007-09-19 18:43             ` Eli Zaretskii
2007-09-20 18:53               ` Stephen J. Turnbull
2007-09-20 19:06                 ` Stefan Monnier
2007-09-20 23:24                   ` Stephen J. Turnbull
2007-09-21  3:35                     ` Stefan Monnier
2007-09-21  5:00                       ` Stephen J. Turnbull
2007-09-21  5:10                         ` Kenichi Handa
2007-09-21  6:21                           ` Stephen J. Turnbull
2007-09-21  7:41                             ` Kenichi Handa
2007-09-17 21:50 ` Stefan Monnier

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).