Reading portions of large files

unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed

* Reading portions of large files
@ 2003-01-09 15:45 Gerald.Jean
  0 siblings, 0 replies; 21+ messages in thread
From: Gerald.Jean @ 2003-01-09 15:45 UTC (permalink / raw)


Hello,

I have very large files, sometimes over 1G, from which I would like to edit
very small portions, the headers or trailers for example.  Emacs won't open
those files, it complains about them being too big.  Is it possible to
edit, and save back after editing, only small portions of such files.

Thanks,

Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone            : (418) 835-4900 poste (7639)
télecopieur          : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca

"In God we trust all others must bring data"  W. Edwards Deming

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <mailman.100.1042135372.21513.help-gnu-emacs@gnu.org>]

* Re: Reading portions of large files
       [not found] <mailman.100.1042135372.21513.help-gnu-emacs@gnu.org>
@ 2003-01-09 18:20 ` David Kastrup
  2003-01-10 19:21   ` Eli Zaretskii
       [not found]   ` <mailman.153.1042230313.21513.help-gnu-emacs@gnu.org>
  2003-01-10 16:27 ` Eric Pement
  2003-01-10 17:16 ` Brendan Halpin
  2 siblings, 2 replies; 21+ messages in thread
From: David Kastrup @ 2003-01-09 18:20 UTC (permalink / raw)

Gerald.Jean@spgdag.ca writes:

> Hello,
> 
> I have very large files, sometimes over 1G, from which I would like to edit
> very small portions, the headers or trailers for example.  Emacs won't open
> those files, it complains about them being too big.  Is it possible to
> edit, and save back after editing, only small portions of such files.

insert-file-contents is a built-in function.
(insert-file-contents FILENAME &optional VISIT BEG END REPLACE)

Insert contents of file FILENAME after point.
Returns list of absolute file name and number of bytes inserted.
If second argument VISIT is non-nil, the buffer's visited filename
and last save file modtime are set, and it is marked unmodified.
If visiting and the file does not exist, visiting is completed
before the error is signaled.
The optional third and fourth arguments BEG and END
specify what portion of the file to insert.

[...]

As to writing?  No idea at the moment.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-09 18:20 ` David Kastrup
@ 2003-01-10 19:21   ` Eli Zaretskii
       [not found]   ` <mailman.153.1042230313.21513.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2003-01-10 19:21 UTC (permalink / raw)


> From: David Kastrup <dak@gnu.org>
> Newsgroups: gnu.emacs.help
> Date: 09 Jan 2003 19:20:06 +0100
> 
> > I have very large files, sometimes over 1G, from which I would like to edit
> > very small portions, the headers or trailers for example.  Emacs won't open
> > those files, it complains about them being too big.  Is it possible to
> > edit, and save back after editing, only small portions of such files.
> 
> insert-file-contents is a built-in function.
> (insert-file-contents FILENAME &optional VISIT BEG END REPLACE)

I don't think this will help the OP, since BEG and END need to be
representable as Lisp integers, so they still are subject to the same
128-MB limit.

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <mailman.153.1042230313.21513.help-gnu-emacs@gnu.org>]

* Re: Reading portions of large files
       [not found]   ` <mailman.153.1042230313.21513.help-gnu-emacs@gnu.org>
@ 2003-01-10 20:51     ` David Kastrup
  2003-01-11  8:51       ` Eli Zaretskii
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: David Kastrup @ 2003-01-10 20:51 UTC (permalink / raw)

"Eli Zaretskii" <eliz@is.elta.co.il> writes:

> > From: David Kastrup <dak@gnu.org>
> > Newsgroups: gnu.emacs.help
> > Date: 09 Jan 2003 19:20:06 +0100
> > 
> > > I have very large files, sometimes over 1G, from which I would
> > > like to edit very small portions, the headers or trailers for
> > > example.  Emacs won't open those files, it complains about them
> > > being too big.  Is it possible to edit, and save back after
> > > editing, only small portions of such files.
> > 
> > insert-file-contents is a built-in function.
> > (insert-file-contents FILENAME &optional VISIT BEG END REPLACE)
> 
> I don't think this will help the OP, since BEG and END need to be
> representable as Lisp integers, so they still are subject to the same
> 128-MB limit.

Oops, I forgot.  In that case it would probably be best to run dd
from or to pipes with appropriate options for writing and reading
pieces from a big file.

BTW, would it be terribly complicated to extend the range of Lisp
integers to 31bit?  Integers don't need any garbage collection or tag
bits per se.  One could still use, say, the upper byte (or a smaller
unit) as a tag byte, only that the first or last 128 values would all
signify "integer".

Emacs has a most-positive-fixnum of 134217727, while XEmacs has
1073741823, more than 8 times as much.  So it would appear to be
possible in theory.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-10 20:51     ` David Kastrup
@ 2003-01-11  8:51       ` Eli Zaretskii
       [not found]       ` <mailman.169.1042278925.21513.help-gnu-emacs@gnu.org>
  2003-01-12 20:38       ` Stefan Monnier <foo@acm.com>
  2 siblings, 0 replies; 21+ messages in thread
From: Eli Zaretskii @ 2003-01-11  8:51 UTC (permalink / raw)


> From: David Kastrup <dak@gnu.org>
> Newsgroups: gnu.emacs.help
> Date: 10 Jan 2003 21:51:49 +0100
> 
> BTW, would it be terribly complicated to extend the range of Lisp
> integers to 31bit?

It's not terribly hard, but IIRC the current consensus among the Emacs
maintainers is that it's not important enough to do that because
before long all machines will have 64-bit compilers.

Perhaps this should be discussed again on the developers' list.

> Integers don't need any garbage collection or tag bits per se.

They need to be distinguishable from other Lisp types, so their tag
bitfield cannot have an arbitrary bit pattern.

> Emacs has a most-positive-fixnum of 134217727, while XEmacs has
> 1073741823, more than 8 times as much.  So it would appear to be
> possible in theory.

IIRC, the XEmacs way requires extensive changes in how Emacs works,
but I don't remember the details.

^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <mailman.169.1042278925.21513.help-gnu-emacs@gnu.org>]

* Re: Reading portions of large files
       [not found]       ` <mailman.169.1042278925.21513.help-gnu-emacs@gnu.org>
@ 2003-01-11 10:42         ` David Kastrup
  0 siblings, 0 replies; 21+ messages in thread
From: David Kastrup @ 2003-01-11 10:42 UTC (permalink / raw)


"Eli Zaretskii" <eliz@is.elta.co.il> writes:

> > From: David Kastrup <dak@gnu.org>
> > Newsgroups: gnu.emacs.help
> > Date: 10 Jan 2003 21:51:49 +0100
> > 
> > BTW, would it be terribly complicated to extend the range of Lisp
> > integers to 31bit?
> 
> It's not terribly hard, but IIRC the current consensus among the Emacs
> maintainers is that it's not important enough to do that because
> before long all machines will have 64-bit compilers.
> 
> Perhaps this should be discussed again on the developers' list.
> 
> > Integers don't need any garbage collection or tag bits per se.
> 
> They need to be distinguishable from other Lisp types, so their tag
> bitfield cannot have an arbitrary bit pattern.

Yes, but a single bit is sufficient for that distinction.  This could
even speed up operations, since the sign bit is a candidate that can
be rather quickly checked.

Something like
  if (x < 0)
will establish that something is an integer,
  (x + 0x40000000)
will yield the value of the integer, and
  (x | 0x8000000)
will convert an integer back to a Lisp number.

I don't know whether an integer Lisp object needs to be identical to
an integer.  If it does, then the above needs an offset of 0x40000000
everywhere, of course.

> > Emacs has a most-positive-fixnum of 134217727, while XEmacs has
> > 1073741823, more than 8 times as much.  So it would appear to be
> > possible in theory.
> 
> IIRC, the XEmacs way requires extensive changes in how Emacs works,
> but I don't remember the details.

No clue about that.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-10 20:51     ` David Kastrup
  2003-01-11  8:51       ` Eli Zaretskii
       [not found]       ` <mailman.169.1042278925.21513.help-gnu-emacs@gnu.org>
@ 2003-01-12 20:38       ` Stefan Monnier <foo@acm.com>
  2003-01-13  7:40         ` Miles Bader
  2003-01-20  7:50         ` Lee Sau Dan
  2 siblings, 2 replies; 21+ messages in thread
From: Stefan Monnier <foo@acm.com> @ 2003-01-12 20:38 UTC (permalink / raw)


> BTW, would it be terribly complicated to extend the range of Lisp
> integers to 31bit?

Currently a cons cell takes 2 words.  Each word has 3 tag bits and
1 mark bit.  When marking a cons cell, the GC sets the mark bit of the
first word of the cell.  The mark bit of the second word is unused
(i.e. wasted).

Since at least 1 bit of tag is needed, that means that to get 31bit
integers we'd need to move the mark bit somewhere else.  XEmacs decided to
use 3-word cons cells (and I know they're still regularly wondering
whether it was a good idea).  Another approach is to use a separate mark-bit
array.

Lots of trade offs, a fair bit of coding, even more testing, ...
Anybody interested is welcome to tried it out.  My opinion is that maybe it
would be nice, but since the only application I'm aware of is "editing files
between 128MB and 1GB on 32bit systems", I don't think it's worth
the trouble.


        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-12 20:38       ` Stefan Monnier <foo@acm.com>
@ 2003-01-13  7:40         ` Miles Bader
  2003-01-13  7:42           ` Miles Bader
  2003-01-20  7:50         ` Lee Sau Dan
  1 sibling, 1 reply; 21+ messages in thread
From: Miles Bader @ 2003-01-13  7:40 UTC (permalink / raw)


"Stefan Monnier <foo@acm.com>" <monnier+gnu.emacs.help/news/@flint.cs.yale.edu> writes:
> Since at least 1 bit of tag is needed, that means that to get 31bit
> integers we'd need to move the mark bit somewhere else.

Hmmm?  I thought only boxed object had to have a mark bit, in which case
integers don't need one.  [Indeed, looking at the current garbage
collector, it doesn't seem to mark integers]

I'd also like to have low-bit tags so I can stack-allocate lisp objects...

-Miles
-- 
Is it true that nothing can be known?  If so how do we know this?  -Woody Allen

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-13  7:40         ` Miles Bader
@ 2003-01-13  7:42           ` Miles Bader
  2003-01-13  7:55             ` David Kastrup
  0 siblings, 1 reply; 21+ messages in thread
From: Miles Bader @ 2003-01-13  7:42 UTC (permalink / raw)


Miles Bader <miles@gnu.org> writes:
> Hmmm?  I thought only boxed object had to have a mark bit, in which case
> integers don't need one.  [Indeed, looking at the current garbage
> collector, it doesn't seem to mark integers]

Oh wait, I was confused, it does need a mark-bit for cons cells...

Sorry for the noise...

-miles
-- 
Freedom's just another word, for nothing left to lose   --Janis Joplin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-13  7:42           ` Miles Bader
@ 2003-01-13  7:55             ` David Kastrup
  2003-01-13  8:05               ` Miles Bader
  0 siblings, 1 reply; 21+ messages in thread
From: David Kastrup @ 2003-01-13  7:55 UTC (permalink / raw)


Miles Bader <miles@gnu.org> writes:

> Miles Bader <miles@gnu.org> writes:
> > Hmmm?  I thought only boxed object had to have a mark bit, in which case
> > integers don't need one.  [Indeed, looking at the current garbage
> > collector, it doesn't seem to mark integers]
> 
> Oh wait, I was confused, it does need a mark-bit for cons cells...
> 
> Sorry for the noise...

Cons cells are not integers.  Care to explain for somebody dull?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-13  7:55             ` David Kastrup
@ 2003-01-13  8:05               ` Miles Bader
  0 siblings, 0 replies; 21+ messages in thread
From: Miles Bader @ 2003-01-13  8:05 UTC (permalink / raw)


David Kastrup <dak@gnu.org> writes:
> Cons cells are not integers.  Care to explain for somebody dull?

Cons cells don't have a header, so they need to use the mark-bit of one
of their components, meaning that anything you can store into a
cons-cell needs a mark-bit.

I wonder how feasible it would be to use another sort of GC, like
stop-and-copy, which doesn't need mark-bits...

-Miles
-- 
"1971 pickup truck; will trade for guns"

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-12 20:38       ` Stefan Monnier <foo@acm.com>
  2003-01-13  7:40         ` Miles Bader
@ 2003-01-20  7:50         ` Lee Sau Dan
  2003-01-24  7:55           ` Mac
  2003-01-27 14:44           ` Stefan Monnier <foo@acm.com>
  1 sibling, 2 replies; 21+ messages in thread
From: Lee Sau Dan @ 2003-01-20  7:50 UTC (permalink / raw)


>>>>> "Stefan" == "Stefan Monnier <foo@acm.com>" <monnier+gnu.emacs.help/news/@flint.cs.yale.edu> writes:

    Stefan> Since at least 1 bit of tag is needed, that means that to
    Stefan> get 31bit integers we'd need to move the mark bit
    Stefan> somewhere else.  XEmacs decided to use 3-word cons cells
    Stefan> (and I know they're still regularly wondering whether it
    Stefan> was a good idea).  Another approach is to use a separate
    Stefan> mark-bit array.

I think the separate mark-bit  array would be cleaner.  You don't need
to access  the mark  bits unless  you're doing gc.   Why let  that bit
stick  there in  the  _main_ working  set  all the  time?  Wouldn't  a
separate mark-bit array also improve locality (important for caching)?

Then, in theory, the tag bits  can also be kept separately, giving the
full  32 bits to  integers (represented  as machine-native  words).  I
think  we only  need 1  tag bit  in the  separate tag-bit  array.  Its
function is  to indicate whether  the corresponding memory word  is an
integer or not.  If not, then  the remaining tag bits are found in the
word itself.  And integer arithmetic can certainly be faster!

Would this implementation be more efficient or worse?


    Stefan> Lots of trade offs, a fair bit of coding, even more
    Stefan> testing, ...  Anybody interested is welcome to tried it
    Stefan> out.  My opinion is that maybe it would be nice, but since
    Stefan> the only application I'm aware of is "editing files
    Stefan> between 128MB and 1GB on 32bit systems", I don't think
    Stefan> it's worth the trouble.

Yeah.  I share this last point with you.  >128MB text files are simply
weird.  And for binary file, a real hex editor (or 'xxd', which I just
discovered) is a more appropriate tool, or just 'dd'.


-- 
Lee Sau Dan                     李守敦(Big5)                    ~{@nJX6X~}(HZ) 

E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-20  7:50         ` Lee Sau Dan
@ 2003-01-24  7:55           ` Mac
  2003-01-27 14:44           ` Stefan Monnier <foo@acm.com>
  1 sibling, 0 replies; 21+ messages in thread
From: Mac @ 2003-01-24  7:55 UTC (permalink / raw)


On 20 Jan 2003, Lee Sau Dan wrote:
> 
>     Stefan> Lots of trade offs, a fair bit of coding, even more
>     Stefan> testing, ...  Anybody interested is welcome to tried it
>     Stefan> out.  My opinion is that maybe it would be nice, but
>     Stefan> since the only application I'm aware of is "editing
>     Stefan> files between 128MB and 1GB on 32bit systems", I don't
>     Stefan> think it's worth the trouble.
> 
> Yeah.  I share this last point with you.  >128MB text files are
> simply weird.  And for binary file, a real hex editor (or 'xxd',
> which I just discovered) is a more appropriate tool, or just 'dd'.

Well, it is a weird world. When working with hardware development,
file sizes over 128MB is very common (netlists, sdf-files,
logfiles...), although what you do with these huge files are
limited. Its mainly search and replace (occur, query-replace-regexp
etc).


/mac

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-20  7:50         ` Lee Sau Dan
  2003-01-24  7:55           ` Mac
@ 2003-01-27 14:44           ` Stefan Monnier <foo@acm.com>
  1 sibling, 0 replies; 21+ messages in thread
From: Stefan Monnier <foo@acm.com> @ 2003-01-27 14:44 UTC (permalink / raw)


> think  we only  need 1  tag bit  in the  separate tag-bit  array.  Its
> function is  to indicate whether  the corresponding memory word  is an
> integer or not.  If not, then  the remaining tag bits are found in the
> word itself.  And integer arithmetic can certainly be faster!

Integer arithmetic performance is a complete non-issue in Emacs (and most
other tagged programming languages, as a matter of fact).


        Stefan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
       [not found] <mailman.100.1042135372.21513.help-gnu-emacs@gnu.org>
  2003-01-09 18:20 ` David Kastrup
@ 2003-01-10 16:27 ` Eric Pement
  2003-01-10 17:16 ` Brendan Halpin
  2 siblings, 0 replies; 21+ messages in thread
From: Eric Pement @ 2003-01-10 16:27 UTC (permalink / raw)


Gerald.Jean@spgdag.ca wrote in message news:<mailman.100.1042135372.21513.help-gnu-emacs@gnu.org>...
> Hello,
> 
> I have very large files, sometimes over 1G, from which I would like to 
> edit
> very small portions, the headers or trailers for example.  Emacs won't 
> open
> those files, it complains about them being too big.

  The Emacs FAQ says that Emacs 20 and above can be compiled "on some
64-bit systems" to hande files of up to 550 million Gigabytes. However,
it looks a bit dated and it would be handier if this section of the
GNU Emacs FAQ were brought more up-to-date.

   If you use Windows editors, Vedit (http://www.vedit.com) will edit
files of up to 2 Gigs in size, though it may take some time to load
files of this size.

   And for just over-the-top accommodation, PDT-Windows claims to
handle filesizes of up to 18 Exabytes (that's 18 billion Gigs)!  I
wonder if that is larger than the aggregate storage of all disks on
the Internet? On a more realistic plane, their website says they
have "easily edited files of 3 - 5 gigabytes".

   I've downloaded the eval version, and this editor is intended
for editing large databases or binary files. It won't work well for
plaintext or concatenated.tar program code. If you're interested,
the URL is http://www.pro-central.com/pdt_win.htm

HTH.

--
Eric Pement

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
       [not found] <mailman.100.1042135372.21513.help-gnu-emacs@gnu.org>
  2003-01-09 18:20 ` David Kastrup
  2003-01-10 16:27 ` Eric Pement
@ 2003-01-10 17:16 ` Brendan Halpin
  2003-01-10 20:35   ` Benjamin Riefenstahl
  2003-01-20  7:50   ` Lee Sau Dan
  2 siblings, 2 replies; 21+ messages in thread
From: Brendan Halpin @ 2003-01-10 17:16 UTC (permalink / raw)


Gerald.Jean@spgdag.ca writes:

> I have very large files, sometimes over 1G, from which I would like to edit
> very small portions, the headers or trailers for example.  Emacs won't open
> those files, it complains about them being too big.  Is it possible to
> edit, and save back after editing, only small portions of such files.

Use head and tail to split the file into the header-to-be-edited
and the-rest. Edit the header-to-be-edited in emacs, save, then
concatenated the-rest onto it.

Assuming all editing is within the first 2000 bytes (not tested):

head -c2000 bigfile > header-to-be-edited
tail -c+2001 bigfile > the-rest
(edit header-to-be-edited, save)
cat header-to-be-edited the-rest > new-big-file

Even if the file is not too big to fit in Emacs, this should be
faster for very big files where the editing is all in a small
section. 

Brendan
-- 
Brendan Halpin,  Deptartment of Sociology,  University of Limerick,  Ireland
Tel: w +353-61-213147 f +353-61-202569 h +353-61-390476;  Room F2-025 x 3147
<mailto:brendan.halpin@ul.ie>        <http://wivenhoe.staff8.ul.ie/~brendan>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-10 17:16 ` Brendan Halpin
@ 2003-01-10 20:35   ` Benjamin Riefenstahl
  2003-01-11 10:25     ` Klaus Berndl
  2003-01-20  7:50     ` Lee Sau Dan
  2003-01-20  7:50   ` Lee Sau Dan
  1 sibling, 2 replies; 21+ messages in thread
From: Benjamin Riefenstahl @ 2003-01-10 20:35 UTC (permalink / raw)

Brendan Halpin <brendan.halpin@ul.ie> writes:
> Use head and tail to split the file into the header-to-be-edited and
> the-rest. Edit the header-to-be-edited in emacs, save, then
> concatenated the-rest onto it.
> 
> Assuming all editing is within the first 2000 bytes (not tested):
> 
> head -c2000 bigfile > header-to-be-edited
> tail -c+2001 bigfile > the-rest
> (edit header-to-be-edited, save)
> cat header-to-be-edited the-rest > new-big-file

This assumes a) Unix, b) that you have the space and time ;-) to deal
with the large temporary files.

If you can assume Unix, dd is a little better, I think.  I recently
had success with using it for extracting and later re-inserting a bit
in a large file.  Getting the options right is a bit of a pain, but
the main thing was getting the direction (extract and re-insert) right
and using conv=notrunc for re-insertion.  And than dd is oriented
towards blocks of bytes, not lines, of course.  And you can not change
the size of the block to be edited, but than large files are usually
binary files, where you don't want to change byte offsets anyway. 

so long, benny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-10 20:35   ` Benjamin Riefenstahl
@ 2003-01-11 10:25     ` Klaus Berndl
  2003-01-20  7:50     ` Lee Sau Dan
  1 sibling, 0 replies; 21+ messages in thread
From: Klaus Berndl @ 2003-01-11 10:25 UTC (permalink / raw)


On 10 Jan 2003, Benjamin Riefenstahl wrote:



>  Brendan Halpin <brendan.halpin@ul.ie> writes:
> > Use head and tail to split the file into the header-to-be-edited and
> > the-rest. Edit the header-to-be-edited in emacs, save, then
> > concatenated the-rest onto it.
> > 
> > Assuming all editing is within the first 2000 bytes (not tested):
> > 
> > head -c2000 bigfile > header-to-be-edited
> > tail -c+2001 bigfile > the-rest
> > (edit header-to-be-edited, save)
> > cat header-to-be-edited the-rest > new-big-file
>  
>  This assumes a) Unix, b) that you have the space and time ;-) to deal
>  with the large temporary files.

Assumption a) is not necessary or correct because there is the cygwin-suite
for Windows available - IMHO a must for using Emacs on Windows-systems ;-)
Cygwin contains tail and head!

Klaus


-- 
Klaus Berndl			mailto: klaus.berndl@sdm.de
sd&m AG				http://www.sdm.de
software design & management
Thomas-Dehler-Str. 27, 81737 München, Germany
Tel +49 89 63812-392, Fax -220

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-10 20:35   ` Benjamin Riefenstahl
  2003-01-11 10:25     ` Klaus Berndl
@ 2003-01-20  7:50     ` Lee Sau Dan
  2003-01-20 12:46       ` Benjamin Riefenstahl
  1 sibling, 1 reply; 21+ messages in thread
From: Lee Sau Dan @ 2003-01-20  7:50 UTC (permalink / raw)

>>>>> "Benjamin" == Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de> writes:

    >> Assuming all editing is within the first 2000 bytes (not
    >> tested):
    >> 
    >> head -c2000 bigfile > header-to-be-edited 
    >> tail -c+2001 bigfile > the-rest
    >>   (edit header-to-be-edited, save)
    >> cat header-to-be-edited the-rest > new-big-file

    Benjamin> This assumes a) Unix, b) that you have the space and
    Benjamin> time ;-) to deal with the large temporary files.

(b)  is assumed even  if you  use other  method.  Most  *text* editors
would save  files by first writing  a temp.  copy of  the new version,
followed by renaming the new version  to the old name.  So, in case of
a crash, you don't lose everything.  Either the old version or the new
version should survive intact.

So, if you didn't have the  extra disk space, you can't do the editing
either.

Time?   It doesn't  take much  time to  'split' and  'cat'.  Moreover,
running  the editor  on smaller  pieces do  save time  on  loading and
saving  the file fragments.   Moreover, the  editor doesn't  need that
much RAM when editing the file.

    Benjamin> If you can assume Unix, dd is a little better, I think.

Why not 'split'?

    Benjamin> I recently had success with using it for extracting and
    Benjamin> later re-inserting a bit in a large file.  

Only when the  extracted and re-inserted blocks are  of the same size.
This is the  case for hex editing, but not  *text* editing.  If you're
doing hex editing,  you shouldn't be using a text  editor in the first
place.  There  are hex  editors which doesn't  need to load  the whole
file into memory.

    Benjamin> Getting the options right is a bit of a pain, 

No.  That  is true  only when  you're using 'dd'  for the  first time.
After a few times, it's easy to remember what options to use.  Most of
the  time, I  only  need  "if=", "of=",  "bs=",  "skip=", "seek="  and
"count=".  These option names are quite easy to remember once you know
the  basic principle  that 'dd'  works by  transferring blocks  of the
input file to output file.

    Benjamin> but the main thing was getting the direction (extract
    Benjamin> and re-insert) right and using conv=notrunc for
    Benjamin> re-insertion.  And than dd is oriented towards blocks of
    Benjamin> bytes, not lines, of course.  

This  is the  down side.   For line-oriented  operations,  use 'head',
'tail', 'cat', 'sed', or even 'awk' and 'perl'.

    Benjamin> And you can not change the size of the block to be
    Benjamin> edited, but than large files are usually binary files,
    Benjamin> where you don't want to change byte offsets anyway.

Then, find a hex editor.  *Text* editors are simply not the right tool
to  edit  huge  *binary*  files.    In  theory,  hex  editors  can  be
implemented very efficiently using mmap().

-- 
Lee Sau Dan                     李守敦(Big5)                    ~{@nJX6X~}(HZ) 

E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-20  7:50     ` Lee Sau Dan
@ 2003-01-20 12:46       ` Benjamin Riefenstahl
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Riefenstahl @ 2003-01-20 12:46 UTC (permalink / raw)


Hi,


> [attribution cut off]
>     >> head -c2000 bigfile > header-to-be-edited 
>     >> tail -c+2001 bigfile > the-rest
>     >> [...]

> >>>>> "Benjamin" == Benjamin Riefenstahl <Benjamin.Riefenstahl@epost.de>
> writes:
>     Benjamin> This assumes a) Unix, b) that you have the space and
>     Benjamin> time ;-) to deal with the large temporary files.

Lee Sau Dan <danlee@informatik.uni-freiburg.de> writes:
> (b)  is assumed even  if you  use other  method.

With something like the dd method I don't ever have to copy the whole
file.  Makes a difference when your file is a CD image of 600 MB and
all you want to do is patch the partition table.

> Time?  It doesn't take much time to 'split' and 'cat'.

It takes several minutes on my machine with the mentioned file.

> Why not 'split'?

I didn't think of that one before.  But it also copies the whole
file. 

> There are hex editors which doesn't need to load the whole file into
> memory.

I'm not aware of a commonly used hex editor on Unix.  Do you have a
recommendation?


so long, benny

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Reading portions of large files
  2003-01-10 17:16 ` Brendan Halpin
  2003-01-10 20:35   ` Benjamin Riefenstahl
@ 2003-01-20  7:50   ` Lee Sau Dan
  1 sibling, 0 replies; 21+ messages in thread
From: Lee Sau Dan @ 2003-01-20  7:50 UTC (permalink / raw)


>>>>> "Brendan" == Brendan Halpin <brendan.halpin@ul.ie> writes:

    Brendan> Use head and tail to split the file into the
    Brendan> header-to-be-edited and the-rest. Edit the
    Brendan> header-to-be-edited in emacs, save, then concatenated
    Brendan> the-rest onto it.

    Brendan> Assuming all editing is within the first 2000 bytes (not
    Brendan> tested):

    Brendan> head -c2000 bigfile > header-to-be-edited 
    Brendan> tail -c+2001 bigfile > the-rest
    Brendan>  (edit header-to-be-edited, save) 
    Brendan> cat header-to-be-edited the-rest > new-big-file

Why not use 'split'?  :)


-- 
Lee Sau Dan                     李守敦(Big5)                    ~{@nJX6X~}(HZ) 

E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2003-01-27 14:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-09 15:45 Reading portions of large files Gerald.Jean
     [not found] <mailman.100.1042135372.21513.help-gnu-emacs@gnu.org>
2003-01-09 18:20 ` David Kastrup
2003-01-10 19:21   ` Eli Zaretskii
     [not found]   ` <mailman.153.1042230313.21513.help-gnu-emacs@gnu.org>
2003-01-10 20:51     ` David Kastrup
2003-01-11  8:51       ` Eli Zaretskii
     [not found]       ` <mailman.169.1042278925.21513.help-gnu-emacs@gnu.org>
2003-01-11 10:42         ` David Kastrup
2003-01-12 20:38       ` Stefan Monnier <foo@acm.com>
2003-01-13  7:40         ` Miles Bader
2003-01-13  7:42           ` Miles Bader
2003-01-13  7:55             ` David Kastrup
2003-01-13  8:05               ` Miles Bader
2003-01-20  7:50         ` Lee Sau Dan
2003-01-24  7:55           ` Mac
2003-01-27 14:44           ` Stefan Monnier <foo@acm.com>
2003-01-10 16:27 ` Eric Pement
2003-01-10 17:16 ` Brendan Halpin
2003-01-10 20:35   ` Benjamin Riefenstahl
2003-01-11 10:25     ` Klaus Berndl
2003-01-20  7:50     ` Lee Sau Dan
2003-01-20 12:46       ` Benjamin Riefenstahl
2003-01-20  7:50   ` Lee Sau Dan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).