unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
From: Xah Lee <xah@xahlee.org>
To: help-gnu-emacs@gnu.org
Subject: Re: opening large files (few hundred meg)
Date: Tue, 29 Jan 2008 08:34:13 -0800 (PST)	[thread overview]
Message-ID: <36295bce-c40b-4faf-ae7d-04eb90796da8@q39g2000hsf.googlegroups.com> (raw)
In-Reply-To: 87ejc1m21x.fsf@lion.rapttech.com.au

Tim X wrote:

> Personally, I'd use something like Perl or one of the many
> other scripting languages that are ideal for (and largely designed for)
> this sort of problem.

A interesting thing about wanting to use elisp to open large file, for
me, is this:

Recently i discovered that emacs lisp is probably the most powerful
lang for processing text, far more so than Perl. Because, in emacs,
there's the “buffers” infra-structure, which allows ones to navigate a
point back and forth, delete, insert, regex search, etc, with
literally few thousands text-processing functions build-in to help
this task.

While in perl or python, typically one either reads the file one line
at a time and process it one line at a time, or read the whole file
one shot but basically still process it one line at a time. The gist
is that, any function you might want to apply to the text is only
applied to a line at a time, and it can't see what's before or after
the line. (one could write it so that it “buffers” the neighboring
lines, but that's rather unusual and involves more code.
Alternatively, one could read in one char at a time, and as well move
the index back and forth, but then that loses all the regex power, and
dealing with files as raw bytes and file pointers is extremely
painful)

The problem with processing one-line at a time is that, for many data
the file is a tree structure (such as HTML/XML, Mathematica source
code). To process a tree struture such as XML, where there is a root
tag at the beginning of the file and closes at the end, and most tree
branches span multiple lines. Processing it line by line is almost
useless. So, in perl, the typical solution is to read in the whole
file, and apply regex to the whole content. This really put stress on
the regex and basically the regex won't work unless the processing
needed is really simple.

A alternative solution to process tree-structured file such as XML, is
to use a proper parser. (e.g. javascript/DOM, or using a libary/
module) However, when using a parser, the nature of programing ceases
to be text-processing but more as strutural manipulation. In general,
the program becomes more complex and difficult. Also, if one uses a
XML parser and DOM, the formatting of the file will also be lost.
(i.e. all your original line endings and indents will be gone)

This is a major reason why, i think emacs lisp's is far more versatile
because it can read in the XML into emacs's buffer infra-structure,
then the programer can move back and forth a point, freely using regex
to search or replace text back and forth. For complex XML processing
such as tree transformation (e.g. XSLT etc), a XML/DOM parser/model is
still more suitable, but for most simple manipulation (such as
processing HTML files), using elisp's buffer and treating it as text
is far easier and flexible. Also, if one so wishes, she can use a XML/
DOM parser/model written in elisp, just as in other lang.

So, last year i switched all new text processing tasks from Perl to
elisp.

But now i have a problem, which i “discovered” this week. What to do
when the file is huge? Normally, one can still just do huge files
since these days memories comes in few gigs. But in my particular
case, my file happens to be 0.5 gig, that i couldn't even open it in
emacs (presumbly because i need a 64 bit OS/hardware. Thanks). So,
given the situation, i'm thinking, perhaps there is a way, to use
emacs lisp to read the file line by line just as perl or python. (The
file is just a apache log file and can be process line by line, can be
split, can be fed to sed/awk/grep with pipes. The reason i want to
open it in emacs and process it using elisp is more just a
exploration, not really a practical need)

  Xah
  xah@xahlee.org
∑ http://xahlee.org/

☄

On Jan 29, 1:08 am, Tim X <t...@nospam.dev.null> wrote:
> Its not that uncommon to encounter text files over half a gig in size. A
> place I worked had systems that would generate logs in excess of 1Gb per
> day (and that was with minimal logging). When I worked with Oracle,
> there were some operations which involved multi Gb files that you needed
> to edit (which I did using sed rather than a text editor).
>
> However, it seems rediculous to attempt to open a text file of the sizeXahis talking about inside an editor. Like others, I have to wonder
> why his log file isn't rotated more often so that it is in manageable
> chunks. Its obvious that nobody would read all of a text file that was
> that large (especially not every week). More than likely, you would use
> existing tools to select 'interesting' parts of the log and then deal
> with them. Personally, I'd use something like Perl or one of the many
> other scripting languages that are ideal for (and largely designed for)
> this sort of problem.
>
> Tim
>
> --
> tcross (at) rapttech dot com dot au

  reply	other threads:[~2008-01-29 16:34 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-28 17:35 opening large files (few hundred meg) Xah Lee
2008-01-28 18:05 ` Sven Joachim
2008-01-28 19:31   ` Eli Zaretskii
2008-01-28 20:36     ` Andreas Röhler
     [not found]     ` <mailman.6652.1201552566.18990.help-gnu-emacs@gnu.org>
2008-01-28 21:50       ` Jason Rumney
2008-01-29  7:07         ` Andreas Röhler
2008-01-29  7:20         ` Thierry Volpiatto
     [not found]         ` <mailman.6666.1201591238.18990.help-gnu-emacs@gnu.org>
2008-01-29  9:08           ` Tim X
2008-01-29 16:34             ` Xah Lee [this message]
2008-01-29 19:06               ` Tom Tromey
2008-01-29 20:44                 ` Eli Zaretskii
     [not found]                 ` <mailman.6705.1201639469.18990.help-gnu-emacs@gnu.org>
2008-01-30 20:01                   ` Stefan Monnier
2008-01-30 22:04                     ` Eli Zaretskii
2008-01-29 22:10               ` Jason Rumney
2008-01-30 17:08                 ` Joel J. Adamson
2008-01-31  5:57               ` Tim X
2008-01-31 15:35                 ` Stefan Monnier
2008-02-08 11:25               ` Giacomo Boffi
2008-02-06  1:47             ` Samuel Karl Peterson
2008-01-29 14:52           ` Joel J. Adamson
2008-01-30 14:55         ` Stefan Monnier
2008-02-06 16:42         ` Mathias Dahl
2008-02-06 16:55           ` Mathias Dahl
2008-01-29 10:43       ` Johan Bockgård
2008-01-29 15:35         ` Andreas Röhler
2008-02-06  1:25         ` Samuel Karl Peterson
2008-02-17 16:01           ` Kevin Rodgers
2008-01-29 16:33       ` Ted Zlatanov
     [not found]   ` <mailman.6646.1201548710.18990.help-gnu-emacs@gnu.org>
2008-01-30 15:12     ` Stefan Monnier
2008-01-30 16:55       ` Sven Joachim
2008-01-30 21:53         ` Stefan Monnier
2008-01-31 22:55     ` Ilya Zakharevich
     [not found]     ` <200801312255.m0VMt701019096@powdermilk.math.berkeley.edu>
2008-02-01 11:04       ` Eli Zaretskii
     [not found]       ` <mailman.6836.1201863892.18990.help-gnu-emacs@gnu.org>
2008-02-01 22:26         ` Ilya Zakharevich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36295bce-c40b-4faf-ae7d-04eb90796da8@q39g2000hsf.googlegroups.com \
    --to=xah@xahlee.org \
    --cc=help-gnu-emacs@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).