From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Xah Lee <xah@xahlee.org>
Newsgroups: gmane.emacs.help
Subject: Re: opening large files (few hundred meg)
Date: Tue, 29 Jan 2008 08:34:13 -0800 (PST)
Organization: http://groups.google.com
Message-ID: <36295bce-c40b-4faf-ae7d-04eb90796da8@q39g2000hsf.googlegroups.com>
References: <1f94fef6-a335-4ce5-8d4b-7e87025a28dc@e32g2000prn.googlegroups.com>
	<87r6g1esga.fsf@gmx.de> <uwsptkaqs.fsf@gnu.org>
	<mailman.6652.1201552566.18990.help-gnu-emacs@gnu.org> 
	<2adae7bb-c775-4a6e-bf83-66a8618b326d@s12g2000prg.googlegroups.com> 
	<mailman.6666.1201591238.18990.help-gnu-emacs@gnu.org>
	<87ejc1m21x.fsf@lion.rapttech.com.au>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1201624885 27441 80.91.229.12 (29 Jan 2008 16:41:25 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 29 Jan 2008 16:41:25 +0000 (UTC)
To: help-gnu-emacs@gnu.org
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Jan 29 17:41:44 2008
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane.org
Original-Received: from lists.gnu.org ([199.232.76.165])
	by lo.gmane.org with esmtp (Exim 4.50)
	id 1JJtWj-0002NR-RA
	for geh-help-gnu-emacs@m.gmane.org; Tue, 29 Jan 2008 17:41:38 +0100
Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43)
	id 1JJtWI-0008C1-TC
	for geh-help-gnu-emacs@m.gmane.org; Tue, 29 Jan 2008 11:41:10 -0500
Original-Path: shelby.stanford.edu!newsfeed.stanford.edu!postnews.google.com!q39g2000hsf.googlegroups.com!not-for-mail
Original-Newsgroups: gnu.emacs.help
Original-Lines: 102
Original-NNTP-Posting-Host: 69.236.102.115
Original-X-Trace: posting.google.com 1201624453 26589 127.0.0.1 (29 Jan 2008 16:34:13
	GMT)
Original-X-Complaints-To: groups-abuse@google.com
Original-NNTP-Posting-Date: Tue, 29 Jan 2008 16:34:13 +0000 (UTC)
Complaints-To: groups-abuse@google.com
Injection-Info: q39g2000hsf.googlegroups.com; posting-host=69.236.102.115; 
	posting-account=qPxGtQkAAADb6PWdLGiWVucht1ZDR6fn
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) 
	AppleWebKit/523.12.2 (KHTML, like Gecko) Version/3.0.4 Safari/523.12.2,
	gzip(gfe), gzip(gfe)
Original-Xref: shelby.stanford.edu gnu.emacs.help:155676
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <http://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
	<mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.emacs.help:51057
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/51057>

Tim X wrote:

> Personally, I'd use something like Perl or one of the many
> other scripting languages that are ideal for (and largely designed for)
> this sort of problem.

A interesting thing about wanting to use elisp to open large file, for
me, is this:

Recently i discovered that emacs lisp is probably the most powerful
lang for processing text, far more so than Perl. Because, in emacs,
there's the =E2=80=9Cbuffers=E2=80=9D infra-structure, which allows ones to =
navigate a
point back and forth, delete, insert, regex search, etc, with
literally few thousands text-processing functions build-in to help
this task.

While in perl or python, typically one either reads the file one line
at a time and process it one line at a time, or read the whole file
one shot but basically still process it one line at a time. The gist
is that, any function you might want to apply to the text is only
applied to a line at a time, and it can't see what's before or after
the line. (one could write it so that it =E2=80=9Cbuffers=E2=80=9D the neigh=
boring
lines, but that's rather unusual and involves more code.
Alternatively, one could read in one char at a time, and as well move
the index back and forth, but then that loses all the regex power, and
dealing with files as raw bytes and file pointers is extremely
painful)

The problem with processing one-line at a time is that, for many data
the file is a tree structure (such as HTML/XML, Mathematica source
code). To process a tree struture such as XML, where there is a root
tag at the beginning of the file and closes at the end, and most tree
branches span multiple lines. Processing it line by line is almost
useless. So, in perl, the typical solution is to read in the whole
file, and apply regex to the whole content. This really put stress on
the regex and basically the regex won't work unless the processing
needed is really simple.

A alternative solution to process tree-structured file such as XML, is
to use a proper parser. (e.g. javascript/DOM, or using a libary/
module) However, when using a parser, the nature of programing ceases
to be text-processing but more as strutural manipulation. In general,
the program becomes more complex and difficult. Also, if one uses a
XML parser and DOM, the formatting of the file will also be lost.
(i.e. all your original line endings and indents will be gone)

This is a major reason why, i think emacs lisp's is far more versatile
because it can read in the XML into emacs's buffer infra-structure,
then the programer can move back and forth a point, freely using regex
to search or replace text back and forth. For complex XML processing
such as tree transformation (e.g. XSLT etc), a XML/DOM parser/model is
still more suitable, but for most simple manipulation (such as
processing HTML files), using elisp's buffer and treating it as text
is far easier and flexible. Also, if one so wishes, she can use a XML/
DOM parser/model written in elisp, just as in other lang.

So, last year i switched all new text processing tasks from Perl to
elisp.

But now i have a problem, which i =E2=80=9Cdiscovered=E2=80=9D this week. Wh=
at to do
when the file is huge? Normally, one can still just do huge files
since these days memories comes in few gigs. But in my particular
case, my file happens to be 0.5 gig, that i couldn't even open it in
emacs (presumbly because i need a 64 bit OS/hardware. Thanks). So,
given the situation, i'm thinking, perhaps there is a way, to use
emacs lisp to read the file line by line just as perl or python. (The
file is just a apache log file and can be process line by line, can be
split, can be fed to sed/awk/grep with pipes. The reason i want to
open it in emacs and process it using elisp is more just a
exploration, not really a practical need)

  Xah
  xah@xahlee.org
=E2=88=91 http://xahlee.org/

=E2=98=84

On Jan 29, 1:08 am, Tim X <t...@nospam.dev.null> wrote:
> Its not that uncommon to encounter text files over half a gig in size. A
> place I worked had systems that would generate logs in excess of 1Gb per
> day (and that was with minimal logging). When I worked with Oracle,
> there were some operations which involved multi Gb files that you needed
> to edit (which I did using sed rather than a text editor).
>
> However, it seems rediculous to attempt to open a text file of the sizeXah=
is talking about inside an editor. Like others, I have to wonder
> why his log file isn't rotated more often so that it is in manageable
> chunks. Its obvious that nobody would read all of a text file that was
> that large (especially not every week). More than likely, you would use
> existing tools to select 'interesting' parts of the log and then deal
> with them. Personally, I'd use something like Perl or one of the many
> other scripting languages that are ideal for (and largely designed for)
> this sort of problem.
>
> Tim
>
> --
> tcross (at) rapttech dot com dot au