From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Xah Lee Newsgroups: gmane.emacs.help Subject: Re: opening large files (few hundred meg) Date: Tue, 29 Jan 2008 08:34:13 -0800 (PST) Organization: http://groups.google.com Message-ID: <36295bce-c40b-4faf-ae7d-04eb90796da8@q39g2000hsf.googlegroups.com> References: <1f94fef6-a335-4ce5-8d4b-7e87025a28dc@e32g2000prn.googlegroups.com> <87r6g1esga.fsf@gmx.de> <2adae7bb-c775-4a6e-bf83-66a8618b326d@s12g2000prg.googlegroups.com> <87ejc1m21x.fsf@lion.rapttech.com.au> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1201624885 27441 80.91.229.12 (29 Jan 2008 16:41:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 29 Jan 2008 16:41:25 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Tue Jan 29 17:41:44 2008 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1JJtWj-0002NR-RA for geh-help-gnu-emacs@m.gmane.org; Tue, 29 Jan 2008 17:41:38 +0100 Original-Received: from localhost ([127.0.0.1] helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1JJtWI-0008C1-TC for geh-help-gnu-emacs@m.gmane.org; Tue, 29 Jan 2008 11:41:10 -0500 Original-Path: shelby.stanford.edu!newsfeed.stanford.edu!postnews.google.com!q39g2000hsf.googlegroups.com!not-for-mail Original-Newsgroups: gnu.emacs.help Original-Lines: 102 Original-NNTP-Posting-Host: 69.236.102.115 Original-X-Trace: posting.google.com 1201624453 26589 127.0.0.1 (29 Jan 2008 16:34:13 GMT) Original-X-Complaints-To: groups-abuse@google.com Original-NNTP-Posting-Date: Tue, 29 Jan 2008 16:34:13 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: q39g2000hsf.googlegroups.com; posting-host=69.236.102.115; posting-account=qPxGtQkAAADb6PWdLGiWVucht1ZDR6fn User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/523.12.2 (KHTML, like Gecko) Version/3.0.4 Safari/523.12.2, gzip(gfe), gzip(gfe) Original-Xref: shelby.stanford.edu gnu.emacs.help:155676 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.help:51057 Archived-At: Tim X wrote: > Personally, I'd use something like Perl or one of the many > other scripting languages that are ideal for (and largely designed for) > this sort of problem. A interesting thing about wanting to use elisp to open large file, for me, is this: Recently i discovered that emacs lisp is probably the most powerful lang for processing text, far more so than Perl. Because, in emacs, there's the =E2=80=9Cbuffers=E2=80=9D infra-structure, which allows ones to = navigate a point back and forth, delete, insert, regex search, etc, with literally few thousands text-processing functions build-in to help this task. While in perl or python, typically one either reads the file one line at a time and process it one line at a time, or read the whole file one shot but basically still process it one line at a time. The gist is that, any function you might want to apply to the text is only applied to a line at a time, and it can't see what's before or after the line. (one could write it so that it =E2=80=9Cbuffers=E2=80=9D the neigh= boring lines, but that's rather unusual and involves more code. Alternatively, one could read in one char at a time, and as well move the index back and forth, but then that loses all the regex power, and dealing with files as raw bytes and file pointers is extremely painful) The problem with processing one-line at a time is that, for many data the file is a tree structure (such as HTML/XML, Mathematica source code). To process a tree struture such as XML, where there is a root tag at the beginning of the file and closes at the end, and most tree branches span multiple lines. Processing it line by line is almost useless. So, in perl, the typical solution is to read in the whole file, and apply regex to the whole content. This really put stress on the regex and basically the regex won't work unless the processing needed is really simple. A alternative solution to process tree-structured file such as XML, is to use a proper parser. (e.g. javascript/DOM, or using a libary/ module) However, when using a parser, the nature of programing ceases to be text-processing but more as strutural manipulation. In general, the program becomes more complex and difficult. Also, if one uses a XML parser and DOM, the formatting of the file will also be lost. (i.e. all your original line endings and indents will be gone) This is a major reason why, i think emacs lisp's is far more versatile because it can read in the XML into emacs's buffer infra-structure, then the programer can move back and forth a point, freely using regex to search or replace text back and forth. For complex XML processing such as tree transformation (e.g. XSLT etc), a XML/DOM parser/model is still more suitable, but for most simple manipulation (such as processing HTML files), using elisp's buffer and treating it as text is far easier and flexible. Also, if one so wishes, she can use a XML/ DOM parser/model written in elisp, just as in other lang. So, last year i switched all new text processing tasks from Perl to elisp. But now i have a problem, which i =E2=80=9Cdiscovered=E2=80=9D this week. Wh= at to do when the file is huge? Normally, one can still just do huge files since these days memories comes in few gigs. But in my particular case, my file happens to be 0.5 gig, that i couldn't even open it in emacs (presumbly because i need a 64 bit OS/hardware. Thanks). So, given the situation, i'm thinking, perhaps there is a way, to use emacs lisp to read the file line by line just as perl or python. (The file is just a apache log file and can be process line by line, can be split, can be fed to sed/awk/grep with pipes. The reason i want to open it in emacs and process it using elisp is more just a exploration, not really a practical need) Xah xah@xahlee.org =E2=88=91 http://xahlee.org/ =E2=98=84 On Jan 29, 1:08 am, Tim X wrote: > Its not that uncommon to encounter text files over half a gig in size. A > place I worked had systems that would generate logs in excess of 1Gb per > day (and that was with minimal logging). When I worked with Oracle, > there were some operations which involved multi Gb files that you needed > to edit (which I did using sed rather than a text editor). > > However, it seems rediculous to attempt to open a text file of the sizeXah= is talking about inside an editor. Like others, I have to wonder > why his log file isn't rotated more often so that it is in manageable > chunks. Its obvious that nobody would read all of a text file that was > that large (especially not every week). More than likely, you would use > existing tools to select 'interesting' parts of the log and then deal > with them. Personally, I'd use something like Perl or one of the many > other scripting languages that are ideal for (and largely designed for) > this sort of problem. > > Tim > > -- > tcross (at) rapttech dot com dot au