unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* How to read integers from file faster?
@ 2013-08-31  3:55 Darren Hoo
  2013-08-31  5:46 ` Mike Gran
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Darren Hoo @ 2013-08-31  3:55 UTC (permalink / raw)
  To: guile-user


It is way too slow to read numbers from a file simply by using `read' 

for example a txt file contains 10,000,000 line of numbers:

    (define (gen-sample max k file)
      (with-output-to-file 
          file
        (lambda ()
          (let lp ((k k))
    	(when (> k 0)
    	  (display (random max)) (newline)
    	  (lp (- k 1) ))))))
         
    (gen-sample 999999999 10000000 "rnd.txt")
    
 
And read the numbers in 

    (define (read-test1)
      (with-input-from-file
          "rnd.txt"
        (lambda ()
          (let lp ((i (read)))
    	(if (not (eof-object? i))
    	    (lp (read)))))))
    
scheme@(guile-user)> ,time (read-test1)
;; 37.348000s real time, 37.340000s run time.  0.450000s spent in GC.


with rdelim's read-line, it's better but still slow.

    (import (ice-9 rdelim))
    (define (read-test2)
      (with-input-from-file
          "rnd.txt"
        (lambda ()
          (let lp ((i (read-line)))
    	(if (not (eof-object? i))
    	    (begin 
    	      (string->number i)
    	      (lp (read-line))))))))
    
scheme@(guile-user)> ,time (read-test2)
;; 11.943000s real time, 11.930000s run time.  0.890000s spent in GC.


it only takes 1.8 seconds by using fscanf

    FILE *f = fopen("rnd.txt", "r");
    if (f == NULL) {
      printf("open failed!");
      exit(1);
    }
    long long i;
    while (fscanf(f, "%lld", &i) != EOF) {
      
    }

$ time ./read 

real	0m1.844s
user	0m1.803s
sys	0m0.032s

Are there any primitives in Guile that is equivalent to C's scanf?




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-08-31  3:55 How to read integers from file faster? Darren Hoo
@ 2013-08-31  5:46 ` Mike Gran
  2013-08-31  8:59 ` Mark H Weaver
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Mike Gran @ 2013-08-31  5:46 UTC (permalink / raw)
  To: Darren Hoo, guile-user@gnu.org

> From: Darren Hoo <darren.hoo@gmail.com>

>It is way too slow to read numbers from a file simply by using `read' 


The problem mostly lies in dealing with the locale or doing utf-8
conversion.  Also, your code creates and destroys a lot of strings.


You could push data through much more quickly if it were binary
and if you didn't create new objects.


(define (read-test1)
  (let* ((p (open-input-file "rnd.txt"))
         (buf (make-bytevector 10 0)))
    (let lp ((i (get-bytevector-n! p buf 0 10)))

      (if (not (eof-object? i))
          (lp (get-bytevector-n p 10))))))

But, alas, knowing that isn't really a solution to your problem.


-Mike




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-08-31  3:55 How to read integers from file faster? Darren Hoo
  2013-08-31  5:46 ` Mike Gran
@ 2013-08-31  8:59 ` Mark H Weaver
  2013-08-31  9:19 ` Andy Wingo
  2013-09-01 20:55 ` Pascal J. Bourguignon
  3 siblings, 0 replies; 9+ messages in thread
From: Mark H Weaver @ 2013-08-31  8:59 UTC (permalink / raw)
  To: Darren Hoo; +Cc: guile-user

[-- Attachment #1: Type: text/plain, Size: 2101 bytes --]

Darren Hoo <darren.hoo@gmail.com> writes:
> It is way too slow to read numbers from a file simply by using `read' 

Indeed.  Looking at the code, I can see a lot of room for improvement,
especially in 'string->number' (which is used also by 'read').  I've
added an item to my TODO list to optimize this.  Hopefully by 2.0.10 or
2.0.11, it will be faster.

> Are there any primitives in Guile that is equivalent to C's scanf?

I'm afraid there's currently nothing that will read integers very fast.
I experimented with using the Dynamic FFI to wrap various C functions,
but was unable to achieve more than a 2x improvement this way.

In the meantime, please remember that part of Guile's philosophy
(inherited from Emacs) is to use C extensions where necessary for
performance critical bits.  For the basics of how to do this, see:

  https://www.gnu.org/software/guile/manual/html_node/C-Extensions.html

As an example, I've attached a small C extension that uses 'fscanf' to
read integers from a file until EOF is reached, and returns them as a
list.  Here's how to compile and load it on GNU/Linux:

--8<---------------cut here---------------start------------->8---
mhw@tines:~$ gcc -shared -o read_integers_from_file.so -fPIC read_integers_from_file.c $(pkg-config --cflags guile-2.0)
mhw@tines:~$ guile
GNU Guile 2.0.9
Copyright (C) 1995-2013 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> (load-extension "./read_integers_from_file.so" "scm_init_read_integers_from_file")
scheme@(guile-user)> (use-modules (read-integers-from-file))
scheme@(guile-user)> (define lst (read-integers-from-file "rnd.txt"))
--8<---------------cut here---------------end--------------->8---

Note that although this code does partial error checking, it cannot cope
with integers that don't fit in a C long, and it also assumes that the
file is properly formatted.

      Regards,
        Mark


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: read_integers_from_file.c --]
[-- Type: text/x-csrc, Size: 938 bytes --]

#include <stdio.h>
#include <libguile.h>

static SCM
read_integers_from_file (SCM filename)
{
  char *fname = scm_to_locale_string (filename);
  FILE *f = fopen (fname, "r");
  SCM result = SCM_EOL;
  int err = 0;
  long num;

  if (f == NULL)
    err = 1;
  else
    {
      while (fscanf (f, "%dl", &num) == 1)
        result = scm_cons (scm_from_long (num), result);
      if (ferror (f))
        err = 1;
      if (fclose (f))
        err = 1;
    }
  free (fname);
  if (err)
    scm_syserror ("read-integers-from-file");
  return scm_reverse_x (result, SCM_EOL);
}

void
init_read_integers_from_file (void *unused)
{
  scm_c_define_gsubr ("read-integers-from-file", 1, 0, 0,
                      read_integers_from_file);
  scm_c_export ("read-integers-from-file", NULL);
}

void
scm_init_read_integers_from_file ()
{
  scm_c_define_module ("read-integers-from-file",
                       init_read_integers_from_file, NULL);
}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-08-31  3:55 How to read integers from file faster? Darren Hoo
  2013-08-31  5:46 ` Mike Gran
  2013-08-31  8:59 ` Mark H Weaver
@ 2013-08-31  9:19 ` Andy Wingo
  2013-08-31 10:19   ` Ludovic Courtès
  2013-09-01 20:55 ` Pascal J. Bourguignon
  3 siblings, 1 reply; 9+ messages in thread
From: Andy Wingo @ 2013-08-31  9:19 UTC (permalink / raw)
  To: Darren Hoo; +Cc: guile-user

On Sat 31 Aug 2013 05:55, Darren Hoo <darren.hoo@gmail.com> writes:

>     (define (read-test1)
>       (with-input-from-file
>           "rnd.txt"
>         (lambda ()
>           (let lp ((i (read)))
>     	(if (not (eof-object? i))
>     	    (lp (read)))))))
>     
> scheme@(guile-user)> ,time (read-test1)
> ;; 37.348000s real time, 37.340000s run time.  0.450000s spent in GC.

What this does is read, character by character, ready to read any
S-expression.  Then when it sees a character that can start a number, it
collects the characters into a string up to the next delimiter, and
tries to parse the string as a number.  If it's not a number it's
treated as a symbol.

>     (import (ice-9 rdelim))
>     (define (read-test2)
>       (with-input-from-file
>           "rnd.txt"
>         (lambda ()
>           (let lp ((i (read-line)))
>     	(if (not (eof-object? i))
>     	    (begin 
>     	      (string->number i)
>     	      (lp (read-line))))))))

This is faster because you have chosen the delimiter (newline) and you
are only looking for numbers.

> it only takes 1.8 seconds by using fscanf
>
>     FILE *f = fopen("rnd.txt", "r");
>     if (f == NULL) {
>       printf("open failed!");
>       exit(1);
>     }
>     long long i;
>     while (fscanf(f, "%lld", &i) != EOF) {
>       
>     }

Likewise, but here we don't cons any strings, and you're looking only
for signed integers in a particular range, not any kind of number.

> Are there any primitives in Guile that is equivalent to C's scanf?

No, for better or for worse :)

I just took a look at your program, which ran in 40s on my machine.
Under callgrind it turned out that we were doing a lot of iconv stuff
that we didn't need to do.  With some patches that are now in master,
I'm down to 6 seconds for your test.  I think that with some
optimizations we could get down to 3 seconds or so -- run Guile under
valgrind to see.

Writing a read-integer in Scheme would eventually be faster because it
knows what kinds of characters it's looking for, and really can avoid
garbage.  But for now the VM overhead is too much.  I'm working on that
but it will take longer :)

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-08-31  9:19 ` Andy Wingo
@ 2013-08-31 10:19   ` Ludovic Courtès
  2013-09-01  8:45     ` Andy Wingo
  0 siblings, 1 reply; 9+ messages in thread
From: Ludovic Courtès @ 2013-08-31 10:19 UTC (permalink / raw)
  To: guile-user

Andy Wingo <wingo@pobox.com> skribis:

> I just took a look at your program, which ran in 40s on my machine.
> Under callgrind it turned out that we were doing a lot of iconv stuff
> that we didn't need to do.

It’s often the case that I/O is faster if you explicitly say that the
port is UTF-8-encoded, because there’s a fast path for that (not using
iconv) in 2.0:

  (with-fluids ((%default-port-encoding "UTF-8"))
    (call-with-input-file file
      ...))

Ludo’.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-08-31 10:19   ` Ludovic Courtès
@ 2013-09-01  8:45     ` Andy Wingo
  2013-09-01 13:33       ` Ludovic Courtès
  0 siblings, 1 reply; 9+ messages in thread
From: Andy Wingo @ 2013-09-01  8:45 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-user

On Sat 31 Aug 2013 12:19, ludo@gnu.org (Ludovic Courtès) writes:

> Andy Wingo <wingo@pobox.com> skribis:
>
>> I just took a look at your program, which ran in 40s on my machine.
>> Under callgrind it turned out that we were doing a lot of iconv stuff
>> that we didn't need to do.
>
> It’s often the case that I/O is faster if you explicitly say that the
> port is UTF-8-encoded, because there’s a fast path for that (not using
> iconv) in 2.0:
>
>   (with-fluids ((%default-port-encoding "UTF-8"))
>     (call-with-input-file file
>       ...))

Yep.  In this particular case, it turned out that even in UTF-8 locales,
scm_ungetc_unlocked and scm_from_port_string were going through iconv.
Fixing that sped up the test by around 600% (!!).

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-09-01  8:45     ` Andy Wingo
@ 2013-09-01 13:33       ` Ludovic Courtès
  0 siblings, 0 replies; 9+ messages in thread
From: Ludovic Courtès @ 2013-09-01 13:33 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-user

Andy Wingo <wingo@pobox.com> skribis:

> On Sat 31 Aug 2013 12:19, ludo@gnu.org (Ludovic Courtès) writes:
>
>> Andy Wingo <wingo@pobox.com> skribis:
>>
>>> I just took a look at your program, which ran in 40s on my machine.
>>> Under callgrind it turned out that we were doing a lot of iconv stuff
>>> that we didn't need to do.
>>
>> It’s often the case that I/O is faster if you explicitly say that the
>> port is UTF-8-encoded, because there’s a fast path for that (not using
>> iconv) in 2.0:
>>
>>   (with-fluids ((%default-port-encoding "UTF-8"))
>>     (call-with-input-file file
>>       ...))
>
> Yep.  In this particular case, it turned out that even in UTF-8 locales,
> scm_ungetc_unlocked and scm_from_port_string were going through iconv.

Aah, ungetc, OK.

Ludo’.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
  2013-08-31  3:55 How to read integers from file faster? Darren Hoo
                   ` (2 preceding siblings ...)
  2013-08-31  9:19 ` Andy Wingo
@ 2013-09-01 20:55 ` Pascal J. Bourguignon
  3 siblings, 0 replies; 9+ messages in thread
From: Pascal J. Bourguignon @ 2013-09-01 20:55 UTC (permalink / raw)
  To: guile-user

Darren Hoo <darren.hoo@gmail.com> writes:

> It is way too slow to read numbers from a file simply by using `read' 
>
> for example a txt file contains 10,000,000 line of numbers:

To get speed, I would:

1- open a binary file,

2- perform double-buffering I/O,  (ie. read a buffer while you process
   another),

3- assume the input format is as specified, ie. no check.
   …
   (let ((byte (vector-ref buffer i)))
     (if (= byte 10)
        (begin (set! values (cons value values))
               (set! value 0))
        (set! value (+ (* value 10) (mod byte 16)))))
   …


Actually, only using buffered I/O matters.  For big files, what slows
down is I/O, not the handling of characters or checking of the syntax or
anything that's done in memory.  If you want to read your data faster,
keep it in RAM, or else in SSD.

-- 
__Pascal Bourguignon__
http://www.informatimago.com/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: How to read integers from file faster?
       [not found] <mailman.97.1378137628.11215.guile-user@gnu.org>
@ 2013-09-02 17:32 ` Daniel Llorens
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Llorens @ 2013-09-02 17:32 UTC (permalink / raw)
  To: guile-user


> Date: Sun, 01 Sep 2013 22:55:56 +0200
> From: "Pascal J. Bourguignon" <pjb@informatimago.com>
> To: guile-user@gnu.org
> 
> Darren Hoo <darren.hoo@gmail.com> writes:
> 

>> It is way too slow to read numbers from a file simply by using `read' 
>> 
>> for example a txt file contains 10,000,000 line of numbers:
> 
> To get speed, I would:
> 
> 1- open a binary file,
> 
> 2- perform double-buffering I/O,  (ie. read a buffer while you process
>   another),
> 
> 3- assume the input format is as specified, ie. no check.
>   ?
>   (let ((byte (vector-ref buffer i)))
>     (if (= byte 10)
>        (begin (set! values (cons value values))
>               (set! value 0))
>        (set! value (+ (* value 10) (mod byte 16)))))
>   ?
> 
> 
> Actually, only using buffered I/O matters.  For big files, what slows
> down is I/O, not the handling of characters or checking of the syntax or
> anything that's done in memory.  If you want to read your data faster,
> keep it in RAM, or else in SSD.

I know this is not exactly a Scheme solution, but if you can choose your file format, consider using HDF5. You can read directly into a typed (srfi4) array, and a lot of languages have bindings (e.g. h5py), so it's a portable format. I've found it to be fast.

I haven't found bindings for Guile on the web (the ones I use depend on local stuff), but it shouldn't be a lot of work, and it's more flexible than a one-off integer reader. Of course, if that's all you need, then disregard this.

If you or someone else is interested in doing proper bindings using the ffi, I can lend a hand and test.

	Daniel





^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-09-02 17:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-31  3:55 How to read integers from file faster? Darren Hoo
2013-08-31  5:46 ` Mike Gran
2013-08-31  8:59 ` Mark H Weaver
2013-08-31  9:19 ` Andy Wingo
2013-08-31 10:19   ` Ludovic Courtès
2013-09-01  8:45     ` Andy Wingo
2013-09-01 13:33       ` Ludovic Courtès
2013-09-01 20:55 ` Pascal J. Bourguignon
     [not found] <mailman.97.1378137628.11215.guile-user@gnu.org>
2013-09-02 17:32 ` Daniel Llorens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).