all messages for Guix-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Christopher Baines <mail@cbaines.net>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: guix-devel@gnu.org
Subject: Re: [PATCH] git-download: Speed up 'git-predicate'.
Date: Thu, 8 Jun 2017 21:43:31 +0100	[thread overview]
Message-ID: <cd27f372-6ea0-1e87-2daa-44672f7bfe3a@cbaines.net> (raw)
In-Reply-To: <87zidk2c4g.fsf@gnu.org>


[-- Attachment #1.1: Type: text/plain, Size: 5180 bytes --]

On 07/06/17 13:52, Ludovic Courtès wrote:
> Christopher Baines <mail@cbaines.net> skribis:
> 
>> Adjust 'git-predicate' to use data structures that perform better when used
>> with git repositories with a large number of files.
>>
>> Previously when matching either a regular file or directory, 'git-predicate'
>> would search a list with a length equal to the number of files in the
>> repository. As a search operation happens for roughly every file in the
>> repository, this meant that the time taken to use 'git-predicate' to traverse
>> all the files in a repository was roughly exponential with respect to the
>> number of files in the repository.
>>
>> Now, for matching regular files or symlinks, 'git-predicate' uses a vhash
>> using the inode value as the key. This should perform roughly in constant
>> amount of time, instead of linear with respect to the number of files in the
>> repository.
>>
>> For matching directories, 'git-predicate' now uses a tree structure stored in
>> association lists. To check if a directory is in the tree, the tree is
>> traversed from the root. The time complexity of this depends on the shape of
>> the tree, but it should be an improvement on searching through the list of all
>> files.
> 
> Great, more than welcome it seems.  :-)
> 
> Do you know how much the inode optimization vs. the tree structure
> contributes to the performance.

I'll find out the actual times for without both optimizations, and with
just one of them and get back to you.

>> * guix/git-download.scm (git-predicate): Use different data structures to
>>   speed up 'git-predicate' with a large number of files.
> 
> [...]
> 
>> +  (define (create-directory-tree files)
>> +    (define (directory-lists->tree directory-lists)
>> +      (map (lambda (top-level-dir)
>> +             (cons top-level-dir
>> +                   (directory-lists->tree
>> +                    (filter-map
>> +                     (lambda (directory-list)
>> +                       (if (eq? (length directory-list) 1)
>> +                           #f
>> +                           (cdr directory-list)))
>> +                     ;; Find all the directory lists under this top-level-dir
>> +                     (filter
>> +                      (lambda (directory-list)
>> +                        (equal? (car directory-list)
>> +                                top-level-dir))
>> +                      directory-lists)))))
>> +           (delete-duplicates
>> +            (map car directory-lists))))
>> +
>> +    (directory-lists->tree
>> +     (filter-map (lambda (path)
>> +                   (let ((split-path (string-split path #\/)))
>> +                     ;; If this is a file in the top of the repository?
>> +                     (if (eq? (length split-path) 1)
>> +                         #f
>> +                         ;; drop-right to remove the filename, as it's
>> +                         ;; just the directory tree that's important
>> +                         (drop-right (string-split path #\/) 1))))
>> +                 files)))
>> +
>> +  (define (directory-in-tree? directory tree)
>> +    (define (directory-list-in-tree? directory-list tree)
>> +      (if (eq? (length directory-list) 1)
>> +          (list? (member (car directory-list)
>> +                         (map car tree)))
>> +          (and=> (find (match-lambda
>> +                         ((top-level-dir . subtree)
>> +                          (equal? top-level-dir
>> +                                  (car directory-list))))
>> +                       tree)
>> +                 (match-lambda
>> +                   ((top-level-dir . subtree)
>> +                    (directory-list-in-tree? (cdr directory-list)
>> +                                             subtree))))))
>> +
>> +    (directory-list-in-tree? (string-split directory #\/)
>> +                             tree))
> 
> Note that ‘length’ and ‘list?’ are O(n).  I think ‘directory-in-tree?’
> should be written using ‘match’, which would avoid that altogether.
> Likewise, the (map car …) call for ‘match’.  :-)

That's interesting about length and list, I'll give using match a go and
send another patch.

> I find the tree implementation hard to grasp.  Perhaps it would help to
> move it outside of the ‘git-predicate’ function and perhaps decompose it
> a bit more?  Thoughts?

Yeah, the create-directory-tree and directory-in-tree? functions are
both general purpose, would you suggest this module, or somewhere else?

>> +         (inodes-vhash   (alist->vhash
>> +                          (map
>> +                           (lambda (file)
>> +                             (let ((stat
>> +                                    (lstat (string-append directory "/" file))))
>> +                               (cons (stat:ino stat) (stat:dev stat))))
>> +                           files)))
> 
> I would call it ‘inodes’ simply.  Also, we could use ‘list->set’ from
> (guix sets) here.

Ah, cool, I'll have a look at sending a updated patch shortly.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 858 bytes --]

  reply	other threads:[~2017-06-08 20:43 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-02  7:08 [PATCH] git-download: Speed up 'git-predicate' Christopher Baines
2017-06-02  7:34 ` Christopher Baines
2017-06-07 12:40   ` Ludovic Courtès
2017-06-07 18:12     ` Christopher Baines
2017-06-07 12:52 ` Ludovic Courtès
2017-06-08 20:43   ` Christopher Baines [this message]
2017-06-19  7:14   ` Christopher Baines
2017-06-21 21:44     ` Ludovic Courtès
2017-07-16 10:42       ` Christopher Baines
2017-07-25 21:26         ` Ludovic Courtès
2017-07-26  9:58           ` Christopher Baines
2017-06-19  7:24   ` Christopher Baines
2017-06-21 21:17     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd27f372-6ea0-1e87-2daa-44672f7bfe3a@cbaines.net \
    --to=mail@cbaines.net \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/guix.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.