From: Christopher Baines <mail@cbaines.net>
To: "Ludovic Courtès" <ludo@gnu.org>
Cc: guix-devel@gnu.org
Subject: Re: [PATCH] git-download: Speed up 'git-predicate'.
Date: Thu, 8 Jun 2017 21:43:31 +0100 [thread overview]
Message-ID: <cd27f372-6ea0-1e87-2daa-44672f7bfe3a@cbaines.net> (raw)
In-Reply-To: <87zidk2c4g.fsf@gnu.org>
[-- Attachment #1.1: Type: text/plain, Size: 5180 bytes --]
On 07/06/17 13:52, Ludovic Courtès wrote:
> Christopher Baines <mail@cbaines.net> skribis:
>
>> Adjust 'git-predicate' to use data structures that perform better when used
>> with git repositories with a large number of files.
>>
>> Previously when matching either a regular file or directory, 'git-predicate'
>> would search a list with a length equal to the number of files in the
>> repository. As a search operation happens for roughly every file in the
>> repository, this meant that the time taken to use 'git-predicate' to traverse
>> all the files in a repository was roughly exponential with respect to the
>> number of files in the repository.
>>
>> Now, for matching regular files or symlinks, 'git-predicate' uses a vhash
>> using the inode value as the key. This should perform roughly in constant
>> amount of time, instead of linear with respect to the number of files in the
>> repository.
>>
>> For matching directories, 'git-predicate' now uses a tree structure stored in
>> association lists. To check if a directory is in the tree, the tree is
>> traversed from the root. The time complexity of this depends on the shape of
>> the tree, but it should be an improvement on searching through the list of all
>> files.
>
> Great, more than welcome it seems. :-)
>
> Do you know how much the inode optimization vs. the tree structure
> contributes to the performance.
I'll find out the actual times for without both optimizations, and with
just one of them and get back to you.
>> * guix/git-download.scm (git-predicate): Use different data structures to
>> speed up 'git-predicate' with a large number of files.
>
> [...]
>
>> + (define (create-directory-tree files)
>> + (define (directory-lists->tree directory-lists)
>> + (map (lambda (top-level-dir)
>> + (cons top-level-dir
>> + (directory-lists->tree
>> + (filter-map
>> + (lambda (directory-list)
>> + (if (eq? (length directory-list) 1)
>> + #f
>> + (cdr directory-list)))
>> + ;; Find all the directory lists under this top-level-dir
>> + (filter
>> + (lambda (directory-list)
>> + (equal? (car directory-list)
>> + top-level-dir))
>> + directory-lists)))))
>> + (delete-duplicates
>> + (map car directory-lists))))
>> +
>> + (directory-lists->tree
>> + (filter-map (lambda (path)
>> + (let ((split-path (string-split path #\/)))
>> + ;; If this is a file in the top of the repository?
>> + (if (eq? (length split-path) 1)
>> + #f
>> + ;; drop-right to remove the filename, as it's
>> + ;; just the directory tree that's important
>> + (drop-right (string-split path #\/) 1))))
>> + files)))
>> +
>> + (define (directory-in-tree? directory tree)
>> + (define (directory-list-in-tree? directory-list tree)
>> + (if (eq? (length directory-list) 1)
>> + (list? (member (car directory-list)
>> + (map car tree)))
>> + (and=> (find (match-lambda
>> + ((top-level-dir . subtree)
>> + (equal? top-level-dir
>> + (car directory-list))))
>> + tree)
>> + (match-lambda
>> + ((top-level-dir . subtree)
>> + (directory-list-in-tree? (cdr directory-list)
>> + subtree))))))
>> +
>> + (directory-list-in-tree? (string-split directory #\/)
>> + tree))
>
> Note that ‘length’ and ‘list?’ are O(n). I think ‘directory-in-tree?’
> should be written using ‘match’, which would avoid that altogether.
> Likewise, the (map car …) call for ‘match’. :-)
That's interesting about length and list, I'll give using match a go and
send another patch.
> I find the tree implementation hard to grasp. Perhaps it would help to
> move it outside of the ‘git-predicate’ function and perhaps decompose it
> a bit more? Thoughts?
Yeah, the create-directory-tree and directory-in-tree? functions are
both general purpose, would you suggest this module, or somewhere else?
>> + (inodes-vhash (alist->vhash
>> + (map
>> + (lambda (file)
>> + (let ((stat
>> + (lstat (string-append directory "/" file))))
>> + (cons (stat:ino stat) (stat:dev stat))))
>> + files)))
>
> I would call it ‘inodes’ simply. Also, we could use ‘list->set’ from
> (guix sets) here.
Ah, cool, I'll have a look at sending a updated patch shortly.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 858 bytes --]
next prev parent reply other threads:[~2017-06-08 20:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-02 7:08 [PATCH] git-download: Speed up 'git-predicate' Christopher Baines
2017-06-02 7:34 ` Christopher Baines
2017-06-07 12:40 ` Ludovic Courtès
2017-06-07 18:12 ` Christopher Baines
2017-06-07 12:52 ` Ludovic Courtès
2017-06-08 20:43 ` Christopher Baines [this message]
2017-06-19 7:14 ` Christopher Baines
2017-06-21 21:44 ` Ludovic Courtès
2017-07-16 10:42 ` Christopher Baines
2017-07-25 21:26 ` Ludovic Courtès
2017-07-26 9:58 ` Christopher Baines
2017-06-19 7:24 ` Christopher Baines
2017-06-21 21:17 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cd27f372-6ea0-1e87-2daa-44672f7bfe3a@cbaines.net \
--to=mail@cbaines.net \
--cc=guix-devel@gnu.org \
--cc=ludo@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.