unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* [PATCH] new: Don't scan unchanged directories with no sub-directories
@ 2013-10-24 20:33 Austin Clements
  2013-10-24 21:08 ` Austin Clements
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Clements @ 2013-10-24 20:33 UTC (permalink / raw)
  To: notmuch

This can substantially reduce the cost of notmuch new in some
situations, such as when the file system cache is cold or when the
Maildir is on NFS.
---
 notmuch-new.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/notmuch-new.c b/notmuch-new.c
index faa33f1..364c73a 100644
--- a/notmuch-new.c
+++ b/notmuch-new.c
@@ -323,6 +323,26 @@ add_files (notmuch_database_t *notmuch,
     }
     db_mtime = directory ? notmuch_directory_get_mtime (directory) : 0;
 
+    /* If the directory is unchanged from our last scan and has no
+     * sub-directories, then return without scanning it at all.  In
+     * some situations, skipping the scan can substantially reduce the
+     * cost of notmuch new, especially since the huge numbers of files
+     * in Maildirs make scans expensive, but all files live in leaf
+     * directories.
+     *
+     * To check for sub-directories, we borrow a trick from find,
+     * kpathsea, and many other UNIX tools: since a directory's link
+     * count is the number of sub-directories (specifically, their
+     * '..' entries) plus 2 (the link from the parent and the link for
+     * '.').  This check is safe even on weird file systems, since
+     * file systems that can't compute this will return 0 or 1.  This
+     * is safe even on *really* weird file systems like HFS+ that
+     * mistakenly return the total number of directory entries, since
+     * that only inflates the count beyond 2.
+     */
+    if (directory && fs_mtime == db_mtime && st.st_nlink == 2)
+	goto DONE;
+
     /* If the database knows about this directory, then we sort based
      * on strcmp to match the database sorting. Otherwise, we can do
      * inode-based sorting for faster filesystem operation. */
-- 
1.8.4.rc3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] new: Don't scan unchanged directories with no sub-directories
  2013-10-24 20:33 [PATCH] new: Don't scan unchanged directories with no sub-directories Austin Clements
@ 2013-10-24 21:08 ` Austin Clements
  2013-10-24 21:38   ` [PATCH v2] " Austin Clements
  0 siblings, 1 reply; 9+ messages in thread
From: Austin Clements @ 2013-10-24 21:08 UTC (permalink / raw)
  To: notmuch

There might be a problem with this patch.  Directory entries that are
*symlinks* to other directories do not increase the containing
directory's link count, but we do count them as directories in
add_files pass 1 and traverse in to them.  Hence, if you had a
directory that contained no sub-directories, but did contain symlinks
to other directories, we would fail to notice changes in the symlinked
directories.

We could check if the database thinks there are sub-directories and
only bail early if the directory is unchanged and *both* the file
system and the database think there are no sub-directories.

Quoth myself on Oct 24 at  4:33 pm:
> This can substantially reduce the cost of notmuch new in some
> situations, such as when the file system cache is cold or when the
> Maildir is on NFS.
> ---
>  notmuch-new.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/notmuch-new.c b/notmuch-new.c
> index faa33f1..364c73a 100644
> --- a/notmuch-new.c
> +++ b/notmuch-new.c
> @@ -323,6 +323,26 @@ add_files (notmuch_database_t *notmuch,
>      }
>      db_mtime = directory ? notmuch_directory_get_mtime (directory) : 0;
>  
> +    /* If the directory is unchanged from our last scan and has no
> +     * sub-directories, then return without scanning it at all.  In
> +     * some situations, skipping the scan can substantially reduce the
> +     * cost of notmuch new, especially since the huge numbers of files
> +     * in Maildirs make scans expensive, but all files live in leaf
> +     * directories.
> +     *
> +     * To check for sub-directories, we borrow a trick from find,
> +     * kpathsea, and many other UNIX tools: since a directory's link
> +     * count is the number of sub-directories (specifically, their
> +     * '..' entries) plus 2 (the link from the parent and the link for
> +     * '.').  This check is safe even on weird file systems, since
> +     * file systems that can't compute this will return 0 or 1.  This
> +     * is safe even on *really* weird file systems like HFS+ that
> +     * mistakenly return the total number of directory entries, since
> +     * that only inflates the count beyond 2.
> +     */
> +    if (directory && fs_mtime == db_mtime && st.st_nlink == 2)
> +	goto DONE;
> +
>      /* If the database knows about this directory, then we sort based
>       * on strcmp to match the database sorting. Otherwise, we can do
>       * inode-based sorting for faster filesystem operation. */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-24 21:08 ` Austin Clements
@ 2013-10-24 21:38   ` Austin Clements
  2013-10-25 11:46     ` Tomi Ollila
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Austin Clements @ 2013-10-24 21:38 UTC (permalink / raw)
  To: notmuch

This can substantially reduce the cost of notmuch new in some
situations, such as when the file system cache is cold or when the
Maildir is on NFS.
---

This should fix the problem with directories containing symlinks to
other directories, but no actual sub-directories.

 notmuch-new.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/notmuch-new.c b/notmuch-new.c
index faa33f1..ba05cb4 100644
--- a/notmuch-new.c
+++ b/notmuch-new.c
@@ -323,6 +323,35 @@ add_files (notmuch_database_t *notmuch,
     }
     db_mtime = directory ? notmuch_directory_get_mtime (directory) : 0;
 
+    /* If the directory is unchanged from our last scan and has no
+     * sub-directories, then return without scanning it at all.  In
+     * some situations, skipping the scan can substantially reduce the
+     * cost of notmuch new, especially since the huge numbers of files
+     * in Maildirs make scans expensive, but all files live in leaf
+     * directories.
+     *
+     * To check for sub-directories, we borrow a trick from find,
+     * kpathsea, and many other UNIX tools: since a directory's link
+     * count is the number of sub-directories (specifically, their
+     * '..' entries) plus 2 (the link from the parent and the link for
+     * '.').  This check is safe even on weird file systems, since
+     * file systems that can't compute this will return 0 or 1.  This
+     * is safe even on *really* weird file systems like HFS+ that
+     * mistakenly return the total number of directory entries, since
+     * that only inflates the count beyond 2.
+     */
+    if (directory && fs_mtime == db_mtime && st.st_nlink == 2) {
+	/* There's one catch: pass 1 below considers symlinks to
+	 * directories to be directories, but these don't increase the
+	 * file system link count.  So, only bail early if the
+	 * database agrees that there are no sub-directories. */
+	db_subdirs = notmuch_directory_get_child_directories (directory);
+	if (!notmuch_filenames_valid (db_subdirs))
+	    goto DONE;
+	notmuch_filenames_destroy (db_subdirs);
+	db_subdirs = NULL;
+    }
+
     /* If the database knows about this directory, then we sort based
      * on strcmp to match the database sorting. Otherwise, we can do
      * inode-based sorting for faster filesystem operation. */
-- 
1.8.4.rc3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-24 21:38   ` [PATCH v2] " Austin Clements
@ 2013-10-25 11:46     ` Tomi Ollila
  2013-10-25 11:59       ` Vladimir Marek
  2013-10-26  0:13     ` David Bremner
  2013-10-28 20:00     ` David Bremner
  2 siblings, 1 reply; 9+ messages in thread
From: Tomi Ollila @ 2013-10-25 11:46 UTC (permalink / raw)
  To: Austin Clements, notmuch

On Fri, Oct 25 2013, Austin Clements <amdragon@MIT.EDU> wrote:

> This can substantially reduce the cost of notmuch new in some
> situations, such as when the file system cache is cold or when the
> Maildir is on NFS.
> ---

LGTM. The creation and destruction of child directories happens
only if there are symlinks to directories in otherwise leaf directories.

Tomi

>
> This should fix the problem with directories containing symlinks to
> other directories, but no actual sub-directories.
>
>  notmuch-new.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/notmuch-new.c b/notmuch-new.c
> index faa33f1..ba05cb4 100644
> --- a/notmuch-new.c
> +++ b/notmuch-new.c
> @@ -323,6 +323,35 @@ add_files (notmuch_database_t *notmuch,
>      }
>      db_mtime = directory ? notmuch_directory_get_mtime (directory) : 0;
>  
> +    /* If the directory is unchanged from our last scan and has no
> +     * sub-directories, then return without scanning it at all.  In
> +     * some situations, skipping the scan can substantially reduce the
> +     * cost of notmuch new, especially since the huge numbers of files
> +     * in Maildirs make scans expensive, but all files live in leaf
> +     * directories.
> +     *
> +     * To check for sub-directories, we borrow a trick from find,
> +     * kpathsea, and many other UNIX tools: since a directory's link
> +     * count is the number of sub-directories (specifically, their
> +     * '..' entries) plus 2 (the link from the parent and the link for
> +     * '.').  This check is safe even on weird file systems, since
> +     * file systems that can't compute this will return 0 or 1.  This
> +     * is safe even on *really* weird file systems like HFS+ that
> +     * mistakenly return the total number of directory entries, since
> +     * that only inflates the count beyond 2.
> +     */
> +    if (directory && fs_mtime == db_mtime && st.st_nlink == 2) {
> +	/* There's one catch: pass 1 below considers symlinks to
> +	 * directories to be directories, but these don't increase the
> +	 * file system link count.  So, only bail early if the
> +	 * database agrees that there are no sub-directories. */
> +	db_subdirs = notmuch_directory_get_child_directories (directory);
> +	if (!notmuch_filenames_valid (db_subdirs))
> +	    goto DONE;
> +	notmuch_filenames_destroy (db_subdirs);
> +	db_subdirs = NULL;
> +    }
> +
>      /* If the database knows about this directory, then we sort based
>       * on strcmp to match the database sorting. Otherwise, we can do
>       * inode-based sorting for faster filesystem operation. */
> -- 
> 1.8.4.rc3
>
> _______________________________________________
> notmuch mailing list
> notmuch@notmuchmail.org
> http://notmuchmail.org/mailman/listinfo/notmuch

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-25 11:46     ` Tomi Ollila
@ 2013-10-25 11:59       ` Vladimir Marek
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Marek @ 2013-10-25 11:59 UTC (permalink / raw)
  To: Tomi Ollila; +Cc: notmuch

Thank you both for your help!
-- 
	Vlad

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-24 21:38   ` [PATCH v2] " Austin Clements
  2013-10-25 11:46     ` Tomi Ollila
@ 2013-10-26  0:13     ` David Bremner
  2013-10-26 11:52       ` David Bremner
  2013-10-28 20:00     ` David Bremner
  2 siblings, 1 reply; 9+ messages in thread
From: David Bremner @ 2013-10-26  0:13 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:

> This can substantially reduce the cost of notmuch new in some
> situations, such as when the file system cache is cold or when the
> Maildir is on NFS.

On my desktop at home (a core i7 950) with spinning rust disks (and lvm
on luks) this patch yields about a 7% slowdown in the intial new perf
test

from

			Wall(s)	Usr(s)	Sys(s)	Res(K)	In/Out(512B)
  Initial notmuch new   579.60	348.86	14.26	217188	5330266/3501272

to

			Wall(s)	Usr(s)	Sys(s)	Res(K)	In/Out(512B)
  Initial notmuch new   620.51	368.62	15.48	217156	5330354/3416456

On an SSD I don't detect a significant different (<0.5% speedup)

d

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-26  0:13     ` David Bremner
@ 2013-10-26 11:52       ` David Bremner
  0 siblings, 0 replies; 9+ messages in thread
From: David Bremner @ 2013-10-26 11:52 UTC (permalink / raw)
  To: Austin Clements, notmuch

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

David Bremner <david@tethera.net> writes:

> Austin Clements <amdragon@MIT.EDU> writes:
>
>> This can substantially reduce the cost of notmuch new in some
>> situations, such as when the file system cache is cold or when the
>> Maildir is on NFS.
>
> On my desktop at home (a core i7 950) with spinning rust disks (and lvm
> on luks) this patch yields about a 7% slowdown in the intial new perf
> test
>
> from
>
> 			Wall(s)	Usr(s)	Sys(s)	Res(K)	In/Out(512B)
>   Initial notmuch new   579.60	348.86	14.26	217188	5330266/3501272
>
> to
>
> 			Wall(s)	Usr(s)	Sys(s)	Res(K)	In/Out(512B)
>   Initial notmuch new   620.51	368.62	15.48	217156	5330354/3416456
>
> On an SSD I don't detect a significant different (<0.5% speedup)

Seems like a false alarm. Averaging over 10 repetitions, the patched
version is about 1% faster. Unfortunately it points out that our
performance test suite should really do more than one repetition for
each test.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: crunch.sh --]
[-- Type: text/x-sh, Size: 246 bytes --]

#!/bin/bash

test_description='notmuch new'

. ./perf-test-lib.sh

time_start
for i in $(seq 1 10); do
    rm -rf ${MAIL_DIR}/.notmuch
    sudo /home/bremner/config/scripts/drop-caches
    time_run "notmuch new #$i" 'notmuch new'
done

time_done

[-- Attachment #3: drop-caches --]
[-- Type: application/octet-stream, Size: 112 bytes --]

sync
sync
echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-24 21:38   ` [PATCH v2] " Austin Clements
  2013-10-25 11:46     ` Tomi Ollila
  2013-10-26  0:13     ` David Bremner
@ 2013-10-28 20:00     ` David Bremner
  2013-10-28 20:46       ` Vladimir Marek
  2 siblings, 1 reply; 9+ messages in thread
From: David Bremner @ 2013-10-28 20:00 UTC (permalink / raw)
  To: Austin Clements, notmuch

Austin Clements <amdragon@MIT.EDU> writes:

> This can substantially reduce the cost of notmuch new in some
> situations, such as when the file system cache is cold or when the
> Maildir is on NFS.
> ---

pushed as commit 516efb7807

d

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: [PATCH v2] new: Don't scan unchanged directories with no sub-directories
  2013-10-28 20:00     ` David Bremner
@ 2013-10-28 20:46       ` Vladimir Marek
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Marek @ 2013-10-28 20:46 UTC (permalink / raw)
  To: David Bremner; +Cc: notmuch

> > This can substantially reduce the cost of notmuch new in some
> > situations, such as when the file system cache is cold or when the
> > Maildir is on NFS.
> > ---
> 
> pushed as commit 516efb7807

Muchas gracias!
-- 
	Vlad

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-10-28 20:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-24 20:33 [PATCH] new: Don't scan unchanged directories with no sub-directories Austin Clements
2013-10-24 21:08 ` Austin Clements
2013-10-24 21:38   ` [PATCH v2] " Austin Clements
2013-10-25 11:46     ` Tomi Ollila
2013-10-25 11:59       ` Vladimir Marek
2013-10-26  0:13     ` David Bremner
2013-10-26 11:52       ` David Bremner
2013-10-28 20:00     ` David Bremner
2013-10-28 20:46       ` Vladimir Marek

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).