unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Patch for xapian defect #250
@ 2009-12-10  7:00 Kan-Ru Chen
  2009-12-11  5:13 ` Carl Worth
  0 siblings, 1 reply; 2+ messages in thread
From: Kan-Ru Chen @ 2009-12-10  7:00 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 8725 bytes --]


The termlist is already sorted, so this is the patch trying to minimize
the modification of database as suggested in the comment and Carl's
TODO file.

My poor profiling shows not much, but some improvement.

*Before*
 
% time notmuch tag +test id:hfntnu+gotv@eGroups.com   
MOD_PLISTS: 368
notmuch tag +test id:hfntnu+gotv@eGroups.com  0.05s user 0.03s system 11% cpu 0.673 total

% time notmuch tag -test id:hfntnu+gotv@eGroups.com   
MOD_PLISTS: 368
notmuch tag -test id:hfntnu+gotv@eGroups.com  0.06s user 0.01s system 10% cpu 0.681 total
 
*After*
 
% time notmuch tag +test id:hfntnu+gotv@eGroups.com                
MOD_PLIST: 1
notmuch tag +test id:hfntnu+gotv@eGroups.com  0.01s user 0.02s system 6% cpu 0.436 total

% time notmuch tag -test id:hfntnu+gotv@eGroups.com                
MOD_PLIST: 1
notmuch tag -test id:hfntnu+gotv@eGroups.com  0.01s user 0.01s system 5% cpu 0.383 total


% time notmuch tag +test tag:notmuch
notmuch tag +test tag:notmuch  1.71s user 0.03s system 65% cpu 2.632 total

% time notmuch tag -test tag:notmuch
notmuch tag -test tag:notmuch  1.61s user 0.02s system 73% cpu 2.204 total

% notmuch count tag:notmuch
682

--- flint_database.cc	2009-12-08 13:34:24.790284881 +0800
+++ flint_database.cc	2009-12-10 14:22:14.493653956 +0800
@@ -1188,7 +1188,7 @@
 
 	termlist.next();
 	while (!termlist.at_end()) {
-	    string tname = termlist.get_termname();
+            string tname = termlist.get_termname();
 	    position_table.delete_positionlist(did, tname);
 	    termcount wdf = termlist.get_wdf();
 
@@ -1278,20 +1278,50 @@
 	}
   
 	if (!modifying || document.internal->terms_modified()) {
-	    // FIXME - in the case where there is overlap between the new
-	    // termlist and the old termlist, it would be better to compare the
-	    // two lists, and make the minimum set of modifications required.
-	    // This would lead to smaller changesets for replication, and
-	    // probably be faster overall.
-
-	    // First, add entries to remove the postings in the underlying record.
 	    Xapian::Internal::RefCntPtr<const FlintWritableDatabase> ptrtothis(this);
 	    FlintTermList termlist(ptrtothis, did);
+            Xapian::TermIterator term = document.termlist_begin();
+	    Xapian::TermIterator term_end = document.termlist_end();
+            flint_doclen_t new_doclen = termlist.get_doclength();
+            string old_tname, new_tname;
+            
+            total_length -= new_doclen;
+            
+            termlist.next();
+            while (true) {
+              bool identical = false;
+              int cmp;
+              if (termlist.at_end() && term == term_end)
+                break;
+              if (!termlist.at_end() && term != term_end) {
+                old_tname = termlist.get_termname();
+                new_tname = *term;
+                cmp = old_tname.compare(new_tname);
+
+                // Check postlist to see whether they are identical
+                if (cmp == 0) {
+                  int new_count = term.positionlist_count();
+                  int old_count = termlist.positionlist_count();
+                  if (old_count == new_count) {
+                    PositionIterator it = term.positionlist_begin();
+                    PositionIterator it_end = term.positionlist_end();
+                    PositionIterator old = termlist.positionlist_begin();
+                    if (equal(it, it_end, old))
+                      identical = true;
+                  }
+                }
+              } else if (termlist.at_end()) {
+                cmp = 2;
+                new_tname = *term;
+              } else {
+                cmp = -2;
+                old_tname = termlist.get_termname();
+              }
 
-	    termlist.next();
-	    while (!termlist.at_end()) {
-		string tname = termlist.get_termname();
+              if (cmp < 0) {
+                const string& tname = old_tname;
 		termcount wdf = termlist.get_wdf();
+                new_doclen -= wdf;
 
 		map<string, pair<termcount_diff, termcount_diff> >::iterator i;
 		i = freq_deltas.find(tname);
@@ -1318,58 +1348,62 @@
 		    // Modifying a document we added/modified since the last flush.
 		    k->second = make_pair('D', 0u);
 		}
-
-		termlist.next();
-	    }
-
-	    total_length -= termlist.get_doclength();
-
-	    flint_doclen_t new_doclen = 0;
-	    Xapian::TermIterator term = document.termlist_begin();
-	    Xapian::TermIterator term_end = document.termlist_end();
-	    for ( ; term != term_end; ++term) {
-		// Calculate the new document length
+              } else if (!identical) {
+                const string& tname = new_tname;
 		termcount wdf = term.get_wdf();
-		new_doclen += wdf;
-
-		string tname = *term;
-		if (tname.size() > MAX_SAFE_TERM_LENGTH)
-		    throw Xapian::InvalidArgumentError("Term too long (> "STRINGIZE(MAX_SAFE_TERM_LENGTH)"): " + tname);
-		map<string, pair<termcount_diff, termcount_diff> >::iterator i;
-		i = freq_deltas.find(tname);
-		if (i == freq_deltas.end()) {
-		    freq_deltas.insert(make_pair(tname, make_pair(1, termcount_diff(wdf))));
-		} else {
-		    ++i->second.first;
-		    i->second.second += wdf;
-		}
+                new_doclen += wdf;
+                
+                if (cmp > 0) {
+                  if (tname.size() > MAX_SAFE_TERM_LENGTH)
+                    throw Xapian::InvalidArgumentError("Term too long (> "STRINGIZE(MAX_SAFE_TERM_LENGTH)"): " + tname);
+                  map<string, pair<termcount_diff, termcount_diff> >::iterator i;
+                  i = freq_deltas.find(tname);
+                  if (i == freq_deltas.end()) {
+                    freq_deltas.insert(make_pair(tname, make_pair(1, termcount_diff(wdf))));
+                  } else {
+                    ++i->second.first;
+                    i->second.second += wdf;
+                  }
+
+                  // Add did to tname's postlist
+                  map<string, map<docid, pair<char, termcount> > >::iterator j;
+                  j = mod_plists.find(tname);
+                  if (j == mod_plists.end()) {
+                    map<docid, pair<char, termcount> > m;
+                    j = mod_plists.insert(make_pair(tname, m)).first;
+                  }
+                  map<docid, pair<char, termcount> >::iterator k;
+                  k = j->second.find(did);
+                  if (k != j->second.end()) {
+                    Assert(k->second.first == 'D');
+                    k->second.first = 'M';
+                    k->second.second = wdf;
+                  } else {
+                    j->second.insert(make_pair(did, make_pair('A', wdf)));
+                  }
+                }
+
+                PositionIterator it = term.positionlist_begin();
+                PositionIterator it_end = term.positionlist_end();
+                if (it != it_end) {
+                  position_table.set_positionlist(
+                                                  did, tname, it, it_end);
+                } else {
+                  position_table.delete_positionlist(did, tname);
+                }
+              }
+              if (termlist.at_end())
+                ++term;
+              else if (term == term_end)
+                termlist.next();
+              else {
+                if (cmp >= 0)
+                  ++term;
+                if (cmp <= 0)
+                  termlist.next();
+              }
+            }
 
-		// Add did to tname's postlist
-		map<string, map<docid, pair<char, termcount> > >::iterator j;
-		j = mod_plists.find(tname);
-		if (j == mod_plists.end()) {
-		    map<docid, pair<char, termcount> > m;
-		    j = mod_plists.insert(make_pair(tname, m)).first;
-		}
-		map<docid, pair<char, termcount> >::iterator k;
-		k = j->second.find(did);
-		if (k != j->second.end()) {
-		    Assert(k->second.first == 'D');
-		    k->second.first = 'M';
-		    k->second.second = wdf;
-		} else {
-		    j->second.insert(make_pair(did, make_pair('A', wdf)));
-		}
-
-		PositionIterator it = term.positionlist_begin();
-		PositionIterator it_end = term.positionlist_end();
-		if (it != it_end) {
-		    position_table.set_positionlist(
-			did, tname, it, it_end);
-		} else {
-		    position_table.delete_positionlist(did, tname);
-		}
-	    }
 	    LOGLINE(DB, "Calculated doclen for replacement document " << did << " as " << new_doclen);
 
 	    // Set the termlist


-- 
Kan-Ru Chen | http://kanru.info

Q: Why are my replies five sentences or less?
A: http://five.sentenc.es/

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Patch for xapian defect #250
  2009-12-10  7:00 Patch for xapian defect #250 Kan-Ru Chen
@ 2009-12-11  5:13 ` Carl Worth
  0 siblings, 0 replies; 2+ messages in thread
From: Carl Worth @ 2009-12-11  5:13 UTC (permalink / raw)
  To: Kan-Ru Chen, notmuch

[-- Attachment #1: Type: text/plain, Size: 1195 bytes --]

On Thu, 10 Dec 2009 15:00:42 +0800, Kan-Ru Chen <kanru@kanru.info> wrote:
> The termlist is already sorted, so this is the patch trying to minimize
> the modification of database as suggested in the comment and Carl's
> TODO file.

Fantastic, Kan-Ru!

> My poor profiling shows not much, but some improvement.

Now you're just understating for sake of the pun. A 5-6x performance
improvement looks great. And I see that as well in my testing:

Before:

	$ time notmuch tag +foo tag:sent
	real 3m18.067s
	$ time notmuch tag -foo tag:sent
	real 3m12.940s

After:

	$ time notmuch tag +foo tag:sent
        real 0m31.497s
	$ time notmuch tag -foo tag:sent
	real 0m28.722s

I didn't flush the OS caches between runs, but a subsequent run of the
"before" code still performed similarly slow:

	$ time notmuch tag +foo tag:sent
	real 3m7.172s

And if I *had* used cold caches for every run the benefit of the patch
would have looked even better.

Anyway, we should get this upstream to the Xapian folks straight
away. I expect they'll want to see a patch to the chert backend as well
as the flint backend, (but fortunately the relevant code looks very
similar if not identical).

Thanks again,

-Carl

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2009-12-11  5:13 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-10  7:00 Patch for xapian defect #250 Kan-Ru Chen
2009-12-11  5:13 ` Carl Worth

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).