* [PATCH 0/3] Making the search indexes optional
@ 2018-07-17 23:27 Eric W. Biederman
2018-07-17 23:30 ` [PATCH 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-17 23:27 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Here is the code to make the Xapian search indexes optional.
The first patch makes the term position database optional.
The second patch makes anything in Xapian optional.
Finally the last patch adds a config option.
At the end of the day it all looks simple and straight forward so I feel
good about the code. At the very least it looks like a good starting
point.
What this code does not do is make the Xapian code modules optional. As
that is more involved, and there is not much reward for that. With a
little cleverness in moving around code that is probably possible in a
follow change.
Eric W. Biederman (3):
SearchIdx.pm: Make indexing search positions optional
SearchIdx: Add the mechanism for making all Xapian indexing optional
SearchIdx: Allow the amount of indexing be configured
lib/PublicInbox/Config.pm | 2 +-
lib/PublicInbox/SearchIdx.pm | 255 +++++++++++++++++++++++--------------------
2 files changed, 137 insertions(+), 120 deletions(-)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3] SearchIdx.pm: Make indexing search positions optional
2018-07-17 23:27 [PATCH 0/3] Making the search indexes optional Eric W. Biederman
@ 2018-07-17 23:30 ` Eric W. Biederman
2018-07-17 23:30 ` [PATCH 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional Eric W. Biederman
` (2 subsequent siblings)
3 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-17 23:30 UTC (permalink / raw)
To: Eric Wong; +Cc: meta, Eric W. Biederman
About half the size of the Xapian search index turns out to be search
positions. The search positions are only used in a very narrow set of
queries. Make the search positions optional so people don't need to
pay the cost of queries they will never make.
This also makes public-inbox more approachable for light hacking as
generating all of the indexes is time consuming.
The way this is done is to add a method to SearchIdx called index_text
that wraps the call of the term generator method index_text. The new
index_text method takes care of calling both index_text and
increase_termpos (the two functions that are responsible for position
data).
Then index_users, index_diff_inc, index_old_diff_fn, index_diff,
index_body are made proper methods that calls the new index_text.
Callers of the new index_text are slightly simplified as they don't
need to call increase_termpos as well.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
lib/PublicInbox/SearchIdx.pm | 94 +++++++++++++++++++-----------------
1 file changed, 49 insertions(+), 45 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0e0796c12c12..cc92c389a152 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -51,6 +51,7 @@ sub new {
my $git_dir = $mainrepo;
my ($altid, $git);
my $version = 1;
+ my $indexlevel = 'positions';
if (ref $ibx) {
$mainrepo = $ibx->{mainrepo};
$altid = $ibx->{altid};
@@ -72,6 +73,7 @@ sub new {
git => $ibx->git,
-altid => $altid,
version => $version,
+ indexlevel => $indexlevel,
}, $class;
$ibx->umask_prepare;
if ($version == 1) {
@@ -118,34 +120,42 @@ sub add_val ($$$) {
$doc->add_value($col, $num);
}
+sub index_text ($$$$)
+{
+ my ($self, $field, $n, $text) = @_;
+ my $tg = $self->term_generator;
+
+ if ($self->{indexlevel} eq 'positions') {
+ $tg->index_text($field, $n, $text);
+ $tg->increase_termpos;
+ } else {
+ $tg->index_text_without_positions($field, $n, $text);
+ }
+}
+
sub index_users ($$) {
- my ($tg, $smsg) = @_;
+ my ($self, $smsg) = @_;
my $from = $smsg->from;
my $to = $smsg->to;
my $cc = $smsg->cc;
- $tg->index_text($from, 1, 'A'); # A - author
- $tg->increase_termpos;
- $tg->index_text($to, 1, 'XTO') if $to ne '';
- $tg->increase_termpos;
- $tg->index_text($cc, 1, 'XCC') if $cc ne '';
- $tg->increase_termpos;
+ $self->index_text($from, 1, 'A'); # A - author
+ $self->index_text($to, 1, 'XTO') if $to ne '';
+ $self->index_text($cc, 1, 'XCC') if $cc ne '';
}
sub index_diff_inc ($$$$) {
- my ($tg, $text, $pfx, $xnq) = @_;
+ my ($self, $text, $pfx, $xnq) = @_;
if (@$xnq) {
- $tg->index_text(join("\n", @$xnq), 1, 'XNQ');
- $tg->increase_termpos;
+ $self->index_text(join("\n", @$xnq), 1, 'XNQ');
@$xnq = ();
}
- $tg->index_text($text, 1, $pfx);
- $tg->increase_termpos;
+ $self->index_text($text, 1, $pfx);
}
sub index_old_diff_fn {
- my ($tg, $seen, $fa, $fb, $xnq) = @_;
+ my ($self, $seen, $fa, $fb, $xnq) = @_;
# no renames or space support for traditional diffs,
# find the number of leading common paths to strip:
@@ -156,7 +166,7 @@ sub index_old_diff_fn {
$fb = join('/', @fb);
if ($fa eq $fb) {
unless ($seen->{$fa}++) {
- index_diff_inc($tg, $fa, 'XDFN', $xnq);
+ $self->index_diff_inc($fa, 'XDFN', $xnq);
}
return 1;
}
@@ -167,22 +177,22 @@ sub index_old_diff_fn {
}
sub index_diff ($$$) {
- my ($tg, $lines, $doc) = @_;
+ my ($self, $lines, $doc) = @_;
my %seen;
my $in_diff;
my @xnq;
my $xnq = \@xnq;
foreach (@$lines) {
if ($in_diff && s/^ //) { # diff context
- index_diff_inc($tg, $_, 'XDFCTX', $xnq);
+ $self->index_diff_inc($_, 'XDFCTX', $xnq);
} elsif (/^-- $/) { # email signature begins
$in_diff = undef;
} elsif (m!^diff --git ("?a/.+) ("?b/.+)\z!) {
my ($fa, $fb) = ($1, $2);
my $fn = (split('/', git_unquote($fa), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$fn = (split('/', git_unquote($fb), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$in_diff = 1;
# traditional diff:
} elsif (m/^diff -(.+) (\S+) (\S+)$/) {
@@ -190,26 +200,26 @@ sub index_diff ($$$) {
push @xnq, $_;
# only support unified:
next unless $opt =~ /[uU]/;
- $in_diff = index_old_diff_fn($tg, \%seen, $fa, $fb,
+ $in_diff = $self->index_old_diff_fn(\%seen, $fa, $fb,
$xnq);
} elsif (m!^--- ("?a/.+)!) {
my $fn = (split('/', git_unquote($1), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$in_diff = 1;
} elsif (m!^\+\+\+ ("?b/.+)!) {
my $fn = (split('/', git_unquote($1), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$in_diff = 1;
} elsif (/^--- (\S+)/) {
$in_diff = $1;
push @xnq, $_;
} elsif (defined $in_diff && /^\+\+\+ (\S+)/) {
- $in_diff = index_old_diff_fn($tg, \%seen, $in_diff, $1,
+ $in_diff = $self->index_old_diff_fn(\%seen, $in_diff, $1,
$xnq);
} elsif ($in_diff && s/^\+//) { # diff added
- index_diff_inc($tg, $_, 'XDFB', $xnq);
+ $self->index_diff_inc($_, 'XDFB', $xnq);
} elsif ($in_diff && s/^-//) { # diff removed
- index_diff_inc($tg, $_, 'XDFA', $xnq);
+ $self->index_diff_inc($_, 'XDFA', $xnq);
} elsif (m!^index ([a-f0-9]+)\.\.([a-f0-9]+)!) {
my ($ba, $bb) = ($1, $2);
index_git_blob_id($doc, 'XDFPRE', $ba);
@@ -219,7 +229,7 @@ sub index_diff ($$$) {
# traditional diff w/o -p
} elsif (/^@@ (?:\S+) (?:\S+) @@\s*(\S+.*)$/) {
# hunk header context
- index_diff_inc($tg, $1, 'XDFHH', $xnq);
+ $self->index_diff_inc($1, 'XDFHH', $xnq);
# ignore the following lines:
} elsif (/^(?:dis)similarity index/ ||
/^(?:old|new) mode/ ||
@@ -238,25 +248,23 @@ sub index_diff ($$$) {
}
}
- $tg->index_text(join("\n", @xnq), 1, 'XNQ');
- $tg->increase_termpos;
+ $self->index_text(join("\n", @xnq), 1, 'XNQ');
}
sub index_body ($$$) {
- my ($tg, $lines, $doc) = @_;
+ my ($self, $lines, $doc) = @_;
my $txt = join("\n", @$lines);
if ($doc) {
# does it look like a diff?
if ($txt =~ /^(?:diff|---|\+\+\+) /ms) {
$txt = undef;
- index_diff($tg, $lines, $doc);
+ $self->index_diff($lines, $doc);
} else {
- $tg->index_text($txt, 1, 'XNQ');
+ $self->index_text($txt, 1, 'XNQ');
}
} else {
- $tg->index_text($txt, 0, 'XQUOT');
+ $self->index_text($txt, 0, 'XQUOT');
}
- $tg->increase_termpos;
@$lines = ();
}
@@ -284,18 +292,15 @@ sub add_message {
my $tg = $self->term_generator;
$tg->set_document($doc);
- $tg->index_text($subj, 1, 'S') if $subj;
- $tg->increase_termpos;
-
- index_users($tg, $smsg);
+ $self->index_text($subj, 1, 'S') if $subj;
+ $self->index_users($smsg);
msg_iter($mime, sub {
my ($part, $depth, @idx) = @{$_[0]};
my $ct = $part->content_type || 'text/plain';
my $fn = $part->filename;
if (defined $fn && $fn ne '') {
- $tg->index_text($fn, 1, 'XFN');
- $tg->increase_termpos;
+ $self->index_text($fn, 1, 'XFN');
}
return if $ct =~ m!\btext/x?html\b!i;
@@ -318,27 +323,26 @@ sub add_message {
my @lines = split(/\n/, $body);
while (defined(my $l = shift @lines)) {
if ($l =~ /^>/) {
- index_body($tg, \@orig, $doc) if @orig;
+ $self->index_body(\@orig, $doc) if @orig;
push @quot, $l;
} else {
- index_body($tg, \@quot, 0) if @quot;
+ $self->index_body(\@quot, 0) if @quot;
push @orig, $l;
}
}
- index_body($tg, \@quot, 0) if @quot;
- index_body($tg, \@orig, $doc) if @orig;
+ $self->index_body(\@quot, 0) if @quot;
+ $self->index_body(\@orig, $doc) if @orig;
});
foreach my $mid (@$mids) {
- $tg->index_text($mid, 1, 'XM');
+ $self->index_text($mid, 1, 'XM');
# because too many Message-IDs are prefixed with
# "Pine.LNX."...
if ($mid =~ /\w{12,}/) {
my @long = ($mid =~ /(\w{3,}+)/g);
- $tg->index_text(join(' ', @long), 1, 'XM');
+ $self->index_text(join(' ', @long), 1, 'XM');
}
- $tg->increase_termpos;
}
$smsg->{to} = $smsg->{cc} = '';
PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional
2018-07-17 23:27 [PATCH 0/3] Making the search indexes optional Eric W. Biederman
2018-07-17 23:30 ` [PATCH 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
@ 2018-07-17 23:30 ` Eric W. Biederman
2018-07-17 23:30 ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
2018-07-18 10:17 ` [PATCH 0/3] Making the search indexes optional Eric Wong
3 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-17 23:30 UTC (permalink / raw)
To: Eric Wong; +Cc: meta, Eric W. Biederman
Create a new method index_message that holds all of the code to create
Xapian indexes. The creation of this method simpliy involved
idenitifying the relevant code and moving it from add_message.
A call is added to index_message from add_message to keep everything
working as it currently does. The new call is made conditional upon
index levels of 'position' and 'terms' The two things public-inbox
uses Xapian to index.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
lib/PublicInbox/SearchIdx.pm | 171 ++++++++++++++++++-----------------
1 file changed, 88 insertions(+), 83 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index cc92c389a152..deb87db3f88a 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -268,10 +268,94 @@ sub index_body ($$$) {
@$lines = ();
}
+sub index_message ($$$$$) {
+ my ($self, $mime, $num, $oid, $mids, $mid0) = @_;
+ my $smsg = PublicInbox::SearchMsg->new($mime);
+ my $doc = $smsg->{doc};
+ my $subj = $smsg->subject;
+ add_val($doc, PublicInbox::Search::TS(), $smsg->ts);
+ my @ds = gmtime($smsg->ds);
+ my $yyyymmdd = strftime('%Y%m%d', @ds);
+ add_val($doc, PublicInbox::Search::YYYYMMDD(), $yyyymmdd);
+ my $dt = strftime('%Y%m%d%H%M%S', @ds);
+ add_val($doc, PublicInbox::Search::DT(), $dt);
+
+ my $tg = $self->term_generator;
+
+ $tg->set_document($doc);
+ $self->index_text($subj, 1, 'S') if $subj;
+ $self->index_users($smsg);
+
+ msg_iter($mime, sub {
+ my ($part, $depth, @idx) = @{$_[0]};
+ my $ct = $part->content_type || 'text/plain';
+ my $fn = $part->filename;
+ if (defined $fn && $fn ne '') {
+ $self->index_text($fn, 1, 'XFN');
+ }
+
+ return if $ct =~ m!\btext/x?html\b!i;
+
+ my $s = eval { $part->body_str };
+ if ($@) {
+ if ($ct =~ m!\btext/plain\b!i) {
+ # Try to assume UTF-8 because Alpine
+ # seems to do wacky things and set
+ # charset=X-UNKNOWN
+ $part->charset_set('UTF-8');
+ $s = eval { $part->body_str };
+ $s = $part->body if $@;
+ }
+ }
+ defined $s or return;
+
+ my (@orig, @quot);
+ my $body = $part->body;
+ my @lines = split(/\n/, $body);
+ while (defined(my $l = shift @lines)) {
+ if ($l =~ /^>/) {
+ $self->index_body(\@orig, $doc) if @orig;
+ push @quot, $l;
+ } else {
+ $self->index_body(\@quot, 0) if @quot;
+ push @orig, $l;
+ }
+ }
+ $self->index_body(\@quot, 0) if @quot;
+ $self->index_body(\@orig, $doc) if @orig;
+ });
+
+ foreach my $mid (@$mids) {
+ $self->index_text($mid, 1, 'XM');
+
+ # because too many Message-IDs are prefixed with
+ # "Pine.LNX."...
+ if ($mid =~ /\w{12,}/) {
+ my @long = ($mid =~ /(\w{3,}+)/g);
+ $self->index_text(join(' ', @long), 1, 'XM');
+ }
+ }
+ $smsg->{to} = $smsg->{cc} = '';
+ PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
+ my $data = $smsg->to_doc_data($oid, $mid0);
+ $doc->set_data($data);
+ if (my $altid = $self->{-altid}) {
+ foreach my $alt (@$altid) {
+ my $pfx = $alt->{xprefix};
+ foreach my $mid (@$mids) {
+ my $id = $alt->mid2alt($mid);
+ next unless defined $id;
+ $doc->add_boolean_term($pfx . $id);
+ }
+ }
+ }
+ $doc->add_boolean_term('Q' . $_) foreach @$mids;
+ $self->{xdb}->replace_document($num, $doc);
+}
+
sub add_message {
# mime = Email::MIME object
my ($self, $mime, $bytes, $num, $oid, $mid0) = @_;
- my $doc_id;
my $mids = mids($mime->header_obj);
$mid0 = $mids->[0] unless defined $mid0; # v1 compatibility
unless (defined $num) { # v1
@@ -279,98 +363,19 @@ sub add_message {
$num = index_mm($self, $mime);
}
eval {
- my $smsg = PublicInbox::SearchMsg->new($mime);
- my $doc = $smsg->{doc};
- my $subj = $smsg->subject;
- add_val($doc, PublicInbox::Search::TS(), $smsg->ts);
- my @ds = gmtime($smsg->ds);
- my $yyyymmdd = strftime('%Y%m%d', @ds);
- add_val($doc, PublicInbox::Search::YYYYMMDD(), $yyyymmdd);
- my $dt = strftime('%Y%m%d%H%M%S', @ds);
- add_val($doc, PublicInbox::Search::DT(), $dt);
-
- my $tg = $self->term_generator;
-
- $tg->set_document($doc);
- $self->index_text($subj, 1, 'S') if $subj;
- $self->index_users($smsg);
-
- msg_iter($mime, sub {
- my ($part, $depth, @idx) = @{$_[0]};
- my $ct = $part->content_type || 'text/plain';
- my $fn = $part->filename;
- if (defined $fn && $fn ne '') {
- $self->index_text($fn, 1, 'XFN');
- }
-
- return if $ct =~ m!\btext/x?html\b!i;
-
- my $s = eval { $part->body_str };
- if ($@) {
- if ($ct =~ m!\btext/plain\b!i) {
- # Try to assume UTF-8 because Alpine
- # seems to do wacky things and set
- # charset=X-UNKNOWN
- $part->charset_set('UTF-8');
- $s = eval { $part->body_str };
- $s = $part->body if $@;
- }
- }
- defined $s or return;
-
- my (@orig, @quot);
- my $body = $part->body;
- my @lines = split(/\n/, $body);
- while (defined(my $l = shift @lines)) {
- if ($l =~ /^>/) {
- $self->index_body(\@orig, $doc) if @orig;
- push @quot, $l;
- } else {
- $self->index_body(\@quot, 0) if @quot;
- push @orig, $l;
- }
- }
- $self->index_body(\@quot, 0) if @quot;
- $self->index_body(\@orig, $doc) if @orig;
- });
-
- foreach my $mid (@$mids) {
- $self->index_text($mid, 1, 'XM');
-
- # because too many Message-IDs are prefixed with
- # "Pine.LNX."...
- if ($mid =~ /\w{12,}/) {
- my @long = ($mid =~ /(\w{3,}+)/g);
- $self->index_text(join(' ', @long), 1, 'XM');
- }
+ if ($self->{indexlevel} =~ m/(positions|terms)/) {
+ $self->index_message($mime, $num, $oid, $mids, $mid0)
}
- $smsg->{to} = $smsg->{cc} = '';
- PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
- my $data = $smsg->to_doc_data($oid, $mid0);
- $doc->set_data($data);
- if (my $altid = $self->{-altid}) {
- foreach my $alt (@$altid) {
- my $pfx = $alt->{xprefix};
- foreach my $mid (@$mids) {
- my $id = $alt->mid2alt($mid);
- next unless defined $id;
- $doc->add_boolean_term($pfx . $id);
- }
- }
- }
-
if (my $over = $self->{over}) {
$over->add_overview($mime, $bytes, $num, $oid, $mid0);
}
- $doc->add_boolean_term('Q' . $_) foreach @$mids;
- $self->{xdb}->replace_document($doc_id = $num, $doc);
};
if ($@) {
warn "failed to index message <".join('> <',@$mids).">: $@\n";
return undef;
}
- $doc_id;
+ $num;
}
# returns begin and end PostingIterator
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
2018-07-17 23:27 [PATCH 0/3] Making the search indexes optional Eric W. Biederman
2018-07-17 23:30 ` [PATCH 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
2018-07-17 23:30 ` [PATCH 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional Eric W. Biederman
@ 2018-07-17 23:30 ` Eric W. Biederman
2018-07-18 10:22 ` Eric Wong
2018-07-18 10:17 ` [PATCH 0/3] Making the search indexes optional Eric Wong
3 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-17 23:30 UTC (permalink / raw)
To: Eric Wong; +Cc: meta, Eric W. Biederman
This adds a new inbox configuration option 'indexlevel' that can take
the values 'positions', 'terms', and 'over'.
When set to 'positions' everything is indexed including the positions
of all terms.
When set to 'terms' everything except the positions of terms is
indexed.
When set to 'over' terms and positions are not indexed. Just the
Overview database for NNTP is created. Which is still quite good and
allows searching for messages by Message-ID. But there are no indexes to support
searching inside the email messages themselves.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
lib/PublicInbox/Config.pm | 2 +-
lib/PublicInbox/SearchIdx.pm | 8 ++++++++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 289c36a6ccf1..78586560cf0f 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -158,7 +158,7 @@ sub _fill {
foreach my $k (qw(mainrepo filter url newsgroup
infourl watch watchheader httpbackendmax
- replyto feedmax nntpserver)) {
+ replyto feedmax nntpserver indexlevel)) {
my $v = $self->{"$pfx.$k"};
$rv->{$k} = $v if defined $v;
}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index deb87db3f88a..c42821f813a7 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -47,6 +47,7 @@ sub git_unquote ($) {
sub new {
my ($class, $ibx, $creat, $part) = @_;
+ my $levels = qr/(positions|terms|over)/;
my $mainrepo = $ibx; # for "public-inbox-index" w/o entry in config
my $git_dir = $mainrepo;
my ($altid, $git);
@@ -62,6 +63,13 @@ sub new {
PublicInbox::AltId->new($ibx, $_);
} @$altid ];
}
+ if ($ibx->{indexlevel}) {
+ if ($ibx->{indexlevel} =~ $levels) {
+ $indexlevel = $ibx->{indexlevel};
+ } else {
+ die("Invalid indexlevel $ibx->{indexlevel}\n");
+ }
+ }
} else { # v1
$ibx = { mainrepo => $git_dir, version => 1 };
}
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 0/3] Making the search indexes optional
2018-07-17 23:27 [PATCH 0/3] Making the search indexes optional Eric W. Biederman
` (2 preceding siblings ...)
2018-07-17 23:30 ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
@ 2018-07-18 10:17 ` Eric Wong
3 siblings, 0 replies; 17+ messages in thread
From: Eric Wong @ 2018-07-18 10:17 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Here is the code to make the Xapian search indexes optional.
>
> The first patch makes the term position database optional.
> The second patch makes anything in Xapian optional.
> Finally the last patch adds a config option.
>
> At the end of the day it all looks simple and straight forward so I feel
> good about the code. At the very least it looks like a good starting
> point.
Thanks for your work on this. I have a few minor comments in
a separate message. I expect you'll have tests and maybe
an option for public-inbox-index eventually?
> What this code does not do is make the Xapian code modules optional. As
> that is more involved, and there is not much reward for that. With a
> little cleverness in moving around code that is probably possible in a
> follow change.
Not required; but it could be nice-to-have for portability and
ease-of-installation (probably for older distros and
non-GNU/Linux platforms).
Off the top of my head, there's also some places in NNTP.pm
where it needlessly uses Search.pm wrappers instead of using
Over.pm interfaces directly.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
2018-07-17 23:30 ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
@ 2018-07-18 10:22 ` Eric Wong
2018-07-18 16:00 ` Eric W. Biederman
0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2018-07-18 10:22 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> This adds a new inbox configuration option 'indexlevel' that can take
> the values 'positions', 'terms', and 'over'.
The names of these user-facing configuration variables aren't
obviously "levels" at all; especially to people not familiar
with Xapian or public-inbox internals.
As for "over", at least it should be spelled out "overview";
but really, I would much prefer something which wouldn't require
consulting the manual for explanations, such as:
'full', 'medium', 'minimal'
Where it's obvious which one sits relative to the others.
That wouldn't tie our user-facing configuration to our internal
choices or terminology used by Xapian, either. I'm pretty happy
with Xapian; but it may be worth exploring other search engines
at some point...
> --- a/lib/PublicInbox/SearchIdx.pm
> +++ b/lib/PublicInbox/SearchIdx.pm
> @@ -47,6 +47,7 @@ sub git_unquote ($) {
>
> sub new {
> my ($class, $ibx, $creat, $part) = @_;
> + my $levels = qr/(positions|terms|over)/;
Please anchor matches so they match expected strings exactly.
It lets typos be caught and makes life easier for 3rd-party
tools and implementations if we're stricter in what we accept.
Captures aren't necessary, so '?:' can be used:
qr/\A(?:full|medium|minimal)\z/
Same comment applies to patch 2/3
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
2018-07-18 10:22 ` Eric Wong
@ 2018-07-18 16:00 ` Eric W. Biederman
2018-07-18 16:31 ` Eric Wong
0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-18 16:00 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> This adds a new inbox configuration option 'indexlevel' that can take
>> the values 'positions', 'terms', and 'over'.
>
> The names of these user-facing configuration variables aren't
> obviously "levels" at all; especially to people not familiar
> with Xapian or public-inbox internals.
>
> As for "over", at least it should be spelled out "overview";
> but really, I would much prefer something which wouldn't require
> consulting the manual for explanations, such as:
>
> 'full', 'medium', 'minimal'
Do you mind the config option name indexlevel?
I don't mind changing the names I just needed some name and
those names were present.
> Where it's obvious which one sits relative to the others.
>
> That wouldn't tie our user-facing configuration to our internal
> choices or terminology used by Xapian, either. I'm pretty happy
> with Xapian; but it may be worth exploring other search engines
> at some point...
>
>> --- a/lib/PublicInbox/SearchIdx.pm
>> +++ b/lib/PublicInbox/SearchIdx.pm
>> @@ -47,6 +47,7 @@ sub git_unquote ($) {
>>
>> sub new {
>> my ($class, $ibx, $creat, $part) = @_;
>> + my $levels = qr/(positions|terms|over)/;
>
> Please anchor matches so they match expected strings exactly.
> It lets typos be caught and makes life easier for 3rd-party
> tools and implementations if we're stricter in what we accept.
> Captures aren't necessary, so '?:' can be used:
>
> qr/\A(?:full|medium|minimal)\z/
>
> Same comment applies to patch 2/3
Good point. I wish I knew a way so I didn't have to repeat the test
so often. But getting the user space interface correct is the first
step then we can optimize if need be.
Eric
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured
2018-07-18 16:00 ` Eric W. Biederman
@ 2018-07-18 16:31 ` Eric Wong
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2018-07-18 16:31 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Eric Wong <e@80x24.org> writes:
>
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> This adds a new inbox configuration option 'indexlevel' that can take
> >> the values 'positions', 'terms', and 'over'.
> >
> > The names of these user-facing configuration variables aren't
> > obviously "levels" at all; especially to people not familiar
> > with Xapian or public-inbox internals.
> >
> > As for "over", at least it should be spelled out "overview";
> > but really, I would much prefer something which wouldn't require
> > consulting the manual for explanations, such as:
> >
> > 'full', 'medium', 'minimal'
>
> Do you mind the config option name indexlevel?
'indexlevel' is fine. I originally had something along the
lines of 'type' in mind (e.g. 'indextype'); but maybe 'level'
is more obvious and requires less documentation.
> I don't mind changing the names I just needed some name and
> those names were present.
>
> > Where it's obvious which one sits relative to the others.
> >
> > That wouldn't tie our user-facing configuration to our internal
> > choices or terminology used by Xapian, either. I'm pretty happy
> > with Xapian; but it may be worth exploring other search engines
> > at some point...
> >
> >> --- a/lib/PublicInbox/SearchIdx.pm
> >> +++ b/lib/PublicInbox/SearchIdx.pm
> >> @@ -47,6 +47,7 @@ sub git_unquote ($) {
> >>
> >> sub new {
> >> my ($class, $ibx, $creat, $part) = @_;
> >> + my $levels = qr/(positions|terms|over)/;
> >
> > Please anchor matches so they match expected strings exactly.
> > It lets typos be caught and makes life easier for 3rd-party
> > tools and implementations if we're stricter in what we accept.
> > Captures aren't necessary, so '?:' can be used:
> >
> > qr/\A(?:full|medium|minimal)\z/
> >
> > Same comment applies to patch 2/3
>
> Good point. I wish I knew a way so I didn't have to repeat the test
> so often. But getting the user space interface correct is the first
> step then we can optimize if need be.
As in repeating the string comparison? Perhaps it could be
mapped to different subroutine calls:
sub do_index_text { ... }
sub do_index_text_without_positions { ... }
sub do_overview { ... }
my %INDEX_LEVEL = (
full => *do_index_text,
medium => *do_index_text_without_positions,
minimal => *do_overview,
);
$self->{index_cb} = $INDEX_LEVEL{$ibx->{indexlevel}};
defined $self->{index_cb} or die "invalid indexlevel\n";
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 1/3] Making the search indexes optional
2018-07-18 16:31 ` Eric Wong
@ 2018-07-18 16:52 ` Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-18 16:52 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
This is my respin of these patches. I have used the levels:
full, medium, basic.
I think basic conveys the message that it is ok to run with and you can
expect most things to work, better than minimal where it feels like
you don't know what will fail.
I have tweaked the reindex tests to run with all 3 different levels
so at least these code paths get exercised.
Eric W. Biederman (3):
SearchIdx.pm: Make indexing search positions optional
SearchIdx: Add the mechanism for making all Xapian indexing optional
SearchIdx: Allow the amount of indexing be configured
lib/PublicInbox/Config.pm | 2 +-
lib/PublicInbox/SearchIdx.pm | 256 +++++++++++++++++++++++--------------------
t/v1reindex.t | 43 +++++++-
t/v2reindex.t | 40 +++++++
4 files changed, 220 insertions(+), 121 deletions(-)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 1/3] SearchIdx.pm: Make indexing search positions optional
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
@ 2018-07-18 16:53 ` Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional Eric W. Biederman
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-18 16:53 UTC (permalink / raw)
To: Eric Wong; +Cc: meta, Eric W. Biederman
About half the size of the Xapian search index turns out to be search
positions. The search positions are only used in a very narrow set of
queries. Make the search positions optional so people don't need to
pay the cost of queries they will never make.
This also makes public-inbox more approachable for light hacking as
generating all of the indexes is time consuming.
The way this is done is to add a method to SearchIdx called index_text
that wraps the call of the term generator method index_text. The new
index_text method takes care of calling both index_text and
increase_termpos (the two functions that are responsible for position
data).
Then index_users, index_diff_inc, index_old_diff_fn, index_diff,
index_body are made proper methods that calls the new index_text.
Callers of the new index_text are slightly simplified as they don't
need to call increase_termpos as well.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
lib/PublicInbox/SearchIdx.pm | 94 +++++++++++++++++++-----------------
1 file changed, 49 insertions(+), 45 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0e0796c12c12..b19618c71508 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -51,6 +51,7 @@ sub new {
my $git_dir = $mainrepo;
my ($altid, $git);
my $version = 1;
+ my $indexlevel = 'full';
if (ref $ibx) {
$mainrepo = $ibx->{mainrepo};
$altid = $ibx->{altid};
@@ -72,6 +73,7 @@ sub new {
git => $ibx->git,
-altid => $altid,
version => $version,
+ indexlevel => $indexlevel,
}, $class;
$ibx->umask_prepare;
if ($version == 1) {
@@ -118,34 +120,42 @@ sub add_val ($$$) {
$doc->add_value($col, $num);
}
+sub index_text ($$$$)
+{
+ my ($self, $field, $n, $text) = @_;
+ my $tg = $self->term_generator;
+
+ if ($self->{indexlevel} eq 'full') {
+ $tg->index_text($field, $n, $text);
+ $tg->increase_termpos;
+ } else {
+ $tg->index_text_without_positions($field, $n, $text);
+ }
+}
+
sub index_users ($$) {
- my ($tg, $smsg) = @_;
+ my ($self, $smsg) = @_;
my $from = $smsg->from;
my $to = $smsg->to;
my $cc = $smsg->cc;
- $tg->index_text($from, 1, 'A'); # A - author
- $tg->increase_termpos;
- $tg->index_text($to, 1, 'XTO') if $to ne '';
- $tg->increase_termpos;
- $tg->index_text($cc, 1, 'XCC') if $cc ne '';
- $tg->increase_termpos;
+ $self->index_text($from, 1, 'A'); # A - author
+ $self->index_text($to, 1, 'XTO') if $to ne '';
+ $self->index_text($cc, 1, 'XCC') if $cc ne '';
}
sub index_diff_inc ($$$$) {
- my ($tg, $text, $pfx, $xnq) = @_;
+ my ($self, $text, $pfx, $xnq) = @_;
if (@$xnq) {
- $tg->index_text(join("\n", @$xnq), 1, 'XNQ');
- $tg->increase_termpos;
+ $self->index_text(join("\n", @$xnq), 1, 'XNQ');
@$xnq = ();
}
- $tg->index_text($text, 1, $pfx);
- $tg->increase_termpos;
+ $self->index_text($text, 1, $pfx);
}
sub index_old_diff_fn {
- my ($tg, $seen, $fa, $fb, $xnq) = @_;
+ my ($self, $seen, $fa, $fb, $xnq) = @_;
# no renames or space support for traditional diffs,
# find the number of leading common paths to strip:
@@ -156,7 +166,7 @@ sub index_old_diff_fn {
$fb = join('/', @fb);
if ($fa eq $fb) {
unless ($seen->{$fa}++) {
- index_diff_inc($tg, $fa, 'XDFN', $xnq);
+ $self->index_diff_inc($fa, 'XDFN', $xnq);
}
return 1;
}
@@ -167,22 +177,22 @@ sub index_old_diff_fn {
}
sub index_diff ($$$) {
- my ($tg, $lines, $doc) = @_;
+ my ($self, $lines, $doc) = @_;
my %seen;
my $in_diff;
my @xnq;
my $xnq = \@xnq;
foreach (@$lines) {
if ($in_diff && s/^ //) { # diff context
- index_diff_inc($tg, $_, 'XDFCTX', $xnq);
+ $self->index_diff_inc($_, 'XDFCTX', $xnq);
} elsif (/^-- $/) { # email signature begins
$in_diff = undef;
} elsif (m!^diff --git ("?a/.+) ("?b/.+)\z!) {
my ($fa, $fb) = ($1, $2);
my $fn = (split('/', git_unquote($fa), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$fn = (split('/', git_unquote($fb), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$in_diff = 1;
# traditional diff:
} elsif (m/^diff -(.+) (\S+) (\S+)$/) {
@@ -190,26 +200,26 @@ sub index_diff ($$$) {
push @xnq, $_;
# only support unified:
next unless $opt =~ /[uU]/;
- $in_diff = index_old_diff_fn($tg, \%seen, $fa, $fb,
+ $in_diff = $self->index_old_diff_fn(\%seen, $fa, $fb,
$xnq);
} elsif (m!^--- ("?a/.+)!) {
my $fn = (split('/', git_unquote($1), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$in_diff = 1;
} elsif (m!^\+\+\+ ("?b/.+)!) {
my $fn = (split('/', git_unquote($1), 2))[1];
- $seen{$fn}++ or index_diff_inc($tg, $fn, 'XDFN', $xnq);
+ $seen{$fn}++ or $self->index_diff_inc($fn, 'XDFN', $xnq);
$in_diff = 1;
} elsif (/^--- (\S+)/) {
$in_diff = $1;
push @xnq, $_;
} elsif (defined $in_diff && /^\+\+\+ (\S+)/) {
- $in_diff = index_old_diff_fn($tg, \%seen, $in_diff, $1,
+ $in_diff = $self->index_old_diff_fn(\%seen, $in_diff, $1,
$xnq);
} elsif ($in_diff && s/^\+//) { # diff added
- index_diff_inc($tg, $_, 'XDFB', $xnq);
+ $self->index_diff_inc($_, 'XDFB', $xnq);
} elsif ($in_diff && s/^-//) { # diff removed
- index_diff_inc($tg, $_, 'XDFA', $xnq);
+ $self->index_diff_inc($_, 'XDFA', $xnq);
} elsif (m!^index ([a-f0-9]+)\.\.([a-f0-9]+)!) {
my ($ba, $bb) = ($1, $2);
index_git_blob_id($doc, 'XDFPRE', $ba);
@@ -219,7 +229,7 @@ sub index_diff ($$$) {
# traditional diff w/o -p
} elsif (/^@@ (?:\S+) (?:\S+) @@\s*(\S+.*)$/) {
# hunk header context
- index_diff_inc($tg, $1, 'XDFHH', $xnq);
+ $self->index_diff_inc($1, 'XDFHH', $xnq);
# ignore the following lines:
} elsif (/^(?:dis)similarity index/ ||
/^(?:old|new) mode/ ||
@@ -238,25 +248,23 @@ sub index_diff ($$$) {
}
}
- $tg->index_text(join("\n", @xnq), 1, 'XNQ');
- $tg->increase_termpos;
+ $self->index_text(join("\n", @xnq), 1, 'XNQ');
}
sub index_body ($$$) {
- my ($tg, $lines, $doc) = @_;
+ my ($self, $lines, $doc) = @_;
my $txt = join("\n", @$lines);
if ($doc) {
# does it look like a diff?
if ($txt =~ /^(?:diff|---|\+\+\+) /ms) {
$txt = undef;
- index_diff($tg, $lines, $doc);
+ $self->index_diff($lines, $doc);
} else {
- $tg->index_text($txt, 1, 'XNQ');
+ $self->index_text($txt, 1, 'XNQ');
}
} else {
- $tg->index_text($txt, 0, 'XQUOT');
+ $self->index_text($txt, 0, 'XQUOT');
}
- $tg->increase_termpos;
@$lines = ();
}
@@ -284,18 +292,15 @@ sub add_message {
my $tg = $self->term_generator;
$tg->set_document($doc);
- $tg->index_text($subj, 1, 'S') if $subj;
- $tg->increase_termpos;
-
- index_users($tg, $smsg);
+ $self->index_text($subj, 1, 'S') if $subj;
+ $self->index_users($smsg);
msg_iter($mime, sub {
my ($part, $depth, @idx) = @{$_[0]};
my $ct = $part->content_type || 'text/plain';
my $fn = $part->filename;
if (defined $fn && $fn ne '') {
- $tg->index_text($fn, 1, 'XFN');
- $tg->increase_termpos;
+ $self->index_text($fn, 1, 'XFN');
}
return if $ct =~ m!\btext/x?html\b!i;
@@ -318,27 +323,26 @@ sub add_message {
my @lines = split(/\n/, $body);
while (defined(my $l = shift @lines)) {
if ($l =~ /^>/) {
- index_body($tg, \@orig, $doc) if @orig;
+ $self->index_body(\@orig, $doc) if @orig;
push @quot, $l;
} else {
- index_body($tg, \@quot, 0) if @quot;
+ $self->index_body(\@quot, 0) if @quot;
push @orig, $l;
}
}
- index_body($tg, \@quot, 0) if @quot;
- index_body($tg, \@orig, $doc) if @orig;
+ $self->index_body(\@quot, 0) if @quot;
+ $self->index_body(\@orig, $doc) if @orig;
});
foreach my $mid (@$mids) {
- $tg->index_text($mid, 1, 'XM');
+ $self->index_text($mid, 1, 'XM');
# because too many Message-IDs are prefixed with
# "Pine.LNX."...
if ($mid =~ /\w{12,}/) {
my @long = ($mid =~ /(\w{3,}+)/g);
- $tg->index_text(join(' ', @long), 1, 'XM');
+ $self->index_text(join(' ', @long), 1, 'XM');
}
- $tg->increase_termpos;
}
$smsg->{to} = $smsg->{cc} = '';
PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v2 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
@ 2018-07-18 16:53 ` Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
` (2 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-18 16:53 UTC (permalink / raw)
To: Eric Wong; +Cc: meta, Eric W. Biederman
Create a new method add_xapian that holds all of the code to create
Xapian indexes. The creation of this method simpliy involved
idenitifying the relevant code and moving it from add_message.
A call is added to add_xapian from add_message to keep everything
working as it currently does. The new call is made conditional upon
index levels of 'full' and 'medium'. The index levels that index
positions and terms the two things public-inbox uses Xapian to index.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
lib/PublicInbox/SearchIdx.pm | 172 ++++++++++++++++++-----------------
1 file changed, 89 insertions(+), 83 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index b19618c71508..8978914ab087 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -268,10 +268,95 @@ sub index_body ($$$) {
@$lines = ();
}
+sub add_xapian ($$$$$) {
+ my ($self, $mime, $num, $oid, $mids, $mid0) = @_;
+ my $smsg = PublicInbox::SearchMsg->new($mime);
+ my $doc = $smsg->{doc};
+ my $subj = $smsg->subject;
+ add_val($doc, PublicInbox::Search::TS(), $smsg->ts);
+ my @ds = gmtime($smsg->ds);
+ my $yyyymmdd = strftime('%Y%m%d', @ds);
+ add_val($doc, PublicInbox::Search::YYYYMMDD(), $yyyymmdd);
+ my $dt = strftime('%Y%m%d%H%M%S', @ds);
+ add_val($doc, PublicInbox::Search::DT(), $dt);
+
+ my $tg = $self->term_generator;
+
+ $tg->set_document($doc);
+ $self->index_text($subj, 1, 'S') if $subj;
+ $self->index_users($smsg);
+
+ msg_iter($mime, sub {
+ my ($part, $depth, @idx) = @{$_[0]};
+ my $ct = $part->content_type || 'text/plain';
+ my $fn = $part->filename;
+ if (defined $fn && $fn ne '') {
+ $self->index_text($fn, 1, 'XFN');
+ }
+
+ return if $ct =~ m!\btext/x?html\b!i;
+
+ my $s = eval { $part->body_str };
+ if ($@) {
+ if ($ct =~ m!\btext/plain\b!i) {
+ # Try to assume UTF-8 because Alpine
+ # seems to do wacky things and set
+ # charset=X-UNKNOWN
+ $part->charset_set('UTF-8');
+ $s = eval { $part->body_str };
+ $s = $part->body if $@;
+ }
+ }
+ defined $s or return;
+
+ my (@orig, @quot);
+ my $body = $part->body;
+ my @lines = split(/\n/, $body);
+ while (defined(my $l = shift @lines)) {
+ if ($l =~ /^>/) {
+ $self->index_body(\@orig, $doc) if @orig;
+ push @quot, $l;
+ } else {
+ $self->index_body(\@quot, 0) if @quot;
+ push @orig, $l;
+ }
+ }
+ $self->index_body(\@quot, 0) if @quot;
+ $self->index_body(\@orig, $doc) if @orig;
+ });
+
+ foreach my $mid (@$mids) {
+ $self->index_text($mid, 1, 'XM');
+
+ # because too many Message-IDs are prefixed with
+ # "Pine.LNX."...
+ if ($mid =~ /\w{12,}/) {
+ my @long = ($mid =~ /(\w{3,}+)/g);
+ $self->index_text(join(' ', @long), 1, 'XM');
+ }
+ }
+ $smsg->{to} = $smsg->{cc} = '';
+ PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
+ my $data = $smsg->to_doc_data($oid, $mid0);
+ $doc->set_data($data);
+ if (my $altid = $self->{-altid}) {
+ foreach my $alt (@$altid) {
+ my $pfx = $alt->{xprefix};
+ foreach my $mid (@$mids) {
+ my $id = $alt->mid2alt($mid);
+ next unless defined $id;
+ $doc->add_boolean_term($pfx . $id);
+ }
+ }
+ }
+ $doc->add_boolean_term('Q' . $_) foreach @$mids;
+ $self->{xdb}->replace_document($num, $doc);
+}
+
sub add_message {
# mime = Email::MIME object
my ($self, $mime, $bytes, $num, $oid, $mid0) = @_;
- my $doc_id;
+ my $xapianlevels = qr/\A(?:full|medium)\z/;
my $mids = mids($mime->header_obj);
$mid0 = $mids->[0] unless defined $mid0; # v1 compatibility
unless (defined $num) { # v1
@@ -279,98 +364,19 @@ sub add_message {
$num = index_mm($self, $mime);
}
eval {
- my $smsg = PublicInbox::SearchMsg->new($mime);
- my $doc = $smsg->{doc};
- my $subj = $smsg->subject;
- add_val($doc, PublicInbox::Search::TS(), $smsg->ts);
- my @ds = gmtime($smsg->ds);
- my $yyyymmdd = strftime('%Y%m%d', @ds);
- add_val($doc, PublicInbox::Search::YYYYMMDD(), $yyyymmdd);
- my $dt = strftime('%Y%m%d%H%M%S', @ds);
- add_val($doc, PublicInbox::Search::DT(), $dt);
-
- my $tg = $self->term_generator;
-
- $tg->set_document($doc);
- $self->index_text($subj, 1, 'S') if $subj;
- $self->index_users($smsg);
-
- msg_iter($mime, sub {
- my ($part, $depth, @idx) = @{$_[0]};
- my $ct = $part->content_type || 'text/plain';
- my $fn = $part->filename;
- if (defined $fn && $fn ne '') {
- $self->index_text($fn, 1, 'XFN');
- }
-
- return if $ct =~ m!\btext/x?html\b!i;
-
- my $s = eval { $part->body_str };
- if ($@) {
- if ($ct =~ m!\btext/plain\b!i) {
- # Try to assume UTF-8 because Alpine
- # seems to do wacky things and set
- # charset=X-UNKNOWN
- $part->charset_set('UTF-8');
- $s = eval { $part->body_str };
- $s = $part->body if $@;
- }
- }
- defined $s or return;
-
- my (@orig, @quot);
- my $body = $part->body;
- my @lines = split(/\n/, $body);
- while (defined(my $l = shift @lines)) {
- if ($l =~ /^>/) {
- $self->index_body(\@orig, $doc) if @orig;
- push @quot, $l;
- } else {
- $self->index_body(\@quot, 0) if @quot;
- push @orig, $l;
- }
- }
- $self->index_body(\@quot, 0) if @quot;
- $self->index_body(\@orig, $doc) if @orig;
- });
-
- foreach my $mid (@$mids) {
- $self->index_text($mid, 1, 'XM');
-
- # because too many Message-IDs are prefixed with
- # "Pine.LNX."...
- if ($mid =~ /\w{12,}/) {
- my @long = ($mid =~ /(\w{3,}+)/g);
- $self->index_text(join(' ', @long), 1, 'XM');
- }
+ if ($self->{indexlevel} =~ $xapianlevels) {
+ $self->add_xapian($mime, $num, $oid, $mids, $mid0)
}
- $smsg->{to} = $smsg->{cc} = '';
- PublicInbox::OverIdx::parse_references($smsg, $mid0, $mids);
- my $data = $smsg->to_doc_data($oid, $mid0);
- $doc->set_data($data);
- if (my $altid = $self->{-altid}) {
- foreach my $alt (@$altid) {
- my $pfx = $alt->{xprefix};
- foreach my $mid (@$mids) {
- my $id = $alt->mid2alt($mid);
- next unless defined $id;
- $doc->add_boolean_term($pfx . $id);
- }
- }
- }
-
if (my $over = $self->{over}) {
$over->add_overview($mime, $bytes, $num, $oid, $mid0);
}
- $doc->add_boolean_term('Q' . $_) foreach @$mids;
- $self->{xdb}->replace_document($doc_id = $num, $doc);
};
if ($@) {
warn "failed to index message <".join('> <',@$mids).">: $@\n";
return undef;
}
- $doc_id;
+ $num;
}
# returns begin and end PostingIterator
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional Eric W. Biederman
@ 2018-07-18 16:53 ` Eric W. Biederman
2018-07-19 21:51 ` [PATCH] tests: fixup indexlevel setting in tests Eric Wong
2018-07-18 17:32 ` [PATCH v2 3/4] public-inbox-init: Initialize indexlevel Eric W. Biederman
2018-07-19 3:52 ` [PATCH v2 1/3] Making the search indexes optional Eric Wong
4 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-18 16:53 UTC (permalink / raw)
To: Eric Wong; +Cc: meta, Eric W. Biederman
This adds a new inbox configuration option 'indexlevel' that can take
the values 'full', 'medium', and 'basic'.
When set to 'full' everything is indexed including the positions
of all terms.
When set to 'medium' everything except the positions of terms is
indexed.
When set to 'basic' terms and positions are not indexed. Just the
Overview database for NNTP is created. Which is still quite good and
allows searching for messages by Message-ID. But there are no indexes to support
searching inside the email messages themselves.
Update the reindex tests to exercise the full medium and basic code paths
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
lib/PublicInbox/Config.pm | 2 +-
lib/PublicInbox/SearchIdx.pm | 8 +++++++
t/v1reindex.t | 43 +++++++++++++++++++++++++++++++++++-
t/v2reindex.t | 40 +++++++++++++++++++++++++++++++++
4 files changed, 91 insertions(+), 2 deletions(-)
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 289c36a6ccf1..78586560cf0f 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -158,7 +158,7 @@ sub _fill {
foreach my $k (qw(mainrepo filter url newsgroup
infourl watch watchheader httpbackendmax
- replyto feedmax nntpserver)) {
+ replyto feedmax nntpserver indexlevel)) {
my $v = $self->{"$pfx.$k"};
$rv->{$k} = $v if defined $v;
}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 8978914ab087..04e853069f06 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -47,6 +47,7 @@ sub git_unquote ($) {
sub new {
my ($class, $ibx, $creat, $part) = @_;
+ my $levels = qr/\A(?:full|medium|basic)\z/;
my $mainrepo = $ibx; # for "public-inbox-index" w/o entry in config
my $git_dir = $mainrepo;
my ($altid, $git);
@@ -62,6 +63,13 @@ sub new {
PublicInbox::AltId->new($ibx, $_);
} @$altid ];
}
+ if ($ibx->{indexlevel}) {
+ if ($ibx->{indexlevel} =~ $levels) {
+ $indexlevel = $ibx->{indexlevel};
+ } else {
+ die("Invalid indexlevel $ibx->{indexlevel}\n");
+ }
+ }
} else { # v1
$ibx = { mainrepo => $git_dir, version => 1 };
}
diff --git a/t/v1reindex.t b/t/v1reindex.t
index 0df36d3fee89..ff32750fc3c3 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -20,6 +20,7 @@ my $ibx_config = {
mainrepo => $mainrepo,
name => 'test-v1reindex',
-primary_address => 'test@example.com',
+ indexlevel => 'full',
};
my $ibx = PublicInbox::Inbox->new($ibx_config);
my $mime = PublicInbox::MIME->create(
@@ -74,15 +75,32 @@ is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
$rw = PublicInbox::SearchIdx->new($ibx, 1);
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ eval { $rw->index_sync({reindex => 1}) };
+ is($@, '', 'no error from reindexing without msgmap');
+ is(scalar(@warn), 0, 'no warnings from reindexing');
+ $im->done;
+ ok(-d $xap, 'Xapian directories recreated');
+ delete $ibx->{mm};
+ is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
+ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
+
+$rw = PublicInbox::SearchIdx->new($ibx, 1);
{
my @warn;
local $SIG{__WARN__} = sub { push @warn, @_ };
eval { $rw->index_sync({reindex => 1}) };
is($@, '', 'no error from reindexing without msgmap');
- is(scalar(@warn), 0, 'no warnings from reindexing');
+ is_deeply(\@warn, [], 'no warnings');
$im->done;
ok(-d $xap, 'Xapian directories recreated');
delete $ibx->{mm};
@@ -91,9 +109,31 @@ ok(!-d $xap, 'Xapian directories removed again');
ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'medium';
+$ibx = PublicInbox::Inbox->new($ibx_config);
$rw = PublicInbox::SearchIdx->new($ibx, 1);
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ eval { $rw->index_sync({reindex => 1}) };
+ is($@, '', 'no error from reindexing without msgmap');
+ is_deeply(\@warn, [], 'no warnings');
+ $im->done;
+ ok(-d $xap, 'Xapian directories recreated');
+ delete $ibx->{mm};
+ is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
+
+ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'basic';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$rw = PublicInbox::SearchIdx->new($ibx, 1);
{
my @warn;
local $SIG{__WARN__} = sub { push @warn, @_ };
@@ -106,4 +146,5 @@ ok(!-d $xap, 'Xapian directories removed again');
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
}
+
done_testing();
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 8af30991f858..20903967bf25 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -19,6 +19,7 @@ my $ibx_config = {
name => 'test-v2writable',
version => 2,
-primary_address => 'test@example.com',
+ indexlevel => 'full',
};
my $ibx = PublicInbox::Inbox->new($ibx_config);
my $mime = PublicInbox::MIME->create(
@@ -95,4 +96,43 @@ ok(!-d $xap, 'Xapian directories removed again');
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
}
+ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'medium';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$im = PublicInbox::V2Writable->new($ibx);
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ eval { $im->index_sync({reindex => 1}) };
+ is($@, '', 'no error from reindexing without msgmap');
+ is_deeply(\@warn, [], 'no warnings');
+ $im->done;
+ ok(-d $xap, 'Xapian directories recreated');
+ delete $ibx->{mm};
+ is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
+
+
+ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
+remove_tree($xap);
+ok(!-d $xap, 'Xapian directories removed again');
+
+$ibx_config->{index_level} = 'basic';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$im = PublicInbox::V2Writable->new($ibx);
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ eval { $im->index_sync({reindex => 1}) };
+ is($@, '', 'no error from reindexing without msgmap');
+ is_deeply(\@warn, [], 'no warnings');
+ $im->done;
+ ok(-d $xap, 'Xapian directories recreated');
+ delete $ibx->{mm};
+ is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+}
+
done_testing();
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v2 3/4] public-inbox-init: Initialize indexlevel
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
` (2 preceding siblings ...)
2018-07-18 16:53 ` [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
@ 2018-07-18 17:32 ` Eric W. Biederman
2018-07-19 3:52 ` [PATCH v2 1/3] Making the search indexes optional Eric Wong
4 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-18 17:32 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
If indexlevel is specified on the command line prefer that.
If indexlevel is specified in the config file prefer that.
If indexlevel is not specified anywhere default to full.
This should make indexlevel somewhat approachable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
I believe this is the other piece of the user interface needed
to make indexlevel accessible.
script/public-inbox-init | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 3ef6c3bdba2d..5e961c803203 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -16,7 +16,10 @@ use Cwd qw/abs_path/;
sub x { system(@_) and die join(' ', @_). " failed: $?\n" }
sub usage { print STDERR "Usage: $usage\n"; exit 1 }
my $version = undef;
-my %opts = ( 'V|version=i' => \$version );
+my $indexlevel = undef;
+my %opts = ( 'V|version=i' => \$version,
+ 'L|indexlevel=s' => \$indexlevel,
+);
GetOptions(%opts) or usage();
my $name = shift @ARGV or usage();
my $mainrepo = shift @ARGV or usage();
@@ -64,8 +67,16 @@ if (-e $pi_config) {
}
exit(1) if $conflict;
+
+ my $ibx = $cfg->lookup_name($name);
+ if ($ibx) {
+ if (!defined($indexlevel) && $ibx->{indexlevel}) {
+ $indexlevel = $ibx->{indexlevel};
+ }
+ }
}
close $fh or die "failed to close $pi_config_tmp: $!\n";
+$indexlevel ||= 'full';
my $pfx = "publicinbox.$name";
my @x = (qw/git config/, "--file=$pi_config_tmp");
@@ -114,6 +125,7 @@ foreach my $addr (@address) {
}
x(@x, "$pfx.url", $http_url);
x(@x, "$pfx.mainrepo", $mainrepo);
+x(@x, "$pfx.indexlevel", $indexlevel);
rename $pi_config_tmp, $pi_config or
die "failed to rename `$pi_config_tmp' to `$pi_config': $!\n";
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/3] Making the search indexes optional
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
` (3 preceding siblings ...)
2018-07-18 17:32 ` [PATCH v2 3/4] public-inbox-init: Initialize indexlevel Eric W. Biederman
@ 2018-07-19 3:52 ` Eric Wong
2018-07-19 18:47 ` Eric W. Biederman
4 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2018-07-19 3:52 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
> This is my respin of these patches. I have used the levels:
> full, medium, basic.
>
> I think basic conveys the message that it is ok to run with and you can
> expect most things to work, better than minimal where it feels like
> you don't know what will fail.
Agreed.
> I have tweaked the reindex tests to run with all 3 different levels
> so at least these code paths get exercised.
Thanks, all four pushed to 'master' on https://public-inbox.org/
along with my flush threshold env check.
I think being able to transition from an existing "basic" to
(full|medium) can be straightforward, too, with the untested
patch below.
But most other transitions between levels will require an expensive
--reindex. (full|medium) => basic would be easy, too, but
it'll be up to the user to remove the Xapian files.
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index bb60506..620f24d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -714,7 +714,7 @@ sub _index_sync {
}
$dbh->commit;
}
- if ($mkey && $newest) {
+ if ($mkey && $newest && $self->{indexlevel} ne 'basic') {
my $cur = $xdb->get_metadata($mkey);
if (need_update($self, $cur, $newest)) {
$xdb->set_metadata($mkey, $newest);
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/3] Making the search indexes optional
2018-07-19 3:52 ` [PATCH v2 1/3] Making the search indexes optional Eric Wong
@ 2018-07-19 18:47 ` Eric W. Biederman
2018-07-20 6:58 ` [PATCH] v1: allow upgrading indexlevel=basic to 'medium' or 'full' Eric Wong
0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2018-07-19 18:47 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
Eric Wong <e@80x24.org> writes:
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>
>> This is my respin of these patches. I have used the levels:
>> full, medium, basic.
>>
>> I think basic conveys the message that it is ok to run with and you can
>> expect most things to work, better than minimal where it feels like
>> you don't know what will fail.
>
> Agreed.
>
>> I have tweaked the reindex tests to run with all 3 different levels
>> so at least these code paths get exercised.
>
> Thanks, all four pushed to 'master' on https://public-inbox.org/
> along with my flush threshold env check.
>
> I think being able to transition from an existing "basic" to
> (full|medium) can be straightforward, too, with the untested
> patch below.
>
> But most other transitions between levels will require an expensive
> --reindex. (full|medium) => basic would be easy, too, but
> it'll be up to the user to remove the Xapian files.
I would still use "$self->{indexlevel} =~ m/(full|medium)/" instead of
ne 'basic' just in case we start recognizing another level.
Otherwise your change seems to make sense. I have not really dug into
the reindex case where the point is to pick up things that were
previously missed. I can definitely see that code should be guarded
with an indexlevel check.
Eric
> diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
> index bb60506..620f24d 100644
> --- a/lib/PublicInbox/SearchIdx.pm
> +++ b/lib/PublicInbox/SearchIdx.pm
> @@ -714,7 +714,7 @@ sub _index_sync {
> }
> $dbh->commit;
> }
> - if ($mkey && $newest) {
> + if ($mkey && $newest && $self->{indexlevel} ne 'basic') {
> my $cur = $xdb->get_metadata($mkey);
> if (need_update($self, $cur, $newest)) {
> $xdb->set_metadata($mkey, $newest);
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] tests: fixup indexlevel setting in tests
2018-07-18 16:53 ` [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
@ 2018-07-19 21:51 ` Eric Wong
0 siblings, 0 replies; 17+ messages in thread
From: Eric Wong @ 2018-07-19 21:51 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> +$ibx_config->{index_level} = 'medium';
Oops, I missed that underscore in my initial review.
(and I dislike the lack of '_' in git-config naming conventions, even).
----------8<--------
Subject: [PATCH] tests: fixup indexlevel setting in tests
The correct field is underscore-less for consistency with
git-config naming conventions. While we're at it, beef up
the v2 tests with actual size checks, too.
I also noticed phrase searching still seems to work for
the limited test case, so I left it documented; but the
size checking verifies the space savings.
---
t/v1reindex.t | 4 ++--
t/v2reindex.t | 26 +++++++++++++++++++++++---
2 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/t/v1reindex.t b/t/v1reindex.t
index ff32750..d97938d 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -111,7 +111,7 @@ ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
-$ibx_config->{index_level} = 'medium';
+$ibx_config->{indexlevel} = 'medium';
$ibx = PublicInbox::Inbox->new($ibx_config);
$rw = PublicInbox::SearchIdx->new($ibx, 1);
{
@@ -131,7 +131,7 @@ ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
-$ibx_config->{index_level} = 'basic';
+$ibx_config->{indexlevel} = 'basic';
$ibx = PublicInbox::Inbox->new($ibx_config);
$rw = PublicInbox::SearchIdx->new($ibx, 1);
{
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 2090396..1543309 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -81,6 +81,7 @@ ok(!-d $xap, 'Xapian directories removed again');
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
}
+my %sizes;
ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
@@ -94,13 +95,16 @@ ok(!-d $xap, 'Xapian directories removed again');
ok(-d $xap, 'Xapian directories recreated');
delete $ibx->{mm};
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+ my $mset = $ibx->search->query('"hello world"', {mset=>1});
+ isnt(0, $mset->size, "phrase search succeeds on indexlevel=full");
+ for (<"$xap/*/*">) { $sizes{$ibx->{indexlevel}} += -s _ if -f $_ }
}
ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
-$ibx_config->{index_level} = 'medium';
+$ibx_config->{indexlevel} = 'medium';
$ibx = PublicInbox::Inbox->new($ibx_config);
$im = PublicInbox::V2Writable->new($ibx);
{
@@ -113,14 +117,26 @@ $im = PublicInbox::V2Writable->new($ibx);
ok(-d $xap, 'Xapian directories recreated');
delete $ibx->{mm};
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
-}
+ if (0) {
+ # not sure why, but Xapian seems to fallback to terms and
+ # phrase searches still work
+ delete $ibx->{search};
+ my $mset = $ibx->search->query('"hello world"', {mset=>1});
+ is(0, $mset->size, 'phrase search does not work on medium');
+ }
+
+ my $mset = $ibx->search->query('hello world', {mset=>1});
+ isnt(0, $mset->size, "normal search works on indexlevel=medium");
+ for (<"$xap/*/*">) { $sizes{$ibx->{indexlevel}} += -s _ if -f $_ }
+ ok($sizes{full} > $sizes{medium}, 'medium is smaller than full');
+}
ok(unlink "$mainrepo/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
-$ibx_config->{index_level} = 'basic';
+$ibx_config->{indexlevel} = 'basic';
$ibx = PublicInbox::Inbox->new($ibx_config);
$im = PublicInbox::V2Writable->new($ibx);
{
@@ -133,6 +149,10 @@ $im = PublicInbox::V2Writable->new($ibx);
ok(-d $xap, 'Xapian directories recreated');
delete $ibx->{mm};
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+ my $mset = $ibx->search->query('hello', {mset=>1});
+ is(0, $mset->size, "search fails on indexlevel='basic'");
+ for (<"$xap/*/*">) { $sizes{$ibx->{indexlevel}} += -s _ if -f $_ }
+ ok($sizes{medium} > $sizes{basic}, 'basic is smaller than medium');
}
done_testing();
--
EW
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH] v1: allow upgrading indexlevel=basic to 'medium' or 'full'
2018-07-19 18:47 ` Eric W. Biederman
@ 2018-07-20 6:58 ` Eric Wong
0 siblings, 0 replies; 17+ messages in thread
From: Eric Wong @ 2018-07-20 6:58 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: meta
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > I think being able to transition from an existing "basic" to
> > (full|medium) can be straightforward, too, with the untested
> > patch below.
> >
> > But most other transitions between levels will require an expensive
> > --reindex. (full|medium) => basic would be easy, too, but
> > it'll be up to the user to remove the Xapian files.
>
> I would still use "$self->{indexlevel} =~ m/(full|medium)/" instead of
> ne 'basic' just in case we start recognizing another level.
Good catch.
> Otherwise your change seems to make sense. I have not really dug into
> the reindex case where the point is to pick up things that were
> previously missed. I can definitely see that code should be guarded
> with an indexlevel check.
Unfortunately, this only works with v1 right now; and supporting
it for v2 would not be as easy... (I wouldn't consider it a priority,
either).
--------8<------
Subject: [PATCH] v1: allow upgrading indexlevel=basic to 'medium' or 'full'
For v1 repos, we don't need to write any metadata to Xapian
and changing from 'basic' to 'medium' or 'full' will work.
For v2, the metadata for indexing is stored in msgmap (because
the Xapian databases are partitioned for parallelism), so a
reindex is required.
---
lib/PublicInbox/SearchIdx.pm | 5 +++--
t/v1reindex.t | 21 ++++++++++++++++++++-
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index bb60506..1d259a8 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -27,6 +27,8 @@ use constant {
DEBUG => !!$ENV{DEBUG},
};
+my $xapianlevels = qr/\A(?:full|medium)\z/;
+
my %GIT_ESC = (
a => "\a",
b => "\b",
@@ -365,7 +367,6 @@ sub add_xapian ($$$$$) {
sub add_message {
# mime = Email::MIME object
my ($self, $mime, $bytes, $num, $oid, $mid0) = @_;
- my $xapianlevels = qr/\A(?:full|medium)\z/;
my $mids = mids($mime->header_obj);
$mid0 = $mids->[0] unless defined $mid0; # v1 compatibility
unless (defined $num) { # v1
@@ -714,7 +715,7 @@ sub _index_sync {
}
$dbh->commit;
}
- if ($mkey && $newest) {
+ if ($mkey && $newest && $self->{indexlevel} =~ $xapianlevels) {
my $cur = $xdb->get_metadata($mkey);
if (need_update($self, $cur, $newest)) {
$xdb->set_metadata($mkey, $newest);
diff --git a/t/v1reindex.t b/t/v1reindex.t
index d97938d..75380f0 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -124,9 +124,10 @@ $rw = PublicInbox::SearchIdx->new($ibx, 1);
ok(-d $xap, 'Xapian directories recreated');
delete $ibx->{mm};
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+ my $mset = $ibx->search->query('hello world', {mset=>1});
+ isnt(0, $mset->size, 'got Xapian search results');
}
-
ok(unlink "$mainrepo/public-inbox/msgmap.sqlite3", 'remove msgmap');
remove_tree($xap);
ok(!-d $xap, 'Xapian directories removed again');
@@ -144,7 +145,25 @@ $rw = PublicInbox::SearchIdx->new($ibx, 1);
ok(-d $xap, 'Xapian directories recreated');
delete $ibx->{mm};
is_deeply([ $ibx->mm->minmax ], $minmax, 'minmax unchanged');
+ my $mset = $ibx->search->reopen->query('hello world', {mset=>1});
+ is(0, $mset->size, "no Xapian search results");
}
+# upgrade existing basic to medium
+# note: changing indexlevels is not yet supported in v2,
+# and may not be without more effort
+$ibx_config->{indexlevel} = 'medium';
+$ibx = PublicInbox::Inbox->new($ibx_config);
+$rw = PublicInbox::SearchIdx->new($ibx, 1);
+# no removals
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ eval { $rw->index_sync };
+ is($@, '', 'no error from indexing');
+ is_deeply(\@warn, [], 'no warnings');
+ my $mset = $ibx->search->reopen->query('hello world', {mset=>1});
+ isnt(0, $mset->size, 'search OK after basic -> medium');
+}
done_testing();
^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-07-20 6:58 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-17 23:27 [PATCH 0/3] Making the search indexes optional Eric W. Biederman
2018-07-17 23:30 ` [PATCH 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
2018-07-17 23:30 ` [PATCH 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional Eric W. Biederman
2018-07-17 23:30 ` [PATCH 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
2018-07-18 10:22 ` Eric Wong
2018-07-18 16:00 ` Eric W. Biederman
2018-07-18 16:31 ` Eric Wong
2018-07-18 16:52 ` [PATCH v2 1/3] Making the search indexes optional Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 1/3] SearchIdx.pm: Make indexing search positions optional Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 2/3] SearchIdx: Add the mechanism for making all Xapian indexing optional Eric W. Biederman
2018-07-18 16:53 ` [PATCH v2 3/3] SearchIdx: Allow the amount of indexing be configured Eric W. Biederman
2018-07-19 21:51 ` [PATCH] tests: fixup indexlevel setting in tests Eric Wong
2018-07-18 17:32 ` [PATCH v2 3/4] public-inbox-init: Initialize indexlevel Eric W. Biederman
2018-07-19 3:52 ` [PATCH v2 1/3] Making the search indexes optional Eric Wong
2018-07-19 18:47 ` Eric W. Biederman
2018-07-20 6:58 ` [PATCH] v1: allow upgrading indexlevel=basic to 'medium' or 'full' Eric Wong
2018-07-18 10:17 ` [PATCH 0/3] Making the search indexes optional Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).