unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 51/52] extsearchidx: support --batch-size checkpoints
Date: Tue, 27 Oct 2020 07:54:52 +0000	[thread overview]
Message-ID: <20201027075453.19163-52-e@80x24.org> (raw)
In-Reply-To: <20201027075453.19163-1-e@80x24.org>

This is needed to limit the RSS of processes and ensure the
stored data in over.sqlite3 and Xapian DBs are consistent if
interrupted.  Without checkpoints, indexing lore causes shard
workers to take several GB of memory and thrash/OOM smaller
systems.
---
 lib/PublicInbox/ExtSearchIdx.pm | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 050c4252..9d576adb 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -106,6 +106,18 @@ sub is_bad_blob ($$$$) {
 	$size == 0 ? 1 : 0; # size == 0 means purged
 }
 
+sub check_batch_limit ($) {
+	my ($req) = @_;
+	my $self = $req->{self};
+	my $new_smsg = $req->{new_smsg};
+
+	# {raw_bytes} may be unset, so just use {bytes}
+	my $n = $self->{transact_bytes} += $new_smsg->{bytes};
+
+	# set flag for PublicInbox::V2Writable::index_todo:
+	${$req->{need_checkpoint}} = 1 if $n >= $self->{batch_bytes};
+}
+
 sub do_xpost ($$) {
 	my ($req, $smsg) = @_;
 	my $self = $req->{self};
@@ -118,6 +130,7 @@ sub do_xpost ($$) {
 		my $xnum = $req->{xnum};
 		$self->{oidx}->add_xref3($docid, $xnum, $oid, $xibx->eidx_key);
 		$idx->shard_add_eidx_info($docid, $oid, $xibx, $eml);
+		check_batch_limit($req);
 	} else { # 'd'
 		$self->{oidx}->remove_xref3($docid, $oid, $xibx->eidx_key);
 		$idx->shard_remove_eidx_info($docid, $oid, $xibx, $eml);
@@ -141,6 +154,7 @@ sub index_unseen ($) {
 	my $ibx = delete $req->{ibx} or die 'BUG: {ibx} unset';
 	$self->{oidx}->add_xref3($docid, $req->{xnum}, $oid, $ibx->eidx_key);
 	$idx->index_raw(undef, $eml, $new_smsg, $ibx);
+	check_batch_limit($req);
 }
 
 sub do_finalize ($) {
@@ -301,7 +315,11 @@ sub eidx_sync { # main entry point
 	local $SIG{__WARN__} = sub {
 		$warn_cb->($self->{current_info}, ': ', @_);
 	};
-	_sync_inbox($self, $opt, $_) for (@{$self->{ibx_list}});
+
+	# don't use $_ here, it'll get clobbered by reindex_checkpoint
+	for my $ibx (@{$self->{ibx_list}}) {
+		_sync_inbox($self, $opt, $ibx);
+	}
 
 	$self->{oidx}->rethread_done($opt);
 

  parent reply	other threads:[~2020-10-27  7:55 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-27  7:54 [PATCH 00/52] detached external index: mostly Eric Wong
2020-10-27  7:54 ` [PATCH 01/52] doc/standards: add RFCs for URL schemes Eric Wong
2020-11-05  7:54   ` [PATCH v2] " Eric Wong
2020-10-27  7:54 ` [PATCH 02/52] search: hoist out _xdb_sharded for v2 inboxes Eric Wong
2020-10-27  7:54 ` [PATCH 03/52] extsearch: start mocking out Eric Wong
2020-10-27  7:54 ` [PATCH 04/52] searchidx: expose INDEXLEVELS as `our' Eric Wong
2020-10-27  7:54 ` [PATCH 05/52] v2writable: add git method Eric Wong
2020-10-27  7:54 ` [PATCH 06/52] v2writable: make OO calls to last_commit-related methods Eric Wong
2020-10-27  7:54 ` [PATCH 07/52] search: xdb_sharded: make this a public method for ExtSearch Eric Wong
2020-10-27  7:54 ` [PATCH 08/52] searchidx: introduce "xref3" concept Eric Wong
2020-10-27  7:54 ` [PATCH 09/52] v2writable: prepare initialization for external indices Eric Wong
2020-10-27  7:54 ` [PATCH 10/52] v2writable: hoist out write_alternates Eric Wong
2020-10-27  7:54 ` [PATCH 11/52] searchidxshard: allow msgref to be undef Eric Wong
2020-10-27  7:54 ` [PATCH 12/52] v2writable: idx_shard: simplify callers Eric Wong
2020-10-27  7:54 ` [PATCH 13/52] v2writable: count_shards: allow working without {ibx} Eric Wong
2020-10-27  7:54 ` [PATCH 14/52] overidx: introduce changes for external index Eric Wong
2020-10-27  7:54 ` [PATCH 15/52] v2: some changes for ExtSearchIdx compatibility Eric Wong
2020-10-27  7:54 ` [PATCH 16/52] inboxwritable: eidx_key for external index Eric Wong
2020-10-27  7:54 ` [PATCH 17/52] v2writable: rename remaining "remote" terminology Eric Wong
2020-10-27  7:54 ` [PATCH 18/52] v2writable: checkpoint: account for lack of {mm} Eric Wong
2020-10-27  7:54 ` [PATCH 19/52] extsearchidx: initial implementation Eric Wong
2020-10-27  7:54 ` [PATCH 20/52] searchidx: index eidx_key as a boolean term Eric Wong
2020-10-27  7:54 ` [PATCH 21/52] searchidx: xref3 delete support Eric Wong
2020-10-27  7:54 ` [PATCH 22/52] searchidxshard: special init for eidx Eric Wong
2020-10-27  7:54 ` [PATCH 23/52] searchidx: put {ibx} into $sync state Eric Wong
2020-10-27  7:54 ` [PATCH 24/52] searchidx: log2stack: simplify callers Eric Wong
2020-10-27  7:54 ` [PATCH 25/52] v2writable: more generic sync setup code Eric Wong
2020-10-27  7:54 ` [PATCH 26/52] v2writable: allow OO method references Eric Wong
2020-10-27  7:54 ` [PATCH 27/52] v2writable: rename {v2w} field to {self} Eric Wong
2020-10-27  7:54 ` [PATCH 28/52] v2writable: make *last_commits and sync_prepare OO methods Eric Wong
2020-10-27  7:54 ` [PATCH 29/52] v2writable: move size check init to sync_prepare Eric Wong
2020-10-27  7:54 ` [PATCH 30/52] extsearchidx: more compatibility with V2Writable callers Eric Wong
2020-10-27  7:54 ` [PATCH 31/52] v2writable: reduce scope of epoch-aware code Eric Wong
2020-10-27  7:54 ` [PATCH 32/52] extsearchidx: remove {unindex_range} field Eric Wong
2020-10-27  7:54 ` [PATCH 33/52] v2writable: pass oid to uindex_oid Eric Wong
2020-10-27  7:54 ` [PATCH 34/52] extsearchidx: sync unit updates Eric Wong
2020-10-27  7:54 ` [PATCH 35/52] searchidx: export prepare_stack Eric Wong
2020-10-27  7:54 ` [PATCH 36/52] extsearchidx: sync updates Eric Wong
2020-10-27  7:54 ` [PATCH 37/52] searchidx: reduce inbox-dependency, wrap ->with_umask Eric Wong
2020-10-27  7:54 ` [PATCH 38/52] searchidx: favor $sync->{ibx} (over $self->{ibx}) Eric Wong
2020-10-27  7:54 ` [PATCH 39/52] Makefile.PL: do not build manpage if POD is missing Eric Wong
2020-10-27  7:54 ` [PATCH 40/52] script: add preliminary eindex implementation Eric Wong
2020-10-27  7:54 ` [PATCH 41/52] index: eindex wiring Eric Wong
2020-10-27  7:54 ` [PATCH 42/52] over: store xref3 data in over.sqlite3 Eric Wong
2020-10-27  7:54 ` [PATCH 43/52] searchidx: remove xref3 support for Xapian Eric Wong
2020-10-27  7:54 ` [PATCH 44/52] t/extsearch.t: verify results and xref3 ordering Eric Wong
2020-10-27  7:54 ` [PATCH 45/52] t/v2writable: remove pointless ->barrier call Eric Wong
2020-10-27  7:54 ` [PATCH 46/52] extsearch: wire up smsg_eml Eric Wong
2020-10-27  7:54 ` [PATCH 47/52] extsearchidx: handle edits Eric Wong
2020-10-27  7:54 ` [PATCH 48/52] extsearch: wire up remaining Inbox-like methods for WWW Eric Wong
2020-10-27  7:54 ` [PATCH 49/52] searchidx: ignore exceptions from ->remove_term Eric Wong
2020-10-27  7:54 ` [PATCH 50/52] extsearchidx: set current_info in warning callbacks Eric Wong
2020-10-27  7:54 ` Eric Wong [this message]
2020-11-03  9:08   ` [PATCH 51/52] extsearchidx: support --batch-size checkpoints Eric Wong
2020-10-27  7:54 ` [PATCH 52/52] searchidxshard: make warnings with eidx_key less confusing Eric Wong
2020-10-27 12:08 ` [PATCH 00/52] detached external index: mostly Konstantin Ryabitsev
2020-11-10 18:53 ` detached external index: performance note Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201027075453.19163-52-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).