From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.2 required=3.0 tests=ALL_TRUSTED,BAYES_00, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 374641F428 for ; Fri, 24 Mar 2023 10:40:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=80x24.org; s=selector1; t=1679654422; bh=bMv6MbT6CK7Uc2SzToVfJglftKyWbE7uirxZC8rkDB8=; h=Date:From:To:Subject:References:In-Reply-To:From; b=kzfBIDEA90RXcKXE8feN4Y5I0V7/Ubk8XlUSge15BqoIJjw9uEXGLcrWLw0FUxcpk alC66HU5MGKuCT59wE4o0ihWvg1UNAJcK8vbuXS/jfXtxjoxTj/q1sIy8EAVZ/k9y/ zrG4xRBLhBLex/U0GqA1Eqw7jiWPgnkKuNkbZfz8= Date: Fri, 24 Mar 2023 10:40:22 +0000 From: Eric Wong To: meta@public-inbox.org Subject: [PATCH 29/28] cindex: --prune checkpoints to avoid OOM Message-ID: <20230324104022.M110416@dcvr> References: <20230321230701.3019936-1-e@80x24.org> <20230321230743.3020032-1-e@80x24.org> <20230321230743.3020032-28-e@80x24.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20230321230743.3020032-28-e@80x24.org> List-Id: Having many ->delete_document calls in a transaction still causes Xapian to eat up a large amount of memory and OOM on my system. I may reimplement --prune to avoid blocking ongoing updates, but this is a simple fix for swapping and OOMs for now. --- lib/PublicInbox/CodeSearchIdx.pm | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/lib/PublicInbox/CodeSearchIdx.pm b/lib/PublicInbox/CodeSearchIdx.pm index 704baa9c..e353f452 100644 --- a/lib/PublicInbox/CodeSearchIdx.pm +++ b/lib/PublicInbox/CodeSearchIdx.pm @@ -622,12 +622,21 @@ sub scan_git_dirs ($) { sub prune_cb { # git->check_async callback my ($hex, $type, undef, $self_id) = @_; - if ($type ne 'commit') { - my ($self, $id) = @$self_id; - progress($self, "$hex $type"); - ++$self->{pruned}; - $self->{xdb}->delete_document($id); - } + return if $type eq 'commit'; + my ($self, $id) = @$self_id; + my $len = $self->{xdb}->get_doclength($id); + progress($self, "$hex $type (doclength=$len)"); + ++$self->{pruned}; + $self->{xdb}->delete_document($id); + + # all math around batch_bytes calculation is pretty fuzzy, + # but need a way to regularly flush output to avoid OOM, + # so assume the average term + position overhead is the + # answer to everything: 42 + return if ($self->{batch_bytes} -= ($len * 42)) > 0; + cidx_ckpoint($self, "[$self->{shard}] $self->{pruned}"); + $self->{batch_bytes} = $self->{-opt}->{batch_size} // + $PublicInbox::SearchIdx::BATCH_BYTES; } sub shard_prune { # via wq_io_do @@ -639,6 +648,8 @@ sub shard_prune { # via wq_io_do my $cur = $xdb->postlist_begin('Tc'); my $end = $xdb->postlist_end('Tc'); my ($id, @cmt, $oid); + local $self->{batch_bytes} = $self->{-opt}->{batch_size} // + $PublicInbox::SearchIdx::BATCH_BYTES; local $self->{pruned} = 0; for (; $cur != $end && !$DO_QUIT; $cur++) { @cmt = xap_terms('Q', $xdb, $id = $cur->get_docid);