From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from mail-qk1-x735.google.com (mail-qk1-x735.google.com [IPv6:2607:f8b0:4864:20::735]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id CCADE1F59D for ; Fri, 26 Aug 2022 16:51:32 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (1024-bit key; unprotected) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b="IO5PhJ1s"; dkim-atps=neutral Received: by mail-qk1-x735.google.com with SMTP id j6so1546870qkl.10 for ; Fri, 26 Aug 2022 09:51:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linuxfoundation.org; s=google; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=2IO5SYMCTeBvDPzph2Xuuj5njE9Q7Rcave2vzOLNEtU=; b=IO5PhJ1sum5TKZZL0CF62B9k8gvF6Pr1qGS1qKyhpaDGosdw57MaPG34XeSW2UJAS+ xOc8m97Rjs8O4Zu+RT+AAMZq+QaW/7YsgvIqNzm4hg8SREQsaQJFwnJf+wedM/yubrPn O2kLXPXZdXmMrjqRfEQpqryU43OGXZXgjxVJ8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=2IO5SYMCTeBvDPzph2Xuuj5njE9Q7Rcave2vzOLNEtU=; b=4mBZ+AY+JcdQ8rrDJgmYlnP7p3FVPliCcoukGwG4T0jVcRhWll112EmYjGiZSlid3z 1P+pdkj+aotPFZFFTXZZechVDnNxZ0KojEqMYOAgYCI5/yVzC1cAAPHC8FavJQt8I5rS 6/yWq8UoukJ14KS43c/BRDL+MBrGYlvfD6RTeA4JGrmL5cgfmVo7/YsGWcWveIcSpOY8 bnq4prtq3Mv6a3B0slWxE21HZbyvuDvL5Y9R4VPp4FsoNMWuDv2aOLfaO2c1HHBjl2/b TXCkl/oa3B9l8mMGlc49BmujSSCiKsNNmffORH4eincxm4jatysImuGTvNnzZMPW9jh0 +VGA== X-Gm-Message-State: ACgBeo2F0xjmyIaiyLf6cbQMkYgHVR20MPZdbY8uxLrujFi+om4cpa80 H9h1hJuVf7slANaL1inz4QxEsQmqL2+l3Q== X-Google-Smtp-Source: AA6agR6h2CvRsMPvXJy/GOKlJR9kU3z+D+g4w8j8dIpRKhoxD14002YME1KW3rSmxt6VrnDxr+WmGw== X-Received: by 2002:a05:620a:4544:b0:6b9:1c3f:fe82 with SMTP id u4-20020a05620a454400b006b91c3ffe82mr435744qkp.373.1661532691551; Fri, 26 Aug 2022 09:51:31 -0700 (PDT) Received: from meerkat.local (bras-base-mtrlpq5031w-grc-33-142-113-79-147.dsl.bell.ca. [142.113.79.147]) by smtp.gmail.com with ESMTPSA id w11-20020ac87e8b000000b00342eff84177sm189854qtj.29.2022.08.26.09.51.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Aug 2022 09:51:31 -0700 (PDT) Date: Fri, 26 Aug 2022 12:51:29 -0400 From: Konstantin Ryabitsev To: Eric Wong Cc: meta@public-inbox.org Subject: Re: extindex for git? [was: an even bigger git show than before...] Message-ID: <20220826165129.j2p4upnyomhwjpwj@meerkat.local> References: <20220822023346.938859-1-e@80x24.org> <20220822193413.M902704@dcvr> <20220825213442.M620858@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20220825213442.M620858@dcvr> List-Id: On Thu, Aug 25, 2022 at 09:34:42PM +0000, Eric Wong wrote: > > I wanted to add search to git repos ages ago, but it was silly > > expensive in terms of space. That was before extindex... > > > > extindex ought to be able to offer space savings across forks > > and similar documents (commits vs patch mails). > > > > At least dfpre/dfpost/dfn/subject may be enough, even... > > And I'm also thinking extindexing coderepos can make > auto-assocation with inboxes possible. > > Right now, configuring coderepos on a large scale is a huge PITA > given the M:N associations between inboxes and coderepos. > > Being able to do fuzzy JOIN-ish operations based on > blobs/filenames/subjects would allow extindex to automatically > associate coderepos with inboxes and vice-versa. I wonder how well this would work in the presence of many forks? E.g. most of the content on git.kernel.org are thin forks of linux.git, so matching by blobs/filenames/subjects across all of them would return too many hits and some kind of priority ordering would be required, I think. Overall, though, I do agree that this would be really handy. -K