* Git-only operation mode
@ 2019-09-25 18:24 Konstantin Ryabitsev
2019-09-25 19:45 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-25 18:24 UTC (permalink / raw)
To: meta
Hello:
Is there a way to run just the archiver component of public-inbox --
just writing to git repos without any of the indexing/frontend bits? One
of the idle conversations I had with vger.kernel.org folks was to see if
we can shift the source of truth archive generation to happen at their
end. We would then clone repositories from them and provide the
frontend/search bits on lore.kernel.org. From my cursory looking, it
would seem that the watch/delivery tools always expect to be taking care
of xapian/indexing, but I think being able to decouple git bits from
search/frontend bits would be a useful mode or operation.
Best,
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-25 18:24 Git-only operation mode Konstantin Ryabitsev
@ 2019-09-25 19:45 ` Eric Wong
2019-09-25 19:58 ` Konstantin Ryabitsev
0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-09-25 19:45 UTC (permalink / raw)
To: Konstantin Ryabitsev; +Cc: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
>
> Is there a way to run just the archiver component of public-inbox -- just
> writing to git repos without any of the indexing/frontend bits? One of the
> idle conversations I had with vger.kernel.org folks was to see if we can
> shift the source of truth archive generation to happen at their end. We
> would then clone repositories from them and provide the frontend/search bits
> on lore.kernel.org. From my cursory looking, it would seem that the
> watch/delivery tools always expect to be taking care of xapian/indexing, but
> I think being able to decouple git bits from search/frontend bits would be a
> useful mode or operation.
v1 was git-only (that led to scalability problems from big trees).
v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
anymore. We could get rid of dedupe for v2, but I'm not sure it's
worth it...
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-25 19:45 ` Eric Wong
@ 2019-09-25 19:58 ` Konstantin Ryabitsev
2019-09-25 22:45 ` Eric Wong
0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-25 19:58 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
>> Is there a way to run just the archiver component of public-inbox --
>> just
>> writing to git repos without any of the indexing/frontend bits? One of the
>> idle conversations I had with vger.kernel.org folks was to see if we can
>> shift the source of truth archive generation to happen at their end. We
>> would then clone repositories from them and provide the frontend/search bits
>> on lore.kernel.org. From my cursory looking, it would seem that the
>> watch/delivery tools always expect to be taking care of xapian/indexing, but
>> I think being able to decouple git bits from search/frontend bits would be a
>> useful mode or operation.
>
>v1 was git-only (that led to scalability problems from big trees).
>v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
>anymore. We could get rid of dedupe for v2, but I'm not sure it's
>worth it...
Needing sqlite is not a big deal -- compared to the size of the repos,
that's reasonably small (e.g. all of lkml git trees are 8.2GB, while
msgmap.sqlite3 is 600MB).
Is there an easy way to exclude xapian indexes from being generated
during watch/mda runs then?
A follow-up to that -- is running "public-inbox-index" on the repository
after it's been updated enough to update the xapian db? It would be easy
to do so as part of the grok-pull post-update hook.
Best,
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-25 19:58 ` Konstantin Ryabitsev
@ 2019-09-25 22:45 ` Eric Wong
2019-09-26 0:23 ` Eric W. Biederman
2019-09-26 20:52 ` Konstantin Ryabitsev
0 siblings, 2 replies; 9+ messages in thread
From: Eric Wong @ 2019-09-25 22:45 UTC (permalink / raw)
To: Konstantin Ryabitsev; +Cc: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
> > > Is there a way to run just the archiver component of public-inbox --
> > > just
> > > writing to git repos without any of the indexing/frontend bits? One of the
> > > idle conversations I had with vger.kernel.org folks was to see if we can
> > > shift the source of truth archive generation to happen at their end. We
> > > would then clone repositories from them and provide the frontend/search bits
> > > on lore.kernel.org. From my cursory looking, it would seem that the
> > > watch/delivery tools always expect to be taking care of xapian/indexing, but
> > > I think being able to decouple git bits from search/frontend bits would be a
> > > useful mode or operation.
> >
> > v1 was git-only (that led to scalability problems from big trees).
> > v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
> > anymore. We could get rid of dedupe for v2, but I'm not sure it's
> > worth it...
>
> Needing sqlite is not a big deal -- compared to the size of the repos,
> that's reasonably small (e.g. all of lkml git trees are 8.2GB, while
> msgmap.sqlite3 is 600MB).
Right, it'll also need xap15/over.sqlite* but that's still not too big.
> Is there an easy way to exclude xapian indexes from being generated during
> watch/mda runs then?
public-inbox-init --indexlevel=basic <usual args>
Or setting publicinbox.$INBOX_NAME.indexlevel=basic in the
config file after-the-fact. You should also be able to remove
any non-SQLite files from xap15 after-the-fact, if you already
generated them, too (but I haven't tested that).
I started working on a public-inbox-init manpage the other day,
still need to finish that...
> A follow-up to that -- is running "public-inbox-index" on the repository
> after it's been updated enough to update the xapian db? It would be easy to
> do so as part of the grok-pull post-update hook.
Yes, on a fresh clone. You'll need to change indexlevel to
medium or full if it was setup using basic.
I haven't figured out how to use a grok-pull post-update hook to
run index on my clone of erol, since there's multiple epochs
per-inbox to deal with.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-25 22:45 ` Eric Wong
@ 2019-09-26 0:23 ` Eric W. Biederman
2019-09-26 20:52 ` Konstantin Ryabitsev
1 sibling, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2019-09-26 0:23 UTC (permalink / raw)
To: Eric Wong; +Cc: Konstantin Ryabitsev, meta
Eric Wong <e@80x24.org> writes:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>> On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
>> > > Is there a way to run just the archiver component of public-inbox --
>> > > just
>> > > writing to git repos without any of the indexing/frontend bits? One of the
>> > > idle conversations I had with vger.kernel.org folks was to see if we can
>> > > shift the source of truth archive generation to happen at their end. We
>> > > would then clone repositories from them and provide the frontend/search bits
>> > > on lore.kernel.org. From my cursory looking, it would seem that the
>> > > watch/delivery tools always expect to be taking care of xapian/indexing, but
>> > > I think being able to decouple git bits from search/frontend bits would be a
>> > > useful mode or operation.
>> >
>> > v1 was git-only (that led to scalability problems from big trees).
>> > v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
>> > anymore. We could get rid of dedupe for v2, but I'm not sure it's
>> > worth it...
>>
>> Needing sqlite is not a big deal -- compared to the size of the repos,
>> that's reasonably small (e.g. all of lkml git trees are 8.2GB, while
>> msgmap.sqlite3 is 600MB).
>
> Right, it'll also need xap15/over.sqlite* but that's still not too
> big.
For linux-kernel my copy looks to be about 2.4G while the git repos
run 9.1G.
>> Is there an easy way to exclude xapian indexes from being generated during
>> watch/mda runs then?
>
> public-inbox-init --indexlevel=basic <usual args>
>
> Or setting publicinbox.$INBOX_NAME.indexlevel=basic in the
> config file after-the-fact. You should also be able to remove
> any non-SQLite files from xap15 after-the-fact, if you already
> generated them, too (but I haven't tested that).
>
> I started working on a public-inbox-init manpage the other day,
> still need to finish that...
>
>> A follow-up to that -- is running "public-inbox-index" on the repository
>> after it's been updated enough to update the xapian db? It would be easy to
>> do so as part of the grok-pull post-update hook.
>
> Yes, on a fresh clone. You'll need to change indexlevel to
> medium or full if it was setup using basic.
>
> I haven't figured out how to use a grok-pull post-update hook to
> run index on my clone of erol, since there's multiple epochs
> per-inbox to deal with.
I have a perl script I use.
Which boils down to:
git remote update
public-inbox-index
Which is enough get things up to date.
The tricky bit when the you have a archive like linux-kernel that uses
multiple git repos.
Given that except in the case of bugs article numbers are stable it
should be completely possible do this.
The nasty case is when someone rebases the git history. I have been
meaning to report this after tracking it down. To the best of my
knowledge public-inbox-inbox throws out all of the history that was
rebased. Which can be expensive. For me it meant I had to drop from
indexlevel=full to indexlevel=basic on linux-kernel. Because my laptop
machine could not handle the reindexing of all of those messages.
Given that the message numbers remain stable in an event like that it
should be possible to optimize and only reindex things if the blob in
git for a particular message number has changed. Maybe we already try
and even that is too expensive. I haven't re-read that code since I
noticed the problem.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-25 22:45 ` Eric Wong
2019-09-26 0:23 ` Eric W. Biederman
@ 2019-09-26 20:52 ` Konstantin Ryabitsev
2019-09-26 21:10 ` Eric Wong
2019-10-07 0:07 ` Eric Wong
1 sibling, 2 replies; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-26 20:52 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
On Wed, Sep 25, 2019 at 10:45:00PM +0000, Eric Wong wrote:
>> A follow-up to that -- is running "public-inbox-index" on the
>> repository
>> after it's been updated enough to update the xapian db? It would be easy to
>> do so as part of the grok-pull post-update hook.
>
>Yes, on a fresh clone. You'll need to change indexlevel to
>medium or full if it was setup using basic.
>
>I haven't figured out how to use a grok-pull post-update hook to
>run index on my clone of erol, since there's multiple epochs
>per-inbox to deal with.
Theoretically, shouldn't be that difficult. The post-update hook fires
on clone/update with the full path to the repo that got updated, e.g.
post-update-hook.sh /var/lib/public-inbox/lkml/git/7.git
Here's a quick and dirty start to the post-update-hook that I came up
with:
-----
#!/bin/bash
topdir=$(echo $1 | sed 's|/git/[[:digit:]]*\.git$||g')
pidir=$(basename $topdir)
url="http://localhost:8080/${pidir}"
cd $topdir/..
if [[ ! -f $pidir/msgmap.sqlite3 ]]; then
listid=$(git --git-dir=$1 show master:m | grep -i '^List-Id:' | sed 's|.*:.*<\(.*\)>$|\1|g')
email=$(echo $listid | sed 's|\.|@|')
public-inbox-init -V2 $pidir $pidir/ $url $email
# Need logic here for adding to the config file
fi
public-inbox-index $pidir
-----
It needs some kind of a template entry for adding to the config file
post-init, but this should at least do the right thing for running
public-inbox-index on repo updates.
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-26 20:52 ` Konstantin Ryabitsev
@ 2019-09-26 21:10 ` Eric Wong
2019-09-26 21:44 ` Konstantin Ryabitsev
2019-10-07 0:07 ` Eric Wong
1 sibling, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-09-26 21:10 UTC (permalink / raw)
To: Konstantin Ryabitsev; +Cc: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Wed, Sep 25, 2019 at 10:45:00PM +0000, Eric Wong wrote:
> > > A follow-up to that -- is running "public-inbox-index" on the
> > > repository
> > > after it's been updated enough to update the xapian db? It would be easy to
> > > do so as part of the grok-pull post-update hook.
> >
> > Yes, on a fresh clone. You'll need to change indexlevel to
> > medium or full if it was setup using basic.
> >
> > I haven't figured out how to use a grok-pull post-update hook to
> > run index on my clone of erol, since there's multiple epochs
> > per-inbox to deal with.
>
> Theoretically, shouldn't be that difficult. The post-update hook fires on
> clone/update with the full path to the repo that got updated, e.g.
>
> post-update-hook.sh /var/lib/public-inbox/lkml/git/7.git
>
> Here's a quick and dirty start to the post-update-hook that I came up with:
>
> -----
> #!/bin/bash
>
> topdir=$(echo $1 | sed 's|/git/[[:digit:]]*\.git$||g')
> pidir=$(basename $topdir)
> url="http://localhost:8080/${pidir}"
>
> cd $topdir/..
>
> if [[ ! -f $pidir/msgmap.sqlite3 ]]; then
> listid=$(git --git-dir=$1 show master:m | grep -i '^List-Id:' | sed 's|.*:.*<\(.*\)>$|\1|g')
> email=$(echo $listid | sed 's|\.|@|')
> public-inbox-init -V2 $pidir $pidir/ $url $email
If grok-pull is using multiple threads, there can be a race
there because parallel runs of public-inbox-init can clobber
each other (which needs to be fixed :x)
> # Need logic here for adding to the config file
Yeah, I've been meaning to add something like "$INBOX_URL/_/text/config"
so some of the config keys can be easily cloned, too.
Not sure if it's something that can be stuffed in manifest.js.gz
or better as a separate file... Probably separate file?
> fi
>
> public-inbox-index $pidir
> -----
>
> It needs some kind of a template entry for adding to the config file
> post-init, but this should at least do the right thing for running
> public-inbox-index on repo updates.
>
> -K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-26 21:10 ` Eric Wong
@ 2019-09-26 21:44 ` Konstantin Ryabitsev
0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-26 21:44 UTC (permalink / raw)
To: Eric Wong; +Cc: meta
On Thu, Sep 26, 2019 at 09:10:25PM +0000, Eric Wong wrote:
>If grok-pull is using multiple threads, there can be a race
>there because parallel runs of public-inbox-init can clobber
>each other (which needs to be fixed :x)
Yes, this is true -- there should be a lockfile in the hook to avoid
multiple post-update-hook's from operating on the same pidir.
>> # Need logic here for adding to the config file
>
>Yeah, I've been meaning to add something like "$INBOX_URL/_/text/config"
>so some of the config keys can be easily cloned, too.
>
>Not sure if it's something that can be stuffed in manifest.js.gz
>or better as a separate file... Probably separate file?
Right -- the manifest only deals with very basic repository details, so
it's easier to pass this info in some other way. Either via a remote
URL, or perhaps via a file in a special ref in the repo itself
(refs/meta/config?). We use this trick on git.kernel.org to let people
tweak cgit parameters for their repos, see
https://korg.wiki.kernel.org/userdoc/cgit-meta-data.
-K
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Git-only operation mode
2019-09-26 20:52 ` Konstantin Ryabitsev
2019-09-26 21:10 ` Eric Wong
@ 2019-10-07 0:07 ` Eric Wong
1 sibling, 0 replies; 9+ messages in thread
From: Eric Wong @ 2019-10-07 0:07 UTC (permalink / raw)
To: Konstantin Ryabitsev; +Cc: meta
Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Wed, Sep 25, 2019 at 10:45:00PM +0000, Eric Wong wrote:
> > > A follow-up to that -- is running "public-inbox-index" on the
> > > repository
> > > after it's been updated enough to update the xapian db? It would be easy to
> > > do so as part of the grok-pull post-update hook.
> >
> > Yes, on a fresh clone. You'll need to change indexlevel to
> > medium or full if it was setup using basic.
> >
> > I haven't figured out how to use a grok-pull post-update hook to
> > run index on my clone of erol, since there's multiple epochs
> > per-inbox to deal with.
>
> Theoretically, shouldn't be that difficult. The post-update hook fires on
> clone/update with the full path to the repo that got updated, e.g.
>
> post-update-hook.sh /var/lib/public-inbox/lkml/git/7.git
>
> Here's a quick and dirty start to the post-update-hook that I came up with:
>
> -----
> #!/bin/bash
>
> topdir=$(echo $1 | sed 's|/git/[[:digit:]]*\.git$||g')
> pidir=$(basename $topdir)
> url="http://localhost:8080/${pidir}"
>
> cd $topdir/..
>
> if [[ ! -f $pidir/msgmap.sqlite3 ]]; then
> listid=$(git --git-dir=$1 show master:m | grep -i '^List-Id:' | sed 's|.*:.*<\(.*\)>$|\1|g')
> email=$(echo $listid | sed 's|\.|@|')
> public-inbox-init -V2 $pidir $pidir/ $url $email
> # Need logic here for adding to the config file
> fi
>
> public-inbox-index $pidir
Running public-inbox-index blindly there can be
dangerous/surprising when multiple epochs are initially cloned in
non-sequential order.
The example I sent out won't index unless there's messages
in msgmap:
https://public-inbox.org/meta/20191006235651.5725-1-e@80x24.org/
> -----
>
> It needs some kind of a template entry for adding to the config file
> post-init, but this should at least do the right thing for running
> public-inbox-index on repo updates.
I tried to use the $INBOX_URL/_/text/config/raw endpoint, which
fell down when I tried to clone erol.kernel.org :x (but works on
lore)
I don't have List-Id: as a fallback, yet... Not sure if it's
really worth the effort, but it just creates a bogus
$inbox_name@$$.$(hostname).example.com address if curl fails on
the config URL.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-10-07 0:07 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-25 18:24 Git-only operation mode Konstantin Ryabitsev
2019-09-25 19:45 ` Eric Wong
2019-09-25 19:58 ` Konstantin Ryabitsev
2019-09-25 22:45 ` Eric Wong
2019-09-26 0:23 ` Eric W. Biederman
2019-09-26 20:52 ` Konstantin Ryabitsev
2019-09-26 21:10 ` Eric Wong
2019-09-26 21:44 ` Konstantin Ryabitsev
2019-10-07 0:07 ` Eric Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).