unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* WIP: searching all of lore
@ 2020-11-26 19:45 Eric Wong
  2020-11-28 22:34 ` Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Eric Wong @ 2020-11-26 19:45 UTC (permalink / raw)
  To: workflows; +Cc: meta

Requires Tor, for now:

http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
http://lore.czquwvybam4bgbro.onion/all/

It seems v3 onions with longer URLs are more secure, these days;
but requires newer Tor.

On Debian or RH-based systems, it's as easy as:
	<apt|yum> install tor torsocks

	# assuming w3m is installed:
	torsocks w3m http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/

Other browsers can configure SOCKS5 proxies to 127.0.0.1:9050
or be wrapped via torsocks (which uses LD_PRELOAD)

Disclaimers:

  I don't know much about Tor security (or security in general).
  I see Tor as an alternative to paying corrupt organizations
  (ICANN) and lets me self-host without a static IP address.

  I've also had numerous ISP and power outages this year,
  probably because more neighbors are home due to the pandemic,
  so don't expect 99.999% uptime, either.  And I'm a klutz who
  always trips over cables on (relatively) good days, and I've
  only had bad days since March :<

How to replicate
----------------

	# I'm using the following to update mirror and lore.kernel.org
	# (see grokmirror docs) for more.  Old command:
	grok-pull -v -c repos.conf
	public-inbox-index --all # update per-inbox indices

	# add "-L basic" or "-L medium" to reduce space requirements
	# to either -extindex or -index commands

	# The new command, not finalized yet:
	public-inbox-extindex --all -v /path/to/ALL


The following changes in a otherwise boring ~/.public-inbox/config
(or whatever $PI_CONFIG is set to)

; not yet stable or finalized, yet:
; this section allows, "all" is a special case, currently
[extindex "all"]
	topdir = /path/to/ALL

; these are already documented in public-inbox-config(5)
[publicinbox]
	; 'all' ignores domain name matching,
	; useful for inboxes served via multiple domains
	wwwlisting = all
	grokManifest = all

	; users with larger machines may want to bump this,
	; the default is for machines w/ 256MB RAM
	; (which I still use, sometimes)
	indexBatchSize = 100m

# I'm using the following to update from lore.kernel.org
# (see grokmirror docs) for more
grok-pull -v -c repos.conf
public-inbox-index --all # update per-inbox indices
public-inbox-extindex --all -v /path/to/ALL # index [extindex "all"]

This is running commit 95cb3e48fc5c4e847cdc111c2c8c9f0b70bdea56
git clone https://public-inbox.org/public-inbox.git

More changes coming (JMAP, speedups), and there's probably still
lots of stuff broken and need fixing (including my brain :<)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-11-26 19:45 WIP: searching all of lore Eric Wong
@ 2020-11-28 22:34 ` Eric Wong
  2020-12-05 20:07   ` Eric Wong
  2020-12-01 14:00 ` Konstantin Ryabitsev
  2021-03-17  7:11 ` Eric Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2020-11-28 22:34 UTC (permalink / raw)
  To: workflows; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Requires Tor, for now:

There's also an NNTP .onion and the ->ALL search speeds
up Xref for cross-post detection at:

  nntp://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion

  torsocks w3m nntp://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/org.kernel.vger.workflows

v3 .onion URLs are annoyingly long :<

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-11-26 19:45 WIP: searching all of lore Eric Wong
  2020-11-28 22:34 ` Eric Wong
@ 2020-12-01 14:00 ` Konstantin Ryabitsev
  2020-12-01 18:48   ` Eric Wong
  2021-03-17  7:11 ` Eric Wong
  2 siblings, 1 reply; 12+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-01 14:00 UTC (permalink / raw)
  To: Eric Wong; +Cc: workflows, meta

On Thu, Nov 26, 2020 at 07:45:43PM +0000, Eric Wong wrote:
> Requires Tor, for now:
> 
> http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> http://lore.czquwvybam4bgbro.onion/all/

Thanks for this work, Eric, things are looking good in my tests, though
I uncovered a bunch of problems with b4 when used with torsocks. :)

When grabbing t.mbox.gz threads from /all, it appears to properly
reconstitute follow-ups from multiple mailing lists, correct? Is there a
way to "weight" different sources, so that when the same message-id
exist in multiple places, we can prefer one source over another? For
example, this is useful when we're trying to do DKIM validation and some
lists are known to mess that up, while others do the right thing.

Thanks again,
Konstantin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-12-01 14:00 ` Konstantin Ryabitsev
@ 2020-12-01 18:48   ` Eric Wong
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-12-01 18:48 UTC (permalink / raw)
  To: workflows, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Nov 26, 2020 at 07:45:43PM +0000, Eric Wong wrote:
> > Requires Tor, for now:
> > 
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> > http://lore.czquwvybam4bgbro.onion/all/
> 
> Thanks for this work, Eric, things are looking good in my tests, though
> I uncovered a bunch of problems with b4 when used with torsocks. :)
> 
> When grabbing t.mbox.gz threads from /all, it appears to properly
> reconstitute follow-ups from multiple mailing lists, correct?

Yup, though some duplicates appear due to different mailing list-added
trailers.  Maybe some of the PublicInbox::Filter::* stuff (currently
only for -mda + -watch) can be applied to the indexing phase to better
dedupe and drop trailers

> Is there a
> way to "weight" different sources, so that when the same message-id
> exist in multiple places, we can prefer one source over another?

It indexes based on the order it iterates through the inboxes
and messages.  That's usually that follows order in the config file;
especially if indexing is delayed.   Of course it's possible a
message can show up in a low-priority source first due to
network latency or outages (something I'm too familiar with :<).

I have any idea to fix that via --reindex which *might*
allow performance improvements on the Xapian side, too.

--reindex is another mind twister when dealing with multiple
histories compared to normal inboxes and will need a new
approach.  Been working on that and my head hurts :x

> For
> example, this is useful when we're trying to do DKIM validation and some
> lists are known to mess that up, while others do the right thing.

Right, though I think it's somewhat less necessary given how sensitive
PublicInbox::ContentHash is compared to just using the Message-ID to
dedupe...

One bad thing about it being too sensitive is NNTP speedups couldn't rely
solely on contents hashing because of mailing list trailers yesterday:

https://public-inbox.org/meta/20201130194201.GA6687@dcvr/

> Thanks again,

You're welcome :>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-11-28 22:34 ` Eric Wong
@ 2020-12-05 20:07   ` Eric Wong
  2020-12-08 14:01     ` Konstantin Ryabitsev
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2020-12-05 20:07 UTC (permalink / raw)
  To: workflows; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Eric Wong <e@80x24.org> wrote:
> > Requires Tor, for now:

An IMAP .onion running the latest per-inbox subset search patches(*)

X=rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion
torsocks mutt -f imap://$X/org.kernel.vger.workflows.0
torsocks mutt -f imap://$X/org.kernel.vger.linux-kernel.76

The ".0" and ".76" at the end is the mailbox slice, LKML
is currently at 77 ~50K message slices (I decided to cap IMAP
mailbox size to due to MUA/Maildir slowness on giant directories).

Per-inbox search also uses subset search, so all the existing
inboxes should be searchable on an individual level, not just /all:
http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/lkml/
http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/workflows/
http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/netdev/
...

(*) https://public-inbox.org/meta/20201205101138.11973-1-e@80x24.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-12-05 20:07   ` Eric Wong
@ 2020-12-08 14:01     ` Konstantin Ryabitsev
  2020-12-08 18:02       ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-08 14:01 UTC (permalink / raw)
  To: Eric Wong; +Cc: workflows, meta

On Sat, Dec 05, 2020 at 08:07:17PM +0000, Eric Wong wrote:
> Per-inbox search also uses subset search, so all the existing
> inboxes should be searchable on an individual level, not just /all:
> http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/lkml/
> http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/workflows/
> http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/netdev/

So, are things to the point where we only need a single xapian db for 
all lists, or do we still need to keep individual list indexes?

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-12-08 14:01     ` Konstantin Ryabitsev
@ 2020-12-08 18:02       ` Eric Wong
  2020-12-08 18:11         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2020-12-08 18:02 UTC (permalink / raw)
  To: workflows, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Sat, Dec 05, 2020 at 08:07:17PM +0000, Eric Wong wrote:
> > Per-inbox search also uses subset search, so all the existing
> > inboxes should be searchable on an individual level, not just /all:
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/lkml/
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/workflows/
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/netdev/
> 
> So, are things to the point where we only need a single xapian db for 
> all lists, or do we still need to keep individual list indexes?

Only indexlevel=basic (sqlite) for individual lists.  This saves
a bunch of FDs and provides ~60G overall space savings (not compacted).

For all of lore:

	git + {over,msgmap}.sqlite3:		51G
	extindex "all"	over.sqlite3 + Xapian:	193G

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-12-08 18:02       ` Eric Wong
@ 2020-12-08 18:11         ` Konstantin Ryabitsev
  0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-08 18:11 UTC (permalink / raw)
  To: Eric Wong; +Cc: workflows, meta

On Tue, Dec 08, 2020 at 06:02:32PM +0000, Eric Wong wrote:
> > So, are things to the point where we only need a single xapian db 
> > for all lists, or do we still need to keep individual list indexes?
> 
> Only indexlevel=basic (sqlite) for individual lists.  This saves
> a bunch of FDs and provides ~60G overall space savings (not compacted).
> 
> For all of lore:
> 
> 	git + {over,msgmap}.sqlite3:		51G
> 	extindex "all"	over.sqlite3 + Xapian:	193G

Sweet, now I'm getting excited all over again. :)

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2020-11-26 19:45 WIP: searching all of lore Eric Wong
  2020-11-28 22:34 ` Eric Wong
  2020-12-01 14:00 ` Konstantin Ryabitsev
@ 2021-03-17  7:11 ` Eric Wong
  2021-03-17 13:27   ` Konstantin Ryabitsev
  2 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2021-03-17  7:11 UTC (permalink / raw)
  To: workflows; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> Requires Tor, for now:
> 
> http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> http://lore.czquwvybam4bgbro.onion/all/

Also available without Tor:

	https://yhbt.net/lore/all/ + https://80x24.org/lore/all/
        (but no more reliable, since it's via ssh tunnels)

Still ironing out some UI bits, but at least "help" redirects
to the correct place.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2021-03-17  7:11 ` Eric Wong
@ 2021-03-17 13:27   ` Konstantin Ryabitsev
  2021-03-17 18:18     ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-03-17 13:27 UTC (permalink / raw)
  To: Eric Wong; +Cc: workflows, meta

On Wed, Mar 17, 2021 at 01:11:16AM -0600, Eric Wong wrote:
> Eric Wong <e@80x24.org> wrote:
> > Requires Tor, for now:
> > 
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> > http://lore.czquwvybam4bgbro.onion/all/
> 
> Also available without Tor:
> 
> 	https://yhbt.net/lore/all/ + https://80x24.org/lore/all/
>         (but no more reliable, since it's via ssh tunnels)

Looking good! I noticed that it doesn't "uniquify" the results. E.g. searching
for "lists.linux.dev" (just some uncommon wording I could think of) returns
multiple hits for the same message sent to multiple lists:

https://yhbt.net/lore/all/?q=lists.linux.dev

Is that intentional, or can this be tweaked to show a single result for the
same message-id?

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2021-03-17 13:27   ` Konstantin Ryabitsev
@ 2021-03-17 18:18     ` Eric Wong
  2021-03-17 18:37       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2021-03-17 18:18 UTC (permalink / raw)
  To: workflows, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Looking good! I noticed that it doesn't "uniquify" the results. E.g. searching
> for "lists.linux.dev" (just some uncommon wording I could think of) returns
> multiple hits for the same message sent to multiple lists:
> 
> https://yhbt.net/lore/all/?q=lists.linux.dev
> 
> Is that intentional, or can this be tweaked to show a single result for the
> same message-id?

Not really.  At least for the summary search results, it makes
no sense:

	https://public-inbox.org/meta/20210317181408.9124-1-e@80x24.org/

The underlying cause that can be seen in
https://yhbt.net/lore/all/20210316102311.182375-1-gregkh@linuxfoundation.org/
is the Mailman-added signature for one of the posts.

I've been considering adding a "diff view" to more easily pick
out differences between messages with identical Message-ID with
subtly different content, but it could be expensive for PSGI...
I will probably prototype it in lei, first.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: WIP: searching all of lore
  2021-03-17 18:18     ` Eric Wong
@ 2021-03-17 18:37       ` Konstantin Ryabitsev
  0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Ryabitsev @ 2021-03-17 18:37 UTC (permalink / raw)
  To: Eric Wong; +Cc: workflows, meta

On Wed, Mar 17, 2021 at 08:18:43PM +0200, Eric Wong wrote:
> > Is that intentional, or can this be tweaked to show a single result for the
> > same message-id?
> 
> Not really.  At least for the summary search results, it makes
> no sense:
> 
> 	https://public-inbox.org/meta/20210317181408.9124-1-e@80x24.org/
> 
> The underlying cause that can be seen in
> https://yhbt.net/lore/all/20210316102311.182375-1-gregkh@linuxfoundation.org/
> is the Mailman-added signature for one of the posts.

Ack, hopefully we'll get all Mailman list managers to clue in and stop
mangling subject/bodies. It's kind of required for DKIM-signed messages
anyway.

> I've been considering adding a "diff view" to more easily pick
> out differences between messages with identical Message-ID with
> subtly different content, but it could be expensive for PSGI...
> I will probably prototype it in lei, first.

I was going to suggest "show the one for which the DKIM signature is valid"
but this is even more expensive. ;)

Thanks for your help.

-K

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-17 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-26 19:45 WIP: searching all of lore Eric Wong
2020-11-28 22:34 ` Eric Wong
2020-12-05 20:07   ` Eric Wong
2020-12-08 14:01     ` Konstantin Ryabitsev
2020-12-08 18:02       ` Eric Wong
2020-12-08 18:11         ` Konstantin Ryabitsev
2020-12-01 14:00 ` Konstantin Ryabitsev
2020-12-01 18:48   ` Eric Wong
2021-03-17  7:11 ` Eric Wong
2021-03-17 13:27   ` Konstantin Ryabitsev
2021-03-17 18:18     ` Eric Wong
2021-03-17 18:37       ` Konstantin Ryabitsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).