unofficial mirror of meta@public-inbox.org
 help / color / mirror / Atom feed
* Occasional web view corruption (extra html escapes)
@ 2024-09-03 17:52 Konstantin Ryabitsev
  2024-09-03 18:42 ` Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Konstantin Ryabitsev @ 2024-09-03 17:52 UTC (permalink / raw)
  To: meta

Hello:

I've observed this a couple of times now when accessing individual message
views. For example, accessing this link:

https://lore.kernel.org/all/20240829184830.5861-1-djahchankoike@gmail.com/

Renders the output as follows in the webpage:

    From: Diogo Jahchan Koike <djahchankoike@gmail.com>
    To: "David S. Miller" <davem@davemloft.net>,
        Eric Dumazet <edumazet@google.com>,
        Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
        Christophe Leroy <christophe.leroy@csgroup.eu>,
        Maxime Chevallier <maxime.chevallier@bootlin.com>
    Cc: Diogo Jahchan Koike <djahchankoike@gmail.com>,
        syzbot+ec369e6d58e210135f71@syzkaller.appspotmail.com,
        netdev@vger.kernel.org, linux-kernel@vger.kernel.org
    Subject: [patch net-next v5] ethtool: pse-pd: move pse validation into set
    Date: Thu, 29 Aug 2024 15:48:27 -0300	[thread overview]
    raw)

In the HTML source, I see:

	<a
	href="?t=20240829184845"></a>From<a
	href="?t=20240829184845"></a>: <a
	href="?t=20240829184845"></a>Diogo<a
	href="?t=20240829184845"></a> <a
	href="?t=20240829184845"></a>Jahchan<a
	href="?t=20240829184845"></a> <a
	href="?t=20240829184845"></a>Koike<a
	href="?t=20240829184845"></a> &<a
	href="?t=20240829184845"></a>lt<a
	href="?t=20240829184845"></a>;<a
	href="?t=20240829184845"></a>djahchankoike<a
	href="?t=20240829184845"></a>@<a
	href="?t=20240829184845"></a>gmail<a
	href="?t=20240829184845"></a>.<a
	href="?t=20240829184845"></a>com<a
	href="?t=20240829184845"></a>&<a
	href="?t=20240829184845"></a>gt<a
	href="?t=20240829184845"></a>;

This is only fixed by terminating the public-inbox-httpd process and letting it respawn.
This affects *all* web views for all messages, so likely some kind of state bug?

Happy to help troubleshoot it further next time I catch it happening.

-K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-03 17:52 Occasional web view corruption (extra html escapes) Konstantin Ryabitsev
@ 2024-09-03 18:42 ` Eric Wong
  2024-09-03 19:18   ` Konstantin Ryabitsev
  2024-09-03 19:11 ` Eric Wong
  2024-09-06 22:20 ` Filip Hejsek
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2024-09-03 18:42 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> This is only fixed by terminating the public-inbox-httpd process and letting it respawn.
> This affects *all* web views for all messages, so likely some kind of state bug?

Very strange, yes.  Is it a frequent occurence?  It only started
hapening recently?  It probably happened with some of the memory
fragmentation reduction changes...

Will investigate, not something I've ever seen...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-03 17:52 Occasional web view corruption (extra html escapes) Konstantin Ryabitsev
  2024-09-03 18:42 ` Eric Wong
@ 2024-09-03 19:11 ` Eric Wong
  2024-09-03 19:22   ` Konstantin Ryabitsev
  2024-09-06 22:20 ` Filip Hejsek
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2024-09-03 19:11 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> In the HTML source, I see:
> 
> 	<a
> 	href="?t=20240829184845"></a>From<a

By any chance, do you have empty `address =' entries in the
public-inbox config file?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-03 18:42 ` Eric Wong
@ 2024-09-03 19:18   ` Konstantin Ryabitsev
  0 siblings, 0 replies; 11+ messages in thread
From: Konstantin Ryabitsev @ 2024-09-03 19:18 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Sep 03, 2024 at 06:42:52PM GMT, Eric Wong wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > This is only fixed by terminating the public-inbox-httpd process and letting it respawn.
> > This affects *all* web views for all messages, so likely some kind of state bug?
> 
> Very strange, yes.  Is it a frequent occurence?  It only started
> hapening recently?  It probably happened with some of the memory
> fragmentation reduction changes...

I've recently migrated all lore.kernel.org instances to RHEL9, switching to
the latest public-inbox master in the process -- so the best and the least
useful answer I can give you is "it started happening after I've changed
pretty much everything."

Thus far, I've seen this happen twice, both times on the same node.

-K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-03 19:11 ` Eric Wong
@ 2024-09-03 19:22   ` Konstantin Ryabitsev
  2024-09-04 13:14     ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Konstantin Ryabitsev @ 2024-09-03 19:22 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Sep 03, 2024 at 07:11:51PM GMT, Eric Wong wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > In the HTML source, I see:
> > 
> > 	<a
> > 	href="?t=20240829184845"></a>From<a
> 
> By any chance, do you have empty `address =' entries in the
> public-inbox config file?

Doesn't appear to be the case. The config is:

    [include]
      path = /etc/public-inbox/config.include
    [publicinbox "tools"]
      address = tools@linux.kernel.org
      url = tools
      inboxdir = /srv/public-inbox/lore.kernel.org/tools
      indexlevel = basic
      newsgroup = org.kernel.linux.tools
      boost = 10
      listid = tools.linux.kernel.org
    (repeat for every list)

This is config.include:

    [publicinbox]
      wwwlisting=all
      grokmanifest=all
      indexBatchSize=128m
      nntpserver=nntp.lore.kernel.org
      css=/usr/local/share/publicinbox/contrib/css/216light.css media='"screen,print"'
      css=/usr/local/share/publicinbox/contrib/css/216dark.css media='"screen and (prefers-color-scheme:dark)"'
    [extindex "all"]
      topdir=/srv/public-inbox/extindex
      indexheader=boolean_term:forpatchid:X-For-Patch-ID:raw=1
      indexheader=boolean_term:changeid:X-Change-ID:raw=1

-K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-03 19:22   ` Konstantin Ryabitsev
@ 2024-09-04 13:14     ` Eric Wong
  2024-09-04 14:42       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2024-09-04 13:14 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Sep 03, 2024 at 07:11:51PM GMT, Eric Wong wrote:
> > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > > In the HTML source, I see:
> > > 
> > > 	<a
> > > 	href="?t=20240829184845"></a>From<a
> > 
> > By any chance, do you have empty `address =' entries in the
> > public-inbox config file?
>
> Doesn't appear to be the case. The config is:
> 
>     [include]
>       path = /etc/public-inbox/config.include
>     [publicinbox "tools"]
>       address = tools@linux.kernel.org
>       url = tools
>       inboxdir = /srv/public-inbox/lore.kernel.org/tools
>       indexlevel = basic
>       newsgroup = org.kernel.linux.tools
>       boost = 10
>       listid = tools.linux.kernel.org
>     (repeat for every list)

OK, and nothing malformed or commented out in the address=
fields?  And no stray semi-colons or hash marks in any address
fields which comment out only the value?

Because having an empty address= and empty url= field in my
publicinbox.lkml section gives me something close (but not
exactly) to what you have:

<a
href="/?t=20240829184845"></a> &<a
href="/?t=20240829184845"></a>lt<a
href="/?t=20240829184845"></a>;<a

Note the href has a leading slash in my near-reproduction
but yours did not.

It's not a double-escaping problem, but the substituion is
is breaking already -escaped `&lt;' and `&gt;' apart with <a>
tags in between characters.  but I'm not sure how this is
happening to you if all your address fields look OK.

I'll filter out blank addresses in our config reader, but
I also wonder if there's anything else going on...

And all your writes to the config are via git-config?
(or public-inbox-config)  IOW, it's not possible for a
partially-written config file to be read because git-config
writes atomically via rename(2).

Fwiw, the buggy code would be in addr2urlmap called by
_msg_page_prepare in PublicInbox::View.
addr2urlmap() will escape any regexp metacharacters present in
addresses via quotemeta, so there's no chance of regexp
injection from a config.

(hopefully coherent, running on fumes due to real life things)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-04 13:14     ` Eric Wong
@ 2024-09-04 14:42       ` Konstantin Ryabitsev
  0 siblings, 0 replies; 11+ messages in thread
From: Konstantin Ryabitsev @ 2024-09-04 14:42 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Wed, Sep 04, 2024 at 01:14:43PM GMT, Eric Wong wrote:
> OK, and nothing malformed or commented out in the address=
> fields?  And no stray semi-colons or hash marks in any address
> fields which comment out only the value?

Nope.

> Because having an empty address= and empty url= field in my
> publicinbox.lkml section gives me something close (but not
> exactly) to what you have:
> 
> <a
> href="/?t=20240829184845"></a> &<a
> href="/?t=20240829184845"></a>lt<a
> href="/?t=20240829184845"></a>;<a
> 
> Note the href has a leading slash in my near-reproduction
> but yours did not.
> 
> It's not a double-escaping problem, but the substituion is
> is breaking already -escaped `&lt;' and `&gt;' apart with <a>
> tags in between characters.  but I'm not sure how this is
> happening to you if all your address fields look OK.
> 
> I'll filter out blank addresses in our config reader, but
> I also wonder if there's anything else going on...

I think so, because the timestamp I have on the config file is Aug 20, and the
corruption manifested itself on Sep 2 or 3. Previously this happened on Aug
26, which is the last time I would have restarted the public-inbox-httpd
process. The only notable thing I currently see about these events is that
they were on a Monday 1 week apart, but it could have been an accident.

> And all your writes to the config are via git-config?

Yes, they are written as part of public-inbox-init. This is the chunk of code
where that happens:
https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git/tree/grokmirror/pi_indexer.py#n164

> Fwiw, the buggy code would be in addr2urlmap called by
> _msg_page_prepare in PublicInbox::View.
> addr2urlmap() will escape any regexp metacharacters present in
> addresses via quotemeta, so there's no chance of regexp
> injection from a config.
> 
> (hopefully coherent, running on fumes due to real life things)

One thing I can do next time this happens is take that node out of rotation
and keep it in that state for additional debugging.

Thanks for your help!

-K

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-03 17:52 Occasional web view corruption (extra html escapes) Konstantin Ryabitsev
  2024-09-03 18:42 ` Eric Wong
  2024-09-03 19:11 ` Eric Wong
@ 2024-09-06 22:20 ` Filip Hejsek
  2024-09-06 22:42   ` Eric Wong
  2 siblings, 1 reply; 11+ messages in thread
From: Filip Hejsek @ 2024-09-06 22:20 UTC (permalink / raw)
  To: Konstantin Ryabitsev, Eric Wong, meta

Hello,

I have figured out why this happens.

I have reproduced the bug by doing the following:
1. setup an instance of public-inbox and import some message into it
2. create extindex named all and add it to config
3. start public-inbox-httpd
4. directly open http://<server-address>/all/<msg-id>/

The server will enter the broken state if this is the first page loaded
from the server.

The issue occurs because of the following sequence of events:
1. WWW->call is called
2. after matching the URL, msg_page is called
3. msg_page validates the inbox name by calling invalid_inbox_mid
4. invalid_inbox_mid calls invalid_inbox
5. the name is looked up with lookup_name
6. because there is no inbox with that name, undef is returned
7. the name is looked uo with lookup_ei
8. the lookup succeeds
9. get_mid_html is called to generate the HTML page
10. addr2urlmap is used to construct a regex of known addresses
11. because no inbox has been instantiated, $cfg->{-by_addr} is empty
12. because of that, $re will an empty string
13. so the final regex is /\b()\b/, which matches every word boundary

 - Filip Hejsek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-06 22:20 ` Filip Hejsek
@ 2024-09-06 22:42   ` Eric Wong
  2024-09-06 23:29     ` Eric Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2024-09-06 22:42 UTC (permalink / raw)
  To: Filip Hejsek; +Cc: Konstantin Ryabitsev, meta

Filip Hejsek <filip.hejsek@gmail.com> wrote:
> Hello,
> 
> I have figured out why this happens.
> 
> I have reproduced the bug by doing the following:
> 1. setup an instance of public-inbox and import some message into it
> 2. create extindex named all and add it to config
> 3. start public-inbox-httpd
> 4. directly open http://<server-address>/all/<msg-id>/

OK, but I wasn't able to reproduce it on my setup since I
had coderepos configured :x

Coderepos were hiding the bug from me.

> The server will enter the broken state if this is the first page loaded
> from the server.
> 
> The issue occurs because of the following sequence of events:
> 1. WWW->call is called
> 2. after matching the URL, msg_page is called
> 3. msg_page validates the inbox name by calling invalid_inbox_mid
> 4. invalid_inbox_mid calls invalid_inbox
> 5. the name is looked up with lookup_name
> 6. because there is no inbox with that name, undef is returned
> 7. the name is looked uo with lookup_ei
> 8. the lookup succeeds
> 9. get_mid_html is called to generate the HTML page
> 10. addr2urlmap is used to construct a regex of known addresses
> 11. because no inbox has been instantiated, $cfg->{-by_addr} is empty
> 12. because of that, $re will an empty string
> 13. so the final regex is /\b()\b/, which matches every word boundary

Thanks for the diagnosis and excellent explanation!

The reason I didn't hit this problem on my instance was because
configuring coderepos causes CodeSearch->load_coderepos to call
lookup_eidx_key, which calls _lookup_fill and populates {-by_addr}

So I think adding a guard to prevent using regexp from empty -by_addr
is a first step; but additional checks + preload fixes should
probably be needed...

Fixes coming...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-06 22:42   ` Eric Wong
@ 2024-09-06 23:29     ` Eric Wong
  2024-09-07  1:24       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2024-09-06 23:29 UTC (permalink / raw)
  To: Filip Hejsek; +Cc: Konstantin Ryabitsev, meta

Eric Wong <e@80x24.org> wrote:
> Coderepos were hiding the bug from me.

coderepos + cindex, that is.

> So I think adding a guard to prevent using regexp from empty -by_addr
> is a first step; but additional checks + preload fixes should
> probably be needed...

This should fix the immediate issue on its own:
-----8<-----
Subject: [PATCH] view: fix addr2url mapping corruption

We must avoid generating a qr/\b()\b/ regexp which matches
every word boundary.  This is caused by a particular set of
circumstances for WWW instances:

1. extindex must be in use
2. cindex must NOT be in use OR WWW->preload wasn't used
   (custom .psgi or non-p-i-{httpd,netd} users)
3. first HTTP request hits /$EXTINDEX/$MSGID/
   (where $EXTINDEX is typically `all')

On extindex-using instances without a cindex configured, the
first HTTP request hitting the extindex encounters an empty
{-by_addr} hash table.  This empty {-by_addr} hash table causes
View->addr2urlmap() to return an all-matching regexp which
corrupts HTML when attempting address substitutions.

cindex-using instances avoid the problem by triggering
_fill_all() during PublicInbox::WWW->preload and ensuring
{-by_addr} of the PublicInbox::Config object is populated.

Thanks to Konstantin for the initial report and Filip for the
immensely helpful explanation of the problem.

Helped-by: Filip Hejsek <filip.hejsek@gmail.com>
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20240903-brainy-lionfish-of-saturation-71ae1a@lemur/
Fixes: 48cbe0c3c8dc4d26 (www: linkify inbox addresses in To/Cc headers, 2024-01-09)
---
 lib/PublicInbox/View.pm | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index bc093a20..154e7537 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -203,8 +203,12 @@ sub addr2urlmap ($) {
 		my $tmp = $ctx->{www}->{pi_cfg}->{-addr2urlmap};
 		my @k = keys %$tmp; # random order
 		delete @$tmp{@k[0..3]} if scalar(@k) > 7;
-		my $re = join('|', map { quotemeta } keys %addr2url);
-		$tmp->{$key} = [ qr/\b($re)\b/i, \%addr2url ];
+		if (scalar keys %addr2url) {
+			my $re = join('|', map { quotemeta } keys %addr2url);
+			$tmp->{$key} = [ qr/\b($re)\b/i, \%addr2url ];
+		} else { # nothing? NUL should never match:
+			[ qr/(\0)/, { "\0" => './' } ];
+		}
 	};
 	@$ent;
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Occasional web view corruption (extra html escapes)
  2024-09-06 23:29     ` Eric Wong
@ 2024-09-07  1:24       ` Konstantin Ryabitsev
  0 siblings, 0 replies; 11+ messages in thread
From: Konstantin Ryabitsev @ 2024-09-07  1:24 UTC (permalink / raw)
  To: Eric Wong; +Cc: Filip Hejsek, meta

On Fri, Sep 06, 2024 at 11:29:03PM GMT, Eric Wong wrote:
> > So I think adding a guard to prevent using regexp from empty -by_addr
> > is a first step; but additional checks + preload fixes should
> > probably be needed...
> 
> This should fix the immediate issue on its own:

Deployed, thank you both!

-K

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-09-07  1:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 17:52 Occasional web view corruption (extra html escapes) Konstantin Ryabitsev
2024-09-03 18:42 ` Eric Wong
2024-09-03 19:18   ` Konstantin Ryabitsev
2024-09-03 19:11 ` Eric Wong
2024-09-03 19:22   ` Konstantin Ryabitsev
2024-09-04 13:14     ` Eric Wong
2024-09-04 14:42       ` Konstantin Ryabitsev
2024-09-06 22:20 ` Filip Hejsek
2024-09-06 22:42   ` Eric Wong
2024-09-06 23:29     ` Eric Wong
2024-09-07  1:24       ` Konstantin Ryabitsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).