* [PATCH] nntp: regexp always consumes rbuf if "\n" exists @ 2019-09-08 10:45 Eric Wong 2019-09-08 10:52 ` trying to figure out 100% CPU usage in nntpd Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-08 10:45 UTC (permalink / raw) To: meta We don't want to get hung into a state where we see "\n" via index(), yet cannot consume rbuf in the while loop. So tweak the regexp to ensure we always consume rbuf. I suspect this is what causes occasional 100% CPU usage of -nntpd, but reproducing it's been difficult.. --- lib/PublicInbox/NNTP.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm index 41f1e9c..c049685 100644 --- a/lib/PublicInbox/NNTP.pm +++ b/lib/PublicInbox/NNTP.pm @@ -995,7 +995,7 @@ sub event_step { my $off = bytes::length($$rbuf); $r = $self->do_read($rbuf, LINE_MAX, $off) or return; } - while ($r > 0 && $$rbuf =~ s/\A[ \t\r\n]*([^\r\n]*)\r?\n//) { + while ($r > 0 && $$rbuf =~ s/\A[ \t]*([^\n]*?)\r?\n//) { my $line = $1; return $self->close if $line =~ /[[:cntrl:]]/s; my $t0 = now(); -- EW ^ permalink raw reply related [flat|nested] 21+ messages in thread
* trying to figure out 100% CPU usage in nntpd... 2019-09-08 10:45 [PATCH] nntp: regexp always consumes rbuf if "\n" exists Eric Wong @ 2019-09-08 10:52 ` Eric Wong 2019-09-09 10:05 ` Konstantin Ryabitsev 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-08 10:52 UTC (permalink / raw) To: meta I've been having trouble reproducing this bug (but maybe summer weather has been taking its toll, certainly has on my HW). And it won't pin the process to any client, so other clients continue to be served while it happens. It'll just end up being a bunch of epoll_wait( ... timeout=0) calls (and no other syscalls) which can drive up power bills :< So this doesn't show up with .onion instances since those only see connections from tor running on localhost; either. The problem only happens on news.public-inbox.org, so it could have something to do with less-reliable clients triggering our long_response code path. While debugging this, I've isolated it to the last ->requeue call in NNTP.pm event_step by using write(2) syscalls to /dev/null so I can strace things as-needed. Showing "git diff -w" output so the entire old event_step sub is shown. The problem is somehow, $$rbuf has "\n" in it, so there's no do_read call happening. Yet the while loop regexp is never consuming $$rbuf... So I'm pretty sure the old regexp is bogus, and the patch: "nntp: regexp always consumes rbuf if "\n" exists" will fix things. diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm index 7c3f68a..ef050b9 100644 --- a/lib/PublicInbox/NNTP.pm +++ b/lib/PublicInbox/NNTP.pm @@ -976,44 +976,54 @@ sub err ($$;@) { sub out ($$;@) { my ($self, $fmt, @args) = @_; printf { $self->{nntpd}->{out} } $fmt."\n", @args; } +use Data::Dumper; +my $null; +BEGIN { + open $null, '>>', '/dev/null' or die; + $null->autoflush(1); +}; + # callback used by PublicInbox::DS for any (e)poll (in/out/hup/err) sub event_step { my ($self) = @_; return unless $self->flush_write && $self->{sock}; update_idle_time($self); # only read more requests if we've drained the write buffer, # otherwise we can be buffering infinitely w/o backpressure my $rbuf = $self->{rbuf} // (\(my $x = '')); my $r = 1; if (index($$rbuf, "\n") < 0) { my $off = bytes::length($$rbuf); $r = $self->do_read($rbuf, LINE_MAX, $off) or return; } while ($r > 0 && $$rbuf =~ s/\A[ \t\r\n]*([^\r\n]*)\r?\n//) { my $line = $1; return $self->close if $line =~ /[[:cntrl:]]/s; my $t0 = now(); my $fd = fileno($self->{sock}); $r = eval { process_line($self, $line) }; my $pending = $self->{wbuf} ? ' pending' : ''; out($self, "[$fd] %s - %0.6f$pending", $line, now() - $t0); } return $self->close if $r < 0; my $len = bytes::length($$rbuf); return $self->close if ($len >= LINE_MAX); $self->rbuf_idle($rbuf); update_idle_time($self); # maybe there's more pipelined data, or we'll have # to register it for socket-readiness notifications - $self->requeue unless $self->{wbuf}; + unless ($self->{wbuf}) { + print $null Dumper({r=>$r, rbuf=>$rbuf}); + $self->requeue; + } } sub not_idle_long ($$) { ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: trying to figure out 100% CPU usage in nntpd... 2019-09-08 10:52 ` trying to figure out 100% CPU usage in nntpd Eric Wong @ 2019-09-09 10:05 ` Konstantin Ryabitsev 2019-09-09 17:53 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-09 10:05 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Sun, Sep 08, 2019 at 10:52:43AM +0000, Eric Wong wrote: > I've been having trouble reproducing this bug (but maybe summer > weather has been taking its toll, certainly has on my HW). I did notice the same thing after updating to what was the latest master last week -- the nntpd process was always 100% or above load in htop. I just applied the latest updates, so we'll see if there is any change. There also was a weird problem a couple of days ago where one of the httpd daemons started returning "Internal Server Error" to all requests. Restarting public-inbox-httpd fixed the problem, but I am not sure how I would troubleshoot the causes if it happens next time -- is there a way to enable error logging for the httpd daemon? -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: trying to figure out 100% CPU usage in nntpd... 2019-09-09 10:05 ` Konstantin Ryabitsev @ 2019-09-09 17:53 ` Eric Wong 2019-09-10 8:38 ` Konstantin Ryabitsev 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-09 17:53 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > There also was a weird problem a couple of days ago where one of the > httpd daemons started returning "Internal Server Error" to all requests. > Restarting public-inbox-httpd fixed the problem, but I am not sure how I > would troubleshoot the causes if it happens next time -- is there a way > to enable error logging for the httpd daemon? That's a new one... I haven't seen any problems from -httpd myself in ages. So -httpd could not handle requests at all? The daemons already spits errors to stderr which typically ends up in syslog via systemd. So, that's the first place to look (also "systemctl status $SERVICE"); anything in there? I can usually figure everything out from strace/lsof on a worker process and hitting it with some requests (SIGTTOU to decrement workers down to one). That said, out-of-FD/memory conditions might not always be logged correctly to stderr and we need to fix that. Also, right now the code considers git-cat-file to be reliable, but I guess it wouldn't be the case in disk failures and perhaps timeouts will be necessary. Maybe nginx/varnish logs would have something, too; but more likely syslog. Also, would be curious how memory usage improves for you with some of the new changes. I don't think I've exceeded 100MB/worker this year, but Email::MIME can be a pig with giant messages. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: trying to figure out 100% CPU usage in nntpd... 2019-09-09 17:53 ` Eric Wong @ 2019-09-10 8:38 ` Konstantin Ryabitsev 2019-09-10 18:12 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-10 8:38 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Mon, Sep 09, 2019 at 05:53:41PM +0000, Eric Wong wrote: > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > There also was a weird problem a couple of days ago where one of the > > httpd daemons started returning "Internal Server Error" to all requests. > > Restarting public-inbox-httpd fixed the problem, but I am not sure how I > > would troubleshoot the causes if it happens next time -- is there a way > > to enable error logging for the httpd daemon? > > That's a new one... I haven't seen any problems from -httpd myself > in ages. So -httpd could not handle requests at all? > > The daemons already spits errors to stderr which typically ends up > in syslog via systemd. So, that's the first place to look > (also "systemctl status $SERVICE"); anything in there? Bah, I hadn't even considered looking in syslog. I blame jet lag. Well, the explanation is simple, then: Sep 7 22:22:35 lore.kernel.org public-inbox-httpd: pipe failed: Too many open files at /usr/local/share/perl5/PublicInbox/Git.pm line 210. Sep 7 22:22:36 lore.kernel.org public-inbox-httpd: pipe: Too many open files Sep 7 22:23:01 lore.kernel.org public-inbox-httpd: pipe: Too many open files Sep 7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files Sep 7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files I see that there's already a LimitNOFILE=30000 -- is exhausting 30,000 open files a reasonable sort of thing for a system as busy as lore.kernel.org, or should we look for a leak. Currently: publici+ 13125 1 0 52297 25096 0 Sep09 ? 00:00:00 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log publici+ 13128 13125 1 200056 591376 4 Sep09 ? 00:21:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log publici+ 13129 13125 1 117789 262000 2 Sep09 ? 00:16:09 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log publici+ 13130 13125 0 103569 204772 9 Sep09 ? 00:08:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log publici+ 13131 13125 0 101510 199268 1 Sep09 ? 00:03:33 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log # ls /proc/13125/fd/ | wc -l 9 # ls /proc/13128/fd/ | wc -l 17776 # ls /proc/13129/fd/ | wc -l 3937 # ls /proc/13130/fd/ | wc -l 648 # ls /proc/13131/fd/ | wc -l 68 17776 since Sep 09 seems... fairly excessive. Most of them are pipes: # lsof -p 13128 | grep pipe | wc -l 14741 a bunch of them are to (deleted) entries in /tmp: # lsof -p 13128 | grep deleted | wc -l 3015 (e.g.): public-in 13128 publicinbox 45u REG 259,4 984 4200094 /tmp/PerlIO_yd43mN (deleted) public-in 13128 publicinbox 46r FIFO 0,9 0t0 21818751 pipe public-in 13128 publicinbox 47u REG 259,4 984 4200096 /tmp/PerlIO_siY3Pb (deleted) public-in 13128 publicinbox 48r FIFO 0,9 0t0 21818758 pipe public-in 13128 publicinbox 49u REG 259,4 984 4200100 /tmp/PerlIO_WyIaaC (deleted) It does seem like there's perhaps a leak somewhere? -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: trying to figure out 100% CPU usage in nntpd... 2019-09-10 8:38 ` Konstantin Ryabitsev @ 2019-09-10 18:12 ` Eric Wong 2019-09-11 2:22 ` httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] Eric Wong 2019-09-11 9:44 ` trying to figure out 100% CPU usage in nntpd Konstantin Ryabitsev 0 siblings, 2 replies; 21+ messages in thread From: Eric Wong @ 2019-09-10 18:12 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Mon, Sep 09, 2019 at 05:53:41PM +0000, Eric Wong wrote: > > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > > There also was a weird problem a couple of days ago where one of the > > > httpd daemons started returning "Internal Server Error" to all requests. > > > Restarting public-inbox-httpd fixed the problem, but I am not sure how I > > > would troubleshoot the causes if it happens next time -- is there a way > > > to enable error logging for the httpd daemon? > > > > That's a new one... I haven't seen any problems from -httpd myself > > in ages. So -httpd could not handle requests at all? > > > > The daemons already spits errors to stderr which typically ends up > > in syslog via systemd. So, that's the first place to look > > (also "systemctl status $SERVICE"); anything in there? > > Bah, I hadn't even considered looking in syslog. I blame jet lag. Well, > the explanation is simple, then: > > Sep 7 22:22:35 lore.kernel.org public-inbox-httpd: pipe failed: Too many open files at /usr/local/share/perl5/PublicInbox/Git.pm line 210. > Sep 7 22:22:36 lore.kernel.org public-inbox-httpd: pipe: Too many open files > Sep 7 22:23:01 lore.kernel.org public-inbox-httpd: pipe: Too many open files > Sep 7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files > Sep 7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files > > I see that there's already a LimitNOFILE=30000 -- is exhausting 30,000 > open files a reasonable sort of thing for a system as busy as > lore.kernel.org, or should we look for a leak. Currently: > > publici+ 13125 1 0 52297 25096 0 Sep09 ? 00:00:00 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13128 13125 1 200056 591376 4 Sep09 ? 00:21:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13129 13125 1 117789 262000 2 Sep09 ? 00:16:09 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13130 13125 0 103569 204772 9 Sep09 ? 00:08:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13131 13125 0 101510 199268 1 Sep09 ? 00:03:33 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log Overall limit of open files could be higher (just limited by RAM) but the balance of pipes/unlinked/other (sockets) seems way off. I wonder if it's some bad interaction with hitting pipe-user-pages-soft (see bottom), too... > # ls /proc/13125/fd/ | wc -l > 9 > # ls /proc/13128/fd/ | wc -l > 17776 > # ls /proc/13129/fd/ | wc -l > 3937 > # ls /proc/13130/fd/ | wc -l > 648 > # ls /proc/13131/fd/ | wc -l > 68 OK, there is correlation between CPU time and open FDs. The poor load distribution (based on CPU time) between processes sucks, but that's an unrelated problem[1] > 17776 since Sep 09 seems... fairly excessive. Most of them are pipes: > > # lsof -p 13128 | grep pipe | wc -l > 14741 > > a bunch of them are to (deleted) entries in /tmp: > > # lsof -p 13128 | grep deleted | wc -l > 3015 > > (e.g.): > > public-in 13128 publicinbox 45u REG 259,4 984 4200094 /tmp/PerlIO_yd43mN (deleted) > public-in 13128 publicinbox 46r FIFO 0,9 0t0 21818751 pipe > public-in 13128 publicinbox 47u REG 259,4 984 4200096 /tmp/PerlIO_siY3Pb (deleted) > public-in 13128 publicinbox 48r FIFO 0,9 0t0 21818758 pipe > public-in 13128 publicinbox 49u REG 259,4 984 4200100 /tmp/PerlIO_WyIaaC (deleted) > > It does seem like there's perhaps a leak somewhere? Probably. Not seeing any of that on my (smaller) instances; but those -httpd haven't been restarted in weeks/months. The "PerlIO_" prefix is created from open(..., '+>', undef), so it either has to be for: 1. POST bodies for git http-backend 1. git cat-file --batch-check stderr 3. experimental ViewVCS|SolverGit which isn't configured on lore :) = git-http-backend Looking at the size (984), PerlIO_* prefix and proximity of FD numbers, I think those lines above are git-http-backend POST body. Any git-http-backend stuck from people fetching/cloning? If so, fd=0 of those should correspond to the inodes of those unlinked PerlIO files. This is -httpd writing to varnish, still, right? Anything from varnish error logs? 3K git-http-backend is a lot, the Qspawn mechanism limits it to 32 processes by default (same as git-daemon), so it shouldn't hit RLIMIT_NOPROC. Qspawn also won't create the pipe until right before spawning git-http-backend. The unlinked file is created once the POST request hits. And 3K git-http-backend would only be a 1:1 ratio of pipes to unlinked files, though, so that's over 10K pipes still unaccounted for... = git cat-file There can be two cat-file processes (--batch and --batch-check) for every inbox in every -httpd worker. Every cat-file process uses 2 pipes (public-inbox closes the ends it hands to git), and there's --batch-check ones which use an additional unlinked regular file for stderr. So: 80 inboxes * 2 cat-file * 2 pipes is only 320 pipes in -httpd And not sure if 160 cat-file processes would cause RLIMIT_NPROC problems, probably not... Fwiw, while the new DS code also uses temporary files for output buffering, those have the "wbuf-" prefix and not PerlIO_* since 'open(FH, "+>>", undef)' didn't work until recently: https://rt.perl.org/Ticket/Display.html?id=134221 I've been meaning to expand dtas-graph to include temporary file and socket connectivity, too, but it's pretty useful at graphing pipe connectivity between processes: https://80x24.org/dtas-all/b243086/s/?b=perl/dtas-graph And it can help find unconnected pipes. = /proc/sys/fs/pipe-user-pages-soft I also just remembered this knob, which you could hit from just cat-file, and I seem to remember some bugs in older kernels when pipe limits are hit. for httpd with 80 inboxes: 160 pipes * 4 workers * 16 pages = 10240 pages for --batch 160 pipes * 4 workers * 1 pages = 640 pages for --batch-check That's already 10880 pages for cat-file alone despite only F_SETPIPE_SZ-ing to a single page for batch-check (default is 16 pages for 65536 bytes). And there's still pipes for nntpd (not sure how many workers there), and various internal self-pipe stuff (single-page), some of which can be made eventfd... /proc/sys/fs/pipe-user-pages-soft seems to default to 16384 regardless of system RAM, and I've needed to increase it in other places for different projects. For anybody using SQLite, I could see about supporting a single cat-file process in a repo where every inbox is an alternate of a main one... [1] maybe a multiaccept knob can improve load distribution: https://public-inbox.org/meta/20190513025610.30378-1-e@80x24.org/ ^ permalink raw reply [flat|nested] 21+ messages in thread
* httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-10 18:12 ` Eric Wong @ 2019-09-11 2:22 ` Eric Wong 2019-09-11 10:24 ` Konstantin Ryabitsev 2019-09-11 9:44 ` trying to figure out 100% CPU usage in nntpd Konstantin Ryabitsev 1 sibling, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-11 2:22 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Btw, some of the changes in public-inbox to use EPOLLET / EPOLLONESHOT would make it more sensitive to kernel bugs, and missed wakeups could cause resource exhaustion... I just noticed this now: https://lore.kernel.org/lkml/20190904175303.918854334@linuxfoundation.org/ And I remember I found another mm-compaction-related issue several years back (early 3.x) which caused epoll wakeups to be missed... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-11 2:22 ` httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] Eric Wong @ 2019-09-11 10:24 ` Konstantin Ryabitsev 2019-09-11 17:12 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-11 10:24 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Wed, Sep 11, 2019 at 02:22:15AM +0000, Eric Wong wrote: > Btw, some of the changes in public-inbox to use EPOLLET / > EPOLLONESHOT would make it more sensitive to kernel bugs, > and missed wakeups could cause resource exhaustion... > > I just noticed this now: > https://lore.kernel.org/lkml/20190904175303.918854334@linuxfoundation.org/ > > And I remember I found another mm-compaction-related issue > several years back (early 3.x) which caused epoll wakeups to be > missed... To give some more data points, downgrading to f4f0a3be still shows a number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays the same over time. If I switch to the latest master, the number of broken pipes grows steadily following each git pull (in the hundreds after 10 minutes of running). I've downgraded lore.kernel.org to f4f0a3be for now to avoid the resource exhaustion. -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-11 10:24 ` Konstantin Ryabitsev @ 2019-09-11 17:12 ` Eric Wong 2019-09-11 17:36 ` Konstantin Ryabitsev 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-11 17:12 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > To give some more data points, downgrading to f4f0a3be still shows a > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays > the same over time. If I switch to the latest master, the number of > broken pipes grows steadily following each git pull (in the hundreds > after 10 minutes of running). Thanks for that info, did those deleted entries eventually go away (perhaps after several minutes)? Kinda difficult to tie those to the corresponding TCP connection atm, though. I should probably put the TCP socket inode number into the pre-unlink tempfile name to make that relationship traceable. > I've downgraded lore.kernel.org to f4f0a3be for now to avoid the > resource exhaustion. I'll try to reproduce it later today or this week. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-11 17:12 ` Eric Wong @ 2019-09-11 17:36 ` Konstantin Ryabitsev 2019-09-12 0:05 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-11 17:36 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Wed, Sep 11, 2019 at 05:12:50PM +0000, Eric Wong wrote: > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > To give some more data points, downgrading to f4f0a3be still shows a > > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays > > the same over time. If I switch to the latest master, the number of > > broken pipes grows steadily following each git pull (in the hundreds > > after 10 minutes of running). > > Thanks for that info, did those deleted entries eventually go > away (perhaps after several minutes)? They don't appear to go away -- since 10:23 UTC earlier today, they accumulated over 2,400 entries: # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | wc -l 2427 # ls -al /proc/{2103,2104,2105,2106}/fd | grep pipe | wc -l 26 Curiously, I also have this datapoint that may or may not be making things more confusing. :) # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | awk '{print $8}' | sort | uniq -c 695 10:24 356 10:27 843 14:45 175 14:46 6 14:50 372 17:19 4 17:20 11 17:23 So, they appear to show up in chunks and hang around together. For nginx configs, this is the relevant part: server { listen *:80 default_server; proxy_buffering off; location ~ ^(.*/(HEAD|info/refs|objects/info/[^/]+|git-(upload|receive)-pack))$ { proxy_pass http://localhost:8080; proxy_read_timeout 90; proxy_connect_timeout 90; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; proxy_set_header X-Forwarded-Port $server_port; } } When setting up lore, I also considered serving git via git-http-backend instead of passing it via public-inbox-httpd -- the only reason we didn't try that was because it complicated SELinux bits a bit. Can you think of a reason why we shouldn't server git requests directly via git-http-backend? Thanks for your help. -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-11 17:36 ` Konstantin Ryabitsev @ 2019-09-12 0:05 ` Eric Wong 2019-09-12 2:49 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-12 0:05 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Wed, Sep 11, 2019 at 05:12:50PM +0000, Eric Wong wrote: > > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > > To give some more data points, downgrading to f4f0a3be still shows a > > > number of /tmp/PerlIO* (deleted) entries, but the number of pipes stays > > > the same over time. If I switch to the latest master, the number of > > > broken pipes grows steadily following each git pull (in the hundreds > > > after 10 minutes of running). > > > > Thanks for that info, did those deleted entries eventually go > > away (perhaps after several minutes)? > > They don't appear to go away -- since 10:23 UTC earlier today, they > accumulated over 2,400 entries: > > # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | wc -l > 2427 > # ls -al /proc/{2103,2104,2105,2106}/fd | grep pipe | wc -l > 26 > > Curiously, I also have this datapoint that may or may not be making > things more confusing. :) > > # ls -al /proc/{2103,2104,2105,2106}/fd | grep deleted | awk '{print $8}' | sort | uniq -c > 695 10:24 > 356 10:27 > 843 14:45 > 175 14:46 > 6 14:50 > 372 17:19 > 4 17:20 > 11 17:23 > > So, they appear to show up in chunks and hang around together. Any other cronjobs or events which may fire around then? Anything in nginx error logs? I've never seen anything like this. It should not matter as HTTP should be HTTP; but I've always had -httpd talking to varnish (mainly out of laziness, but I've never used nginx with public-inbox myself, but also ease of documentation since the varnish-4.vcl example teaches Varnish to only cache GET and HEAD requests). > When setting up lore, I also considered serving git via git-http-backend > instead of passing it via public-inbox-httpd -- the only reason we > didn't try that was because it complicated SELinux bits a bit. Can you > think of a reason why we shouldn't server git requests directly via > git-http-backend? I didn't realize nginx added the ability to spawn CGI processes, (but I haven't really looked at nginx this decade). public-inbox-httpd has the qspawn/limiter stuff and the same 32-process default limit as git-daemon. I actually limit it to 1-2 processes depending on the size of inbox given the tiny VPS which powers public-inbox.org. It's all documented in the -config manpage ("NAMED LIMITER"/"httpbackendmax"). My favorite feature is being able to configure different inboxes for different queues, so clone requests for large inboxes won't starve requests to small inboxes. My goal is to allow -httpd to do HTTPS and proxy to varnish soonish (eliminating the need for nginx or similar), and eventually eliminate the need for varnish entirely. > Thanks for your help. No problem; thanks for the extra info. Will try to finish my current (non-public) tasks so I can dig into this ASAP. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-12 0:05 ` Eric Wong @ 2019-09-12 2:49 ` Eric Wong 2019-09-12 8:35 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-12 2:49 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta One more thing, are you running any extra middlewares in the .psgi file? Thanks. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-12 2:49 ` Eric Wong @ 2019-09-12 8:35 ` Eric Wong 2019-09-12 11:37 ` Konstantin Ryabitsev 0 siblings, 1 reply; 21+ messages in thread From: Eric Wong @ 2019-09-12 8:35 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Eric Wong <e@80x24.org> wrote: > One more thing, are you running any extra middlewares in the > .psgi file? Thanks. That's probably not it, I suspected the non-fileno path was being hit, but I just tested a debug change on top of b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink") ("tmpfile: support O_APPEND and use it in DS::tmpio") to fake the presence of a middleware wrapping psgi.input. I've also pushed out an "unlink-fix" branch on top of f4f0a3be0864721d90f9557ffe1c513b0289a74b which only features one bugfix which wouldn't affect you, as well as the tmpfile change to give more meanigful names: solvergit: ensure Qspawn usage doesn't unref update-index stdin tmpfile: give temporary files meaningful names Note: if reusing existing working tree, be sure to rerun "perl Makefile.PL" since new .pm files won't get picked up, otherwise. # force creation of extra temporary file for psgi.input # in case some middleware overrides the psgi.input we # set in HTTP.pm: --------8<------ diff --git a/lib/PublicInbox/GitHTTPBackend.pm b/lib/PublicInbox/GitHTTPBackend.pm index a8337035..2eaf221e 100644 --- a/lib/PublicInbox/GitHTTPBackend.pm +++ b/lib/PublicInbox/GitHTTPBackend.pm @@ -216,7 +216,7 @@ sub input_prepare { my $input = $env->{'psgi.input'}; my $fd = eval { fileno($input) }; - if (defined $fd && $fd >= 0) { + if (0 && defined $fd && $fd >= 0) { return { 0 => $fd }; } my $id = "git-http.input.$env->{REMOTE_HOST}:$env->{REMOTE_PORT}"; ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-12 8:35 ` Eric Wong @ 2019-09-12 11:37 ` Konstantin Ryabitsev 2019-09-13 3:12 ` Eric Wong 0 siblings, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-12 11:37 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Thu, Sep 12, 2019 at 08:35:03AM +0000, Eric Wong wrote: > Eric Wong <e@80x24.org> wrote: > > One more thing, are you running any extra middlewares in the > > .psgi file? Thanks. No, it's just vanilla what comes with the source. > That's probably not it, I suspected the non-fileno path was > being hit, but I just tested a debug change on top of > b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink") > ("tmpfile: support O_APPEND and use it in DS::tmpio") > to fake the presence of a middleware wrapping psgi.input. I sent you a dump of lsof -p of all 4 processes after about 20 minutes of running. For another data point, the daemon was running in SELinux-permissive mode, to make sure unlinks aren't failing because of any permission errors. Let me know if you would like any further info. Best, -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-12 11:37 ` Konstantin Ryabitsev @ 2019-09-13 3:12 ` Eric Wong 2019-09-13 7:03 ` Eric Wong ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Eric Wong @ 2019-09-13 3:12 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Thu, Sep 12, 2019 at 08:35:03AM +0000, Eric Wong wrote: > > Eric Wong <e@80x24.org> wrote: > > > One more thing, are you running any extra middlewares in the > > > .psgi file? Thanks. > > No, it's just vanilla what comes with the source. OK, and Perl 5.16.3 from CentOS 7? (4:5.16.3-294.el7_6 RPM) > > That's probably not it, I suspected the non-fileno path was > > being hit, but I just tested a debug change on top of > > b7cfd5aeff4b6b316b61b327af9c144776d77225 (branch: "unlink") > > ("tmpfile: support O_APPEND and use it in DS::tmpio") > > to fake the presence of a middleware wrapping psgi.input. > > I sent you a dump of lsof -p of all 4 processes after about 20 minutes > of running. For another data point, the daemon was running in > SELinux-permissive mode, to make sure unlinks aren't failing because of > any permission errors. It looks like there's a Perl reference leak (cycle) of some sort holding on to FDs, since you have lots of input files and pipes, yet only one established IPv4 connection. And the inodes encoded into the filenames don't point to the connected socket, even.... However, I'm not able to reproduce it on my CentOS 7 VM which has nginx 1.12.2. I don't think nginx is a factor in this since public-inbox-httpd is clearly not holding TCP sockets open, even. Not at all familiar with SELinux, but I'm just using the defaults CentOS comes with and running both nginx + public-inbox-httpd as a regular user. That "if (0..." GitHTTPBackend patch definitely isn't needed for testing anymore and only makes FD exhaustion happen sooner. > Let me know if you would like any further info. If there's a reference leak somewhere, this could also be part of the high memory use you showed us a few months ago. Dunno if you had many FDs back then. I could see about adding explicit close() calls in a few places, but that would make a corresponding memory leak harder-to-notice, even. I pushed out two patches to the "unlink" branch which may be able to reproduce the issue on your end (I see nothing out of the ordinary on my slow CentOS 7 VM or Debian machines/VMs) * [PATCH] t/httpd-corner: check for leaking FDs and pipes * [RFC] t/git-http-backend: add MANY_CLONE test # no pipes should be present in -httpd with -W0 prove -lv t/httpd-corner.t # unrelated note: there's 4 pipes from -W1 (the default), # but I think 2 can be closed, actually... GIANT_GIT_DIR=/path/to/git.git MANY_CLONE=1 prove -lv t/git-http-backend.t If those updated test cases can't reproduce the problem, can you reproduce this on erol or any other machines? perhaps with a different Perl? Thanks. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-13 3:12 ` Eric Wong @ 2019-09-13 7:03 ` Eric Wong 2019-09-13 9:01 ` Eric Wong 2019-09-13 18:07 ` Konstantin Ryabitsev 2 siblings, 0 replies; 21+ messages in thread From: Eric Wong @ 2019-09-13 7:03 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Eric Wong <e@80x24.org> wrote: > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > > I sent you a dump of lsof -p of all 4 processes after about 20 minutes > > of running. For another data point, the daemon was running in > > SELinux-permissive mode, to make sure unlinks aren't failing because of > > any permission errors. Oh, another curiosity. It would be good to know the /proc/$pid/fdinfo/$fd_of_eventpoll to know what state the pipes in the epoll descriptor to tie those with lsof output. It won't help with deleted regular files, though... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-13 3:12 ` Eric Wong 2019-09-13 7:03 ` Eric Wong @ 2019-09-13 9:01 ` Eric Wong 2019-09-13 18:07 ` Konstantin Ryabitsev 2 siblings, 0 replies; 21+ messages in thread From: Eric Wong @ 2019-09-13 9:01 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Eric Wong <e@80x24.org> wrote: > # unrelated note: there's 4 pipes from -W1 (the default), > # but I think 2 can be closed, actually... Nevermind on the extra pipes, I was running lsof on the master :x. "[RFC 2/1] t/git-http-backend: disable worker processes" cuts it down to a single process (no master/worker distinction) so lsof is seeing the right pid. > GIANT_GIT_DIR=/path/to/git.git MANY_CLONE=1 prove -lv t/git-http-backend.t And now no pipes are visible at all, so either that test can't trigger it, or neither of the systems I'm testing on can't trigger it... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-13 3:12 ` Eric Wong 2019-09-13 7:03 ` Eric Wong 2019-09-13 9:01 ` Eric Wong @ 2019-09-13 18:07 ` Konstantin Ryabitsev 2019-09-14 5:25 ` Eric Wong 2 siblings, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-13 18:07 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Fri, Sep 13, 2019 at 03:12:12AM +0000, Eric Wong wrote: > > No, it's just vanilla what comes with the source. > > OK, and Perl 5.16.3 from CentOS 7? (4:5.16.3-294.el7_6 RPM) Correct, though another thing that may help is the git version -- we install git2u from IUS, so the git version is not what is in standard CentOS7 (it's too ancient). # rpm -q git2u git2u-2.16.5-1.ius.centos7.x86_64 See if you have more luck reproducing the problem when you get that in place. You should be able to find what you need by searching for "git2u ius centos". My access over the next 10 days is going to be spotty due to travel, but I'll try to catch up at least once a day. Thanks again for your help! -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] 2019-09-13 18:07 ` Konstantin Ryabitsev @ 2019-09-14 5:25 ` Eric Wong 0 siblings, 0 replies; 21+ messages in thread From: Eric Wong @ 2019-09-14 5:25 UTC (permalink / raw) To: meta; +Cc: Konstantin Ryabitsev Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Fri, Sep 13, 2019 at 03:12:12AM +0000, Eric Wong wrote: > > > No, it's just vanilla what comes with the source. > > > > OK, and Perl 5.16.3 from CentOS 7? (4:5.16.3-294.el7_6 RPM) > > Correct, though another thing that may help is the git version -- we > install git2u from IUS, so the git version is not what is in standard > CentOS7 (it's too ancient). > > # rpm -q git2u > git2u-2.16.5-1.ius.centos7.x86_64 > > See if you have more luck reproducing the problem when you get that in > place. You should be able to find what you need by searching for "git2u > ius centos". Tried git2u, didn't help. I wouldn't expect external processes such as nginx or git-http-backend having any impact on leaks unless it's slower/faster and triggers a race in public-inbox. At this point I suspect it's timing-related in our code but maybe not something my hardware/VM is capable of reproducing. What level of parallelism is grokmirror using? I am very curious if anybody else sees it, too... > My access over the next 10 days is going to be spotty due to travel, but > I'll try to catch up at least once a day. The /proc fdinfo of the eventpoll descriptor+lsof at the same time would be helpful if you get a moment. But it's understandable if you don't :> > Thanks again for your help! No problem. I've gone through the code again but nothing jumped out at me. Will add some comments and dig into it again after I do some other stuff. Enjoy your travels! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: trying to figure out 100% CPU usage in nntpd... 2019-09-10 18:12 ` Eric Wong 2019-09-11 2:22 ` httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] Eric Wong @ 2019-09-11 9:44 ` Konstantin Ryabitsev 2019-09-11 17:12 ` Eric Wong 1 sibling, 1 reply; 21+ messages in thread From: Konstantin Ryabitsev @ 2019-09-11 9:44 UTC (permalink / raw) To: Eric Wong; +Cc: meta On Tue, Sep 10, 2019 at 06:12:24PM +0000, Eric Wong wrote: > > It does seem like there's perhaps a leak somewhere? > > Probably. Not seeing any of that on my (smaller) instances; > but those -httpd haven't been restarted in weeks/months. > > The "PerlIO_" prefix is created from open(..., '+>', undef), so > it either has to be for: > > 1. POST bodies for git http-backend > > 1. git cat-file --batch-check stderr > > 3. experimental ViewVCS|SolverGit which isn't configured on lore :) > > = git-http-backend > > Looking at the size (984), PerlIO_* prefix and proximity of FD > numbers, I think those lines above are git-http-backend POST body. Pretty sure that's the culprit. This is how we replicate between lore.kernel.org to erol.kernel.org: - once a minute, two nodes that are behind erol.kernel.org grab the newest manifest.js.gz - if there are changes, each updated repository is pulled from lore.kernel.org, so if there were 5 repository updates, there would be 10 "git pull" requests I switched the replication nodes to pull once every 5 minutes instead of once every minute and I see a direct correlation between when those processes run and the number of broken pipes and "/tmp/PerlIO_* (deleted)" processes showing up and hanging around. Not every run produces these, but increase spikes come in roughly 5-minute intervals. On the first run after public-inbox-httpd restart, the correlation is direct: this is from one of the mirroring nodes: [82807] 2019-09-11 09:30:02,044 - INFO - Updating 18 repos from https://lore.kernel.org this is on lore.kernel.org after the run is completed: # ls -al /proc/{16212,16213,16214,16215}/fd | grep deleted | wc -l 36 > Any git-http-backend stuck from people fetching/cloning? No, all git processes seem to exit cleanly on both ends. > This is -httpd writing to varnish, still, right? We bypass varnish for git requests, since this is not generally useful. Nginx goes straight to public-inbox-httpd for those. I did run some updates on lore.kernel.org on Thursday, including kernel (3.10.0-957.27.2), nginx (1.16.1) and public-inbox updates. For the latter, it went from f4f0a3be to what was latest master at the time (d327141c). Hope this helps, and thanks for your help! -K ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: trying to figure out 100% CPU usage in nntpd... 2019-09-11 9:44 ` trying to figure out 100% CPU usage in nntpd Konstantin Ryabitsev @ 2019-09-11 17:12 ` Eric Wong 0 siblings, 0 replies; 21+ messages in thread From: Eric Wong @ 2019-09-11 17:12 UTC (permalink / raw) To: Konstantin Ryabitsev; +Cc: meta Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote: > On Tue, Sep 10, 2019 at 06:12:24PM +0000, Eric Wong wrote: > > Any git-http-backend stuck from people fetching/cloning? > > No, all git processes seem to exit cleanly on both ends. > > > This is -httpd writing to varnish, still, right? > > We bypass varnish for git requests, since this is not generally useful. > Nginx goes straight to public-inbox-httpd for those. > > I did run some updates on lore.kernel.org on Thursday, including kernel > (3.10.0-957.27.2), nginx (1.16.1) and public-inbox updates. For the > latter, it went from f4f0a3be to what was latest master at the time > (d327141c). I also noticed "tcp: remove empty skb from write queue in error cases" (64a2a93ba3856d16765e5c8a00c133fa8750879a) which is likely to affect public-inbox using edge-triggered epoll. Not sure if Linux 3.10.0-957.27.2 has that backported... Also, what nginx buffer settings are you using? In particular, is proxy_buffering off or on? Also, does that version of nginx do persistent connections to backends (and if so, is it enabled)? It doesn't seem pipe-user-pages related, but what I wrote could definitely be a problem in the future with the number if inboxes you have and I should be able to mitigate that by alternates and sharing a cat-file process. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2019-09-14 5:25 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-09-08 10:45 [PATCH] nntp: regexp always consumes rbuf if "\n" exists Eric Wong 2019-09-08 10:52 ` trying to figure out 100% CPU usage in nntpd Eric Wong 2019-09-09 10:05 ` Konstantin Ryabitsev 2019-09-09 17:53 ` Eric Wong 2019-09-10 8:38 ` Konstantin Ryabitsev 2019-09-10 18:12 ` Eric Wong 2019-09-11 2:22 ` httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] Eric Wong 2019-09-11 10:24 ` Konstantin Ryabitsev 2019-09-11 17:12 ` Eric Wong 2019-09-11 17:36 ` Konstantin Ryabitsev 2019-09-12 0:05 ` Eric Wong 2019-09-12 2:49 ` Eric Wong 2019-09-12 8:35 ` Eric Wong 2019-09-12 11:37 ` Konstantin Ryabitsev 2019-09-13 3:12 ` Eric Wong 2019-09-13 7:03 ` Eric Wong 2019-09-13 9:01 ` Eric Wong 2019-09-13 18:07 ` Konstantin Ryabitsev 2019-09-14 5:25 ` Eric Wong 2019-09-11 9:44 ` trying to figure out 100% CPU usage in nntpd Konstantin Ryabitsev 2019-09-11 17:12 ` Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).