From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 88C7B1F462; Thu, 6 Jun 2019 22:10:09 +0000 (UTC) Date: Thu, 6 Jun 2019 22:10:09 +0000 From: Eric Wong To: Konstantin Ryabitsev Cc: meta@public-inbox.org Subject: Re: how's memory usage on public-inbox-httpd? Message-ID: <20190606221009.y4fe2e2rervvq3z4@dcvr> References: <20181201194429.d5aldesjkb56il5c@dcvr> <20190606190455.GA17362@chatter.i7.local> <20190606203752.7wpdla5ynemjlshs@dcvr> <20190606214509.GA4087@chatter.i7.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20190606214509.GA4087@chatter.i7.local> List-Id: Konstantin Ryabitsev wrote: > On Thu, Jun 06, 2019 at 08:37:52PM +0000, Eric Wong wrote: > > Do you have commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5? > > ("view: stop storing all MIME objects on large threads") > > That was most significant. > > Yes. We're running 743ac758 with a few cherry-picked patches on top of that > (like epoch roll-over fix). > > > Otherwise it's probably a combination of several things... > > httpd and nntpd both supports streaming, arbitrarily large > > endpoints (all.mbox.gz, and /T/, /t/, /t.mbox.gz threads with > > thousands of messages, giant NNTP BODY/ARTICLE ranges). > > > > All those endpoints should detect backpressure from a slow > > client (varnish/nginx in your case) using the ->getline method. > > Wouldn't that spike up and down? The size I'm seeing stays pretty constant > without any significant changes across requests. Nope. That's the thing with glibc malloc not wanting to trim the heap for good benchmarks. You could also try starting with MALLOC_MMAP_THRESHOLD_=131072 in env (or some smaller/larger number in bytes) to force it to use mmap in more cases instead of sbrk. > > Also, are you only using the default of -W/--worker-process=1 > > on a 16-core machine? Just checked public-inbox-httpd(8), the > > -W switch is documented :) You can use SIGTTIN/TTOU to > > increase, decrease workers w/o restarting, too. > > D'oh, yes... though it's not been a problem yet. :) I'm not sure I want to > bump that up, though, if that means we're going to have multiple 19GB-sized > processes instead of one. :) You'd probably end up with several smaller processes totalling up to 19GB. In any case, killing individual workers with QUIT/INT/TERM is graceful and won't drop connections if memory use on one goes awry. > > Do you have any stats on the number of simultaneous connections > > public-inbox-httpd/nginx/varnish handles (and logging of that > > info at peek)? (perhaps running "ss -tan" periodically)(*) > > We don't collect that info, but I'm not sure it's the number of concurrent > connections that's the culprit, as there is no fluctuation in RSS size based > on the number of responses. Without concurrent connections; I can't see that happening unless there's a single message which is gigabytes in size. I'm already irked that Email::MIME requires slurping entire emails into memory; but it should not be using more than one Email::MIME object in memory-at-a-time for a single client. Anything from varnish/nginx logs can't keep up for some reason? Come to think of it, nginx proxy buffering might be redundant and even harmful if varnish is already doing it. Perhaps "proxy_buffering off" in nginx is worth trying... I use yahns instead of nginx, which does lazy buffering (but scary Ruby experimental server warning applies :x). Last I checked: nginx is either buffer-in-full-before-first-byte or no buffering at all (which is probably fine with varnish). > To answer the questions in your follow-up: > > It would appear to be all in anon memory. Mem_usage [1] reports: > > # ./Mem_usage 18275 > Backed by file: > Executable r-x 16668 > Write/Exec (jump tables) rwx 0 > RO data r-- 106908 > Data rw- 232 > Unreadable --- 94072 > Unknown 0 > Anonymous: > Writable code (stack) rwx 0 > Data (malloc, mmap) rw- 19988892 > RO data r-- 0 > Unreadable --- 0 > Unknown 12 > > I've been looking at lsof -p of that process and I see sqlite and xapian > showing up and disappearing. The lkml ones are being accessed almost all the > time, but even there I see them showing up with different FD entries, so > they are being closed and reopened properly. Yep, that's expected. It's to better detect DB changes in case of compact/copydatabase/xcpdb for Xapian. Might not be necessary strictly necessary for SQLite, but maybe somebody could be running VACUUM offline; then flock-ing inbox.lock and rename-ing it into place or something (and retrying/restarting the VACUUM if out-of-date, seq_lock style).