From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 9AFF21F8C3 for ; Sat, 9 May 2020 08:27:38 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH 2/3] eml: speed up common LF-only emails Date: Sat, 9 May 2020 08:27:37 +0000 Message-Id: <20200509082738.23602-3-e@yhbt.net> In-Reply-To: <20200509082738.23602-1-e@yhbt.net> References: <20200509082738.23602-1-e@yhbt.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: Emails a *nix MTA are typically LF-only, so we don't need the complexity of the RE engine when a simple index() works. We still need to ensure there's no "\r\n\r\n" before the first "\n\n", but two calls to index() is still faster than a RE match. This gives a 2-5% speedup in some informal tests and saves ~30MB when scanning a 30MB spam message on newer versions of Perl. I'll have to diagnose why Perl wastes so much memory doing RE matches on giant strings, though. --- lib/PublicInbox/Eml.pm | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm index 80e7c1af..f022516c 100644 --- a/lib/PublicInbox/Eml.pm +++ b/lib/PublicInbox/Eml.pm @@ -71,10 +71,18 @@ sub re_memo ($) { # compatible with our uses of Email::MIME sub new { my $ref = ref($_[1]) ? $_[1] : \(my $cpy = $_[1]); - if ($$ref =~ /\r?\n(\r?\n)/s) { # likely - # This can modify $$ref in-place and to avoid memcpy/memmove - # on a potentially large $$ref. It does need to make a - # copy for $hdr, though. Idea stolen from Email::Simple + # substr() can modify the first arg in-place and to avoid + # memcpy/memmove on a potentially large scalar. It does need + # to make a copy for $hdr, though. Idea stolen from Email::Simple. + + # We also prefer index() on common LFLF emails since it's faster + # and re scan can bump RSS by length($$ref) on big strings + if (index($$ref, "\r\n") < 0 && (my $pos = index($$ref, "\n\n")) >= 0) { + # likely on *nix + my $hdr = substr($$ref, 0, $pos + 2, ''); # sv_chop on $$ref + chop($hdr); # lower SvCUR + bless { hdr => \$hdr, crlf => "\n", bdy => $ref }, __PACKAGE__; + } elsif ($$ref =~ /\r?\n(\r?\n)/s) { my $hdr = substr($$ref, 0, $+[0], ''); # sv_chop on $$ref substr($hdr, -(length($1))) = ''; # lower SvCUR bless { hdr => \$hdr, crlf => $1, bdy => $ref }, __PACKAGE__;