From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Newsgroups: gmane.emacs.bugs Subject: bug#68244: hash-table improvements Date: Mon, 8 Jan 2024 19:26:21 +0100 Message-ID: <19265EA5-E6F3-446C-AD9B-763693CF0A48@gmail.com> References: <170438379722.3921.9312235725296561206@vcs2.savannah.gnu.org> <20240104155642.B4A99C00344@vcs2.savannah.gnu.org> <8d49ebdc-9da7-4e70-a080-d8e892b980b6@gutov.dev> <08314177-5AE9-4352-94A0-641830B4094D@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="22029"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Dmitry Gutov , Eli Zaretskii , 68244@debbugs.gnu.org To: Stefan Monnier Original-X-From: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Mon Jan 08 19:27:46 2024 Return-path: Envelope-to: geb-bug-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1rMuLg-0005Ti-Sd for geb-bug-gnu-emacs@m.gmane-mx.org; Mon, 08 Jan 2024 19:27:45 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rMuKv-0001Dp-WB; Mon, 08 Jan 2024 13:26:58 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rMuKt-0000uf-V2 for bug-gnu-emacs@gnu.org; Mon, 08 Jan 2024 13:26:56 -0500 Original-Received: from debbugs.gnu.org ([2001:470:142:5::43]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rMuKt-0004CW-Kg for bug-gnu-emacs@gnu.org; Mon, 08 Jan 2024 13:26:55 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1rMuKz-0005IP-N2 for bug-gnu-emacs@gnu.org; Mon, 08 Jan 2024 13:27:01 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Mattias =?UTF-8?Q?Engdeg=C3=A5rd?= Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 08 Jan 2024 18:27:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 68244 X-GNU-PR-Package: emacs Original-Received: via spool by 68244-submit@debbugs.gnu.org id=B68244.170473840220331 (code B ref 68244); Mon, 08 Jan 2024 18:27:01 +0000 Original-Received: (at 68244) by debbugs.gnu.org; 8 Jan 2024 18:26:42 +0000 Original-Received: from localhost ([127.0.0.1]:37548 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rMuKf-0005Hr-U1 for submit@debbugs.gnu.org; Mon, 08 Jan 2024 13:26:42 -0500 Original-Received: from mail-lf1-x130.google.com ([2a00:1450:4864:20::130]:42083) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rMuKa-0005Ha-9P for 68244@debbugs.gnu.org; Mon, 08 Jan 2024 13:26:40 -0500 Original-Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-50eab4bf47aso1971675e87.0 for <68244@debbugs.gnu.org>; Mon, 08 Jan 2024 10:26:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704738384; x=1705343184; darn=debbugs.gnu.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=hbvj/+aLkpBWYT7TAvewJZqDs4B9qdDKVbWu71vRsE4=; b=dIaetEH+5S5sBVT2lVbdlbCyZICc+nzXihSRlHV4yMvnr+rT5SWyOo/1xUuXemaGPO PndckKntH/70cjbh1KEm2w7IHfDtfYSV76LZPBterxzeKdYer8FT5lqahxzrtFbNgWrR nJkG1s+74e6Hvg90U6i3zFlaTu2T5kaQvwpW4ufzdu5SWmlGzKx2C4OwiV4GbDIaZenW 767qrm+yBqVH8g1/3eV1zPExzDaZ1wcjDQ6xaN21QLcZtLRo4IiUY9l5jTyIdJV02Pza lhEYeZjo7XlExGcEDA+n2GS9fd/BMSvXa1ibX4e4eirp+ZuNkcykBij/fntIrTEfk8Ju jGKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704738384; x=1705343184; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=hbvj/+aLkpBWYT7TAvewJZqDs4B9qdDKVbWu71vRsE4=; b=IiSUGbz879K6XhBADVHEUlOogURzOmfUZQbmJrCO5OmwxQ34iDb1qSg6D2x5Xs9r5A pDWG4lz0bAssDpNy06o4bUxIO/sI/+CAG+Fk796+bTH5ILsAHpG81blKaGgB9SiVAzvs vFG8hKfn1Up8kxVtUIxDKA6e0XVwsqZZ2kWSjFWBMy65spaWep6neKTEYbqh7P9MArLc XMzuTkIoz4BH4Ki+V0InZKkyBhavo+EOlq0zO4PQi2fYGpJRsLB3gbbDVoGq3dkJ0yAI RR2WkVl9U+z9ZyQqA9tYPWArdm6eGNx3YaTJ8umlAmWeQWBRWVj17+aT5ODw2jkLLheB jv1g== X-Gm-Message-State: AOJu0YyBjt+eF/9mCmqtlSTys0S2Kqa1oTBwuEXLqJ8+/Z1ifFK2gS4G mhQYMb/KDOFD8zxUkVbIqH8= X-Google-Smtp-Source: AGHT+IHE8f3P6kqEkGXRpn/yrQVdmCIsjiU605ItJCeJ52Bl8JdM+iHy48aLuZNQzdeoUuNnSbPVIw== X-Received: by 2002:a05:6512:b88:b0:50b:f231:d444 with SMTP id b8-20020a0565120b8800b0050bf231d444mr102574lfv.7.1704738383678; Mon, 08 Jan 2024 10:26:23 -0800 (PST) Original-Received: from smtpclient.apple (c80-217-1-132.bredband.tele2.se. [80.217.1.132]) by smtp.gmail.com with ESMTPSA id s29-20020a19771d000000b0050e74d04c8asm38775lfc.132.2024.01.08.10.26.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Jan 2024 10:26:22 -0800 (PST) In-Reply-To: X-Mailer: Apple Mail (2.3654.120.0.1.15) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-gnu-emacs@gnu.org List-Id: "Bug reports for GNU Emacs, the Swiss army knife of text editors" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: bug-gnu-emacs-bounces+geb-bug-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.bugs:277570 Archived-At: 7 jan. 2024 kl. 20.10 skrev Stefan Monnier : > The change gives good results for small tables but less so for big = ones. > I don't have a good intuition for why that would be: none of the > operations directly involved seem to be more costly for large tables, = so > my best guess is that it leads to more collisions somehow, tho I don't > have a good idea about why that would be. Yes, I wondered about that too. It could simply be bad sampling. The = benchmarks with bad results used sequential numbers (0, 1, ...) as keys = so let's start with that: | count | size idxsiz 1st% avg max | size idxsiz 1st% avg max |--------+------------------------------+----------------------------- | 10000 | 12466 15343 100.0 1.00 1 | 12288 16384 100.0 1.00 1 | 20000 | 28048 34523 100.0 1.00 1 | 24576 32768 98.8 1.01 2 | 30000 | 42072 51781 100.0 1.00 1 | 49152 65536 99.1 1.01 2 | 40000 | 42072 51781 100.0 1.00 1 | 49152 65536 94.5 1.06 2 | 50000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1 | 60000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1 | 70000 | 94662 116507 100.0 1.00 1 | 98304 131072 100.0 1.00 1 | 80000 | 94662 116507 100.0 1.00 1 | 98304 131072 96.0 1.04 2 | 90000 | 94662 116507 100.0 1.00 1 | 98304 131072 89.3 1.11 2 | 100000 | 141993 174761 100.0 1.00 1 | 196608 262144 92.9 1.07 2 | 110000 | 141993 174761 100.0 1.00 1 | 196608 262144 90.9 1.09 2 | 120000 | 141993 174761 100.0 1.00 1 | 196608 262144 89.3 1.11 2 | 130000 | 141993 174761 100.0 1.00 1 | 196608 262144 87.9 1.12 2 | 140000 | 141993 174761 100.0 1.00 1 | 196608 262144 86.7 1.13 2 | 150000 | 212989 262141 100.0 1.00 1 | 196608 262144 85.7 1.14 2 | 160000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.8 1.15 2 | 170000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.0 1.16 2 | 180000 | 212989 262141 100.0 1.00 1 | 196608 262144 83.3 1.17 2 | 190000 | 212989 262141 100.0 1.00 1 | 196608 262144 82.7 1.17 2 | 200000 | 212989 262141 100.0 1.00 1 | 393216 524288 100.0 1.00 1 The left columns are for the standard hash tables with remainder-based = range reduction, the right ones with Knuth reduction. 'size' is the table size, 'idxsiz' the index vector size, '1st%' the = portion of entries accessed with a single lookup, 'avg' the average = number of accesses and 'max' the maximum number of accesses required. The old code looks perfect (no collisions!), but the new shiny code is a = disappointment. Then again, sequential numbers are best-case for remainder-based = indexing: Emacs hashes (smallish) fixnums to themselves so we are = guaranteed a minimum number of collisions, actually zero since the index = vector is always larger than the number of entries. But sequential keys would be a somewhat unusual use of hash tables; a = vector is a lot more efficient. Let's try with random fixnums instead, = using the same seed for each table: | count | size idxsiz 1st% avg max | size idxsiz 1st% avg max |--------+-----------------------------+---------------------------- | 10000 | 12466 15343 73.5 1.32 6 | 12288 16384 75.4 1.30 6 | 20000 | 28048 34523 75.7 1.29 6 | 24576 32768 75.2 1.30 6 | 30000 | 42072 51781 75.8 1.29 5 | 49152 65536 80.6 1.22 5 | 40000 | 42072 51781 69.6 1.39 7 | 49152 65536 75.0 1.30 6 | 50000 | 63108 77671 73.6 1.32 6 | 98304 131072 83.5 1.19 6 | 60000 | 63108 77671 69.6 1.39 7 | 98304 131072 80.5 1.23 6 | 70000 | 94662 116507 75.2 1.30 6 | 98304 131072 77.5 1.27 6 | 80000 | 94662 116507 72.4 1.34 7 | 98304 131072 75.0 1.30 7 | 90000 | 94662 116507 69.8 1.38 7 | 98304 131072 72.6 1.34 7 | 100000 | 141993 174761 76.2 1.29 6 | 196608 262144 83.3 1.19 6 | 110000 | 141993 174761 74.2 1.32 6 | 196608 262144 81.8 1.21 6 | 120000 | 141993 174761 72.4 1.34 7 | 196608 262144 80.3 1.23 6 | 130000 | 141993 174761 70.6 1.37 7 | 196608 262144 78.8 1.25 6 | 140000 | 141993 174761 68.8 1.40 7 | 196608 262144 77.6 1.27 6 | 150000 | 212989 262141 76.1 1.29 6 | 196608 262144 76.2 1.29 6 | 160000 | 212989 262141 74.8 1.31 6 | 196608 262144 74.9 1.30 6 | 170000 | 212989 262141 73.5 1.33 6 | 196608 262144 73.6 1.32 6 | 180000 | 212989 262141 72.3 1.35 7 | 196608 262144 72.3 1.34 6 | 190000 | 212989 262141 71.1 1.36 7 | 196608 262144 71.1 1.36 7 | 200000 | 212989 262141 69.9 1.38 7 | 393216 524288 83.1 1.19 5 Here the new code looks a bit better, but it could just be the bigger = index vectors. Not sure what to make of this. We could try switching to a high-quality hash function (or family = thereof), like Murmur or Jenkins. Then range reduction is just a matter = of masking off the required number of bits.