From: "Mattias Engdegård" <mattias.engdegard@gmail.com>
To: Stefan Monnier <monnier@iro.umontreal.ca>
Cc: Dmitry Gutov <dmitry@gutov.dev>, Eli Zaretskii <eliz@gnu.org>,
68244@debbugs.gnu.org
Subject: bug#68244: hash-table improvements
Date: Mon, 8 Jan 2024 19:26:21 +0100 [thread overview]
Message-ID: <19265EA5-E6F3-446C-AD9B-763693CF0A48@gmail.com> (raw)
In-Reply-To: <jwvv885dktt.fsf-monnier+emacs@gnu.org>
7 jan. 2024 kl. 20.10 skrev Stefan Monnier <monnier@iro.umontreal.ca>:
> The change gives good results for small tables but less so for big ones.
> I don't have a good intuition for why that would be: none of the
> operations directly involved seem to be more costly for large tables, so
> my best guess is that it leads to more collisions somehow, tho I don't
> have a good idea about why that would be.
Yes, I wondered about that too. It could simply be bad sampling. The benchmarks with bad results used sequential numbers (0, 1, ...) as keys so let's start with that:
| count | size idxsiz 1st% avg max | size idxsiz 1st% avg max
|--------+------------------------------+-----------------------------
| 10000 | 12466 15343 100.0 1.00 1 | 12288 16384 100.0 1.00 1
| 20000 | 28048 34523 100.0 1.00 1 | 24576 32768 98.8 1.01 2
| 30000 | 42072 51781 100.0 1.00 1 | 49152 65536 99.1 1.01 2
| 40000 | 42072 51781 100.0 1.00 1 | 49152 65536 94.5 1.06 2
| 50000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1
| 60000 | 63108 77671 100.0 1.00 1 | 98304 131072 100.0 1.00 1
| 70000 | 94662 116507 100.0 1.00 1 | 98304 131072 100.0 1.00 1
| 80000 | 94662 116507 100.0 1.00 1 | 98304 131072 96.0 1.04 2
| 90000 | 94662 116507 100.0 1.00 1 | 98304 131072 89.3 1.11 2
| 100000 | 141993 174761 100.0 1.00 1 | 196608 262144 92.9 1.07 2
| 110000 | 141993 174761 100.0 1.00 1 | 196608 262144 90.9 1.09 2
| 120000 | 141993 174761 100.0 1.00 1 | 196608 262144 89.3 1.11 2
| 130000 | 141993 174761 100.0 1.00 1 | 196608 262144 87.9 1.12 2
| 140000 | 141993 174761 100.0 1.00 1 | 196608 262144 86.7 1.13 2
| 150000 | 212989 262141 100.0 1.00 1 | 196608 262144 85.7 1.14 2
| 160000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.8 1.15 2
| 170000 | 212989 262141 100.0 1.00 1 | 196608 262144 84.0 1.16 2
| 180000 | 212989 262141 100.0 1.00 1 | 196608 262144 83.3 1.17 2
| 190000 | 212989 262141 100.0 1.00 1 | 196608 262144 82.7 1.17 2
| 200000 | 212989 262141 100.0 1.00 1 | 393216 524288 100.0 1.00 1
The left columns are for the standard hash tables with remainder-based range reduction, the right ones with Knuth reduction.
'size' is the table size, 'idxsiz' the index vector size, '1st%' the portion of entries accessed with a single lookup, 'avg' the average number of accesses and 'max' the maximum number of accesses required.
The old code looks perfect (no collisions!), but the new shiny code is a disappointment.
Then again, sequential numbers are best-case for remainder-based indexing: Emacs hashes (smallish) fixnums to themselves so we are guaranteed a minimum number of collisions, actually zero since the index vector is always larger than the number of entries.
But sequential keys would be a somewhat unusual use of hash tables; a vector is a lot more efficient. Let's try with random fixnums instead, using the same seed for each table:
| count | size idxsiz 1st% avg max | size idxsiz 1st% avg max
|--------+-----------------------------+----------------------------
| 10000 | 12466 15343 73.5 1.32 6 | 12288 16384 75.4 1.30 6
| 20000 | 28048 34523 75.7 1.29 6 | 24576 32768 75.2 1.30 6
| 30000 | 42072 51781 75.8 1.29 5 | 49152 65536 80.6 1.22 5
| 40000 | 42072 51781 69.6 1.39 7 | 49152 65536 75.0 1.30 6
| 50000 | 63108 77671 73.6 1.32 6 | 98304 131072 83.5 1.19 6
| 60000 | 63108 77671 69.6 1.39 7 | 98304 131072 80.5 1.23 6
| 70000 | 94662 116507 75.2 1.30 6 | 98304 131072 77.5 1.27 6
| 80000 | 94662 116507 72.4 1.34 7 | 98304 131072 75.0 1.30 7
| 90000 | 94662 116507 69.8 1.38 7 | 98304 131072 72.6 1.34 7
| 100000 | 141993 174761 76.2 1.29 6 | 196608 262144 83.3 1.19 6
| 110000 | 141993 174761 74.2 1.32 6 | 196608 262144 81.8 1.21 6
| 120000 | 141993 174761 72.4 1.34 7 | 196608 262144 80.3 1.23 6
| 130000 | 141993 174761 70.6 1.37 7 | 196608 262144 78.8 1.25 6
| 140000 | 141993 174761 68.8 1.40 7 | 196608 262144 77.6 1.27 6
| 150000 | 212989 262141 76.1 1.29 6 | 196608 262144 76.2 1.29 6
| 160000 | 212989 262141 74.8 1.31 6 | 196608 262144 74.9 1.30 6
| 170000 | 212989 262141 73.5 1.33 6 | 196608 262144 73.6 1.32 6
| 180000 | 212989 262141 72.3 1.35 7 | 196608 262144 72.3 1.34 6
| 190000 | 212989 262141 71.1 1.36 7 | 196608 262144 71.1 1.36 7
| 200000 | 212989 262141 69.9 1.38 7 | 393216 524288 83.1 1.19 5
Here the new code looks a bit better, but it could just be the bigger index vectors.
Not sure what to make of this.
We could try switching to a high-quality hash function (or family thereof), like Murmur or Jenkins. Then range reduction is just a matter of masking off the required number of bits.
next prev parent reply other threads:[~2024-01-08 18:26 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <170438379722.3921.9312235725296561206@vcs2.savannah.gnu.org>
[not found] ` <20240104155642.B4A99C00344@vcs2.savannah.gnu.org>
2024-01-04 17:32 ` scratch/hash-table-perf 2d28042f56a 19/35: Use non-Lisp allocation for internal hash-table vectors Dmitry Gutov
2024-01-05 10:33 ` bug#68244: hash-table improvements Mattias Engdegård
2024-01-05 15:41 ` Dmitry Gutov
2024-01-06 11:34 ` Mattias Engdegård
2024-01-06 11:51 ` Eli Zaretskii
2024-01-07 3:13 ` Dmitry Gutov
2024-01-07 5:26 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-01-07 15:39 ` Dmitry
2024-01-07 18:36 ` Mattias Engdegård
2024-01-07 19:10 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-01-08 18:26 ` Mattias Engdegård [this message]
2024-01-09 0:33 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-01-09 10:26 ` Mattias Engdegård
2024-01-13 20:06 ` Mattias Engdegård
2024-01-04 16:27 ` Mattias Engdegård
2024-01-04 16:39 ` Eli Zaretskii
2024-01-04 17:02 ` Mattias Engdegård
2024-01-04 17:45 ` Eli Zaretskii
2024-01-05 11:34 ` Mattias Engdegård
2024-01-05 17:14 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-01-06 11:46 ` Mattias Engdegård
2024-01-09 21:51 ` Stefan Kangas
2024-01-12 15:42 ` Mattias Engdegård
2024-01-14 22:08 ` Andy Moreton
2024-01-15 12:31 ` Eli Zaretskii
2024-01-15 13:26 ` Mattias Engdegård
2024-01-18 18:13 ` Mattias Engdegård
2024-01-15 20:01 ` Andy Moreton
2024-01-15 20:21 ` Eli Zaretskii
2024-01-16 21:57 ` Andy Moreton
2024-01-17 3:31 ` Eli Zaretskii
2024-01-18 20:29 ` Andy Moreton
2024-01-19 6:37 ` Eli Zaretskii
2024-01-20 20:20 ` Andy Moreton
2024-01-21 5:11 ` Eli Zaretskii
2024-01-21 13:03 ` Basil L. Contovounesios via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-01-22 9:17 ` João Távora
2024-01-22 9:18 ` João Távora
2024-01-23 9:44 ` Basil L. Contovounesios via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-01-14 5:25 ` Gerd Möllmann
2024-01-14 14:42 ` Mattias Engdegård
2024-01-21 12:41 ` Stefan Kangas
2024-02-08 9:46 ` Mattias Engdegård
2024-02-08 14:19 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-08 14:36 ` Gerd Möllmann
2024-02-08 14:42 ` Mattias Engdegård
2024-02-08 15:13 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-08 17:29 ` Mattias Engdegård
2024-02-08 17:49 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-12 12:16 ` Mattias Engdegård
2024-02-12 13:36 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-13 9:05 ` Gerd Möllmann
2024-02-13 10:12 ` Mattias Engdegård
2024-02-13 12:12 ` Gerd Möllmann
2024-02-13 12:43 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-14 12:37 ` Mattias Engdegård
2024-02-14 13:05 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-14 13:21 ` Mattias Engdegård
2024-02-17 19:50 ` Mattias Engdegård
2024-02-20 10:21 ` Mattias Engdegård
2024-02-20 14:00 ` Eli Zaretskii
2024-02-20 16:11 ` Mattias Engdegård
2024-02-20 17:12 ` Eli Zaretskii
2024-02-21 12:59 ` Eli Zaretskii
2024-02-21 20:13 ` Andrea Corallo
2024-02-23 12:16 ` Mattias Engdegård
2024-02-24 9:45 ` Mattias Engdegård
2024-02-24 10:30 ` Eli Zaretskii
2024-02-24 10:53 ` Mattias Engdegård
2024-02-24 11:03 ` Eli Zaretskii
2024-02-24 17:20 ` Dmitry Gutov
2024-02-24 17:43 ` Mattias Engdegård
2024-02-24 17:48 ` Dmitry Gutov
2024-02-24 17:53 ` Mattias Engdegård
2024-02-24 18:08 ` Eli Zaretskii
2024-02-24 18:31 ` Dmitry Gutov
2024-02-24 17:54 ` Stefan Monnier via Bug reports for GNU Emacs, the Swiss army knife of text editors
2024-02-24 17:14 ` Dmitry Gutov
2024-02-24 2:46 ` Dmitry Gutov
[not found] ` <20240104155642.6326FC00344@vcs2.savannah.gnu.org>
2024-01-04 18:52 ` scratch/hash-table-perf 3e9e68333ae 16/35: Remove rehash-threshold and rehash-size struct members Dmitry Gutov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=19265EA5-E6F3-446C-AD9B-763693CF0A48@gmail.com \
--to=mattias.engdegard@gmail.com \
--cc=68244@debbugs.gnu.org \
--cc=dmitry@gutov.dev \
--cc=eliz@gnu.org \
--cc=monnier@iro.umontreal.ca \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/emacs.git
https://git.savannah.gnu.org/cgit/emacs/org-mode.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.