From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: =?utf-8?Q?Gerd_M=C3=B6llmann?= Newsgroups: gmane.emacs.devel Subject: Re: Some experience with the igc branch Date: Fri, 27 Dec 2024 10:45:19 +0100 Message-ID: References: <87o713wwsi.fsf@telefonica.net> <86o7112rnq.fsf@gnu.org> <867c7p2nz4.fsf@gnu.org> <861pxx2lh7.fsf@gnu.org> <86ldw40xbo.fsf@gnu.org> <868qs329kj.fsf@gnu.org> <864j2r25hg.fsf@gnu.org> <861pxv234y.fsf@gnu.org> <86zfkjznat.fsf@gnu.org> <86ldw2zy6s.fsf@gnu.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="24482"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: Eli Zaretskii , pipcet@protonmail.com, ofv@wanadoo.es, emacs-devel@gnu.org, eller.helmut@gmail.com, acorallo@gnu.org To: Stefan Kangas Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Fri Dec 27 10:45:49 2024 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tR6uj-0006DW-5J for ged-emacs-devel@m.gmane-mx.org; Fri, 27 Dec 2024 10:45:49 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tR6uP-0007Fm-2k; Fri, 27 Dec 2024 04:45:29 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tR6uN-0007FT-GB for emacs-devel@gnu.org; Fri, 27 Dec 2024 04:45:27 -0500 Original-Received: from mail-ej1-x62b.google.com ([2a00:1450:4864:20::62b]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1tR6uK-00054i-Qq; Fri, 27 Dec 2024 04:45:27 -0500 Original-Received: by mail-ej1-x62b.google.com with SMTP id a640c23a62f3a-aa67333f7d2so1047237466b.0; Fri, 27 Dec 2024 01:45:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735292721; x=1735897521; darn=gnu.org; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=xlXKq7tXZLtjhpN3kuOaXjFnjm9Q+vuZhqprUSF7B60=; b=Acdp70F62ktJOI3+GgmOnchmAL+TetK4F7s1XVmLWCDjnSNidHWbM2pPS5paj+AGHF 9tVGiDu0pwGev9kmNE2tCkyfNFKoigxvZrzbiqNpxW+mz8PBc3zLORD0nVzwVD4jLEeA /efQVXIzGPUkTE5weZ/zhkM/3LGn01fHje8wRT+DNcPPar9JXKipVXLSkgjie83Wyz6J LuShMSfs8zO+IFQeRkxVyZpnHAUaEgbN9sMGluqjGCf0jOmMML6l20Td7ok/SnThgLyT VuQg8l96PuxcoXnFkGbIpqHcvbb2PkiDmKj/khVJ0ZDgjQrFXiY6KwXJK0ZfuBLJ53Oj K4uQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735292721; x=1735897521; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=xlXKq7tXZLtjhpN3kuOaXjFnjm9Q+vuZhqprUSF7B60=; b=q1/uqtG9bEcCPFP75sNXgrbVb3N9u9GQy8x6zAPSgdxE/bUzPxwr7MEU7yL0vaqMRP jZ+FC1KpAvQ8Y8b/RoECB/tgwHg78pcembmKwYuSG7hVzrdvQoQ210YpYBlwWWjIfCHh cmJ0p1kPUt9V2XbOvtwnDoV8k2xfBduHDdnSJ17YZDPUYkAlYalC0IHsDHe5ik8tGHZs av1VTxPQ5J0CAqHkh7jszmj5i/B8yEXNMhrxIViTwnhAgMPgulm4FEOsxg3vAHwlrf4T hHe3qNCNo7tiieqMGgCjDFow4jeQ4iFXkQQnfKDXLER3Nn9Mf0ugvgATMgdyf+zG1uS/ na6Q== X-Forwarded-Encrypted: i=1; AJvYcCU1ChFrGnLYHFrMjPnqd7DW2FkaqgxkEIXOx5X0gUEInk7V3OFHvZ3JUVVaQHBUgQH67Z11B86zNg==@gnu.org, AJvYcCXAY4667EGVPAydBuxe3hR8F/hoSlirqNq6Uxpjl8GOWJwIcbpn8MbXEqDSbujWrefSCjsSO2JeFfPhW8A=@gnu.org X-Gm-Message-State: AOJu0Yw4OqEvj2kK6sRwMRdp3L73Tl84mGgohQglydleOAjEIzzI+Lqj EQQLpTeOSzMN4tcbhTJyYppZkiUZyta60SYxFZRVE4QIYSb12aFt2ADXJQ== X-Gm-Gg: ASbGncv1rLvVr15S/lPliYRzjphHFywatysidYOJm+Aa2KrvbicrWQdE2UF+pgVZGpU VwIf7KRkZZBcX9B6OPuJHYxPNL5ZuKp6F+zdHaw2ZrywTbYzNc2Jz+TzHc4VSRDcKKaRkG/Y8si tXhErS7+SM3F/9er1ELe9c/lQ1+P4vdi9/rxH8+4ngNUzaXQNi17/N/fiXR9gF+lKW3RlohWxuK aSxvuAgUpFpCqRQWRYTo0nHrjO9GhYDxqtwa2uYy8f6i3imSIOBdKOIKJnXYGXavB/k16cApXu8 m6QjKBQkrExirM/ELOnEn7rB/1vfTfR3YNFuiBqylc9TID6WIqpAUlj3FZBOam40Mw== X-Google-Smtp-Source: AGHT+IF09ss+nJv4OEYq9hxi3F/Mxtekdgalco4Ce/Kas6hOCwHqQ4BboFmVDypZPbiGj38fe+KJ6w== X-Received: by 2002:a17:907:9803:b0:aaf:1604:bc5f with SMTP id a640c23a62f3a-aaf1604bcd1mr599913766b.21.1735292720930; Fri, 27 Dec 2024 01:45:20 -0800 (PST) Original-Received: from pro2 (p200300e0b74e5900bde19f98ca9a60df.dip0.t-ipconnect.de. [2003:e0:b74e:5900:bde1:9f98:ca9a:60df]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-aac0e898689sm1086180866b.78.2024.12.27.01.45.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Dec 2024 01:45:20 -0800 (PST) In-Reply-To: ("Gerd =?utf-8?Q?M=C3=B6llmann=22'?= =?utf-8?Q?s?= message of "Thu, 26 Dec 2024 20:51:10 +0100") Received-SPF: pass client-ip=2a00:1450:4864:20::62b; envelope-from=gerd.moellmann@gmail.com; helo=mail-ej1-x62b.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:327182 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Gerd M=C3=B6llmann writes: > Stefan Kangas writes: > >> Gerd M=C3=B6llmann writes: >> >>> Coincidentally, I wrote this today, which might help, and would also be >>> interesting to get some feedback for. >> >> This text looks good to me as a general introduction, especially if you >> extend it to include these points also: > > Thanks. > >>> * Initialization >>> - Roots >>> - Threads >>> - Pools, Object formats >>> * Lisp Object Allocation >>> Object Header? >>> * Malloc with roots >>> * Memory barriers >> >> Do you intend to install this on the scratch/igc branch? > > Yes, when I'm a bit further. I think Eli wants a comment in igc.c like > the one at xdisp.c starts with. I've reached the point where I don't know what else to explain. Could always be improved, of course, and so on. Please find attached, with request for feedback. --=-=-= Content-Type: text/x-org; charset=utf-8 Content-Disposition: attachment; filename=20241225165202-icg_doc_commebt.org Content-Transfer-Encoding: quoted-printable Content-Description: Doc :PROPERTIES: :ID: 0B687D90-0DE7-4BC3-B92E-1D582C175737 :END: #+title: icg doc commebt * Introduction / Garbage collection Implementing a programming language like Lisp requires automatic memory management which frees the memory of Lisp objects like conses, strings etc. when they are no longer in use. The automatic memory management used for Lisp is called garbage collection (GC), which is performed by a garbage collector. Objects that are no longer used are called garbage or dead, objects that are in use or called live objects. The process of getting rid of garbage is called the GC. A large number of GC algorithms and implementations exist which differ in various dimensions. Emacs has two GC implementations which can be chosen at compile-time. The traditional (old) GC, which was the only one until recently, is a so-called mark-sweep, non-copying collector. The new GC implementation in this file is an incremental, generational, concurrent (igc) collector based on the MPS library from Ravenbrook. It is a so-called copying collector. The terms used here will become clearer in the following. Emacs' traditional mark-sweep GC works in two phases: 1. The mark phase Start with a set of so-called (GC) roots that are known to contain live objects. Examples of roots in Emacs are - the bytecode stacks=20 - the specpdl (binding) stacks - Lisp variables defined in C with DEFVAR - the C stacks (aka control stacks) - ... Roots in a general sense are contiguous areas of memory, and they are either "exact" or "ambiguous". If we know that a root always contains only valid Lisp object references, the root is called exact. An Example for an exact root is the root for a DEFVAR. The DEFVAR variable always contains a valid reference to a Lisp object. The memory range of the root is the Lisp_Object C variable. If we only know that a root contains potential references to Lisp objects, the root is called ambiguous. An example for an ambiguous root is the C stack. The C stack can contain integers that look like a Lisp_Object or pointer to a Lisp object, but actually are just random integers. Ambiguous roots are said to be scanned conservatively: we make the conservative assumption that we really found a reference, so that we don't discard objects that are only referenced from the C stack. The mark phase of the traditional GC scans all roots, conservatively or exactly, and marks the Lisp objects found referenced in the roots live, plus all Lisp objects reachable from them, recursively. In the end, all live objects are marked, and all objects not marked are dead, i.e. garbage. 2. The sweep phase Sweeping means to free all objects that are not marked live. The collector iterates over all allocated objects, live or dead, and frees the dead ones so that their memory can be reused. The traditional mark-sweep GC implementation is - Not concurrent. Emacs calls GC explicitly in various places, and proceeds only when the GC is done. - Not incremental. The GC is not done in steps. - Not generational. The GC doesn't take advantage of the so-called generational hypothesis, which says that most objects used by a program die young, i.e. are only used for a short period of time. Other GCs take this hypothesis into account to reduce pause times. - Not copying. Lisp objects are never copied by the GC. They stay at the addresses where their initial allocation puts them. Special facilities like allocation in blocks are implemented to avoid memory fragmentation. In contrast, the new igc collector, using MPS, is - Concurrent. The GC runs in its own thread. There are no explicit calls to start GC, and Emacs doesn't have to wait for the GC to complete. - Incremental. The GC is done in steps. - Generational. The GC takes advantage of the so-called generational hypothesis. - Copying. Lisp objects are copied by the GC, avoiding memory fragmentation. Special care must be taken to take into account that the same Lisp object can have different addresses at different times. The goal of igc is to reduce pause times when using Emacs interactively. The properties listed above which come from leveraging MPS make that possible. * MPS (Memory Pool System) The MPS (Memory Pool System) is a C library developed by Ravenbrook Inc. over several decades. It is used, for example, by the OpenDylan project. A guide and a reference manual for MPS can be found at https://memory-pool-system.readthedocs.io. The following can only be a rough and incomplete overview. It is recommended to read at least the MPS Guide. MPS uses "arenas" which consist of "pools". The pools represent memory to allocate objects from that are subject to GC. Different types of pools implement different GC strategies. Pools have "object formats" that an application must supply when it creates a pool. The most important part of an object format are a handful of functions that an application must implement to perform lowest-level operations on behalf of the MPS. These functions make it possible to implement MPS without it having to know low-level details about what application objects look like. The most important MPS pool type is named AMC (Automatic Mostly Copying). AMC implements a variant of a copying collector. Objects allocated from AMS pools can therefore change their memory addresses. When copying objects, a marker is left in the original object pointing to its copy. This marker is also called a "tombstone". A "memory barrier" is placed on the original object. Memory barrier means the memory is read and/or write protected (e.g. with mprotect). The barrier leads to MPS being invoked when an old object is accessed. The whole process is called "object forwarding". MPS makes sure that references to old objects are updated to refer to their new addresses. Functions defined in the object format are used by MPS to perform the lowest-level tasks of object forwarding, so that MPS doesn't have to know application-specific details of how objects look like. In the end, copying/forwarding is transparent to the application. AMC implements a "mostly-copying" collector, where "mostly" refers to the fact that it supports ambiguous references. Ambiguous references are those from ambiguous roots, where we can't tell if a reference is real or not. If we would copy such an object, we wouldn't be able to update their address in all references because we can't tell if the ambiguous reference is real or just some random integer, and changing it would have unforeseeable consequences. Ambiguously referenced objects are therefore never copied, and their address does not change. * The registry The MPS shields itself from knowing application details, for example which GC roots the application has, which threads, how objects look like and so on. MPS has an internal model instead which describes these details. Emacs creates this model using the MPS API. For example, MPS cannot know what Emacs' GC roots are. We tell MPS about Emacs' roots by calling an MPS API function an MPS-internal model for the root, and get back a handle that stands for the root. This handle later used to refer to the model object. For example, if we want to delete a root later, because we don't need it anymore, we call an MPS function giving it the handle for the root we no longer need. All other model objects are handled in the same way, threads, arenas, pools, object formats and so on. Igc collects all these MPS handles in a 'struct igc'. This "registry" of MPS handles is found in the global variable 'global_igc' and thus can be accessed from anywhere. * Root scan functions MPS allows us to specify roots having tailor-made scan functions that Emacs implements. Scanning here refers to the process of finding references in the memory area of the root, and telling MPS about the references. The function scan_specpdl is an example. We know the structure of a bindings stack, so we can tell where references to Lisp objects can be. This is generally better than letting MPS do the scanning itself, because MPS can only scan the whole block word for word, ambiguously or exactly. All such scan functions in igc have the prefix scan_. * Lisp object scan functions Igc tells MPS how to scan Lisp objects allocated via MPS by specifying a scan function for that purpose in an object format. This function is 'dflt_scan' in igc.c, which dispatches to various subroutines for different Lisp object types. The lower-level functions and macros igc defines in the call tree of dflt_scan have names starting with 'fix'_ or 'FIX_', because they use the MPS_FIX1 and MPS_FIX2 API to do their job. Please refer to the MPS reference for details of MPS_FIX1/2. * Initialization Before we can use MPS, we must define and create various things that MPS needs to know, i.e we create an MPS' internal model of Emacs. This is done at application startup, and all objects are added to the registry. - Pools. We tell MPS which pools we want to use, and what the object formats are, i.e. which callbacks in Emacs MPS can use. - Threads. We define which threads Emacs has, and add their C stacks as roots. This includes Emacs' main thread. - Roots. We tell MPS about the various roots in Emacs, the DEFVARs, the byte code stack, staticpro, etc. - ... When we done all that, we tell MPS it can start doing its job. =20=20 * Lisp Object Allocation All of Emacs' Lisp object allocation ultimately ends up done in igc's 'alloc_impl' function. MPS allocation from pools is thread-specific, using so-called "allocation points". These allocation points optimize allocation by reducing thread-contention. Allocation points are associated with pools, and there is one allocation point per thread. The function 'thread_ap' in igc determines which allocation point to use for the current thread and depending on the type of Lisp object to allocate. * Malloc with roots In a number of places, Emacs allocates memory with its 'xmalloc' function family and then stores references to Lisp objects there, pointers or Lisp_Objects. With the traditional GC, frequently, inadvertently collecting such objects is prevented by inhibiting GC. With igc, we do things differently. We don't want to temporarily stop the GC thread to inhibit GC, as a design decision. Instead, we make the malloc'd memory a root. The root is destroyed when the memory is freed. igc provides a number if functions for doing such allocations. For example 'igc_xzalloc_ambig', 'igc_xpalloc_exact' and so on. Freeing the memory must be done with 'igc_xfree'. These functions are used throughout Emacs in =C3=A4ifdef HAVE_MPS. In general, it's an erro to put references to Lisp objects in malloc'd memory and not use the igc functions. --=-=-=--