From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Daniel Fleischer Newsgroups: gmane.emacs.devel Subject: Re: [NonGNU ELPA] New package: llm Date: Mon, 21 Aug 2023 09:36:42 +0300 Message-ID: References: <54c21d90-8bd6-8723-9e33-d69179b37bd0@gmail.com> <705ab838-142a-b3cc-8cc8-6f4d143c4341@gmail.com> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="10406"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Gnus/5.13 (Gnus v5.13) Cc: Richard Stallman , ahyatt@gmail.com, emacs-devel@gnu.org To: Jim Porter Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Mon Aug 21 08:37:41 2023 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1qXyXj-0002Tb-GB for ged-emacs-devel@m.gmane-mx.org; Mon, 21 Aug 2023 08:37:39 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qXyWx-00083w-Ll; Mon, 21 Aug 2023 02:36:51 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qXyWv-00083S-3C for emacs-devel@gnu.org; Mon, 21 Aug 2023 02:36:49 -0400 Original-Received: from mail-wm1-x32f.google.com ([2a00:1450:4864:20::32f]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1qXyWs-0001bB-F3; Mon, 21 Aug 2023 02:36:48 -0400 Original-Received: by mail-wm1-x32f.google.com with SMTP id 5b1f17b1804b1-3fee600dce6so14123925e9.1; Sun, 20 Aug 2023 23:36:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1692599804; x=1693204604; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=xB+CIf+DxUN4OgMgO848IzyJwb/Zoh+nXhyb7Mrq9+g=; b=WwfYJF7pHRgAlAHSbOBIdGEmwMcUAcRaiF2OZS3jkkZCsxyci/2tdn6CikbCtKWXdi FX+iulVSHuA7MblInNNyNpkNm4rc4rA68mISoaDdPZckZ+VQ4H7oY5sNGoGwXZFe/ZeX bKdwW8ArGBK+VAMhWNCeIJF2AhYfBNrRaIqp/WnmKLEvpKeKaqrCmjmg+PAUL25HpzxQ lo8viEe6wBIVU3A5NwX20r7bR3e7jAAWz8jpcjdgO1qojjyGdFo9QlHkM2ExFyT6MHq2 CvMv6otiJO17O1dW4AcWsJ4BziX3tabbfTlUxoTclLX1wrprUfV7TIoHgNcBttqEvrZ/ hTpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692599804; x=1693204604; h=mime-version:user-agent:message-id:date:references:in-reply-to :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=xB+CIf+DxUN4OgMgO848IzyJwb/Zoh+nXhyb7Mrq9+g=; b=EvVm39FFswMF+Gz+77QynraXbCdr2xXkRy4IJ/9phd/trcXf7jE0LaCNd1kFJ0CpvL BtOQT7QpQxsBUnGPtJ76NeqEQdVZGmhEJLH08+KN/o6d3L0+9cepzjPM7ATYWKHMc22H ihVZ5hrL6uDp745yCbgdrafzx/UpAimZDBRGN3mMn6/EG6i1W5bWVs5ceyxStMrVUfy9 fTZyFTb2Aeer08xrCwhsvkcDhsqFB6NqVOl8IlFUUI6zGOKFxhd3TLNy9knBmup7DWy5 ovEkuJfLDsZjZJTtju4moQJYjBv95ZMGKDd0Gy8rWHYGYHJNsF5W2iIMu/9EOxqun1XJ xXWQ== X-Gm-Message-State: AOJu0Yz/DV6snAd42nfBHNDx6hCyWuGCfXWREnOWnIbaFzPh5lXU590t sv/FNhjttX7txE8BFyXp4JttiZ96e0O9ng== X-Google-Smtp-Source: AGHT+IHYUZ2ZMMSYqT/8jY0LmIKFp37q/C3EJNArKOVrOa0CBsJ09AYcJyeRt6w4mpstQ+ciQydC5Q== X-Received: by 2002:adf:db4c:0:b0:317:6314:96e2 with SMTP id f12-20020adfdb4c000000b00317631496e2mr4523186wrj.14.1692599804130; Sun, 20 Aug 2023 23:36:44 -0700 (PDT) Original-Received: from Home-Mac ([141.226.169.23]) by smtp.gmail.com with ESMTPSA id u6-20020a5d5146000000b00315af025098sm11368311wrt.46.2023.08.20.23.36.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 20 Aug 2023 23:36:43 -0700 (PDT) In-Reply-To: <705ab838-142a-b3cc-8cc8-6f4d143c4341@gmail.com> (Jim Porter's message of "Sun, 20 Aug 2023 21:48:06 -0700") Received-SPF: pass client-ip=2a00:1450:4864:20::32f; envelope-from=danflscr@gmail.com; helo=mail-wm1-x32f.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:309029 Archived-At: Jim Porter writes: > The link says that this model has been pretrained, which is certainly > useful for the average person who doesn't want (or doesn't have the > resources) to perform the training themselves, but from the > documentation, it's not clear how I *would* perform the training > myself if I were so inclined. (I've only toyed with LLMs, so I'm not > an expert at more "advanced" cases like this.) When I say people can train models themselves I mean "fine tuning" which is the process of taking an existing model and make it learn to do a specific task by showing it a small number of examples, as low as 1000 examples. There are advanced techniques that can train a model by modifying a small percentage of its weights; this type of training can be done in a few hours on a laptop. See https://huggingface.co/docs/peft/index for a tool to do that. > I do see that the documentation mentions the training datasets used, > but it also says that "great efforts have been taken to clean the > pretraining data". Am I able to access the cleaned datasets? I looked > over their blog post[1], but I didn't see anything describing this in > detail. > > While I certainly appreciate the effort people are making to produce > LLMs that are more open than OpenAI (a low bar), I'm not sure if > providing several gigabytes of model weights in binary format is > really providing the *source*. It's true that you can still edit these > models in a sense by fine-tuning them, but you could say the same > thing about a project that only provided the generated output from GNU > Bison, instead of the original input to Bison. To a large degree, the model is the weights. Today's models mainly share a single architecture, called a transformer decoder. Once you specify the architecture and a few hyper-parameters in a config file, the model is entirely determined by the weights. https://huggingface.co/mosaicml/mpt-7b/blob/main/config.json Put differently, today's models differ mainly by their weights, not architectural differences. As for reproducibility, the truth is one can not reproduce the models, theoretically and practically. The models can contain 7, 14, 30, 60 billion parameters which are floating point numbers; is it impossible to reproduce it exactly as there are many sources for randomness in the training process. Practically, pretraining is expensive; it requires hundreds of GPUs and training costs are 100,000$ for small models and up to millions for larger models. Some models do release the training data, see e.g. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T A side note: we are in a stage where our theoretical understanding is lacking while practical applications are flourishing. Things move very very fast, and there is a strong drive to productize this technology, making people and companies invest a lot of resources into this. However the open source aspect is amazing; the fact that the architecture, code and insights are shared between everyone and even some companies share the models they pretrained under open licensing (taking upon themselves the high cost of training) is a huge win to everyone, including the open source and scientific communities because now the innovation can come from anywhere. -- Daniel Fleischer