From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Andrew Hyatt <ahyatt@gmail.com>
Newsgroups: gmane.emacs.devel
Subject: Re: [NonGNU ELPA] New package: llm
Date: Mon, 21 Aug 2023 01:12:38 -0400
Message-ID: <CAM6wYYKiodexr6C4M1HyMznXDmZCMfDRWmh5fcSfGE2n1736eQ@mail.gmail.com>
References: <CAM6wYYJHa+tCUKO_SsnT77g-4MUM0x4FrkoCekr=T9-UF1ADDA@mail.gmail.com>
 <E1qTaA2-00038O-UA@fencepost.gnu.org>
 <CAM6wYY+E=z5VqV2xXMbhbpN7vn+-tyzfOGKFAuG0s+croRmEPA@mail.gmail.com>
 <E1qV08g-0001mb-11@fencepost.gnu.org>
 <CAM6wYY+89Q+uVB79=VoQAB9SxUw6iV-QXW5_t12LA5+Be22kCQ@mail.gmail.com>
 <54c21d90-8bd6-8723-9e33-d69179b37bd0@gmail.com>
 <E1qWSLC-0005qF-AV@fencepost.gnu.org>
 <m2wmxt38cr.fsf@gmail.com> <705ab838-142a-b3cc-8cc8-6f4d143c4341@gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="000000000000f8019e060367ed8b"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="5056"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: Daniel Fleischer <danflscr@gmail.com>, Richard Stallman <rms@gnu.org>,
 emacs-devel@gnu.org
To: Jim Porter <jporterbugs@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Mon Aug 21 07:13:32 2023
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1qXxEK-00018h-EX
	for ged-emacs-devel@m.gmane-mx.org; Mon, 21 Aug 2023 07:13:32 +0200
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces@gnu.org>)
	id 1qXxDk-00062A-0L; Mon, 21 Aug 2023 01:12:56 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ahyatt@gmail.com>) id 1qXxDj-00061k-44
 for emacs-devel@gnu.org; Mon, 21 Aug 2023 01:12:55 -0400
Original-Received: from mail-ed1-x52e.google.com ([2a00:1450:4864:20::52e])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <ahyatt@gmail.com>)
 id 1qXxDg-0002RB-8Q; Mon, 21 Aug 2023 01:12:54 -0400
Original-Received: by mail-ed1-x52e.google.com with SMTP id
 4fb4d7f45d1cf-5257e2b5d12so3431189a12.2; 
 Sun, 20 Aug 2023 22:12:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20221208; t=1692594770; x=1693199570;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:from:to:cc:subject:date:message-id:reply-to;
 bh=gSTYFYhLWruS4FV/vn4x9DUABfwOBUBGcHmrVMO/jAo=;
 b=Dj0PKZBSO7e20NcD2mv3HDv9WupDM8GSn+KUR5ph8v6TDHs3B6ofCA+ErsczQkZMyi
 Sqw2tJlGCV6RD5a2HRxebs4RPtyHy8ncvYSB5YE3DZjX8dML5Bg+5X4QriiAOKe3v+Kd
 ayH1b4RHgEqrtWgFaWHGOEw2Oqw+x6yvC7lQtjfk42zt+lyE4NmYbBiyV+3sg/kZ1Jh9
 Gbjb/9rrQXzooCVYUdDrGcB+AzV5aLFTh3FhHxSN3IPFF6vlBCZBdT8OEZwDja5w1LL7
 +jt8F/hI3+WScD5NKxeLa18oKrkgfX8t9wBDgDiPnLU3yeyQHwZvuHN8TWKReiUfd9QC
 xxwQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1692594770; x=1693199570;
 h=cc:to:subject:message-id:date:from:in-reply-to:references
 :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=gSTYFYhLWruS4FV/vn4x9DUABfwOBUBGcHmrVMO/jAo=;
 b=ZRcq+MPn4uHLvL2mH7XYRuw7ipdIKjKnPW6Txrv6mz9bPYUJQy6JI5bpnpeATKpV1U
 p/Za1/xetfKvAqXgin0uaDtWj79TkTGLywjx81GMWLUANRlMGdWbZ+rfZwqZKLt3ZTRP
 yIZE19iUrkoM3idgy3hpwL8XLsEnTslNcRDRrm8vtbUCMgXq/gx1WjwuGj3orwQdswNQ
 i6cXFezdrrLU16HDfjOp+wpBQr0GuEaDqwRw6BMdRSzr4gz54AyFRWCxlNMN5YF7+9iz
 gf2iHsw9uaIQcKeB59Ih6hXTSiccVdsbKgzbAp3hliSWXKIV34+Fjxn6JYphdTLVFT87
 aTMA==
X-Gm-Message-State: AOJu0Yxp1Q9L5zCN1Tp7swkjpWvJVMaMD12z6cyem7wiD0dCnYcPRNdg
 DfYUWPMrITUih6cuG+Nq+i9M2cjdk2kEWLf4FX526bGOGCw=
X-Google-Smtp-Source: AGHT+IG2liwtBsMo7FrMud7wOB3f/c3HvA/jQQRrBYK0xPAxNrtKocWDHHAELyy2wdYGlg5Y68IBKlqWURyNFIjW2mk=
X-Received: by 2002:a05:6402:1610:b0:523:38ea:48bb with SMTP id
 f16-20020a056402161000b0052338ea48bbmr4131357edv.24.1692594769230; Sun, 20
 Aug 2023 22:12:49 -0700 (PDT)
In-Reply-To: <705ab838-142a-b3cc-8cc8-6f4d143c4341@gmail.com>
Received-SPF: pass client-ip=2a00:1450:4864:20::52e;
 envelope-from=ahyatt@gmail.com; helo=mail-ed1-x52e.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.devel:309021
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/309021>

--000000000000f8019e060367ed8b
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Aug 21, 2023 at 12:48=E2=80=AFAM Jim Porter <jporterbugs@gmail.com>=
 wrote:

> On 8/17/2023 10:08 AM, Daniel Fleischer wrote:
> > That is not accurate; LLMs can definitely run locally on your machine.
> > Models can be downloaded and ran using Python. Here is an LLM released
> > under Apache 2 license [0]. There are "black-box" models, served in the
> > cloud, but the revolution we're is precisely because many models are
> > released freely and can be ran (and trained) locally, even on a laptop.
> >
> > [0] https://huggingface.co/mosaicml/mpt-7b
>
> The link says that this model has been pretrained, which is certainly
> useful for the average person who doesn't want (or doesn't have the
> resources) to perform the training themselves, but from the
> documentation, it's not clear how I *would* perform the training myself
> if I were so inclined. (I've only toyed with LLMs, so I'm not an expert
> at more "advanced" cases like this.)
>

The training of these is fairly straightforward, at least if you are
familiar with the area.  The code for implementing transformers in the
original "Attention is All You Need" paper is at
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/model=
s/transformer.py
under an Apache License, and the LLM we are talking about here use this
technique to train and execute, changing some parameters and adding things
like more attention heads, but keeping the fundamental architecture the
same.

I'm not an expert, but I believe that due to the use of stochastic
processes in training, even if you had the exact code, parameters and data
used in training, you would never be able to reproduce the model they make
available.  It should be equivalent in quality, perhaps, but not the same.


>
> I do see that the documentation mentions the training datasets used, but
> it also says that "great efforts have been taken to clean the
> pretraining data". Am I able to access the cleaned datasets? I looked
> over their blog post[1], but I didn't see anything describing this in
> detail.
>
> While I certainly appreciate the effort people are making to produce
> LLMs that are more open than OpenAI (a low bar), I'm not sure if
> providing several gigabytes of model weights in binary format is really
> providing the *source*. It's true that you can still edit these models
> in a sense by fine-tuning them, but you could say the same thing about a
> project that only provided the generated output from GNU Bison, instead
> of the original input to Bison.
>

To me, I believe it should be about freedom.  Not absolute freedom, but
relative freedom: do you, the user, have the same amount of freedom as
anyone else, including the creator?  For the LLMs like huggingface and many
other research LLMs, the answer is yes.  You do have the freedom to
fine-tune the model, as does the creator.  You cannot change the base model
in any meaningful way, but neither can the creator, because no one knows
how to do that yet.  You cannot understand the model, but neither can the
creator, because while some progress has been made in understanding simple
things about simple LLMs like GPT-2, the modern LLMs are too complex for
anyone to make sense out of.


>
> (Just to be clear, I don't mean any of the above to be leading
> questions. I really don't know the answers, and using analogies to
> previous cases like Bison can only get us so far. I truly hope there
> *is* a freedom-respecting way to interface with LLMs, but I also think
> it's worth taking some extra care at the beginning so we can choose the
> right path forward.)
>
> [1] https://www.mosaicml.com/blog/mpt-7b
>

--000000000000f8019e060367ed8b
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Mon, Aug 21, 2023 at 12:48=E2=80=AFAM =
Jim Porter &lt;<a href=3D"mailto:jporterbugs@gmail.com">jporterbugs@gmail.c=
om</a>&gt; wrote:<br></div><div class=3D"gmail_quote"><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex">On 8/17/2023 10:08 AM, Daniel Fleischer wrote:=
<br>
&gt; That is not accurate; LLMs can definitely run locally on your machine.=
<br>
&gt; Models can be downloaded and ran using Python. Here is an LLM released=
<br>
&gt; under Apache 2 license [0]. There are &quot;black-box&quot; models, se=
rved in the<br>
&gt; cloud, but the revolution we&#39;re is precisely because many models a=
re<br>
&gt; released freely and can be ran (and trained) locally, even on a laptop=
.<br>
&gt; <br>
&gt; [0] <a href=3D"https://huggingface.co/mosaicml/mpt-7b" rel=3D"noreferr=
er" target=3D"_blank">https://huggingface.co/mosaicml/mpt-7b</a><br>
<br>
The link says that this model has been pretrained, which is certainly <br>
useful for the average person who doesn&#39;t want (or doesn&#39;t have the=
 <br>
resources) to perform the training themselves, but from the <br>
documentation, it&#39;s not clear how I *would* perform the training myself=
 <br>
if I were so inclined. (I&#39;ve only toyed with LLMs, so I&#39;m not an ex=
pert <br>
at more &quot;advanced&quot; cases like this.)<br></blockquote><div><br></d=
iv><div>The training of these is fairly straightforward,=C2=A0at least if y=
ou are familiar with=C2=A0the=C2=A0area.=C2=A0 The code for implementing tr=
ansformers in the original &quot;Attention is All You Need&quot; paper is a=
t <a href=3D"https://github.com/tensorflow/tensor2tensor/blob/master/tensor=
2tensor/models/transformer.py">https://github.com/tensorflow/tensor2tensor/=
blob/master/tensor2tensor/models/transformer.py</a> under an Apache License=
, and the LLM we are talking about here use this technique to train and exe=
cute, changing some parameters and adding things like more attention heads,=
 but keeping the fundamental architecture the same.=C2=A0=C2=A0</div><div><=
br></div><div>I&#39;m not an expert, but I believe that due to the use of s=
tochastic processes in training, even if you had the exact code, parameters=
 and data used in training, you would never be able to reproduce the model =
they make available.=C2=A0 It should be equivalent in quality, perhaps, but=
 not the same.</div><div>=C2=A0<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex">
<br>
I do see that the documentation mentions the training datasets used, but <b=
r>
it also says that &quot;great efforts have been taken to clean the <br>
pretraining data&quot;. Am I able to access the cleaned datasets? I looked =
<br>
over their blog post[1], but I didn&#39;t see anything describing this in <=
br>
detail.<br>
<br>
While I certainly appreciate the effort people are making to produce <br>
LLMs that are more open than OpenAI (a low bar), I&#39;m not sure if <br>
providing several gigabytes of model weights in binary format is really <br=
>
providing the *source*. It&#39;s true that you can still edit these models =
<br>
in a sense by fine-tuning them, but you could say the same thing about a <b=
r>
project that only provided the generated output from GNU Bison, instead <br=
>
of the original input to Bison.<br></blockquote><div><br></div><div>To me, =
I believe it should be about freedom.=C2=A0 Not absolute freedom, but relat=
ive freedom: do you, the user, have the same amount of freedom as anyone el=
se, including the creator?=C2=A0 For the LLMs like huggingface and many oth=
er research LLMs, the answer is yes.=C2=A0 You do have the freedom to fine-=
tune the model, as does the creator.=C2=A0 You cannot change the base model=
 in any meaningful way, but neither can the creator, because no one knows h=
ow to do that yet.=C2=A0 You cannot understand the model, but neither can t=
he creator, because while some progress has been made in understanding simp=
le things about simple LLMs like GPT-2, the modern LLMs are too complex for=
 anyone to make sense out of.</div><div>=C2=A0<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">
<br>
(Just to be clear, I don&#39;t mean any of the above to be leading <br>
questions. I really don&#39;t know the answers, and using analogies to <br>
previous cases like Bison can only get us so far. I truly hope there <br>
*is* a freedom-respecting way to interface with LLMs, but I also think <br>
it&#39;s worth taking some extra care at the beginning so we can choose the=
 <br>
right path forward.)<br>
<br>
[1] <a href=3D"https://www.mosaicml.com/blog/mpt-7b" rel=3D"noreferrer" tar=
get=3D"_blank">https://www.mosaicml.com/blog/mpt-7b</a><br>
</blockquote></div></div>

--000000000000f8019e060367ed8b--