all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
From: Daniel Fleischer <danflscr@gmail.com>
To: Jim Porter <jporterbugs@gmail.com>
Cc: Richard Stallman <rms@gnu.org>,  ahyatt@gmail.com,  emacs-devel@gnu.org
Subject: Re: [NonGNU ELPA] New package: llm
Date: Mon, 21 Aug 2023 09:36:42 +0300	[thread overview]
Message-ID: <m2zg2kgav9.fsf@gmail.com> (raw)
In-Reply-To: <705ab838-142a-b3cc-8cc8-6f4d143c4341@gmail.com> (Jim Porter's message of "Sun, 20 Aug 2023 21:48:06 -0700")

Jim Porter <jporterbugs@gmail.com> writes:

> The link says that this model has been pretrained, which is certainly
> useful for the average person who doesn't want (or doesn't have the
> resources) to perform the training themselves, but from the
> documentation, it's not clear how I *would* perform the training
> myself if I were so inclined. (I've only toyed with LLMs, so I'm not
> an expert at more "advanced" cases like this.)

When I say people can train models themselves I mean "fine tuning" which
is the process of taking an existing model and make it learn to do a
specific task by showing it a small number of examples, as low as 1000
examples. There are advanced techniques that can train a model by
modifying a small percentage of its weights; this type of training can
be done in a few hours on a laptop. See
https://huggingface.co/docs/peft/index for a tool to do that. 

> I do see that the documentation mentions the training datasets used,
> but it also says that "great efforts have been taken to clean the
> pretraining data". Am I able to access the cleaned datasets? I looked
> over their blog post[1], but I didn't see anything describing this in
> detail.
>
> While I certainly appreciate the effort people are making to produce
> LLMs that are more open than OpenAI (a low bar), I'm not sure if
> providing several gigabytes of model weights in binary format is
> really providing the *source*. It's true that you can still edit these
> models in a sense by fine-tuning them, but you could say the same
> thing about a project that only provided the generated output from GNU
> Bison, instead of the original input to Bison.

To a large degree, the model is the weights. Today's models mainly share
a single architecture, called a transformer decoder. Once you specify
the architecture and a few hyper-parameters in a config file, the model
is entirely determined by the weights. 

https://huggingface.co/mosaicml/mpt-7b/blob/main/config.json

Put differently, today's models differ mainly by their weights, not
architectural differences. 

As for reproducibility, the truth is one can not reproduce the models,
theoretically and practically. The models can contain 7, 14, 30, 60
billion parameters which are floating point numbers; is it impossible to
reproduce it exactly as there are many sources for randomness in the
training process. Practically, pretraining is expensive; it requires
hundreds of GPUs and training costs are 100,000$ for small models and up
to millions for larger models.

Some models do release the training data, see e.g.
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

A side note: we are in a stage where our theoretical understanding is
lacking while practical applications are flourishing. Things move very
very fast, and there is a strong drive to productize this technology,
making people and companies invest a lot of resources into this. However
the open source aspect is amazing; the fact that the architecture, code
and insights are shared between everyone and even some companies share
the models they pretrained under open licensing (taking upon themselves
the high cost of training) is a huge win to everyone, including the open
source and scientific communities because now the innovation can come
from anywhere.

-- 
Daniel Fleischer



  parent reply	other threads:[~2023-08-21  6:36 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-07 23:54 [NonGNU ELPA] New package: llm Andrew Hyatt
2023-08-08  5:42 ` Philip Kaludercic
2023-08-08 15:08   ` Spencer Baugh
2023-08-08 15:09   ` Andrew Hyatt
2023-08-09  3:47 ` Richard Stallman
2023-08-09  4:37   ` Andrew Hyatt
2023-08-13  1:43     ` Richard Stallman
2023-08-13  1:43     ` Richard Stallman
2023-08-13  2:11       ` Emanuel Berg
2023-08-15  5:14       ` Andrew Hyatt
2023-08-15 17:12         ` Jim Porter
2023-08-17  2:02           ` Richard Stallman
2023-08-17  2:48             ` Andrew Hyatt
2023-08-19  1:51               ` Richard Stallman
2023-08-19  9:08                 ` Ihor Radchenko
2023-08-21  1:12                   ` Richard Stallman
2023-08-21  8:26                     ` Ihor Radchenko
2023-08-17 17:08             ` Daniel Fleischer
2023-08-19  1:49               ` Richard Stallman
2023-08-19  8:15                 ` Daniel Fleischer
2023-08-21  1:12                   ` Richard Stallman
2023-08-21  4:48               ` Jim Porter
2023-08-21  5:12                 ` Andrew Hyatt
2023-08-21  6:03                   ` Jim Porter
2023-08-21  6:36                 ` Daniel Fleischer [this message]
2023-08-22  1:06                 ` Richard Stallman
2023-08-16  2:30         ` Richard Stallman
2023-08-16  5:11           ` Tomas Hlavaty
2023-08-18  2:10             ` Richard Stallman
2023-08-27  1:07       ` Andrew Hyatt
2023-08-27 13:11         ` Philip Kaludercic
2023-08-28  1:31           ` Richard Stallman
2023-08-28  2:32             ` Andrew Hyatt
2023-08-28  2:59               ` Jim Porter
2023-08-28  4:54                 ` Andrew Hyatt
2023-08-31  2:10                 ` Richard Stallman
2023-08-31  9:06                   ` Ihor Radchenko
2023-08-31 16:29                     ` chad
2023-09-01  9:53                       ` Ihor Radchenko
2023-09-04  1:27                     ` Richard Stallman
2023-09-06 12:25                       ` Ihor Radchenko
2023-09-06 12:51                       ` Is ChatGTP SaaSS? (was: [NonGNU ELPA] New package: llm) Ihor Radchenko
2023-09-06 16:59                         ` Andrew Hyatt
2023-09-09  0:37                           ` Richard Stallman
2023-09-06 22:52                         ` Emanuel Berg
2023-09-07  7:28                           ` Lucien Cartier-Tilet
2023-09-07  7:57                             ` Emanuel Berg
2023-09-09  0:38                         ` Richard Stallman
2023-09-09 10:28                           ` Collaborative training of Libre LLMs (was: Is ChatGTP SaaSS? (was: [NonGNU ELPA] New package: llm)) Ihor Radchenko
2023-09-09 11:19                             ` Jean Louis
2023-09-10  0:22                             ` Richard Stallman
2023-09-10  2:18                               ` Debanjum Singh Solanky
2023-09-04  1:27                     ` [NonGNU ELPA] New package: llm Richard Stallman
2023-08-27 18:36         ` Jim Porter
2023-08-28  0:19           ` Andrew Hyatt
2023-09-04  1:27           ` Richard Stallman
2023-09-04  5:18             ` Andrew Hyatt
2023-09-07  1:21               ` Richard Stallman
2023-09-12  4:54                 ` Andrew Hyatt
2023-09-12  9:57                   ` Philip Kaludercic
2023-09-12 15:05                   ` Stefan Kangas
2023-09-19 16:26                     ` Andrew Hyatt
2023-09-19 16:34                       ` Philip Kaludercic
2023-09-19 18:19                         ` Andrew Hyatt
2023-09-04  1:27         ` Richard Stallman
2023-08-09  3:47 ` Richard Stallman
2023-08-09  4:06   ` Andrew Hyatt
2023-08-12  2:44     ` Richard Stallman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m2zg2kgav9.fsf@gmail.com \
    --to=danflscr@gmail.com \
    --cc=ahyatt@gmail.com \
    --cc=emacs-devel@gnu.org \
    --cc=jporterbugs@gmail.com \
    --cc=rms@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.