From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lynn Winebarger Newsgroups: gmane.emacs.devel Subject: Re: Grammar checking Date: Thu, 6 Apr 2023 08:29:15 -0400 Message-ID: References: <87sfdnyuxc.fsf@posteo.de> <83sfdl2z26.fsf@gnu.org> <58158ae49808189da7b2@heytings.org> <83mt3t2xz1.fsf@gnu.org> <86jzyxxqir.fsf@gmail.com> <58158ae4986fa602fe47@heytings.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000e6b45405f8aa0fb1" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="37696"; mail-complaints-to="usenet@ciao.gmane.io" Cc: emacs-devel To: Richard Stallman Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Apr 06 14:30:38 2023 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1pkOlB-0009dO-RV for ged-emacs-devel@m.gmane-mx.org; Thu, 06 Apr 2023 14:30:38 +0200 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pkOkM-0007tt-AJ; Thu, 06 Apr 2023 08:29:46 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pkOkL-0007tj-FM for emacs-devel@gnu.org; Thu, 06 Apr 2023 08:29:45 -0400 Original-Received: from mail-pg1-x535.google.com ([2607:f8b0:4864:20::535]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pkOkI-00010y-Bm; Thu, 06 Apr 2023 08:29:44 -0400 Original-Received: by mail-pg1-x535.google.com with SMTP id s19so23663867pgi.0; Thu, 06 Apr 2023 05:29:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680784179; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dpqCN4oj4ipxP+JYIS1N8vaOIEY8kSDjHAUTWPpHjLw=; b=IW9Lp0juHZOHzuYJ5xxB12DmLexfoDOGHRWlO2obfmHqxWAD1PYrYg9fy+C5xna/l+ +0OdQDlQP+R5Qjl9EDX7Enyaxa9cgrMr88l4QlHnlEG6JiRZfzGgHKHRAeM54sDrXeiw /KPj7qe6uvfka3uJdAAmThXdfApOZ+TpD7Qt3EriZnaWJ2x9qBidG85yWKsz7uOrpaRb 8CE5goJTxWitZndnCoBvGskV8NyImrW6LbhDfFsBPhLLGEm4qWfd5+2xhwRu4lixZFGx gu5r+NkBKeicu+5sFzCdkSos9r2GBS1AgohSIkGwgsE8JIKEbLG8FXn3lATmA6f2opXo iKWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680784179; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dpqCN4oj4ipxP+JYIS1N8vaOIEY8kSDjHAUTWPpHjLw=; b=4F075oaFG701gUZvBO6EtEbO0jzui1Lj4m2nM/blGRB3G11aBV6ksQ8pg9yhzqLYWK HlOZfrU7ijBE/9+0zmgNUFu7nvuJvqL78OFLIrd0AdCUvbiZRcQUIRJguFfWY+bkag4a iAcqPXptRod0VMoL887kb9YyZCsFJkFszlZmeozguAnqLBPUx8TVgDlbEOIBWQOZdTzD ZQL5yIDR/GoiYL2WtiiLUsd/H8qVWYJfk4iKbcQLT8oOuOEVtaauQpCh3NAv7dTjwEUv 1CBAdI3jt7i6fk+ybc4XcPlpIzf6uahhdQv0ns8ziVQbFAZApCtvRidzg+O03sJfiwX9 aVww== X-Gm-Message-State: AAQBX9f9v10qMpTKbo3D9teXGhw1bpmffrbFOTmuoA2fBLIK9gk3e+P2 705KD5OvA/SAMCbmXmHzi9CchvEGxdXFjCSNiFj+S7laPa0= X-Google-Smtp-Source: AKy350ZPV6trVZgO2EJTndZHnHAev0Bar/TJJ6RmO6wvojUQ99Sqe1kjylSy0IlI2IvHzKeo0aB9HnnBRlpV7vMny/Q= X-Received: by 2002:a63:e912:0:b0:513:c20d:9508 with SMTP id i18-20020a63e912000000b00513c20d9508mr3142423pgh.4.1680784178458; Thu, 06 Apr 2023 05:29:38 -0700 (PDT) In-Reply-To: Received-SPF: pass client-ip=2607:f8b0:4864:20::535; envelope-from=owinebar@gmail.com; helo=mail-pg1-x535.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.devel:305143 Archived-At: --000000000000e6b45405f8aa0fb1 Content-Type: text/plain; charset="UTF-8" On Sun, Apr 2, 2023, 11:05 PM Richard Stallman wrote: > [[[ To any NSA and FBI agents reading my email: please consider ]]] > [[[ whether defending the US Constitution against all enemies, ]]] > [[[ foreign or domestic, requires you to follow Snowden's example. ]]] > > > > If the released (and free) LanguageTool _program_ gives adequate > > > results, we could make Emacs support working with that. But we > should > > > take pains _not_ to support the kind of communication that that SaaSS > > > server offers. > > They may not make it easy, see this complaint on their forum: > > Would you please spell out what it is > that they "may not make easy"? > > > > https://forum.languagetool.org/t/about-the-premium-version-of-languagetool/8469 > > I looked at that page, but lacking the context, I can't understand it > well enough to divine the point that your message hints at. > I may have been mistaken in my first reading. I read the message as saying that any process using the free service would receive an advertisement of how many corrections would be found by the premium service. I am assuming that at the least emacs maintainers would want to filter that out by default. The forum message may only refer to using the web user interface for checking sample text, though. If the former is true though, it could be difficult to ensure such advertising is always filtered. It really depends on the owners of that service, who can change over time. > * The process for contributing "rules" to the free version is to go > > through the SaaSS's forum sites. > > https://community.languagetool.org/rule/list?lang=en shows 5919 rules > > for english, presumably in the basic version. > I found a more on-point reference addressing my concern, i.e. how will contributions replicating the rules implemented in the premium version be treated by the project developers: https://forum.languagetool.org/t/free-lt-premium-for-contributors/8639 Since the exact nature of those premium rules is presumably not disclosed just by virtue of having a premium subscription, I can only guess this reverse engineering would happen by following a process like: 1) Take a large corpus of texts with known grammatical errors, e.g. https://www.cl.cam.ac.uk/research/nl/bea2019st/ or https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html?m=1 2) Record the results produced by the free and premium versions on each test case 3) Formulate rules that specifically fix issues found by the premium version and not the free version. Perhaps the LanguageTool.org owners would consider this a violation of their service's terms and conditions as a justification for not accepting contributions of source code to the project. OTOH, if an emacs developer or user simply wants to systematically improve the free version of LanguageTool, the most obvious method for doing so would be 1) Take a large corpus of texts with known grammatical errors, see above 2) Record the results produced by the free rule set 3) Formulate rules that specifically fix issues found, prioritizing issues by some measure of expected frequency in real text Presumably the additional rules in the premium version have been added precisely according to some measure of their expected frequency, possibly by analysis of real-world text from users over the years the service has been available. It would be surprising if any attempt to systematically improve my the rules in LanguageTool did not have significant overlap with the rules found in the premium version, if that attempt was successful, just due to the definition of "successful" in statistical terms and the assumption that the premium rule set is likewise "successful". We could consider forking that code in a limited way: adding new rules. > > In general, we should cooperate with upstream developers, but we don't > have to jump through hoops to do so. > I'm not personally very pure in the software I use, so I'm surprised at how much the issues I perceive seem to bother me. I've been an emacs user since the 90s, and it would never have occurred to me that I would ever be concerned about contributing code to improve emacs, whether directly to the emacs projects, or indirectly through one of its dependencies. From what I see now, that will not be the case if grammar checking support is added that depends on languagetool. I suppose there's another, even more abstract concern with open source software that is developed specifically in conjunction with a SaaSS business, which is: To what extent does data from users of the SaaSS drive development, or even get incorporated in some (aggregated or statistical) form in the source code. For example, what if a grammar checker incorporated a "deep learning" system that had been trained on such data. In most cases, it would be impossible to reconstruct the training data set starting from the data specifying the trained model. But, would it be acceptable for a GNU software project to depend on such software? I don't know the answer, but I think it's a real question when dealing with open source software from projects like LanguageTool. I also don't know or allege that there's anything like that in LanguageTool, but neither can I be certain that there is not. I can't help but think this business model - maintaining an open source version as a loss leader for a proprietary or SaaSS version - is only going to continue growing, and hence the need to address it in the GNU coding manual section 8 or otherwise. > > Looking at the java code makes it appear there are > > many hard-coded rules, but I don't know if that is really the case. > > That is whether the code for the rules are some generic implementation > > of the rules coded in XML, or if the XML rule sets are being > > translated into java code at some point in the build process. > > I can only guess at the context this is about, but it sounds like > you're suggesting that it may not be clear what form of the code is > the real source code. Do they not say? Does their source release > include the XML? Does it include Make rules to translate the XML into > Java? I don't do a lot of Java coding, and it was a cursory examination. I did eventually find the xml rulesets linked to from https://dev.languagetool.org/languages, which is classified as "user documentation". It appears most rules in well-supported languages are in XML, with some coded in Java. Whether the coding in Java is for speed or to overcome limitations of the semantics of rules expressed in XML, I have no idea. I'm going to leave my concerns at that. I've already spent too much time on this as it is. I just thought the last-minute hair-pulling discussion of tree-sitter grammar files, which frankly seem to have much less ethical baggage, should not be repeated after grammar checking support depending on LanguageTool is already implemented and adopted. Lynn --000000000000e6b45405f8aa0fb1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Sun, Apr 2, 2023, 11:05 PM Richard Stallma= n <r= ms@gnu.org> wrote:
[[[ To a= ny NSA and FBI agents reading my email: please consider=C2=A0 =C2=A0 ]]] [[[ whether defending the US Constitution against all enemies,=C2=A0 =C2=A0= =C2=A0]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]<= br>
=C2=A0 > > If the released (and free) LanguageTool _program_ gives ad= equate
=C2=A0 > > results, we could make Emacs support working with that.=C2= =A0 But we should
=C2=A0 > > take pains _not_ to support the kind of communication that= that SaaSS
=C2=A0 > > server offers.
=C2=A0 > They may not make it easy, see this complaint on their forum:
Would you please spell out what it is
that they "may not make easy"?

=C2=A0 > https://forum.languagetool.org/t/about-the-premium-version-of-= languagetool/8469

I looked at that page, but lacking the context, I can't understand it well enough to divine the point that your message hints at.


I may have been mistaken in my first reading. I read the messa= ge as saying that any process using the free service would receive an adver= tisement of how many corrections would be found by the premium service.=C2= =A0 I am assuming that at the least emacs maintainers would want to filter = that out by default. The forum message may only refer to using the web use= r interface for checking sample text, though.

If the former is true though, it could be difficult t= o ensure such advertising is always filtered. It really depends on the own= ers of that service, who can change over time.

<= /div>

=C2=A0> * The process for contributing= "rules" to the free version is to go
=C2=A0 > through the SaaSS's forum sites.
=C2=A0 > https://comm= unity.languagetool.org/rule/list?lang=3Den shows 5919 rules
=C2=A0 > for english, presumably in the basic version.
<= /div>

I found a more on-= point reference addressing my concern, i.e. how will contributions replicat= ing the rules implemented in the premium version be treated by the project = developers:


Since the exact nature of those premium rules is presumably not d= isclosed just by virtue of having a premium subscription, I can only guess = this reverse engineering would happen by following a process like:

=
2) Record the results produced by the free and premium ve= rsions on each test case
3) Formulate rules that spe= cifically fix issues found by the premium version and not the free version.=

Perhaps the LanguageToo= l.org owners would consider this a violation of their service's terms a= nd conditions as a justification for not accepting contributions of source = code to the project.

OTO= H, if an emacs developer or user simply wants to systematically improve the= free version of LanguageTool, the most obvious method for doing so would b= e

1) Take a large corpus= of texts with known grammatical errors, see above
2) Record the results produced by the free rule set
3) Formulate rules that specifically fix issues found, prioritizing issu= es by some measure of expected frequency in real text

Presumably the additional rules in the premiu= m version have been added precisely according to some measure of their expe= cted frequency, possibly by analysis of real-world text from users over the= years the service has been available.=C2=A0=C2=A0
<= br>
It would be surprising if any attempt to systema= tically improve my the rules in LanguageTool did not have significant overl= ap with the rules found in the premium version, if that attempt was success= ful, just due to the definition of "successful" in statistical te= rms and the assumption that the premium rule set is likewise "successf= ul".

We could consider forking that = code in a limited way: adding new rules.

In general, we should cooperate with upstream developers, but we don't<= br> have to jump through hoops to do so.

I&= #39;m not personally very pure in the software I use, so I'm surprised = at how much the issues I perceive seem to bother me.=C2=A0 I've been an= emacs user since the 90s, and it would never have occurred to me that I wo= uld ever be concerned about contributing code to improve emacs, whether dir= ectly to the emacs projects, or indirectly through one of its dependencies.= =C2=A0 From what I see now, that will not be the case if grammar checking s= upport is added that depends on languagetool.=C2=A0=C2=A0

I suppose there's another, even more abstract concern with open= source software that is developed specifically in conjunction with a SaaSS= business, which is:=C2=A0 To what extent does data from users of the SaaSS= drive development, or even get incorporated in some (aggregated or statist= ical) form in the source code.=C2=A0 For example, what if a grammar checker= incorporated a "deep learning" system that had been trained on s= uch data.=C2=A0 In most cases, it would be impossible to reconstruct the tr= aining data set starting from the data specifying the trained model.=C2=A0 = But, would it be acceptable for a GNU software project to depend on such so= ftware?=C2=A0 I don't know the answer, but I think it's a real ques= tion when dealing with open source software from projects like LanguageTool= .=C2=A0 I also don't know or allege that there's anything like that= in LanguageTool, but neither can I be certain that there is not.=C2=A0 I c= an't help but think this business model - maintaining an open source ve= rsion as a loss leader for a proprietary or SaaSS version - is only going t= o continue growing, and hence the need to address it in the GNU coding manu= al section 8 or otherwise.
=C2=A0
=C2=A0 >=C2=A0 =C2=A0 Looking at the java code makes it appear = there are
=C2=A0 > many hard-coded rules, but I don't know if that is really t= he case.
=C2=A0 > That is whether the code for the rules are some generic impleme= ntation
=C2=A0 > of the rules coded in XML, or if the XML rule sets are being =C2=A0 > translated into java code at some point in the build process.
I can only guess at the context this is about, but it sounds like
you're suggesting that it may not be clear what form of the code is
the real source code.=C2=A0 Do they not say?=C2=A0 Does their source releas= e
include the XML?=C2=A0 Does it include Make rules to translate the XML into=
Java?

I don't do a lot of Java coding, and it was a cursory examination.=C2= =A0 I did eventually find the xml rulesets linked to from https://dev.languagetoo= l.org/languages, which is classified as "user documentation".= It appears most rules in well-supported languages are in XML, with some co= ded in Java.=C2=A0 Whether the coding in Java is for speed or to overcome l= imitations of the semantics of rules expressed in XML, I have no idea.

I'm going to leave my concerns at tha= t.=C2=A0 I've already spent too much time on this as it is.=C2=A0 I jus= t thought the last-minute hair-pulling discussion of tree-sitter grammar fi= les, which frankly seem to have much less ethical baggage, should not be re= peated after grammar checking support depending on LanguageTool is already = implemented and adopted.

Lynn


--000000000000e6b45405f8aa0fb1--