From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Titus von der Malsburg <malsburg@posteo.de>
Newsgroups: gmane.emacs.devel
Subject: Re: Changing dictionary while flyspell-buffer is running
Date: Fri, 22 Feb 2019 10:57:29 +0100
Message-ID: <87a7iocfhy.fsf@posteo.de>
References: <874l8ztmgk.fsf@posteo.de> <E1gwf0K-00072s-KC@fencepost.gnu.org>
	<838sy9hkgk.fsf@gnu.org> <87k1htsfpa.fsf@posteo.de>
	<83zhqpfb0t.fsf@gnu.org> <87ftsh3p42.fsf@fastmail.fm>
	<83mumogaza.fsf@gnu.org> <87ef80dekm.fsf@posteo.de>
	<83lg28fgdg.fsf@gnu.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="243065"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: mu4e 1.1.0; emacs 26.1.91
Cc: joostkremers@fastmail.fm, rms@gnu.org, emacs-devel@gnu.org
To: Eli Zaretskii <eliz@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Feb 22 11:04:35 2019
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:256)
	(Exim 4.89)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1gx7h4-00115H-JD
	for ged-emacs-devel@m.gmane.org; Fri, 22 Feb 2019 11:04:34 +0100
Original-Received: from localhost ([127.0.0.1]:47958 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>)
	id 1gx7h3-0005Fv-Go
	for ged-emacs-devel@m.gmane.org; Fri, 22 Feb 2019 05:04:33 -0500
Original-Received: from eggs.gnu.org ([209.51.188.92]:43420)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <malsburg@posteo.de>) id 1gx7aK-0000Q2-W3
	for emacs-devel@gnu.org; Fri, 22 Feb 2019 04:57:38 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <malsburg@posteo.de>) id 1gx7aJ-0001Ze-Ch
	for emacs-devel@gnu.org; Fri, 22 Feb 2019 04:57:36 -0500
Original-Received: from mout01.posteo.de ([185.67.36.65]:51878)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <malsburg@posteo.de>) id 1gx7aI-0001Sd-8M
	for emacs-devel@gnu.org; Fri, 22 Feb 2019 04:57:35 -0500
Original-Received: from submission (posteo.de [89.146.220.130]) 
	by mout01.posteo.de (Postfix) with ESMTPS id 01B75160062
	for <emacs-devel@gnu.org>; Fri, 22 Feb 2019 10:57:30 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.de; s=2017;
	t=1550829451; bh=qrvZWauvwz8MIeRms7Tp1/NoQeV83o2d062UP/5Ry8g=;
	h=From:To:Cc:Subject:Date:From;
	b=ge4BJBJWyBweUhy1F7VTuLkKTqeQEobPG4C+9D7zkqL4DsUwNL3bewIBIeYtbMoAL
	75Tt5YxEGffuzO208rzgzb4avHd0GzxW3wpaOiI6vU0ruG+FxsPjTgSB9nql8eX1ca
	EmTqTTPZxK4OQ2yXaLE4YgO0QV9HDoRNS014NR3HkgA/+th5zEyhFNUhoDPce9IbNp
	hPegJl5ZDsIpGIceROEnpjyZeNzs41VKSgNkeBBSnGRyMgMrN+XN0tu1GJ8jg/KqBh
	oP0P0ckf2LGwX5ecUQxbNrA6XZR+rwV3gQSOjFfuyUDUNgKN+fK+xcmE2NZp5/vCUS
	SuAV8V0e9c9Pg==
Original-Received: from customer (localhost [127.0.0.1])
	by submission (posteo.de) with ESMTPSA id 445RbG0vb1z6tmL;
	Fri, 22 Feb 2019 10:57:29 +0100 (CET)
In-reply-to: <83lg28fgdg.fsf@gnu.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 185.67.36.65
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/emacs-devel/>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
	<mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org
Original-Sender: "Emacs-devel" <emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.emacs.devel:233530
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/233530>


On 2019-02-22 Fri 08:10, Eli Zaretskii wrote:
>> From: Titus von der Malsburg <malsburg@posteo.de>
>> Cc: Joost Kremers <joostkremers@fastmail.fm>, emacs-devel@gnu.org, rms@g=
nu.org
>> Date: Thu, 21 Feb 2019 22:19:53 +0100
>>
>> > It needs at least 30 letters to guess right, which is quite a few.
>>
>> The number of letters depends on the configured languages, it could be
>> less than 30 when the scripts are different but for English, Dutch,
>> and German 30 works well in my experience and languages don=E2=80=99t ge=
t much
>> more similar than that (except if you want to distinguish between US
>> English and UK English).
>
> The minimum number also depends on the expected reliability of
> language detection, of course.

Of course.  I should say that I didn=E2=80=99t come up with the algorithm. =
 It=E2=80=99s
a standard approach to language detection used in many contexts.  Its
selling points are high accuracy, low computational complexity, and that
only a small amount of language data is required.  For most languages,
we need only 1.2Kb of data.

[More below.]

>> I just tried it and noticed one downside: Flyspell offers possible
>> corrections for unknown words and when multiple languages are
>> configured, these suggestions come from all configured dictionaries.
>
> Of course, but what would you expect?

I would expect to get only suggestions from the language that I=E2=80=99m
currently typing in.

> And how is that a downside?

If I have to pick the correct word from a list that contains many
irrelevant words, it will take more time.  The suggestions are just less
relevant on average.

> Hunspell doesn't try to guess the language at all, it just looks in
> all loaded dictionaries one by one.

That=E2=80=99s the problem. :)

>> Many of them are of course not relevant because they are not in the
>> language of the paragraph.
>
> There's no "language of the paragraph" in this method, you can freely
> mix words from different languages in the same paragraph.  There are
> important use cases for that, like editing a message translation
> catalog or text that that explains in-line the meaning of words in
> another language.

If the use case is working with paragraphs that mix languages, the user
is free to use Hunspell.  However, there is the other use case, the one
that I=E2=80=99m interested in, where the document contains whole paragraphs
each in its own language.  Plus the use case where the document is in
just one language and I don=E2=80=99t want the spell-checker to suggest wor=
ds in
some other language.

Note, that automatic language detection has other applications beyond
changing dictionaries for spell checkers.  For instance, it allows to
automatically switch the voice used by the Festival speech synthesizer,
which is useful for blind people working with text in multiple
languages.  It can also switch the typographical conventions used by
type-mode (e.g., use quote symbols that are appropriate for the current
language).  It could also switch the language of dictionaries, thesauri,
and text completion packages such as company-ngrams.

>> Flyspell also has an autocorrection feature (which I=E2=80=99m not using)
>> and this feature would also largely stop being useful with multiple
>> dictionaries.
>
> It will only become less useful if the first correction is off in a
> significant number of cases.  Which is not at all expected, certainly
> not when each language uses a different script.
>
>> I think that this makes the Hunspell solution less appealing.
>
> I think you are slightly biased ;-).  As am I, most probably.  Both
> solutions have their advantages and disadvantages, and the user should
> choose which one better suits his/her needs in each case.

Exactly, and that=E2=80=99s why I never said that people should be prevented
from using Hunspell with multiple dictionaries if that=E2=80=99s the best
solution for them.  :)

> I mentioned Hunspell because I think few people even know about this
> feature, which is quite unique among spellers supported by Emacs.

That is true.  I certainly didn=E2=80=99t know about that feature.  Hunspel=
l is
fairly impressive, especially for languages like German that can freely
compose new words.  Following this conversation, I might actually switch.

In sum, I don=E2=80=99t want to push my package to anyone.  I said I would =
be
happy to contribute it to Emacs/Elpa /if/ there is interest.  But I=E2=80=
=99m
perfectly happy with keeping it in Melpa where it currently lives.

Regarding my initial question: I had a closer look at how flyspell-buffer w=
orks internally and I=E2=80=99m afraid there is no easy way to make it swit=
ch languages half-way through the document.  The hook for incorrect words i=
s called only when the spell-checker has already finished its work.  It wil=
l be necessary to write a new function that processes the document paragrap=
h by paragraph.

Thanks for all the suggestions.

  Titus


--
Dr. Titus von der Malsburg
Department of Linguistics
University of Potsdam, Germany
https://tmalsburg.github.io