From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Eli Zaretskii <eliz@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: Re: CSV parsing and other issues (Re: LC_NUMERIC)
Date: Thu, 10 Jun 2021 19:57:33 +0300
Message-ID: <83lf7hbqte.fsf@gnu.org>
References: <20210606233638.v7b7rwbufay5ltn7@E15-2016.optimum.net>
 <83a6o1hn9l.fsf@gnu.org>
 <20210608004510.usj7rw2i6tmx6qnw@E15-2016.optimum.net>
 <83h7i9f5ij.fsf@gnu.org> <73df2202-081b-5e50-677d-e4498b6782d4@gmail.com>
 <83eedcdw8k.fsf@gnu.org> <ea575f9a-5d90-d13f-4d8c-58552541034d@gmail.com>
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="11880"; mail-complaints-to="usenet@ciao.gmane.io"
Cc: boruch_baum@gmx.com, emacs-devel@gnu.org
To: Maxim Nikulin <manikulin@gmail.com>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Jun 10 18:59:12 2021
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1lrO1Q-0002yj-2R
	for ged-emacs-devel@m.gmane-mx.org; Thu, 10 Jun 2021 18:59:12 +0200
Original-Received: from localhost ([::1]:49698 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1lrO1P-0006vV-4M
	for ged-emacs-devel@m.gmane-mx.org; Thu, 10 Jun 2021 12:59:11 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:46534)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>) id 1lrO05-00059d-Hh
 for emacs-devel@gnu.org; Thu, 10 Jun 2021 12:57:49 -0400
Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:59522)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <eliz@gnu.org>)
 id 1lrO04-0006vL-TB; Thu, 10 Jun 2021 12:57:48 -0400
Original-Received: from 84.94.185.95.cable.012.net.il ([84.94.185.95]:2730
 helo=home-c4e4a596f7)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <eliz@gnu.org>)
 id 1lrO04-0008Jn-G6; Thu, 10 Jun 2021 12:57:48 -0400
In-Reply-To: <ea575f9a-5d90-d13f-4d8c-58552541034d@gmail.com> (message from
 Maxim Nikulin on Thu, 10 Jun 2021 23:28:59 +0700)
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:270655
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/270655>

> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Thu, 10 Jun 2021 23:28:59 +0700
> Cc: boruch_baum@gmx.com
> 
> > We already use nl_langinfo in locale-info, so what exactly
> > are you suggesting here? adding more items?  You don't
> > really expect Lisp programs to format numbers such as
> > 123,456 by hand after learning from locale-info that the
> > thousands separator is a comma, do you?
> 
> I have hijacked Boruch's thread and changed the subject to "CSV 
> parsing".

That explains part of my confusion.  Please try not to hijack
discussions; instead, start a separate thread, to avoid such
confusion.

For processing CSV, if there's a need to know whether the locale uses
the comma as a decimal separator, we could indeed extend locale-info.
But such an extension is almost trivial and doesn't even touch on the
significant problems in the rest of the discussion.

> I was trying to support Boruch that buffer-local variables may be
> important part of locale context, more precise than global settings,

They are more precise, but they don't support mixed languages in the
same buffer, something that happens in Emacs very frequently.  Which
means they are not precise enough.  So my POV is that we should look
for a way to be able to specify the language of some span of text, in
which case buffers that use a single language will be a special case.

> > And then we have conceptual problems.  For example, in a
> > multilingual editor such as Emacs, the notion of a "buffer
> > language" not always makes sense, you'd need to support
> > portions of text that have different language properties.
> > Imagine switching locales as Emacs processes adjacent
> > stretches of text and other complications.  For example,
> > changing letter-case for a stretch or Turkish text is
> > supposed to be different from the English or German text.
> > I'm all ears for ideas how to design such "language
> > support".  It definitely isn't easy, so if you have ideas,
> > please voice them!
> 
> I never have a consistent vision nor see a conceptual problem. 

Here's  a trivial example:

  (insert (downcase (buffer-substring POS1 POS2)))

Contrast with

  (insert (downcase "FOO"))

The function 'downcase' gets a Lisp string, but it has no way of
knowing whether the string is actually a portion of current buffer's
text.  So how can it apply the correct letter-case conversions, even
if some buffer-local setting specifies that this should be done using
some specific language's rules?

IOW, one of the non-trivial problems is how to process Lisp strings
correctly for these purposes.  Buffers can have local variables, but
what about strings?

> > If you are suggesting that we introduce ICU as a dependency,
> > we could discuss the pros and cons.
> 
> I consider it as the most complete available implementation.  Do you 
> know a comparable alternative?

Yes: what we have already in Emacs.  That covers a lot of the same
Unicode turf that ICU handles, because we import and use the same
Unicode files and tables.  The question is: what is best for the
future development of Emacs in this area: depend on ICU (which would
mean we need to rewrite lots of code that is working well), or extend
what we have to support more Unicode features?  One not-so-trivial
aspect of this is efficiency of fetching character properties (Emacs
has char-tables for that, which are efficient both CPU- and
memory-wise).  Another aspect is support for raw bytes in buffers and
strings.  And there are probably some others.

It is not a simple decision.