From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Eli Zaretskii Newsgroups: gmane.emacs.devel Subject: Re: CSV parsing and other issues (Re: LC_NUMERIC) Date: Thu, 10 Jun 2021 19:57:33 +0300 Message-ID: <83lf7hbqte.fsf@gnu.org> References: <20210606233638.v7b7rwbufay5ltn7@E15-2016.optimum.net> <83a6o1hn9l.fsf@gnu.org> <20210608004510.usj7rw2i6tmx6qnw@E15-2016.optimum.net> <83h7i9f5ij.fsf@gnu.org> <73df2202-081b-5e50-677d-e4498b6782d4@gmail.com> <83eedcdw8k.fsf@gnu.org> Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="11880"; mail-complaints-to="usenet@ciao.gmane.io" Cc: boruch_baum@gmx.com, emacs-devel@gnu.org To: Maxim Nikulin Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Jun 10 18:59:12 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lrO1Q-0002yj-2R for ged-emacs-devel@m.gmane-mx.org; Thu, 10 Jun 2021 18:59:12 +0200 Original-Received: from localhost ([::1]:49698 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lrO1P-0006vV-4M for ged-emacs-devel@m.gmane-mx.org; Thu, 10 Jun 2021 12:59:11 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:46534) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lrO05-00059d-Hh for emacs-devel@gnu.org; Thu, 10 Jun 2021 12:57:49 -0400 Original-Received: from fencepost.gnu.org ([2001:470:142:3::e]:59522) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lrO04-0006vL-TB; Thu, 10 Jun 2021 12:57:48 -0400 Original-Received: from 84.94.185.95.cable.012.net.il ([84.94.185.95]:2730 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lrO04-0008Jn-G6; Thu, 10 Jun 2021 12:57:48 -0400 In-Reply-To: (message from Maxim Nikulin on Thu, 10 Jun 2021 23:28:59 +0700) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:270655 Archived-At: > From: Maxim Nikulin > Date: Thu, 10 Jun 2021 23:28:59 +0700 > Cc: boruch_baum@gmx.com > > > We already use nl_langinfo in locale-info, so what exactly > > are you suggesting here? adding more items? You don't > > really expect Lisp programs to format numbers such as > > 123,456 by hand after learning from locale-info that the > > thousands separator is a comma, do you? > > I have hijacked Boruch's thread and changed the subject to "CSV > parsing". That explains part of my confusion. Please try not to hijack discussions; instead, start a separate thread, to avoid such confusion. For processing CSV, if there's a need to know whether the locale uses the comma as a decimal separator, we could indeed extend locale-info. But such an extension is almost trivial and doesn't even touch on the significant problems in the rest of the discussion. > I was trying to support Boruch that buffer-local variables may be > important part of locale context, more precise than global settings, They are more precise, but they don't support mixed languages in the same buffer, something that happens in Emacs very frequently. Which means they are not precise enough. So my POV is that we should look for a way to be able to specify the language of some span of text, in which case buffers that use a single language will be a special case. > > And then we have conceptual problems. For example, in a > > multilingual editor such as Emacs, the notion of a "buffer > > language" not always makes sense, you'd need to support > > portions of text that have different language properties. > > Imagine switching locales as Emacs processes adjacent > > stretches of text and other complications. For example, > > changing letter-case for a stretch or Turkish text is > > supposed to be different from the English or German text. > > I'm all ears for ideas how to design such "language > > support". It definitely isn't easy, so if you have ideas, > > please voice them! > > I never have a consistent vision nor see a conceptual problem. Here's a trivial example: (insert (downcase (buffer-substring POS1 POS2))) Contrast with (insert (downcase "FOO")) The function 'downcase' gets a Lisp string, but it has no way of knowing whether the string is actually a portion of current buffer's text. So how can it apply the correct letter-case conversions, even if some buffer-local setting specifies that this should be done using some specific language's rules? IOW, one of the non-trivial problems is how to process Lisp strings correctly for these purposes. Buffers can have local variables, but what about strings? > > If you are suggesting that we introduce ICU as a dependency, > > we could discuss the pros and cons. > > I consider it as the most complete available implementation. Do you > know a comparable alternative? Yes: what we have already in Emacs. That covers a lot of the same Unicode turf that ICU handles, because we import and use the same Unicode files and tables. The question is: what is best for the future development of Emacs in this area: depend on ICU (which would mean we need to rewrite lots of code that is working well), or extend what we have to support more Unicode features? One not-so-trivial aspect of this is efficiency of fetching character properties (Emacs has char-tables for that, which are efficient both CPU- and memory-wise). Another aspect is support for raw bytes in buffers and strings. And there are probably some others. It is not a simple decision.