From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Maxim Nikulin Newsgroups: gmane.emacs.devel Subject: Re: CSV parsing and other issues (Re: LC_NUMERIC) Date: Thu, 10 Jun 2021 23:28:59 +0700 Message-ID: References: <20210606233638.v7b7rwbufay5ltn7@E15-2016.optimum.net> <83a6o1hn9l.fsf@gnu.org> <20210608004510.usj7rw2i6tmx6qnw@E15-2016.optimum.net> <83h7i9f5ij.fsf@gnu.org> <73df2202-081b-5e50-677d-e4498b6782d4@gmail.com> <83eedcdw8k.fsf@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="18305"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 Cc: boruch_baum@gmx.com To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Jun 10 18:29:45 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lrNYt-0004Tt-B8 for ged-emacs-devel@m.gmane-mx.org; Thu, 10 Jun 2021 18:29:43 +0200 Original-Received: from localhost ([::1]:44174 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lrNYs-0006et-DP for ged-emacs-devel@m.gmane-mx.org; Thu, 10 Jun 2021 12:29:42 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:40806) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lrNYO-0005wW-3U for emacs-devel@gnu.org; Thu, 10 Jun 2021 12:29:12 -0400 Original-Received: from ciao.gmane.io ([116.202.254.214]:56650) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lrNYM-0001iv-1Z for emacs-devel@gnu.org; Thu, 10 Jun 2021 12:29:11 -0400 Original-Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1lrNYK-0003gP-33 for emacs-devel@gnu.org; Thu, 10 Jun 2021 18:29:08 +0200 X-Injected-Via-Gmane: http://gmane.org/ X-Mozilla-News-Host: news://news.gmane.io In-Reply-To: <83eedcdw8k.fsf@gnu.org> Content-Language: en-US Received-SPF: pass client-ip=116.202.254.214; envelope-from=ged-emacs-devel@m.gmane-mx.org; helo=ciao.gmane.io X-Spam_score_int: 5 X-Spam_score: 0.5 X-Spam_bar: / X-Spam_report: (0.5 / 5.0 requ) BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001, FORGED_GMAIL_RCVD=1, FREEMAIL_FORGED_FROMDOMAIN=0.249, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.249, NICE_REPLY_A=-0.001, NML_ADSP_CUSTOM_MED=0.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:270653 Archived-At: On 09/06/2021 01:52, Eli Zaretskii wrote: > From: Maxim Nikulin Date: Tue, 8 Jun 2021 23:35:51 +0700 I have reordered some parts of discussion. >> I just have realized that nl_langinfo(3) (and >> nl_langinfo_l(3) as well) from libc accepts RADIXCHAR >> (decimal dot) and THOUSEP (group separator) >> arguments. They are good candidates for `locale-info' >> extension. > > We already use nl_langinfo in locale-info, so what exactly > are you suggesting here? adding more items? You don't > really expect Lisp programs to format numbers such as > 123,456 by hand after learning from locale-info that the > thousands separator is a comma, do you? I have hijacked Boruch's thread and changed the subject to "CSV parsing". There are plenty of CSV dialects. If decimal separator is "," then office software uses ";" instead of comma as cell (field) separator. So to parse CSV file it is necessary to know decimal separator in a specified locale. RADIXCHAR as argument of nl_langinfo(3) is a first step to better user experience with CSV files. Unfortunately it allows only to get reasonable visual representation. Taking advantage of Org spreadsheet calculations require parsing cell contents thus parsing of numbers (and maybe dates). I mentioned earlier https://debbugs.gnu.org/47885 and a part of discussion that is missed in the bug tracker: https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html I have seen nl_langinfo without RADIXCHAR in emacs sources http://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c#n3258 http://git.savannah.gnu.org/cgit/emacs.git/tree/lib-src/ntlib.c#n520 Originally during discussion in emacs-orgmode I did not plan to raise the question concerning number formatting and parsing since I had no hope for any positive outcome without consistent proposal. Accidentally I notices Borich's message and decided to add another use case. >> On 08/06/2021 09:35, Eli Zaretskii wrote: >> > From: Boruch Baum >> >> No? If an Emacs user has two buffers in two separate languages, the >> >> buffer-local settings aren't / won't be respected? >> > >> > First, language is different from locale. And second, we don't even >> > have a buffer-local notion of language yet. >> >> Certainly locale is more precise than just language since it includes >> region and other variants, moreover it can be granularly tuned (date, >> numbers, sorting can be adjusted independently), but I still think that >> all these properties can be sometimes broadly referred to as language. > > No, they cannot, not in general. A locale comes with a whole database > of different settings: language, encoding (a.k.a. "codeset"), formats > of date and time, names of days of the week and of the months, rules > for collation and capitalization, etc. etc. You can easily find > several locales whose language is English, but some/many/all of the > other locale-dependent settings are different. It isn't a coincidence > that a locale's name includes more than just the language part. I wrote almost the same concerning locale variants and components, so I feel some sort of confusion and can not get its origin. I was trying to support Boruch that buffer-local variables may be important part of locale context, more precise than global settings, and a fallback if locale is not specified for particular span of text. In respect to such hierarchy language vs. locale difference does not matter. >> Low level functions can accept explicit locale. > > Which ones? Most libc routines don't, they use the locale > as a global identifier. And many libc's (with the prominent > exception of glibc) don't support efficient change of a > locale in the middle of a program, they assume that the > program's locale is set once at program startup. Hypothetical functions in new elisp API, maybe relying on some external libraries. I believed, you agreed that global LC_NUMERIC must be "C" to avoid various sort of problems with data exchange. I am not aware of libc functions for number formatting or parsing that can take explicit locale (I have seen such feature in C++ standard library, Qt, other languages). Totalitarian approach of libc with the only locale facet, the only timezone imposes too hard limitations to consider some libc functions as useful and reliable in more or less complex application. Its API is suitable for simple tools that can quickly do their work and do not assume any conversion. More flexible base layer is required when mix of environments is expected. Full support of locale features requires a lot of work, that is why I am asking if some external library can be used instead. >> Higher level API can obtain it implicitly from >> buffer-local variables and global locale. For example the >> LOCALE argument of `string-collate-lessp' is optional >> one. I can even anticipate that locale may be stored in >> text properties some times. A random message from recent >> "About multilingual documents" thread at emacs-orgmode >> mail list: >> https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html > > That's mostly about input methods and org-export, I don't > see how it's relevant to what Boruch asked. I added this link to show you that demand for multilanguage documents is real. Notice that problems with spell checking were mentioned in that discussion. Earlier I saw suggestions to switch ispell language with input method. In my opinion it is ridiculous. Personally I rather need combined dictionary then explicitly marked text regions. I expect that new features will be wider utilized when possibility to use them will appear. >> At first basic functionality may be implemented. The >> problem is to choose extensible API. > > No, the problem is to have a design that would allow an > efficient implementation. Given what the underlying libc > does, it isn't easy. That is why I looking for an alternative to libc. Previously you wrote "locale switching". I would rather say constructing and destroying locales on demand. Switching may behave not so well when thread are involved. > And then we have conceptual problems. For example, in a > multilingual editor such as Emacs, the notion of a "buffer > language" not always makes sense, you'd need to support > portions of text that have different language properties. > Imagine switching locales as Emacs processes adjacent > stretches of text and other complications. For example, > changing letter-case for a stretch or Turkish text is > supposed to be different from the English or German text. > I'm all ears for ideas how to design such "language > support". It definitely isn't easy, so if you have ideas, > please voice them! I never have a consistent vision nor see a conceptual problem. Buffer-local settings are just more specific than global ones. That is I mentioned text properties as even more precise in my previous message. Maybe even current mode can help to build proper hierarchy of locale contexts. HTML has "lang" attribute, there is "\foreignlanguage" in LaTeX, etc. I have heard that special case exists in Turkish, but I was not curious enough to find details and rules when and how it should be applied. > If you are suggesting that we introduce ICU as a dependency, > we could discuss the pros and cons. I consider it as the most complete available implementation. Do you know a comparable alternative? I have realized that since Emacs has support of dynamic modules, it is possible to create a prototype with bindings to external library without rebuilding of Emacs. > I don't think the problem is the API. I think, introducing features gradually will be more headache for developers of external packages than absence of support at all. API determines the scope of such features. >> E.g. I was completely unaware that negative sign may be >> represented by parenthesis > > Really? it's standard in financial applications. Is it really so standard? Maybe I have seen such format, even guessed from some context that e.g. table column with such numbers should assume negative values, or e.g. in discount entry. At least I did not recognize such format as some general rule. new Intl.NumberFormat('de-DE', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500); "-3.500,00 $" new Intl.NumberFormat('es-ES', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500); "-3500,00 US$" new Intl.NumberFormat('fr-FR', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500); "(3 500,00 $US)" new Intl.NumberFormat('ru-RU', {style: 'currency', currency: 'USD', currencySign: 'accounting', signDisplay: 'always'}).format(-3500); "-3 500,00 $" >> I expect enough surprises and unexpected "discoveries" >> during implementation of better locale support. That is >> why I would consider adapting some more or less >> established API for this purpose. > > I don't think "consider" cuts it. We have already a lot of > stuff in Emacs; what we don't have needs serious design and > comparison of available implementation options. Emacs's > needs are quite special and unlike those of most other > programs. I still think that expectation of users around the globe are more special than Emacs' needs at least in respect to format of numbers.