From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Maxim Nikulin Newsgroups: gmane.emacs.devel Subject: CSV parsing and other issues (Re: LC_NUMERIC) Date: Thu, 3 Jun 2021 21:44:08 +0700 Message-ID: <921965d7-af86-6d2e-8b48-3d0b9b51998e@gmail.com> References: <20210602185441.nhvhirdffamahgfy@E15-2016.optimum.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="31690"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 Cc: Utkarsh Singh To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Thu Jun 03 16:50:55 2021 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1loogQ-00085n-PQ for ged-emacs-devel@m.gmane-mx.org; Thu, 03 Jun 2021 16:50:54 +0200 Original-Received: from localhost ([::1]:47442 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1loogP-0000fp-Ri for ged-emacs-devel@m.gmane-mx.org; Thu, 03 Jun 2021 10:50:53 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:43152) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1loofj-0007i9-FD for emacs-devel@gnu.org; Thu, 03 Jun 2021 10:50:11 -0400 Original-Received: from ciao.gmane.io ([116.202.254.214]:58368) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1looff-0006dj-EW for emacs-devel@gnu.org; Thu, 03 Jun 2021 10:50:11 -0400 Original-Received: from list by ciao.gmane.io with local (Exim 4.92) (envelope-from ) id 1loofa-0006f8-3c for emacs-devel@gnu.org; Thu, 03 Jun 2021 16:50:02 +0200 X-Injected-Via-Gmane: http://gmane.org/ In-Reply-To: <20210602185441.nhvhirdffamahgfy@E15-2016.optimum.net> Content-Language: en-US Received-SPF: pass client-ip=116.202.254.214; envelope-from=ged-emacs-devel@m.gmane-mx.org; helo=ciao.gmane.io X-Spam_score_int: 5 X-Spam_score: 0.5 X-Spam_bar: / X-Spam_report: (0.5 / 5.0 requ) BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001, FORGED_GMAIL_RCVD=1, FREEMAIL_FORGED_FROMDOMAIN=0.248, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.249, NML_ADSP_CUSTOM_MED=0.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:270346 Archived-At: On 03/06/2021 01:54, Boruch Baum wrote: > Please consider having the elisp 'format' function adopt the > single-quote and 'I' flags. Each is already implemented in both the GNU > C printf command and the linux printf command. The single-quote flag is > part of the 'Single UNIX Specification' and the 'I' flag has been part > of glibc since version 2.2 [ref: man(3) printf]. > > If function 'format' uses 'printf' as its backend, this would seem to be > a matter of exposing an existing feature. I do not know the story why Emacs does not support locale-aware number formats, but I suspect that relying on libc is opening a can of worms. Once setlocale(LC_NUMERIC, "") is invoked, one is never sure if printf- and scanf-like functions deal with default "C" representation or with formatted accordingly to current locale numbers. Some numbers related to communication protocols must be always formatted using "C" locale. I do not remember if it happened with XFree86 or with Xorg, but at certain moment users experienced problems. X11 could not start at all due to invalid configs. The source of problem was "," as decimal separator in some locales and wrong expectations concerning numbers in config files. Recently I found the following fixup_locale function: http://git.savannah.gnu.org/cgit/emacs.git/tree/src/emacs.c#n2861 setlocale (LC_NUMERIC, "C"); I was surprised that impossible to determine current decimal separator from elisp. At the same time e.g. `string-collate-lessp' has LOCALE argument. A month ago some patches were submitted to Org mode with intention to improve import of tables, see https://debbugs.gnu.org/47885 A part of discussion is missed in the bug tracker: https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html Org mode has a piece of code that tries to guess if the file has commas or tabs as field separator (CSV or TSV format). The suggested change adds e.g. semicolon. (Sidenote: probably csv-mode is a better place than org-mode for such code.) The problem is that office software uses semicolon for locales where comma serves as decimal separator for floating-point numbers (e.g. de_DE, es_ES, fr_FR, ru_RU, etc.): A;1,2;3,4 So semicolon should be tried with higher priority than comma if in current locale numbers are represented as e.g 1,2. Unfortunately the only way to get such information from Emacs is to call some external application. Maintaining own mapping of locale to separator is unnecessary burden. Besides office software, there are some equipment that always use "C" number formatting, so a user can have a mix of files with various dialects of CSV. Thus locale info is not enough, some heuristics is required anyway. More subtle questions rise on the next step. Org allows to perform calculations on table cells (and there is calc). Should numbers be converted to "C" locale representation during import? Should conversion happen when passing cell content as argument and the result converted back to current locale? I anticipate that buffer-local setting will be requested. There was even discussion of mixed-language documents in emacs-orgmode mail list, however numbers were not mentioned. So locale-aware number formatting would be a great improvement for Emacs. On the other hand, it should be implemented with great care to avoid localized numbers in some cases. Maybe locale argument should be passed to functions that deal with numbers. Formatting of integer numbers is not enough, floating point numbers should be handled as well. Parsing numbers formatted accordingly to locale rules should be addressed too. A function similar to `locale-info' is highly desired to get properties of locale (e.g. decimal_point from result of localeconv). Some decision is required whether calc & Co should operate with localized numbers.