unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
From: Maxim Nikulin <manikulin@gmail.com>
To: emacs-devel@gnu.org
Cc: boruch_baum@gmx.com
Subject: Re: CSV parsing and other issues (Re: LC_NUMERIC)
Date: Thu, 10 Jun 2021 23:28:59 +0700	[thread overview]
Message-ID: <ea575f9a-5d90-d13f-4d8c-58552541034d@gmail.com> (raw)
In-Reply-To: <83eedcdw8k.fsf@gnu.org>

On 09/06/2021 01:52, Eli Zaretskii wrote:
> From: Maxim Nikulin Date: Tue, 8 Jun 2021 23:35:51 +0700

I have reordered some parts of discussion.

>> I just have realized that nl_langinfo(3) (and
>> nl_langinfo_l(3) as well) from libc accepts RADIXCHAR
>> (decimal dot) and THOUSEP (group separator)
>> arguments. They are good candidates for `locale-info'
>> extension.
>
> We already use nl_langinfo in locale-info, so what exactly
> are you suggesting here? adding more items?  You don't
> really expect Lisp programs to format numbers such as
> 123,456 by hand after learning from locale-info that the
> thousands separator is a comma, do you?

I have hijacked Boruch's thread and changed the subject to "CSV 
parsing".  There are plenty of CSV dialects. If decimal separator is "," 
then office software uses ";" instead of comma as cell (field) 
separator.  So to parse CSV file it is necessary to know decimal 
separator in a specified locale. RADIXCHAR as argument of nl_langinfo(3) 
is a first step to better user experience with CSV files.

Unfortunately it allows only to get reasonable visual representation. 
Taking advantage of Org spreadsheet calculations require parsing cell 
contents thus parsing of numbers (and maybe dates).

I mentioned earlier https://debbugs.gnu.org/47885 and a part of 
discussion that is missed in the bug tracker:
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html

I have seen nl_langinfo without RADIXCHAR in emacs sources
http://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c#n3258
http://git.savannah.gnu.org/cgit/emacs.git/tree/lib-src/ntlib.c#n520

Originally during discussion in emacs-orgmode I did not plan to raise
the question concerning number formatting and parsing since I had no
hope for any positive outcome without consistent proposal.  Accidentally
I notices Borich's message and decided to add another use case.

>> On 08/06/2021 09:35, Eli Zaretskii wrote:
>>  > From: Boruch Baum
>>  >> No? If an Emacs user has two buffers in two separate languages, the
>>  >> buffer-local settings aren't / won't be respected?
>>  >
>>  > First, language is different from locale.  And second, we don't even
>>  > have a buffer-local notion of language yet.
>>
>> Certainly locale is more precise than just language since it includes
>> region and other variants, moreover it can be granularly tuned (date,
>> numbers, sorting can be adjusted independently), but I still think that
>> all these properties can be sometimes broadly referred to as language.
>
> No, they cannot, not in general.  A locale comes with a whole database
> of different settings: language, encoding (a.k.a. "codeset"), formats
> of date and time, names of days of the week and of the months, rules
> for collation and capitalization, etc. etc.  You can easily find
> several locales whose language is English, but some/many/all of the
> other locale-dependent settings are different.  It isn't a coincidence
> that a locale's name includes more than just the language part.

I wrote almost the same concerning locale variants and components, so I 
feel some sort of confusion and can not get its origin.  I was trying to 
support Boruch that buffer-local variables may be important part of 
locale context, more precise than global settings, and a fallback if 
locale is not specified for particular span of text.  In respect to such 
hierarchy language vs. locale difference does not matter.

>> Low level functions can accept explicit locale.
>
> Which ones?  Most libc routines don't, they use the locale
> as a global identifier.  And many libc's (with the prominent
> exception of glibc) don't support efficient change of a
> locale in the middle of a program, they assume that the
> program's locale is set once at program startup.

Hypothetical functions in new elisp API, maybe relying on some external
libraries.  I believed, you agreed that global LC_NUMERIC must be "C" to 
avoid various sort of problems with data exchange. I am not aware of 
libc functions for number formatting or parsing that can take explicit 
locale (I have seen such feature in C++ standard library, Qt, other 
languages).  Totalitarian approach of libc with the only locale facet, 
the only timezone imposes too hard limitations to consider some libc 
functions as useful and reliable in more or less complex application. 
Its API is suitable for simple tools that can quickly do their work and 
do not assume any conversion. More flexible base layer is required when 
mix of environments is expected. Full support of locale features 
requires a lot of work, that is why I am asking if some external library 
can be used instead.

>> Higher level API can obtain it implicitly from
>> buffer-local variables and global locale. For example the
>> LOCALE argument of `string-collate-lessp' is optional
>> one. I can even anticipate that locale may be stored in
>> text properties some times. A random message from recent
>> "About multilingual documents" thread at emacs-orgmode
>> mail list:
>> https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html
>
> That's mostly about input methods and org-export, I don't
> see how it's relevant to what Boruch asked.

I added this link to show you that demand for multilanguage documents is 
real. Notice that problems with spell checking were mentioned in that 
discussion. Earlier I saw suggestions to switch ispell language with 
input method. In my opinion it is ridiculous.  Personally I rather need 
combined dictionary then explicitly marked text regions.

I expect that new features will be wider utilized when possibility to 
use them will appear.

>> At first basic functionality may be implemented. The
>> problem is to choose extensible API.
>
> No, the problem is to have a design that would allow an
> efficient implementation.  Given what the underlying libc
> does, it isn't easy.

That is why I looking for an alternative to libc. Previously you wrote
"locale switching". I would rather say constructing and destroying
locales on demand. Switching may behave not so well when thread are 
involved.

> And then we have conceptual problems.  For example, in a
> multilingual editor such as Emacs, the notion of a "buffer
> language" not always makes sense, you'd need to support
> portions of text that have different language properties.
> Imagine switching locales as Emacs processes adjacent
> stretches of text and other complications.  For example,
> changing letter-case for a stretch or Turkish text is
> supposed to be different from the English or German text.
> I'm all ears for ideas how to design such "language
> support".  It definitely isn't easy, so if you have ideas,
> please voice them!

I never have a consistent vision nor see a conceptual problem. 
Buffer-local settings are just more specific than global ones.  That is 
I mentioned text properties as even more precise in my previous message. 
Maybe even current mode can help to build proper hierarchy of locale 
contexts. HTML has "lang" attribute, there is "\foreignlanguage" in 
LaTeX, etc.

I have heard that special case exists in Turkish, but I was not curious
enough to find details and rules when and how it should be applied.

> If you are suggesting that we introduce ICU as a dependency,
> we could discuss the pros and cons.

I consider it as the most complete available implementation.  Do you 
know a comparable alternative?

I have realized that since Emacs has support of dynamic modules, it is
possible to create a prototype with bindings to external library without
rebuilding of Emacs.

> I don't think the problem is the API.

I think, introducing features gradually will be more headache for 
developers of external packages than absence of support at all.  API 
determines the scope of such features.

>> E.g. I was completely unaware that negative sign may be
>> represented by parenthesis
>
> Really? it's standard in financial applications.

Is it really so standard? Maybe I have seen such format, even guessed 
from some context that e.g. table column with such numbers should assume 
negative values, or e.g. in discount entry.  At least I did not 
recognize such format as some general rule.

new Intl.NumberFormat('de-DE', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3.500,00 $"
new Intl.NumberFormat('es-ES', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3500,00 US$"
new Intl.NumberFormat('fr-FR', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"(3 500,00 $US)"
new Intl.NumberFormat('ru-RU', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3 500,00 $"

>> I expect enough surprises and unexpected "discoveries"
>> during implementation of better locale support. That is
>> why I would consider adapting some more or less
>> established API for this purpose.
>
> I don't think "consider" cuts it.  We have already a lot of
> stuff in Emacs; what we don't have needs serious design and
> comparison of available implementation options.  Emacs's
> needs are quite special and unlike those of most other
> programs.

I still think that expectation of users around the globe are more
special than Emacs' needs at least in respect to format of numbers.




  reply	other threads:[~2021-06-10 16:28 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-06 23:36 CSV parsing and other issues (Re: LC_NUMERIC) Boruch Baum
2021-06-07 12:28 ` Eli Zaretskii
2021-06-08  0:45   ` Boruch Baum
2021-06-08  2:35     ` Eli Zaretskii
2021-06-08 15:35       ` Stefan Monnier
2021-06-08 16:35       ` Maxim Nikulin
2021-06-08 18:52         ` Eli Zaretskii
2021-06-10 16:28           ` Maxim Nikulin [this message]
2021-06-10 16:57             ` Eli Zaretskii
2021-06-10 18:01               ` Boruch Baum
2021-06-10 18:50                 ` Eli Zaretskii
2021-06-10 19:04                   ` Boruch Baum
2021-06-10 19:23                     ` Eli Zaretskii
2021-06-10 20:20                       ` Boruch Baum
2021-06-11  6:19                         ` Eli Zaretskii
2021-06-11  8:18                           ` Boruch Baum
2021-06-11 16:51                           ` Maxim Nikulin
2021-06-11 13:56                       ` Filipp Gunbin
2021-06-11 14:10                         ` Eli Zaretskii
2021-06-11 18:52                           ` Filipp Gunbin
2021-06-11 19:34                             ` Eli Zaretskii
2021-06-11 16:58               ` Maxim Nikulin
2021-06-11 18:04                 ` Eli Zaretskii
2021-06-14 16:38                   ` Maxim Nikulin
2021-06-14 17:19                     ` Eli Zaretskii
2021-06-16 17:27                       ` Maxim Nikulin
2021-06-16 17:36                         ` Eli Zaretskii
2021-06-10 21:10             ` Stefan Monnier
2021-06-12 14:41               ` Maxim Nikulin
  -- strict thread matches above, loose matches on Subject: below --
2021-06-02 18:54 LC_NUMERIC formatting [FEATURE REQUEST] Boruch Baum
2021-06-03 14:44 ` CSV parsing and other issues (Re: LC_NUMERIC) Maxim Nikulin
2021-06-03 15:01   ` Eli Zaretskii
2021-06-04 16:31     ` Maxim Nikulin
2021-06-04 19:17       ` Eli Zaretskii

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/emacs/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea575f9a-5d90-d13f-4d8c-58552541034d@gmail.com \
    --to=manikulin@gmail.com \
    --cc=boruch_baum@gmx.com \
    --cc=emacs-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).