CSV parsing and other issues (Re: LC

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-02 18:54 LC_NUMERIC formatting [FEATURE REQUEST] Boruch Baum
@ 2021-06-03 14:44 ` Maxim Nikulin
  2021-06-03 15:01   ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-03 14:44 UTC (permalink / raw)
  To: emacs-devel; +Cc: Utkarsh Singh

On 03/06/2021 01:54, Boruch Baum wrote:
> Please consider having the elisp 'format' function adopt the
> single-quote and 'I' flags. Each is already implemented in both the GNU
> C printf command and the linux printf command. The single-quote flag is
> part of the 'Single UNIX Specification' and the 'I' flag has been part
> of glibc since version 2.2 [ref: man(3) printf].
> 
> If function 'format' uses 'printf' as its backend, this would seem to be
> a matter of exposing an existing feature.

I do not know the story why Emacs does not support locale-aware number 
formats, but I suspect that relying on libc is opening a can of worms. 
Once setlocale(LC_NUMERIC, "") is invoked, one is never sure if printf- 
and scanf-like functions deal with default "C" representation or with 
formatted accordingly to current locale numbers. Some numbers related to 
communication protocols must be always formatted using "C" locale. I do 
not remember if it happened with XFree86 or with Xorg, but at certain 
moment users experienced problems. X11 could not start at all due to 
invalid configs. The source of problem was "," as decimal separator in 
some locales and wrong expectations concerning numbers in config files.

Recently I found the following fixup_locale function:
http://git.savannah.gnu.org/cgit/emacs.git/tree/src/emacs.c#n2861

     setlocale (LC_NUMERIC, "C");

I was surprised that impossible to determine current decimal separator 
from elisp. At the same time e.g. `string-collate-lessp' has LOCALE 
argument.

A month ago some patches were submitted to Org mode with intention to 
improve import of tables, see https://debbugs.gnu.org/47885 A part of 
discussion is missed in the bug tracker: 
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html 
Org mode has a piece of code that tries to guess if the file has commas 
or tabs as field separator (CSV or TSV format). The suggested change 
adds e.g. semicolon. (Sidenote: probably csv-mode is a better place than 
org-mode for such code.)

The problem is that office software uses semicolon for locales where 
comma serves as decimal separator for floating-point numbers (e.g. 
de_DE, es_ES, fr_FR, ru_RU, etc.):

     A;1,2;3,4

So semicolon should be tried with higher priority than comma if in 
current locale numbers are represented as e.g 1,2. Unfortunately the 
only way to get such information from Emacs is to call some external 
application. Maintaining own mapping of locale to separator is 
unnecessary burden.

Besides office software, there are some equipment that always use "C" 
number formatting, so a user can have a mix of files with various 
dialects of CSV. Thus locale info is not enough, some heuristics is 
required anyway.

More subtle questions rise on the next step. Org allows to perform 
calculations on table cells (and there is calc). Should numbers be 
converted to "C" locale representation during import? Should conversion 
happen when passing cell content as argument and the result converted 
back to current locale? I anticipate that buffer-local setting will be 
requested. There was even discussion of mixed-language documents in 
emacs-orgmode mail list, however numbers were not mentioned.

So locale-aware number formatting would be a great improvement for 
Emacs. On the other hand, it should be implemented with great care to 
avoid localized numbers in some cases. Maybe locale argument should be 
passed to functions that deal with numbers. Formatting of integer 
numbers is not enough, floating point numbers should be handled as well. 
Parsing numbers formatted accordingly to locale rules should be 
addressed too. A function similar to `locale-info' is highly desired to 
get properties of locale (e.g. decimal_point from result of localeconv). 
Some decision is required whether calc & Co should operate with
localized numbers.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-03 14:44 ` CSV parsing and other issues (Re: LC_NUMERIC) Maxim Nikulin
@ 2021-06-03 15:01   ` Eli Zaretskii
  2021-06-04 16:31     ` Maxim Nikulin
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-03 15:01 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: utkarsh190601, emacs-devel

> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Thu, 3 Jun 2021 21:44:08 +0700
> Cc: Utkarsh Singh <utkarsh190601@gmail.com>
> 
> So locale-aware number formatting would be a great improvement for 
> Emacs. On the other hand, it should be implemented with great care to 
> avoid localized numbers in some cases. Maybe locale argument should be 
> passed to functions that deal with numbers. Formatting of integer 
> numbers is not enough, floating point numbers should be handled as well. 
> Parsing numbers formatted accordingly to locale rules should be 
> addressed too. A function similar to `locale-info' is highly desired to 
> get properties of locale (e.g. decimal_point from result of localeconv). 
> Some decision is required whether calc & Co should operate with
> localized numbers.

Setting a locale globally in Emacs is a non-starter, for the reasons
that you point out and others.  Text processing in Emacs is generally
separate from the current locale's rules, mainly to have Emacs work
the same in any locale.  So passing a locale argument to functions
that produce output, with the intent to request some behavior to be
tailored to that locale, is the only reasonable way to have this kind
of functionalities in Emacs.  The problem with that, of course, is
that not every supported platform can dynamically change the locale,
let alone do that efficiently.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-03 15:01   ` Eli Zaretskii
@ 2021-06-04 16:31     ` Maxim Nikulin
  2021-06-04 19:17       ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-04 16:31 UTC (permalink / raw)
  To: emacs-devel; +Cc: utkarsh190601

On 03/06/2021 22:01, Eli Zaretskii wrote:
>> From: Maxim Nikulin
>> Date: Thu, 3 Jun 2021 21:44:08 +0700
>>
>> So locale-aware number formatting would be a great improvement for
>> Emacs. On the other hand, it should be implemented with great care to
>> avoid localized numbers in some cases. Maybe locale argument should be
>> passed to functions that deal with numbers. Formatting of integer
>> numbers is not enough, floating point numbers should be handled as well.
>> Parsing numbers formatted accordingly to locale rules should be
>> addressed too. A function similar to `locale-info' is highly desired to
>> get properties of locale (e.g. decimal_point from result of localeconv).
>> Some decision is required whether calc & Co should operate with
>> localized numbers.
> 
> Setting a locale globally in Emacs is a non-starter, for the reasons
> that you point out and others.  Text processing in Emacs is generally
> separate from the current locale's rules, mainly to have Emacs work
> the same in any locale.  So passing a locale argument to functions
> that produce output, with the intent to request some behavior to be
> tailored to that locale, is the only reasonable way to have this kind
> of functionalities in Emacs.  The problem with that, of course, is
> that not every supported platform can dynamically change the locale,
> let alone do that efficiently.

I do not think it is efficient to require from users to fight with 
number formatting themselves. Some links from my browser history when I 
was trying to figure out how to get locale-specific decimal separator in 
elisp:

https://stackoverflow.com/questions/35661173/how-to-format-table-fields-as-currency-in-org-mode
https://www.emacswiki.org/emacs/AddCommasToNumbers
https://www.reddit.com/r/emacs/comments/61mhyx/creating_a_function_to_add_commasseparators_to/

Do you mean that it is necessary to create new implementation of number 
formatter specially for Emacs? Something like

https://unicode.org/reports/tr35/tr35-numbers.html
Unicode Locale Data Markup Language (LDML) Part 3: Numbers

Actually it is an almost random link. I do not know which source is 
currently considered as the best collection of wisdom related to number 
formatting. Outside of Emacs world, when I needed numbers formatted 
accordingly to various locales previous time, I was lucky enough to use 
code similar to the following one and did not care concerning details:

#include <cstdio>
#include <QLocale>
#include <QTextStream>

void test(QTextStream& stream, const char *loc_name) {
	QLocale loc(QString::fromLocal8Bit(loc_name));
	stream << "point: " << loc.decimalPoint()
		<< " " << loc.toString(12345.67)
		<< " " << loc.toString(1234567890) << "\n";
}
int main(int argc, char *argv[]) {
	QTextStream stream(stdout);
	for (int i = 1; i < argc; ++i) {
		test(stream, argv[i]);
	}
	return 0;
}

./qtloc de_DE en_GB fa_IR
point: , 12.345,7 1.234.567.890
point: . 12,345.7 1,234,567,890
point: ٫ ۱۲٬۳۴۵٫۷ ۱٬۲۳۴٬۵۶۷٬۸۹۰

Surprisingly it works even despite I have not generated de and fa locales.

On linux I see that Emacs is linked with ICU

ldd /usr/bin/emacs | grep -i icu
	libicuuc.so.66 => /usr/lib/x86_64-linux-gnu/libicuuc.so.66 
(0x00007f457c799000)
	libicudata.so.66 => /usr/lib/x86_64-linux-gnu/libicudata.so.66 
(0x00007f457a61c000)

I am not familiar with ICU API but I expect that it may be utilized
https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/numfmt/capi.c

Do you have a bright idea concerning implementation of parser-formatter 
for numbers with reasonable efforts?




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-04 16:31     ` Maxim Nikulin
@ 2021-06-04 19:17       ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-04 19:17 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: utkarsh190601, emacs-devel

> Cc: utkarsh190601@gmail.com
> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Fri, 4 Jun 2021 23:31:13 +0700
> 
> > Setting a locale globally in Emacs is a non-starter, for the reasons
> > that you point out and others.  Text processing in Emacs is generally
> > separate from the current locale's rules, mainly to have Emacs work
> > the same in any locale.  So passing a locale argument to functions
> > that produce output, with the intent to request some behavior to be
> > tailored to that locale, is the only reasonable way to have this kind
> > of functionalities in Emacs.  The problem with that, of course, is
> > that not every supported platform can dynamically change the locale,
> > let alone do that efficiently.
> 
> I do not think it is efficient to require from users to fight with 
> number formatting themselves.

I didn't suggest that.  I was talking about the design of the APIs
that need to be able to provide locale-specific formatting.  The
implementation should be provided by Emacs core, of course.

> Do you mean that it is necessary to create new implementation of number 
> formatter specially for Emacs?

Either that, or use the underlying C library if it can accept a locale
specifier, or if it supports efficient dynamic change of the locale,
like we do in some of the implementations of string-collate-lessp.

> On linux I see that Emacs is linked with ICU

It isn't.  It's either HarfBuzz or maybe libc that pulls in the ICU
library.  Emacs doesn't use it directly.

> Do you have a bright idea concerning implementation of parser-formatter 
> for numbers with reasonable efforts?

See above.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
@ 2021-06-06 23:36 Boruch Baum
  2021-06-07 12:28 ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Boruch Baum @ 2021-06-06 23:36 UTC (permalink / raw)
  To: Emacs-Devel List; +Cc: Maxim Nikulin, Eli Zaretskii

I wasn't cc'ed (and I don't subscribe to the list), so I only now saw
the continuation of my post.

1] @Maxim: You seemed to indicate that the default emacs locale is 'C'.
   That may be true, and I may be mixing up two separate things, but my
   observation is that I get 'nil' when I check for any related
   environment variable using function `getenv', and in practice I need
   to temporarily manually use function setenv to set LC_COLLATE=C in
   order to offer several sorting options in package diredc. Note though
   that feature isn't performing the sort within emacs; it's temporarily
   setting a shell environment and having the external ls program
   perform the sort for emacs-core dired. Thus, my experience has been
   that the default has been something other than C, at least for
   LC_COLLATE. I suspect that's true for ALL emacs users.

2] @Eli: You wrote

> > The problem with that, of course, is that not every supported
> > platform can dynamically change the locale, let alone do that
> > efficiently.

   I have no idea to what actual supported platform you're referring.

3] @ELi: Your wrote

> > Text processing in Emacs is generally separate from the current
> > locale's rules,
> > ...
> > So passing a locale argument to functions that produce output,
> > with the intent to request some behavior to be tailored to that
> > locale, is the only reasonable way to have this kind

   Agreed. My input here is that there should be clear documentation of
   how to retrieve a value for that argument from a buffer's context,
   (maybe the same way that flyspell does?).

I see also that I created room for confusion in asking actually for TWO
features (single-quote and upper-case I) because the two will behave
differently in an expected default condition. The single quote format
(for the thousands separator) can be expected to produce a result always
for all conditions of locale, while I expect most locale cases won't
produce any special output for the upper-case I format option.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-06 23:36 CSV parsing and other issues (Re: LC_NUMERIC) Boruch Baum
@ 2021-06-07 12:28 ` Eli Zaretskii
  2021-06-08  0:45   ` Boruch Baum
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-07 12:28 UTC (permalink / raw)
  To: Boruch Baum; +Cc: manikulin, emacs-devel

> Date: Sun, 6 Jun 2021 19:36:38 -0400
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: Maxim Nikulin <manikulin@gmail.com>, Eli Zaretskii <eliz@gnu.org>
> 
> 1] @Maxim: You seemed to indicate that the default emacs locale is 'C'.
>    That may be true

That's only true for LC_NUMERIC category.

> 2] @Eli: You wrote
> 
> > > The problem with that, of course, is that not every supported
> > > platform can dynamically change the locale, let alone do that
> > > efficiently.
> 
>    I have no idea to what actual supported platform you're referring.

GNU/Linux is the only one I know of that can efficiently switch
locales dynamically (and even that in recent versions of libc, AFAIR).

> > > Text processing in Emacs is generally separate from the current
> > > locale's rules,
> > > ...
> > > So passing a locale argument to functions that produce output,
> > > with the intent to request some behavior to be tailored to that
> > > locale, is the only reasonable way to have this kind
> 
>    Agreed. My input here is that there should be clear documentation of
>    how to retrieve a value for that argument from a buffer's context,
>    (maybe the same way that flyspell does?).

Sorry, I don't see the relevance.  I was talking about calling
functions, so how does some buffer enter this picture?  Buffers don't
have anything to do with the locale used by library functions called
by Emacs.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-07 12:28 ` Eli Zaretskii
@ 2021-06-08  0:45   ` Boruch Baum
  2021-06-08  2:35     ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Boruch Baum @ 2021-06-08  0:45 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: manikulin, emacs-devel

On 2021-06-07 15:28, Eli Zaretskii wrote:
> Sorry, I don't see the relevance.  I was talking about calling
> functions, so how does some buffer enter this picture?  Buffers don't
> have anything to do with the locale used by library functions called
> by Emacs.

No? If an Emacs user has two buffers in two separate languages, the
buffer-local settings aren't / won't be respected?

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-08  0:45   ` Boruch Baum
@ 2021-06-08  2:35     ` Eli Zaretskii
  2021-06-08 15:35       ` Stefan Monnier
  2021-06-08 16:35       ` Maxim Nikulin
  0 siblings, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-08  2:35 UTC (permalink / raw)
  To: Boruch Baum; +Cc: manikulin, emacs-devel

> Date: Mon, 7 Jun 2021 20:45:10 -0400
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: emacs-devel@gnu.org, manikulin@gmail.com
> 
> On 2021-06-07 15:28, Eli Zaretskii wrote:
> > Sorry, I don't see the relevance.  I was talking about calling
> > functions, so how does some buffer enter this picture?  Buffers don't
> > have anything to do with the locale used by library functions called
> > by Emacs.
> 
> No? If an Emacs user has two buffers in two separate languages, the
> buffer-local settings aren't / won't be respected?

First, language is different from locale.  And second, we don't even
have a buffer-local notion of language yet.  What we can support (but
seldom if ever do) is to have buffer-local case-conversion table,
which is a very small part of language- or locale-dependent settings.

So no, buffer-local aspects in general don't affect what you have in
mind, not yet anyway.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-08  2:35     ` Eli Zaretskii
@ 2021-06-08 15:35       ` Stefan Monnier
  2021-06-08 16:35       ` Maxim Nikulin
  1 sibling, 0 replies; 33+ messages in thread
From: Stefan Monnier @ 2021-06-08 15:35 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Boruch Baum, manikulin, emacs-devel

> First, language is different from locale.  And second, we don't even
> have a buffer-local notion of language yet.  What we can support (but
> seldom if ever do) is to have buffer-local case-conversion table,
> which is a very small part of language- or locale-dependent settings.
>
> So no, buffer-local aspects in general don't affect what you have in
> mind, not yet anyway.

Worse: it's not uncommon to run code which doesn't really care about its
current-buffer, so it's not always correct to presume that the settings
of the current-buffer should be used.  We already suffer from such
problems in some corner cases with code that uses `\<` or `\_<` in
regexps matching on strings (rather than buffer content) where the
result can unexpectedly depend on the buffer which happens to
be current.

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-08  2:35     ` Eli Zaretskii
  2021-06-08 15:35       ` Stefan Monnier
@ 2021-06-08 16:35       ` Maxim Nikulin
  2021-06-08 18:52         ` Eli Zaretskii
  1 sibling, 1 reply; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-08 16:35 UTC (permalink / raw)
  To: emacs-devel; +Cc: emacs-devel

On 08/06/2021 09:35, Eli Zaretskii wrote:
 > From: Boruch Baum
 >> No? If an Emacs user has two buffers in two separate languages, the
 >> buffer-local settings aren't / won't be respected?
 >
 > First, language is different from locale.  And second, we don't even
 > have a buffer-local notion of language yet.

Certainly locale is more precise than just language since it includes 
region and other variants, moreover it can be granularly tuned (date, 
numbers, sorting can be adjusted independently), but I still think that 
all these properties can be sometimes broadly referred to as language.

Does not we discuss a feature request? Low level functions can accept 
explicit locale. Higher level API can obtain it implicitly from 
buffer-local variables and global locale. For example the LOCALE 
argument of `string-collate-lessp' is optional one. I can even 
anticipate that locale may be stored in text properties some times. A 
random message from recent "About multilingual documents" thread at 
emacs-orgmode mail list:
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html

At first basic functionality may be implemented. The problem is to 
choose extensible API.

On 07/06/2021 06:36, Boruch Baum wrote:
 > I get 'nil' when I check for any related
 > environment variable using function `getenv'

Do not confuse setlocale and setenv. setenv affects
later calls to setlocale (with NULL as locale argument)
and child processes. setlocale deals with current
processes it can take into account or override values
of environment variables. setlocale is not exposed to elisp.

> The single quote format (for the thousands separator) can be expected
> to produce a result always for all conditions of locale, while I
> expect most locale cases won't produce any special output for the
> upper-case I format option.

I still think that "'" and "I" formats are tightly bound. Grouping style 
is locale-dependent. So representation of digits is just another 
property of locale.

LC_NUMERIC=C.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1234567890
LC_NUMERIC=en_US.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1,234,567,890
LC_NUMERIC=es_ES.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1.234.567.890
LC_NUMERIC=ru_RU.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1 234 567 890

Even group size is not always 3

: LC_NUMERIC=en_IN.UTF-8 /usr/bin/printf "%'d\n" 1234567890
: 1,23,45,67,890

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/NumberFormat
"India uses thousands/lakh/crore separators"

I just have realized that nl_langinfo(3) (and nl_langinfo_l(3) as well) 
from libc accepts RADIXCHAR (decimal dot) and THOUSEP (group separator) 
arguments. They are good candidates for `locale-info' extension.

On 05/06/2021 02:17, Eli Zaretskii wrote:
 >> From: Maxim Nikulin Date: Fri, 4 Jun 2021 23:31:13 +0700
 >> On linux I see that Emacs is linked with ICU
 >
 > It isn't.  It's either HarfBuzz or maybe libc that pulls in the ICU
 > library.  Emacs doesn't use it directly.

Actually Qt links my example with other libraries from ICU. My point was 
that since Emacs anyway (indirectly) links with this library, the 
dependency may be not so heavy. My personal requirements for number 
formatting were quite modest so far, I expect that other languages (CJK, 
right-to-left scripts, etc.) may require quite special treatment, so 
implementation in Emacs (and further maintenance) may require a lot of 
work. At least API of ICU should be studied to get some inspiration what 
features will be necessary for users from other regions.

E.g. I was completely unaware that negative sign may be represented by 
parenthesis (JavaScript, may be executed in browser developer tools)

new Intl.NumberFormat('en-GB', {
   style: 'currency',
   currency: 'USD',
   currencySign: 'accounting',
   signDisplay: 'always'
}).format(-3500);
"(US$3,500.00)"

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/NumberFormat/NumberFormat

I do not know if Intl API is really convenient. I see that there is no
direct way to get decimal separator. However it can serve as
another source for inspiration.

I expect enough surprises and unexpected "discoveries" during 
implementation of better locale support. That is why I would consider 
adapting some more or less established API for this purpose.

P.S.

On 07/06/2021 06:36, Boruch Baum wrote:
> and in practice I need to temporarily manually use function setenv to
> set LC_COLLATE=C in order to offer several sorting options in package
> diredc.

Ideally you should avoid this and use envp argument of execve(2) system 
call. Otherwise it could interfere with other packages, especially if 
threads are involved. Unsure that Emacs currently provides such API option.

> it's temporarily setting a shell environment and having the external
> ls program perform the sort for emacs-core dired.

I am unsure if "ls" may be reliably used at all. File names may have 
e.g. newlines, various control characters, part that looks rather 
similar to ls output. I am not familiar with dired internals. At first 
by intention was to create an issue for diredc but skimming though its 
code I did not found direct "ls" invocation. Some problems with ls:
https://mywiki.wooledge.org/BashPitfalls?highlight=%28%5CbCategoryShell%5Cb%29#for_f_in_.24.28ls_.2A.mp3.29
Bash Pitfalls: item #1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-08 16:35       ` Maxim Nikulin
@ 2021-06-08 18:52         ` Eli Zaretskii
  2021-06-10 16:28           ` Maxim Nikulin
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-08 18:52 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: boruch_baum, emacs-devel

> Cc: emacs-devel@gnu.org
> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Tue, 8 Jun 2021 23:35:51 +0700
> 
> On 08/06/2021 09:35, Eli Zaretskii wrote:
>  > From: Boruch Baum
>  >> No? If an Emacs user has two buffers in two separate languages, the
>  >> buffer-local settings aren't / won't be respected?
>  >
>  > First, language is different from locale.  And second, we don't even
>  > have a buffer-local notion of language yet.
> 
> Certainly locale is more precise than just language since it includes 
> region and other variants, moreover it can be granularly tuned (date, 
> numbers, sorting can be adjusted independently), but I still think that 
> all these properties can be sometimes broadly referred to as language.

No, they cannot, not in general.  A locale comes with a whole database
of different settings: language, encoding (a.k.a. "codeset"), formats
of date and time, names of days of the week and of the months, rules
for collation and capitalization, etc. etc.  You can easily find
several locales whose language is English, but some/many/all of the
other locale-dependent settings are different.  It isn't a coincidence
that a locale's name includes more than just the language part.

> Low level functions can accept explicit locale.

Which ones?  Most libc routines don't, they use the locale as a global
identifier.  And many libc's (with the prominent exception of glibc)
don't support efficient change of a locale in the middle of a program,
they assume that the program's locale is set once at program startup.

> Higher level API can obtain it implicitly from 
> buffer-local variables and global locale. For example the LOCALE 
> argument of `string-collate-lessp' is optional one. I can even 
> anticipate that locale may be stored in text properties some times. A 
> random message from recent "About multilingual documents" thread at 
> emacs-orgmode mail list:
> https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html

That's mostly about input methods and org-export, I don't see how it's
relevant to what Boruch asked.

> At first basic functionality may be implemented. The problem is to 
> choose extensible API.

No, the problem is to have a design that would allow an efficient
implementation.  Given what the underlying libc does, it isn't easy.

And then we have conceptual problems.  For example, in a multilingual
editor such as Emacs, the notion of a "buffer language" not always
makes sense, you'd need to support portions of text that have
different language properties.  Imagine switching locales as Emacs
processes adjacent stretches of text and other complications.  For
example, changing letter-case for a stretch or Turkish text is
supposed to be different from the English or German text.  I'm all
ears for ideas how to design such "language support".  It definitely
isn't easy, so if you have ideas, please voice them!

> I just have realized that nl_langinfo(3) (and nl_langinfo_l(3) as well) 
> from libc accepts RADIXCHAR (decimal dot) and THOUSEP (group separator) 
> arguments. They are good candidates for `locale-info' extension.

We already use nl_langinfo in locale-info, so what exactly are you
suggesting here? adding more items?  You don't really expect Lisp
programs to format numbers such as 123,456 by hand after learning from
locale-info that the thousands separator is a comma, do you?

> Actually Qt links my example with other libraries from ICU. My point was 
> that since Emacs anyway (indirectly) links with this library, the 
> dependency may be not so heavy.

If you are suggesting that we introduce ICU as a dependency, we could
discuss the pros and cons.  It isn't a simple decision, because ICU
comes with a lot of baggage that we already have implemented in Emacs,
so whether we throw away what we have and use ICU instead, or just add
what we miss without depending on ICU, requires good thought and good
acquaintance with the ICU internals (to make sure it does what we want
in Emacs, and doesn't break existing features).

> My personal requirements for number 
> formatting were quite modest so far, I expect that other languages (CJK, 
> right-to-left scripts, etc.) may require quite special treatment, so 
> implementation in Emacs (and further maintenance) may require a lot of 
> work. At least API of ICU should be studied to get some inspiration what 
> features will be necessary for users from other regions.

I don't think the problem is the API.

> E.g. I was completely unaware that negative sign may be represented by 
> parenthesis

Really? it's standard in financial applications.

> I expect enough surprises and unexpected "discoveries" during 
> implementation of better locale support. That is why I would consider 
> adapting some more or less established API for this purpose.

I don't think "consider" cuts it.  We have already a lot of stuff in
Emacs; what we don't have needs serious design and comparison of
available implementation options.  Emacs's needs are quite special and
unlike those of most other programs.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-08 18:52         ` Eli Zaretskii
@ 2021-06-10 16:28           ` Maxim Nikulin
  2021-06-10 16:57             ` Eli Zaretskii
  2021-06-10 21:10             ` Stefan Monnier
  0 siblings, 2 replies; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-10 16:28 UTC (permalink / raw)
  To: emacs-devel; +Cc: boruch_baum

On 09/06/2021 01:52, Eli Zaretskii wrote:
> From: Maxim Nikulin Date: Tue, 8 Jun 2021 23:35:51 +0700

I have reordered some parts of discussion.

>> I just have realized that nl_langinfo(3) (and
>> nl_langinfo_l(3) as well) from libc accepts RADIXCHAR
>> (decimal dot) and THOUSEP (group separator)
>> arguments. They are good candidates for `locale-info'
>> extension.
>
> We already use nl_langinfo in locale-info, so what exactly
> are you suggesting here? adding more items?  You don't
> really expect Lisp programs to format numbers such as
> 123,456 by hand after learning from locale-info that the
> thousands separator is a comma, do you?

I have hijacked Boruch's thread and changed the subject to "CSV 
parsing".  There are plenty of CSV dialects. If decimal separator is "," 
then office software uses ";" instead of comma as cell (field) 
separator.  So to parse CSV file it is necessary to know decimal 
separator in a specified locale. RADIXCHAR as argument of nl_langinfo(3) 
is a first step to better user experience with CSV files.

Unfortunately it allows only to get reasonable visual representation. 
Taking advantage of Org spreadsheet calculations require parsing cell 
contents thus parsing of numbers (and maybe dates).

I mentioned earlier https://debbugs.gnu.org/47885 and a part of 
discussion that is missed in the bug tracker:
https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00693.html

I have seen nl_langinfo without RADIXCHAR in emacs sources
http://git.savannah.gnu.org/cgit/emacs.git/tree/src/w32proc.c#n3258
http://git.savannah.gnu.org/cgit/emacs.git/tree/lib-src/ntlib.c#n520

Originally during discussion in emacs-orgmode I did not plan to raise
the question concerning number formatting and parsing since I had no
hope for any positive outcome without consistent proposal.  Accidentally
I notices Borich's message and decided to add another use case.

>> On 08/06/2021 09:35, Eli Zaretskii wrote:
>>  > From: Boruch Baum
>>  >> No? If an Emacs user has two buffers in two separate languages, the
>>  >> buffer-local settings aren't / won't be respected?
>>  >
>>  > First, language is different from locale.  And second, we don't even
>>  > have a buffer-local notion of language yet.
>>
>> Certainly locale is more precise than just language since it includes
>> region and other variants, moreover it can be granularly tuned (date,
>> numbers, sorting can be adjusted independently), but I still think that
>> all these properties can be sometimes broadly referred to as language.
>
> No, they cannot, not in general.  A locale comes with a whole database
> of different settings: language, encoding (a.k.a. "codeset"), formats
> of date and time, names of days of the week and of the months, rules
> for collation and capitalization, etc. etc.  You can easily find
> several locales whose language is English, but some/many/all of the
> other locale-dependent settings are different.  It isn't a coincidence
> that a locale's name includes more than just the language part.

I wrote almost the same concerning locale variants and components, so I 
feel some sort of confusion and can not get its origin.  I was trying to 
support Boruch that buffer-local variables may be important part of 
locale context, more precise than global settings, and a fallback if 
locale is not specified for particular span of text.  In respect to such 
hierarchy language vs. locale difference does not matter.

>> Low level functions can accept explicit locale.
>
> Which ones?  Most libc routines don't, they use the locale
> as a global identifier.  And many libc's (with the prominent
> exception of glibc) don't support efficient change of a
> locale in the middle of a program, they assume that the
> program's locale is set once at program startup.

Hypothetical functions in new elisp API, maybe relying on some external
libraries.  I believed, you agreed that global LC_NUMERIC must be "C" to 
avoid various sort of problems with data exchange. I am not aware of 
libc functions for number formatting or parsing that can take explicit 
locale (I have seen such feature in C++ standard library, Qt, other 
languages).  Totalitarian approach of libc with the only locale facet, 
the only timezone imposes too hard limitations to consider some libc 
functions as useful and reliable in more or less complex application. 
Its API is suitable for simple tools that can quickly do their work and 
do not assume any conversion. More flexible base layer is required when 
mix of environments is expected. Full support of locale features 
requires a lot of work, that is why I am asking if some external library 
can be used instead.

>> Higher level API can obtain it implicitly from
>> buffer-local variables and global locale. For example the
>> LOCALE argument of `string-collate-lessp' is optional
>> one. I can even anticipate that locale may be stored in
>> text properties some times. A random message from recent
>> "About multilingual documents" thread at emacs-orgmode
>> mail list:
>> https://lists.gnu.org/archive/html/emacs-orgmode/2021-05/msg00252.html
>
> That's mostly about input methods and org-export, I don't
> see how it's relevant to what Boruch asked.

I added this link to show you that demand for multilanguage documents is 
real. Notice that problems with spell checking were mentioned in that 
discussion. Earlier I saw suggestions to switch ispell language with 
input method. In my opinion it is ridiculous.  Personally I rather need 
combined dictionary then explicitly marked text regions.

I expect that new features will be wider utilized when possibility to 
use them will appear.

>> At first basic functionality may be implemented. The
>> problem is to choose extensible API.
>
> No, the problem is to have a design that would allow an
> efficient implementation.  Given what the underlying libc
> does, it isn't easy.

That is why I looking for an alternative to libc. Previously you wrote
"locale switching". I would rather say constructing and destroying
locales on demand. Switching may behave not so well when thread are 
involved.

> And then we have conceptual problems.  For example, in a
> multilingual editor such as Emacs, the notion of a "buffer
> language" not always makes sense, you'd need to support
> portions of text that have different language properties.
> Imagine switching locales as Emacs processes adjacent
> stretches of text and other complications.  For example,
> changing letter-case for a stretch or Turkish text is
> supposed to be different from the English or German text.
> I'm all ears for ideas how to design such "language
> support".  It definitely isn't easy, so if you have ideas,
> please voice them!

I never have a consistent vision nor see a conceptual problem. 
Buffer-local settings are just more specific than global ones.  That is 
I mentioned text properties as even more precise in my previous message. 
Maybe even current mode can help to build proper hierarchy of locale 
contexts. HTML has "lang" attribute, there is "\foreignlanguage" in 
LaTeX, etc.

I have heard that special case exists in Turkish, but I was not curious
enough to find details and rules when and how it should be applied.

> If you are suggesting that we introduce ICU as a dependency,
> we could discuss the pros and cons.

I consider it as the most complete available implementation.  Do you 
know a comparable alternative?

I have realized that since Emacs has support of dynamic modules, it is
possible to create a prototype with bindings to external library without
rebuilding of Emacs.

> I don't think the problem is the API.

I think, introducing features gradually will be more headache for 
developers of external packages than absence of support at all.  API 
determines the scope of such features.

>> E.g. I was completely unaware that negative sign may be
>> represented by parenthesis
>
> Really? it's standard in financial applications.

Is it really so standard? Maybe I have seen such format, even guessed 
from some context that e.g. table column with such numbers should assume 
negative values, or e.g. in discount entry.  At least I did not 
recognize such format as some general rule.

new Intl.NumberFormat('de-DE', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3.500,00 $"
new Intl.NumberFormat('es-ES', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3500,00 US$"
new Intl.NumberFormat('fr-FR', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"(3 500,00 $US)"
new Intl.NumberFormat('ru-RU', {style: 'currency', currency: 'USD', 
currencySign: 'accounting', signDisplay: 'always'}).format(-3500);
"-3 500,00 $"

>> I expect enough surprises and unexpected "discoveries"
>> during implementation of better locale support. That is
>> why I would consider adapting some more or less
>> established API for this purpose.
>
> I don't think "consider" cuts it.  We have already a lot of
> stuff in Emacs; what we don't have needs serious design and
> comparison of available implementation options.  Emacs's
> needs are quite special and unlike those of most other
> programs.

I still think that expectation of users around the globe are more
special than Emacs' needs at least in respect to format of numbers.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 16:28           ` Maxim Nikulin
@ 2021-06-10 16:57             ` Eli Zaretskii
  2021-06-10 18:01               ` Boruch Baum
  2021-06-11 16:58               ` Maxim Nikulin
  2021-06-10 21:10             ` Stefan Monnier
  1 sibling, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-10 16:57 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: boruch_baum, emacs-devel

> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Thu, 10 Jun 2021 23:28:59 +0700
> Cc: boruch_baum@gmx.com
> 
> > We already use nl_langinfo in locale-info, so what exactly
> > are you suggesting here? adding more items?  You don't
> > really expect Lisp programs to format numbers such as
> > 123,456 by hand after learning from locale-info that the
> > thousands separator is a comma, do you?
> 
> I have hijacked Boruch's thread and changed the subject to "CSV 
> parsing".

That explains part of my confusion.  Please try not to hijack
discussions; instead, start a separate thread, to avoid such
confusion.

For processing CSV, if there's a need to know whether the locale uses
the comma as a decimal separator, we could indeed extend locale-info.
But such an extension is almost trivial and doesn't even touch on the
significant problems in the rest of the discussion.

> I was trying to support Boruch that buffer-local variables may be
> important part of locale context, more precise than global settings,

They are more precise, but they don't support mixed languages in the
same buffer, something that happens in Emacs very frequently.  Which
means they are not precise enough.  So my POV is that we should look
for a way to be able to specify the language of some span of text, in
which case buffers that use a single language will be a special case.

> > And then we have conceptual problems.  For example, in a
> > multilingual editor such as Emacs, the notion of a "buffer
> > language" not always makes sense, you'd need to support
> > portions of text that have different language properties.
> > Imagine switching locales as Emacs processes adjacent
> > stretches of text and other complications.  For example,
> > changing letter-case for a stretch or Turkish text is
> > supposed to be different from the English or German text.
> > I'm all ears for ideas how to design such "language
> > support".  It definitely isn't easy, so if you have ideas,
> > please voice them!
> 
> I never have a consistent vision nor see a conceptual problem. 

Here's  a trivial example:

  (insert (downcase (buffer-substring POS1 POS2)))

Contrast with

  (insert (downcase "FOO"))

The function 'downcase' gets a Lisp string, but it has no way of
knowing whether the string is actually a portion of current buffer's
text.  So how can it apply the correct letter-case conversions, even
if some buffer-local setting specifies that this should be done using
some specific language's rules?

IOW, one of the non-trivial problems is how to process Lisp strings
correctly for these purposes.  Buffers can have local variables, but
what about strings?

> > If you are suggesting that we introduce ICU as a dependency,
> > we could discuss the pros and cons.
> 
> I consider it as the most complete available implementation.  Do you 
> know a comparable alternative?

Yes: what we have already in Emacs.  That covers a lot of the same
Unicode turf that ICU handles, because we import and use the same
Unicode files and tables.  The question is: what is best for the
future development of Emacs in this area: depend on ICU (which would
mean we need to rewrite lots of code that is working well), or extend
what we have to support more Unicode features?  One not-so-trivial
aspect of this is efficiency of fetching character properties (Emacs
has char-tables for that, which are efficient both CPU- and
memory-wise).  Another aspect is support for raw bytes in buffers and
strings.  And there are probably some others.

It is not a simple decision.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 16:57             ` Eli Zaretskii
@ 2021-06-10 18:01               ` Boruch Baum
  2021-06-10 18:50                 ` Eli Zaretskii
  2021-06-11 16:58               ` Maxim Nikulin
  1 sibling, 1 reply; 33+ messages in thread
From: Boruch Baum @ 2021-06-10 18:01 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Maxim Nikulin, emacs-devel

On 2021-06-10 19:57, Eli Zaretskii wrote:
>
> It is not a simple decision.

My request at the beginning of the (original) thread was much more
limited in scope and still seems to me in fact to be a simple decision,
and with no side effects. Paraphrased:

  Please consider exposing to the elisp `format' function the
  single-quote and upper-case 'I' format specifiers of the libc (or
  other) `printf' command.

Doing this will just offer an elisp programmer a new option.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 18:01               ` Boruch Baum
@ 2021-06-10 18:50                 ` Eli Zaretskii
  2021-06-10 19:04                   ` Boruch Baum
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-10 18:50 UTC (permalink / raw)
  To: Boruch Baum; +Cc: manikulin, emacs-devel

> Date: Thu, 10 Jun 2021 14:01:45 -0400
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: Maxim Nikulin <manikulin@gmail.com>, emacs-devel@gnu.org
> 
>   Please consider exposing to the elisp `format' function the
>   single-quote and upper-case 'I' format specifiers of the libc (or
>   other) `printf' command.
> 
> Doing this will just offer an elisp programmer a new option.

That would make the output of 'format' dependent on the current
locale, unless we do something else to allow Lisp programs to take
control on what those specifiers produce.  That "something else" is
what I was talking about.  It is true that I was talking about larger
range of issues, but still, even this small feature touches on some of
them.  And I don't think you had any ideas for how to resolve those
issues, or did I miss something?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 18:50                 ` Eli Zaretskii
@ 2021-06-10 19:04                   ` Boruch Baum
  2021-06-10 19:23                     ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Boruch Baum @ 2021-06-10 19:04 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: manikulin, emacs-devel

On 2021-06-10 21:50, Eli Zaretskii wrote:
> > Date: Thu, 10 Jun 2021 14:01:45 -0400
> > From: Boruch Baum <boruch_baum@gmx.com>
> > Cc: Maxim Nikulin <manikulin@gmail.com>, emacs-devel@gnu.org
> >
> >   Please consider exposing to the elisp `format' function the
> >   single-quote and upper-case 'I' format specifiers of the libc (or
> >   other) `printf' command.
> >
> > Doing this will just offer an elisp programmer a new option.
>
> That would make the output of 'format' dependent on the current
> locale

That's the elisp programmer's business, not your responsibilty.

> ...
> And I don't think you had any ideas for how to resolve those issues,
> or did I miss something?

Yes, that I haven't invested in responding about those issues because I
don't see any of them as relevant.

  + Elisp function `format' exists.

  + Elsip function `format' uses `printf' format specifiers.

  + Elisp function `format' doesn't expose two of them.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 19:04                   ` Boruch Baum
@ 2021-06-10 19:23                     ` Eli Zaretskii
  2021-06-10 20:20                       ` Boruch Baum
  2021-06-11 13:56                       ` Filipp Gunbin
  0 siblings, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-10 19:23 UTC (permalink / raw)
  To: Boruch Baum; +Cc: manikulin, emacs-devel

> Date: Thu, 10 Jun 2021 15:04:53 -0400
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: manikulin@gmail.com, emacs-devel@gnu.org
> 
> > That would make the output of 'format' dependent on the current
> > locale
> 
> That's the elisp programmer's business, not your responsibilty.

What could the Lisp programmer do in this situation?

>   + Elsip function `format' uses `printf' format specifiers.

Only for some of the 'format's capabilities, not for all of them.

>   + Elisp function `format' doesn't expose two of them.

I don't think it's TRT for Emacs to expose locale-dependent features
that cannot be controlled from Lisp, sorry.  We need to find a better
way.  For example, there could be a Lisp variable that specifies the
group separator character, and then 'format' could use that character
when the format spec includes %'.  Which means we'd need to implement
that in our own code; patches welcome.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 19:23                     ` Eli Zaretskii
@ 2021-06-10 20:20                       ` Boruch Baum
  2021-06-11  6:19                         ` Eli Zaretskii
  2021-06-11 13:56                       ` Filipp Gunbin
  1 sibling, 1 reply; 33+ messages in thread
From: Boruch Baum @ 2021-06-10 20:20 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: manikulin, emacs-devel

On 2021-06-10 22:23, Eli Zaretskii wrote:
> > Date: Thu, 10 Jun 2021 15:04:53 -0400
> > From: Boruch Baum <boruch_baum@gmx.com>
> > Cc: manikulin@gmail.com, emacs-devel@gnu.org
> >
> > > That would make the output of 'format' dependent on the current
> > > locale
> >
> > That's the elisp programmer's business, not your responsibilty.
>
> What could the Lisp programmer do in this situation?

It's not your responsibilty.

I can say that in the use-case that prompted my request, I'm confident
it will *never* be an issue. I ask format to give me a string and I
display it. End of story. Whether just 99% or 99.99%, the overwhelming
majority of cases will be the same. Your concerns are total non-issues.

> >   + Elsip function `format' uses `printf' format specifiers.
>
> Only for some of the 'format's capabilities, not for all of them.

  [Commentary: 'Some' isn't a number or a percentage.]

  [Commentary: I see all format specifiers supported but the two
               requested.]

> >   + Elisp function `format' doesn't expose two of them.
>
> I don't think it's TRT for Emacs to expose locale-dependent features
> that cannot be controlled from Lisp

Then don't make them locale specific. Implement the single-quote
specifier the same way you currently handle the floating-point specifier
'%f', a locale-specific format that has existed in emacs without
complaint since ...

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 16:28           ` Maxim Nikulin
  2021-06-10 16:57             ` Eli Zaretskii
@ 2021-06-10 21:10             ` Stefan Monnier
  2021-06-12 14:41               ` Maxim Nikulin
  1 sibling, 1 reply; 33+ messages in thread
From: Stefan Monnier @ 2021-06-10 21:10 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: emacs-devel, boruch_baum

> There are plenty of CSV dialects. If decimal separator is "," then office
> software uses ";" instead of comma as cell (field) separator.

But there's no reason to presume that a given CSV file was generated in
the same locale as the one we're currently using.

So the locale could be one ingredient in the machinery used to guess
which separator was used, but I'm not sure it would be of much help.

[ BTW, I'll take the opportunity to advocate for the use of TSV
  instead, which is slightly less ill-defined.  ]

        Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 20:20                       ` Boruch Baum
@ 2021-06-11  6:19                         ` Eli Zaretskii
  2021-06-11  8:18                           ` Boruch Baum
  2021-06-11 16:51                           ` Maxim Nikulin
  0 siblings, 2 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-11  6:19 UTC (permalink / raw)
  To: Boruch Baum; +Cc: manikulin, emacs-devel

> Date: Thu, 10 Jun 2021 16:20:45 -0400
> From: Boruch Baum <boruch_baum@gmx.com>
> Cc: manikulin@gmail.com, emacs-devel@gnu.org
> 
> On 2021-06-10 22:23, Eli Zaretskii wrote:
> > > Date: Thu, 10 Jun 2021 15:04:53 -0400
> > > From: Boruch Baum <boruch_baum@gmx.com>
> > > Cc: manikulin@gmail.com, emacs-devel@gnu.org
> > >
> > > > That would make the output of 'format' dependent on the current
> > > > locale
> > >
> > > That's the elisp programmer's business, not your responsibilty.
> >
> > What could the Lisp programmer do in this situation?
> 
> It's not your responsibilty.

It is my responsibility to make sure we don't add to Emacs features
that are not very useful, or are against the Emacs philosophy and/or
design principles.

> I can say that in the use-case that prompted my request, I'm confident
> it will *never* be an issue. I ask format to give me a string and I
> display it. End of story. Whether just 99% or 99.99%, the overwhelming
> majority of cases will be the same. Your concerns are total non-issues.

You can always write a module to implement this feature, if you want
it for your own purposes.  Or you could change Emacs to support that
directly and maintain that change locally.  There's no need to
introduce into Emacs features that are useful for a few people.

>   [Commentary: I see all format specifiers supported but the two
>                requested.]

You are overlooking some aspects of the code if that is your
conclusion.

> > I don't think it's TRT for Emacs to expose locale-dependent features
> > that cannot be controlled from Lisp
> 
> Then don't make them locale specific. Implement the single-quote
> specifier the same way you currently handle the floating-point specifier
> '%f', a locale-specific format that has existed in emacs without
> complaint since ...

That was my suggestion, more or less.  Patches are welcome to
implement that.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11  6:19                         ` Eli Zaretskii
@ 2021-06-11  8:18                           ` Boruch Baum
  2021-06-11 16:51                           ` Maxim Nikulin
  1 sibling, 0 replies; 33+ messages in thread
From: Boruch Baum @ 2021-06-11  8:18 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: manikulin, emacs-devel

On 2021-06-11 09:19, Eli Zaretskii wrote:
> You can always write a module to implement this feature, if you want
> it for your own purposes.

Done and published and on MELPA before my first post here. And I wasn't
the first; there are other code examples available elsewhere.

> There's no need to introduce into Emacs features that are useful for a
> few people.

??? But it's clear that your set in your decision. I think I've done
more than enogh to try and benefit others on this one.

--
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 19:23                     ` Eli Zaretskii
  2021-06-10 20:20                       ` Boruch Baum
@ 2021-06-11 13:56                       ` Filipp Gunbin
  2021-06-11 14:10                         ` Eli Zaretskii
  1 sibling, 1 reply; 33+ messages in thread
From: Filipp Gunbin @ 2021-06-11 13:56 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: manikulin, Boruch Baum, emacs-devel

On 10/06/2021 22:23 +0300, Eli Zaretskii wrote:

> I don't think it's TRT for Emacs to expose locale-dependent features
> that cannot be controlled from Lisp, sorry.  We need to find a better
> way.  For example, there could be a Lisp variable that specifies the
> group separator character, and then 'format' could use that character
> when the format spec includes %'.  Which means we'd need to implement
> that in our own code; patches welcome.

Maybe an alternative set of specifiers, which output data in
locale-specific format.  Then a single variable to let-bound around
format, which instructs what locale to use.  Very simple...

Filipp



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11 13:56                       ` Filipp Gunbin
@ 2021-06-11 14:10                         ` Eli Zaretskii
  2021-06-11 18:52                           ` Filipp Gunbin
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-11 14:10 UTC (permalink / raw)
  To: Filipp Gunbin; +Cc: manikulin, boruch_baum, emacs-devel

> From: Filipp Gunbin <fgunbin@fastmail.fm>
> Cc: Boruch Baum <boruch_baum@gmx.com>,  manikulin@gmail.com,
>   emacs-devel@gnu.org
> Date: Fri, 11 Jun 2021 16:56:34 +0300
> 
> On 10/06/2021 22:23 +0300, Eli Zaretskii wrote:
> 
> > I don't think it's TRT for Emacs to expose locale-dependent features
> > that cannot be controlled from Lisp, sorry.  We need to find a better
> > way.  For example, there could be a Lisp variable that specifies the
> > group separator character, and then 'format' could use that character
> > when the format spec includes %'.  Which means we'd need to implement
> > that in our own code; patches welcome.
> 
> Maybe an alternative set of specifiers, which output data in
> locale-specific format.  Then a single variable to let-bound around
> format, which instructs what locale to use.  Very simple...

Sorry, I don't think I understand what you propose.  Please elaborate
on the "alternative set of specifiers, which output data in
locale-specific format".



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11  6:19                         ` Eli Zaretskii
  2021-06-11  8:18                           ` Boruch Baum
@ 2021-06-11 16:51                           ` Maxim Nikulin
  1 sibling, 0 replies; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-11 16:51 UTC (permalink / raw)
  To: emacs-devel

Eli, Boruch, you are overreacting (both).

On 11/06/2021 13:19, Eli Zaretskii wrote:
> There's no need to
> introduce into Emacs features that are useful for a few people.

I think that expectation of users and developers in respect to support 
of locales evolves in time. Proper formatting of numbers is useful more 
widely then for a few people.

Boruch, till your last messages, I believed that you were convinced that 
adding support of "'" and "I" is not so easy.

Support of locale-dependent format specifiers through printf looks 
attractive but it can not be directly used by `format' or other elisp 
functions in a safe way.

Some code calling `format' implicitly expects that it generates 
locale-independent numbers, so changing its behavior is not backward 
compatible.

libc can only work with single global locale at any moment. I expect 
that attempt to "temporary" call setlocale(LC_NUMERIC, "") will be 
permanent source of bugs: forgotten reverting call, call of a function 
that needs universal format in locale-specific context, threads started 
at inappropriate moment, etc.

Another implementation of locale functions is necessary with ability to 
perform parsing and formatting without touching of global variables.

Personally I expect basic level functions with explicit locale context 
(random names):

     (locale-format-number-with-ctx
      (locale-get-current-context :group-separator 'suppress)
      1234567890)

or with explicit locale instead of `locale-get-current-context'. It is 
better to add some convenience helpers that inspect text properties, 
buffer-local and global settings to determine context:

     (locale-format-number 1234567890)

and maybe even `locale-format[-with-ctx]' that accepts printf-like 
format string.

On 11/06/2021 03:20, Boruch Baum wrote:
 > Then don't make them locale specific. Implement the
 > single-quote specifier the same way you currently handle the
 > floating-point specifier '%f', a locale-specific format that
 > has existed in emacs without complaint since ...

You are confusing something. "%f" is not locale-specific inside Emacs,
it uses "universal" format with dot "." as decimal separator even in
locales with "," in this role. At the same time "'" is highly
locale-dependent in libc. Group sizes and group separator widely
vary. I posted this example earlier:

LC_NUMERIC=C.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1234567890
LC_NUMERIC=en_US.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1,234,567,890
LC_NUMERIC=es_ES.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1.234.567.890
LC_NUMERIC=ru_RU.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1 234 567 890
LC_NUMERIC=en_IN.UTF-8 /usr/bin/printf "%'d\n" 1234567890
1,23,45,67,890

 > It's not your responsibilty.
 >
 > I can say that in the use-case that prompted my request, I'm
 > confident it will *never* be an issue. I ask format to give
 > me a string and I display it. End of story. Whether just 99%
 > or 99.99%, the overwhelming majority of cases will be the
 > same. Your concerns are total non-issues.

I would prefer to avoid idiosyncrasy when "%'d" is locale-dependent but 
"%f" is not.

P.S.

With some limitation (printf binary is available and you do not need to 
work with floating point numbers), you can leverage libc formatting 
facilities with the following crutch:

(shell-command-to-string (format "/usr/bin/printf \"%%'d\" %d"
1234567890))

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 16:57             ` Eli Zaretskii
  2021-06-10 18:01               ` Boruch Baum
@ 2021-06-11 16:58               ` Maxim Nikulin
  2021-06-11 18:04                 ` Eli Zaretskii
  1 sibling, 1 reply; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-11 16:58 UTC (permalink / raw)
  To: emacs-devel; +Cc: boruch_baum

On 10/06/2021 23:57, Eli Zaretskii wrote:
 >> From: Maxim Nikulin Date: Thu, 10 Jun 2021 23:28:59 +0700
 >
 > For processing CSV, if there's a need to know whether the
 > locale uses the comma as a decimal separator, we could
 > indeed extend locale-info.  But such an extension is almost
 > trivial and doesn't even touch on the significant problems
 > in the rest of the discussion.
 >

You forgot `setlocale(LC_NUMERIC, "C")', didn't you?

#include <langinfo.h>
#include <locale.h>
#include <stdio.h>

int main() {
	setlocale(LC_ALL, "");
	printf("%c", *nl_langinfo(RADIXCHAR));
	setlocale(LC_NUMERIC, "C");
	printf("%c\n", *nl_langinfo(RADIXCHAR));
	return 0;
}

Output is ",.". There is nl_langinfo_l(3), but it requires more work.

After parsing of rows to cells, it may be necessary to parse numbers 
("2,34" to 2.34). That is why quality of CSV file import is tightly 
related to handling of number formats.

 >> I was trying to support Boruch that buffer-local variables
 >> may be important part of locale context, more precise than
 >> global settings,
 >
 > They are more precise, but they don't support mixed
 > languages in the same buffer, something that happens in
 > Emacs very frequently.

In some cases I would prefer to have uniform format of numbers and dates
despite alternating language in the buffer, e.g. for my private notes.

 > Here's a trivial example:
 >
 >     (insert (downcase (buffer-substring POS1 POS2)))
 >
 > Contrast with
 >
 >     (insert (downcase "FOO"))

Either `set-text-properties' should be called on "FOO" before passing it 
to `downcase' or `locale-downcase' with LOCALE first argument should be 
added. Moreover, such `locale-downcase' function may be used to 
implement higher level functions working with implicit locales.  LOCALE 
may assume some hierarchy with user overrides for particular call, text 
properties, buffer variables, global settings.

 > Yes: what we have already in Emacs.  That covers a lot of
 > the same Unicode turf that ICU handles, because we import
 > and use the same Unicode files and tables.

There are plenty of xml files in cldr-common-39.0.zip 
(common/main/*.xml) https://www.unicode.org/Public/cldr/39/ in addition 
to Unicode data in Emacs sources.  They include rules for number 
formatting https://unicode.org/reports/tr35/tr35-numbers.html
Of course, human-style number formatting, currencies, financial style, 
etc. may be discarded and implementation may be limited to grouping and 
decimal separators (leaving other features to further requests).  There 
is newlocale(3) function in glibc to obtain minimal subset of 
properties. I am not familiar with other platforms.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11 16:58               ` Maxim Nikulin
@ 2021-06-11 18:04                 ` Eli Zaretskii
  2021-06-14 16:38                   ` Maxim Nikulin
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-11 18:04 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: boruch_baum, emacs-devel

> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Fri, 11 Jun 2021 23:58:24 +0700
> Cc: boruch_baum@gmx.com
> 
> On 10/06/2021 23:57, Eli Zaretskii wrote:
>  >> From: Maxim Nikulin Date: Thu, 10 Jun 2021 23:28:59 +0700
>  >
>  > For processing CSV, if there's a need to know whether the
>  > locale uses the comma as a decimal separator, we could
>  > indeed extend locale-info.  But such an extension is almost
>  > trivial and doesn't even touch on the significant problems
>  > in the rest of the discussion.
>  >
> 
> You forgot `setlocale(LC_NUMERIC, "C")', didn't you?

No, I didn't.  Adding a call to setlocale to locale-info, even if we
want to add an argument for the caller to control the locale, is
trivial.

>  > Here's a trivial example:
>  >
>  >     (insert (downcase (buffer-substring POS1 POS2)))
>  >
>  > Contrast with
>  >
>  >     (insert (downcase "FOO"))
> 
> Either `set-text-properties' should be called on "FOO" before passing it 
> to `downcase'

Which property will help here? we don't have such properties.  they
need to be designed and implemented.

> or `locale-downcase' with LOCALE first argument should be 
> added.

How would you implement locale-downcase?  Are you familiar with how
Emacs case tables work?

And even if we had locale-downcase, which locale would you pass to it
in any given use case?

Please note that I'm not saying these issues cannot be solved -- they
can.  I'm saying that designing them requires non-trivial thought,
something we didn't yet do.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11 14:10                         ` Eli Zaretskii
@ 2021-06-11 18:52                           ` Filipp Gunbin
  2021-06-11 19:34                             ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Filipp Gunbin @ 2021-06-11 18:52 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: manikulin, boruch_baum, emacs-devel

On 11/06/2021 17:10 +0300, Eli Zaretskii wrote:

>> From: Filipp Gunbin <fgunbin@fastmail.fm>
>> Cc: Boruch Baum <boruch_baum@gmx.com>,  manikulin@gmail.com,
>>   emacs-devel@gnu.org
>> Date: Fri, 11 Jun 2021 16:56:34 +0300
>>
>> On 10/06/2021 22:23 +0300, Eli Zaretskii wrote:
>>
>> > I don't think it's TRT for Emacs to expose locale-dependent features
>> > that cannot be controlled from Lisp, sorry.  We need to find a better
>> > way.  For example, there could be a Lisp variable that specifies the
>> > group separator character, and then 'format' could use that character
>> > when the format spec includes %'.  Which means we'd need to implement
>> > that in our own code; patches welcome.
>>
>> Maybe an alternative set of specifiers, which output data in
>> locale-specific format.  Then a single variable to let-bound around
>> format, which instructs what locale to use.  Very simple...
>
> Sorry, I don't think I understand what you propose.  Please elaborate
> on the "alternative set of specifiers, which output data in
> locale-specific format".

I mean that for every specifier which could be affected by locale (but
isn't), there could be additional specifier, which takes locale into
account.  Less awkward, there could be an explicit modifier which says
"use locale for this specifier in format".  Something like `O' or `E'
modifier in "format-time-string".

This way only given format call is affected, without surprises somewhere
below in the call stack.

Then, a locale to use could be let-bound around this format call, thus
overriding the default which came from env vars or from somewhere else.

Filipp



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11 18:52                           ` Filipp Gunbin
@ 2021-06-11 19:34                             ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-11 19:34 UTC (permalink / raw)
  To: Filipp Gunbin; +Cc: manikulin, boruch_baum, emacs-devel

> From: Filipp Gunbin <fgunbin@fastmail.fm>
> Cc: boruch_baum@gmx.com,  manikulin@gmail.com,  emacs-devel@gnu.org
> Date: Fri, 11 Jun 2021 21:52:57 +0300
> 
> I mean that for every specifier which could be affected by locale (but
> isn't), there could be additional specifier, which takes locale into
> account.  Less awkward, there could be an explicit modifier which says
> "use locale for this specifier in format".  Something like `O' or `E'
> modifier in "format-time-string".

That could work, but if we rely on libc functions for the
locale-dependent behavior, it could be slow, because switching a
locale could be expensive.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-10 21:10             ` Stefan Monnier
@ 2021-06-12 14:41               ` Maxim Nikulin
  0 siblings, 0 replies; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-12 14:41 UTC (permalink / raw)
  To: emacs-devel

On 11/06/2021 04:10, Stefan Monnier wrote:
 >> There are plenty of CSV dialects. If decimal separator is
 >> "," then office software uses ";" instead of comma as cell
 >> (field) separator.
 >
 > But there's no reason to presume that a given CSV file was
 > generated in the same locale as the one we're currently
 > using.
 >
 > So the locale could be one ingredient in the machinery used
 > to guess which separator was used, but I'm not sure it would
 > be of much help.

You are right. My expectation is still that ";" is mostly used for 
locales with comma as decimal separator, and in such cases it must be 
tried with higher priority due to records that have enough amount of 
both characters.

     1,2;3,45;56,789

Originally the question raised exactly in the context of attempt to 
improve guessing of separator:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=47885 The patches have
however other problems. Advanced options for table import are likely 
more suitable e.g. for csv-mode and may become unnecessary burden in 
org-mode (especially if kill-yank would work well in both directions).

Certainly users should have opportunity to explicitly specify the 
dialect of the files they are going to import.

 > [ BTW, I'll take the opportunity to advocate for the use of
 >   TSV instead, which is slightly less ill-defined.  ]

In real world one often does have full control of file formats he has to 
deal with. In simple cases I can use space separated columns of numbers 
having fixed width. On the other hand downloaded bank statements are 
namely CSV with ";" as delimiter and in legacy windows 8-bit encoding 
(and such files have a kind of header with varying column number 
distinct from the following table).

So ability to get decimal separator for current locale may slightly 
improve user experience with import of CSV files at least in Org mode. 
However it is just an aspect of support of locale-aware number formats 
in Emacs.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-11 18:04                 ` Eli Zaretskii
@ 2021-06-14 16:38                   ` Maxim Nikulin
  2021-06-14 17:19                     ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-14 16:38 UTC (permalink / raw)
  To: emacs-devel

On 12/06/2021 01:04, Eli Zaretskii wrote:
 >> From: Maxim Nikulin Date: Fri, 11 Jun 2021 23:58:24 +0700
>> On 10/06/2021 23:57, Eli Zaretskii wrote:
>>  >> From: Maxim Nikulin Date: Thu, 10 Jun 2021 23:28:59 +0700
>>  >
>>  > For processing CSV, if there's a need to know whether the
>>  > locale uses the comma as a decimal separator, we could
>>  > indeed extend locale-info.  But such an extension is almost
>>  > trivial and doesn't even touch on the significant problems
>>  > in the rest of the discussion.
>>
>> You forgot `setlocale(LC_NUMERIC, "C")', didn't you?
> 
> No, I didn't.  Adding a call to setlocale to locale-info, even if we
> want to add an argument for the caller to control the locale, is
> trivial.

I would avoid such manipulations and the reason is not efficiency of 
particular implementation. Locale is not thread local, so changing it in 
*getter* is a source rare but really obscure hardly reproducible 
problems. I do not like such output

1234.567890
1234,567890
1234.567890

of the following program changing locale in a parallel thread

   #include <locale.h>
   #include <pthread.h>
   #include <stdio.h>
   #include <time.h>

   #define DELAY_NS 40000000

   void* other_thread(void *arg) {
           struct timespec delay = { 0, DELAY_NS/2 };
           nanosleep(&delay, NULL);
           printf("%f\n", 1234.56789);
           delay.tv_nsec = DELAY_NS;
           nanosleep(&delay, NULL);
           printf("%f\n", 1234.56789);
           nanosleep(&delay, NULL);
           printf("%f\n", 1234.56789);
           return NULL;
   }

   int main() {
           setlocale(LC_NUMERIC, "C");
           pthread_t thread_id;
           pthread_create(&thread_id, NULL, &other_thread, NULL);
           struct timespec delay = { 0, DELAY_NS };
           nanosleep(&delay, NULL);
           setlocale(LC_NUMERIC, "");
           nanosleep(&delay, NULL);
           setlocale(LC_NUMERIC, "C");
           void *res;
           pthread_join(thread_id, &res);
           return 0;
   }

Explicit locale objects decoupled from application-wide global 
preferences are safer and more flexible.

>> > Here's a trivial example:
>>  >
>>  >     (insert (downcase (buffer-substring POS1 POS2)))
>>  >
>>  > Contrast with
>>  >
>>  >     (insert (downcase "FOO"))
>>
>> Either `set-text-properties' should be called on "FOO" before passing it 
>> to `downcase'
> 
> Which property will help here? we don't have such properties.  they
> need to be designed and implemented.
Let's name it "locale". Its value is some object that represents either 
a "solid" locale such as de_DE or combined LC_NUMERIC=en_GB + 
LC_TIME=de_DE + default fr_FR. Data required for particular operations 
may be loaded on demand.

>> or `locale-downcase' with LOCALE first argument should be 
>> added.
> 
> How would you implement locale-downcase?  Are you familiar with how
> Emacs case tables work?

No, I am not familiar with Emacs internals dealing with case conversion. 
I already wrote I am even unaware how to properly handle Turkish.  For 
the scripts I am familiar with, it is enough to have default table for 
normalizing and conversion. I can admit that sometimes conversion may 
depend on language and the language can not be determined from code 
point. In such cases I expect additional override table that has higher 
priority than the default one.

 > And even if we had locale-downcase, which locale would you
 > pass to it in any given use case?

I already mentioned responsibility chain: explicit value or set of 
overrides passed by user, text property for particular span of 
characters, buffer-local variables, global environment variables. Locale 
may be instantiated from its name "it_IT". Convenience functions to 
obtain locale at point likely will be useful as well.  (Actually I am 
assuming number parsing-formatting rather than case conversion.)




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-14 16:38                   ` Maxim Nikulin
@ 2021-06-14 17:19                     ` Eli Zaretskii
  2021-06-16 17:27                       ` Maxim Nikulin
  0 siblings, 1 reply; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-14 17:19 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: emacs-devel

> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Mon, 14 Jun 2021 23:38:19 +0700
> 
> >> You forgot `setlocale(LC_NUMERIC, "C")', didn't you?
> > 
> > No, I didn't.  Adding a call to setlocale to locale-info, even if we
> > want to add an argument for the caller to control the locale, is
> > trivial.
> 
> I would avoid such manipulations and the reason is not efficiency of 
> particular implementation.

But we already do that in locale-info, for locale categories other
than LC_NUMERIC.

> >> > Here's a trivial example:
> >>  >
> >>  >     (insert (downcase (buffer-substring POS1 POS2)))
> >>  >
> >>  > Contrast with
> >>  >
> >>  >     (insert (downcase "FOO"))
> >>
> >> Either `set-text-properties' should be called on "FOO" before passing it 
> >> to `downcase'
> > 
> > Which property will help here? we don't have such properties.  they
> > need to be designed and implemented.
> Let's name it "locale". Its value is some object that represents either 
> a "solid" locale such as de_DE or combined LC_NUMERIC=en_GB + 
> LC_TIME=de_DE + default fr_FR. Data required for particular operations 
> may be loaded on demand.

How do you associate such an object with text of a buffer or a string
such that different parts of the text could have different "locales"
(as required for a multi-lingual editor such as Emacs)?

> > How would you implement locale-downcase?  Are you familiar with how
> > Emacs case tables work?
> 
> No, I am not familiar with Emacs internals dealing with case conversion. 
> I already wrote I am even unaware how to properly handle Turkish.  For 
> the scripts I am familiar with, it is enough to have default table for 
> normalizing and conversion. I can admit that sometimes conversion may 
> depend on language and the language can not be determined from code 
> point. In such cases I expect additional override table that has higher 
> priority than the default one.
> 
>  > And even if we had locale-downcase, which locale would you
>  > pass to it in any given use case?
> 
> I already mentioned responsibility chain: explicit value or set of 
> overrides passed by user, text property for particular span of 
> characters, buffer-local variables, global environment variables. Locale 
> may be instantiated from its name "it_IT". Convenience functions to 
> obtain locale at point likely will be useful as well.  (Actually I am 
> assuming number parsing-formatting rather than case conversion.)

What you describe doesn't exist, not even in its design stage.  We are
back where we started: I said at the very beginning that this
infrastructure is missing.  It is futile to discuss solutions which
rely on infrastructure that doesn't exist.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-14 17:19                     ` Eli Zaretskii
@ 2021-06-16 17:27                       ` Maxim Nikulin
  2021-06-16 17:36                         ` Eli Zaretskii
  0 siblings, 1 reply; 33+ messages in thread
From: Maxim Nikulin @ 2021-06-16 17:27 UTC (permalink / raw)
  To: emacs-devel

On 15/06/2021 00:19, Eli Zaretskii wrote:
 >> From: Maxim Nikulin Date: Mon, 14 Jun 2021 23:38:19 +0700
 >>>> You forgot `setlocale(LC_NUMERIC, "C")', didn't you?
 >>>
 >>> No, I didn't.  Adding a call to setlocale to locale-info, even if we
 >>> want to add an argument for the caller to control the locale, is
 >>> trivial.
 >>
 >> I would avoid such manipulations and the reason is not efficiency of
 >> particular implementation.
 >
 > But we already do that in locale-info, for locale categories other
 > than LC_NUMERIC.

I have seen it call for collation. It may be reasonable in past (e.g. as 
quick plumbing), but I thunk such things should be avoided for the sake 
of thread safety. Moreover, you are crying that implementations other 
than glibc are inefficient.

Proper instruments for concurrency and parallel execution may alleviate
issues like the following:
https://lists.gnu.org/archive/html/emacs-devel/2021-05/msg01297.html
 > I hear quite a few people run at least two instances of
 > Emacs, for example if they don't want Gnus fetching new
 > articles and email to freeze the interactive session for
 > prolonged times.

>>> Which property will help here? we don't have such properties.  they
>>> need to be designed and implemented.
>> Let's name it "locale". Its value is some object that represents either 
>> a "solid" locale such as de_DE or combined LC_NUMERIC=en_GB + 
>> LC_TIME=de_DE + default fr_FR. Data required for particular operations 
>> may be loaded on demand.
> 
> How do you associate such an object with text of a buffer or a string
> such that different parts of the text could have different "locales"
> (as required for a multi-lingual editor such as Emacs)?

I already suggested some variants and you did not argue.

Technically it can be done through `set-text-properties'. If there are 
no such text properties than it may be assumed that no fine grain tuning 
is requires, so buffer-local variables or global environment are used. 
Language may be guessed from code points of characters. Particular modes 
may either inhibit localization for program code or extract necessary 
information from HTML lang attributes, arguments of LaTeX 
\foreignlanguage macro, etc.

In my opinion, Emacs is not really multi-lingual yet due to limitations 
and inconveniences. Some other software demonstrated significantly 
greater progress during last decade. Maybe achieving current level was 
so painful that you are prefer to avoid touching of related code for any 
reason, not to speak of various improvements.

>  > And even if we had locale-downcase, which locale would you
>  > pass to it in any given use case?
> 
> I already mentioned responsibility chain: explicit value or set of 
> overrides passed by user, text property for particular span of 
> characters, buffer-local variables, global environment variables. Locale 
> may be instantiated from its name "it_IT". Convenience functions to 
> obtain locale at point likely will be useful as well.  (Actually I am 
> assuming number parsing-formatting rather than case conversion.)

I am aware that such features do not exist yet. Only libc is available, 
but we consider it as inappropriate (you due to performance issues, me 
due to thread safety and possible bugs due to missed calls restoring old 
state). You are against using of CLDR detailed info for locales through 
ICU due to alternative implementation of Unicode character tables 
(another part of ICU) already exists in Emacs. At the same time you are 
refusing any attempts to discuss possible extensions from any side: low 
level base functions taking locale as explicit argument or high level 
requirements what interface can be useful to "implicitly" derive locale 
of particular part of text (actually text prepared for intelligent 
handling of locales).

Certainly with position "locale-aware formatting can not be implemented 
because Emacs has no necessary infrastructure and such feature is needed 
by only a handful of user" there is no way to improve anything.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: CSV parsing and other issues (Re: LC_NUMERIC)
  2021-06-16 17:27                       ` Maxim Nikulin
@ 2021-06-16 17:36                         ` Eli Zaretskii
  0 siblings, 0 replies; 33+ messages in thread
From: Eli Zaretskii @ 2021-06-16 17:36 UTC (permalink / raw)
  To: Maxim Nikulin; +Cc: emacs-devel

> From: Maxim Nikulin <manikulin@gmail.com>
> Date: Thu, 17 Jun 2021 00:27:49 +0700
> 
> Certainly with position "locale-aware formatting can not be implemented 
> because Emacs has no necessary infrastructure and such feature is needed 
> by only a handful of user" there is no way to improve anything.

Please see how many changes I committed over the years to Emacs, some
of them quite revolutionary (bidirectional editing comes to mind), and
I'm sure you will realize that the above is a gross misunderstanding
of what I meant.



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-06-16 17:36 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-06 23:36 CSV parsing and other issues (Re: LC_NUMERIC) Boruch Baum
2021-06-07 12:28 ` Eli Zaretskii
2021-06-08  0:45   ` Boruch Baum
2021-06-08  2:35     ` Eli Zaretskii
2021-06-08 15:35       ` Stefan Monnier
2021-06-08 16:35       ` Maxim Nikulin
2021-06-08 18:52         ` Eli Zaretskii
2021-06-10 16:28           ` Maxim Nikulin
2021-06-10 16:57             ` Eli Zaretskii
2021-06-10 18:01               ` Boruch Baum
2021-06-10 18:50                 ` Eli Zaretskii
2021-06-10 19:04                   ` Boruch Baum
2021-06-10 19:23                     ` Eli Zaretskii
2021-06-10 20:20                       ` Boruch Baum
2021-06-11  6:19                         ` Eli Zaretskii
2021-06-11  8:18                           ` Boruch Baum
2021-06-11 16:51                           ` Maxim Nikulin
2021-06-11 13:56                       ` Filipp Gunbin
2021-06-11 14:10                         ` Eli Zaretskii
2021-06-11 18:52                           ` Filipp Gunbin
2021-06-11 19:34                             ` Eli Zaretskii
2021-06-11 16:58               ` Maxim Nikulin
2021-06-11 18:04                 ` Eli Zaretskii
2021-06-14 16:38                   ` Maxim Nikulin
2021-06-14 17:19                     ` Eli Zaretskii
2021-06-16 17:27                       ` Maxim Nikulin
2021-06-16 17:36                         ` Eli Zaretskii
2021-06-10 21:10             ` Stefan Monnier
2021-06-12 14:41               ` Maxim Nikulin
  -- strict thread matches above, loose matches on Subject: below --
2021-06-02 18:54 LC_NUMERIC formatting [FEATURE REQUEST] Boruch Baum
2021-06-03 14:44 ` CSV parsing and other issues (Re: LC_NUMERIC) Maxim Nikulin
2021-06-03 15:01   ` Eli Zaretskii
2021-06-04 16:31     ` Maxim Nikulin
2021-06-04 19:17       ` Eli Zaretskii

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).