unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed
* per-buffer language environments
@ 2010-12-11 15:25 Werner LEMBERG
  2010-12-11 19:00 ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-11 15:25 UTC (permalink / raw)
  To: emacs-devel


According to the documentation, set-language-environent acts globally.
However, at least for CJK documents, it would be very helpful if this
could be controlled on a per-buffer basis[1].  For example, on my
GNU/Linux box with latin-1 as the default langauge environment, while
editing some Japanese text, I see the katakana glyphs from the
`simsun' font which look particularly ugly.  Assuming that I edit a
Chinese text in parallel, a Japanese font from a Japanese language
environment would miss most of the Chinese characters, causing
fallback character substitution which looks ugly again...


   Werner


[1] An extension of this would be enriched text which supports
multiple language environments within a buffer.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-11 15:25 per-buffer language environments Werner LEMBERG
@ 2010-12-11 19:00 ` Eli Zaretskii
  2010-12-12  6:25   ` Werner LEMBERG
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-11 19:00 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: emacs-devel

> Date: Sat, 11 Dec 2010 16:25:03 +0100 (CET)
> From: Werner LEMBERG <wl@gnu.org>
> 
> According to the documentation, set-language-environent acts globally.
> However, at least for CJK documents, it would be very helpful if this
> could be controlled on a per-buffer basis[1].  For example, on my
> GNU/Linux box with latin-1 as the default langauge environment, while
> editing some Japanese text, I see the katakana glyphs from the
> `simsun' font which look particularly ugly.  Assuming that I edit a
> Chinese text in parallel, a Japanese font from a Japanese language
> environment would miss most of the Chinese characters, causing
> fallback character substitution which looks ugly again...

But font selection is just one part of the language environment.  Are
there any other aspects of the language environment that would make
sense to have on per-buffer basis?

If font selection is the only part, then doesn't the fontset
definition feature (see "(emacs)Defining Fontsets") do what you want?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-11 19:00 ` Eli Zaretskii
@ 2010-12-12  6:25   ` Werner LEMBERG
  2010-12-13  7:56     ` Kenichi Handa
  0 siblings, 1 reply; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-12  6:25 UTC (permalink / raw)
  To: eliz; +Cc: emacs-devel


>> According to the documentation, set-language-environent acts
>> globally.  However, at least for CJK documents, it would be very
>> helpful if this could be controlled on a per-buffer basis[1].  For
>> example, on my GNU/Linux box with latin-1 as the default langauge
>> environment, while editing some Japanese text, I see the katakana
>> glyphs from the `simsun' font which look particularly ugly.
>> Assuming that I edit a Chinese text in parallel, a Japanese font
>> from a Japanese language environment would miss most of the Chinese
>> characters, causing fallback character substitution which looks
>> ugly again...
> 
> But font selection is just one part of the language environment.  Are
> there any other aspects of the language environment that would make
> sense to have on per-buffer basis?

For CJK language environments, I'm not aware of other aspects, but
probably Ken'ichi-san knows more.

> If font selection is the only part, then doesn't the fontset
> definition feature (see "(emacs)Defining Fontsets") do what you
> want?

If you tell me how to do that, this would be fine.  Note that the
`CHARSET:FONT' feature within a fontset is not appropriate since it
helps only if there are different charsets.  However, in the discussed
problem all buffer encodings are using Unicode.

On the other hand, I think it is not the right solution to specify a
fontset as a file variable.  I really want to say that file `foo'
contains Chinese; Emacs parses this information somehow and then
forwards this information to the font selection engine.


    Werner



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-12  6:25   ` Werner LEMBERG
@ 2010-12-13  7:56     ` Kenichi Handa
  2010-12-13  9:27       ` Werner LEMBERG
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Kenichi Handa @ 2010-12-13  7:56 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: eliz, emacs-devel

In article <20101212.072550.527160732.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:
> > But font selection is just one part of the language environment.  Are
> > there any other aspects of the language environment that would make
> > sense to have on per-buffer basis?

> For CJK language environments, I'm not aware of other aspects, but
> probably Ken'ichi-san knows more.

* Which input method to turn on by C-\.

* Which coding system to use on writing when the current
  buffer contains a character that can't be encoded by
  buffer-file-coding-system.

* Which coding systems have higher priority when inserting a
  file in the current buffer.

* The locale of the program invoked by shell-command-on-region.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-13  7:56     ` Kenichi Handa
@ 2010-12-13  9:27       ` Werner LEMBERG
  2010-12-13 10:59         ` Kenichi Handa
  2010-12-13 11:47       ` Eli Zaretskii
  2010-12-18 17:03       ` Per Starbäck
  2 siblings, 1 reply; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-13  9:27 UTC (permalink / raw)
  To: handa; +Cc: eliz, emacs-devel

>> > Are there any other aspects of the language environment that
>> > would make sense to have on per-buffer basis?
> 
>> For CJK language environments, I'm not aware of other aspects, but
>> probably Ken'ichi-san knows more.
> 
> * Which input method to turn on by C-\.
> 
> * Which coding system to use on writing when the current
>   buffer contains a character that can't be encoded by
>   buffer-file-coding-system.
> 
> * Which coding systems have higher priority when inserting a
>   file in the current buffer.
> 
> * The locale of the program invoked by shell-command-on-region.

Thanks for the list.  IMHO, this adds more arguments to
per-buffer-language enviroments.


    Werner



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-13  9:27       ` Werner LEMBERG
@ 2010-12-13 10:59         ` Kenichi Handa
  2010-12-13 12:15           ` Werner LEMBERG
  0 siblings, 1 reply; 27+ messages in thread
From: Kenichi Handa @ 2010-12-13 10:59 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: eliz, emacs-devel

In article <20101213.102709.409649500.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:
> > * Which input method to turn on by C-\.
> > 
> > * Which coding system to use on writing when the current
> >   buffer contains a character that can't be encoded by
> >   buffer-file-coding-system.
> > 
> > * Which coding systems have higher priority when inserting a
> >   file in the current buffer.
> > 
> > * The locale of the program invoked by shell-command-on-region.

> Thanks for the list.  IMHO, this adds more arguments to
> per-buffer-language enviroments.

Yes, but deciding exactly how they should work is not that
straight forward.  For instance, how the command
prefer-coding-system should work when invoked in a buffer
for which you locally changed the language environment?
Should it change the preference globally, or for the current
buffer only, or for all buffers that have the same language
environment?

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-13  7:56     ` Kenichi Handa
  2010-12-13  9:27       ` Werner LEMBERG
@ 2010-12-13 11:47       ` Eli Zaretskii
  2010-12-14 11:38         ` Stephen J. Turnbull
  2010-12-18 17:03       ` Per Starbäck
  2 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-13 11:47 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: eliz@gnu.org, emacs-devel@gnu.org
> Date: Mon, 13 Dec 2010 16:56:08 +0900
> 
> In article <20101212.072550.527160732.wl@gnu.org>, Werner LEMBERG <wl@gnu.org> writes:
> > > But font selection is just one part of the language environment.  Are
> > > there any other aspects of the language environment that would make
> > > sense to have on per-buffer basis?
> 
> > For CJK language environments, I'm not aware of other aspects, but
> > probably Ken'ichi-san knows more.
> 
> * Which input method to turn on by C-\.
> 
> * Which coding system to use on writing when the current
>   buffer contains a character that can't be encoded by
>   buffer-file-coding-system.
> 
> * Which coding systems have higher priority when inserting a
>   file in the current buffer.

I could understand how the font selection and the default input method
are related to the language, but what do encodings have to do with
that?  The preferred encoding is generally an attribute of a locale,
not of a language.  The fact that we mix them is because Emacs had
language environments before it had locale environments.

It's high time to make the distinction, IMO.  The language environment
should be derived from the language(s) of the text we are editing, and
is internal to Emacs, in the sense that it is defined by internal
Emacs logic for its purposes.  The locale environment is derived from
the environment outside Emacs, and expresses the preferences of the
outside world.

> * The locale of the program invoked by shell-command-on-region.

This is _definitely_ not related to the language.  It may be the case
that to force an external program DTRT for a certain language, you
need to set some LC_* variable in the environment of that program, but
that's an implementation detail, IMO.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-13 10:59         ` Kenichi Handa
@ 2010-12-13 12:15           ` Werner LEMBERG
  0 siblings, 0 replies; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-13 12:15 UTC (permalink / raw)
  To: handa; +Cc: eliz, emacs-devel


> [...] deciding exactly how they should work is not that straight
> forward.  For instance, how the command prefer-coding-system should
> work when invoked in a buffer for which you locally changed the
> language environment?  Should it change the preference globally, or
> for the current buffer only, or for all buffers that have the same
> language environment?

Perhaps we should start with items which are agreed on, this is, the
possibility to set a language environment buffer-wise so that Emacs
can benefit by looking up the right font in case the buffer encoding
is Unicode.  The same for the default input encoding.


    Werner



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-13 11:47       ` Eli Zaretskii
@ 2010-12-14 11:38         ` Stephen J. Turnbull
  2010-12-14 15:14           ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Stephen J. Turnbull @ 2010-12-14 11:38 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, Kenichi Handa

Eli Zaretskii writes:

 > > * Which coding systems have higher priority when inserting a
 > >   file in the current buffer.
 > 
 > I could understand how the font selection and the default input method
 > are related to the language, but what do encodings have to do with
 > that?  The preferred encoding is generally an attribute of a locale,
 > not of a language.

Note the word "insert", which implies "read".  It is certainly true
that a locale may specify an encoding.  However, if the person is
Japanese, they may specify ja_JP.UTF-8 for their locale and strongly
prefer that files be written with that encoding, yet still need to
read files in other encodings.  The locale encoding of UTF-8 is no
help in distinguishing an EUC-JP file from an ISO-8859-1 file, let
alone an EUC-CN file.  OTOH, somebody with a Hebrew language
environment and a locale specifying UTF-8 as the encoding almost
certainly prefers that a file containing 8-bit-set octets inconsistent
with UTF-8 be recognized as ISO-8859-8 rather than EUC-JP, no?

 > The fact that we mix them is because Emacs had
 > language environments before it had locale environments.

What's a "locale environment"?  AFAIK Emacsen use the locale as a
heuristic for determining the language environment unless otherwise
specified, but it seems like you mean something else.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-14 11:38         ` Stephen J. Turnbull
@ 2010-12-14 15:14           ` Eli Zaretskii
  2010-12-15  4:51             ` Stephen J. Turnbull
  0 siblings, 1 reply; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-14 15:14 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel, handa

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: Kenichi Handa <handa@m17n.org>,
>     emacs-devel@gnu.org
> Date: Tue, 14 Dec 2010 20:38:43 +0900
> 
> Eli Zaretskii writes:
> 
>  > > * Which coding systems have higher priority when inserting a
>  > >   file in the current buffer.
>  > 
>  > I could understand how the font selection and the default input method
>  > are related to the language, but what do encodings have to do with
>  > that?  The preferred encoding is generally an attribute of a locale,
>  > not of a language.
> 
> Note the word "insert", which implies "read".  It is certainly true
> that a locale may specify an encoding.  However, if the person is
> Japanese, they may specify ja_JP.UTF-8 for their locale and strongly
> prefer that files be written with that encoding, yet still need to
> read files in other encodings.  The locale encoding of UTF-8 is no
> help in distinguishing an EUC-JP file from an ISO-8859-1 file, let
> alone an EUC-CN file.  OTOH, somebody with a Hebrew language
> environment and a locale specifying UTF-8 as the encoding almost
> certainly prefers that a file containing 8-bit-set octets inconsistent
> with UTF-8 be recognized as ISO-8859-8 rather than EUC-JP, no?

Those are all valid concerns, but they are just the tip of an iceberg.
There's an almost infinite number of combinations of a language and
the preferred encoding, and it's impossible to fold them all, or even
their significant fraction, in a reasonably usable user-level
interface.  We shouldn't even try, IMO; we already have
prefer-coding-system, the coding: cookies, the .dir_locals meta-data,
etc. to cover the situations where the user knows what encoding should
be preferred/used, even though her language and locale say otherwise.

set-language-environment accepts a single string, which should be a
language name, as its argument.  (There are some "languages" that we
recognize, such as "Chinese-GB18030", which sneak in the encoding as
well, but that's an anomaly, I think, which goes back to when Emacs
didn't have locale environments to express that.  Now that we do, we
could get rid of that, at least in principle.)  Therefore, a language
environment should set the defaults suitable for the language, and
that doesn't include the encoding, or at least does not have to fit
each minor cultural variant of the language.

>  > The fact that we mix them is because Emacs had
>  > language environments before it had locale environments.
> 
> What's a "locale environment"?

See set-locale-environment.

> AFAIK Emacsen use the locale as a heuristic for determining the
> language environment

There's no heuristic involved, AFAIR.  Emacs has a database of
languages _and_encodings_ suitable for the known locale names.
set-locale-environment uses that database to get the language and the
preferred encoding(s), then calls set-language-environment with the
language, and sets the priorities of the encodings according to the
encoding preferences.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-14 15:14           ` Eli Zaretskii
@ 2010-12-15  4:51             ` Stephen J. Turnbull
  2010-12-15  6:47               ` Eli Zaretskii
  0 siblings, 1 reply; 27+ messages in thread
From: Stephen J. Turnbull @ 2010-12-15  4:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: handa, emacs-devel

Eli Zaretskii writes:
 > > From: "Stephen J. Turnbull" <stephen@xemacs.org>
 > > Cc: Kenichi Handa <handa@m17n.org>,
 > >     emacs-devel@gnu.org
 > > Date: Tue, 14 Dec 2010 20:38:43 +0900
 > > 
 > > Eli Zaretskii writes:
 > > 
 > >  > > * Which coding systems have higher priority when inserting a
 > >  > >   file in the current buffer.
 > >  > 
 > >  > I could understand how the font selection and the default input method
 > >  > are related to the language, but what do encodings have to do with
 > >  > that?  The preferred encoding is generally an attribute of a locale,
 > >  > not of a language.
 > > 
 > > Note the word "insert", which implies "read".  It is certainly true
 > > that a locale may specify an encoding.  However, if the person is
 > > Japanese, they may specify ja_JP.UTF-8 for their locale and strongly
 > > prefer that files be written with that encoding, yet still need to
 > > read files in other encodings. [more examples snipped]
 > 
 > Those are all valid concerns, but they are just the tip of an
 > iceberg.

No, they *are* the iceberg, at least as far as the autopilot is
concerned.  After that, you *must* ask the user.

 > There's an almost infinite number of combinations of a language and
 > the preferred encoding

Sure, but given a language and the set of encoding features Emacs
knows how to detect *when reading from a stream*, there remains
substantial ambiguity.  Setting the priority list can remove almost
all of that ambiguity, leaving what's left for the user.  That is what
the priority lists are for, and it is a useful feature of the language
environment.  All problems with the language environment that I know
of stem from its global nature applying to all buffers and the
application itself, not from appropriate use in a given buffer.  IOW,
it's just the defects of the POSIX_ME_HARDER locale mirrored into
Emacs itself.

The preferred encoding, OTOH, is a heuristic for the encoding of files
read, and the default for the encoding of files written.

These two are independent in principle, but of course "preferred
encoding for writing" = "highest priority encoding for reading" is a
very valuable heuristic.

 > , and it's impossible to fold them all, or even their significant
 > fraction,

Of course a significant fraction is possible.  That's precisely what
the priority lists have been achieving since the early 1990s.  If your
complaint is that we should do better, "patches welcome" is the only
thing I can think of to say.  But it does a pretty damn good job
already, and buffer-local language environments should cut current
damage by 80% or more; your work is cut out for you.

 > in a reasonably usable user-level interface.  We shouldn't even
 > try, IMO; we already have prefer-coding-system

Huh?  prefer-coding-system has two effects: it promotes a certain
coding-system to highest priority in its category, and it promotes
that category to highest priority in case of ambiguity.  IOW, it's a
user override of the priority setting that comes from the language
environment.  A completely different purpose (handling exceptions)
from the language environment itself (handling the unmarked case).

Are you sure you have any idea what you're talking about?  (That's an
honest question; the way you are going, I have to wonder.  If you say
"yes", I'll trust you, but I'd appreciate an explanation of what
you're talking about that refers to real bugs in the current system,
rather than general features that offend your sense of design.)

 > , the coding: cookies, the .dir_locals meta-data,

Speaking of *my* sense of design, two features that are an offense
against Man and a stench in the nostrils of God.  But I digress.

 > etc. to cover the situations where the user knows what encoding should
 > be preferred/used, even though her language and locale say otherwise.
 > 
 > set-language-environment accepts a single string, which should be a
 > language name, as its argument.  (There are some "languages" that we
 > recognize, such as "Chinese-GB18030", which sneak in the encoding as
 > well, but that's an anomaly, I think, which goes back to when Emacs
 > didn't have locale environments to express that.  Now that we do, we
 > could get rid of that, at least in principle.)  Therefore, a language
 > environment should set the defaults suitable for the language, and
 > that doesn't include the encoding, or at least does not have to fit
 > each minor cultural variant of the language.

That's not what coding priority settings are for.  They are to remove
ambiguities like "we have EUC, but which one?" and "we have
Windows-125x, but which one?" and "since ISO-8859-1 allows all 256
bytes, if we want to give priority to Chinese or Japanese, that had
better come late in the list!"

 > >  > The fact that we mix them is because Emacs had
 > >  > language environments before it had locale environments.
 > > 
 > > What's a "locale environment"?
 > 
 > See set-locale-environment.

"[No match]"

YAGNI, apparently.  (For values of "you" == "me", obviously.  YMMV. :-)

 > > AFAIK Emacsen use the locale as a heuristic for determining the
 > > language environment
 > 
 > There's no heuristic involved, AFAIR.  Emacs has a database of
 > languages _and_encodings_ suitable for the known locale names.

You're confusing "algorithmic" with "non-heuristic".  Of course it's
possible to have a heuristic algorithm.

And of course in this case, locale is a heuristic.  *Emacs is a
multilingual* (well, technically, multiscript) *application*, and any
setting of the language environment that doesn't take into account the
current text we're working with is surely heuristic.

 > set-locale-environment uses that database to get the language and the
 > preferred encoding(s), then calls set-language-environment with the
 > language, and sets the priorities of the encodings according to the
 > encoding preferences.

That's an unnecessary API, ISTM.  (set-language-environment nil)
should do that.  Perhaps there should be a `set-locale' command to
override the POSIX_ME_HARDER locale taken from the environment, but
the POSIX_ME_HARDER locale is an abomination in a multilingual
application and should be buried as deeply as we can manage.  It is,
of course, a useful heuristic for the user's preferred language
environment for *scratch*, but that's about as far as we can take that.





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-15  4:51             ` Stephen J. Turnbull
@ 2010-12-15  6:47               ` Eli Zaretskii
  2010-12-15  7:45                 ` Werner LEMBERG
                                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-15  6:47 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: handa, emacs-devel

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: emacs-devel@gnu.org,
>     handa@m17n.org
> Date: Wed, 15 Dec 2010 13:51:40 +0900
> 
>  > Those are all valid concerns, but they are just the tip of an
>  > iceberg.
> 
> No, they *are* the iceberg, at least as far as the autopilot is
> concerned.  After that, you *must* ask the user.

As long as we agree that there _is_ an iceberg, I won't argue.

>  > There's an almost infinite number of combinations of a language and
>  > the preferred encoding
> 
> Sure, but given a language and the set of encoding features Emacs
> knows how to detect *when reading from a stream*, there remains
> substantial ambiguity.

The emphasis on *reading* takes what I originally wrote out of its
context.  I didn't comment on reading alone, I commented on the entire
issue of coding-systems being tied up to the language:

> > * Which coding system to use on writing when the current
> >   buffer contains a character that can't be encoded by
> >   buffer-file-coding-system.
> > 
> > * Which coding systems have higher priority when inserting a
> >   file in the current buffer.
> 
> I could understand how the font selection and the default input method
> are related to the language, but what do encodings have to do with
> that?  The preferred encoding is generally an attribute of a locale,
> not of a language.

If the ambiguity you are talking about is that there are more settings
than just for reading, then I was originally talking about those, too.
If the ambiguity is about something else, please tell what that is.

> All problems with the language environment that I know
> of stem from its global nature applying to all buffers and the
> application itself, not from appropriate use in a given buffer.

I agree that it would be useful to have a language as per-buffer
setting.  This discussion is about what should that include.

> IOW, it's just the defects of the POSIX_ME_HARDER locale mirrored
> into Emacs itself.

I also stated quite clearly (I think) that I think we should
distinguish between the locale and the language, as far as their
effects on Emacs are concerned.

>  > , and it's impossible to fold them all, or even their significant
>  > fraction,
> 
> Of course a significant fraction is possible.  That's precisely what
> the priority lists have been achieving since the early 1990s.

Evidently, your examples try to show that the fraction is not
significant enough.

> If your complaint is that we should do better, "patches welcome" is
> the only thing I can think of to say.

No, I'm saying we shouldn't try to do better _automatically_.  Users
have enough facilities to affect the defaults according to their
specific use-cases.

>  > in a reasonably usable user-level interface.  We shouldn't even
>  > try, IMO; we already have prefer-coding-system
> 
> Huh?  prefer-coding-system has two effects: it promotes a certain
> coding-system to highest priority in its category, and it promotes
> that category to highest priority in case of ambiguity.  IOW, it's a
> user override of the priority setting that comes from the language
> environment.

Exactly my point: the user can override the automated selections if
she needs.  So the current automation doesn't need to do better.

> A completely different purpose (handling exceptions)
> from the language environment itself (handling the unmarked case).

Except that set-language-environment calls prefer-coding-system under
the hood to do most of its job...

> Are you sure you have any idea what you're talking about?

I think I do.  I'm not sure we are talking about the same thing,
though.

> That's an honest question; the way you are going, I have to wonder.

Knowing me for as long as you do, I wonder how can such a question be
honest.  But I digress.

> If you say "yes", I'll trust you, but I'd appreciate an explanation
> of what you're talking about that refers to real bugs in the current
> system, rather than general features that offend your sense of
> design.

I wasn't talking about any bugs at all.  Werner suggested to add a new
_feature_; I was talking about what that feature should and shouldn't
include.

> [coding priority settings] are to remove ambiguities like "we have
> EUC, but which one?" and "we have Windows-125x, but which one?" and
> "since ISO-8859-1 allows all 256 bytes, if we want to give priority
> to Chinese or Japanese, that had better come late in the list!"

I don't think I said anything to the contrary.  I would add, though,
that the priority settings also deal with "we have some encoding that
uses 8-bit bytes, but which encoding is that?"

>  > > AFAIK Emacsen use the locale as a heuristic for determining the
>  > > language environment
>  > 
>  > There's no heuristic involved, AFAIR.  Emacs has a database of
>  > languages _and_encodings_ suitable for the known locale names.
> 
> You're confusing "algorithmic" with "non-heuristic".

Please take a look at the database.  I stand by what I wrote: there's
no heuristic anywhere in sight.

> And of course in this case, locale is a heuristic.  *Emacs is a
> multilingual* (well, technically, multiscript) *application*, and any
> setting of the language environment that doesn't take into account the
> current text we're working with is surely heuristic.

If so, it's a heuristic that is external to Emacs.  Emacs just abides
by it, because users expect that.  Anyway, this aspect is entirely
unrelated to the issue at hand.

>  > set-locale-environment uses that database to get the language and the
>  > preferred encoding(s), then calls set-language-environment with the
>  > language, and sets the priorities of the encodings according to the
>  > encoding preferences.
> 
> That's an unnecessary API, ISTM.  (set-language-environment nil)
> should do that.

So we basically agree: the (not entirely complete) equivalence between
these 2 APIs is not TRT and it should go away.  We may disagree which
API should be dropped and which one retained, but that's just a naming
issue (and maybe a consequence of the fact that you didn't know about
set-locale-environment before).

But this is not the main issue I wanted to discuss.  The main issue is
what constitutes a "language environment" as far as Emacs is
concerned, after we factor out the effects of the locale?  If we are
going to implement per-buffer language environments, we need to decide
that first and foremost.

Perhaps a useful starting point would be to ask: what exactly is a
"language name" string? should it specify only a language, or should
it also try to specify the preferred encodings?

> the POSIX_ME_HARDER locale is an abomination in a multilingual
> application and should be buried as deeply as we can manage.  It is,
> of course, a useful heuristic for the user's preferred language
> environment for *scratch*, but that's about as far as we can take that.

I'm not sure it's as black and white as you make it sound.  For
example, users of the same language on GNU/Linux and on MS-Windows
might very well disagree wrt to the preferred encodings.  So some
aspects of the locale still affect language-specific choices.  But
again, I think talking about the locale just muddies the waters in
this discussion.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-15  6:47               ` Eli Zaretskii
@ 2010-12-15  7:45                 ` Werner LEMBERG
  2010-12-16 21:10                 ` Stephen J. Turnbull
  2010-12-17  0:51                 ` Kenichi Handa
  2 siblings, 0 replies; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-15  7:45 UTC (permalink / raw)
  To: eliz; +Cc: stephen, emacs-devel, handa


> I wasn't talking about any bugs at all.  Werner suggested to add a
> new _feature_; I was talking about what that feature should and
> shouldn't include.

Perhaps it makes sense to provide typical user cases instead of
theorizing a priori.  Hopefully, others provide real-life scenarios
too.

  My case:

    file language: Chinese or Japanese or Korean
    file encoding: UTF-8 (or any other flavour of Unicode)

    Wish:
      Emacs should select a proper font based on a file language tag.
      The fonts should be specified by the user, to be configured as a
      preference list in `.emacs'.

    Reason:
      It is not possible to automatically decide whether a given font
      like SimSun is really suitable for a given language; this
      concept is missing in the OpenType specification, contrary to,
      say, CID-keyed fonts.  A hint might be the presence of a
      specific script and language tag in the font's OpenType tables
      (`HANI' and `CHN', respectively, for SimSun), but there are many
      TrueType fonts which don't have advanced OpenType features.
      Since SimSun contains Katakana, Hiragana, and CJK glyphs – this
      might be deduced from the OS/2 table, and FontConfig checks that
      also – it *can* be used for Japanese, but it doesn't *suit*.

      This problem is really important for CJK fonts, however, even
      European languages can be affected.  For example, the right way
      in Romanian is to write `ş' (s with cedilla), but it should be
      displayed as `ș' (s with comma below).  Recent OpenType fonts
      often contain proper language tags so that a language specific
      mapping can be done, but many, many Type 1 fonts don't; they
      contain the glyph name `scedilla', but the real glyph displayed
      is s with comma below.


    Werner

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-15  6:47               ` Eli Zaretskii
  2010-12-15  7:45                 ` Werner LEMBERG
@ 2010-12-16 21:10                 ` Stephen J. Turnbull
  2010-12-17 11:51                   ` Eli Zaretskii
  2010-12-17  0:51                 ` Kenichi Handa
  2 siblings, 1 reply; 27+ messages in thread
From: Stephen J. Turnbull @ 2010-12-16 21:10 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, handa

Eli Zaretskii writes:

 > The emphasis on *reading* takes what I originally wrote out of its
 > context.  I didn't comment on reading alone, I commented on the entire
 > issue of coding-systems being tied up to the language:

I know you were talking about something else, but I can't figure out
what or why.  You said, "don't associate coding priorities with
language," I gave a good reason why there should be coding priorities
associated with language.  The rest of what you write is irrelevant,
since none of it points out a real problem with that association.

 > If the ambiguity you are talking about is that there are more settings
 > than just for reading,

Of course that's not the ambiguity I'm talking about.  The ambiguity
I'm talking about is in the reading, and that is sufficient reason to
associate priorities with language.

 > I agree that it would be useful to have a language as per-buffer
 > setting.  This discussion is about what should that include.

It should include priorities for encoding detection.

 > > Of course a significant fraction is possible.  That's precisely what
 > > the priority lists have been achieving since the early 1990s.
 > 
 > Evidently, your examples try to show that the fraction is not
 > significant enough.

No, my examples show what you will lose by removing the association of
encoding priority with language environment.

 > > If your complaint is that we should do better, "patches welcome" is
 > > the only thing I can think of to say.
 > 
 > No, I'm saying we shouldn't try to do better _automatically_.  Users
 > have enough facilities to affect the defaults according to their
 > specific use-cases.

Handa-san was not talking about trying to do better.  He was talking
about how we achieve the success rates we currently get.  Removing the
association of language with encoding priority would drastically
decrease that for anybody who needs to deal with multiple languages
and multiple associated encodings in their environment.

 > Exactly my point: the user can override the automated selections if
 > she needs.  So the current automation doesn't need to do better.

Well, your point is just plain wrong, then, because nobody is
proposing a change w.r.t. the current automation.  All that has been
suggested is that we keep doing the same things we've been doing to
achieve a reasonable degree of automatic recognition for people in
environments with multiple encodings.

 > > A completely different purpose (handling exceptions)
 > > from the language environment itself (handling the unmarked case).
 > 
 > Except that set-language-environment calls prefer-coding-system under
 > the hood to do most of its job...

Yes, this works for Europeans, Arabs, and Israelis, because basically
what you need to do is disambiguate ISO-8859-X, and just putting the
right ISO coding system (or perhaps a Windows-125x coding system) at
the head of the list (ie, just using prefer-coding-system) does what
you need.  It's not good enough for Han users because they need to
disambiguate EUC from each other and from 8-bit ISO, and among
Microsoft bogus encodings (Shift JIS and Big5).  That means
manipulating the priority lists at positions other than head of list.
I'm not sure about Cyrillic users.

 > > That's an honest question; the way you are going, I have to wonder.
 > 
 > Knowing me for as long as you do, I wonder how can such a question be
 > honest.  But I digress.

Usually you don't miss a point like "nobody is proposing anything new
here for how language environments work".  (All that is being proposed
is making them buffer-local.)  Since you did miss it, I have to wonder
if you know anything about how encoding detection works internally.

 > I wasn't talking about any bugs at all.  Werner suggested to add a new
 > _feature_; I was talking about what that feature should and shouldn't
 > include.

Well, you're wrong about manipulating the coding priorities.  It is
not new, and it is needed.

 > > And of course in this case, locale is a heuristic.  *Emacs is a
 > > multilingual* (well, technically, multiscript) *application*, and any
 > > setting of the language environment that doesn't take into account the
 > > current text we're working with is surely heuristic.
 > 
 > If so, it's a heuristic that is external to Emacs.  Emacs just abides
 > by it, because users expect that.  Anyway, this aspect is entirely
 > unrelated to the issue at hand.

Of course it's not unrelated.  Referring to the locale is an external
heuristic and therefore unreliable.  If the user sets a language
environment, that is surely better information than what you get from
the locale.  However, it's probably a good idea to merge information
from the new language environment with that from the old one, giving
precedence to the new.

 > But this is not the main issue I wanted to discuss.  The main issue is
 > what constitutes a "language environment" as far as Emacs is
 > concerned, after we factor out the effects of the locale?

What are you talking about, "factor out"?  If the user sets a language
environment, that will override the locale on all points where it
specifies behavior.

 > Perhaps a useful starting point would be to ask: what exactly is a
 > "language name" string? should it specify only a language, or should
 > it also try to specify the preferred encodings?

It should specify only the language, IMO.  Determining the preferred
encodings is complex but fairly mature at this point.  If the user
doesn't want the default priorities associated with a language, I
don't see why they shouldn't use prefer-coding-system or
set-coding-priority-list rather than piggyback on the language
environment itself.

 > I'm not sure it's as black and white as you make it sound.  For
 > example, users of the same language on GNU/Linux and on MS-Windows
 > might very well disagree wrt to the preferred encodings.  So some
 > aspects of the locale still affect language-specific choices.

Huh?  That's not "locale", that's system convention.  Locale is
something else entirely.  It's true that you can override that
heuristic via locale, but (at least in XEmacs) we take the system type
into account when computing the startup priorities, even if the locale
specifies an encoding.  I would imagine Emacs does the same.

 > But again, I think talking about the locale just muddies the waters
 > in this discussion.

Then why do you keep talking about it?

Can we agree that it's a good heuristic for (1) the initial language
environment for *scratch* and (2) when an encoding is specified in the
locale, it should be prefer-coding-system'd, and (3) after doing (1)
and (2) we don't care about the locale any more?






^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-15  6:47               ` Eli Zaretskii
  2010-12-15  7:45                 ` Werner LEMBERG
  2010-12-16 21:10                 ` Stephen J. Turnbull
@ 2010-12-17  0:51                 ` Kenichi Handa
  2010-12-17  2:48                   ` Stephen J. Turnbull
  2010-12-17 11:05                   ` Eli Zaretskii
  2 siblings, 2 replies; 27+ messages in thread
From: Kenichi Handa @ 2010-12-17  0:51 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen, emacs-devel

I should join this discussion, but sorry, I don't have a
time at the moment.  I'd like to say one thing:

In article <E1PSl9O-0001wu-GB@fencepost.gnu.org>, Eli Zaretskii <eliz@gnu.org> writes:

> Perhaps a useful starting point would be to ask: what exactly is a
> "language name" string? should it specify only a language, or should
> it also try to specify the preferred encodings?

The reason why I chose the term "language environment"
instead of simple "language" was to make it provide a set of
various good settings (i.e. environment).  So my intention
of "language environment" name was not a language name but
an environment name.  This concept was close to "locale" but
when I made it, it was not clear what the system locale can
do.

Anyway, for that, "language" is just one aspect, and thus
there are variants of Chinese-* (which specify both language
and encoding) and variants of Latin-* (which specify only
encoding).

A while ago, I proposed more dynamic way of specifing
language environment which allows user to freely name a
environment by any combination of language and encoding, and
Emacs automatically generate a proper "language environment"
object associated with the specified name.  The name can
have this syntax:
  LANGUAGE-[ENCODING[-CHARSET[-INPUT_METHOD]]]
or any other convenient syntax (e.g. keyward-value pair).

But that idea was rejected because it's overkill.

---
Kenichi Handa
handa@m17n.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-17  0:51                 ` Kenichi Handa
@ 2010-12-17  2:48                   ` Stephen J. Turnbull
  2010-12-17 11:05                   ` Eli Zaretskii
  1 sibling, 0 replies; 27+ messages in thread
From: Stephen J. Turnbull @ 2010-12-17  2:48 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: Eli Zaretskii, emacs-devel

Kenichi Handa writes:

 > I should join this discussion, but sorry, I don't have a
 > time at the moment.  I'd like to say one thing:

Well, it *is* shiwasu, after all.[1]

I think it's best if Eli and I defer the discussion until you can
join, then.

Footnotes: 
[1]  To let everybody in on the joke, it's a Japanese name for
December which literally means "[even] teachers run".




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-17  0:51                 ` Kenichi Handa
  2010-12-17  2:48                   ` Stephen J. Turnbull
@ 2010-12-17 11:05                   ` Eli Zaretskii
  1 sibling, 0 replies; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-17 11:05 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: stephen, emacs-devel

> From: Kenichi Handa <handa@m17n.org>
> Cc: stephen@xemacs.org, emacs-devel@gnu.org
> Date: Fri, 17 Dec 2010 09:51:23 +0900
> 
> The reason why I chose the term "language environment"
> instead of simple "language" was to make it provide a set of
> various good settings (i.e. environment).  So my intention
> of "language environment" name was not a language name but
> an environment name.

That's okay, but it's not clear whether what Werner wants still fits
this model, which was originally invented to be global for the entire
Emacs session.  Since now the issue (as I understand it) is to provide
a more fine-grained feature, we should analyze again whether it is
still appropriate to specify the whole thing (language, input method,
default encodings for all of their varieties, etc.) with a single
string argument, or indeed with a single API/UI.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-16 21:10                 ` Stephen J. Turnbull
@ 2010-12-17 11:51                   ` Eli Zaretskii
  2010-12-18  6:29                     ` Werner LEMBERG
  2010-12-18  9:30                     ` Stephen J. Turnbull
  0 siblings, 2 replies; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-17 11:51 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel, handa

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: handa@m17n.org,
>     emacs-devel@gnu.org
> Date: Fri, 17 Dec 2010 06:10:44 +0900
> 
> I know you were talking about something else, but I can't figure out
> what or why.

Sorry for not making myself clear.  Let me try again:

  . The issue is what it means to have a separate buffer-local
    "language environment".

  . The current machinery of language environments was invented and
    evolved to its current form as a global session-wide setting.  I'm
    not sure the same set of heuristics, or even the extent of what
    "language environment" means and what settings it affects, are
    still correct for a buffer-local setting.

  . There's any number of possible use-cases for needing this kind of
    feature.  They are all quite rare (if they weren't, we would have
    many complaints about not having such a feature, which we don't).
    The current heuristics encoded in the global language environment
    does not cover well rare and marginal use-cases, being just that
    -- a set of heuristics.  It is therefore quite probable that just
    making the language environment buffer-local and keeping all the
    rest of its machinery and semantics would do the wrong thing for a
    large portion of the use-cases which need such a buffer-local
    feature.

  . IMO, the way we set priorities for selecting an encoding based on
    the language runs the highest risk being inappropriate for this
    kind of buffer-local "language environment".  That's because
    selection of an appropriate encoding depends on factors that have
    nothing to do with the language, for those languages which have
    several alternative encodings.  These factors include the locale,
    the filesystem on which the buffer's file lives (which could be
    local or remote), the purpose of the text that is edited (it could
    be a text file, or a program source, or an email message meant to
    be sent, or text to be sent to a subsidiary program or copy/pasted
    through a selection), and possibly some more.  Setting the
    language can surely identify a small set of appropriate encodings,
    but I very much doubt that it can correctly select The Right One.

  . Therefore, I think that buffer-local "language environments"
    should not automatically select the encodings given just the
    language name, but instead let the user specify them separately
    when she selects the buffer-local language.

>  > > That's an honest question; the way you are going, I have to wonder.
>  > 
>  > Knowing me for as long as you do, I wonder how can such a question be
>  > honest.  But I digress.
> 
> Usually you don't miss a point like "nobody is proposing anything new
> here for how language environments work".  (All that is being proposed
> is making them buffer-local.)  Since you did miss it, I have to wonder
> if you know anything about how encoding detection works internally.

Since you have the logs to get you straight about the degree of my
knowledge in that issue, you should rather wonder whether I'm missing
your point because I misunderstood what you are saying or because you
failed to explain it clearly.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-17 11:51                   ` Eli Zaretskii
@ 2010-12-18  6:29                     ` Werner LEMBERG
  2010-12-18  9:30                     ` Stephen J. Turnbull
  1 sibling, 0 replies; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-18  6:29 UTC (permalink / raw)
  To: eliz; +Cc: stephen, handa, emacs-devel


>   . There's any number of possible use-cases for needing this kind
>     of feature.  They are all quite rare (if they weren't, we would
>     have many complaints about not having such a feature, which we
>     don't).

I would rather say that these use-cases are all quite new.
Previously, Emacs didn't do too well with Unicode, and only recently
free CJK fonts (especially for Japanese) are available as TrueType
fonts so that the user has a choice to select between well-crafted
fonts.

>     The current heuristics encoded in the global language environment
>     does not cover well rare and marginal use-cases, being just that
>     -- a set of heuristics.  It is therefore quite probable that just
>     making the language environment buffer-local and keeping all the
>     rest of its machinery and semantics would do the wrong thing for a
>     large portion of the use-cases which need such a buffer-local
>     feature.

I don't think so, however, people on this list should submit more
use-cases if possible so that we can decide this issue easier.

>     Setting the language can surely identify a small set of
>     appropriate encodings, but I very much doubt that it can
>     correctly select The Right One.

Note that I'm specifically talking about Unicode.  IMHO, handling of
all other encodings should stay as-is since they will be extinct soon
anyway.


    Werner



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-17 11:51                   ` Eli Zaretskii
  2010-12-18  6:29                     ` Werner LEMBERG
@ 2010-12-18  9:30                     ` Stephen J. Turnbull
  2010-12-21 18:39                       ` Eli Zaretskii
  2010-12-21 21:16                       ` Werner LEMBERG
  1 sibling, 2 replies; 27+ messages in thread
From: Stephen J. Turnbull @ 2010-12-18  9:30 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel, handa

Eli Zaretskii writes:

 >   . The issue is what it means to have a separate buffer-local
 >     "language environment".

Please, let's postpone this until Handa-san has some time to work on
it.

I have two comments to make to try to avoid misunderstanding later.

First, please note that what Werner needs has little or nothing to do
with this discussion of modifying the coding priority list.  Werner is
in the fairly small set of users for whom encoding selection is a
solved problem.  If Emacs gets it wrong, he knows what to do about it.

*His* problems are harder, and more deeply tied to language itself.
The current language environment mechanism is good at what it does and
will be somewhat improved by being made buffer-local, but to be really
useful to Werner a number of additional attributes need to be added,
as well as some functionality that I don't yet really know how to
implement (eg, his Romanian s-with-comma-below vs. s-cedilla issue).

Second, you wrote:

 > Since you have the logs to get you straight about the degree of my
 > knowledge in that issue, you should rather wonder whether I'm missing
 > your point because I misunderstood what you are saying or because you
 > failed to explain it clearly.

*sigh*  OK, then, let me make things perfectly clear.  I have so
wondered, but the words you have written make me believe that you have
a fundamental misunderstanding of how buffer-file-coding-system gets
set in Emacsen, and specifically you do not understand the role of the
priority lists.  Since the details matter (and I believe now differ
across the Emacsen), I recommend you look at the source rather than
have me explain.  But I'll tell you how the general scheme works in
XEmacs if you want, and later Handa-san can clarify any differences.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-13  7:56     ` Kenichi Handa
  2010-12-13  9:27       ` Werner LEMBERG
  2010-12-13 11:47       ` Eli Zaretskii
@ 2010-12-18 17:03       ` Per Starbäck
  2010-12-19 13:54         ` Stefan Monnier
  2 siblings, 1 reply; 27+ messages in thread
From: Per Starbäck @ 2010-12-18 17:03 UTC (permalink / raw)
  To: Kenichi Handa; +Cc: eliz, emacs-devel

I've "always" wanted Emacs to know what natural language a buffer is
in, at least text buffers, maybe as a minor mode. Here are two things
I think think haven't been mentioned in the thread yet, which I would
like:

* automatically ispell-change-dictionary
* have language-specific abbrevs

If there's a hook for changing from and to specific languages that
could be used for other things as well, like changing the values of
some sentence-end-* variables.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-18 17:03       ` Per Starbäck
@ 2010-12-19 13:54         ` Stefan Monnier
  2010-12-19 21:05           ` Dimitri Fontaine
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Monnier @ 2010-12-19 13:54 UTC (permalink / raw)
  To: Per Starbäck; +Cc: eliz, emacs-devel, Kenichi Handa

> * automatically ispell-change-dictionary

Yes, that would be nice.  It'd have to work "on the fly" to be useful
for me (typically I want/need it for email messages, which start empty).
I was thinking that (fly|i)spell could detect "too many misspelling" and
trigger an auto-detect.


        Stefan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-19 13:54         ` Stefan Monnier
@ 2010-12-19 21:05           ` Dimitri Fontaine
  0 siblings, 0 replies; 27+ messages in thread
From: Dimitri Fontaine @ 2010-12-19 21:05 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Per Starbäck, eliz, Kenichi Handa, emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> * automatically ispell-change-dictionary
>
> Yes, that would be nice.  It'd have to work "on the fly" to be useful
> for me (typically I want/need it for email messages, which start empty).
> I was thinking that (fly|i)spell could detect "too many misspelling" and
> trigger an auto-detect.

See:

  http://git.naquadah.org/?p=flyguess.git;a=summary
  http://www.emacswiki.org/emacs-fr/CategorySpelling

Regards,
-- 
dim



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-18  9:30                     ` Stephen J. Turnbull
@ 2010-12-21 18:39                       ` Eli Zaretskii
  2010-12-21 21:16                       ` Werner LEMBERG
  1 sibling, 0 replies; 27+ messages in thread
From: Eli Zaretskii @ 2010-12-21 18:39 UTC (permalink / raw)
  To: Stephen J. Turnbull; +Cc: emacs-devel, handa

> From: "Stephen J. Turnbull" <stephen@xemacs.org>
> Cc: handa@m17n.org,
>     emacs-devel@gnu.org
> Date: Sat, 18 Dec 2010 18:30:39 +0900
> 
> the words you have written make me believe that you have
> a fundamental misunderstanding of how buffer-file-coding-system gets
> set in Emacsen, and specifically you do not understand the role of the
> priority lists.

You are wrong.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-18  9:30                     ` Stephen J. Turnbull
  2010-12-21 18:39                       ` Eli Zaretskii
@ 2010-12-21 21:16                       ` Werner LEMBERG
  2010-12-22  6:52                         ` Stephen J. Turnbull
  1 sibling, 1 reply; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-21 21:16 UTC (permalink / raw)
  To: stephen; +Cc: eliz, handa, emacs-devel


> *His* problems are harder, and more deeply tied to language itself.
> The current language environment mechanism is good at what it does
> and will be somewhat improved by being made buffer-local, but to be
> really useful to Werner a number of additional attributes need to be
> added, as well as some functionality that I don't yet really know
> how to implement (eg, his Romanian s-with-comma-below vs. s-cedilla
> issue).

IMHO, the Romanian functionality is nothing Emacs should take care at
all.  It should simply forward a `language environment' to the font
library which has to take care of using the proper glyph.  Today, most
of the good multilingual OpenType fonts have support for that
mechanism.  However, for CJK stuff, the situation is very different.
Virtually *no* font supports different glyphs for Chinese, Japanese,
and Korean.  Ken Lunde from Adobe has analyzed the problem in detail,
and according to him, it would be necessary to add about 40% more
glyphs, making huge fonts even larger.


    Werner



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-21 21:16                       ` Werner LEMBERG
@ 2010-12-22  6:52                         ` Stephen J. Turnbull
  2010-12-22  7:42                           ` Werner LEMBERG
  0 siblings, 1 reply; 27+ messages in thread
From: Stephen J. Turnbull @ 2010-12-22  6:52 UTC (permalink / raw)
  To: Werner LEMBERG; +Cc: eliz, emacs-devel, handa

Werner LEMBERG writes:

 > IMHO, the Romanian functionality is nothing Emacs should take care at
 > all.  It should simply forward a `language environment' to the font
 > library which has to take care of using the proper glyph.  Today, most
 > of the good multilingual OpenType fonts have support for that
 > mechanism.

It's not obvious to me that that is a generally correct solution (see
below for why I don't think it appropriate for CJK), but if it does
work for European (and probably many other) languages, that's great.

BTW, did you mean to say good *free* multilingual OpenType fonts, and
just assume freedom, or was the omission prompted by reality?  Freedom
matters to Emacsen, of course.

 > However, for CJK stuff, the situation is very different.  Virtually
 > *no* font supports different glyphs for Chinese, Japanese, and
 > Korean.

It's not obvious to me that they should.  If you look at the multiple
Chinese languages, Japanese, Korean, and Vietnamese, you see that
there are clearly Chinese styles (and I suspect differences among
Taiwanese, Cantonese, and Mandarin styles), clearly Japanese styles,
etc. with respect to stroke endings, attitude of slanted strokes,
contact points, and extensions at join points.  I don't think that
people from different East Asian culture/languages would find
compromise fonts acceptable, except perhaps in the very simplest of
Gothic and Maru Gothic faces (Japanese names for font styles basically
equivalent to sans-serif upright faces for Latin characters).  Eg, in
Emacs, even as one who learned Japanese late in life, I've gotten used
to distinguishing Chinese spam from Japanese spam via such stylistic
differences (strictly speaking, it's unnecessary as the presence of
kana is normally decisive).  I have to wonder if such stylistic fine
points might not be very important to the comfort level of someone who
is bilingual in Chinese and Japanese.

But as a practical matter, today if Emacs wants to display Chinese
attractively (maybe even "correctly"), it cannot use a Japanese font
and compromise fonts with multilingual support basically don't exist.
So even if support in Emacs for choosing appropriate fonts based on
language is not needed for Romanian, it is needed for Han-based
languages.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: per-buffer language environments
  2010-12-22  6:52                         ` Stephen J. Turnbull
@ 2010-12-22  7:42                           ` Werner LEMBERG
  0 siblings, 0 replies; 27+ messages in thread
From: Werner LEMBERG @ 2010-12-22  7:42 UTC (permalink / raw)
  To: stephen; +Cc: eliz, emacs-devel, handa


> BTW, did you mean to say good *free* multilingual OpenType fonts,
> and just assume freedom, or was the omission prompted by reality?
> Freedom matters to Emacsen, of course.

Today, virtually any new multilingual font, be it `free' or not,
supports that.  For example, the TeX Gyre project provides extended
glyph sets for the URW clones of the 35 standard PS fonts, and of
course the Romanian case is handled by its OpenType tables (script
`latn', language `ROM', table `locl').

> > However, for CJK stuff, the situation is very different.
> > Virtually *no* font supports different glyphs for Chinese,
> > Japanese, and Korean.
> 
> It's not obvious to me that they should.  If you look at the
> multiple Chinese languages, Japanese, Korean, and Vietnamese, you
> see that there are clearly Chinese styles (and I suspect differences
> among Taiwanese, Cantonese, and Mandarin styles), clearly Japanese
> styles, etc. with respect to stroke endings, attitude of slanted
> strokes, contact points, and extensions at join points.  I don't
> think that people from different East Asian culture/languages would
> find compromise fonts acceptable, except perhaps in the very
> simplest of Gothic and Maru Gothic faces (Japanese names for font
> styles basically equivalent to sans-serif upright faces for Latin
> characters).  Eg, in Emacs, even as one who learned Japanese late in
> life, I've gotten used to distinguishing Chinese spam from Japanese
> spam via such stylistic differences (strictly speaking, it's
> unnecessary as the presence of kana is normally decisive).  I have
> to wonder if such stylistic fine points might not be very important
> to the comfort level of someone who is bilingual in Chinese and
> Japanese.

You've probably misunderstood me.  The idea of the script/language
tags within OpenType fonts is that you can map the input character
codes to script or language specific glyphs.  If you do so for CJK
fonts, you need about 60% more glyphs *to get locale specific correct
shapes*, as Ken Lunde has analyzed (unfortunately, his presentation is
not available in the net).  A great number of glyphs (about 40%),
however, *can* be shared among the CJK locales ^[$(Q#|^[(B just think of the
characters ^[$B0lFs;0;M8^^[(B which have always the same shape.

In other words, the technical problems to have a single font with
support for multiple CJK locales have been solved, but there is no
such font (neither free nor non-free, AFAIK) which incorporates this
technique.

> But as a practical matter, today if Emacs wants to display Chinese
> attractively (maybe even "correctly"), it cannot use a Japanese font
> and compromise fonts with multilingual support basically don't
> exist.  So even if support in Emacs for choosing appropriate fonts
> based on language is not needed for Romanian, it is needed for
> Han-based languages.

With `no support in Emacs needed' I mean that the burden of handling
the issue can be transferred to a library (e.g. libotf).  In case the
library says "sorry, I can't provide a locale specific font", Emacs
should do nothing for Romanian (since the user can easily install a
font with proper support), but should actually handle the case for
CJK.


    Werner



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-12-22  7:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-11 15:25 per-buffer language environments Werner LEMBERG
2010-12-11 19:00 ` Eli Zaretskii
2010-12-12  6:25   ` Werner LEMBERG
2010-12-13  7:56     ` Kenichi Handa
2010-12-13  9:27       ` Werner LEMBERG
2010-12-13 10:59         ` Kenichi Handa
2010-12-13 12:15           ` Werner LEMBERG
2010-12-13 11:47       ` Eli Zaretskii
2010-12-14 11:38         ` Stephen J. Turnbull
2010-12-14 15:14           ` Eli Zaretskii
2010-12-15  4:51             ` Stephen J. Turnbull
2010-12-15  6:47               ` Eli Zaretskii
2010-12-15  7:45                 ` Werner LEMBERG
2010-12-16 21:10                 ` Stephen J. Turnbull
2010-12-17 11:51                   ` Eli Zaretskii
2010-12-18  6:29                     ` Werner LEMBERG
2010-12-18  9:30                     ` Stephen J. Turnbull
2010-12-21 18:39                       ` Eli Zaretskii
2010-12-21 21:16                       ` Werner LEMBERG
2010-12-22  6:52                         ` Stephen J. Turnbull
2010-12-22  7:42                           ` Werner LEMBERG
2010-12-17  0:51                 ` Kenichi Handa
2010-12-17  2:48                   ` Stephen J. Turnbull
2010-12-17 11:05                   ` Eli Zaretskii
2010-12-18 17:03       ` Per Starbäck
2010-12-19 13:54         ` Stefan Monnier
2010-12-19 21:05           ` Dimitri Fontaine

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).