all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* How to compare strings?
@ 2007-04-29 16:23 David Kastrup
  2007-04-29 19:38 ` Eli Zaretskii
       [not found] ` <mailman.2692.1177876391.7795.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 16+ messages in thread
From: David Kastrup @ 2007-04-29 16:23 UTC (permalink / raw)
  To: help-gnu-emacs


Hi,

how do I compare strings in the sort order of the current language
environment?  Does Emacs have a concept of sort order depending on
language?  If not, why not?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 16:23 How to compare strings? David Kastrup
@ 2007-04-29 19:38 ` Eli Zaretskii
  2007-04-29 20:06   ` Lennart Borgman (gmail)
       [not found] ` <mailman.2692.1177876391.7795.help-gnu-emacs@gnu.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Eli Zaretskii @ 2007-04-29 19:38 UTC (permalink / raw)
  To: help-gnu-emacs

> From: David Kastrup <dak@gnu.org>
> Date: Sun, 29 Apr 2007 18:23:09 +0200
> 
> how do I compare strings in the sort order of the current language
> environment?

I don't understand the question.  I'm sure you are aware that in the
Emacs internal representation of strings, each character has a
distinct codepoint.  That is, unlike outside Emacs, where the same
code can stand for different characters depending on the locale
(because each locale assumes a certain default encoding of text),
inside Emacs Latin-1 è and Latin-2 č are two different characters
represented by two different codes, even though their respective 8-bit
encodings are identical (\350 or hex E8).  In the above example, these
two internal codes are 2280 and 2408 decimal.  (In Emacs 23, these
codes will change, but will still be different.)

Thus, as long as the string was decoded correctly, comparing such
strings is a simple matter of using string< and its ilk.

> Does Emacs have a concept of sort order depending on language?  If
> not, why not?

Because characters that have different order depending on the language
have different codepoints inside Emacs, and thus the issue doesn't
exist.

Or am I missing something?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 19:38 ` Eli Zaretskii
@ 2007-04-29 20:06   ` Lennart Borgman (gmail)
  2007-04-29 20:52     ` Maciej Katafiasz
       [not found]     ` <mailman.2696.1177880336.7795.help-gnu-emacs@gnu.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Lennart Borgman (gmail) @ 2007-04-29 20:06 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: help-gnu-emacs

Eli Zaretskii wrote:

> Because characters that have different order depending on the language
> have different codepoints inside Emacs, and thus the issue doesn't
> exist.
> 
> Or am I missing something?


I think that sorting differs more than that between different languages. 
Or at least it used to do that. Perhaps things have changed today, I am 
not sure.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
       [not found] ` <mailman.2692.1177876391.7795.help-gnu-emacs@gnu.org>
@ 2007-04-29 20:39   ` Joost Kremers
  2007-04-29 21:31     ` sigvaldi
                       ` (3 more replies)
  2007-04-29 22:25   ` David Kastrup
  1 sibling, 4 replies; 16+ messages in thread
From: Joost Kremers @ 2007-04-29 20:39 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii wrote:
>> From: David Kastrup <dak@gnu.org>
>> Does Emacs have a concept of sort order depending on language?  If
>> not, why not?
>
> Because characters that have different order depending on the language
> have different codepoints inside Emacs, and thus the issue doesn't
> exist.
>
> Or am I missing something?

Well, in German dictionaries you will generally find words with ö
interspersed with those with o, but within the letter O, o>ö. So both "Ode"
and "öde" appear under O, but the former before the latter. Both, however,
appear before "oder".

Yet, other languages that use ö may well alphabetise it as a completely
separate letter. IIRC this is done for example in Hungarian dictionaries,
where O and Ö are different sections of the dictionary, Ö following after
O. In Icelandic I think (may well be wrong, though), that it is customary
to sort words with Ö at the end, that is, even *after* Z.


-- 
Joost Kremers                                      joostkremers@yahoo.com
Selbst in die Unterwelt dringt durch Spalten Licht
EN:SiS(9)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 20:06   ` Lennart Borgman (gmail)
@ 2007-04-29 20:52     ` Maciej Katafiasz
       [not found]     ` <mailman.2696.1177880336.7795.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 16+ messages in thread
From: Maciej Katafiasz @ 2007-04-29 20:52 UTC (permalink / raw)
  To: help-gnu-emacs

Den Sun, 29 Apr 2007 22:06:29 +0200 skrev Lennart Borgman (gmail):

> Eli Zaretskii wrote:
> 
>> Because characters that have different order depending on the language
>> have different codepoints inside Emacs, and thus the issue doesn't
>> exist.
>> 
>> Or am I missing something?
> 
> I think that sorting differs more than that between different languages. 
> Or at least it used to do that. Perhaps things have changed today, I am 
> not sure.

It does.

Swedish:
a ae o oe å ä ö

German:
a å ä ae o ö oe

Cheers,
Maciej

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 20:39   ` Joost Kremers
@ 2007-04-29 21:31     ` sigvaldi
  2007-04-29 21:47     ` Harald Hanche-Olsen
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: sigvaldi @ 2007-04-29 21:31 UTC (permalink / raw)
  To: help-gnu-emacs


Joost Kremers wrote:
> Eli Zaretskii wrote:
> >> From: David Kastrup <dak@gnu.org>
> >> Does Emacs have a concept of sort order depending on language?  If
> >> not, why not?
> >
> > Because characters that have different order depending on the language
> > have different codepoints inside Emacs, and thus the issue doesn't
> > exist.
> >
> > Or am I missing something?
>
> Well, in German dictionaries you will generally find words with ö
> interspersed with those with o, but within the letter O, o>ö. So both "Ode"
> and "öde" appear under O, but the former before the latter. Both, however,
> appear before "oder".
>
> Yet, other languages that use ö may well alphabetise it as a completely
> separate letter. IIRC this is done for example in Hungarian dictionaries,
> where O and Ö are different sections of the dictionary, Ö following after
> O. In Icelandic I think (may well be wrong, though), that it is customary
> to sort words with Ö at the end, that is, even *after* Z.
>

The Icelandic alphabet ends in XYZÞÆÖ

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 20:39   ` Joost Kremers
  2007-04-29 21:31     ` sigvaldi
@ 2007-04-29 21:47     ` Harald Hanche-Olsen
  2007-04-29 21:56     ` Lennart Borgman (gmail)
       [not found]     ` <mailman.2701.1177884177.7795.help-gnu-emacs@gnu.org>
  3 siblings, 0 replies; 16+ messages in thread
From: Harald Hanche-Olsen @ 2007-04-29 21:47 UTC (permalink / raw)
  To: help-gnu-emacs

+ Joost Kremers <joostkremers@yahoo.com>:

| Yet, other languages that use ö may well alphabetise it as a
| completely separate letter. IIRC this is done for example in
| Hungarian dictionaries, where O and Ö are different sections of the
| dictionary, Ö following after O. In Icelandic I think (may well be
| wrong, though), that it is customary to sort words with Ö at the
| end, that is, even *after* Z.

Swedish too:  The Swedish alphabet ends ...XYZÅÄÖ.
And the Danish and Norwegian, ...XYZÆØÅ.
(The Danish/Norwegian use of Ø corresponds roughly to the Swedish Ö,
and Æ to Ä.)

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 20:39   ` Joost Kremers
  2007-04-29 21:31     ` sigvaldi
  2007-04-29 21:47     ` Harald Hanche-Olsen
@ 2007-04-29 21:56     ` Lennart Borgman (gmail)
  2007-04-29 22:22       ` Jesper Harder
       [not found]       ` <mailman.2702.1177885779.7795.help-gnu-emacs@gnu.org>
       [not found]     ` <mailman.2701.1177884177.7795.help-gnu-emacs@gnu.org>
  3 siblings, 2 replies; 16+ messages in thread
From: Lennart Borgman (gmail) @ 2007-04-29 21:56 UTC (permalink / raw)
  To: Joost Kremers; +Cc: help-gnu-emacs

Joost Kremers wrote:
> Eli Zaretskii wrote:
>>> From: David Kastrup <dak@gnu.org>
>>> Does Emacs have a concept of sort order depending on language?  If
>>> not, why not?
>> Because characters that have different order depending on the language
>> have different codepoints inside Emacs, and thus the issue doesn't
>> exist.
>>
>> Or am I missing something?
> 
> Well, in German dictionaries you will generally find words with ö
> interspersed with those with o, but within the letter O, o>ö. So both "Ode"
> and "öde" appear under O, but the former before the latter. Both, however,
> appear before "oder".


But I think there are completely different problems too. Does not some 
languages sort partly depending the phonetics instead of the spelling?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
       [not found]     ` <mailman.2701.1177884177.7795.help-gnu-emacs@gnu.org>
@ 2007-04-29 22:08       ` Joost Kremers
  2007-04-30  7:50         ` Harald Hanche-Olsen
  0 siblings, 1 reply; 16+ messages in thread
From: Joost Kremers @ 2007-04-29 22:08 UTC (permalink / raw)
  To: help-gnu-emacs

Lennart Borgman (gmail) wrote:
> But I think there are completely different problems too. Does not some 
> languages sort partly depending the phonetics instead of the spelling?

TBH i have no idea what you mean by that... could you give an example?


-- 
Joost Kremers                                      joostkremers@yahoo.com
Selbst in die Unterwelt dringt durch Spalten Licht
EN:SiS(9)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 21:56     ` Lennart Borgman (gmail)
@ 2007-04-29 22:22       ` Jesper Harder
       [not found]       ` <mailman.2702.1177885779.7795.help-gnu-emacs@gnu.org>
  1 sibling, 0 replies; 16+ messages in thread
From: Jesper Harder @ 2007-04-29 22:22 UTC (permalink / raw)
  To: help-gnu-emacs

"Lennart Borgman (gmail)" <lennart.borgman@gmail.com> writes:

> But I think there are completely different problems too. Does not some 
> languages sort partly depending the phonetics instead of the spelling?

Yes. In Danish 'aa' is alphabetized according to how it's
pronounced. 

If it is pronounced as two vowels (e.g. ekstraarbejde), it's
alphabetized as two a's. If it is pronounced as one vowel
(e.g. afrikaans) is alphabetized as å (the last letter in the Danish
alphabet).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
       [not found] ` <mailman.2692.1177876391.7795.help-gnu-emacs@gnu.org>
  2007-04-29 20:39   ` Joost Kremers
@ 2007-04-29 22:25   ` David Kastrup
  2007-04-30  5:30     ` Stefan Monnier
  2007-04-30 19:28     ` Eli Zaretskii
  1 sibling, 2 replies; 16+ messages in thread
From: David Kastrup @ 2007-04-29 22:25 UTC (permalink / raw)
  To: help-gnu-emacs

Eli Zaretskii <eliz@gnu.org> writes:

>> From: David Kastrup <dak@gnu.org>
>> Date: Sun, 29 Apr 2007 18:23:09 +0200
>> 
>> how do I compare strings in the sort order of the current language
>> environment?
>
> I don't understand the question.  I'm sure you are aware that in the
> Emacs internal representation of strings, each character has a
> distinct codepoint.  That is, unlike outside Emacs, where the same
> code can stand for different characters depending on the locale
> (because each locale assumes a certain default encoding of text),
> inside Emacs Latin-1 è and Latin-2 č are two different characters
> represented by two different codes, even though their respective 8-bit
> encodings are identical (\350 or hex E8).

And?

> In the above example, these two internal codes are 2280 and 2408
> decimal.  (In Emacs 23, these codes will change, but will still be
> different.)
>
> Thus, as long as the string was decoded correctly, comparing such
> strings is a simple matter of using string< and its ilk.

But it does not establish the sort order of a language, but rather the
sort order of Unicode (or MULE) code points.  Something entirely
different.

>> Does Emacs have a concept of sort order depending on language?  If
>> not, why not?
>
> Because characters that have different order depending on the
> language have different codepoints inside Emacs, and thus the issue
> doesn't exist.
>
> Or am I missing something?

You are seemingly talking about something entirely different.  I can't
even make sense of your explanations.

Different languages have different orders of sorting characters.  Look
up the man pages of strcoll and strxfrm.  Pick up sometelephone
directories or dictionaries of such languages.  Please note that this
is only partly related to the coding scheme (utf-8/latin-1 etc).

For example, in some languages, accented letters will be right behind
the corresponding unaccented letter.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
       [not found]       ` <mailman.2702.1177885779.7795.help-gnu-emacs@gnu.org>
@ 2007-04-29 23:06         ` Joost Kremers
  0 siblings, 0 replies; 16+ messages in thread
From: Joost Kremers @ 2007-04-29 23:06 UTC (permalink / raw)
  To: help-gnu-emacs

Jesper Harder wrote:
> "Lennart Borgman (gmail)" <lennart.borgman@gmail.com> writes:
>
>> But I think there are completely different problems too. Does not some 
>> languages sort partly depending the phonetics instead of the spelling?
>
> Yes. In Danish 'aa' is alphabetized according to how it's
> pronounced. 
>
> If it is pronounced as two vowels (e.g. ekstraarbejde), it's
> alphabetized as two a's. If it is pronounced as one vowel
> (e.g. afrikaans) is alphabetized as å (the last letter in the Danish
> alphabet).

technically, this is not (if i understand things correctly, i don't speak
danish) a case of alphabetising according to pronunciation. when 'aa' is,
as you put it, pronounced as one vowel, it is technically a digraph, i.e. a
combination of two letters that indicate a single sound.

many languages have digraphs, e.g. english has th, ch, ph and ng, and quite
a few vowel combinations that are pronounced as one vowel (or diphthong);
dutch has quite a few vowel digraphs (with pronunciations that are somewhat
more regular than in english ;-), e.g. oe, eu, ui, au, ou, ei and ij.

in some languages, digraphs are treated as single letters for
alphabetisation. the 'aa' case in danish above is an example. sometimes,
digraphs present particularly interesting problems. in dutch dictionaries,
the digraph ij is treated as two letters, so words starting with ij appear
under i, but in phone books and the like, it's often treated as equivalent
to y, so that names starting with ij appear intermingled with y.

and then there's the case of nahuatl, which has a bunch of consonant
digraphs (ch, cu/uc, hu/uh, qu, tl, tz). dictionaries often (though not
always, there's no "standard" here), have separate sections for words
starting with these digraphs, but for the rest treat them as two separate
letters for alphabetisation within a section. (well, there's of course the
whole issue of roots vs. stems and the fact that cu/uc and hu/uh change
based on the position of the word they're in, but let's not get into
that. ;-)


-- 
Joost Kremers                                      joostkremers@yahoo.com
Selbst in die Unterwelt dringt durch Spalten Licht
EN:SiS(9)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 22:25   ` David Kastrup
@ 2007-04-30  5:30     ` Stefan Monnier
  2007-04-30 19:28     ` Eli Zaretskii
  1 sibling, 0 replies; 16+ messages in thread
From: Stefan Monnier @ 2007-04-30  5:30 UTC (permalink / raw)
  To: help-gnu-emacs

>> Or am I missing something?
> You are seemingly talking about something entirely different.  I can't
> even make sense of your explanations.

It was just a roundabout way to say "No, Emacs does not support
language-dependent sort order".


        Stefan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 22:08       ` Joost Kremers
@ 2007-04-30  7:50         ` Harald Hanche-Olsen
  0 siblings, 0 replies; 16+ messages in thread
From: Harald Hanche-Olsen @ 2007-04-30  7:50 UTC (permalink / raw)
  To: help-gnu-emacs

+ Joost Kremers <joostkremers@yahoo.com>:

| Lennart Borgman (gmail) wrote:
|> But I think there are completely different problems too. Does not some 
|> languages sort partly depending the phonetics instead of the spelling?
|
| TBH i have no idea what you mean by that... could you give an example?

It's true, at least in Norwegian phone books.  Before the letter Å
entered our alphabet, Aa was used instead.  You don't find that in
regular words, anymore, but the practice survives in many family
names.  So a name like Aarnes would be alphabetized like it were
Årnes.  And just to make matters really confusing, the rule is
supposed not to be followed with foreign names where the aa really
does not corresponding to the letter å, so an algorithmic solution is
impossible.  (I strongly suspect that Norwegian phone books
consistently alphabetize aa as å, though, regardless of the origin of
the name.)

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
  2007-04-29 22:25   ` David Kastrup
  2007-04-30  5:30     ` Stefan Monnier
@ 2007-04-30 19:28     ` Eli Zaretskii
  1 sibling, 0 replies; 16+ messages in thread
From: Eli Zaretskii @ 2007-04-30 19:28 UTC (permalink / raw)
  To: help-gnu-emacs

> From: David Kastrup <dak@gnu.org>
> Date: Mon, 30 Apr 2007 00:25:59 +0200
> 
> You are seemingly talking about something entirely different.  I can't
> even make sense of your explanations.

I simply didn't understand what you were asking (I actually told that
right at the beginning of my response).  I thought you were asking
about script-specific sorting, and that is what I responded to.

But you in fact asked about language-specific sorting that goes beyond
script.  Emacs doesn't currently support any language-specific
features (not just sorting, _any_ features), unless they happen to
coincide with script-specific features.  Adding such language-specific
features should probably be part of the agenda for the Unicode based
Emacs (a.k.a. Emacs 23), and I expect it to be a lot of work.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: How to compare strings?
       [not found]     ` <mailman.2696.1177880336.7795.help-gnu-emacs@gnu.org>
@ 2007-05-01 13:19       ` Malte Spiess
  0 siblings, 0 replies; 16+ messages in thread
From: Malte Spiess @ 2007-05-01 13:19 UTC (permalink / raw)
  To: help-gnu-emacs

Maciej Katafiasz <mathrick@gmail.com> writes:

> Den Sun, 29 Apr 2007 22:06:29 +0200 skrev Lennart Borgman (gmail):
>
>> Eli Zaretskii wrote:
>> 
>>> Because characters that have different order depending on the language
>>> have different codepoints inside Emacs, and thus the issue doesn't
>>> exist.
>>> 
>>> Or am I missing something?
>> 
>> I think that sorting differs more than that between different languages. 
>> Or at least it used to do that. Perhaps things have changed today, I am 
>> not sure.
>
> It does.
>
> Swedish:
> a ae o oe å ä ö
>
> German:
> a å ä ae o ö oe

Well, in Estonian it's even worse, since here the z is between the s and
the t (going r s z t u v) - so even with normal letters the sorting is
different. I should add that "z" is not part of the normal Estonian
alphabet, but you sort in foreign words like this.

> Cheers,
> Maciej

Greetings
Malte

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-05-01 13:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-29 16:23 How to compare strings? David Kastrup
2007-04-29 19:38 ` Eli Zaretskii
2007-04-29 20:06   ` Lennart Borgman (gmail)
2007-04-29 20:52     ` Maciej Katafiasz
     [not found]     ` <mailman.2696.1177880336.7795.help-gnu-emacs@gnu.org>
2007-05-01 13:19       ` Malte Spiess
     [not found] ` <mailman.2692.1177876391.7795.help-gnu-emacs@gnu.org>
2007-04-29 20:39   ` Joost Kremers
2007-04-29 21:31     ` sigvaldi
2007-04-29 21:47     ` Harald Hanche-Olsen
2007-04-29 21:56     ` Lennart Borgman (gmail)
2007-04-29 22:22       ` Jesper Harder
     [not found]       ` <mailman.2702.1177885779.7795.help-gnu-emacs@gnu.org>
2007-04-29 23:06         ` Joost Kremers
     [not found]     ` <mailman.2701.1177884177.7795.help-gnu-emacs@gnu.org>
2007-04-29 22:08       ` Joost Kremers
2007-04-30  7:50         ` Harald Hanche-Olsen
2007-04-29 22:25   ` David Kastrup
2007-04-30  5:30     ` Stefan Monnier
2007-04-30 19:28     ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.