all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed
* Comparing non-English strings for sorting
@ 2009-02-10  6:31 spamfilteraccount
  2009-02-10 10:47 ` spamfilteraccount
  0 siblings, 1 reply; 3+ messages in thread
From: spamfilteraccount @ 2009-02-10  6:31 UTC (permalink / raw
  To: help-gnu-emacs

Hi,

I see Emacs doesn't have builtin support for sorting non-Engish (UTF,
Unicode) strings in proper order.

Has anyone written a comparison function which can handle sorting such
strings if the character order is provided?

For example, in my case I'd supply the Hungarian alphabetical order as
a string ("aábcdeéfghijklmnoóöőpqrstuúüűxyvz") and the string
comparison function would use the character positions in this string
when comparing two strings do determine which is the lesser.

It couldn't handle all kinds of Unicode strings, of course, but it
would be an adequately simple solution for most of the Western
languages.

Someone may have already written this for some package, only I don't
know where to look. Any pointers?



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Comparing non-English strings for sorting
  2009-02-10  6:31 Comparing non-English strings for sorting spamfilteraccount
@ 2009-02-10 10:47 ` spamfilteraccount
  2009-02-12 23:57   ` Thien-Thi Nguyen
  0 siblings, 1 reply; 3+ messages in thread
From: spamfilteraccount @ 2009-02-10 10:47 UTC (permalink / raw
  To: help-gnu-emacs

On Feb 10, 7:31 am, "spamfilteracco...@gmail.com"
<spamfilteracco...@gmail.com> wrote:
> Hi,
>
> I see Emacs doesn't have builtin support for sorting non-Engish (UTF,
> Unicode) strings in proper order.
>
> Has anyone written a comparison function which can handle sorting such
> strings if the character order is provided?

I wrote my own func. Wasn't that hard. Let me know if you spot some
error in it or know a better way:


(require 'cl)

(let ((l '("str1" "str2" ...))
      (order "aábcdeéfghijklmnoóöőpqrstuúüűxyvz"))
  (sort l 'my-case-insensitive-nonenglish-string-comparator))


(defun my-case-insensitive-nonenglish-string-comparator (str1 str2)
  (let ((diff (some (lambda (char1 char2)
                      (and (not (equal char1 char2))
                           (cons char1 char2)))
                    (vconcat (downcase str1))
                    (vconcat (downcase str2)))))
    (if diff
        (let* ((char1 (car diff))
               (char2 (cdr diff))
               (pos1 (position char1 order))
               (pos2 (position char2 order)))
          (if (and pos1 pos2)
              (< pos1 pos2)
            (< char1 char2))))))


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Comparing non-English strings for sorting
  2009-02-10 10:47 ` spamfilteraccount
@ 2009-02-12 23:57   ` Thien-Thi Nguyen
  0 siblings, 0 replies; 3+ messages in thread
From: Thien-Thi Nguyen @ 2009-02-12 23:57 UTC (permalink / raw
  To: help-gnu-emacs

() "spamfilteraccount@gmail.com" <spamfilteraccount@gmail.com>
() Tue, 10 Feb 2009 02:47:41 -0800 (PST)

                       (vconcat (downcase str1))
                       (vconcat (downcase str2)))))

If all the strings you wish to compare are composed entirely of
the characters in `order', this (unconditional case smashing) is
sufficient.  Otherwise, comparing a downcased character in that
set with a "downcased" character outside that set (where the
result is equal to the input) can be problematic.

Consider the ASCII character set (ascii(7)), specifically, the
six indices between ?Z and ?a (here, we use ?_, decimal 95).

 (downcase ?_) => 95  ;; no change
 (downcase ?a) => 97  ;; no change
 (downcase ?A) => 97  ;; smashed (numerically "upward", hee hee)
           ?A  => 65  ;; originally

Using unconditional case smashing in a hypothetical analog of
`my-case-insensitive-nonenglish-string-comparator', we'd see:

 (string-ci-lessp "_" "a") => t
 (string-ci-lessp "_" "A") => t
 (string-lessp "_" "a")    => t
 (string-lessp "_" "A")    => nil

Perhaps the reason behind the difference between the 2nd and 4th
results being "one is case-insensitive and the other isn't" does
indeed satisfy you.  It doesn't, me.  What is the case of the
underscore and why should my (in)sensitivity to it matter at all?

Appended is what i think is a more rational algorithm (expressed
in C, not Emacs Lisp, because it is part of an upcoming Guile
release (which is implemented (like Emacs) in C)).  It allows for
the (properly phrased ;-) mu answer.

thi

______________________________________
int
scm_i_ccmp_ci (int x, int y)
{
  int d, lx, ly, ux = 0, uy = 0;

#define ISLOWER(c)  (islower (c) ? (1 + c - 'a') : 0)
#define ISUPPER(c)  (isupper (c) ? (1 + c - 'A') : 0)
#define ALPHA(c)    ((l ## c = ISLOWER (c)) || (u ## c = ISUPPER (c)))

  d = (!ALPHA (x) || !ALPHA (y))
    /* Subtract directly.  */
    ? (x - y)
    /* Subtract in one domain or another.  */
    : (lx
       ? (lx - (ly
                ? ly
                : uy))
       : (ux - (uy
                ? uy
                : ly)));
  return !d
    ? 0
    : (GOOD (d)
       ?  1
       : -1);

#undef ALPHA
#undef ISUPPER
#undef ISLOWER
}




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-02-12 23:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-10  6:31 Comparing non-English strings for sorting spamfilteraccount
2009-02-10 10:47 ` spamfilteraccount
2009-02-12 23:57   ` Thien-Thi Nguyen

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.