* Ispell and unibyte characters @ 2012-03-17 18:46 Eli Zaretskii 2012-03-26 17:39 ` Agustin Martin 0 siblings, 1 reply; 25+ messages in thread From: Eli Zaretskii @ 2012-03-17 18:46 UTC (permalink / raw) To: emacs-devel The doc string of ispell-dictionary-alist says, inter alia: Each element of this list is also a list: (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET) ... CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings containing bytes of CHARACTER-SET. In addition, if they contain a non-ASCII byte, the regular expression must be a single `character set' construct that doesn't specify a character range for non-ASCII bytes. Why the restriction to unibyte character sets? This is quite a serious limitation, given that the modern spellers (aspell and hunspell) use UTF-8 as their default encoding. The only reason for this limitation I could find is in ispell-process-line, which assumes that the byte offsets returned by the speller can be used to compute character position of the misspelled word in the buffer. Are there any other places in ispell.el that assume unibyte characters? If ispell-process-line is the only place, then it should be easy to extend it so it handles correctly UTF-8 in addition to unibyte character sets. In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and OTHERCHARS as ugly unibyte escapes, since their usage is entirely consistent with multibyte characters: they are used to construct regular expressions and match buffer text against those regexps. Did I miss something important? Any comments and pointers to my blunders are welcome. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii @ 2012-03-26 17:39 ` Agustin Martin 2012-03-26 20:08 ` Eli Zaretskii 0 siblings, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-03-26 17:39 UTC (permalink / raw) To: emacs-devel On Sat, Mar 17, 2012 at 08:46:54PM +0200, Eli Zaretskii wrote: > The doc string of ispell-dictionary-alist says, inter alia: > > Each element of this list is also a list: > > (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P > ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET) > ... > CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings > containing bytes of CHARACTER-SET. In addition, if they contain > a non-ASCII byte, the regular expression must be a single > `character set' construct that doesn't specify a character range > for non-ASCII bytes. > > Why the restriction to unibyte character sets? This is quite a > serious limitation, given that the modern spellers (aspell and > hunspell) use UTF-8 as their default encoding. Hi Eli, At least for aspell ispell.el already uses utf8 as default communication encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). OTHERCHARS is guessed from aspell .dat file for given dictionary. Since currently it is not possible to ask hunspell for installed dictionaries (hunspell -D does not return control to the console) no one tried something similar for hunspell. > The only reason for this limitation I could find is in > ispell-process-line, which assumes that the byte offsets returned by > the speller can be used to compute character position of the > misspelled word in the buffer. Are there any other places in > ispell.el that assume unibyte characters? Not sure if using utf8 and [:alpha:] has caused some problem for aspell, I do not remember reports about this. > If ispell-process-line is the only place, then it should be easy to > extend it so it handles correctly UTF-8 in addition to unibyte > character sets. > > In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and > OTHERCHARS as ugly unibyte escapes, since their usage is entirely > consistent with multibyte characters: they are used to construct > regular expressions and match buffer text against those regexps. IIRC, the reason to use octal escapes is mostly that they are encoding independent. Otherwise a .emacs file may have mixed unibyte/multibyte encodings. Current limitation in docstring may be only something left from old times. I will try to look with recent ispell american dict, which can be called in utf8. Will let you know. Regards, -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-26 17:39 ` Agustin Martin @ 2012-03-26 20:08 ` Eli Zaretskii 2012-03-26 22:07 ` Lennart Borgman 2012-03-28 19:18 ` Agustin Martin 0 siblings, 2 replies; 25+ messages in thread From: Eli Zaretskii @ 2012-03-26 20:08 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Date: Mon, 26 Mar 2012 19:39:12 +0200 > From: Agustin Martin <agustin.martin@hispalinux.es> > > Hi Eli, Thanks for responding, I was beginning to think that no one is interested. In general, I find that ispell.el is in sore need of modernization; at least that's my conclusion so far from playing with hunspell (with which I want to replace my aging collection of Ispell and its dictionaries that I use for many years). > At least for aspell ispell.el already uses utf8 as default communication > encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). > OTHERCHARS is guessed from aspell .dat file for given dictionary. The question is, why isn't this done for any modern speller. The only one I know of that cannot handle UTF-8 is Ispell. OTHERCHARS are not very important anyway, at least for languages I'm interested in. > Since currently it is not possible to ask hunspell for installed > dictionaries (hunspell -D does not return control to the console) > no one tried something similar for hunspell. In what version do you have problems with -D? In any case, hunspell supports multiple dictionaries in the same session. One can invoke it with, e.g., "-d en_US,de_DE,ru_RU,he_IL" and have it spell-check mixed text that uses all these languages in the same buffer (at least in theory; I didn't yet try that in my experiments). Clearly, this can only be done with UTF-8 or some such as the encoding. So I think we should deprecate usage of the unibyte characters in the ispell.el defaults, and simply use [:alpha:] for all languages. As a bonus, we can then get rid of the ridiculously long and hard to maintain customization of each new dictionary you add to your repertory. Just one entry will serve almost any language, or at least supply an excellent default. > > The only reason for this limitation I could find is in > > ispell-process-line, which assumes that the byte offsets returned by > > the speller can be used to compute character position of the > > misspelled word in the buffer. Are there any other places in > > ispell.el that assume unibyte characters? > > Not sure if using utf8 and [:alpha:] has caused some problem for aspell, > I do not remember reports about this. Since I wrote that, I found that the problem was due to a bug in hunspell (which I fixed in my copy): it reported byte offsets of the misspelled words, rather than character offsets. After fixing that bug, there's no issue here anymore and nothing to fix in ispell.el. There's a bug report with a patch about that in the hunspell bug tracker, so there's reason to believe this bug will be fixed in a future release. > IIRC, the reason to use octal escapes is mostly that they are encoding > independent. They aren't; their encoding is guessed by Emacs based on the locale. Using them is asking for trouble, IMO. We specifically discourage use of unibyte text in Emacs manuals, and yet we ourselves use them in a package that is part of Emacs! > Otherwise a .emacs file may have mixed unibyte/multibyte encodings. I was talking about ispell.el, first and foremost. There's no problem with having ispell.el encoded in UTF-8, if needed (but I don't think there's a need, see above). ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-26 20:08 ` Eli Zaretskii @ 2012-03-26 22:07 ` Lennart Borgman 2012-03-28 19:18 ` Agustin Martin 1 sibling, 0 replies; 25+ messages in thread From: Lennart Borgman @ 2012-03-26 22:07 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Agustin Martin, emacs-devel On Mon, Mar 26, 2012 at 22:08, Eli Zaretskii <eliz@gnu.org> wrote: > > > Date: Mon, 26 Mar 2012 19:39:12 +0200 > > From: Agustin Martin <agustin.martin@hispalinux.es> > > > > Hi Eli, > > Thanks for responding, I was beginning to think that no one is > interested. In general, I find that ispell.el is in sore need of I am interested, but I have just given up on this since I found I did not have time to fix it. On w32 it was all a mess. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-26 20:08 ` Eli Zaretskii 2012-03-26 22:07 ` Lennart Borgman @ 2012-03-28 19:18 ` Agustin Martin 2012-03-29 18:06 ` Eli Zaretskii 2012-04-10 19:08 ` Agustin Martin 1 sibling, 2 replies; 25+ messages in thread From: Agustin Martin @ 2012-03-28 19:18 UTC (permalink / raw) To: emacs-devel On Mon, Mar 26, 2012 at 04:08:06PM -0400, Eli Zaretskii wrote: > > Date: Mon, 26 Mar 2012 19:39:12 +0200 > > From: Agustin Martin <agustin.martin@hispalinux.es> > > > > Hi Eli, > > Thanks for responding, I was beginning to think that no one is > interested. In general, I find that ispell.el is in sore need of > modernization; at least that's my conclusion so far from playing with > hunspell (with which I want to replace my aging collection of Ispell > and its dictionaries that I use for many years). > > > At least for aspell ispell.el already uses utf8 as default communication > > encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). > > OTHERCHARS is guessed from aspell .dat file for given dictionary. > > The question is, why isn't this done for any modern speller. The only > one I know of that cannot handle UTF-8 is Ispell. I think the only real remaining reason is for XEmacs compatibility. AFAIK XEmacs does not support [:alpha:]. I thought about filtering ispell-dictionary-base-alist when used from FSF Emacs, so it uses [:alpha:] and still keeps compatibility. I am currently a bit busy, but at some time I may try this for Debian and see what happens. For XEMACS in Debian GNU/* even changing to [:alpha:] should have a reduced impact, strings provided by dictionary maintainers take precedence, but better if I can easily do the above anyway, so [:alpha:] is used if available. Once release happens, I'd like to commit some other changes to decrease XEmacs incompatibilities in ispell.el and flyspell.el, so my changes for Debian GNU/* become smaller. > OTHERCHARS are not very important anyway, at least for languages I'm > interested in. > > > Since currently it is not possible to ask hunspell for installed > > dictionaries (hunspell -D does not return control to the console) > > no one tried something similar for hunspell. > > In what version do you have problems with -D? Hunspell 1.3.2. Does not return control until I press ^C. This may be useful if someone wants to know about installed hunspell dictionaries and prepare something to play with that info, in a way similar to what is currently done for aspell in ispell.el. > In any case, hunspell supports multiple dictionaries in the same > session. One can invoke it with, e.g., "-d en_US,de_DE,ru_RU,he_IL" > and have it spell-check mixed text that uses all these languages in > the same buffer (at least in theory; I didn't yet try that in my > experiments). Clearly, this can only be done with UTF-8 or some such > as the encoding. Right. > So I think we should deprecate usage of the unibyte characters in the > ispell.el defaults, and simply use [:alpha:] for all languages. As a > bonus, we can then get rid of the ridiculously long and hard to > maintain customization of each new dictionary you add to your > repertory. Just one entry will serve almost any language, or at least > supply an excellent default. > > > > The only reason for this limitation I could find is in > > > ispell-process-line, which assumes that the byte offsets returned by > > > the speller can be used to compute character position of the > > > misspelled word in the buffer. Are there any other places in > > > ispell.el that assume unibyte characters? > > > > Not sure if using utf8 and [:alpha:] has caused some problem for aspell, > > I do not remember reports about this. > > Since I wrote that, I found that the problem was due to a bug in > hunspell (which I fixed in my copy): it reported byte offsets of the > misspelled words, rather than character offsets. After fixing that > bug, there's no issue here anymore and nothing to fix in ispell.el. > There's a bug report with a patch about that in the hunspell bug > tracker, so there's reason to believe this bug will be fixed in a > future release. You mean http://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395 I filed that bug one year ago and received no reply from hunspell maintainers. This year I received a followup with a proposed change, but there is still no reply to it. There is other problem that mostly hits re-using ispell default entries under hunspell http://sourceforge.net/tracker/?func=detail&aid=2617130&group_id=143754&atid=756395 [~ prefixed strings are treated as words in pipe mode] that now stands for three years. I have waited in the hope this is fixed, but I think I will soon commit to Emacs the same change I use for Debian, making sure extended-character-mode is nil for hunspell. I do not think extended-character-mode pseudo-charsets will ever be implemented in hunspell. -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-28 19:18 ` Agustin Martin @ 2012-03-29 18:06 ` Eli Zaretskii 2012-03-29 21:13 ` Andreas Schwab 2012-04-26 9:54 ` Eli Zaretskii 2012-04-10 19:08 ` Agustin Martin 1 sibling, 2 replies; 25+ messages in thread From: Eli Zaretskii @ 2012-03-29 18:06 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Date: Wed, 28 Mar 2012 21:18:21 +0200 > From: Agustin Martin <agustin.martin@hispalinux.es> > > > OTHERCHARS are not very important anyway, at least for languages I'm > > interested in. > > > > > Since currently it is not possible to ask hunspell for installed > > > dictionaries (hunspell -D does not return control to the console) > > > no one tried something similar for hunspell. > > > > In what version do you have problems with -D? > > Hunspell 1.3.2. Does not return control until I press ^C. This may be useful > if someone wants to know about installed hunspell dictionaries and prepare > something to play with that info, in a way similar to what is currently done > for aspell in ispell.el. Well, to be fair to the Hunspell developers, the documentation doesn't say that -D should exit after displaying the available dictionaries. And the code really doesn't do that. However, with a simple 2-liner (below) I can make it do what you want. > > Since I wrote that, I found that the problem was due to a bug in > > hunspell (which I fixed in my copy): it reported byte offsets of the > > misspelled words, rather than character offsets. After fixing that > > bug, there's no issue here anymore and nothing to fix in ispell.el. > > There's a bug report with a patch about that in the hunspell bug > > tracker, so there's reason to believe this bug will be fixed in a > > future release. > > You mean > > http://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395 Yes. > I filed that bug one year ago and received no reply from hunspell > maintainers. This year I received a followup with a proposed change, but > there is still no reply to it. I simply fixed this. This _is_ Free Software, isn't it? > There is other problem that mostly hits re-using ispell default entries > under hunspell > > http://sourceforge.net/tracker/?func=detail&aid=2617130&group_id=143754&atid=756395 > > [~ prefixed strings are treated as words in pipe mode] Another easy fix (the feature is not implemented, so the code should simply ignore such lines). > that now stands for three years. I have waited in the hope this is fixed, It's true that development seems to be slow, but then aspell development is not exactly vibrant, either: both spellers hadn't a release in many months. Anyway, to me, Hunspell is a better tool, because of its support for multiple dictionaries, which fixes the most annoying inconvenience in Emacs spell-checking: the need to switch dictionaries according to the language -- this is really a bad thing when you use Flyspell. With multiple dictionaries, with very rare exceptions, one needs a single entry in ispell-dictionary-alist, having all of the dictionaries for languages one normally uses, [[:alpha:]] as CASECHARS, and UTF-8 as the encoding. > but I think I will soon commit to Emacs the same change I use for Debian, > making sure extended-character-mode is nil for hunspell. Probably a good idea. --- src/tools/hunspell.cxx~0 2011-01-21 19:01:29.000000000 +0200 +++ src/tools/hunspell.cxx 2012-03-21 16:40:31.255690500 +0200 @@ -1756,6 +1763,7 @@ int main(int argc, char** argv) fprintf(stderr, gettext("SEARCH PATH:\n%s\n"), path); fprintf(stderr, gettext("AVAILABLE DICTIONARIES (path is not mandatory for -d option):\n")); search(path, NULL, NULL); + if (arg_files==-1) exit(0); } if (!privdicname) privdicname = mystrdup(getenv("WORDLIST")); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-29 18:06 ` Eli Zaretskii @ 2012-03-29 21:13 ` Andreas Schwab 2012-03-30 6:28 ` Eli Zaretskii 2012-04-26 9:54 ` Eli Zaretskii 1 sibling, 1 reply; 25+ messages in thread From: Andreas Schwab @ 2012-03-29 21:13 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Agustin Martin, emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> Date: Wed, 28 Mar 2012 21:18:21 +0200 >> From: Agustin Martin <agustin.martin@hispalinux.es> >> >> > OTHERCHARS are not very important anyway, at least for languages I'm >> > interested in. >> > >> > > Since currently it is not possible to ask hunspell for installed >> > > dictionaries (hunspell -D does not return control to the console) >> > > no one tried something similar for hunspell. >> > >> > In what version do you have problems with -D? >> >> Hunspell 1.3.2. Does not return control until I press ^C. This may be useful >> if someone wants to know about installed hunspell dictionaries and prepare >> something to play with that info, in a way similar to what is currently done >> for aspell in ispell.el. > > Well, to be fair to the Hunspell developers, the documentation doesn't > say that -D should exit after displaying the available dictionaries. > And the code really doesn't do that. However, with a simple 2-liner > (below) I can make it do what you want. You can just redirect from /dev/null instead. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-29 21:13 ` Andreas Schwab @ 2012-03-30 6:28 ` Eli Zaretskii 0 siblings, 0 replies; 25+ messages in thread From: Eli Zaretskii @ 2012-03-30 6:28 UTC (permalink / raw) To: Andreas Schwab; +Cc: agustin.martin, emacs-devel > From: Andreas Schwab <schwab@linux-m68k.org> > Cc: Agustin Martin <agustin.martin@hispalinux.es>, emacs-devel@gnu.org > Date: Thu, 29 Mar 2012 23:13:19 +0200 > > > Well, to be fair to the Hunspell developers, the documentation doesn't > > say that -D should exit after displaying the available dictionaries. > > And the code really doesn't do that. However, with a simple 2-liner > > (below) I can make it do what you want. > > You can just redirect from /dev/null instead. Right, thanks. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-29 18:06 ` Eli Zaretskii 2012-03-29 21:13 ` Andreas Schwab @ 2012-04-26 9:54 ` Eli Zaretskii 1 sibling, 0 replies; 25+ messages in thread From: Eli Zaretskii @ 2012-04-26 9:54 UTC (permalink / raw) To: agustin.martin; +Cc: emacs-devel > Date: Thu, 29 Mar 2012 20:06:17 +0200 > From: Eli Zaretskii <eliz@gnu.org> > CC: emacs-devel@gnu.org > > Anyway, to me, Hunspell is a better tool, because of its support for > multiple dictionaries, which fixes the most annoying inconvenience in > Emacs spell-checking: the need to switch dictionaries according to the > language -- this is really a bad thing when you use Flyspell. > > With multiple dictionaries, with very rare exceptions, one needs a > single entry in ispell-dictionary-alist, having all of the > dictionaries for languages one normally uses, [[:alpha:]] as > CASECHARS, and UTF-8 as the encoding. Unfortunately, I have to take that back. Hunspell _does_ support multiple dictionaries, but only if they can use the same .aff file. When you invoke Hunspell with several dictionaries, as in hunspell -d "foo,bar,baz" only the first dictionary is loaded with its .aff file; the rest use that same .aff file. Therefore, it is practically impossible to use Hunspell to spell multi-lingual buffers without switching dictionaries. This feature _is_ useful when you want to add specialized dictionaries (e.g., for terminology in some specific field of knowledge or discipline) to the general dictionary of the same language, though. Sorry for posting misleading information. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-03-28 19:18 ` Agustin Martin 2012-03-29 18:06 ` Eli Zaretskii @ 2012-04-10 19:08 ` Agustin Martin 2012-04-10 19:11 ` Eli Zaretskii 1 sibling, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-04-10 19:08 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1655 bytes --] On Wed, Mar 28, 2012 at 09:18:21PM +0200, Agustin Martin wrote: > On Mon, Mar 26, 2012 at 04:08:06PM -0400, Eli Zaretskii wrote: > > > Date: Mon, 26 Mar 2012 19:39:12 +0200 > > > From: Agustin Martin <agustin.martin@hispalinux.es> > > > > > > Hi Eli, > > > > Thanks for responding, I was beginning to think that no one is > > interested. In general, I find that ispell.el is in sore need of > > modernization; at least that's my conclusion so far from playing with > > hunspell (with which I want to replace my aging collection of Ispell > > and its dictionaries that I use for many years). > > > > > At least for aspell ispell.el already uses utf8 as default communication > > > encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). > > > OTHERCHARS is guessed from aspell .dat file for given dictionary. > > > > The question is, why isn't this done for any modern speller. The only > > one I know of that cannot handle UTF-8 is Ispell. > > I think the only real remaining reason is for XEmacs compatibility. AFAIK > XEmacs does not support [:alpha:]. > > I thought about filtering ispell-dictionary-base-alist when used from FSF > Emacs, so it uses [:alpha:] and still keeps compatibility. I am currently a > bit busy, but at some time I may try this for Debian and see what happens. For the records, I am attaching what I am currently trying, post-processing global dictionary list while leaving local definitions at ~/.emacs unmodified. This should also deal with [#11200: ispell.el sets incorrect encoding for the default dictionary]. I would like to test this a bit more and commit if there are no problems. -- Agustin [-- Attachment #2: ispell.el_alpha-regexp.2.diff --] [-- Type: text/x-diff, Size: 2073 bytes --] --- ispell.el.orig 2012-04-10 20:02:51.422092761 +0200 +++ ispell.el 2012-04-10 20:18:27.464680054 +0200 @@ -783,6 +783,12 @@ (make-obsolete-variable 'ispell-aspell-supports-utf8 'ispell-encoding8-command "23.1") +(defvar ispell-emacs-alpha-regexp + (if (string-match "^[[:alpha:]]+$" "abcde") + "[[:alpha:]]" + nil) + "[[:alpha:]] if Emacs supports [:alpha:] regexp, nil +otherwise (current XEmacs does not support it).") ;;; ********************************************************************** ;;; The following are used by ispell, and should not be changed. @@ -1179,8 +1185,7 @@ (error nil)) ispell-really-aspell ispell-encoding8-command - ;; XEmacs does not like [:alpha:] regexps. - (string-match "^[[:alpha:]]+$" "abcde")) + ispell-emacs-alpha-regexp) (unless ispell-aspell-dictionary-alist (ispell-find-aspell-dictionaries))) @@ -1204,8 +1209,27 @@ ispell-dictionary-base-alist)) (unless (assoc (car dict) all-dicts-alist) (add-to-list 'all-dicts-alist dict))) - (setq ispell-dictionary-alist all-dicts-alist)))) + (setq ispell-dictionary-alist all-dicts-alist)) + ;; If Emacs flavor supports [:alpha:] use it for global dicts. If + ;; spellchecker also supports UTF-8 via command-line option use it + ;; in communication. This does not affect definitions in ~/.emacs. + (if ispell-emacs-alpha-regexp + (let (tmp-dicts-alist) + (dolist (adict ispell-dictionary-alist) + (add-to-list 'tmp-dicts-alist + (list + (nth 0 adict) ; dict name + "[[:alpha:]]" ; casechars + "[^[:alpha:]]" ; not-casechars + (nth 3 adict) ; otherchars + (nth 4 adict) ; many-otherchars-p + (nth 5 adict) ; ispell-args + (nth 6 adict) ; extended-character-mode + (if ispell-encoding8-command + 'utf-8 + (nth 7 adict))))) + (setq ispell-dictionary-alist tmp-dicts-alist))))) (defun ispell-valid-dictionary-list () "Return a list of valid dictionaries. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-10 19:08 ` Agustin Martin @ 2012-04-10 19:11 ` Eli Zaretskii 2012-04-12 14:36 ` Agustin Martin 0 siblings, 1 reply; 25+ messages in thread From: Eli Zaretskii @ 2012-04-10 19:11 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Date: Tue, 10 Apr 2012 21:08:03 +0200 > From: Agustin Martin <agustin.martin@hispalinux.es> > > For the records, I am attaching what I am currently trying, post-processing > global dictionary list while leaving local definitions at ~/.emacs > unmodified. This should also deal with [#11200: ispell.el sets incorrect > encoding for the default dictionary]. I would like to test this a bit more > and commit if there are no problems. Thanks, looks good to me. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-10 19:11 ` Eli Zaretskii @ 2012-04-12 14:36 ` Agustin Martin 2012-04-12 19:01 ` Eli Zaretskii 0 siblings, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-04-12 14:36 UTC (permalink / raw) To: emacs-devel On Tue, Apr 10, 2012 at 10:11:38PM +0300, Eli Zaretskii wrote: > > Date: Tue, 10 Apr 2012 21:08:03 +0200 > > From: Agustin Martin <agustin.martin@hispalinux.es> > > > > For the records, I am attaching what I am currently trying, post-processing > > global dictionary list while leaving local definitions at ~/.emacs > > unmodified. This should also deal with [#11200: ispell.el sets incorrect > > encoding for the default dictionary]. I would like to test this a bit more > > and commit if there are no problems. > > Thanks, looks good to me. Just some info, this is taking longer than expected, I am still dealing with an open issue here. Some languages have non 7bit wordchars, like Catalan middledot, and it should be converted to UTF-8 if default communication language is changed to UTF-8. I have looked at the encoding stuff and I am currently trying something like (if ispell-encoding8-command ;; Convert non 7bit otherchars to utf-8 if needed (encode-coding-string (decode-coding-string (nth 3 adict) (nth 7 adict)) 'utf-8) (nth 3 adict)) ; otherchars to get new UTF-8 string where (nth 7 adict) -> dict-coding-system (nth 3 adict) -> Original otherchars but get a sgml-lexical-context error. Need to look more carefuly, so this will take longer. I am far from expert in handling encodings, so comments are welcome. -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-12 14:36 ` Agustin Martin @ 2012-04-12 19:01 ` Eli Zaretskii 2012-04-13 15:25 ` Agustin Martin 0 siblings, 1 reply; 25+ messages in thread From: Eli Zaretskii @ 2012-04-12 19:01 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Date: Thu, 12 Apr 2012 16:36:57 +0200 > From: Agustin Martin <agustin.martin@hispalinux.es> > > I am still dealing with an open issue here. Some languages have non 7bit > wordchars, like Catalan middledot, and it should be converted to UTF-8 if > default communication language is changed to UTF-8. Sorry, I don't understand: do you mean "non 8-bit wordchars"? I don't think 7 bits is assumed anywhere. Assuming you did mean 8-bit, then why not use UTF-8 for Catalan from the get-go? Only some languages can use single-byte encodings, and evidently Catalan is not one of them. For that matter, why shouldn't aspell and hunspell use UTF-8 by default (something I already asked)? > I have looked at the encoding stuff and I am currently trying something > like > > (if ispell-encoding8-command > ;; Convert non 7bit otherchars to utf-8 if needed > (encode-coding-string > (decode-coding-string (nth 3 adict) (nth 7 adict)) > 'utf-8) > (nth 3 adict)) ; otherchars > > to get new UTF-8 string where > > (nth 7 adict) -> dict-coding-system > (nth 3 adict) -> Original otherchars > > but get a sgml-lexical-context error. Need to look more carefuly, so this > will take longer. I am far from expert in handling encodings, so comments > are welcome. I don't understand what are you trying to accomplish by encoding OTHERCHARS in UTF-8. What exactly is the problem with them being encoded in some 8-bit encoding? Please explain. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-12 19:01 ` Eli Zaretskii @ 2012-04-13 15:25 ` Agustin Martin 2012-04-13 15:53 ` Eli Zaretskii 2012-04-13 17:51 ` Stefan Monnier 0 siblings, 2 replies; 25+ messages in thread From: Agustin Martin @ 2012-04-13 15:25 UTC (permalink / raw) To: emacs-devel On Thu, Apr 12, 2012 at 10:01:30PM +0300, Eli Zaretskii wrote: > I wrote: > > I am still dealing with an open issue here. Some languages have non 7bit > > wordchars, like Catalan middledot, and it should be converted to UTF-8 if > > default communication language is changed to UTF-8. > > Sorry, I don't understand: do you mean "non 8-bit wordchars"? I don't > think 7 bits is assumed anywhere. I mean wordchars that cannot be represented in 7bit encoding, like Catalan middledot (available in 8bit latin1) > Assuming you did mean 8-bit, then why not use UTF-8 for Catalan from > the get-go? Only some languages can use single-byte encodings, and > evidently Catalan is not one of them. For that matter, why shouldn't > aspell and hunspell use UTF-8 by default (something I already asked)? [...] > I don't understand what are you trying to accomplish by encoding > OTHERCHARS in UTF-8. What exactly is the problem with them being > encoded in some 8-bit encoding? Please explain. Imagine a fake entry in the general list, either in ispell.el or provided through `ispell-base-dicts-override-alist' (no accented chars for simplicity) ("catala8" "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1) Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it properly. I prefer to not use UTF-8 here, because I want the entry to also be useful for ispell (and also be XEmacs incompatible). The best approach here seems to decode the otherchars regexp according to provided coding-system. I have noticed that there seems to be no need to encode the resulting string in UTF-8, Emacs will know what to do with the decoded string. I tested something like (dolist (adict ispell-dictionary-alist) (add-to-list 'tmp-dicts-alist (list (nth 0 adict) ; dict name "[[:alpha:]]" ; casechars "[^[:alpha:]]" ; not-casechars (if ispell-encoding8-command ;; Decode 8bit otherchars if needed (decode-coding-string (nth 3 adict) (nth 7 adict)) (nth 3 adict)) ; otherchars (nth 4 adict) ; many-otherchars-p (nth 5 adict) ; ispell-args (nth 6 adict) ; extended-character-mode (if ispell-encoding8-command 'utf-8 (nth 7 adict))))) and seems to work well. > I wrote: > > but get a sgml-lexical-context error. Need to look more carefuly, so this > > will take longer. I have tested further and this seems to be an unrelated problem. Some time ago I already noticed some problems with flyspell.el and sgml mode (in particular psgml) regarding sgml-lexical-context error sgml-lexical-context: Wrong type argument: stringp, nil sometimes when running flyspell-buffer after enabling flyspell-mode. I am also seing something like Error in post-command-hook (flyspell-post-command-hook): (wrong-type-argument stringp nil) when enabling flyspell-mode from the beginning of my sgml buffer. Cannot reproduce with emacs -Q, still trying to find where this comes from. Both problems tested with emacs-snapshot_20120410. For Debian I do not use sgml-lexical-context, but an improved version of old regexp to try keeping things compatible with XEmacs. This seems to work well and has some advantages over sgml-lexical-context 1) Is compatible with XEmacs 2) Is twice faster when using flyspell-buffer than sgml-lexical-context 3) Does not trigger above error. I am considering to use this improved regexp instead of sgml-lexical-context for above reasons, but this is another issue. -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-13 15:25 ` Agustin Martin @ 2012-04-13 15:53 ` Eli Zaretskii 2012-04-13 16:38 ` Agustin Martin 2012-04-13 17:51 ` Stefan Monnier 1 sibling, 1 reply; 25+ messages in thread From: Eli Zaretskii @ 2012-04-13 15:53 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Date: Fri, 13 Apr 2012 17:25:25 +0200 > From: Agustin Martin <agustin.martin@hispalinux.es> > > > I don't understand what are you trying to accomplish by encoding > > OTHERCHARS in UTF-8. What exactly is the problem with them being > > encoded in some 8-bit encoding? Please explain. > > Imagine a fake entry in the general list, either in ispell.el or provided > through `ispell-base-dicts-override-alist' (no accented chars for simplicity) > > ("catala8" > "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1) > > Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it > properly. I prefer to not use UTF-8 here, because I want the entry to also be > useful for ispell (and also be XEmacs incompatible). The best approach here > seems to decode the otherchars regexp according to provided coding-system. > > I have noticed that there seems to be no need to encode the resulting string > in UTF-8, Emacs will know what to do with the decoded string. > > I tested something like > > (dolist (adict ispell-dictionary-alist) > (add-to-list 'tmp-dicts-alist > (list > (nth 0 adict) ; dict name > "[[:alpha:]]" ; casechars > "[^[:alpha:]]" ; not-casechars > (if ispell-encoding8-command > ;; Decode 8bit otherchars if needed > (decode-coding-string (nth 3 adict) (nth 7 adict)) > (nth 3 adict)) ; otherchars > (nth 4 adict) ; many-otherchars-p > (nth 5 adict) ; ispell-args > (nth 6 adict) ; extended-character-mode > (if ispell-encoding8-command > 'utf-8 > (nth 7 adict))))) > > and seems to work well. So you are taking the Catalan dictionary spec written for Ispell and convert it to a spec that could be used to support more characters by using UTF-8, is that right? If so, I find this a bit kludgey. How about having a completely separate spec instead? More generally, why not separate ispell-dictionary-alist into 2 alists, one to be used with Ispell, the other to be used with aspell and hunspell? I think this would be cleaner, don't you agree? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-13 15:53 ` Eli Zaretskii @ 2012-04-13 16:38 ` Agustin Martin 0 siblings, 0 replies; 25+ messages in thread From: Agustin Martin @ 2012-04-13 16:38 UTC (permalink / raw) To: emacs-devel On Fri, Apr 13, 2012 at 06:53:57PM +0300, Eli Zaretskii wrote: > > Date: Fri, 13 Apr 2012 17:25:25 +0200 > > From: Agustin Martin <agustin.martin@hispalinux.es> > > > > > I don't understand what are you trying to accomplish by encoding > > > OTHERCHARS in UTF-8. What exactly is the problem with them being > > > encoded in some 8-bit encoding? Please explain. > > > > Imagine a fake entry in the general list, either in ispell.el or provided > > through `ispell-base-dicts-override-alist' (no accented chars for simplicity) > > > > ("catala8" > > "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1) > > > > Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it > > properly. I prefer to not use UTF-8 here, because I want the entry to also be > > useful for ispell (and also be XEmacs incompatible). The best approach here > > seems to decode the otherchars regexp according to provided coding-system. > > > > I have noticed that there seems to be no need to encode the resulting string > > in UTF-8, Emacs will know what to do with the decoded string. > > > > I tested something like > > > > (dolist (adict ispell-dictionary-alist) > > (add-to-list 'tmp-dicts-alist > > (list > > (nth 0 adict) ; dict name > > "[[:alpha:]]" ; casechars > > "[^[:alpha:]]" ; not-casechars > > (if ispell-encoding8-command > > ;; Decode 8bit otherchars if needed > > (decode-coding-string (nth 3 adict) (nth 7 adict)) > > (nth 3 adict)) ; otherchars > > (nth 4 adict) ; many-otherchars-p > > (nth 5 adict) ; ispell-args > > (nth 6 adict) ; extended-character-mode > > (if ispell-encoding8-command > > 'utf-8 > > (nth 7 adict))))) > > > > and seems to work well. > > So you are taking the Catalan dictionary spec written for Ispell and > convert it to a spec that could be used to support more characters by > using UTF-8, is that right? If so, I find this a bit kludgey. I think differently and like above approach because I find it way more versatile for general definitions. This is not a matter of ispell blind reuse. In particular I noticed this problem in Debian with the catalan spec written for aspell (automatically created after info provided by aspell-ca package). That info is written that way to also be useful for XEmacs, but with above post-processing it can work way better for Emacs. > How > about having a completely separate spec instead? More generally, why > not separate ispell-dictionary-alist into 2 alists, one to be used > with Ispell, the other to be used with aspell and hunspell? I think > this would be cleaner, don't you agree? As a matter of fact that is what we do in Debian from info provided by ispell, aspell and hunspell dicts maintainers. The difference is that the provided info is supposed to be valid for both Emacs and XEmacs, so I find post-processing as above very useful, because it helps to take the best for Emacs. Global dicts alist is built from (dolist (dict (append found-dicts-alist ispell-base-dicts-override-alist ispell-dictionary-base-alist)) where first found wins. `found-dicts-alist' has the result of automatic search (currently used only for aspell) and has higher priority, `ispell-dictionary-base-alist' is the fallback alist having the lower priority. Depending on the spellchecker `ispell-base-dicts-override-alist' is set to an alist corresponding to ispell, aspell or hunspell dictionaries (they are handled independently) I do not think that maintaining separate hardcoded dict lists in ispell.el for ispell, aspell and hunspell worths. For hunspell, in the future I'd go for some sort of parsing mechanism like current one for aspell. -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-13 15:25 ` Agustin Martin 2012-04-13 15:53 ` Eli Zaretskii @ 2012-04-13 17:51 ` Stefan Monnier 2012-04-13 18:44 ` Agustin Martin 1 sibling, 1 reply; 25+ messages in thread From: Stefan Monnier @ 2012-04-13 17:51 UTC (permalink / raw) To: emacs-devel > ("catala8" > "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1) > Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it > properly. I prefer to not use UTF-8 here, because I want the entry to also be > useful for ispell (and also be XEmacs incompatible). The best approach here > seems to decode the otherchars regexp according to provided coding-system. There's something I don't understand here: If you want a middle dot, why don't you put a middle dot? I mean why write "['\267-]" rather than ['·-]? I think this is related to your saying "I prefer to not use UTF-8 here", but again I don't know what you mean by "use UTF-8", because using a middle dot character in the source file does not imply using UTF-8 anywhere (the file can be saved in any encoding that includes the middle dot). For me notations like \267 should be used exclusively to talk about *bytes*, not about *chars*. So it might make sense to use those for things like matching particular bytes in [ia]spell's output, but it makes no sense to match chars in the buffer being spell-checked since the buffer does not contain bytes but chars. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-13 17:51 ` Stefan Monnier @ 2012-04-13 18:44 ` Agustin Martin 2012-04-14 1:57 ` Stefan Monnier 0 siblings, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-04-13 18:44 UTC (permalink / raw) To: emacs-devel On Fri, Apr 13, 2012 at 01:51:15PM -0400, Stefan Monnier wrote: > > ("catala8" > > "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-8859-1) > > > Unless emacs knows the encoding for \267 (middledot "·") it cannot decode it > > properly. I prefer to not use UTF-8 here, because I want the entry to also be > > useful for ispell (and also be XEmacs incompatible). The best approach here > > seems to decode the otherchars regexp according to provided coding-system. > > There's something I don't understand here: > > If you want a middle dot, why don't you put a middle dot? > I mean why write "['\267-]" rather than ['·-]? The problem is that in a dictionary alist you can have dictionaries with different unibyte encodings, if you happen to have two of that chars in different encodings I'd expect problems. I really should have gone in more detail about the system where I noticed this, even if it is a bit Debian specific. I noticed this problem in aspell catalan entry provided by Debian aspell-ca package. In Debian for the different aspell {and ispell and hunspell} dictionaries alists are created on dictionary installation and stored in a file (for the curious /var/cache/dictionaries-common/emacsen-ispell-dicts.el). Some maintainers provide \xxx, some provide explicit chars in different encodings, and all that info it put together in dict alist form in that file, so it cannot be loaded with a given unique encoding but as 'raw-text, and that implies loading as bytes rather than as chars. > I think this is related to your saying "I prefer to not use UTF-8 here", > but again I don't know what you mean by "use UTF-8", because using > a middle dot character in the source file does not imply using UTF-8 > anywhere (the file can be saved in any encoding that includes the > middle dot). > > For me notations like \267 should be used exclusively to talk about > *bytes*, not about *chars*. So it might make sense to use those for > things like matching particular bytes in [ia]spell's output, but it > makes no sense to match chars in the buffer being spell-checked since > the buffer does not contain bytes but chars. That is why I want to decode those bytes into actual chars to be used in spellchecking, and make sure that they are decoded from correct coding-system. Otherwise if process coding-system is changed to UTF-8 and that stays as bytes matching the wrong encoding things may not work well. If there is a consensus that I should not go the decode- way for otherchars, I will not commit that part. For Debian I can simply keep loading emacsen-ispell-dicts.el as raw-text and do the decode- processing on its contents, before they are passed to ispell.el through `ispell-base-dicts-override-alist', so this last contains chars more that bytes. I however think that is better to keep the decode- stuff for more general use. I will wait at least a couple of days before committing so is clear what to do. Thanks all for your comments, -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-13 18:44 ` Agustin Martin @ 2012-04-14 1:57 ` Stefan Monnier 2012-04-15 0:02 ` Agustin Martin 0 siblings, 1 reply; 25+ messages in thread From: Stefan Monnier @ 2012-04-14 1:57 UTC (permalink / raw) To: emacs-devel >> If you want a middle dot, why don't you put a middle dot? >> I mean why write "['\267-]" rather than ['·-]? > The problem is that in a dictionary alist you can have dictionaries with > different unibyte encodings, if you happen to have two of that chars in > different encodings I'd expect problems. I still don't understand. Can you be more specific? >> For me notations like \267 should be used exclusively to talk about >> *bytes*, not about *chars*. So it might make sense to use those for >> things like matching particular bytes in [ia]spell's output, but it >> makes no sense to match chars in the buffer being spell-checked since >> the buffer does not contain bytes but chars. > That is why I want to decode those bytes into actual chars to be used in If I understand correctly what you mean by "those bytes", then using "·" instead of "\267" gives you the decoded form right away without having to do extra work. > spellchecking, and make sure that they are decoded from correct > coding-system. Otherwise if process coding-system is changed to UTF-8 and > that stays as bytes matching the wrong encoding things may not work well. I lost you here. I agree that "if it stays as bytes" you're going to suffer, which is why I propose to use chars instead. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-14 1:57 ` Stefan Monnier @ 2012-04-15 0:02 ` Agustin Martin 2012-04-16 2:40 ` Stefan Monnier 0 siblings, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-04-15 0:02 UTC (permalink / raw) To: emacs-devel El día 14 de abril de 2012 03:57, Stefan Monnier <monnier@iro.umontreal.ca> escribió: >>> If you want a middle dot, why don't you put a middle dot? >>> I mean why write "['\267-]" rather than ['·-]? >> The problem is that in a dictionary alist you can have dictionaries with >> different unibyte encodings, if you happen to have two of that chars in >> different encodings I'd expect problems. > > I still don't understand. Can you be more specific? Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other dictionary (I am guessing the possibility to be more general, do not actually have a real example of something different from our Debian file with all info put together) with another upper char in otherchars, but in a different encoding (e.g., koi8r). The only possibility to have both coexist as chars in the same file is to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and koi8r, so Emacs properly gets chars when reading the file (if properly guessing file coding-system). XEmacs seems to be a bit more tricky regarding UTF-8, but I'd expect things to work once proper decoding is done. The traditional possibility is to use octal codes to represent bytes matching the char in dict declared charset. Using UTF-8 is actually what eliz proposed in the beginning of this thread. While one of ispell.el docstrings claims that only unibyte chars can be put here, I'd expect this to work, at least for Emacs. As a matter of fact, when I was first trying the (encode- (decode ..)) way I actually got UTF-8 (that was decoded again by `ispell-get-otherchars' according to new 'utf-8 coding-system) and seemed to work (apart from the psgml/sgml-lexical-context problem) for Emacs. At that time I did not notice that once Emacs loads something as char, encodings only matter when writing it (Yes, I am really learning all this encode-* decode-* stuff in more depth in this thread). I'd however use this only in personal ~/.emacs files and if needed. >>> For me notations like \267 should be used exclusively to talk about >>> *bytes*, not about *chars*. So it might make sense to use those for >>> things like matching particular bytes in [ia]spell's output, but it >>> makes no sense to match chars in the buffer being spell-checked since >>> the buffer does not contain bytes but chars. >> That is why I want to decode those bytes into actual chars to be used in > > If I understand correctly what you mean by "those bytes", then using "·" > instead of "\267" gives you the decoded form right away without having > to do extra work. That is true for files with a single encoding. However, the problem happens when a file has mixed encodings like in the Debian example I mentioned. I know, this will not happen in real manually edited files, but can happen and happens in aggregates like the one I mentioned. If file is loaded with a given coding-system-for-read chars in that coding-system will be properly interpreted by Emacs when reading, but not the others. Something like that happened with iso-8859-1/iso-8859-15 chars in http://bugs.debian.org/337214 and the simple way to avoid the mess was to read as 'raw-text, and that indeed reads upper chars as pure bytes although they were originally written as chars (I mean not through octal codes), no implicit on the fly "decoding/interpretation" at all. Not a big problem, we know the encoding for every single dict, so things can be properly decoded (\xxx + coding-system gives a char). If we later change default encoding for communication for entries in that file, we need to decode the bytes obtained from 'raw-text read to actual char so is internally handled as desired char. Changing it also to UTF-8 (and expecting ispell-get-otherchars to decode again to char) seems to work in Emacs, but also seems absolutely not needed. I am getting more and more convinced that this is a Debian-only problem because of the way we create that file, so I should handle this special case as Debian-only and do needed decoding there, not in ispell.el. -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-15 0:02 ` Agustin Martin @ 2012-04-16 2:40 ` Stefan Monnier 2012-04-20 15:25 ` Agustin Martin 0 siblings, 1 reply; 25+ messages in thread From: Stefan Monnier @ 2012-04-16 2:40 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other > dictionary (I am guessing the possibility to be more general, do not > actually have a real example of something different from our Debian > file with all info put together) with another upper char in > otherchars, but in a different encoding (e.g., koi8r). You're still living in Emacs-21/22: since Emacs-23, basically chars aren't associated with their encoding (actually charset) any more. > The only possibility to have both coexist as chars in the same file is > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and > koi8r, so Emacs properly gets chars when reading the file (if properly > guessing file coding-system). Not at all, there are many encodings which cover the superset of iso-8859-* and koi8-*. UTF-8 is the more fashionable one nowadays, but not anywhere close to the only one. e.g. there's also iso-2022, emacs-mule, and then some. > I'd however use this only in personal ~/.emacs files and if needed. Why? It would make the code more clear and simpler. > That is true for files with a single encoding. However, the problem > happens when a file has mixed encodings like in the Debian example I > mentioned. I know, this will not happen in real manually edited files, > but can happen and happens in aggregates like the one I mentioned. That's an old solved problem. > If file is loaded with a given coding-system-for-read chars in that > coding-system will be properly interpreted by Emacs when reading, but > not the others. Something like that happened with > iso-8859-1/iso-8859-15 chars in That was then. Not any more. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-16 2:40 ` Stefan Monnier @ 2012-04-20 15:25 ` Agustin Martin 2012-04-20 15:36 ` Eli Zaretskii 0 siblings, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-04-20 15:25 UTC (permalink / raw) To: emacs-devel [-- Attachment #1: Type: text/plain, Size: 4404 bytes --] On Sun, Apr 15, 2012 at 10:40:29PM -0400, Stefan Monnier wrote: > > Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other > > dictionary (I am guessing the possibility to be more general, do not > > actually have a real example of something different from our Debian > > file with all info put together) with another upper char in > > otherchars, but in a different encoding (e.g., koi8r). > > You're still living in Emacs-21/22: since Emacs-23, basically chars aren't > associated with their encoding (actually charset) any more. Not once inside Emacs, but when reading a file, encoding matters and in some corner cases with mixed charsets Emacs may get wrong chars (with no sane way to make it automatically get the right ones), see attached file. BTW, I do not even have Emacs-21/22 installed, I am testing this in Emacs23/24 (together with XEmacs to check that I do not introduce additional incompatibilities) > > The only possibility to have both coexist as chars in the same file is > > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and > > koi8r, so Emacs properly gets chars when reading the file (if properly > > guessing file coding-system). > > Not at all, there are many encodings which cover the superset of > iso-8859-* and koi8-*. UTF-8 is the more fashionable one nowadays, but > not anywhere close to the only one. e.g. there's also iso-2022, > emacs-mule, and then some. Sorry, should have written something like supersets, put UTF-8 as an example. > > I'd however use this only in personal ~/.emacs files and if needed. > > Why? It would make the code more clear and simpler. To make my Debian changes minimal I prefer to keep compatibility with XEmacs when possible. That makes my life easier when adapting changes in FSF Emacs repo to Debian. Seems that XEmacs has very recently added support for automatic on-the-fly UTF-8 parsing, so my POV may change, but I admit I am currently biassed to the 7bit \xxx strings. Since Emacs should now (for some days) use [:alpha:] in "Casechars" and "Not-Casechars" for global dicts, I think we should not worry very much about this from Emacs side, just for Otherchars in the very few cases it contains an upper char (none in current ispell.el). And for that I still personally prefer keep using for now the 7bit string "\xxx". > > That is true for files with a single encoding. However, the problem > > happens when a file has mixed encodings like in the Debian example I > > mentioned. I know, this will not happen in real manually edited files, > > but can happen and happens in aggregates like the one I mentioned. > > That's an old solved problem. May be we are speaking about different things, but as I understand this, it does not seem so. And I do not think this can be solved in a robust enough way for all files. See attached file and comments below. > > If file is loaded with a given coding-system-for-read chars in that > > coding-system will be properly interpreted by Emacs when reading, but > > not the others. Something like that happened with > > iso-8859-1/iso-8859-15 chars in > > That was then. Not any more. I think you mean that iso-8859-* chars are currently unified. I am aware of that, but I am speaking about something different, mixed encodings, also discussed in that thread together with the iso-8859-1/iso-8859-15 problems. See attached file. It contains middledot in two encodings, UTF-8 in first line and latin1 in the second, together with something that was originally written as iso-8859-7 lowercase greek zeta. In my iso-8859-1 box emacs24 (emacs-snapshot_20120410) reads it as -- · · æ -- so gets the wrong char both for UTF-8 and for greek lowercase zeta. In a different environment Emacs may have guessed that first line is UTF-8, but I do not see a robust enough to properly guess all the mixed encodings for a small file like this. That is the kind of things I am now dealing with for Debian. Currently not a big problem for Emacs after [:alpha:] changes for casechars/not casechars (the chance that a new dict adds otherchars in incompatible charsets is small), however this can still happen us for XEmacs in that aggregated file. Sorry if I did not make that clear enough and helped making this thread this long. Regards, -- Agustin [-- Attachment #2: test.txt --] [-- Type: text/plain, Size: 10 bytes --] · · æ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-20 15:25 ` Agustin Martin @ 2012-04-20 15:36 ` Eli Zaretskii 2012-04-20 16:17 ` Agustin Martin 0 siblings, 1 reply; 25+ messages in thread From: Eli Zaretskii @ 2012-04-20 15:36 UTC (permalink / raw) To: Agustin Martin; +Cc: emacs-devel > Date: Fri, 20 Apr 2012 17:25:32 +0200 > From: Agustin Martin <agustin.martin@hispalinux.es> > > > That was then. Not any more. > > I think you mean that iso-8859-* chars are currently unified. I am aware of > that, but I am speaking about something different, mixed encodings, also > discussed in that thread together with the iso-8859-1/iso-8859-15 problems. > > See attached file. It contains middledot in two encodings, UTF-8 in first > line and latin1 in the second, together with something that was originally > written as iso-8859-7 lowercase greek zeta. Why should we care about files that mix encodings? We were talking about dictionary definitions in ispell.el, and that file will surely NOT mix encodings. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-20 15:36 ` Eli Zaretskii @ 2012-04-20 16:17 ` Agustin Martin 2012-04-21 2:17 ` Stefan Monnier 0 siblings, 1 reply; 25+ messages in thread From: Agustin Martin @ 2012-04-20 16:17 UTC (permalink / raw) To: emacs-devel On Fri, Apr 20, 2012 at 06:36:40PM +0300, Eli Zaretskii wrote: > > Date: Fri, 20 Apr 2012 17:25:32 +0200 > > From: Agustin Martin <agustin.martin@hispalinux.es> > > > > > That was then. Not any more. > > > > I think you mean that iso-8859-* chars are currently unified. I am aware of > > that, but I am speaking about something different, mixed encodings, also > > discussed in that thread together with the iso-8859-1/iso-8859-15 problems. > > > > See attached file. It contains middledot in two encodings, UTF-8 in first > > line and latin1 in the second, together with something that was originally > > written as iso-8859-7 lowercase greek zeta. > > Why should we care about files that mix encodings? We were talking > about dictionary definitions in ispell.el, and that file will surely > NOT mix encodings. Was just trying to explain why I was dealing with this for Debian. Changes committed to Emacs bzr repo did not try to deal with this. -- Agustin ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Ispell and unibyte characters 2012-04-20 16:17 ` Agustin Martin @ 2012-04-21 2:17 ` Stefan Monnier 0 siblings, 0 replies; 25+ messages in thread From: Stefan Monnier @ 2012-04-21 2:17 UTC (permalink / raw) To: emacs-devel > Was just trying to explain why I was dealing with this for Debian. > Changes committed to Emacs bzr repo did not try to deal with this. I still really have no clue what problem you're talking about. ispell.el operates on buffers, so file encodings do not affect it, and the only file involved is ispell.el itself, where we can choose the encoding to be "non mixed". And all of that should apply to Debian as well as to any other environment where ispell.el might be used/distributed. Stefan ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2012-04-26 9:54 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-17 18:46 Ispell and unibyte characters Eli Zaretskii 2012-03-26 17:39 ` Agustin Martin 2012-03-26 20:08 ` Eli Zaretskii 2012-03-26 22:07 ` Lennart Borgman 2012-03-28 19:18 ` Agustin Martin 2012-03-29 18:06 ` Eli Zaretskii 2012-03-29 21:13 ` Andreas Schwab 2012-03-30 6:28 ` Eli Zaretskii 2012-04-26 9:54 ` Eli Zaretskii 2012-04-10 19:08 ` Agustin Martin 2012-04-10 19:11 ` Eli Zaretskii 2012-04-12 14:36 ` Agustin Martin 2012-04-12 19:01 ` Eli Zaretskii 2012-04-13 15:25 ` Agustin Martin 2012-04-13 15:53 ` Eli Zaretskii 2012-04-13 16:38 ` Agustin Martin 2012-04-13 17:51 ` Stefan Monnier 2012-04-13 18:44 ` Agustin Martin 2012-04-14 1:57 ` Stefan Monnier 2012-04-15 0:02 ` Agustin Martin 2012-04-16 2:40 ` Stefan Monnier 2012-04-20 15:25 ` Agustin Martin 2012-04-20 15:36 ` Eli Zaretskii 2012-04-20 16:17 ` Agustin Martin 2012-04-21 2:17 ` Stefan Monnier
Code repositories for project(s) associated with this external index https://git.savannah.gnu.org/cgit/emacs.git https://git.savannah.gnu.org/cgit/emacs/org-mode.git This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.