From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Agustin Martin Newsgroups: gmane.emacs.devel Subject: Re: Ispell and unibyte characters Date: Fri, 13 Apr 2012 17:25:25 +0200 Message-ID: <20120413152525.GA14949@agmartin.aq.upm.es> References: <83aa3f2hgh.fsf@gnu.org> <20120326173912.GA22306@agmartin.aq.upm.es> <20120328191821.GA6266@agmartin.aq.upm.es> <20120410190803.GA13517@agmartin.aq.upm.es> <83ty0r5rmd.fsf@gnu.org> <20120412143657.GA18352@agmartin.aq.upm.es> <83d37c4vw5.fsf@gnu.org> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1334330749 20805 80.91.229.3 (13 Apr 2012 15:25:49 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 13 Apr 2012 15:25:49 +0000 (UTC) To: emacs-devel@gnu.org Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Apr 13 17:25:49 2012 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SIiNe-0007fd-W6 for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 17:25:47 +0200 Original-Received: from localhost ([::1]:60779 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIiNe-0001Xo-2W for ged-emacs-devel@m.gmane.org; Fri, 13 Apr 2012 11:25:46 -0400 Original-Received: from eggs.gnu.org ([208.118.235.92]:40772) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIiNV-0001WJ-NO for emacs-devel@gnu.org; Fri, 13 Apr 2012 11:25:44 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SIiNT-0007iW-FK for emacs-devel@gnu.org; Fri, 13 Apr 2012 11:25:37 -0400 Original-Received: from fibonacci.ccupm.upm.es ([138.100.198.70]:48162 helo=smtp.upm.es) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SIiNT-0007ga-4l for emacs-devel@gnu.org; Fri, 13 Apr 2012 11:25:35 -0400 Original-Received: from agmartin.aq.upm.es (Agmartin.aq.upm.es [138.100.41.131]) by smtp.upm.es (8.14.3/8.14.3/fibonacci-001) with ESMTP id q3DFPQma008485; Fri, 13 Apr 2012 17:25:26 +0200 Original-Received: by agmartin.aq.upm.es (Postfix, from userid 1000) id 2991A301; Fri, 13 Apr 2012 17:25:25 +0200 (CEST) Mail-Followup-To: emacs-devel@gnu.org Content-Disposition: inline In-Reply-To: <83d37c4vw5.fsf@gnu.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-MIME-Autoconverted: from 8bit to quoted-printable by smtp.upm.es id q3DFPQma008485 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 138.100.198.70 X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:149637 Archived-At: On Thu, Apr 12, 2012 at 10:01:30PM +0300, Eli Zaretskii wrote: > I wrote: > > I am still dealing with an open issue here. Some languages have non 7= bit > > wordchars, like Catalan middledot, and it should be converted to UTF-= 8 if > > default communication language is changed to UTF-8. >=20 > Sorry, I don't understand: do you mean "non 8-bit wordchars"? I don't > think 7 bits is assumed anywhere. I mean wordchars that cannot be represented in 7bit encoding, like Catala= n middledot (available in 8bit latin1) > Assuming you did mean 8-bit, then why not use UTF-8 for Catalan from > the get-go? Only some languages can use single-byte encodings, and > evidently Catalan is not one of them. For that matter, why shouldn't > aspell and hunspell use UTF-8 by default (something I already asked)? [...] > I don't understand what are you trying to accomplish by encoding > OTHERCHARS in UTF-8. What exactly is the problem with them being > encoded in some 8-bit encoding? Please explain. Imagine a fake entry in the general list, either in ispell.el or provided through `ispell-base-dicts-override-alist' (no accented chars for simplic= ity) ("catala8" "[A-Za-z]" "[^A-Za-z]" "['\267-]" nil ("-B" "-d" "catalan") nil iso-= 8859-1) Unless emacs knows the encoding for \267 (middledot "=B7") it cannot deco= de it properly. I prefer to not use UTF-8 here, because I want the entry to als= o be useful for ispell (and also be XEmacs incompatible). The best approach he= re seems to decode the otherchars regexp according to provided coding-system. I have noticed that there seems to be no need to encode the resulting str= ing in UTF-8, Emacs will know what to do with the decoded string. I tested something like (dolist (adict ispell-dictionary-alist) (add-to-list 'tmp-dicts-alist (list (nth 0 adict) ; dict name "[[:alpha:]]" ; casechars "[^[:alpha:]]" ; not-casechars (if ispell-encoding8-command ;; Decode 8bit otherchars if needed (decode-coding-string (nth 3 adict) (nth 7 adict)) (nth 3 adict)) ; otherchars (nth 4 adict) ; many-otherchars-p (nth 5 adict) ; ispell-args (nth 6 adict) ; extended-character-mode (if ispell-encoding8-command 'utf-8 (nth 7 adict))))) and seems to work well. > I wrote: > > but get a sgml-lexical-context error. Need to look more carefuly, so = this > > will take longer. I have tested further and this seems to be an unrelated problem. Some tim= e ago I already noticed some problems with flyspell.el and sgml mode (in particular psgml) regarding sgml-lexical-context error sgml-lexical-context: Wrong type argument: stringp, nil sometimes when running flyspell-buffer after enabling flyspell-mode. I am also seing something like Error in post-command-hook (flyspell-post-command-hook): (wrong-type-argument stringp nil) when enabling flyspell-mode from the beginning of my sgml buffer. Cannot reproduce with emacs -Q, still trying to find where this comes from. Both problems tested with emacs-snapshot_20120410. For Debian I do not use sgml-lexical-context, but an improved version of = old regexp to try keeping things compatible with XEmacs. This seems to work w= ell and has some advantages over sgml-lexical-context 1) Is compatible with XEmacs 2) Is twice faster when using flyspell-buffer than sgml-lexical-context 3) Does not trigger above error. I am considering to use this improved regexp instead of sgml-lexical-cont= ext for above reasons, but this is another issue. --=20 Agustin