bug#8308: 23.3; Use utf-8 for writing abbrev file

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* bug#8308: 23.3; Use utf-8 for writing abbrev file
@ 2011-03-21  6:22 Leo
  2011-03-21  9:00 ` Eli Zaretskii
  2011-03-21 14:50 ` Stefan Monnier
  0 siblings, 2 replies; 17+ messages in thread
From: Leo @ 2011-03-21  6:22 UTC (permalink / raw)
  To: 8308

Is it OK to change the encoding for abbrev file to utf-8?

=== modified file 'lisp/abbrev.el'
--- a/lisp/abbrev.el	2011-03-21 05:49:12 +0000
+++ b/lisp/abbrev.el	2011-03-21 06:20:36 +0000
@@ -225,9 +225,9 @@
 		    abbrev-file-name)))
   (or (and file (> (length file) 0))
       (setq file abbrev-file-name))
-  (let ((coding-system-for-write 'emacs-mule))
+  (let ((coding-system-for-write 'utf-8))
     (with-temp-file file
-      (insert ";;-*-coding: emacs-mule;-*-\n")
+      (insert ";;-*-coding: utf-8;-*-\n")
       (dolist (table
                ;; We sort the table in order to ease the automatic
                ;; merging of different versions of the user's abbrevs


Leo





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21  6:22 bug#8308: 23.3; Use utf-8 for writing abbrev file Leo
@ 2011-03-21  9:00 ` Eli Zaretskii
  2011-03-21 10:01   ` Leo
  2011-03-21 14:50 ` Stefan Monnier
  1 sibling, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2011-03-21  9:00 UTC (permalink / raw)
  To: Leo; +Cc: 8308

> From: Leo <sdl.web@gmail.com>
> Date: Mon, 21 Mar 2011 14:22:24 +0800
> Cc: 
> 
> Is it OK to change the encoding for abbrev file to utf-8?

What will that do to characters that are not unified into the range of
valid Unicode code points?

Can you tell what is the purpose of this change?






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21  9:00 ` Eli Zaretskii
@ 2011-03-21 10:01   ` Leo
  2011-03-21 10:54     ` Eli Zaretskii
  0 siblings, 1 reply; 17+ messages in thread
From: Leo @ 2011-03-21 10:01 UTC (permalink / raw)
  To: bug-gnu-emacs

On 2011-03-21 17:00 +0800, Eli Zaretskii wrote:
>> From: Leo <sdl.web@gmail.com>
>> Date: Mon, 21 Mar 2011 14:22:24 +0800
>> Cc: 
>> 
>> Is it OK to change the encoding for abbrev file to utf-8?
>
> What will that do to characters that are not unified into the range of
> valid Unicode code points?

That's a valid concern. But

,----
| M -- emacs-mule
| 
| Emacs 21 internal format used in buffer and string.
| Type: emacs-mule (Emacs 21 internal encoding)
| EOL type: Automatic selection from:
| 	[emacs-mule-unix emacs-mule-dos emacs-mule-mac]
| This coding system can encode all emacs-mule charsets.
| 
| [back]
`----

,----[ (info "(elisp)Text Representations") ]
|    (1) This internal representation is based on one of the encodings
| defined by the Unicode Standard, called "UTF-8", for representing any
| Unicode codepoint, but Emacs extends UTF-8 to represent the additional
| codepoints it uses for raw 8-bit bytes and characters not unified with
| Unicode.
`----

Would you agree to use utf-8-emacs instead, which covers all characters.

>
> Can you tell what is the purpose of this change?

Make abbrev file editable to other editors.

Leo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 10:01   ` Leo
@ 2011-03-21 10:54     ` Eli Zaretskii
  2011-03-21 11:26       ` Andreas Röhler
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2011-03-21 10:54 UTC (permalink / raw)
  To: Leo; +Cc: bug-gnu-emacs

> From: Leo <sdl.web@gmail.com>
> Date: Mon, 21 Mar 2011 18:01:17 +0800
> Cc: 
> 
> Would you agree to use utf-8-emacs instead, which covers all characters.

That's better, but the characters outside Unicode are still going to
do bad things to any software except Emacs.  AFAIK, emacs-mule is a
superset of iso-2022 in the same way as utf-8-emacs is a superset of
utf-8.

> > Can you tell what is the purpose of this change?
> 
> Make abbrev file editable to other editors.

If we are really keen on making the abbrev files editable to other
editors, we should make sure they are encoded in some encoding that
these other editors will understand.  That probably calls for using
utf-8 for everything that's covered by Unicode, and using other
appropriate encodings for characters outside Unicode.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 10:54     ` Eli Zaretskii
@ 2011-03-21 11:26       ` Andreas Röhler
  0 siblings, 0 replies; 17+ messages in thread
From: Andreas Röhler @ 2011-03-21 11:26 UTC (permalink / raw)
  To: bug-gnu-emacs

Am 21.03.2011 11:54, schrieb Eli Zaretskii:
>> From: Leo<sdl.web@gmail.com>
>> Date: Mon, 21 Mar 2011 18:01:17 +0800
>> Cc:
>>
>> Would you agree to use utf-8-emacs instead, which covers all characters.
>
> That's better, but the characters outside Unicode are still going to
> do bad things to any software except Emacs.  AFAIK, emacs-mule is a
> superset of iso-2022 in the same way as utf-8-emacs is a superset of
> utf-8.
>
>>> Can you tell what is the purpose of this change?
>>
>> Make abbrev file editable to other editors.
>
> If we are really keen on making the abbrev files editable to other
> editors, we should make sure they are encoded in some encoding that
> these other editors will understand.  That probably calls for using
> utf-8 for everything that's covered by Unicode, and using other
> appropriate encodings for characters outside Unicode.
>
>
>
>

Hi,

sounds interesting for me, as not just other editors are at stake AFAIU, 
but auto-generated abbrevs produced by programms.

These might be theme-specific, cover items of medicine, jura etc.
Could offer modes with preloaded abbrevs resp. to matter of writing.

Regards,

Andreas










^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21  6:22 bug#8308: 23.3; Use utf-8 for writing abbrev file Leo
  2011-03-21  9:00 ` Eli Zaretskii
@ 2011-03-21 14:50 ` Stefan Monnier
  2011-03-21 15:37   ` Leo
  2011-03-21 18:24   ` Andreas Röhler
  1 sibling, 2 replies; 17+ messages in thread
From: Stefan Monnier @ 2011-03-21 14:50 UTC (permalink / raw)
  To: Leo; +Cc: 8308

> Is it OK to change the encoding for abbrev file to utf-8?
> === modified file 'lisp/abbrev.el'
> --- a/lisp/abbrev.el	2011-03-21 05:49:12 +0000
> +++ b/lisp/abbrev.el	2011-03-21 06:20:36 +0000
> @@ -225,9 +225,9 @@
>  		    abbrev-file-name)))
>    (or (and file (> (length file) 0))
>        (setq file abbrev-file-name))
> -  (let ((coding-system-for-write 'emacs-mule))
> +  (let ((coding-system-for-write 'utf-8))
>      (with-temp-file file
> -      (insert ";;-*-coding: emacs-mule;-*-\n")
> +      (insert ";;-*-coding: utf-8;-*-\n")
>        (dolist (table
>                 ;; We sort the table in order to ease the automatic
>                 ;; merging of different versions of the user's abbrevs

Sounds good in general, but I'm wondering whether we should worry about
the presence of abbrevs which include bytes (aka eight-bit-chars).
Using `utf-8-emacs' should fix those issues, but would then bump into
the problem that such abbrev files wouldn't be compatible with Emacs-22.


        Stefan





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 14:50 ` Stefan Monnier
@ 2011-03-21 15:37   ` Leo
  2011-03-21 18:45     ` Eli Zaretskii
  2011-03-21 18:24   ` Andreas Röhler
  1 sibling, 1 reply; 17+ messages in thread
From: Leo @ 2011-03-21 15:37 UTC (permalink / raw)
  To: bug-gnu-emacs

On 2011-03-21 22:50 +0800, Stefan Monnier wrote:
> Sounds good in general, but I'm wondering whether we should worry about
> the presence of abbrevs which include bytes (aka eight-bit-chars).
> Using `utf-8-emacs' should fix those issues, but would then bump into
> the problem that such abbrev files wouldn't be compatible with Emacs-22.

I think we should just use utf-8-emacs. What do other people think?

Leo






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 14:50 ` Stefan Monnier
  2011-03-21 15:37   ` Leo
@ 2011-03-21 18:24   ` Andreas Röhler
  2011-03-21 18:53     ` Eli Zaretskii
  1 sibling, 1 reply; 17+ messages in thread
From: Andreas Röhler @ 2011-03-21 18:24 UTC (permalink / raw)
  To: bug-gnu-emacs

Am 21.03.2011 15:50, schrieb Stefan Monnier:
>> Is it OK to change the encoding for abbrev file to utf-8?
>> === modified file 'lisp/abbrev.el'
>> --- a/lisp/abbrev.el	2011-03-21 05:49:12 +0000
>> +++ b/lisp/abbrev.el	2011-03-21 06:20:36 +0000
>> @@ -225,9 +225,9 @@
>>   		    abbrev-file-name)))
>>     (or (and file (>  (length file) 0))
>>         (setq file abbrev-file-name))
>> -  (let ((coding-system-for-write 'emacs-mule))
>> +  (let ((coding-system-for-write 'utf-8))
>>       (with-temp-file file
>> -      (insert ";;-*-coding: emacs-mule;-*-\n")
>> +      (insert ";;-*-coding: utf-8;-*-\n")
>>         (dolist (table
>>                  ;; We sort the table in order to ease the automatic
>>                  ;; merging of different versions of the user's abbrevs
>
> Sounds good in general, but I'm wondering whether we should worry about
> the presence of abbrevs which include bytes (aka eight-bit-chars).
> Using `utf-8-emacs' should fix those issues, but would then bump into
> the problem that such abbrev files wouldn't be compatible with Emacs-22.
>
>
>          Stefan
>

Hi,

so maybe not hard-code it, rather have a variable?

Andreas





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 15:37   ` Leo
@ 2011-03-21 18:45     ` Eli Zaretskii
  2011-03-22  1:00       ` Leo
  0 siblings, 1 reply; 17+ messages in thread
From: Eli Zaretskii @ 2011-03-21 18:45 UTC (permalink / raw)
  To: Leo; +Cc: bug-gnu-emacs

> From: Leo <sdl.web@gmail.com>
> Date: Mon, 21 Mar 2011 23:37:41 +0800
> Cc: 
> 
> I think we should just use utf-8-emacs.

Why do you think so?





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 18:24   ` Andreas Röhler
@ 2011-03-21 18:53     ` Eli Zaretskii
  0 siblings, 0 replies; 17+ messages in thread
From: Eli Zaretskii @ 2011-03-21 18:53 UTC (permalink / raw)
  To: Andreas Röhler; +Cc: bug-gnu-emacs

> Date: Mon, 21 Mar 2011 19:24:16 +0100
> From: Andreas Röhler <andreas.roehler@easy-emacs.de>
> Cc: 
> 
> so maybe not hard-code it, rather have a variable?

A constant encoding will never DTRT in all cases.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-21 18:45     ` Eli Zaretskii
@ 2011-03-22  1:00       ` Leo
  2011-03-22  2:48         ` Stefan Monnier
  0 siblings, 1 reply; 17+ messages in thread
From: Leo @ 2011-03-22  1:00 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: bug-gnu-emacs

On 2011-03-22 02:45 +0800, Eli Zaretskii wrote:
>> I think we should just use utf-8-emacs.
>
> Why do you think so?

By the time 24.1 is released, it will be 1-2 years from now and there
will be two major stable releases that work with utf-8-emacs, which are
backward-compatible enough. But I don't know so I'll forget about this
bug and let the gurus figure it out.

Leo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-22  1:00       ` Leo
@ 2011-03-22  2:48         ` Stefan Monnier
  2011-03-22  3:47           ` Leo
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2011-03-22  2:48 UTC (permalink / raw)
  To: Leo; +Cc: bug-gnu-emacs

>>> I think we should just use utf-8-emacs.
>> Why do you think so?
> By the time 24.1 is released, it will be 1-2 years from now and there
> will be two major stable releases that work with utf-8-emacs, which are
> backward-compatible enough. But I don't know so I'll forget about this
> bug and let the gurus figure it out.

I think it might be OK to do it for Emacs-25, but since Emacs-22 can't
handle utf-8-emacs, I think it's a bit early to switch to it in
Emacs-24.  If utf-8 is sufficient, OTOH it's the best choice.  So maybe
we should check the buffer first to see if utf-8 is safe, and only fall
back to emacs-mule if utf-8 is not safe.

        Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-22  2:48         ` Stefan Monnier
@ 2011-03-22  3:47           ` Leo
  2011-03-22  5:24             ` Stefan Monnier
  0 siblings, 1 reply; 17+ messages in thread
From: Leo @ 2011-03-22  3:47 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: bug-gnu-emacs

On 2011-03-22 10:48 +0800, Stefan Monnier wrote:
> I think it might be OK to do it for Emacs-25, but since Emacs-22 can't
> handle utf-8-emacs, I think it's a bit early to switch to it in
> Emacs-24.  If utf-8 is sufficient, OTOH it's the best choice.  So maybe
> we should check the buffer first to see if utf-8 is safe, and only fall
> back to emacs-mule if utf-8 is not safe.

I think default to utf-8 is good, which is sufficient for most people.
Any comments on the following patch? I don't know how to introduce a
char unencodable with utf-8 to the abbrevs. So it is only partially
tested.


=== modified file 'lisp/abbrev.el'
--- lisp/abbrev.el	2011-01-25 04:08:28 +0000
+++ lisp/abbrev.el	2011-03-22 03:30:52 +0000
@@ -225,21 +225,29 @@
 		    abbrev-file-name)))
   (or (and file (> (length file) 0))
       (setq file abbrev-file-name))
-  (let ((coding-system-for-write 'emacs-mule))
-    (with-temp-file file
-      (insert ";;-*-coding: emacs-mule;-*-\n")
+  (let ((coding-system-for-write 'utf-8))
+    (with-temp-buffer
       (dolist (table
-               ;; We sort the table in order to ease the automatic
-               ;; merging of different versions of the user's abbrevs
-               ;; file.  This is useful, for example, for when the
-               ;; user keeps their home directory in a revision
-               ;; control system, and is therefore keeping multiple
-               ;; slightly-differing copies loosely synchronized.
-               (sort (copy-sequence abbrev-table-name-list)
-                     (lambda (s1 s2)
-                       (string< (symbol-name s1)
-                                (symbol-name s2)))))
-	(insert-abbrev-table-description table nil)))))
+	       ;; We sort the table in order to ease the automatic
+	       ;; merging of different versions of the user's abbrevs
+	       ;; file.  This is useful, for example, for when the
+	       ;; user keeps their home directory in a revision
+	       ;; control system, and is therefore keeping multiple
+	       ;; slightly-differing copies loosely synchronized.
+	       (sort (copy-sequence abbrev-table-name-list)
+		     (lambda (s1 s2)
+		       (string< (symbol-name s1)
+				(symbol-name s2)))))
+	(insert-abbrev-table-description table nil))
+      (when (unencodable-char-position (point-min) (point-max) 'utf-8)
+	(setq coding-system-for-write
+	      (if (> emacs-major-version 24)
+		  'utf-8-emacs
+		;; For compatibility with Emacs 22
+		'emacs-mule)))
+      (goto-char (point-min))
+      (insert (format ";;-*-coding: %s;-*-\n" coding-system-for-write))
+      (write-region nil nil file nil 0))))
 \f
 (defun add-mode-abbrev (arg)
   "Define mode-specific abbrev for last word(s) before point.






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-22  3:47           ` Leo
@ 2011-03-22  5:24             ` Stefan Monnier
  2011-03-22 10:41               ` Leo
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2011-03-22  5:24 UTC (permalink / raw)
  To: Leo; +Cc: bug-gnu-emacs

> I think default to utf-8 is good, which is sufficient for most people.
> Any comments on the following patch? I don't know how to introduce a
> char unencodable with utf-8 to the abbrevs. So it is only partially
> tested.

(unibyte-string 129) returns a string containing an unencodable char.
So you can test with it.
The patch looks good,


        Stefan





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-22  5:24             ` Stefan Monnier
@ 2011-03-22 10:41               ` Leo
  2011-03-22 18:27                 ` Stefan Monnier
  0 siblings, 1 reply; 17+ messages in thread
From: Leo @ 2011-03-22 10:41 UTC (permalink / raw)
  To: bug-gnu-emacs

On 2011-03-22 13:24 +0800, Stefan Monnier wrote:
> (unibyte-string 129) returns a string containing an unencodable char.
> So you can test with it.

I still cannot get any byte into the abbrevs. For example,
(unibyte-string 129) returns byte \201 but when it is written to abbrev
file by write-abbrev-file, it is changed to \ 2 0 1, so utf-8 appear
sufficient even for bytes.

Leo






^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-22 10:41               ` Leo
@ 2011-03-22 18:27                 ` Stefan Monnier
  2011-03-23  0:42                   ` Leo
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Monnier @ 2011-03-22 18:27 UTC (permalink / raw)
  To: Leo; +Cc: bug-gnu-emacs

>> (unibyte-string 129) returns a string containing an unencodable char.
>> So you can test with it.
> I still cannot get any byte into the abbrevs. For example,
> (unibyte-string 129) returns byte \201 but when it is written to abbrev
> file by write-abbrev-file, it is changed to \ 2 0 1, so utf-8 appear
> sufficient even for bytes.

Good.  In any case your unencodable-foo test would trigger if there were
eight-bit-chars in there, so it works correctly in this respect.
Please install your patch.


        Stefan





^ permalink raw reply	[flat|nested] 17+ messages in thread

* bug#8308: 23.3; Use utf-8 for writing abbrev file
  2011-03-22 18:27                 ` Stefan Monnier
@ 2011-03-23  0:42                   ` Leo
  0 siblings, 0 replies; 17+ messages in thread
From: Leo @ 2011-03-23  0:42 UTC (permalink / raw)
  To: 8308-done

Version: 24.1.





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-03-23  0:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-21  6:22 bug#8308: 23.3; Use utf-8 for writing abbrev file Leo
2011-03-21  9:00 ` Eli Zaretskii
2011-03-21 10:01   ` Leo
2011-03-21 10:54     ` Eli Zaretskii
2011-03-21 11:26       ` Andreas Röhler
2011-03-21 14:50 ` Stefan Monnier
2011-03-21 15:37   ` Leo
2011-03-21 18:45     ` Eli Zaretskii
2011-03-22  1:00       ` Leo
2011-03-22  2:48         ` Stefan Monnier
2011-03-22  3:47           ` Leo
2011-03-22  5:24             ` Stefan Monnier
2011-03-22 10:41               ` Leo
2011-03-22 18:27                 ` Stefan Monnier
2011-03-23  0:42                   ` Leo
2011-03-21 18:24   ` Andreas Röhler
2011-03-21 18:53     ` Eli Zaretskii

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.