string-as-unibyte

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* string-as-unibyte
@ 2005-07-18 21:33 Stefan Monnier
  2005-07-18 22:41 ` string-as-unibyte YAMAMOTO Mitsuharu
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2005-07-18 21:33 UTC (permalink / raw)
  Cc: emacs-devel

Could you explain the need for the change below:

2005-07-16  YAMAMOTO Mitsuharu  <mituharu@math.s.chiba-u.ac.jp>

	* mac.c [TARGET_API_MAC_CARBON] (Fmac_code_convert_string):
	Use Fstring_as_unibyte instead of string_make_unibyte.

My experience is that string-as-unibyte is extremely rarely the right answer
to solve a problem.  If you described your motivation, I could add a comment
in the code making it clear why this is needed here (or else come up with
a better solution).

        Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-18 21:33 string-as-unibyte Stefan Monnier
@ 2005-07-18 22:41 ` YAMAMOTO Mitsuharu
  2005-07-18 23:52   ` string-as-unibyte YAMAMOTO Mitsuharu
  2005-07-19  2:56   ` string-as-unibyte Kenichi Handa
  0 siblings, 2 replies; 8+ messages in thread
From: YAMAMOTO Mitsuharu @ 2005-07-18 22:41 UTC (permalink / raw)
  Cc: emacs-devel

>>>>> On Mon, 18 Jul 2005 17:33:02 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said:

> Could you explain the need for the change below:

> 2005-07-16 YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>

> 	* mac.c [TARGET_API_MAC_CARBON] (Fmac_code_convert_string):
> Use Fstring_as_unibyte instead of string_make_unibyte.

It is at the preparation stage of code conversion.  So I think the
following comment in decode_coding_string (coding.c) is also
applicable to this case.

  if (STRING_MULTIBYTE (str))
    {
      /* Decoding routines expect the source text to be unibyte.  */
      str = Fstring_as_unibyte (str);

> My experience is that string-as-unibyte is extremely rarely the
> right answer to solve a problem.  If you described your motivation,
> I could add a comment in the code making it clear why this is needed
> here (or else come up with a better solution).

I was trying to make a coding system that almost works as utf-8, but
additionally does "HFS+ composition" (canonical composition with some
exclusions) on decoding.

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

;; For the Carbon port, Mac OS X 10.2 or later.
(make-coding-system
 'mac-hfs+
 0
 (coding-system-mnemonic 'utf-8)
 "Like utf-8, but additionally does Mac HFS+ composition on decoding."
 (coding-system-flags 'utf-8)
 (list (cons 'safe-charsets (coding-system-get 'utf-8 'safe-charsets))
       '(post-read-conversion . mac-hfs+-post-read-conversion)
       '(pre-write-conversion . mac-hfs+-pre-write-conversion)))

(defun mac-hfs+-post-read-conversion (length)
  (save-excursion
    (save-restriction
      (narrow-to-region (point) (+ (point) length))
      (let ((str (mac-code-convert-string (buffer-string)
					  'utf-8 'utf-8 'HFS+C)))
	(when str
	  (erase-buffer)
	  (insert (if enable-multibyte-characters
		      (string-as-multibyte str) str)))
	(setq length (decode-coding-region (point-min) (point-max) 'utf-8))
	;; We are inside a post-read-conversion function, so the
	;; original post-read-conversion for utf-8 is not
	;; automatically called.
	(goto-char (point-min))
	(funcall (or (coding-system-get 'utf-8 'post-read-conversion)
		     'identity)
		 length)))))

(defun mac-hfs+-pre-write-conversion (beg end)
  (funcall (or (coding-system-get 'utf-8 'pre-write-conversion) 'ignore)
	   beg (+ beg (encode-coding-region beg end 'utf-8))))

(setq default-file-name-coding-system 'mac-hfs+)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-18 22:41 ` string-as-unibyte YAMAMOTO Mitsuharu
@ 2005-07-18 23:52   ` YAMAMOTO Mitsuharu
  2005-07-19  2:50     ` string-as-unibyte Kenichi Handa
  2005-07-19  2:56   ` string-as-unibyte Kenichi Handa
  1 sibling, 1 reply; 8+ messages in thread
From: YAMAMOTO Mitsuharu @ 2005-07-18 23:52 UTC (permalink / raw)
  Cc: emacs-devel

>>>>> On Tue, 19 Jul 2005 07:41:33 +0900, YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> said:

> I was trying to make a coding system that almost works as utf-8, but
> additionally does "HFS+ composition" (canonical composition with
> some exclusions) on decoding.

> 	(when str
> 	  (erase-buffer)
> 	  (insert (if enable-multibyte-characters
> 		      (string-as-multibyte str) str)))

Maybe this part should have been string-to-multibyte.

BTW, I noticed that backslashes in the docstring of
string-as-multibyte are stripped off as follows:

If you're not sure, whether to use `string-as-multibyte' or
`string-to-multibyte', use `string-to-multibyte'.  Beware:
   (aref (string-as-multibyte "201") 0) -> 129 (aka ?201)
   (aref (string-as-multibyte "300") 0) -> 192 (aka ?300)
   (aref (string-as-multibyte "300201") 0) -> 192 (aka ?300)
   (aref (string-as-multibyte "300201") 1) -> 129 (aka ?201)
but
   (aref (string-as-multibyte "201300") 0) -> 2240
   (aref (string-as-multibyte "201300") 1) -> <error>

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-18 23:52   ` string-as-unibyte YAMAMOTO Mitsuharu
@ 2005-07-19  2:50     ` Kenichi Handa
  0 siblings, 0 replies; 8+ messages in thread
From: Kenichi Handa @ 2005-07-19  2:50 UTC (permalink / raw)
  Cc: monnier, emacs-devel

In article <wl3bqbisf1.wl%mituharu@math.s.chiba-u.ac.jp>, YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes:

> BTW, I noticed that backslashes in the docstring of
> string-as-multibyte are stripped off as follows:

> If you're not sure, whether to use `string-as-multibyte' or
> `string-to-multibyte', use `string-to-multibyte'.  Beware:
>    (aref (string-as-multibyte "201") 0) -> 129 (aka ?201)
[...]

Thank you for finding this bug.  I've just installed a fix.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-18 22:41 ` string-as-unibyte YAMAMOTO Mitsuharu
  2005-07-18 23:52   ` string-as-unibyte YAMAMOTO Mitsuharu
@ 2005-07-19  2:56   ` Kenichi Handa
  2005-07-19  3:49     ` string-as-unibyte YAMAMOTO Mitsuharu
  2005-07-19 14:26     ` string-as-unibyte Stefan Monnier
  1 sibling, 2 replies; 8+ messages in thread
From: Kenichi Handa @ 2005-07-19  2:56 UTC (permalink / raw)
  Cc: monnier, emacs-devel

In article <wl4qarag9u.wl%mituharu@math.s.chiba-u.ac.jp>, YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp> writes:

>>>>>>  On Mon, 18 Jul 2005 17:33:02 -0400, Stefan Monnier <monnier@iro.umontreal.ca> said:
>>  Could you explain the need for the change below:

>>  2005-07-16 YAMAMOTO Mitsuharu <mituharu@math.s.chiba-u.ac.jp>

>>  	* mac.c [TARGET_API_MAC_CARBON] (Fmac_code_convert_string):
>>  Use Fstring_as_unibyte instead of string_make_unibyte.

> It is at the preparation stage of code conversion.  So I think the
> following comment in decode_coding_string (coding.c) is also
> applicable to this case.

>   if (STRING_MULTIBYTE (str))
>     {
>       /* Decoding routines expect the source text to be unibyte.  */
>       str = Fstring_as_unibyte (str);

If a multibyte string is given to mac-code-convert-string,
and the string is made mutlibyte by string-to-multibyte from
the raw-byte sequence (ex. inserting a file by raw-text in a
mutlibyte buffer and extracting a string by
buffer-substring), using Fstring_as_unibyte is correct.
Please note that we don't have Fstring_to_unibyte because it
should work the same way as Fstring_as_unibyte.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-19  2:56   ` string-as-unibyte Kenichi Handa
@ 2005-07-19  3:49     ` YAMAMOTO Mitsuharu
  2005-07-19 14:26     ` string-as-unibyte Stefan Monnier
  1 sibling, 0 replies; 8+ messages in thread
From: YAMAMOTO Mitsuharu @ 2005-07-19  3:49 UTC (permalink / raw)
  Cc: monnier, emacs-devel

>>>>> On Tue, 19 Jul 2005 11:56:37 +0900, Kenichi Handa <handa@m17n.org> said:

> If a multibyte string is given to mac-code-convert-string, and the
> string is made mutlibyte by string-to-multibyte from the raw-byte
> sequence (ex. inserting a file by raw-text in a mutlibyte buffer and
> extracting a string by buffer-substring), using Fstring_as_unibyte
> is correct.

That's the case for mac-code-convert-string.  Thanks for clarifying.

As for the `mac-hfs+' coding system shown in my previous mail, its
coding-system-type should have been 5 (raw-text), rather than 0
(emacs-mule) so that the leading bytes for eight-bit-control and
eight-bit-graphic may not be eaten.  And I had to do
(set-buffer-multibyte t) explicitly for the case of
decode-coding-string.  Maybe " *code-converting-work*" in
ctext-post-read-conversion should be " *code-conversion-work*" ?

				     YAMAMOTO Mitsuharu
				mituharu@math.s.chiba-u.ac.jp

(make-coding-system
 'mac-hfs+
 5
 (coding-system-mnemonic 'utf-8)
 "Like utf-8, but additionally does Mac HFS+ composition on decoding."
 (coding-system-flags 'utf-8)
 (list (cons 'safe-charsets (coding-system-get 'utf-8 'safe-charsets))
       '(post-read-conversion . mac-hfs+-post-read-conversion)
       '(pre-write-conversion . mac-hfs+-pre-write-conversion)))

(defun mac-hfs+-post-read-conversion (length)
  (save-excursion
    (save-restriction
      (narrow-to-region (point) (+ (point) length))
      (let ((in-workbuf (string= (buffer-name) " *code-conversion-work*"))
	    (str (mac-code-convert-string (buffer-string)
					  'utf-8 'utf-8 'HFS+C)))
	(when str
	  (erase-buffer)
	  (insert (if enable-multibyte-characters
		      (string-to-multibyte str) str)))
	(if in-workbuf
	    (set-buffer-multibyte t))
	(setq length (decode-coding-region (point-min) (point-max) 'utf-8))
	;; We are inside a post-read-conversion function, so the
	;; original post-read-conversion for utf-8 is not
	;; automatically called.
	(goto-char (point-min))
	(funcall (or (coding-system-get 'utf-8 'post-read-conversion)
		     'identity)
		 length)))))

(defun mac-hfs+-pre-write-conversion (beg end)
  (funcall (or (coding-system-get 'utf-8 'pre-write-conversion) 'ignore)
	   beg (+ beg (encode-coding-region beg end 'utf-8))))

(setq default-file-name-coding-system 'mac-hfs+)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-19  2:56   ` string-as-unibyte Kenichi Handa
  2005-07-19  3:49     ` string-as-unibyte YAMAMOTO Mitsuharu
@ 2005-07-19 14:26     ` Stefan Monnier
  2005-07-20  0:36       ` string-as-unibyte Kenichi Handa
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Monnier @ 2005-07-19 14:26 UTC (permalink / raw)
  Cc: YAMAMOTO Mitsuharu, emacs-devel

> If a multibyte string is given to mac-code-convert-string,
> and the string is made mutlibyte by string-to-multibyte from
> the raw-byte sequence (ex. inserting a file by raw-text in a
> mutlibyte buffer and extracting a string by
> buffer-substring), using Fstring_as_unibyte is correct.

Indeed, but sadly so.  In order to make it clear that the multibyte string
is expected to only contain single-byte chars (including eight-bit-* chars),
a comment is in order.

> Please note that we don't have Fstring_to_unibyte because it
> should work the same way as Fstring_as_unibyte.

Actually no.  string-to-unibyte should signal an error if it encounters
a non-ascii non-eight-bit-* char.


        Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: string-as-unibyte
  2005-07-19 14:26     ` string-as-unibyte Stefan Monnier
@ 2005-07-20  0:36       ` Kenichi Handa
  0 siblings, 0 replies; 8+ messages in thread
From: Kenichi Handa @ 2005-07-20  0:36 UTC (permalink / raw)
  Cc: mituharu, emacs-devel

In article <87wtnmg9iw.fsf-monnier+emacs@gnu.org>, Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>  If a multibyte string is given to mac-code-convert-string,
>>  and the string is made mutlibyte by string-to-multibyte from
>>  the raw-byte sequence (ex. inserting a file by raw-text in a
>>  mutlibyte buffer and extracting a string by
>>  buffer-substring), using Fstring_as_unibyte is correct.

> Indeed, but sadly so.  In order to make it clear that the multibyte string
> is expected to only contain single-byte chars (including eight-bit-* chars),
> a comment is in order.

I agree.

>>  Please note that we don't have Fstring_to_unibyte because it
>>  should work the same way as Fstring_as_unibyte.

> Actually no.  string-to-unibyte should signal an error if it encounters
> a non-ascii non-eight-bit-* char.

Ah, hmmm, you are right.  I remember to implement it in
emacs-unicode.

---
Kenichi Handa
handa@m17n.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-07-20  0:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-18 21:33 string-as-unibyte Stefan Monnier
2005-07-18 22:41 ` string-as-unibyte YAMAMOTO Mitsuharu
2005-07-18 23:52   ` string-as-unibyte YAMAMOTO Mitsuharu
2005-07-19  2:50     ` string-as-unibyte Kenichi Handa
2005-07-19  2:56   ` string-as-unibyte Kenichi Handa
2005-07-19  3:49     ` string-as-unibyte YAMAMOTO Mitsuharu
2005-07-19 14:26     ` string-as-unibyte Stefan Monnier
2005-07-20  0:36       ` string-as-unibyte Kenichi Handa

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).