unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* `mike-port-encodings' branch (bug #29643)
       [not found] <E1OL1A2-00071t-DR@vcs-noshell.in.savannah.gnu.org>
@ 2010-06-08 22:29 ` Ludovic Courtès
  2010-06-10 18:16   ` Mike Gran
  0 siblings, 1 reply; 3+ messages in thread
From: Ludovic Courtès @ 2010-06-08 22:29 UTC (permalink / raw)
  To: Michael Gran; +Cc: guile-devel

Hello Mike,

Thanks for working on this!

Here’s a review of the ‘mike-port-encodings’ branch.

"Michael Gran" <spk121@yahoo.com> writes:

> commit 26e8fd7459581f09857cccc8ddf1c366d2caea87
> Author: Michael Gran <spk121@yahoo.com>
> Date:   Sat Jun 5 13:40:32 2010 -0700
>
>     Simplify scm_i_scan_for_encoding
>     
>     Note especially that the variable 'i' has two different uses in this
>     function, and they get confused.
>     
>     * libguile/read.c (scm_i_scan_for_encoding): cleanup

-  while (pos + i - header <= SCM_ENCODING_SEARCH_SIZE 
-         && pos + i - header < bytes_read
-	 && (isalnum ((int) pos[i]) || strchr ("_-.:/,+=()", pos[i]) != NULL))
+  while (encoding_start + i - header <= SCM_ENCODING_SEARCH_SIZE
+         && encoding_start + i - header < bytes_read
+	 && (isalnum ((int) encoding_start[i])
+	     || strchr ("_-.:/,+=()", encoding_start[i]) != NULL))

Hmm, “simplification”?  :-)

Is this commit really needed?  What’s the operational effect?

> commit 44c705d173d52cb485d761a51425b4d8e310a67a
> Author: Michael Gran <spk121@yahoo.com>
> Date:   Sat Jun 5 14:26:14 2010 -0700
>
>     read-line should use port's encoding, not locale's encoding
>     
>     * libguile/rdelim.c (scm_read_line): modified, use port's encoding
>

OK.  Would it be possible to write a test case, perhaps with a string
port?

+	  line = scm_from_stringn (s, slen-1, enc, hndl);

Please leave spaces around binary operators.

> commit 62abf139798af3f35ca3c89200389dc75d51a159
> Author: Michael Gran <spk121@yahoo.com>
> Date:   Sat Jun 5 14:19:37 2010 -0700
>
>     open-file should handle binary mode and coding declarations
>     
>     The open-file port should use the binary-friendly ISO-8859-1 encoding when
>     a file is opened using mode "b".  Also, it should honor a "coding:"
>     declaration at the top of a file when reading files where it is present.
>     
>     * libguile/fports.c (scm_open_file): modified
>     * test-suite/tests/ports.test: more tests for open-file
>     * doc/ref/api-io.texi (File Ports): more documentation for open-file

  --- a/doc/ref/api-io.texi
  +++ b/doc/ref/api-io.texi
  @@ -130,8 +130,8 @@ this fluid otherwise.

   @deffn {Scheme Procedure} port-encoding port
   @deffnx {C Function} scm_port_encoding
  -Returns, as a string, the character encoding that @var{port} uses to
  -interpret its input and output.
  +Returns, as a string, the character encoding that @var{port} uses to interpret
  +its input and output.  The value @code{#f} is equivalent to @code{"ISO-8859-1"}.

Instead of having the #f special case, I’ve been thinking that it would
be nicer and easier to have ‘port-encoding’ always return a string
(similar for ‘set-port-encoding!’), which would be "ISO-8859-1" instead
of #f.  Perhaps it’s a bit late, though.

What do you think?

  +Also, open the file using the binary-compatible character encoding
  +"ISO-8859-1", ignoring any coding declaration or port encoding.

I dislike this whole notion of “binary-compatibility”.  What is meant
here is that for binary I/O, you’d better choose ISO-8859-1 as the
port’s encoding because non-ASCII byte values will go through as is,
right?  (I suppose that’s true of any 8-bit encoding.)

OTOH, the (rnrs io ports) API does binary I/O as expected regardless of
the port’s encoding, so that’s what users should choose when doing
binary I/O, I think.

Thus I’d suggest removing any references to “binary-compatibility”.
 
  +When the file is opened, this procedure will scan for a coding
  +declaration (@pxref{Character Encoding of Source Files}). If present
  +will use that encoding for interpreting the file.  Otherwise, the
  +port's encoding will be used.

How about adding: “, which defaults to the value of
@code{%default-port-encoding} (@pxref{Ports,
@code{%default-port-encoding}}).”

  +the file in binary mode and then set the port encoding explicitly
  +using @code{set-port-encoding!}.

Again, there’s no real binary mode.

              "The following additional characters can be appended:\n"
              "@table @samp\n"
              "@item b\n"
  -	    "Open the underlying file in binary mode, if supported by the operating system. "
  +	    "Open the underlying file in binary mode, if supported by the system.\n"
  +	    "Also, open the file using the binary-compatible character encoding\n"
  +	    "\"ISO-8859-1\", ignoring the port's encoding and the coding declaration\n"
  +	    "at the top of the input file, if any.\n"

OK, except for “binary compatible”.

  +;;; binary mode ignores port encoding
  +(with-fluids ((%default-port-encoding "UTF-8"))
  +  (let* ((filename (test-file))
  +         (port (open-file filename "w"))
  +         (test-string "一二三")
  +         (binary-test-string
  +          (apply string
  +                 (map integer->char
  +                      (uniform-vector->list (string->utf8 test-string))))))
  +    (write-line test-string port)
  +    (close-port port)
  +    (let* ((in-port (open-file filename "rb"))
  +           (line (read-line in-port)))
  +      (close-port in-port)
  +      (pass-if "file: binary mode ignores port encoding"
  +               (string=? line binary-test-string)))))

Rather put all the test under ‘pass-if’ so that errors are attributed to
this test.  There should also be a (delete-file filename) as the last
thunk of a ‘dynamic-wind’ (otherwise this could break ‘make distcheck’.)

Other than that the tests look good to me.

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: `mike-port-encodings' branch (bug #29643)
  2010-06-08 22:29 ` `mike-port-encodings' branch (bug #29643) Ludovic Courtès
@ 2010-06-10 18:16   ` Mike Gran
  2010-06-16 21:13     ` Ludovic Courtès
  0 siblings, 1 reply; 3+ messages in thread
From: Mike Gran @ 2010-06-10 18:16 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

> From: Ludovic Courtès <ludo@gnu.org>
> Hello Mike,

> Thanks for working on this!

> Here’s a review of the ‘mike-port-encodings’ branch.


> Hmm, “simplification”?  :-)

> Is this commit really needed?  What’s the operational effect?

That commit has no net effect; however, the code
uses the 'i' variable in a for loop as an iterator and then
uses its value after the loop.  Using a for loop iterator
variable after the loop is complete is bad style, IMHO.

> OK.  Would it be possible to write a test case, 
> perhaps with a string port?

Sure.

>>  +Returns, as a string, the character encoding that 
>> @var{port} uses to interpret
>>  +its input and output.  The value @code{#f} is equivalent to 
>> @code{"ISO-8859-1"}.

> Instead of having the #f special case, I’ve been 
> thinking that it would be nicer and easier to have ‘port-encoding’ always 
> return a string (similar for ‘set-port-encoding!’), which would be 
> "ISO-8859-1" instead of #f.  Perhaps it’s a bit late, 
> though.

> What do you think?

I have no problem with that.  But, there is some small optimization that
comes from using NULL and SCM_BOOL_F as shorthand for ISO-8859-1,
because is eliminates a strcmp operation before each block of port text
is read or written.

[...]

> I dislike this whole notion of “binary-compatibility”.  What is meant
> here is that for binary I/O, 
> you’d better choose ISO-8859-1 as theport’s encoding because non-ASCII byte 
> values will go through as is,right?  (I suppose that’s true of any 
> 8-bit encoding.)

In the documentation, (rnrs io ports) could be recommended for binary
ports.  No problem.

But do you agree with the idea that Guile should force 8-bit encoding when
a legacy binary port is opened?  Or should it still inherit %default-port-encoding?
Which if those does not violate the 'principle of least surprise'?

-Mike



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: `mike-port-encodings' branch (bug #29643)
  2010-06-10 18:16   ` Mike Gran
@ 2010-06-16 21:13     ` Ludovic Courtès
  0 siblings, 0 replies; 3+ messages in thread
From: Ludovic Courtès @ 2010-06-16 21:13 UTC (permalink / raw)
  To: guile-devel

Hi Mike,

Mike Gran <spk121@yahoo.com> writes:

>> Hmm, “simplification”?  :-)
>
>> Is this commit really needed?  What’s the operational effect?
>
> That commit has no net effect; however, the code
> uses the 'i' variable in a for loop as an iterator and then
> uses its value after the loop.  Using a for loop iterator
> variable after the loop is complete is bad style, IMHO.

Hmm right.  Looking again at this part, it looks like the new variable
names are indeed more descriptive than before, and that ‘i’ is not
reused throughout may help, so OK.

Please change the commit log from “cleanup” to something that briefly
describes the actual change, though.

>>>  +Returns, as a string, the character encoding that 
>>> @var{port} uses to interpret
>>>  +its input and output.  The value @code{#f} is equivalent to 
>>> @code{"ISO-8859-1"}.
>
>> Instead of having the #f special case, I’ve been 
>> thinking that it would be nicer and easier to have ‘port-encoding’ always 
>> return a string (similar for ‘set-port-encoding!’), which would be 
>> "ISO-8859-1" instead of #f.  Perhaps it’s a bit late, 
>> though.
>
>> What do you think?
>
> I have no problem with that.  But, there is some small optimization that
> comes from using NULL and SCM_BOOL_F as shorthand for ISO-8859-1,
> because is eliminates a strcmp operation before each block of port text
> is read or written.

Hmm, right.  Then forget about what I said.

However, it would be nice if the documentation would state the
equivalence of #f and ISO-8859-1 in a single place.

>> I dislike this whole notion of “binary-compatibility”.  What is meant
>> here is that for binary I/O, 
>> you’d better choose ISO-8859-1 as theport’s encoding because non-ASCII byte 
>> values will go through as is,right?  (I suppose that’s true of any 
>> 8-bit encoding.)
>
> In the documentation, (rnrs io ports) could be recommended for binary
> ports.  No problem.

OK, with an xref to that part of the doc right after the description of
the ‘b’ flag.  :-)

> But do you agree with the idea that Guile should force 8-bit encoding when
> a legacy binary port is opened?

Yes.

>   Or should it still inherit %default-port-encoding?  Which if those
> does not violate the 'principle of least surprise'?

ISO-8859-1 is probably best in that respect.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-06-16 21:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <E1OL1A2-00071t-DR@vcs-noshell.in.savannah.gnu.org>
2010-06-08 22:29 ` `mike-port-encodings' branch (bug #29643) Ludovic Courtès
2010-06-10 18:16   ` Mike Gran
2010-06-16 21:13     ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).