* [PATCH] Have string ports honor ‘%default-port-encoding’
@ 2010-01-05 0:18 Ludovic Courtès
2010-01-05 2:47 ` Mike Gran
0 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2010-01-05 0:18 UTC (permalink / raw)
To: guile-devel
[-- Attachment #1: Type: text/plain, Size: 204 bytes --]
Hello,
The attached patch fixes string ports so that their encoding defaults to
‘%default-port-encoding’ (as for other ports) instead of UTF-8.
Mike: can you review it?
Thanks,
Ludo’.
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: the patch --]
[-- Type: text/x-patch, Size: 14135 bytes --]
From 7228c59977106d6b87f2347ffb77466073f832a9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ludovic=20Court=C3=A8s?= <ludo@gnu.org>
Date: Tue, 5 Jan 2010 01:10:12 +0100
Subject: [PATCH] Have string ports honor `%default-port-encoding'.
* libguile/strports.c (scm_i_mkstrport): Remove.
(scm_mkstrport): Don't change the port's encoding to UTF-8; convert
STR to the default port encoding.
(scm_strport_to_string): Fix documentation & indentation.
* libguile/strports.h (scm_i_mkstrport): Remove.
* test-suite/lib.scm (exception:encoding-error): New variable.
(format-test-name): Set `%default-port-encoding' to "UTF-8".
* test-suite/tests/ports.test ("string ports")["%default-port-encoding
is honored", "suitable encoding [latin-1]", "suitable encoding
[latin-3]", "wrong encoding"]: New tests.
* test-suite/tests/r6rs-ports.test ("7.2.11 Binary
Output")["put-bytevector with UTF-16 string port", "put-bytevector
with wrong-encoding string port"]: New tests.
* test-suite/tests/reader.test (read-string): Set
`%default-port-encoding' to `#f'.
("reading")["unprintable symbol"]: Use a string that doesn't contain
zeros.
---
libguile/strports.c | 80 +++++++++++++------------------------
libguile/strports.h | 4 +-
test-suite/lib.scm | 26 ++++++++-----
test-suite/tests/ports.test | 41 ++++++++++++++++++-
test-suite/tests/r6rs-ports.test | 26 ++++++++++--
test-suite/tests/reader.test | 11 +++--
6 files changed, 110 insertions(+), 78 deletions(-)
diff --git a/libguile/strports.c b/libguile/strports.c
index 95e93c9..75c3773 100644
--- a/libguile/strports.c
+++ b/libguile/strports.c
@@ -1,4 +1,4 @@
-/* Copyright (C) 1995,1996,1998,1999,2000,2001,2002, 2003, 2005, 2006, 2009 Free Software Foundation, Inc.
+/* Copyright (C) 1995,1996,1998,1999,2000,2001,2002, 2003, 2005, 2006, 2009, 2010 Free Software Foundation, Inc.
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
@@ -289,84 +289,60 @@ st_truncate (SCM port, scm_t_off length)
pt->write_pos = pt->read_end;
}
-SCM
-scm_i_mkstrport (SCM pos, const char *utf8_str, size_t str_len, long modes, const char *caller)
+SCM
+scm_mkstrport (SCM pos, SCM str, long modes, const char *caller)
{
- SCM z, str;
+ SCM z;
scm_t_port *pt;
- size_t c_pos;
- char *buf;
-
- /* Because ports are inherently 8-bit, strings need to be converted
- to a locale representation for storage. But, since string ports
- rely on string functionality for their memory management, we need
- to create a new string that has the 8-bit locale representation
- of the underlying string.
+ size_t str_len, c_pos;
+ char *buf, *c_str;
- locale_str is already in the locale of the port. */
- str = scm_i_make_string (str_len, &buf);
- memcpy (buf, utf8_str, str_len);
-
- c_pos = scm_to_unsigned_integer (pos, 0, str_len);
+ SCM_ASSERT (scm_is_string (str), str, SCM_ARG1, caller);
+ c_pos = scm_to_unsigned_integer (pos, 0, scm_i_string_length (str));
if (!((modes & SCM_WRTNG) || (modes & SCM_RDNG)))
scm_misc_error ("scm_mkstrport", "port must read or write", SCM_EOL);
scm_i_scm_pthread_mutex_lock (&scm_i_port_table_mutex);
+
z = scm_new_port_table_entry (scm_tc16_strport);
pt = SCM_PTAB_ENTRY(z);
SCM_SETSTREAM (z, SCM_UNPACK (str));
- SCM_SET_CELL_TYPE(z, scm_tc16_strport|modes);
- pt->write_buf = pt->read_buf = (unsigned char *) scm_i_string_chars (str);
+ SCM_SET_CELL_TYPE (z, scm_tc16_strport | modes);
+
+ /* Create a copy of STR in the encoding of Z. */
+ buf = scm_to_stringn (str, &str_len, pt->encoding,
+ SCM_FAILED_CONVERSION_ERROR);
+ /* FIXME: strdup doesn't do the right thing if BUF contains zeros, but we
+ don't know the size in bytes of STR. */
+ c_str = scm_gc_strdup (buf, "strport");
+ free (buf);
+
+ pt->write_buf = pt->read_buf = (unsigned char *) c_str;
pt->read_pos = pt->write_pos = pt->read_buf + c_pos;
pt->write_buf_size = pt->read_buf_size = str_len;
pt->write_end = pt->read_end = pt->read_buf + pt->read_buf_size;
pt->rw_random = 1;
+
scm_i_pthread_mutex_unlock (&scm_i_port_table_mutex);
- /* ensure write_pos is writable. */
+ /* Ensure WRITE_POS is writable. */
if ((modes & SCM_WRTNG) && pt->write_pos == pt->write_end)
st_flush (z);
- scm_i_set_port_encoding_x (z, "UTF-8");
scm_i_set_conversion_strategy_x (z, SCM_FAILED_CONVERSION_ERROR);
return z;
}
-SCM
-scm_mkstrport (SCM pos, SCM str, long modes, const char *caller)
-{
- SCM z;
- size_t str_len;
- char *buf;
-
- SCM_ASSERT (scm_is_string (str), str, SCM_ARG1, caller);
-
- /* Because ports are inherently 8-bit, strings need to be converted
- to a locale representation for storage. But, since string ports
- rely on string functionality for their memory management, we need
- to create a new string that has the 8-bit locale representation
- of the underlying string. This violates the guideline that the
- internal encoding of characters in strings is in unicode
- codepoints. */
-
- /* String ports are are always initialized with "UTF-8" as their
- encoding. */
- buf = scm_to_stringn (str, &str_len, "UTF-8", SCM_FAILED_CONVERSION_ERROR);
- z = scm_i_mkstrport (pos, buf, str_len, modes, caller);
- free (buf);
- return z;
-}
-
-/* Create a new string from a string port's buffer, converting from
- the port's 8-bit locale-specific representation to the standard
- string representation. */
-SCM scm_strport_to_string (SCM port)
+/* Create a new string from the buffer of PORT, a string port, converting from
+ PORT's encoding to the standard string representation. */
+SCM
+scm_strport_to_string (SCM port)
{
- scm_t_port *pt = SCM_PTAB_ENTRY (port);
SCM str;
-
+ scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
if (pt->rw_active == SCM_PORT_WRITE)
st_flush (port);
diff --git a/libguile/strports.h b/libguile/strports.h
index d93266a..3a9c3ec 100644
--- a/libguile/strports.h
+++ b/libguile/strports.h
@@ -3,7 +3,7 @@
#ifndef SCM_STRPORTS_H
#define SCM_STRPORTS_H
-/* Copyright (C) 1995,1996,2000,2001,2002, 2006, 2008 Free Software Foundation, Inc.
+/* Copyright (C) 1995,1996,2000,2001,2002, 2006, 2008, 2010 Free Software Foundation, Inc.
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
@@ -44,8 +44,6 @@ SCM_API scm_t_bits scm_tc16_strport;
\f
SCM_API SCM scm_mkstrport (SCM pos, SCM str, long modes, const char * caller);
-SCM_INTERNAL SCM scm_i_mkstrport (SCM pos, const char *locale_str, size_t str_len,
- long modes, const char *caller);
SCM_API SCM scm_strport_to_string (SCM port);
SCM_API SCM scm_object_to_string (SCM obj, SCM printer);
SCM_API SCM scm_call_with_output_string (SCM proc);
diff --git a/test-suite/lib.scm b/test-suite/lib.scm
index e5b7a08..a2390da 100644
--- a/test-suite/lib.scm
+++ b/test-suite/lib.scm
@@ -1,5 +1,5 @@
;;;; test-suite/lib.scm --- generic support for testing
-;;;; Copyright (C) 1999, 2000, 2001, 2004, 2006, 2007, 2009 Free Software Foundation, Inc.
+;;;; Copyright (C) 1999, 2000, 2001, 2004, 2006, 2007, 2009, 2010 Free Software Foundation, Inc.
;;;;
;;;; This program is free software; you can redistribute it and/or
;;;; modify it under the terms of the GNU Lesser General Public
@@ -30,6 +30,7 @@
exception:numerical-overflow
exception:struct-set!-denied
exception:system-error
+ exception:encoding-error
exception:miscellaneous-error
exception:string-contains-nul
exception:read-error
@@ -267,6 +268,8 @@ with-locale with-locale*
(cons 'misc-error "^set! denied for field"))
(define exception:system-error
(cons 'system-error ".*"))
+(define exception:encoding-error
+ (cons 'misc-error "(cannot convert to output locale|input locale conversion error)"))
(define exception:miscellaneous-error
(cons 'misc-error "^.*"))
(define exception:read-error
@@ -389,15 +392,18 @@ with-locale with-locale*
;;;; Turn a test name into a nice human-readable string.
(define (format-test-name name)
- (call-with-output-string
- (lambda (port)
- (let loop ((name name)
- (separator ""))
- (if (pair? name)
- (begin
- (display separator port)
- (display (car name) port)
- (loop (cdr name) ": ")))))))
+ ;; Choose a Unicode-capable encoding so that the string port can contain any
+ ;; valid Unicode character.
+ (with-fluids ((%default-port-encoding "UTF-8"))
+ (call-with-output-string
+ (lambda (port)
+ (let loop ((name name)
+ (separator ""))
+ (if (pair? name)
+ (begin
+ (display separator port)
+ (display (car name) port)
+ (loop (cdr name) ": "))))))))
;;;; For a given test-name, deliver the full name including all prefixes.
(define (full-name name)
diff --git a/test-suite/tests/ports.test b/test-suite/tests/ports.test
index 312467d..72dcb63 100644
--- a/test-suite/tests/ports.test
+++ b/test-suite/tests/ports.test
@@ -1,7 +1,7 @@
-;;;; ports.test --- test suite for Guile I/O ports -*- scheme -*-
+;;;; ports.test --- Guile I/O ports. -*- coding: utf-8; mode: scheme; -*-
;;;; Jim Blandy <jimb@red-bean.com> --- May 1999
;;;;
-;;;; Copyright (C) 1999, 2001, 2004, 2006, 2007, 2009 Free Software Foundation, Inc.
+;;;; Copyright (C) 1999, 2001, 2004, 2006, 2007, 2009, 2010 Free Software Foundation, Inc.
;;;;
;;;; This library is free software; you can redistribute it and/or
;;;; modify it under the terms of the GNU Lesser General Public
@@ -307,7 +307,42 @@
(string-set! text 0 #\a)
(string-set! text (- len 1) #\b)
(pass-if "output check"
- (string=? text result))))
+ (string=? text result)))
+
+ (pass-if "%default-port-encoding is honored"
+ (let ((encodings '("UTF-8" "UTF-16" "ISO-8859-1" "ISO-8859-3")))
+ (equal? (map (lambda (e)
+ (with-fluids ((%default-port-encoding e))
+ (call-with-output-string
+ (lambda (p)
+ (display (port-encoding p) p)))))
+ encodings)
+ encodings)))
+
+ (pass-if "suitable encoding [latin-1]"
+ (let ((str "hello, world"))
+ (with-fluids ((%default-port-encoding "ISO-8859-1"))
+ (equal? str
+ (with-output-to-string
+ (lambda ()
+ (display str)))))))
+
+ (pass-if "suitable encoding [latin-3]"
+ (let ((str "Äu bone?"))
+ (with-fluids ((%default-port-encoding "ISO-8859-3"))
+ (equal? str
+ (with-output-to-string
+ (lambda ()
+ (display str)))))))
+
+ (pass-if-exception "wrong encoding"
+ exception:encoding-error
+ (let ((str "Äu bone?"))
+ ;; Latin-1 cannot represent âÄâ.
+ (with-fluids ((%default-port-encoding "ISO-8859-1"))
+ (with-output-to-string
+ (lambda ()
+ (display str)))))))
(with-test-prefix "call-with-output-string"
diff --git a/test-suite/tests/r6rs-ports.test b/test-suite/tests/r6rs-ports.test
index eb60cf3..1d60991 100644
--- a/test-suite/tests/r6rs-ports.test
+++ b/test-suite/tests/r6rs-ports.test
@@ -1,6 +1,6 @@
-;;;; r6rs-ports.test --- Exercise the R6RS I/O port API.
+;;;; r6rs-ports.test --- R6RS I/O port tests. -*- coding: iso-8859-1; -*-
;;;;
-;;;; Copyright (C) 2009 Free Software Foundation, Inc.
+;;;; Copyright (C) 2009, 2010 Free Software Foundation, Inc.
;;;; Ludovic Courtès
;;;;
;;;; This library is free software; you can redistribute it and/or
@@ -219,7 +219,25 @@
(port (%make-void-port "w")))
(close-port port)
- (put-bytevector port bv))))
+ (put-bytevector port bv)))
+
+ (pass-if "put-bytevector with UTF-16 string port"
+ (let* ((str "hello, world")
+ (bv (string->utf16 str)))
+ (equal? str
+ (with-fluids ((%default-port-encoding "UTF-16BE"))
+ (call-with-output-string
+ (lambda (port)
+ (put-bytevector port bv)))))))
+
+ (pass-if-exception "put-bytevector with wrong-encoding string port"
+ exception:encoding-error
+ (let* ((str "hello, world")
+ (bv (string->utf16 str)))
+ (with-fluids ((%default-port-encoding "UTF-32"))
+ (call-with-output-string
+ (lambda (port)
+ (put-bytevector port bv)))))))
\f
(with-test-prefix "7.2.7 Input Ports"
@@ -452,8 +470,6 @@
(not eof?)
(bytevector=? sink source)))))
-
;;; Local Variables:
-;;; coding: latin-1
;;; mode: scheme
;;; End:
diff --git a/test-suite/tests/reader.test b/test-suite/tests/reader.test
index 2ee21c1..b819e63 100644
--- a/test-suite/tests/reader.test
+++ b/test-suite/tests/reader.test
@@ -1,6 +1,6 @@
-;;;; reader.test --- Exercise the reader. -*- Scheme -*-
+;;;; reader.test --- Reader test. -*- coding: iso-8859-1; mode: scheme -*-
;;;;
-;;;; Copyright (C) 1999, 2001, 2002, 2003, 2007, 2008, 2009 Free Software Foundation, Inc.
+;;;; Copyright (C) 1999, 2001, 2002, 2003, 2007, 2008, 2009, 2010 Free Software Foundation, Inc.
;;;; Jim Blandy <jimb@red-bean.com>
;;;;
;;;; This library is free software; you can redistribute it and/or
@@ -41,7 +41,8 @@
(define (read-string s)
- (with-input-from-string s (lambda () (read))))
+ (with-fluids ((%default-port-encoding #f))
+ (with-input-from-string s (lambda () (read)))))
(define (with-read-options opts thunk)
(let ((saved-options (read-options)))
@@ -110,8 +111,8 @@
(pass-if "unprintable symbol"
;; The reader tolerates unprintable characters for symbols.
- (equal? (string->symbol "\001\002\003")
- (read-string "\001\002\003")))
+ (equal? (string->symbol "\x01\x02\x03")
+ (read-string "\x01\x02\x03")))
(pass-if "CR recognized as a token delimiter"
;; In 1.8.3, character 0x0d was not recognized as a delimiter.
--
1.6.4.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] Have string ports honor ‘%default-port-encoding’
2010-01-05 0:18 [PATCH] Have string ports honor ‘%default-port-encoding’ Ludovic Courtès
@ 2010-01-05 2:47 ` Mike Gran
2010-01-05 9:14 ` Ludovic Courtès
0 siblings, 1 reply; 5+ messages in thread
From: Mike Gran @ 2010-01-05 2:47 UTC (permalink / raw)
To: Ludovic Courtès, guile-devel
> From: Ludovic Courtès <ludo@gnu.org>
>
> Hello,
>
> The attached patch fixes string ports so that their encoding defaults to
> ‘%default-port-encoding’ (as for other ports) instead of UTF-8.
>
> Mike: can you review it?
Hi Ludovic,
It is a bit hard for me to review this as I don't have access to my machine
right now, but, let me see what I can do.
When you write...
+ /* Create a copy of STR in the encoding of Z. */
+ buf = scm_to_stringn (str, &str_len, pt->encoding,
+ SCM_FAILED_CONVERSION_ERROR);
+ /* FIXME: strdup doesn't do the right thing if BUF contains zeros, but we
+ don't know the size in bytes of STR. */
+ c_str = scm_gc_strdup (buf, "strport");
+ free (buf);
... isn't the returned value str_len the length in bytes of buf? I think
you could avoid the strdup call, since it could fail, for example, for
UTF-32 strings of more than one character.
Also, in the big scheme of things, I wonder if the name "string port"
is misleading now. Strings can contain the whole codepoint range.
But string ports can't store the whole range depending on their encoding.
(That's what the "UTF-8" hack was about.)
Thanks,
Mike Gran
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Have string ports honor ‘%default-port-encoding’
2010-01-05 2:47 ` Mike Gran
@ 2010-01-05 9:14 ` Ludovic Courtès
2010-01-05 18:21 ` Mike Gran
0 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2010-01-05 9:14 UTC (permalink / raw)
To: Mike Gran; +Cc: guile-devel
Hello,
Mike Gran <spk121@yahoo.com> writes:
> When you write...
>
> + /* Create a copy of STR in the encoding of Z. */
> + buf = scm_to_stringn (str, &str_len, pt->encoding,
> + SCM_FAILED_CONVERSION_ERROR);
> + /* FIXME: strdup doesn't do the right thing if BUF contains zeros, but we
> + don't know the size in bytes of STR. */
> + c_str = scm_gc_strdup (buf, "strport");
> + free (buf);
>
> ... isn't the returned value str_len the length in bytes of buf?
The (undocumented) ‘scm_to_stringn ()’ returns the number of characters,
AFAICS.
> I think you could avoid the strdup call, since it could fail, for
> example, for UTF-32 strings of more than one character.
Yes, that sucks. Probably we need a function to known the number of
bytes of a string. Thoughts?
> Also, in the big scheme of things, I wonder if the name "string port"
> is misleading now. Strings can contain the whole codepoint range.
> But string ports can't store the whole range depending on their encoding.
> (That's what the "UTF-8" hack was about.)
Yes, it’s tricky. The problem is that currently we can send both
textual and binary data to a given port (unlike the R6RS port API, which
judiciously distinguishes textual and binary ports.) Because of that, I
think string ports can’t just use a fixed encoding.
What do you think?
Thanks,
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Have string ports honor ‘%default-port-encoding’
2010-01-05 9:14 ` Ludovic Courtès
@ 2010-01-05 18:21 ` Mike Gran
2010-01-07 10:37 ` Ludovic Courtès
0 siblings, 1 reply; 5+ messages in thread
From: Mike Gran @ 2010-01-05 18:21 UTC (permalink / raw)
To: Ludovic Courtès; +Cc: guile-devel
> Ludo sez:
> The (undocumented) ‘scm_to_stringn ()’ returns the number of characters,
> AFAICS.
I'll try to get a doc patch in this weekend.
The second parameter to scm_to_stringn gets filled with the result buffer
length from either mem_iconveh or u32_conv_to_encoding, so it is definitely
the length in bytes of the locale-encoded string.
Maybe I can also make a unit test for this function.
> > Also, in the big scheme of things, I wonder if the name "string port"
> > is misleading now. Strings can contain the whole codepoint range.
> > But string ports can't store the whole range depending on their encoding.
> > (That's what the "UTF-8" hack was about.)
>
> Yes, it’s tricky. The problem is that currently we can send both
> textual and binary data to a given port (unlike the R6RS port API, which
> judiciously distinguishes textual and binary ports.) Because of that, I
> think string ports can’t just use a fixed encoding.
>
> What do you think?
I'm fine with having the string ports operate this way. I think the
parallelism to other ports is a good thing.
I know that I'm 'splitting hairs', but there are a couple of places in
the docs that refer to string ports being "ports *on* a scheme string", when
in truth they are ports initialized by strings or that output to strings.
That is a trivial point of nomenclature, though.
Thanks,
Mike
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Have string ports honor ‘%default-port-encoding’
2010-01-05 18:21 ` Mike Gran
@ 2010-01-07 10:37 ` Ludovic Courtès
0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2010-01-07 10:37 UTC (permalink / raw)
To: guile-devel
Hi,
Mike Gran <spk121@yahoo.com> writes:
>> The (undocumented) ‘scm_to_stringn ()’ returns the number of characters,
>> AFAICS.
>
> I'll try to get a doc patch in this weekend.
As a start, I added a small description in the source:
http://git.savannah.gnu.org/cgit/guile.git/commit/?id=29bcdbb05948a5f12d2d8cb36a0c3c582e738be3
>> > Also, in the big scheme of things, I wonder if the name "string port"
>> > is misleading now. Strings can contain the whole codepoint range.
>> > But string ports can't store the whole range depending on their encoding.
>> > (That's what the "UTF-8" hack was about.)
>>
>> Yes, it’s tricky. The problem is that currently we can send both
>> textual and binary data to a given port (unlike the R6RS port API, which
>> judiciously distinguishes textual and binary ports.) Because of that, I
>> think string ports can’t just use a fixed encoding.
>>
>> What do you think?
>
> I'm fine with having the string ports operate this way. I think the
> parallelism to other ports is a good thing.
Committed this:
http://git.savannah.gnu.org/cgit/guile.git/commit/?id=7b0419128bce68f48a158292430ed4a7202aa1b1
Thanks for the review!
Ludo’.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-01-07 10:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-05 0:18 [PATCH] Have string ports honor ‘%default-port-encoding’ Ludovic Courtès
2010-01-05 2:47 ` Mike Gran
2010-01-05 9:14 ` Ludovic Courtès
2010-01-05 18:21 ` Mike Gran
2010-01-07 10:37 ` Ludovic Courtès
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).