unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* Making custom binary input ports unbuffered
@ 2014-01-14 23:00 Ludovic Courtès
  2014-01-15  5:43 ` Mark H Weaver
  0 siblings, 1 reply; 11+ messages in thread
From: Ludovic Courtès @ 2014-01-14 23:00 UTC (permalink / raw)
  To: guile-devel; +Cc: Andy Wingo, Mark H Weaver

[-- Attachment #1: Type: text/plain, Size: 1770 bytes --]

Hi!

As discussed on IRC, our R6 custom binary input ports (CBIPs) are
currently buffered.  The buffer is hard-coded and setvbuf doesn’t work
on non-file ports.

Having a buffer can be problematic for several reasons.

  1. The user’s ‘get-position’ will always point past what the port’s
     user sees.

     This could be worked around by subtracting read_pos to what
     ‘get-position’ returns, but that feels wrong: conceptually, the
     port’s position is something under the CBIP implementor’s control.
     (I wonder how fopencookie/fseek deal with this.)

  2. Some applications want no buffering.

     My use case was that I read from a byte stream, and at some point I
     want to compute a hash over a delimited part of that stream.  To do
     that, I intuitively wanted to have a CBIP that wraps some other
     input port, and do the hash computation in that CBIP.  But that
     only works if we can guarantee that the CBIP doesn’t read more than
     what was actually asked.

     Same for the delimited port in (web response).

The patch below makes CBIPs unbuffered (see the tests for the properties
it gives.)  It works thanks to the optimization in ‘scm_c_read’ for
unbuffered binary ports.

This is going to be a performance hit for applications that read things
byte by byte, *or* via textual procedures (‘scm_getc’, ‘get-string’,
etc.)  But the assumption is that people rather use ‘get-bytevector-n’
(or similar) to get a chunk of data.

However!  There are places in (web ...) where CBIPs are used for mixed
binary/textual input.  When that happens, all the accesses end up being
unbuffered, which really sucks.


So, what do we do?  :-)

Thanks,
Ludo’.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: the patch --]
[-- Type: text/x-patch, Size: 9200 bytes --]

From ef60588d6e76d6ad0ae09197043c6d7371beb1b7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ludovic=20Court=C3=A8s?= <ludo@gnu.org>
Date: Tue, 14 Jan 2014 21:54:51 +0100
Subject: [PATCH] Make custom binary input ports unbuffered.

* libguile/r6rs-ports.c (make_cbip): Leave read_{pos,end,buf_size}
  unchanged and call 'scm_port_non_buffer'.
  (cbip_fill_input): Change to use PORT's associated bytevector as an
  intermediate copy passed to READ_PROC and then copied back into PORT's
  own buffer.  Reallocate a new bytevector when it's smaller than the
  current 'read_buf_size'.  Don't loop back to 'again' label and remove
  it.
* test-suite/tests/r6rs-ports.test ("7.2.7 Input Ports")["custom binary
  input & 'port-position'", "custom binary input & 'read!' calls"]: New
  tests.
---
 libguile/r6rs-ports.c            | 77 ++++++++++++++++++++++++++--------------
 test-suite/tests/r6rs-ports.test | 60 +++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+), 27 deletions(-)

diff --git a/libguile/r6rs-ports.c b/libguile/r6rs-ports.c
index 790c24c..f2a654e 100644
--- a/libguile/r6rs-ports.c
+++ b/libguile/r6rs-ports.c
@@ -280,13 +280,17 @@ cbp_close (SCM port)
 
 static scm_t_bits custom_binary_input_port_type = 0;
 
-/* Size of the buffer embedded in custom binary input ports.  */
+/* Initial size of the buffer embedded in custom binary input ports.  */
 #define CBIP_BUFFER_SIZE  4096
 
 /* Return the bytevector associated with PORT.  */
 #define SCM_CBIP_BYTEVECTOR(_port)				\
   SCM_SIMPLE_VECTOR_REF (SCM_PACK (SCM_STREAM (_port)), 4)
 
+/* Set BV as the bytevector associated with PORT.  */
+#define SCM_SET_CBIP_BYTEVECTOR(_port, _bv)				\
+  SCM_SIMPLE_VECTOR_SET (SCM_PACK (SCM_STREAM (_port)), 4, (_bv))
+
 /* Return the various procedures of PORT.  */
 #define SCM_CBIP_READ_PROC(_port)				\
   SCM_SIMPLE_VECTOR_REF (SCM_PACK (SCM_STREAM (_port)), 0)
@@ -297,15 +301,11 @@ make_cbip (SCM read_proc, SCM get_position_proc,
 	   SCM set_position_proc, SCM close_proc)
 {
   SCM port, bv, method_vector;
-  char *c_bv;
-  unsigned c_len;
   scm_t_port *c_port;
   const unsigned long mode_bits = SCM_OPN | SCM_RDNG;
 
-  /* Use a bytevector as the underlying buffer.  */
-  c_len = CBIP_BUFFER_SIZE;
-  bv = scm_c_make_bytevector (c_len);
-  c_bv = (char *) SCM_BYTEVECTOR_CONTENTS (bv);
+  /* Pre-allocate a bytevector to be passed to the 'read!' method.  */
+  bv = scm_c_make_bytevector (CBIP_BUFFER_SIZE);
 
   /* Store the various methods and bytevector in a vector.  */
   method_vector = scm_c_make_vector (5, SCM_BOOL_F);
@@ -326,10 +326,13 @@ make_cbip (SCM read_proc, SCM get_position_proc,
   /* Attach it the method vector.  */
   SCM_SETSTREAM (port, SCM_UNPACK (method_vector));
 
-  /* Have the port directly access the buffer (bytevector).  */
-  c_port->read_pos = c_port->read_buf = (unsigned char *) c_bv;
-  c_port->read_end = (unsigned char *) c_bv;
-  c_port->read_buf_size = c_len;
+  /* Make it unbuffered.  This is necessary to guarantee that (1) users can
+     actually implement GET_POSITION_PROC correctly, and that (2) each
+     'get-bytevector-*' call has exactly one corresponding READ_PROC call.
+     The latter is necessary in some applications, typically when wrapping
+     another port where we don't want to consume more than what was
+     actually asked for.  */
+  scm_port_non_buffer (c_port);
 
   /* Mark PORT as open, readable and unbuffered (hmm, how elegant...).  */
   SCM_SET_CELL_TYPE (port, custom_binary_input_port_type | mode_bits);
@@ -346,34 +349,54 @@ cbip_fill_input (SCM port)
   int result;
   scm_t_port *c_port = SCM_PTAB_ENTRY (port);
 
- again:
   if (c_port->read_pos >= c_port->read_end)
     {
       /* Invoke the user's `read!' procedure.  */
-      unsigned c_octets;
+      size_t c_octets, c_requested;
       SCM bv, read_proc, octets;
 
-      /* Use the bytevector associated with PORT as the buffer passed to the
+      read_proc = SCM_CBIP_READ_PROC (port);
+
+      /* Attempt to pass the bytevector associated with PORT to the
 	 `read!' procedure, thereby avoiding additional allocations.  */
       bv = SCM_CBIP_BYTEVECTOR (port);
-      read_proc = SCM_CBIP_READ_PROC (port);
 
-      /* The assumption here is that C_PORT's internal buffer wasn't changed
-	 behind our back.  */
-      assert (c_port->read_buf ==
-	      (unsigned char *) SCM_BYTEVECTOR_CONTENTS (bv));
-      assert ((unsigned) c_port->read_buf_size
-	      == SCM_BYTEVECTOR_LENGTH (bv));
+      /* When called via the 'get-bytevector-*' procedures, and thus via
+	 'scm_c_read', we are passed the caller-provided buffer, so we need
+	 to check its size.  */
+      c_requested = c_port->read_buf_size;
+
+      if (SCM_BYTEVECTOR_LENGTH (bv) < c_requested)
+	{
+	  /* Bad luck: we have to make another allocation.  Save that
+	     bytevector for later reuse, in the hope that the application
+	     has regular access patterns.  */
+	  bv = scm_c_make_bytevector (c_requested);
+	  SCM_SET_CBIP_BYTEVECTOR (port, bv);
+	}
+
+      /* READ_PROC must always be called with a strictly positive number of
+	 bytes to read; otherwise it is forced to return 0, which is used
+	 to indicate EOF.  */
+      if (SCM_LIKELY (c_requested > 0))
+	{
+	  octets = scm_call_3 (read_proc, bv, SCM_INUM0,
+			       scm_from_size_t (c_requested));
+	  c_octets = scm_to_size_t (octets);
+	}
+      else
+	c_octets = 0;
 
-      octets = scm_call_3 (read_proc, bv, SCM_INUM0,
-			   SCM_I_MAKINUM (CBIP_BUFFER_SIZE));
-      c_octets = scm_to_uint (octets);
+      if (SCM_UNLIKELY (c_octets > c_requested))
+	scm_out_of_range (FUNC_NAME, octets);
 
-      c_port->read_pos = (unsigned char *) SCM_BYTEVECTOR_CONTENTS (bv);
+      /* Copy the data back to the original buffer.  */
+      memcpy ((char *) c_port->read_pos, SCM_BYTEVECTOR_CONTENTS (bv),
+	      c_octets);
       c_port->read_end = (unsigned char *) c_port->read_pos + c_octets;
 
-      if (c_octets > 0)
-	goto again;
+      if (c_octets != 0 || c_requested == 0)
+	result = (int) *c_port->read_pos;
       else
 	result = EOF;
     }
diff --git a/test-suite/tests/r6rs-ports.test b/test-suite/tests/r6rs-ports.test
index eaae29f..41b46b1 100644
--- a/test-suite/tests/r6rs-ports.test
+++ b/test-suite/tests/r6rs-ports.test
@@ -447,6 +447,66 @@ not `set-port-position!'"
                          (u8-list->bytevector
                           (map char->integer (string->list "Port!")))))))
 
+  (pass-if-equal "custom binary input & 'port-position'"
+      '(0 2 5 11)
+    ;; Check that the value returned by 'port-position' is correct, and
+    ;; that each 'port-position' call leads one call to the
+    ;; 'get-position' method.
+    (let* ((str    "Hello Port!")
+           (output (make-bytevector (string-length str)))
+           (source (with-fluids ((%default-port-encoding "UTF-8"))
+                     (open-string-input-port str)))
+           (read!  (lambda (bv start count)
+                     (let ((r (get-bytevector-n! source bv start count)))
+                       (if (eof-object? r)
+                           0
+                           r))))
+           (pos     '())
+           (get-pos (lambda ()
+                      (let ((p (port-position source)))
+                        (set! pos (cons p pos))
+                        p)))
+           (port    (make-custom-binary-input-port "the port" read!
+                                                   get-pos #f #f)))
+
+      (and (= 0 (port-position port))
+           (begin
+             (get-bytevector-n! port output 0 2)
+             (= 2 (port-position port)))
+           (begin
+             (get-bytevector-n! port output 2 3)
+             (= 5 (port-position port)))
+           (let ((bv (string->utf8 (get-string-all port))))
+             (bytevector-copy! bv 0 output 5 (bytevector-length bv))
+             (= (string-length str) (port-position port)))
+           (bytevector=? output (string->utf8 str))
+           (reverse pos))))
+
+  (pass-if-equal "custom binary input & 'read!' calls"
+      `((2 "He") (3 "llo") (42 " Port!"))
+    (let* ((str    "Hello Port!")
+           (source (with-fluids ((%default-port-encoding "UTF-8"))
+                     (open-string-input-port str)))
+           (reads  '())
+           (read!  (lambda (bv start count)
+                     (set! reads (cons count reads))
+                     (let ((r (get-bytevector-n! source bv start count)))
+                       (if (eof-object? r)
+                           0
+                           r))))
+           (port   (make-custom-binary-input-port "the port" read!
+                                                  #f #f #f)))
+
+      (let ((ret (list (get-bytevector-n port 2)
+                       (get-bytevector-n port 3)
+                       (get-bytevector-n port 42))))
+        (zip (reverse reads)
+             (map (lambda (obj)
+                    (if (bytevector? obj)
+                        (utf8->string obj)
+                        obj))
+                  ret)))))
+
   (pass-if "custom binary input port `close-proc' is called"
     (let* ((closed?  #f)
            (read!    (lambda (bv start count) 0))
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-01-21 17:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-14 23:00 Making custom binary input ports unbuffered Ludovic Courtès
2014-01-15  5:43 ` Mark H Weaver
2014-01-15 11:48   ` Ludovic Courtès
2014-01-15 22:51   ` Ludovic Courtès
2014-01-16  0:15     ` Mark H Weaver
2014-01-16 23:00       ` Ludovic Courtès
2014-01-18 21:57         ` Ludovic Courtès
2014-01-21  7:41         ` Mark H Weaver
2014-01-21 10:50           ` Ludovic Courtès
2014-01-21 14:46             ` Mark H Weaver
2014-01-21 17:37               ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).