Converting a part of byte vector to UTF-8 string

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* Converting a part of byte vector to UTF-8 string
@ 2014-01-13 23:17 Panicz Maciej Godek
  2014-01-15  4:59 ` Nala Ginrut
  0 siblings, 1 reply; 4+ messages in thread
From: Panicz Maciej Godek @ 2014-01-13 23:17 UTC (permalink / raw)
  To: guile-user@gnu.org

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

Hi,
what would be the best way to convert
only a part of a byte vector (interpreted as
UTF-8) to string?

Let's say that I have a big buffer,
(define buffer (make-u8vector 1024))

which contains some message
(define n (recv! sock buffer))

I'd like to get only the first n bytes of buffer.
I initially thought that this would do:

(utf8->string (make-shared-array buffer list `(0 ,(- n 1))))

(the utf8->string comes from ((rnrs) #:version (6)) module)

However, it failed (having expected byte-vector).

Another option would be to use
(substring (utf8->string buffer 0 n))

This one works, but according to the manual, the
string is "newly allocated", so it's unnecessary overhead.

What would be the best solution?

TIA
M

[-- Attachment #2: Type: text/html, Size: 1115 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Converting a part of byte vector to UTF-8 string
  2014-01-13 23:17 Converting a part of byte vector to UTF-8 string Panicz Maciej Godek
@ 2014-01-15  4:59 ` Nala Ginrut
  2014-01-15 15:27   ` Panicz Maciej Godek
  0 siblings, 1 reply; 4+ messages in thread
From: Nala Ginrut @ 2014-01-15  4:59 UTC (permalink / raw)
  To: Panicz Maciej Godek; +Cc: guile-user@gnu.org

hi there!

On Tue, 2014-01-14 at 00:17 +0100, Panicz Maciej Godek wrote:
> Another option would be to use
> (substring (utf8->string buffer 0 n))
> 
> This one works, but according to the manual, the
> string is "newly allocated", so it's unnecessary overhead.
> 

Actually, substring is COW(copy-on-write), so you don't have to be
worried. And you may try substring/shared which won't allocate at all.
But please be careful the side-effect in you context ;-) 

> What would be the best solution?
> 

IMO, no matter you us substring or substring/shared in this context, you
have to allocate a new string. The reason is we don't have something
like bytevector/shared.

But IIRC bytevector in Guile is similar with C array, which means you
can avoid any allocation when you try to slice a bytevector if you can
handle the array pointer properly. 
So one may take advantage of it.

!!But I can't say you can avoid allocation when you convert bytevector
to string, because either utf8->string or pointer->string will allocate
anyway.

(Anyone correct me please if I'm wrong!)

Here's my black magic:
-------------------------------cut------------------------------
(use-modules (system foreign)) ; to handle the C pointer

(define* (bv->string/partly bv #:optional (start 0) 
                                          (end #f) 
                                          (size 1)
                                          (encoding "utf-8"))
 (let ((len (if end (* size (- end start)) 
                    (- (bytevector-length bv) (* size start))))
       (addr (+ (pointer-address (bytevector->pointer bv)) 
                (* size start))))
 (pointer->string (make-pointer addr) len encoding)))
-------------------------------end--------------------------------

;;(define bv (string->utf8 "我了个去啊"))
;; NOTE: Chinese character needs size==3
(bv->string/partly bv 2 4 3)
==> "个去"

;; And for common latin character whose size==1
;;(define bv2 (string->utf8 "hello world"))
(bv->string/partly bv 0 5)
==> "hello"

But I have a give a warning again, when you try to avoid allocation
overhead, you have to face the risk of the side-effect. To me, I'd
prefer pure-functional. ;-P

> TIA
> M

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Converting a part of byte vector to UTF-8 string
  2014-01-15  4:59 ` Nala Ginrut
@ 2014-01-15 15:27   ` Panicz Maciej Godek
  2014-01-15 18:29     ` Mark H Weaver
  0 siblings, 1 reply; 4+ messages in thread
From: Panicz Maciej Godek @ 2014-01-15 15:27 UTC (permalink / raw)
  To: Nala Ginrut; +Cc: guile-user@gnu.org

[-- Attachment #1: Type: text/plain, Size: 2737 bytes --]

hello :)

[...]

But I have a give a warning again, when you try to avoid allocation
> overhead, you have to face the risk of the side-effect. To me, I'd
> prefer pure-functional. ;-P
>
>
Your solution seems reasonable, but I have found another way, which lead me
to some new problems.
I realised that since sockets are ports in guile, I could process them with
the plain "read" (which is what I have been using them for anyway).

However, this approach caused some new problems. The thing is that if I'm
trying to read some message from port, and that message does not end with a
delimiter (like a whitespace or a balancing, closing parenthesis), then the
read would wait forever, possibly gluing its arguments.

The solution I came up with is through soft ports. The idea is to have a
port proxy, that -- if it would block -- would return an eof-object instead.

The current implementation is rather straightforward:
(define (nonblocking port)
  "returns a port proxy that returns eof-object on read attempt \
if a read would block"
  (make-soft-port
   (vector
    ;; 0. procedure accepting one character for output
     (lambda(c) (write c port))
     ;; 1. procedure accepting a string for output
     (lambda(s) (display s port))
     ;; 2. thunk for flushing output
     (lambda () (force-output port))
     ;; 3. thunk for getting one character
     (lambda () (and (char-ready? port)
                     (read-char port)))
     ;; 4. thunk for closing port (not by garbage collection)
     (lambda () (close-port port))
     ;; 5. (if present and not `#f') thunk for computing the number of
     ;;    characters that can be read from the port without blocking
     (lambda () (if (char-ready? port)
                    1
                    0)))
   (string-append (if (input-port? port) "r" "")
                  (if (output-port? port) "w" ""))))

One problem is that two messages, if not formatted properly, can still be
glued together (although they no longer cause read to hang).

The other thing that puzzles me is the last function provided to the soft
port vector -- the one that computes the number of characters that can be
read.

Its only public interface I know of is through the "char-ready?" procedure.
So there is no way for me to check the number of characters available in
the original port. This is strange.

The other thing is that I would like to have some means to make sure that
an eof object is emited after reading each package sent through the socket,
so that it wouldn't be possible to glue together data sent in two separate
packages.

I could of course create a more sophisticated soft-port, that would be
implemented entirely using send and recv!, but I wonder if there's any
simpler way.

Thanks!

[-- Attachment #2: Type: text/html, Size: 4694 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Converting a part of byte vector to UTF-8 string
  2014-01-15 15:27   ` Panicz Maciej Godek
@ 2014-01-15 18:29     ` Mark H Weaver
  0 siblings, 0 replies; 4+ messages in thread
From: Mark H Weaver @ 2014-01-15 18:29 UTC (permalink / raw)
  To: Panicz Maciej Godek; +Cc: guile-user

Panicz Maciej Godek <godek.maciek@gmail.com> writes:

> Your solution seems reasonable, but I have found another way, which
> lead me to some new problems.
> I realised that since sockets are ports in guile, I could process them
> with the plain "read" (which is what I have been using them for
> anyway).
>
> However, this approach caused some new problems. The thing is that if
> I'm trying to read some message from port, and that message does not
> end with a delimiter (like a whitespace or a balancing, closing
> parenthesis), then the read would wait forever, possibly gluing its
> arguments.
>
> The solution I came up with is through soft ports. The idea is to have
> a port proxy, that -- if it would block -- would return an eof-object
> instead.

This is terribly inefficient, and also not robust.  Guile's native soft
ports do not support efficient reading, because everything is one
character at a time.  Also, Guile's 'char-ready?' currently does the job
of 'u8-ready?', i.e. it only checks if a _byte_ is available, not a
whole character, so the 'read-char' might still block.  Anyway, if this
is a socket, what if the data isn't available simply because of network
latency?  Then you'll generate a spurious EOF.


To offer my own answer to your original question: R7RS-small provides an
API that does precisely what you asked for.  Its 'utf8->string'
procedure accepts optional 'start' and 'end' byte positions.  I
implemented this on the 'r7rs-wip' branch of Guile git as follows:

http://git.savannah.gnu.org/gitweb/?p=guile.git;a=blob;f=module/scheme/base.scm;h=f110d4c2b241ec0941b4223cece05c309db5308a;hb=r7rs-wip#l327

  (import (rename (rnrs bytevectors)
                  (utf8->string      r6rs-utf8->string)
                  (string->utf8      r6rs-string->utf8)
                  (bytevector-copy   r6rs-bytevector-copy)
                  (bytevector-copy!  r6rs-bytevector-copy!)))

  [...]

  (define bytevector-copy
    (case-lambda
      ((bv)
       (r6rs-bytevector-copy bv))
      ((bv start)
       (let* ((len (- (bytevector-length bv) start))
              (result (make-bytevector len)))
         (r6rs-bytevector-copy! bv start result 0 len)
         result))
      ((bv start end)
       (let* ((len (- end start))
              (result (make-bytevector len)))
         (r6rs-bytevector-copy! bv start result 0 len)
         result))))

  (define utf8->string
    (case-lambda
      ((bv) (r6rs-utf8->string bv))
      ((bv start)
       (r6rs-utf8->string (bytevector-copy bv start)))
      ((bv start end)
       (r6rs-utf8->string (bytevector-copy bv start end)))))



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-01-15 18:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-13 23:17 Converting a part of byte vector to UTF-8 string Panicz Maciej Godek
2014-01-15  4:59 ` Nala Ginrut
2014-01-15 15:27   ` Panicz Maciej Godek
2014-01-15 18:29     ` Mark H Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).