From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Vijay Marupudi Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH] Enable utf8->string to take a range Date: Fri, 21 Jan 2022 20:21:44 -0500 Message-ID: <87pmokmuon.fsf@vijaymarupudi.com> References: <87h79x6abc.fsf@vijaymarupudi.com> <0f4ce6f8ddbdd9456dcc0063b206bf8c76d71da6.camel@telenet.be> <87bl046dss.fsf@vijaymarupudi.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="21418"; mail-complaints-to="usenet@ciao.gmane.io" To: Maxime Devos , guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Sat Jan 22 02:22:40 2022 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nB570-0005JB-6c for guile-devel@m.gmane-mx.org; Sat, 22 Jan 2022 02:22:38 +0100 Original-Received: from localhost ([::1]:57728 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nB56y-0001mj-M5 for guile-devel@m.gmane-mx.org; Fri, 21 Jan 2022 20:22:36 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:49152) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nB56f-0001mL-MR for guile-devel@gnu.org; Fri, 21 Jan 2022 20:22:18 -0500 Original-Received: from [2a0c:5a00:149::25] (port=59362 helo=mailtransmit04.runbox.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nB56b-0007Q0-He for guile-devel@gnu.org; Fri, 21 Jan 2022 20:22:16 -0500 Original-Received: from mailtransmit02.runbox ([10.9.9.162] helo=aibo.runbox.com) by mailtransmit04.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nB56W-00HLFc-3v for guile-devel@gnu.org; Sat, 22 Jan 2022 02:22:08 +0100 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=vijaymarupudi.com; s=selector1; h=Content-Type:MIME-Version:Message-ID:Date :References:In-Reply-To:Subject:To:From; bh=hRQwAUjqougmmZyMvEiCZBBSjG18Tl0cyY+afuulAIU=; b=Zdmu0y4LH3iS1cq+GkMgQ39JCn GjkL6jkshExQ6RScfHVg60vJ6kFRh2jrMLIyLjzpL+rnSn92F8e53fPY1iN3lrXhj3gysIckSBrxw wgDBlzDueapC9YIHSTt/JJZ88+Hi5lW9EdWvkzaqiczEQrARYhWOMqXDGHCrxYzCpDgQHYM7qmBMk v/t3Jd6mKaF7Psco0SOLC4gOrEU+URyZLklWc69PgI01bOTLdbRk+7FLxSnq7jVwh1EbszxoSqjrz 1QfhgF+XDumIi7ohb5b5ejversdfjVdgX+JkOyOvx/NwMWYPJpzHSkk/8c8+xGzQXQKTxrwgs3/5U XPzzZ9BA==; Original-Received: from [10.9.9.74] (helo=submission03.runbox) by mailtransmit02.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nB56V-0002O1-Fy; Sat, 22 Jan 2022 02:22:07 +0100 Original-Received: by submission03.runbox with esmtpsa [Authenticated ID (1028486)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nB56D-0001r9-L6; Sat, 22 Jan 2022 02:21:50 +0100 In-Reply-To: X-Host-Lookup-Failed: Reverse DNS lookup failed for 2a0c:5a00:149::25 (failed) Received-SPF: pass client-ip=2a0c:5a00:149::25; envelope-from=vijay@vijaymarupudi.com; helo=mailtransmit04.runbox.com X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:21059 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > It would be nice to check multibyte characters as well, > to verify that byte indices and not character indices are used. > > E.g., (utf8->string #vu8(195 169) 0 2) should return "=C3=A9". > > Another nice test: (utf8->string #vu8(195 169) 0 1) should raise > a 'decoding-error', even though #vu8(195 169) is valid UTF-8. > > And (utf8->string #vu8(0 32 196) 0 2) should return "\x00 " even > though #vu8(0 32 195) is invalid UTF-8 -- and as a bonus, it checks > that the nul character is supported -- which can be easily forgotten > because Guile is implemented in C which usually terminates strings > by zero instead of using a length field. Thank you for the suggestions. I have added all the tests you suggested to the test suite, and they all pass. > Overall, the patch you sent seems a reasonable approach to me, though > I didn't verify the details. I find myself at times copying a part of > a bytevector to a new bytevector because some procedure doesn't allow > specifying byte ranges ... I'm glad it will be useful for you! I addition to those tests, I have added the range functionality to both utf16->string, and utf32->string. I have updated the documentation, and the tests pass. I have also changed the name of the functions to emphasize that they are a range on the bytevector (not the string). The new C functions are the following. SCM scm_utf8_range_to_string (SCM, SCM, SCM); SCM scm_utf16_range_to_string (SCM, SCM, SCM, SCM); SCM scm_utf32_range_to_string (SCM, SCM, SCM, SCM); In a separate patch, I have removed the wrapper function for R7RS compatibility and have exported the new changed utf8->string function. I have removed a function that was not being used anywhere in the process. I have attached the edited patch, and the new R7RS patch. ~ Vijay --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-Allow-utf8-string-utf16-string-utf32-string-to-take-.patch Content-Transfer-Encoding: quoted-printable >From c6be127b4818d43a0244592c18a52de113d3ff08 Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Thu, 20 Jan 2022 22:19:25 -0500 Subject: [PATCH 1/2] Allow utf8->string, utf16->string, utf32->string to ta= ke ranges Added the following new functions, that behave like substring, but for bytevector to string conversion. scm_utf8_range_to_string (SCM, SCM, SCM); scm_utf16_range_to_string (SCM, SCM, SCM, SCM); scm_utf32_range_to_string (SCM, SCM, SCM, SCM); * doc/ref/api-data.texi: Updated documentation to reflect new function and range constraints * libguile/bytevectors.c: Added new function. * libguile/bytevectors.h: Added new function declaration. * test-suite/tests/bytevectors.test: Added tests for exceptions and behavior for edge cases --- doc/ref/api-data.texi | 15 +++- libguile/bytevectors.c | 144 +++++++++++++++++++++++------- libguile/bytevectors.h | 3 + test-suite/tests/bytevectors.test | 37 ++++++++ 4 files changed, 164 insertions(+), 35 deletions(-) diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi index b6c2c4d61..44b64454f 100644 --- a/doc/ref/api-data.texi +++ b/doc/ref/api-data.texi @@ -7139,16 +7139,25 @@ UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF= -16 and UTF-32, it defaults to big endian. @end deffn =20 -@deffn {Scheme Procedure} utf8->string utf -@deffnx {Scheme Procedure} utf16->string utf [endianness] -@deffnx {Scheme Procedure} utf32->string utf [endianness] +@deffn {Scheme Procedure} utf8->string utf [start [end]] +@deffnx {Scheme Procedure} utf16->string utf [endianness [start [end]]] +@deffnx {Scheme Procedure} utf32->string utf [endianness [start [end]]] @deffnx {C Function} scm_utf8_to_string (utf) +@deffnx {C Function} scm_utf8_range_to_string (utf, start, end) @deffnx {C Function} scm_utf16_to_string (utf, endianness) +@deffnx {C Function} scm_utf16_range_to_string (utf, endianness, start, en= d) @deffnx {C Function} scm_utf32_to_string (utf, endianness) +@deffnx {C Function} scm_utf32_range_to_string (utf, endianness, start, en= d) + Return a newly allocated string that contains from the UTF-8-, UTF-16-, or UTF-32-decoded contents of bytevector @var{utf}. For UTF-16 and UTF-32, @var{endianness} should be the symbol @code{big} or @code{little}; when om= itted, it defaults to big endian. + +@var{start} and @var{end}, when provided, must be exact integers +satisfying: + +0 <=3D @var{start} <=3D @var{end} <=3D @code{(bytevector-length @var{utf})= }. @end deffn =20 @node Bytevectors as Arrays diff --git a/libguile/bytevectors.c b/libguile/bytevectors.c index f42fbb427..12d299042 100644 --- a/libguile/bytevectors.c +++ b/libguile/bytevectors.c @@ -2061,25 +2061,46 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", =20 /* Produce the body of a function that converts a UTF-encoded bytevector t= o a string. */ -#define UTF_TO_STRING(_utf_width) \ +#define UTF_TO_STRING(_utf_width, utf, endianness, start, end) \ SCM str =3D SCM_BOOL_F; \ int err; \ char *c_str =3D NULL; \ char c_utf_name[MAX_UTF_ENCODING_NAME_LEN]; \ char *c_utf; \ - size_t c_strlen =3D 0, c_utf_len =3D 0; \ + size_t c_strlen =3D 0, c_utf_len, c_start, c_end; \ \ - SCM_VALIDATE_BYTEVECTOR (1, utf); \ - if (scm_is_eq (endianness, SCM_UNDEFINED)) \ - endianness =3D sym_big; \ + SCM_VALIDATE_BYTEVECTOR (1, (utf)); \ + if (scm_is_eq ((endianness), SCM_UNDEFINED)) \ + (endianness) =3D sym_big; \ else \ - SCM_VALIDATE_SYMBOL (2, endianness); \ + SCM_VALIDATE_SYMBOL (2, (endianness)); \ \ - c_utf_len =3D SCM_BYTEVECTOR_LENGTH (utf); \ - c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS (utf); \ - utf_encoding_name (c_utf_name, (_utf_width), endianness); \ + c_utf_len =3D SCM_BYTEVECTOR_LENGTH ((utf)); \ + c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS ((utf)); \ + utf_encoding_name (c_utf_name, (_utf_width), (endianness)); \ + \ + if (!scm_is_eq ((start), SCM_UNDEFINED)) \ + { \ + c_start =3D scm_to_unsigned_integer ((start), 0, c_utf_len); \ + } \ + else \ + { \ + c_start =3D 0; \ + } \ + \ + if (!scm_is_eq ((end), SCM_UNDEFINED)) \ + { \ + c_end =3D scm_to_unsigned_integer ((end), 0, c_utf_len); \ + } \ + else \ + { \ + c_end =3D c_utf_len; \ + } \ + \ + validate_bytevector_range(FUNC_NAME, c_utf_len, c_start, c_end); \ + \ \ - err =3D mem_iconveh (c_utf, c_utf_len, \ + err =3D mem_iconveh (c_utf + c_start, c_end - c_start, \ c_utf_name, "UTF-8", \ iconveh_question_mark, NULL, \ &c_str, &c_strlen); \ @@ -2094,46 +2115,105 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", return (str); =20 =20 -SCM_DEFINE (scm_utf8_to_string, "utf8->string", - 1, 0, 0, - (SCM utf), - "Return a newly allocate string that contains from the UTF-8-" - "encoded contents of bytevector @var{utf}.") -#define FUNC_NAME s_scm_utf8_to_string +static inline void +validate_bytevector_range(const char* function_name, size_t len, size_t st= art, size_t end) { + if (SCM_UNLIKELY (start > len)) + { + scm_out_of_range (function_name, scm_from_size_t(start)); + } + if (SCM_UNLIKELY (end > len)) + { + scm_out_of_range (function_name, scm_from_size_t(end)); + } + if (SCM_UNLIKELY(end < start)) + { + scm_out_of_range (function_name, scm_from_size_t(end)); + } +} + + +SCM_DEFINE (scm_utf8_range_to_string, "utf8->string", + 1, 2, 0, + (SCM utf, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf8_range_to_string { SCM str; const char *c_utf; - size_t c_utf_len =3D 0; + size_t c_start; + size_t c_end; + size_t c_len; =20 SCM_VALIDATE_BYTEVECTOR (1, utf); - - c_utf_len =3D SCM_BYTEVECTOR_LENGTH (utf); c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS (utf); - str =3D scm_from_utf8_stringn (c_utf, c_utf_len); + c_len =3D SCM_BYTEVECTOR_LENGTH(utf); + + if (!scm_is_eq (start, SCM_UNDEFINED)) + { + c_start =3D scm_to_unsigned_integer (start, 0, c_len); + } + else + { + c_start =3D 0; + } =20 + if (!scm_is_eq (end, SCM_UNDEFINED)) + { + c_end =3D scm_to_unsigned_integer (end, 0, c_len); + } + else + { + c_end =3D c_len; + } + + validate_bytevector_range(FUNC_NAME, c_len, c_start, c_end); + str =3D scm_from_utf8_stringn (c_utf + c_start, c_end - c_start); return (str); } #undef FUNC_NAME =20 -SCM_DEFINE (scm_utf16_to_string, "utf16->string", - 1, 1, 0, - (SCM utf, SCM endianness), - "Return a newly allocate string that contains from the UTF-16-" - "encoded contents of bytevector @var{utf}.") +SCM +scm_utf8_to_string(SCM utf) +#define FUNC_NAME s_scm_utf8_to_string +{ + return scm_utf8_range_to_string(utf, SCM_UNDEFINED, SCM_UNDEFINED); +} +#undef FUNC_NAME + +SCM_DEFINE (scm_utf16_range_to_string, "utf16->string", + 1, 3, 0, + (SCM utf, SCM endianness, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf16_range_to_string +{ + UTF_TO_STRING(16, utf, endianness, start, end); +} +#undef FUNC_NAME + +SCM scm_utf16_to_string (SCM utf, SCM endianness) #define FUNC_NAME s_scm_utf16_to_string { - UTF_TO_STRING (16); + return scm_utf16_range_to_string(utf, endianness, SCM_UNDEFINED, SCM_UND= EFINED); } #undef FUNC_NAME =20 -SCM_DEFINE (scm_utf32_to_string, "utf32->string", - 1, 1, 0, - (SCM utf, SCM endianness), - "Return a newly allocate string that contains from the UTF-32-" - "encoded contents of bytevector @var{utf}.") +SCM_DEFINE (scm_utf32_range_to_string, "utf32->string", + 1, 3, 0, + (SCM utf, SCM endianness, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf32_range_to_string +{ + UTF_TO_STRING(32, utf, endianness, start, end); +} +#undef FUNC_NAME + +SCM scm_utf32_to_string (SCM utf, SCM endianness) #define FUNC_NAME s_scm_utf32_to_string { - UTF_TO_STRING (32); + return scm_utf32_range_to_string(utf, endianness, SCM_UNDEFINED, SCM_UND= EFINED); } #undef FUNC_NAME =20 diff --git a/libguile/bytevectors.h b/libguile/bytevectors.h index 980d6e267..63d8e3119 100644 --- a/libguile/bytevectors.h +++ b/libguile/bytevectors.h @@ -113,8 +113,11 @@ SCM_API SCM scm_string_to_utf8 (SCM); SCM_API SCM scm_string_to_utf16 (SCM, SCM); SCM_API SCM scm_string_to_utf32 (SCM, SCM); SCM_API SCM scm_utf8_to_string (SCM); +SCM_API SCM scm_utf8_range_to_string (SCM, SCM, SCM); SCM_API SCM scm_utf16_to_string (SCM, SCM); +SCM_API SCM scm_utf16_range_to_string (SCM, SCM, SCM, SCM); SCM_API SCM scm_utf32_to_string (SCM, SCM); +SCM_API SCM scm_utf32_range_to_string (SCM, SCM, SCM, SCM); =20 =20 diff --git a/test-suite/tests/bytevectors.test b/test-suite/tests/bytevecto= rs.test index 732aadb3e..f8c6a8df1 100644 --- a/test-suite/tests/bytevectors.test +++ b/test-suite/tests/bytevectors.test @@ -558,6 +558,43 @@ exception:decoding-error (utf8->string #vu8(104 105 239 191 50))) =20 + (pass-if "utf8->string range: start provided" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 4))) + (string=3D? str "guile"))) + + (pass-if "utf8->string range: start and end provided" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 4 7))) + (string=3D? str "gui"))) + + (pass-if "utf8->string range: start =3D end =3D 0" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 0 0))) + (string=3D? str ""))) + + (pass-if-exception "utf8->string range: start > len" + exception:out-of-range + (let* ((utf8 (string->utf8 "four"))) + ;; 4 as start is expected to return an empty string, in congruence + ;; with `substring'. + (utf8->string utf8 5))) + + (pass-if-exception "utf8->string range: end < start" + exception:out-of-range + (let* ((utf8 (string->utf8 "gnu guile"))) + (utf8->string utf8 1 0))) + + (pass-if "utf8->string range: multibyte characters" + (string=3D? (utf8->string #vu8(195 169 67) 0 2) "=C3=A9")) + + (pass-if-exception "utf8->string range: decoding error for invalid range" + exception:decoding-error + (utf8->string #vu8(195 169) 0 1)) + + (pass-if "utf8->string range: null byte non-termination" + (string=3D? (utf8->string #vu8(0 32 196) 0 2) "\x00 ")) + (pass-if "utf16->string" (let* ((utf16 (uint-list->bytevector (map char->integer (string->list "hello, world= ")) --=20 2.34.1 --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=0002-Re-export-utf8-string-instead-of-wrapper-with-extra-.patch >From 94bbaaf6dc4760eb22bb4b2594648b1f8dbf83a5 Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Fri, 21 Jan 2022 20:17:42 -0500 Subject: [PATCH 2/2] Re-export utf8->string instead of wrapper with extra allocation. In addition, remove the redundant function that was dead code. module/scheme/base.scm: Deleted wrapped and exported the default utf8->string. --- module/scheme/base.scm | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/module/scheme/base.scm b/module/scheme/base.scm index c6a73c092..28adb2e32 100644 --- a/module/scheme/base.scm +++ b/module/scheme/base.scm @@ -60,7 +60,6 @@ vector-append vector-for-each vector-map (r7:bytevector-copy . bytevector-copy) (r7:bytevector-copy! . bytevector-copy!) - (r7:utf8->string . utf8->string) square (r7:expt . expt) boolean=? symbol=? @@ -109,7 +108,7 @@ string-copy string-copy! string-fill! string-for-each string-length string-ref string-set! string<=? string=? string>? string? substring symbol->string - symbol? syntax-error syntax-rules truncate + symbol? utf8->string syntax-error syntax-rules truncate truncate-quotient truncate-remainder truncate/ (char-ready? . u8-ready?) unless @@ -494,9 +493,6 @@ (bytevector-copy! bv start out 0 mlen) out) -(define (%subbytevector1 bv start) - (%subbytevector bv start (bytevector-length bv))) - (define r7:bytevector-copy! (case-lambda* ((to at from #:optional @@ -512,12 +508,6 @@ ((bv start #:optional (end (bytevector-length bv))) (%subbytevector bv start end)))) -(define r7:utf8->string - (case-lambda* - ((bv) (utf8->string bv)) - ((bv start #:optional (end (bytevector-length bv))) - (utf8->string (%subbytevector bv start end))))) - (define (square x) (* x x)) (define (r7:expt x y) -- 2.34.1 --=-=-=--