From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Vijay Marupudi Newsgroups: gmane.lisp.guile.bugs Subject: bug#54141: [PATCH] Allow utf[8/16/32]->string functions to take start and ends bounds Date: Thu, 24 Feb 2022 08:38:38 -0500 Message-ID: <875yp45qqp.fsf@vijaymarupudi.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="22249"; mail-complaints-to="usenet@ciao.gmane.io" To: 54141@debbugs.gnu.org Original-X-From: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Thu Feb 24 16:09:21 2022 Return-path: Envelope-to: guile-bugs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nNFk8-0005UU-HG for guile-bugs@m.gmane-mx.org; Thu, 24 Feb 2022 16:09:21 +0100 Original-Received: from localhost ([::1]:38456 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nNFk7-0003v1-6S for guile-bugs@m.gmane-mx.org; Thu, 24 Feb 2022 10:09:19 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:33002) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nNFk0-0003tz-ED for bug-guile@gnu.org; Thu, 24 Feb 2022 10:09:12 -0500 Original-Received: from debbugs.gnu.org ([209.51.188.43]:56227) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nNFjq-0002SE-7W for bug-guile@gnu.org; Thu, 24 Feb 2022 10:09:12 -0500 Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1nNFjq-0000PV-3Q for bug-guile@gnu.org; Thu, 24 Feb 2022 10:09:02 -0500 X-Loop: help-debbugs@gnu.org Resent-From: Vijay Marupudi Original-Sender: "Debbugs-submit" Resent-CC: bug-guile@gnu.org Resent-Date: Thu, 24 Feb 2022 15:09:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 54141 X-GNU-PR-Package: guile X-GNU-PR-Keywords: patch X-Debbugs-Original-To: bug-guile@gnu.org Original-Received: via spool by submit@debbugs.gnu.org id=B.16457152871455 (code B ref -1); Thu, 24 Feb 2022 15:09:01 +0000 Original-Received: (at submit) by debbugs.gnu.org; 24 Feb 2022 15:08:07 +0000 Original-Received: from localhost ([127.0.0.1]:50110 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nNFiw-0000NN-Jx for submit@debbugs.gnu.org; Thu, 24 Feb 2022 10:08:07 -0500 Original-Received: from lists.gnu.org ([209.51.188.17]:46462) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nNEKm-0003iz-6A for submit@debbugs.gnu.org; Thu, 24 Feb 2022 08:39:04 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:34974) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nNEKl-0000jK-6M for bug-guile@gnu.org; Thu, 24 Feb 2022 08:39:03 -0500 Original-Received: from mailtransmit05.runbox.com ([185.226.149.38]:35406) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nNEKh-0001vJ-Ts for bug-guile@gnu.org; Thu, 24 Feb 2022 08:39:02 -0500 Original-Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nNEKZ-00ElZG-Bb for bug-guile@gnu.org; Thu, 24 Feb 2022 14:38:51 +0100 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=vijaymarupudi.com; s=selector1; h=Content-Type:MIME-Version:Message-ID:Date :Subject:To:From; bh=MNRHBuYOLrn4a1+HIizPArIm58ofOpeotSonEVSyeTA=; b=COT81XMk kCR0913Dw6YMnDOV0JenerzfmY4qBRX+hXoOKmHPjjGNGilZApLebL9/Gfi2zSQbeMnnZNnwzkARF dUqBLzcAKRxayCmE64M9yGBzfR5zZvdN1WOO6Qb25HGjwCFCIuaBMNqL+2L4NkpYwfcYz4gHMEqBK fAv5Bp6T4/zS0xYUGlisiMAsDjc6N9dr8fvgQQdQNndmPzZRMyMOKZZ3UB6nNwmJnMMEoXze9VsIk XzY39I1PDRMgoXnr9e2i2bL1EcFCxzO6B4uETyOb6koFdpGB+lGzRyPiG41MBbC06le1mXiTuoraB ElTWbwdHlhvXJz65qONM/XIBCA==; Original-Received: from [10.9.9.72] (helo=submission01.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nNEKY-00027Y-Tp for bug-guile@gnu.org; Thu, 24 Feb 2022 14:38:51 +0100 Original-Received: by submission01.runbox with esmtpsa [Authenticated ID (1028486)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nNEKR-0002mn-M6 for bug-guile@gnu.org; Thu, 24 Feb 2022 14:38:44 +0100 Received-SPF: pass client-ip=185.226.149.38; envelope-from=vijay@vijaymarupudi.com; helo=mailtransmit05.runbox.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Thu, 24 Feb 2022 10:08:05 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guile@gnu.org List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guile-bounces+guile-bugs=m.gmane-mx.org@gnu.org Original-Sender: "bug-guile" Xref: news.gmane.io gmane.lisp.guile.bugs:10249 Archived-At: --=-=-= Content-Type: text/plain Hello, I have attached a patch that extend the bytevector->string functions to take an start and end range. The second patch changes the r7rs compatibility layer to use this new functionality instead of the making a new bytevector as an intermediate step. ~ Vijay --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: attachment; filename=0001-Allow-utf8-string-utf16-string-utf32-string-to-take-.patch Content-Transfer-Encoding: quoted-printable >From c6be127b4818d43a0244592c18a52de113d3ff08 Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Thu, 20 Jan 2022 22:19:25 -0500 Subject: [PATCH 1/2] Allow utf8->string, utf16->string, utf32->string to ta= ke ranges Added the following new functions, that behave like substring, but for bytevector to string conversion. scm_utf8_range_to_string (SCM, SCM, SCM); scm_utf16_range_to_string (SCM, SCM, SCM, SCM); scm_utf32_range_to_string (SCM, SCM, SCM, SCM); * doc/ref/api-data.texi: Updated documentation to reflect new function and range constraints * libguile/bytevectors.c: Added new function. * libguile/bytevectors.h: Added new function declaration. * test-suite/tests/bytevectors.test: Added tests for exceptions and behavior for edge cases --- doc/ref/api-data.texi | 15 +++- libguile/bytevectors.c | 144 +++++++++++++++++++++++------- libguile/bytevectors.h | 3 + test-suite/tests/bytevectors.test | 37 ++++++++ 4 files changed, 164 insertions(+), 35 deletions(-) diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi index b6c2c4d61..44b64454f 100644 --- a/doc/ref/api-data.texi +++ b/doc/ref/api-data.texi @@ -7139,16 +7139,25 @@ UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF= -16 and UTF-32, it defaults to big endian. @end deffn =20 -@deffn {Scheme Procedure} utf8->string utf -@deffnx {Scheme Procedure} utf16->string utf [endianness] -@deffnx {Scheme Procedure} utf32->string utf [endianness] +@deffn {Scheme Procedure} utf8->string utf [start [end]] +@deffnx {Scheme Procedure} utf16->string utf [endianness [start [end]]] +@deffnx {Scheme Procedure} utf32->string utf [endianness [start [end]]] @deffnx {C Function} scm_utf8_to_string (utf) +@deffnx {C Function} scm_utf8_range_to_string (utf, start, end) @deffnx {C Function} scm_utf16_to_string (utf, endianness) +@deffnx {C Function} scm_utf16_range_to_string (utf, endianness, start, en= d) @deffnx {C Function} scm_utf32_to_string (utf, endianness) +@deffnx {C Function} scm_utf32_range_to_string (utf, endianness, start, en= d) + Return a newly allocated string that contains from the UTF-8-, UTF-16-, or UTF-32-decoded contents of bytevector @var{utf}. For UTF-16 and UTF-32, @var{endianness} should be the symbol @code{big} or @code{little}; when om= itted, it defaults to big endian. + +@var{start} and @var{end}, when provided, must be exact integers +satisfying: + +0 <=3D @var{start} <=3D @var{end} <=3D @code{(bytevector-length @var{utf})= }. @end deffn =20 @node Bytevectors as Arrays diff --git a/libguile/bytevectors.c b/libguile/bytevectors.c index f42fbb427..12d299042 100644 --- a/libguile/bytevectors.c +++ b/libguile/bytevectors.c @@ -2061,25 +2061,46 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", =20 /* Produce the body of a function that converts a UTF-encoded bytevector t= o a string. */ -#define UTF_TO_STRING(_utf_width) \ +#define UTF_TO_STRING(_utf_width, utf, endianness, start, end) \ SCM str =3D SCM_BOOL_F; \ int err; \ char *c_str =3D NULL; \ char c_utf_name[MAX_UTF_ENCODING_NAME_LEN]; \ char *c_utf; \ - size_t c_strlen =3D 0, c_utf_len =3D 0; \ + size_t c_strlen =3D 0, c_utf_len, c_start, c_end; \ \ - SCM_VALIDATE_BYTEVECTOR (1, utf); \ - if (scm_is_eq (endianness, SCM_UNDEFINED)) \ - endianness =3D sym_big; \ + SCM_VALIDATE_BYTEVECTOR (1, (utf)); \ + if (scm_is_eq ((endianness), SCM_UNDEFINED)) \ + (endianness) =3D sym_big; \ else \ - SCM_VALIDATE_SYMBOL (2, endianness); \ + SCM_VALIDATE_SYMBOL (2, (endianness)); \ \ - c_utf_len =3D SCM_BYTEVECTOR_LENGTH (utf); \ - c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS (utf); \ - utf_encoding_name (c_utf_name, (_utf_width), endianness); \ + c_utf_len =3D SCM_BYTEVECTOR_LENGTH ((utf)); \ + c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS ((utf)); \ + utf_encoding_name (c_utf_name, (_utf_width), (endianness)); \ + \ + if (!scm_is_eq ((start), SCM_UNDEFINED)) \ + { \ + c_start =3D scm_to_unsigned_integer ((start), 0, c_utf_len); \ + } \ + else \ + { \ + c_start =3D 0; \ + } \ + \ + if (!scm_is_eq ((end), SCM_UNDEFINED)) \ + { \ + c_end =3D scm_to_unsigned_integer ((end), 0, c_utf_len); \ + } \ + else \ + { \ + c_end =3D c_utf_len; \ + } \ + \ + validate_bytevector_range(FUNC_NAME, c_utf_len, c_start, c_end); \ + \ \ - err =3D mem_iconveh (c_utf, c_utf_len, \ + err =3D mem_iconveh (c_utf + c_start, c_end - c_start, \ c_utf_name, "UTF-8", \ iconveh_question_mark, NULL, \ &c_str, &c_strlen); \ @@ -2094,46 +2115,105 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", return (str); =20 =20 -SCM_DEFINE (scm_utf8_to_string, "utf8->string", - 1, 0, 0, - (SCM utf), - "Return a newly allocate string that contains from the UTF-8-" - "encoded contents of bytevector @var{utf}.") -#define FUNC_NAME s_scm_utf8_to_string +static inline void +validate_bytevector_range(const char* function_name, size_t len, size_t st= art, size_t end) { + if (SCM_UNLIKELY (start > len)) + { + scm_out_of_range (function_name, scm_from_size_t(start)); + } + if (SCM_UNLIKELY (end > len)) + { + scm_out_of_range (function_name, scm_from_size_t(end)); + } + if (SCM_UNLIKELY(end < start)) + { + scm_out_of_range (function_name, scm_from_size_t(end)); + } +} + + +SCM_DEFINE (scm_utf8_range_to_string, "utf8->string", + 1, 2, 0, + (SCM utf, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf8_range_to_string { SCM str; const char *c_utf; - size_t c_utf_len =3D 0; + size_t c_start; + size_t c_end; + size_t c_len; =20 SCM_VALIDATE_BYTEVECTOR (1, utf); - - c_utf_len =3D SCM_BYTEVECTOR_LENGTH (utf); c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS (utf); - str =3D scm_from_utf8_stringn (c_utf, c_utf_len); + c_len =3D SCM_BYTEVECTOR_LENGTH(utf); + + if (!scm_is_eq (start, SCM_UNDEFINED)) + { + c_start =3D scm_to_unsigned_integer (start, 0, c_len); + } + else + { + c_start =3D 0; + } =20 + if (!scm_is_eq (end, SCM_UNDEFINED)) + { + c_end =3D scm_to_unsigned_integer (end, 0, c_len); + } + else + { + c_end =3D c_len; + } + + validate_bytevector_range(FUNC_NAME, c_len, c_start, c_end); + str =3D scm_from_utf8_stringn (c_utf + c_start, c_end - c_start); return (str); } #undef FUNC_NAME =20 -SCM_DEFINE (scm_utf16_to_string, "utf16->string", - 1, 1, 0, - (SCM utf, SCM endianness), - "Return a newly allocate string that contains from the UTF-16-" - "encoded contents of bytevector @var{utf}.") +SCM +scm_utf8_to_string(SCM utf) +#define FUNC_NAME s_scm_utf8_to_string +{ + return scm_utf8_range_to_string(utf, SCM_UNDEFINED, SCM_UNDEFINED); +} +#undef FUNC_NAME + +SCM_DEFINE (scm_utf16_range_to_string, "utf16->string", + 1, 3, 0, + (SCM utf, SCM endianness, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf16_range_to_string +{ + UTF_TO_STRING(16, utf, endianness, start, end); +} +#undef FUNC_NAME + +SCM scm_utf16_to_string (SCM utf, SCM endianness) #define FUNC_NAME s_scm_utf16_to_string { - UTF_TO_STRING (16); + return scm_utf16_range_to_string(utf, endianness, SCM_UNDEFINED, SCM_UND= EFINED); } #undef FUNC_NAME =20 -SCM_DEFINE (scm_utf32_to_string, "utf32->string", - 1, 1, 0, - (SCM utf, SCM endianness), - "Return a newly allocate string that contains from the UTF-32-" - "encoded contents of bytevector @var{utf}.") +SCM_DEFINE (scm_utf32_range_to_string, "utf32->string", + 1, 3, 0, + (SCM utf, SCM endianness, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf32_range_to_string +{ + UTF_TO_STRING(32, utf, endianness, start, end); +} +#undef FUNC_NAME + +SCM scm_utf32_to_string (SCM utf, SCM endianness) #define FUNC_NAME s_scm_utf32_to_string { - UTF_TO_STRING (32); + return scm_utf32_range_to_string(utf, endianness, SCM_UNDEFINED, SCM_UND= EFINED); } #undef FUNC_NAME =20 diff --git a/libguile/bytevectors.h b/libguile/bytevectors.h index 980d6e267..63d8e3119 100644 --- a/libguile/bytevectors.h +++ b/libguile/bytevectors.h @@ -113,8 +113,11 @@ SCM_API SCM scm_string_to_utf8 (SCM); SCM_API SCM scm_string_to_utf16 (SCM, SCM); SCM_API SCM scm_string_to_utf32 (SCM, SCM); SCM_API SCM scm_utf8_to_string (SCM); +SCM_API SCM scm_utf8_range_to_string (SCM, SCM, SCM); SCM_API SCM scm_utf16_to_string (SCM, SCM); +SCM_API SCM scm_utf16_range_to_string (SCM, SCM, SCM, SCM); SCM_API SCM scm_utf32_to_string (SCM, SCM); +SCM_API SCM scm_utf32_range_to_string (SCM, SCM, SCM, SCM); =20 =20 diff --git a/test-suite/tests/bytevectors.test b/test-suite/tests/bytevecto= rs.test index 732aadb3e..f8c6a8df1 100644 --- a/test-suite/tests/bytevectors.test +++ b/test-suite/tests/bytevectors.test @@ -558,6 +558,43 @@ exception:decoding-error (utf8->string #vu8(104 105 239 191 50))) =20 + (pass-if "utf8->string range: start provided" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 4))) + (string=3D? str "guile"))) + + (pass-if "utf8->string range: start and end provided" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 4 7))) + (string=3D? str "gui"))) + + (pass-if "utf8->string range: start =3D end =3D 0" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 0 0))) + (string=3D? str ""))) + + (pass-if-exception "utf8->string range: start > len" + exception:out-of-range + (let* ((utf8 (string->utf8 "four"))) + ;; 4 as start is expected to return an empty string, in congruence + ;; with `substring'. + (utf8->string utf8 5))) + + (pass-if-exception "utf8->string range: end < start" + exception:out-of-range + (let* ((utf8 (string->utf8 "gnu guile"))) + (utf8->string utf8 1 0))) + + (pass-if "utf8->string range: multibyte characters" + (string=3D? (utf8->string #vu8(195 169 67) 0 2) "=C3=A9")) + + (pass-if-exception "utf8->string range: decoding error for invalid range" + exception:decoding-error + (utf8->string #vu8(195 169) 0 1)) + + (pass-if "utf8->string range: null byte non-termination" + (string=3D? (utf8->string #vu8(0 32 196) 0 2) "\x00 ")) + (pass-if "utf16->string" (let* ((utf16 (uint-list->bytevector (map char->integer (string->list "hello, world= ")) --=20 2.34.1 --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=0002-Re-export-utf8-string-instead-of-wrapper-with-extra-.patch >From 94bbaaf6dc4760eb22bb4b2594648b1f8dbf83a5 Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Fri, 21 Jan 2022 20:17:42 -0500 Subject: [PATCH 2/2] Re-export utf8->string instead of wrapper with extra allocation. In addition, remove the redundant function that was dead code. module/scheme/base.scm: Deleted wrapped and exported the default utf8->string. --- module/scheme/base.scm | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/module/scheme/base.scm b/module/scheme/base.scm index c6a73c092..28adb2e32 100644 --- a/module/scheme/base.scm +++ b/module/scheme/base.scm @@ -60,7 +60,6 @@ vector-append vector-for-each vector-map (r7:bytevector-copy . bytevector-copy) (r7:bytevector-copy! . bytevector-copy!) - (r7:utf8->string . utf8->string) square (r7:expt . expt) boolean=? symbol=? @@ -109,7 +108,7 @@ string-copy string-copy! string-fill! string-for-each string-length string-ref string-set! string<=? string=? string>? string? substring symbol->string - symbol? syntax-error syntax-rules truncate + symbol? utf8->string syntax-error syntax-rules truncate truncate-quotient truncate-remainder truncate/ (char-ready? . u8-ready?) unless @@ -494,9 +493,6 @@ (bytevector-copy! bv start out 0 mlen) out) -(define (%subbytevector1 bv start) - (%subbytevector bv start (bytevector-length bv))) - (define r7:bytevector-copy! (case-lambda* ((to at from #:optional @@ -512,12 +508,6 @@ ((bv start #:optional (end (bytevector-length bv))) (%subbytevector bv start end)))) -(define r7:utf8->string - (case-lambda* - ((bv) (utf8->string bv)) - ((bv start #:optional (end (bytevector-length bv))) - (utf8->string (%subbytevector bv start end))))) - (define (square x) (* x x)) (define (r7:expt x y) -- 2.34.1 --=-=-=--