From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Vijay Marupudi Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH] Enable utf8->string to take a range Date: Wed, 09 Mar 2022 09:50:23 -0500 Message-ID: <87bkyfyyc0.fsf@vijaymarupudi.com> References: <87h79x6abc.fsf@vijaymarupudi.com> <0f4ce6f8ddbdd9456dcc0063b206bf8c76d71da6.camel@telenet.be> <87bl046dss.fsf@vijaymarupudi.com> <87pmokmuon.fsf@vijaymarupudi.com> <9e8b7145e02b643dea86a0ddfd415d950558b882.camel@telenet.be> <71847b347864b3eaba8d2214b30355984286dcef.camel@telenet.be> <67c1b5fa2605da70cbc01bc19499cf20800012d4.camel@telenet.be> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="2964"; mail-complaints-to="usenet@ciao.gmane.io" To: Maxime Devos , guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Wed Mar 09 15:51:24 2022 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nRxes-0000Xo-PJ for guile-devel@m.gmane-mx.org; Wed, 09 Mar 2022 15:51:23 +0100 Original-Received: from localhost ([::1]:36358 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nRxer-0007gc-FX for guile-devel@m.gmane-mx.org; Wed, 09 Mar 2022 09:51:21 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:39682) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nRxeM-0007g7-FD for guile-devel@gnu.org; Wed, 09 Mar 2022 09:50:52 -0500 Original-Received: from [2a0c:5a00:149::26] (port=41142 helo=mailtransmit05.runbox.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nRxeI-0001nS-A0 for guile-devel@gnu.org; Wed, 09 Mar 2022 09:50:48 -0500 Original-Received: from mailtransmit03.runbox ([10.9.9.163] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nRxeB-00EXMC-Nd; Wed, 09 Mar 2022 15:50:39 +0100 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=vijaymarupudi.com; s=selector1; h=Content-Type:MIME-Version:Message-ID:Date :References:In-Reply-To:Subject:To:From; bh=/EQC6OTV9iPZHQnVbzrNFTD50vcwaZcna0neNRzO1bg=; b=oY+t2SR9wiIs3vFLVls+pae208 kRz4ATcGvDBIkNNrM0fYfrPblaTep6tX1yhiGSniYPovCG6zxCx/+2ZE3omzjMNkYD+UAymJcymuG 5Wrb+JqQW6t93Kgt26aKzeA99MFrt6A5tzV2td6aFy2G+OudpA6aL9hT6+/xc0XxlOaOh3Qkl2GbV expc2/5lBJbu0Hmwa2kaQbmqw2n+k0d5Nk+iVX6RVN0BfpUYeog+qyD4oOD0LiGXMEtEg7fJVDPwy NCi80JBeyp1hlUeQey9gtXsQPqhfFPW8sU8eTDUlgxR+kEDCBfQYu1TPj5uCl+ObbmBoQZxEaB+yo bnh9X56Q==; Original-Received: from [10.9.9.74] (helo=submission03.runbox) by mailtransmit03.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nRxeB-0006Cx-2m; Wed, 09 Mar 2022 15:50:39 +0100 Original-Received: by submission03.runbox with esmtpsa [Authenticated ID (1028486)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nRxe2-0007Zb-KQ; Wed, 09 Mar 2022 15:50:31 +0100 In-Reply-To: <67c1b5fa2605da70cbc01bc19499cf20800012d4.camel@telenet.be> X-Host-Lookup-Failed: Reverse DNS lookup failed for 2a0c:5a00:149::26 (failed) Received-SPF: pass client-ip=2a0c:5a00:149::26; envelope-from=vijay@vijaymarupudi.com; helo=mailtransmit05.runbox.com X-Spam_score_int: -19 X-Spam_score: -2.0 X-Spam_bar: -- X-Spam_report: (-2.0 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:21168 Archived-At: --=-=-= Content-Type: text/plain Maxime Devos writes: > Nevermind, seems like a misinterpreded a comment and #vu8(97 0 98) is > valid UTF-8 after all, it's just not possible to encode it as a zero- > terminated string. Thanks for the catch on the typo in the docstrings. I've attached the updated versions of the patches that fixes that typo. Assuming that no changes are required for the null bytes comments. Hoping this can get merged soon! ~ Vijay --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: attachment; filename=0001-Allow-utf8-string-utf16-string-utf32-string-to-take-.patch Content-Transfer-Encoding: quoted-printable >From 6f287255456dbefb75a1c2242904c8f0046ad5bb Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Thu, 20 Jan 2022 22:19:25 -0500 Subject: [PATCH 1/2] Allow utf8->string, utf16->string, utf32->string to ta= ke ranges Added the following new functions, that behave like substring, but for bytevector to string conversion. scm_utf8_range_to_string (SCM, SCM, SCM); scm_utf16_range_to_string (SCM, SCM, SCM, SCM); scm_utf32_range_to_string (SCM, SCM, SCM, SCM); * doc/ref/api-data.texi: Updated documentation to reflect new function and range constraints * libguile/bytevectors.c: Added new function. * libguile/bytevectors.h: Added new function declaration. * test-suite/tests/bytevectors.test: Added tests for exceptions and behavior for edge cases --- doc/ref/api-data.texi | 15 +++- libguile/bytevectors.c | 144 +++++++++++++++++++++++------- libguile/bytevectors.h | 3 + test-suite/tests/bytevectors.test | 37 ++++++++ 4 files changed, 164 insertions(+), 35 deletions(-) diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi index b6c2c4d61..44b64454f 100644 --- a/doc/ref/api-data.texi +++ b/doc/ref/api-data.texi @@ -7139,16 +7139,25 @@ UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF= -16 and UTF-32, it defaults to big endian. @end deffn =20 -@deffn {Scheme Procedure} utf8->string utf -@deffnx {Scheme Procedure} utf16->string utf [endianness] -@deffnx {Scheme Procedure} utf32->string utf [endianness] +@deffn {Scheme Procedure} utf8->string utf [start [end]] +@deffnx {Scheme Procedure} utf16->string utf [endianness [start [end]]] +@deffnx {Scheme Procedure} utf32->string utf [endianness [start [end]]] @deffnx {C Function} scm_utf8_to_string (utf) +@deffnx {C Function} scm_utf8_range_to_string (utf, start, end) @deffnx {C Function} scm_utf16_to_string (utf, endianness) +@deffnx {C Function} scm_utf16_range_to_string (utf, endianness, start, en= d) @deffnx {C Function} scm_utf32_to_string (utf, endianness) +@deffnx {C Function} scm_utf32_range_to_string (utf, endianness, start, en= d) + Return a newly allocated string that contains from the UTF-8-, UTF-16-, or UTF-32-decoded contents of bytevector @var{utf}. For UTF-16 and UTF-32, @var{endianness} should be the symbol @code{big} or @code{little}; when om= itted, it defaults to big endian. + +@var{start} and @var{end}, when provided, must be exact integers +satisfying: + +0 <=3D @var{start} <=3D @var{end} <=3D @code{(bytevector-length @var{utf})= }. @end deffn =20 @node Bytevectors as Arrays diff --git a/libguile/bytevectors.c b/libguile/bytevectors.c index f42fbb427..4c1f4ce42 100644 --- a/libguile/bytevectors.c +++ b/libguile/bytevectors.c @@ -2061,25 +2061,46 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", =20 /* Produce the body of a function that converts a UTF-encoded bytevector t= o a string. */ -#define UTF_TO_STRING(_utf_width) \ +#define UTF_TO_STRING(_utf_width, utf, endianness, start, end) \ SCM str =3D SCM_BOOL_F; \ int err; \ char *c_str =3D NULL; \ char c_utf_name[MAX_UTF_ENCODING_NAME_LEN]; \ char *c_utf; \ - size_t c_strlen =3D 0, c_utf_len =3D 0; \ + size_t c_strlen =3D 0, c_utf_len, c_start, c_end; \ \ - SCM_VALIDATE_BYTEVECTOR (1, utf); \ - if (scm_is_eq (endianness, SCM_UNDEFINED)) \ - endianness =3D sym_big; \ + SCM_VALIDATE_BYTEVECTOR (1, (utf)); \ + if (scm_is_eq ((endianness), SCM_UNDEFINED)) \ + (endianness) =3D sym_big; \ else \ - SCM_VALIDATE_SYMBOL (2, endianness); \ + SCM_VALIDATE_SYMBOL (2, (endianness)); \ \ - c_utf_len =3D SCM_BYTEVECTOR_LENGTH (utf); \ - c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS (utf); \ - utf_encoding_name (c_utf_name, (_utf_width), endianness); \ + c_utf_len =3D SCM_BYTEVECTOR_LENGTH ((utf)); \ + c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS ((utf)); \ + utf_encoding_name (c_utf_name, (_utf_width), (endianness)); \ + \ + if (!scm_is_eq ((start), SCM_UNDEFINED)) \ + { \ + c_start =3D scm_to_unsigned_integer ((start), 0, c_utf_len); \ + } \ + else \ + { \ + c_start =3D 0; \ + } \ + \ + if (!scm_is_eq ((end), SCM_UNDEFINED)) \ + { \ + c_end =3D scm_to_unsigned_integer ((end), 0, c_utf_len); \ + } \ + else \ + { \ + c_end =3D c_utf_len; \ + } \ + \ + validate_bytevector_range(FUNC_NAME, c_utf_len, c_start, c_end); \ + \ \ - err =3D mem_iconveh (c_utf, c_utf_len, \ + err =3D mem_iconveh (c_utf + c_start, c_end - c_start, \ c_utf_name, "UTF-8", \ iconveh_question_mark, NULL, \ &c_str, &c_strlen); \ @@ -2094,46 +2115,105 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", return (str); =20 =20 -SCM_DEFINE (scm_utf8_to_string, "utf8->string", - 1, 0, 0, - (SCM utf), - "Return a newly allocate string that contains from the UTF-8-" - "encoded contents of bytevector @var{utf}.") -#define FUNC_NAME s_scm_utf8_to_string +static inline void +validate_bytevector_range(const char* function_name, size_t len, size_t st= art, size_t end) { + if (SCM_UNLIKELY (start > len)) + { + scm_out_of_range (function_name, scm_from_size_t(start)); + } + if (SCM_UNLIKELY (end > len)) + { + scm_out_of_range (function_name, scm_from_size_t(end)); + } + if (SCM_UNLIKELY(end < start)) + { + scm_out_of_range (function_name, scm_from_size_t(end)); + } +} + + +SCM_DEFINE (scm_utf8_range_to_string, "utf8->string", + 1, 2, 0, + (SCM utf, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-8-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf8_range_to_string { SCM str; const char *c_utf; - size_t c_utf_len =3D 0; + size_t c_start; + size_t c_end; + size_t c_len; =20 SCM_VALIDATE_BYTEVECTOR (1, utf); - - c_utf_len =3D SCM_BYTEVECTOR_LENGTH (utf); c_utf =3D (char *) SCM_BYTEVECTOR_CONTENTS (utf); - str =3D scm_from_utf8_stringn (c_utf, c_utf_len); + c_len =3D SCM_BYTEVECTOR_LENGTH(utf); + + if (!scm_is_eq (start, SCM_UNDEFINED)) + { + c_start =3D scm_to_unsigned_integer (start, 0, c_len); + } + else + { + c_start =3D 0; + } =20 + if (!scm_is_eq (end, SCM_UNDEFINED)) + { + c_end =3D scm_to_unsigned_integer (end, 0, c_len); + } + else + { + c_end =3D c_len; + } + + validate_bytevector_range(FUNC_NAME, c_len, c_start, c_end); + str =3D scm_from_utf8_stringn (c_utf + c_start, c_end - c_start); return (str); } #undef FUNC_NAME =20 -SCM_DEFINE (scm_utf16_to_string, "utf16->string", - 1, 1, 0, - (SCM utf, SCM endianness), - "Return a newly allocate string that contains from the UTF-16-" - "encoded contents of bytevector @var{utf}.") +SCM +scm_utf8_to_string(SCM utf) +#define FUNC_NAME s_scm_utf8_to_string +{ + return scm_utf8_range_to_string(utf, SCM_UNDEFINED, SCM_UNDEFINED); +} +#undef FUNC_NAME + +SCM_DEFINE (scm_utf16_range_to_string, "utf16->string", + 1, 3, 0, + (SCM utf, SCM endianness, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-16-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf16_range_to_string +{ + UTF_TO_STRING(16, utf, endianness, start, end); +} +#undef FUNC_NAME + +SCM scm_utf16_to_string (SCM utf, SCM endianness) #define FUNC_NAME s_scm_utf16_to_string { - UTF_TO_STRING (16); + return scm_utf16_range_to_string(utf, endianness, SCM_UNDEFINED, SCM_UND= EFINED); } #undef FUNC_NAME =20 -SCM_DEFINE (scm_utf32_to_string, "utf32->string", - 1, 1, 0, - (SCM utf, SCM endianness), - "Return a newly allocate string that contains from the UTF-32-" - "encoded contents of bytevector @var{utf}.") +SCM_DEFINE (scm_utf32_range_to_string, "utf32->string", + 1, 3, 0, + (SCM utf, SCM endianness, SCM start, SCM end), + "Return a newly allocate string that contains from the UTF-32-" + "encoded contents of bytevector @var{utf}.") +#define FUNC_NAME s_scm_utf32_range_to_string +{ + UTF_TO_STRING(32, utf, endianness, start, end); +} +#undef FUNC_NAME + +SCM scm_utf32_to_string (SCM utf, SCM endianness) #define FUNC_NAME s_scm_utf32_to_string { - UTF_TO_STRING (32); + return scm_utf32_range_to_string(utf, endianness, SCM_UNDEFINED, SCM_UND= EFINED); } #undef FUNC_NAME =20 diff --git a/libguile/bytevectors.h b/libguile/bytevectors.h index 980d6e267..63d8e3119 100644 --- a/libguile/bytevectors.h +++ b/libguile/bytevectors.h @@ -113,8 +113,11 @@ SCM_API SCM scm_string_to_utf8 (SCM); SCM_API SCM scm_string_to_utf16 (SCM, SCM); SCM_API SCM scm_string_to_utf32 (SCM, SCM); SCM_API SCM scm_utf8_to_string (SCM); +SCM_API SCM scm_utf8_range_to_string (SCM, SCM, SCM); SCM_API SCM scm_utf16_to_string (SCM, SCM); +SCM_API SCM scm_utf16_range_to_string (SCM, SCM, SCM, SCM); SCM_API SCM scm_utf32_to_string (SCM, SCM); +SCM_API SCM scm_utf32_range_to_string (SCM, SCM, SCM, SCM); =20 =20 diff --git a/test-suite/tests/bytevectors.test b/test-suite/tests/bytevecto= rs.test index 732aadb3e..f8c6a8df1 100644 --- a/test-suite/tests/bytevectors.test +++ b/test-suite/tests/bytevectors.test @@ -558,6 +558,43 @@ exception:decoding-error (utf8->string #vu8(104 105 239 191 50))) =20 + (pass-if "utf8->string range: start provided" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 4))) + (string=3D? str "guile"))) + + (pass-if "utf8->string range: start and end provided" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 4 7))) + (string=3D? str "gui"))) + + (pass-if "utf8->string range: start =3D end =3D 0" + (let* ((utf8 (string->utf8 "gnu guile")) + (str (utf8->string utf8 0 0))) + (string=3D? str ""))) + + (pass-if-exception "utf8->string range: start > len" + exception:out-of-range + (let* ((utf8 (string->utf8 "four"))) + ;; 4 as start is expected to return an empty string, in congruence + ;; with `substring'. + (utf8->string utf8 5))) + + (pass-if-exception "utf8->string range: end < start" + exception:out-of-range + (let* ((utf8 (string->utf8 "gnu guile"))) + (utf8->string utf8 1 0))) + + (pass-if "utf8->string range: multibyte characters" + (string=3D? (utf8->string #vu8(195 169 67) 0 2) "=C3=A9")) + + (pass-if-exception "utf8->string range: decoding error for invalid range" + exception:decoding-error + (utf8->string #vu8(195 169) 0 1)) + + (pass-if "utf8->string range: null byte non-termination" + (string=3D? (utf8->string #vu8(0 32 196) 0 2) "\x00 ")) + (pass-if "utf16->string" (let* ((utf16 (uint-list->bytevector (map char->integer (string->list "hello, world= ")) --=20 2.35.1 --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=0002-Re-export-utf8-string-instead-of-wrapper-with-extra-.patch >From febfaed2c5fc681fe014805c901026bdab8ea7cd Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Fri, 21 Jan 2022 20:17:42 -0500 Subject: [PATCH 2/2] Re-export utf8->string instead of wrapper with extra allocation. In addition, remove the redundant function that was dead code. module/scheme/base.scm: Deleted wrapped and exported the default utf8->string. --- module/scheme/base.scm | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/module/scheme/base.scm b/module/scheme/base.scm index c6a73c092..28adb2e32 100644 --- a/module/scheme/base.scm +++ b/module/scheme/base.scm @@ -60,7 +60,6 @@ vector-append vector-for-each vector-map (r7:bytevector-copy . bytevector-copy) (r7:bytevector-copy! . bytevector-copy!) - (r7:utf8->string . utf8->string) square (r7:expt . expt) boolean=? symbol=? @@ -109,7 +108,7 @@ string-copy string-copy! string-fill! string-for-each string-length string-ref string-set! string<=? string=? string>? string? substring symbol->string - symbol? syntax-error syntax-rules truncate + symbol? utf8->string syntax-error syntax-rules truncate truncate-quotient truncate-remainder truncate/ (char-ready? . u8-ready?) unless @@ -494,9 +493,6 @@ (bytevector-copy! bv start out 0 mlen) out) -(define (%subbytevector1 bv start) - (%subbytevector bv start (bytevector-length bv))) - (define r7:bytevector-copy! (case-lambda* ((to at from #:optional @@ -512,12 +508,6 @@ ((bv start #:optional (end (bytevector-length bv))) (%subbytevector bv start end)))) -(define r7:utf8->string - (case-lambda* - ((bv) (utf8->string bv)) - ((bv start #:optional (end (bytevector-length bv))) - (utf8->string (%subbytevector bv start end))))) - (define (square x) (* x x)) (define (r7:expt x y) -- 2.35.1 --=-=-=--