From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Vijay Marupudi Newsgroups: gmane.lisp.guile.devel Subject: [PATCH] Enable utf8->string to take a range Date: Thu, 20 Jan 2022 22:23:51 -0500 Message-ID: <87h79x6abc.fsf@vijaymarupudi.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="7173"; mail-complaints-to="usenet@ciao.gmane.io" To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Fri Jan 21 04:24:19 2022 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nAkXC-0001dR-6y for guile-devel@m.gmane-mx.org; Fri, 21 Jan 2022 04:24:19 +0100 Original-Received: from localhost ([::1]:55784 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nAkXA-0004SD-NO for guile-devel@m.gmane-mx.org; Thu, 20 Jan 2022 22:24:16 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:49344) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nAkX0-0004S5-1P for guile-devel@gnu.org; Thu, 20 Jan 2022 22:24:06 -0500 Original-Received: from [2a0c:5a00:149::26] (port=46720 helo=mailtransmit05.runbox.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nAkWx-0003PU-4K for guile-devel@gnu.org; Thu, 20 Jan 2022 22:24:05 -0500 Original-Received: from mailtransmit02.runbox ([10.9.9.162] helo=aibo.runbox.com) by mailtransmit05.runbox.com with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.93) (envelope-from ) id 1nAkWs-00F9VE-M7 for guile-devel@gnu.org; Fri, 21 Jan 2022 04:23:58 +0100 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=vijaymarupudi.com; s=selector1; h=Content-Type:MIME-Version:Message-ID:Date :Subject:To:From; bh=UfQFNhbcJlKw8wY9AW3vCYDRCJQQQ7epFal8+WalgME=; b=W2ZyBVUC UkaL1xaTyxeJzhqUSfzTXPTUR0vTKNfCOqeFmRZKFuQN8P0sGOSTVISbmTHF7eUAp/9u7JPrv3X68 BF1sCa6gsmXLzrg241MEP8j3P0F2HJbFo7kc7dkvxslybeq9PtUG7CjwgItobs/4j57r4EriViLtg Kv+PiBciPl3uscllC8DD6Xy0XzTqce1fHXDZiD7HjQguimt5dijKdcxxOeo1MKGc4AoSIzlSNlmLX 8goi+fbrP7FB/iU/KZS7nGu6YjHsW6dxFkyNsO1OQs4Vyfqu1rwTNDmpuU1Nzfv+wGeBYhnEryqnN /LcXs/qVLfqXRbaa7p1CSPHNgg==; Original-Received: from [10.9.9.73] (helo=submission02.runbox) by mailtransmit02.runbox with esmtp (Exim 4.86_2) (envelope-from ) id 1nAkWs-0003IK-Am for guile-devel@gnu.org; Fri, 21 Jan 2022 04:23:58 +0100 Original-Received: by submission02.runbox with esmtpsa [Authenticated ID (1028486)] (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) id 1nAkWr-00032I-S1 for guile-devel@gnu.org; Fri, 21 Jan 2022 04:23:58 +0100 X-Host-Lookup-Failed: Reverse DNS lookup failed for 2a0c:5a00:149::26 (failed) Received-SPF: pass client-ip=2a0c:5a00:149::26; envelope-from=vijay@vijaymarupudi.com; helo=mailtransmit05.runbox.com X-Spam_score_int: -12 X-Spam_score: -1.3 X-Spam_bar: - X-Spam_report: (-1.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:21052 Archived-At: --=-=-= Content-Type: text/plain Hello, Attached is a patch that allows `utf8->string' to take additional parameters indicating the start and end indicies of the bytevector that it is converting. This preserves backwards compatibility by adding the new functionality to an `scm_utf8_to_string_range' function (open to any alternative name suggestions) and letting the old `scm_utf8_to_string' function call it. For my work, I am currently handling bytevectors with large strings embedded as part of the bytevector. This patch would reduce the need for spurious allocations and copying to convert part of a bytevector to a string using pure Scheme. It would also make R7RS compatibility easier, since the current compatibility module involves copies to a fresh bytevector. For example, from module/scheme/base.scm > (define (%subbytevector bv start end) > (define mlen (- end start)) > (define out (make-bytevector mlen)) > (bytevector-copy! bv start out 0 mlen) > out) > > (define (%subbytevector1 bv start) > (%subbytevector bv start (bytevector-length bv))) > > (define r7:utf8->string > (case-lambda* > ((bv) (utf8->string bv)) > ((bv start #:optional (end (bytevector-length bv))) > (utf8->string (%subbytevector bv start end))))) Would appreciate any thoughts and feedback. ~ Vijay --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=0001-Enable-utf8-string-to-take-a-range.patch >From 695c2a6189458a292819df8fba659ea488dc0b4e Mon Sep 17 00:00:00 2001 From: Vijay Marupudi Date: Thu, 20 Jan 2022 22:19:25 -0500 Subject: [PATCH] Enable utf8->string to take a range Additionally, adds a scm_utf8_to_string_range function for access from C. --- doc/ref/api-data.texi | 3 ++- libguile/bytevectors.c | 48 +++++++++++++++++++++++++++++++++++------- libguile/bytevectors.h | 1 + 3 files changed, 43 insertions(+), 9 deletions(-) diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi index b6c2c4d61..1bdd1f7ed 100644 --- a/doc/ref/api-data.texi +++ b/doc/ref/api-data.texi @@ -7139,10 +7139,11 @@ UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF-16 and UTF-32, it defaults to big endian. @end deffn -@deffn {Scheme Procedure} utf8->string utf +@deffn {Scheme Procedure} utf8->string utf [start [end]] @deffnx {Scheme Procedure} utf16->string utf [endianness] @deffnx {Scheme Procedure} utf32->string utf [endianness] @deffnx {C Function} scm_utf8_to_string (utf) +@deffnx {C Function} scm_utf8_to_string_range (utf, start, end) @deffnx {C Function} scm_utf16_to_string (utf, endianness) @deffnx {C Function} scm_utf32_to_string (utf, endianness) Return a newly allocated string that contains from the UTF-8-, UTF-16-, diff --git a/libguile/bytevectors.c b/libguile/bytevectors.c index f42fbb427..44a062257 100644 --- a/libguile/bytevectors.c +++ b/libguile/bytevectors.c @@ -2094,27 +2094,59 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32", return (str); -SCM_DEFINE (scm_utf8_to_string, "utf8->string", - 1, 0, 0, - (SCM utf), +SCM_DEFINE (scm_utf8_to_string_range, "utf8->string", + 1, 2, 0, + (SCM utf, SCM start, SCM end), "Return a newly allocate string that contains from the UTF-8-" "encoded contents of bytevector @var{utf}.") -#define FUNC_NAME s_scm_utf8_to_string +#define FUNC_NAME s_scm_utf8_to_string_range { SCM str; const char *c_utf; - size_t c_utf_len = 0; + size_t c_start = 0; + size_t c_end; + size_t c_len; SCM_VALIDATE_BYTEVECTOR (1, utf); - - c_utf_len = SCM_BYTEVECTOR_LENGTH (utf); c_utf = (char *) SCM_BYTEVECTOR_CONTENTS (utf); - str = scm_from_utf8_stringn (c_utf, c_utf_len); + c_len = SCM_BYTEVECTOR_LENGTH(utf); + c_end = c_len; + + if (!scm_is_eq (start, SCM_UNDEFINED)) + { + c_start = scm_to_size_t (start); + if (SCM_UNLIKELY (c_start >= c_len)) + { + scm_out_of_range (FUNC_NAME, start); + } + + if (!scm_is_eq (end, SCM_UNDEFINED)) + { + c_end = scm_to_size_t (end); + if (SCM_UNLIKELY (c_end > c_len)) + scm_out_of_range (FUNC_NAME, end); + } + } + + if (SCM_UNLIKELY(c_end < c_start)) { + scm_out_of_range (FUNC_NAME, end); + } + + str = scm_from_utf8_stringn (c_utf + c_start, c_end - c_start); return (str); } #undef FUNC_NAME +SCM +scm_utf8_to_string(SCM utf) +#define FUNC_NAME s_scm_utf8_to_string +{ + return scm_utf8_to_string_range(utf, SCM_UNDEFINED, SCM_UNDEFINED); +} +#undef FUNC_NAME + + SCM_DEFINE (scm_utf16_to_string, "utf16->string", 1, 1, 0, (SCM utf, SCM endianness), diff --git a/libguile/bytevectors.h b/libguile/bytevectors.h index 980d6e267..82a66ee5e 100644 --- a/libguile/bytevectors.h +++ b/libguile/bytevectors.h @@ -113,6 +113,7 @@ SCM_API SCM scm_string_to_utf8 (SCM); SCM_API SCM scm_string_to_utf16 (SCM, SCM); SCM_API SCM scm_string_to_utf32 (SCM, SCM); SCM_API SCM scm_utf8_to_string (SCM); +SCM_API SCM scm_utf8_to_string_range (SCM, SCM, SCM); SCM_API SCM scm_utf16_to_string (SCM, SCM); SCM_API SCM scm_utf32_to_string (SCM, SCM); -- 2.34.1 --=-=-=--