From: Vijay Marupudi <vijay@vijaymarupudi.com>
To: guile-devel@gnu.org
Subject: [PATCH] Enable utf8->string to take a range
Date: Thu, 20 Jan 2022 22:23:51 -0500 [thread overview]
Message-ID: <87h79x6abc.fsf@vijaymarupudi.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]
Hello,
Attached is a patch that allows `utf8->string' to take
additional parameters indicating the start and end indicies of the
bytevector that it is converting.
This preserves backwards compatibility by adding the new functionality
to an `scm_utf8_to_string_range' function (open to any alternative name
suggestions) and letting the old `scm_utf8_to_string' function call it.
For my work, I am currently handling bytevectors with large strings
embedded as part of the bytevector. This patch would reduce the need for
spurious allocations and copying to convert part of a bytevector to a
string using pure Scheme. It would also make R7RS compatibility easier,
since the current compatibility module involves copies to a fresh
bytevector.
For example, from module/scheme/base.scm
> (define (%subbytevector bv start end)
> (define mlen (- end start))
> (define out (make-bytevector mlen))
> (bytevector-copy! bv start out 0 mlen)
> out)
>
> (define (%subbytevector1 bv start)
> (%subbytevector bv start (bytevector-length bv)))
>
> (define r7:utf8->string
> (case-lambda*
> ((bv) (utf8->string bv))
> ((bv start #:optional (end (bytevector-length bv)))
> (utf8->string (%subbytevector bv start end)))))
Would appreciate any thoughts and feedback.
~ Vijay
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Enable-utf8-string-to-take-a-range.patch --]
[-- Type: text/x-patch, Size: 3635 bytes --]
From 695c2a6189458a292819df8fba659ea488dc0b4e Mon Sep 17 00:00:00 2001
From: Vijay Marupudi <vijay@vijaymarupudi.com>
Date: Thu, 20 Jan 2022 22:19:25 -0500
Subject: [PATCH] Enable utf8->string to take a range
Additionally, adds a scm_utf8_to_string_range function for access from
C.
---
doc/ref/api-data.texi | 3 ++-
libguile/bytevectors.c | 48 +++++++++++++++++++++++++++++++++++-------
libguile/bytevectors.h | 1 +
3 files changed, 43 insertions(+), 9 deletions(-)
diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi
index b6c2c4d61..1bdd1f7ed 100644
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@@ -7139,10 +7139,11 @@ UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF-16 and UTF-32,
it defaults to big endian.
@end deffn
-@deffn {Scheme Procedure} utf8->string utf
+@deffn {Scheme Procedure} utf8->string utf [start [end]]
@deffnx {Scheme Procedure} utf16->string utf [endianness]
@deffnx {Scheme Procedure} utf32->string utf [endianness]
@deffnx {C Function} scm_utf8_to_string (utf)
+@deffnx {C Function} scm_utf8_to_string_range (utf, start, end)
@deffnx {C Function} scm_utf16_to_string (utf, endianness)
@deffnx {C Function} scm_utf32_to_string (utf, endianness)
Return a newly allocated string that contains from the UTF-8-, UTF-16-,
diff --git a/libguile/bytevectors.c b/libguile/bytevectors.c
index f42fbb427..44a062257 100644
--- a/libguile/bytevectors.c
+++ b/libguile/bytevectors.c
@@ -2094,27 +2094,59 @@ SCM_DEFINE (scm_string_to_utf32, "string->utf32",
return (str);
-SCM_DEFINE (scm_utf8_to_string, "utf8->string",
- 1, 0, 0,
- (SCM utf),
+SCM_DEFINE (scm_utf8_to_string_range, "utf8->string",
+ 1, 2, 0,
+ (SCM utf, SCM start, SCM end),
"Return a newly allocate string that contains from the UTF-8-"
"encoded contents of bytevector @var{utf}.")
-#define FUNC_NAME s_scm_utf8_to_string
+#define FUNC_NAME s_scm_utf8_to_string_range
{
SCM str;
const char *c_utf;
- size_t c_utf_len = 0;
+ size_t c_start = 0;
+ size_t c_end;
+ size_t c_len;
SCM_VALIDATE_BYTEVECTOR (1, utf);
-
- c_utf_len = SCM_BYTEVECTOR_LENGTH (utf);
c_utf = (char *) SCM_BYTEVECTOR_CONTENTS (utf);
- str = scm_from_utf8_stringn (c_utf, c_utf_len);
+ c_len = SCM_BYTEVECTOR_LENGTH(utf);
+ c_end = c_len;
+
+ if (!scm_is_eq (start, SCM_UNDEFINED))
+ {
+ c_start = scm_to_size_t (start);
+ if (SCM_UNLIKELY (c_start >= c_len))
+ {
+ scm_out_of_range (FUNC_NAME, start);
+ }
+
+ if (!scm_is_eq (end, SCM_UNDEFINED))
+ {
+ c_end = scm_to_size_t (end);
+ if (SCM_UNLIKELY (c_end > c_len))
+ scm_out_of_range (FUNC_NAME, end);
+ }
+ }
+
+ if (SCM_UNLIKELY(c_end < c_start)) {
+ scm_out_of_range (FUNC_NAME, end);
+ }
+
+ str = scm_from_utf8_stringn (c_utf + c_start, c_end - c_start);
return (str);
}
#undef FUNC_NAME
+SCM
+scm_utf8_to_string(SCM utf)
+#define FUNC_NAME s_scm_utf8_to_string
+{
+ return scm_utf8_to_string_range(utf, SCM_UNDEFINED, SCM_UNDEFINED);
+}
+#undef FUNC_NAME
+
+
SCM_DEFINE (scm_utf16_to_string, "utf16->string",
1, 1, 0,
(SCM utf, SCM endianness),
diff --git a/libguile/bytevectors.h b/libguile/bytevectors.h
index 980d6e267..82a66ee5e 100644
--- a/libguile/bytevectors.h
+++ b/libguile/bytevectors.h
@@ -113,6 +113,7 @@ SCM_API SCM scm_string_to_utf8 (SCM);
SCM_API SCM scm_string_to_utf16 (SCM, SCM);
SCM_API SCM scm_string_to_utf32 (SCM, SCM);
SCM_API SCM scm_utf8_to_string (SCM);
+SCM_API SCM scm_utf8_to_string_range (SCM, SCM, SCM);
SCM_API SCM scm_utf16_to_string (SCM, SCM);
SCM_API SCM scm_utf32_to_string (SCM, SCM);
--
2.34.1
next reply other threads:[~2022-01-21 3:23 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-21 3:23 Vijay Marupudi [this message]
2022-01-21 16:53 ` [PATCH] Enable utf8->string to take a range Maxime Devos
2022-01-21 16:54 ` Maxime Devos
2022-01-21 16:55 ` Maxime Devos
2022-01-21 17:04 ` Maxime Devos
2022-01-21 20:20 ` Vijay Marupudi
2022-01-21 22:08 ` Maxime Devos
2022-01-22 1:21 ` Vijay Marupudi
2022-03-09 13:20 ` Maxime Devos
2022-03-09 13:20 ` Maxime Devos
2022-03-09 13:24 ` Maxime Devos
2022-03-09 13:27 ` Maxime Devos
2022-03-09 13:35 ` Maxime Devos
2022-03-09 14:50 ` Vijay Marupudi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h79x6abc.fsf@vijaymarupudi.com \
--to=vijay@vijaymarupudi.com \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).