From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Maxime Devos Newsgroups: gmane.lisp.guile.devel Subject: Re: [PATCH] Enable utf8->string to take a range Date: Fri, 21 Jan 2022 23:08:25 +0100 Message-ID: References: <87h79x6abc.fsf@vijaymarupudi.com> <0f4ce6f8ddbdd9456dcc0063b206bf8c76d71da6.camel@telenet.be> <87bl046dss.fsf@vijaymarupudi.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg="pgp-sha512"; protocol="application/pgp-signature"; boundary="=-HXMudW0vOfxQZRM2lu2/" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="15968"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Evolution 3.38.3-1 To: Vijay Marupudi , guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Fri Jan 21 23:09:30 2022 Return-path: Envelope-to: guile-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1nB265-0003vw-1s for guile-devel@m.gmane-mx.org; Fri, 21 Jan 2022 23:09:29 +0100 Original-Received: from localhost ([::1]:60808 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nB264-0005q9-0X for guile-devel@m.gmane-mx.org; Fri, 21 Jan 2022 17:09:28 -0500 Original-Received: from eggs.gnu.org ([209.51.188.92]:49650) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nB25E-0005py-NA for guile-devel@gnu.org; Fri, 21 Jan 2022 17:08:36 -0500 Original-Received: from [2a02:1800:120:4::f00:15] (port=57534 helo=andre.telenet-ops.be) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1nB25B-0002Oj-TA for guile-devel@gnu.org; Fri, 21 Jan 2022 17:08:36 -0500 Original-Received: from ptr-bvsjgyhxw7psv60dyze.18120a2.ip6.access.telenet.be ([IPv6:2a02:1811:8c09:9d00:3c5f:2eff:feb0:ba5a]) by andre.telenet-ops.be with bizsmtp id la8W2600F4UW6Th01a8WLo; Fri, 21 Jan 2022 23:08:31 +0100 In-Reply-To: <87bl046dss.fsf@vijaymarupudi.com> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=telenet.be; s=r22; t=1642802911; bh=z+tg9Zxku3khC3H9prV1exFWtia8CTk3ryNLkEq/qTU=; h=Subject:From:To:Date:In-Reply-To:References; b=oL867NCFzVACXVlReCJoBsUqFAWahg1TVQDFs+Q0/ORUmKkVkSwDwYnA12Pbhd1uS SVkoM2rms0kv2VIYwZJUPP1qiRoJGZYmadanTmOLiTU1giLJrXCUXBqR8Sk7dzvJ+M DiCEnckzO8uDQMc4Rc80FM92ga+Rls+RmnzkI9RVXxJHQPOpIk0pBhoAAdLAhA+uYF /W8V9b3LZK/9w5h/U1zsvNYc8jGTJp9dDVlhiPokEVB608Ibud4e426LmYSOJAI4fn zwRdlZ7xEfmi5NcxZsJtCNSNx01EEvh2MKV4R7KAA4gvi8+Gj3koqU9be6bFAyHSDX 4mb83HbCegAvw== X-Host-Lookup-Failed: Reverse DNS lookup failed for 2a02:1800:120:4::f00:15 (failed) Received-SPF: pass client-ip=2a02:1800:120:4::f00:15; envelope-from=maximedevos@telenet.be; helo=andre.telenet-ops.be X-Spam_score_int: -19 X-Spam_score: -2.0 X-Spam_bar: -- X-Spam_report: (-2.0 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, RDNS_NONE=0.793, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane-mx.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.io gmane.lisp.guile.devel:21058 Archived-At: --=-HXMudW0vOfxQZRM2lu2/ Content-Type: multipart/alternative; boundary="=-X3keEu3IgEhCsaTFggmw" --=-X3keEu3IgEhCsaTFggmw Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Vijay Marupudi schreef op vr 21-01-2022 om 15:20 [-0500]: +=C2=A0 (pass-if-exception "utf8->string range: end < start" +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 exception:out-of-range +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (let* ((utf8 (string->utf8 "gnu guile"))) +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (utf8->string utf8 1 0))) + [other tests] It would be nice to check multibyte characters as well, to verify that byte indices and not character indices are used. E.g., (utf8->string #vu8(195 169) 0 2) should return "=C3=A9". Another nice test: (utf8->string #vu8(195 169) 0 1) should raise a 'decoding-error', even though #vu8(195 169) is valid UTF-8. And (utf8->string #vu8(0 32 196) 0 2) should return "\x00 " even though #vu8(0 32 195) is invalid UTF-8 -- and as a bonus, it checks that the nul character is supported -- which can be easily forgotten because Guile is implemented in C which usually terminates strings by zero instead of using a length field. Overall, the patch you sent seems a reasonable approach to me, though I didn't verify the details. I find myself at times copying a part of a bytevector to a new bytevector because some procedure doesn't allow specifying byte ranges ... Greetings, Maxime --=-X3keEu3IgEhCsaTFggmw Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable
Vijay Marupudi schreef op vr 21-01-2022 om 15=
:20 [-0500]:
+  (pass-if-exception "utf8->string ran= ge: end < start"
+      exception:out-of-ran= ge
+      (let* ((utf8 (string->utf8 "gnu gu= ile")))
+        (utf8->string utf= 8 1 0)))
+  [other tests]

It w=
ould be nice to check multibyte characters as well,
to verify tha=
t byte indices and not character indices are used.

E.g., (utf8->string #vu8(195 169) 0 2) should return "=C3=A9".
Another nice test: (utf8->string #vu8(195 169) 0 1) sho=
uld raise
a 'decoding-error', even though #vu8(195 169) is valid =
UTF-8.

And (utf8->string #vu8(0 32 196) 0 2) sh=
ould return "\x00 " even
though #vu8(0 32 195) is invalid UTF-8 -=
- and as a bonus, it checks
that the nul character is supported -=
- which can be easily forgotten
because Guile is implemented in C=
 which usually terminates strings
by zero instead of using a leng=
th field.

Overall, the patch you sent seems a reas=
onable approach to me, though
I didn't verify the details.  I fin=
d myself at times copying a part
of a bytevector to a new bytevec=
tor because some procedure doesn't
allow specifying byte ranges .=
..

Greetings,
Maxime
--=-X3keEu3IgEhCsaTFggmw-- --=-HXMudW0vOfxQZRM2lu2/ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYesu2RccbWF4aW1lZGV2 b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7m59AQDydF6NSx+ubfK5GvfcnjDuze8h GEZ80UrHEEQWXK6biAD+OoqaLY/TPEUSLKDZ2q/4o3V7PM8I9teRkUXT/rSoAQc= =etUR -----END PGP SIGNATURE----- --=-HXMudW0vOfxQZRM2lu2/--