From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: divoplade Newsgroups: gmane.lisp.guile.user Subject: Re: Guile Hacker Handbook - Character sets Date: Fri, 19 Feb 2021 00:15:15 +0100 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="25542"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Evolution 3.34.2 To: =?ISO-8859-1?Q?J=E9r=E9my?= Korwin-Zmijowski , Mailing list Guile User Original-X-From: guile-user-bounces+guile-user=m.gmane-mx.org@gnu.org Fri Feb 19 00:16:48 2021 Return-path: Envelope-to: guile-user@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lCsXP-0006U3-Na for guile-user@m.gmane-mx.org; Fri, 19 Feb 2021 00:16:47 +0100 Original-Received: from localhost ([::1]:54154 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lCsXO-0003ns-Fb for guile-user@m.gmane-mx.org; Thu, 18 Feb 2021 18:16:46 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:56924) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lCsWS-0003lO-Sa for guile-user@gnu.org; Thu, 18 Feb 2021 18:15:48 -0500 Original-Received: from relay3-d.mail.gandi.net ([217.70.183.195]:60197) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lCsWM-0005Q3-N5 for guile-user@gnu.org; Thu, 18 Feb 2021 18:15:48 -0500 X-Originating-IP: 90.78.202.178 Original-Received: from pruneau.home (lfbn-poi-1-1269-178.w90-78.abo.wanadoo.fr [90.78.202.178]) (Authenticated sender: d@divoplade.fr) by relay3-d.mail.gandi.net (Postfix) with ESMTPSA id B23BF60003; Thu, 18 Feb 2021 23:15:33 +0000 (UTC) In-Reply-To: Received-SPF: pass client-ip=217.70.183.195; envelope-from=d@divoplade.fr; helo=relay3-d.mail.gandi.net X-Spam_score_int: -25 X-Spam_score: -2.6 X-Spam_bar: -- X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane-mx.org@gnu.org Original-Sender: "guile-user" Xref: news.gmane.io gmane.lisp.guile.user:17268 Archived-At: Hello, Le jeudi 18 février 2021 à 20:54 +0100, Jérémy Korwin-Zmijowski a écrit : > I happily managed to find some time to write a new chapter for the > Guile Hacker Handbook ! > > https://jeko.frama.io/en/char-sets.html > > It deals with char-sets, something new to me. The exercise was fun, I > liked how convenient it is to play with these data type. The use of unicode makes it tempting to think that each thing you can index in a string is a character. This will work most of the time, except in some cases with foreign languages. This remark is general, and applies in many situations, including the previous chapter about characters. I suggest reading: https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ . Fortunately, there are very few international problems that need to look at individual characters of a string. Your password rules example is arguably one of them, although it may make non-latin users angry (this upper case / lower case distinction does not work in chinese, as far as I know). The other example that I'm aware of is limiting the size of a message so that the reader does not get bored (so, not for storage reasons). One website famously limits the number of unicode code points for a message, although it is in fact much more complex and opinionated than expected ( https://developer.twitter.com/en/docs/counting-characters). I think that the approach of demonstrating general code that works with latin except "special characters" is rude to the rest of the world and should not be put in such a strategic place as the Guile Hacker Handbook. For your example, I suggest switching to something that has more structure and is purposedly latin, for instance checking the validity of IBAN accounts, car license plates in an applicable country, maybe your grocery store's customer ID... You can also invent your own. The previous chapter about characters gives a good importance to letter intervals, which is even more difficult because the locale order would put 'é' after 'e' and before 'f', but the char>=? predicate would put it after everything. So, this does not even work for all latin. And if you use the locale order, then you won't even have meaningful character ranges anymore. Unicode is a very complex beast, with very few general use cases. Don't let that discourage you. Fortunately, most of everyday computing tasks can be solved without going down to the unicode character semantics. As a general idea, I would suggest to stay away from characters, and start with strings.