From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Thien-Thi Nguyen Newsgroups: gmane.lisp.guile.user Subject: Re: survey: string external representation Date: Fri, 27 Jan 2012 11:27:30 +0100 Message-ID: <87k44dbfu5.fsf@gnuvola.org> References: <87wr8edhac.fsf@gnuvola.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Trace: dough.gmane.org 1327660108 18941 80.91.229.12 (27 Jan 2012 10:28:28 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 27 Jan 2012 10:28:28 +0000 (UTC) To: guile-user@gnu.org Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Fri Jan 27 11:28:24 2012 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([140.186.70.17]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Rqj2d-0005oF-R0 for guile-user@m.gmane.org; Fri, 27 Jan 2012 11:28:24 +0100 Original-Received: from localhost ([::1]:53596 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rqj2d-00017g-0b for guile-user@m.gmane.org; Fri, 27 Jan 2012 05:28:23 -0500 Original-Received: from eggs.gnu.org ([140.186.70.92]:48015) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rqj2W-00017Y-7E for guile-user@gnu.org; Fri, 27 Jan 2012 05:28:20 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Rqj2R-0005pn-RJ for guile-user@gnu.org; Fri, 27 Jan 2012 05:28:16 -0500 Original-Received: from smtp208.alice.it ([82.57.200.104]:38573) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rqj2R-0005oz-EJ for guile-user@gnu.org; Fri, 27 Jan 2012 05:28:11 -0500 Original-Received: from ambire (79.41.65.77) by smtp208.alice.it (8.6.023.02) id 4F056E850230408D for guile-user@gnu.org; Fri, 27 Jan 2012 11:27:50 +0100 Original-Received: from ttn by ambire with local (Exim 4.72) (envelope-from ) id 1Rqj1m-0000g2-VY for guile-user@gnu.org; Fri, 27 Jan 2012 11:27:30 +0100 In-Reply-To: <87wr8edhac.fsf@gnuvola.org> (Thien-Thi Nguyen's message of "Thu, 26 Jan 2012 09:00:59 +0100") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 82.57.200.104 X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Original-Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.user:9195 Archived-At: --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Thanks to everyone who responded. Based on the collected information, i've cobbled together a runtime check for =E2=80=98sql-quote=E2=80=99. It and some tests are in the attached program. To play: guile -s normalize.scm guile -s normalize.scm stupid The code assumes Guile 2 DTRT, but if you have doubts, you can sed -i 's/guile-2/&-not-really/' normalize.scm to disable that assumption. In any case, the program should exit successfully, indicating smooth =E2=80=98write=E2=80=99 / =E2=80=98read=E2= =80=99 round-tripping. This is so (both w/ and w/o "stupid") for Guile 1.4.1.124 and 1.8.7. ___________________________________________ --=-=-= Content-Type: text/x-scheme; charset=utf-8 Content-Disposition: inline; filename=normalize.scm Content-Transfer-Encoding: quoted-printable ;; -*- mode: scheme; coding: utf-8 -*- (define EXIT-VALUE #t) ; optimism (define STUPID? (false-if-exception (string=3D? "stupid" (cadr (command-lin= e))))) ;; PostgreSQL groks =E2=80=98\xXX=E2=80=99 as an octet w/ hex value XX. ;; It also groks raw octets. This is all fine and good. ;; The problem arises when there is a mix of contiguous ;; raw and \x representations, intended to represent a ;; UTF-8 (say) encoded character. ;; ;; It seems Guile ;; - 1.4 DTRT by doing nothing; ;; - 1.6 ???; ;; - 1.8 fails by \x-escaping inconsistently; ;; - 2.0 doesn't have this problem. (cond-expand (guile-2 (define normalize identity)) (else (use-modules (srfi srfi-13) (srfi srfi-14)) (define normalize (or (let* ((ego (char-set ;; These are not strictly necessary for ;; PostgreSQL, but we include them for ;; (Scheme-only) round-trip testing. ;; Doubtlessly, what doubtful ego! #\" #\\)) (ugh (ucs-range->char-set #o177 #o400 #t ego))) (and (not (char-set-every (lambda (ch) ;; Does the octet xrep unmolested? (char=3D? ch (string-ref (object->string (string ch)= ) 1))) (char-set-difference ugh ego))) (or (not STUPID?) (begin (set! ugh ego) #t)) ;; Lame. (lambda (s) (define backslash-x (let ((v (make-vector 256))) (char-set-for-each (lambda (ch) (let ((i (char->integer ch))) (vector-set! v i (string-append "\\x" (number->string i 16))))) ugh) ;; backslash-x (lambda (ch) (vector-ref v (char->integer ch))))) (let loop ((start 0) (acc '())) (cond ((string-index s ugh start) =3D> (lambda (idx) (loop (1+ idx) (cons* (backslash-x (string-ref s idx)) (substring/shared s start idx) acc)))) ((zero? start) s) (else (string-concatenate-reverse acc (substring/shared s start)))))))) ;; Cool. identity)))) (define (try s) (simple-format #t "ORIG:\t~S~%NORM:\t~S~%=3D>\t~A~%~%" s (normalize s) (let ((round (with-input-from-string (with-output-to-string (lambda () (if (eq? identity normalize) (write s) (begin (display #\") (display (normalize s)) (display #\"))))) read))) (cond ((equal? s round) 'SAME) (else (set! EXIT-VALUE #f) ;-O (string-append "DIFF: [" (number->string (string-length round)) "]|" round "|")))))) (simple-format #t "Guile ~A~% LANG: ~S~% normalize: ~S~A~%~%" (version) (getenv "LANG") (procedure-name normalize) (if (and STUPID? (not (eq? normalize identity))) " (but we stupidly revert to degeneracy)" "")) (try "") (try (list->string (map integer->char (iota 256)))) (try "U+2002: |=E2=80=82| (utf-8: E2 80 82)") (try "U+232C: |=E2=8C=AC| (utf-8: E2 80 82)") (try "U+1D7FF: |=F0=9D=9F=BF| (utf-8: F0 9D 9F BF)") (try "U+2F9B2: |=F0=AF=A6=B2| (utf-8: F0 AF A6 B2)") (try "U+2F9BC: |=F0=AF=A6=BC| (utf-8: F0 AF A6 BC)") (exit EXIT-VALUE) --=-=-=--