From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Mark H Weaver Newsgroups: gmane.lisp.guile.devel Subject: Re: byte-order marks Date: Tue, 29 Jan 2013 03:22:17 -0500 Message-ID: <87y5fcjt52.fsf@tines.lan> References: <87boc956j2.fsf@pobox.com> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: ger.gmane.org 1359447767 3782 80.91.229.3 (29 Jan 2013 08:22:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 29 Jan 2013 08:22:47 +0000 (UTC) Cc: guile-devel To: Andy Wingo Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Tue Jan 29 09:23:06 2013 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1U06TC-0005Ie-Gi for guile-devel@m.gmane.org; Tue, 29 Jan 2013 09:23:06 +0100 Original-Received: from localhost ([::1]:52544 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U06Su-0005Fm-8D for guile-devel@m.gmane.org; Tue, 29 Jan 2013 03:22:48 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:60751) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U06Sr-0005Fa-5g for guile-devel@gnu.org; Tue, 29 Jan 2013 03:22:46 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U06Sp-0003X6-JG for guile-devel@gnu.org; Tue, 29 Jan 2013 03:22:45 -0500 Original-Received: from world.peace.net ([96.39.62.75]:56362) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U06Sp-0003VT-FK for guile-devel@gnu.org; Tue, 29 Jan 2013 03:22:43 -0500 Original-Received: from 209-6-91-212.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com ([209.6.91.212] helo=tines.lan) by world.peace.net with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1U06SX-0002FP-IH; Tue, 29 Jan 2013 03:22:25 -0500 In-Reply-To: <87boc956j2.fsf@pobox.com> (Andy Wingo's message of "Mon, 28 Jan 2013 22:42:09 +0100") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 96.39.62.75 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:15613 Archived-At: Hi Andy, Andy Wingo writes: > What do people think about this attached patch? I'm strongly opposed to making 'open-input-file' any more clever than it already is. Furthermore, I strongly believe that it should be much less clever than it is now. Our basic textual I/O should be robust by default, and should not second-guess the specified encoding based on flimsy heuristics that work 99% of the time. IMO, our default behavior should allow portable scheme code to write an arbitrary string of characters to a file in some encoding, and later read it back, without having to worry about whether the string starts with something that looks like a BOM, or contains a string that looks like a coding declaration. The string might be from a network, and thus potentially from a malicious source. Frankly, I consider this to be a potential source of security flaws in software built using Guile, and on that basis would advocate removing the existing cleverness from 'open-input-file' in stable-2.0. At the very least it should be removed from master. Regarding byte-order marks, my preference is that users should explictly consume BOMs if that's what they want (ideally using some convenience procedure provided by Guile). Sometimes consuming the BOM is the wrong thing. For example, if the user is copying a file to another file, or to a socket, it may be important to preserve the BOM. If others feel strongly that BOMs should be consumed by default, then the following compromise is about as far as I'd (reluctantly) consider going: * 'open-input-file' could perhaps auto-consume a BOM at the beginning of the stream, but *only* if the BOM is already in the encoding specified by the user (possibly via an explicit call to 'file-encoding'). For example, if the specified port encoding is UTF-8, then EF BB BF would be consumed, but FE FF or FF FE would be left alone. * BOMs absolutely should *not* be used to determine the encoding unless the user has explicitly asked for coding auto-detection. Having said all this, if 'open-input-file' is changed to no longer call 'scm_i_scan_for_file_encoding', then I think it's a fine idea to add BOMs to its list of heuristics, though I tend to agree with Mike that a coding declaration should take precedence, for the reasons he described. However, I strongly believe that 'scm_i_scan_for_file_encoding' is the wrong place to consume BOMs. What do you think? Mark