From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms0.migadu.com with LMTPS id /f0WKVeWj2C74wAAgWs5BA (envelope-from ) for ; Mon, 03 May 2021 08:21:11 +0200 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2 with LMTPS id eIAzJFeWj2DXawAAB5/wlQ (envelope-from ) for ; Mon, 03 May 2021 06:21:11 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id D9B401203E for ; Mon, 3 May 2021 08:21:10 +0200 (CEST) Received: from localhost ([::1]:36070 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ldRx6-00051F-Lx for larch@yhetil.org; Mon, 03 May 2021 02:21:08 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:36176) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ldRx0-00050T-9M for bug-guix@gnu.org; Mon, 03 May 2021 02:21:02 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:34508) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ldRx0-00028z-1c for bug-guix@gnu.org; Mon, 03 May 2021 02:21:02 -0400 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1ldRwz-0008EF-SU for bug-guix@gnu.org; Mon, 03 May 2021 02:21:01 -0400 X-Loop: help-debbugs@gnu.org Subject: bug#48114: Disarchive occasionally fails tests Resent-From: Bengt Richter Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 03 May 2021 06:21:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 48114 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Timothy Sample Received: via spool by 48114-submit@debbugs.gnu.org id=B48114.162002281331606 (code B ref 48114); Mon, 03 May 2021 06:21:01 +0000 Received: (at 48114) by debbugs.gnu.org; 3 May 2021 06:20:13 +0000 Received: from localhost ([127.0.0.1]:46054 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ldRwC-0008Di-Q5 for submit@debbugs.gnu.org; Mon, 03 May 2021 02:20:13 -0400 Received: from imta-37.everyone.net ([216.200.145.37]:51644 helo=imta-38.everyone.net) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ldRw7-0008DW-Ha for 48114@debbugs.gnu.org; Mon, 03 May 2021 02:20:11 -0400 Received: from pps.filterd (m0004962.ppops.net [127.0.0.1]) by imta-38.everyone.net (8.16.0.43/8.16.0.43) with SMTP id 143685IG016802; Sun, 2 May 2021 23:20:05 -0700 X-Eon-Originating-Account: zKbnJClbHE4RrXfO8FGTHu71PbUggyAyyvm4jUVqHLE X-Eon-Dm: m0116952.ppops.net Received: by m0116952.mta.everyone.net (EON-AUTHRELAY2 - 53b92615) id m0116952.60622040.32699c; Sun, 2 May 2021 23:20:04 -0700 X-Eon-Sig: AQMHrIJgj5YU0Eyn4QIAAAAD,dc994c3f1869cca911fc38ede01eda29 X-Eip: mFzkUTY-jR0AG2JfaO9mrEUsZZvGvs4vHcGS_KwLFJQ Date: Mon, 3 May 2021 08:19:50 +0200 From: Bengt Richter Message-ID: <20210503061950.GA26660@LionPure> References: <87v984gkhn.fsf@inria.fr> <87pmybeen3.fsf@ngyro.com> <874kfk6h8o.fsf@gnu.org> <87a6pceerf.fsf@ngyro.com> <8735v4ea7y.fsf@ngyro.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <8735v4ea7y.fsf@ngyro.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Proofpoint-ORIG-GUID: DcmakGCANGBO1sTw16rCtrNEJjBQYRER X-Proofpoint-GUID: DcmakGCANGBO1sTw16rCtrNEJjBQYRER X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-05-03_03:2021-04-30, 2021-05-03 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 lowpriorityscore=0 malwarescore=0 mlxscore=0 mlxlogscore=999 phishscore=0 clxscore=1034 impostorscore=0 adultscore=0 suspectscore=0 spamscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2105030044 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Bengt Richter Cc: 48114@debbugs.gnu.org Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1620022871; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:resent-cc: resent-from:resent-sender:resent-message-id:in-reply-to:in-reply-to: references:references:list-id:list-help:list-unsubscribe: list-subscribe:list-post; bh=Xp1F12PSwbuJF/DVgMCALEF0EbsnK84AZtZ6aoGM8as=; b=mb0GZnyIAwcJVmT48h5rMTUfk75JQ7G0TJ+AYbeEE/V0f5se1KO+i+FzImwEsYsF9HTe2W 7qNS3y3a5BlG1G6aewCZt0VvO3D3FIjcMNBZefPX+D21Q0OvLIIZhNN52ztw18Zfq/iNJ1 bRwLgu+3GCVOTOAc5cJwdsfjGYkBcg/weawxJ3X4suxjV4wbQMj5jlTJw/SV4OVrb564QT poQAtUlGNtJVQxPO2a+eMJc6QM1EwxEWOGQpaV7bEirUWBey0S4Tny/ISPVQhcBGV/4eRS LKzX5JPMpS9CS0IICLh3i+t2U6OIfFEhdXCYMDC1nzfs0Hd7l7wf1EkigXa7LA== ARC-Seal: i=1; s=key1; d=yhetil.org; t=1620022871; a=rsa-sha256; cv=none; b=PXcvEOA3JqmexWGg9wCgJutmH9U3FMSxMmVLv/E08WiZtHnVE7lK8kX2HcP2Na6X+ao6wW bJOSeiMD52Ozoi4R0xysCV+3CD/4UTQfOC32LH0RIDXmXPBSp21fSdeAlhxd9stetvwY2C 4ZfkODVkC3YJLVCfMXS3ut1oaeKadrgLHoTuasp87q56TDndRZEHGdCS3QEiNaTnR4tsjX IyeZucVFlHzelxPiQH9s0wItN5kVxMWdeLsuZYYRzE/abIe1oUmbENp2Hiqy7ZZuQ+Er/N F26cpaoms1Mv7BhS9hrBdT0CGZUqm+oe64liomq2hDLhNoFg801IGQqJipFTNA== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Spam-Score: -1.96 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=none; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: D9B401203E X-Spam-Score: -1.96 X-Migadu-Scanner: scn0.migadu.com X-TUID: 1dqTJsP7tx5k Hi Timothy, Ludo, On +2021-05-03 00:02:09 -0400, Timothy Sample wrote: > Timothy Sample writes: > > > I’m still looking into this, but I wanted to quickly post this > > reproducer for the Guile bug: > > > > (use-modules (ice-9 regex)) > > (define str > > "\U101514\U103ab0\U0f6e6e\U02e278\U01d9eb\U10b996\U1089b5\uea15\U0fa074\U101e41\U02e330\u0177\u2492") > > (match:substring (string-match "[0-8]+" str)) > > > > This triggers the out-of-range error when run with “LC_ALL=C”. > > It turns out that all that’s needed is the last code point, which is > “Number Eleven Full Stop”, or ‘⒒’. When Guile converts this to an ASCII > C string using ‘u32_conv_from_encoding’, it becomes “11.”. The regex > (“[0-8]+”) matches the “11” part with start index 0 and end index 2. > The ‘fixup_multibyte_match’ function does nothing (it only matters when > the locale encoding is multibyte) [1]. Guile then builds the match > vector with the original string but keeps the ASCII offsets. In other > words, it thinks the match substring goes from 0 to 2 in a single code > point string: > > ,use (ice-9 regex) > (string-match "11" "\u2492") > => #("\u2492" (0 . 2)) > > I’m not sure there’s any way to solve this nicely in Guile. It would be > clearer if the match vector included the string as libc matched it, but > it’s still surprising that the match happens with a different string. > > In Disarchive, I can rewrite the generator without regex. I’ll do that > and see what I can do about the “Gave up!” issue. > > [1] It works on the converted-to-ASCII C string, which means that the > byte offsets and code point offsets are the same. Hence, it has nothing > to do. > > > -- Tim > > > What happens with these? (code ppoints in decimal) 8554 _Ⅺ_ "ROMAN NUMERAL ELEVEN" 8570 _ⅺ_ "SMALL ROMAN NUMERAL ELEVEN" 9322 _⑪_ "CIRCLED NUMBER ELEVEN" 9342 _⑾_ "PARENTHESIZED NUMBER ELEVEN" 9362 _⒒_ "NUMBER ELEVEN FULL STOP" 9451 _⓫_ "NEGATIVE CIRCLED NUMBER ELEVEN" 13155 _㍣_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ELEVEN" 13290 _㏪_ "IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ELEVEN" I would argue that none of these should be "decoded" into ascii polyglyphs since they are atomic character glyphs. IMO It is over-eager transformation to make them into ascii polyglyphs. /Super/sub/-script placement metadata is another thing to consider -- "decode" to ascii art?? ;-) Unicode characters representing mathematical values in other languages are different. Those are subject to natural language translation with locale-dependent semantics. These might be candidates for that?: (code points in decimal) 8544 _Ⅰ_ "ROMAN NUMERAL ONE" 8545 _Ⅱ_ "ROMAN NUMERAL TWO" 8546 _Ⅲ_ "ROMAN NUMERAL THREE" 8547 _Ⅳ_ "ROMAN NUMERAL FOUR" 8548 _Ⅴ_ "ROMAN NUMERAL FIVE" 8549 _Ⅵ_ "ROMAN NUMERAL SIX" 8550 _Ⅶ_ "ROMAN NUMERAL SEVEN" 8551 _Ⅷ_ "ROMAN NUMERAL EIGHT" 8552 _Ⅸ_ "ROMAN NUMERAL NINE" 8553 _Ⅹ_ "ROMAN NUMERAL TEN" 8554 _Ⅺ_ "ROMAN NUMERAL ELEVEN" 8555 _Ⅻ_ "ROMAN NUMERAL TWELVE" 8556 _Ⅼ_ "ROMAN NUMERAL FIFTY" 8557 _Ⅽ_ "ROMAN NUMERAL ONE HUNDRED" 8558 _Ⅾ_ "ROMAN NUMERAL FIVE HUNDRED" 8559 _Ⅿ_ "ROMAN NUMERAL ONE THOUSAND" 8560 _ⅰ_ "SMALL ROMAN NUMERAL ONE" 8561 _ⅱ_ "SMALL ROMAN NUMERAL TWO" 8562 _ⅲ_ "SMALL ROMAN NUMERAL THREE" 8563 _ⅳ_ "SMALL ROMAN NUMERAL FOUR" 8564 _ⅴ_ "SMALL ROMAN NUMERAL FIVE" 8565 _ⅵ_ "SMALL ROMAN NUMERAL SIX" 8566 _ⅶ_ "SMALL ROMAN NUMERAL SEVEN" 8567 _ⅷ_ "SMALL ROMAN NUMERAL EIGHT" 8568 _ⅸ_ "SMALL ROMAN NUMERAL NINE" 8569 _ⅹ_ "SMALL ROMAN NUMERAL TEN" 8570 _ⅺ_ "SMALL ROMAN NUMERAL ELEVEN" 8571 _ⅻ_ "SMALL ROMAN NUMERAL TWELVE" 8572 _ⅼ_ "SMALL ROMAN NUMERAL FIFTY" 8573 _ⅽ_ "SMALL ROMAN NUMERAL ONE HUNDRED" 8574 _ⅾ_ "SMALL ROMAN NUMERAL FIVE HUNDRED" 8575 _ⅿ_ "SMALL ROMAN NUMERAL ONE THOUSAND" 8576 _ↀ_ "ROMAN NUMERAL ONE THOUSAND C D" 8577 _ↁ_ "ROMAN NUMERAL FIVE THOUSAND" 8578 _ↂ_ "ROMAN NUMERAL TEN THOUSAND" 8579 _Ↄ_ "ROMAN NUMERAL REVERSED ONE HUNDRED" 8581 _ↅ_ "ROMAN NUMERAL SIX LATE FORM" 8582 _ↆ_ "ROMAN NUMERAL FIFTY EARLY FORM" 8583 _ↇ_ "ROMAN NUMERAL FIFTY THOUSAND" 8584 _ↈ_ "ROMAN NUMERAL ONE HUNDRED THOUSAND" 12321 _〡_ "HANGZHOU NUMERAL ONE" 12322 _〢_ "HANGZHOU NUMERAL TWO" 12323 _〣_ "HANGZHOU NUMERAL THREE" 12324 _〤_ "HANGZHOU NUMERAL FOUR" 12325 _〥_ "HANGZHOU NUMERAL FIVE" 12326 _〦_ "HANGZHOU NUMERAL SIX" 12327 _〧_ "HANGZHOU NUMERAL SEVEN" 12328 _〨_ "HANGZHOU NUMERAL EIGHT" 12329 _〩_ "HANGZHOU NUMERAL NINE" 12344 _〸_ "HANGZHOU NUMERAL TEN" 12345 _〹_ "HANGZHOU NUMERAL TWENTY" 12346 _〺_ "HANGZHOU NUMERAL THIRTY" Just my intuitive reaction, no academic creds to back it up ;) -- Regards, Bengt Richter