From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Richard Wordingham Newsgroups: gmane.emacs.help Subject: Re: Is there a way to "asciify" a string? Date: Thu, 31 May 2018 23:52:07 +0100 Message-ID: <20180531235207.7e65aa35@JRWUBU2> References: <87zi0llisj.fsf@mbork.pl> <20180527073645.GB17354@tuxteam.de> <87y3g5l1h0.fsf@mbork.pl> <1408913120.160961.1527779327833@webmail.appsuite.proximus.be> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: blaine.gmane.org 1527807121 2157 195.159.176.226 (31 May 2018 22:52:01 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 31 May 2018 22:52:01 +0000 (UTC) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Fri Jun 01 00:51:57 2018 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fOWQG-0000Qy-9t for geh-help-gnu-emacs@m.gmane.org; Fri, 01 Jun 2018 00:51:56 +0200 Original-Received: from localhost ([::1]:46519 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fOWSN-000066-C8 for geh-help-gnu-emacs@m.gmane.org; Thu, 31 May 2018 18:54:07 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:39278) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fOWQZ-0007k0-EC for help-gnu-emacs@gnu.org; Thu, 31 May 2018 18:52:16 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fOWQW-0003jh-B4 for help-gnu-emacs@gnu.org; Thu, 31 May 2018 18:52:15 -0400 Original-Received: from know-smtprelay-omc-10.server.virginmedia.net ([80.0.253.74]:50664) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fOWQV-0003fx-Ty for help-gnu-emacs@gnu.org; Thu, 31 May 2018 18:52:12 -0400 Original-Received: from JRWUBU2 ([82.4.11.47]) by cmsmtp with ESMTP id OWQRfDgFpZZ9kOWQRfj8pz; Thu, 31 May 2018 23:52:07 +0100 X-Originating-IP: [82.4.11.47] X-Authenticated-User: X-Spam: 0 X-Authority: v=2.3 cv=Ipswjo3g c=1 sm=1 tr=0 a=yrOAJgItaIMndimPI+pDLQ==:117 a=yrOAJgItaIMndimPI+pDLQ==:17 a=kj9zAlcOel0A:10 a=x7bEGLp0ZPQA:10 a=pLFbZ-nkixU0gNN_WuAA:9 a=CjuIK1q_8ugA:10 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ntlworld.com; s=meg.feb2017; t=1527807127; bh=aERdehlcC/146eUqocA6XqvJ//X1/UoiEwPNzZUhILM=; h=Date:From:To:Subject:In-Reply-To:References; b=K/k2vJ/nNe88itsCidOGwQksZzE/Hby+qZPFMCR0PE1+n72EzbbjQrZgl1m65GJjQ wZZ6QL82hAMqCx3NWhfgDc4AArLf3+6d5HTYlHd+e9uB+eaivit6lakMV3Dvi8bkQg DGdOuA+ZkAyiqYuuYvTqqtfOkn13P86EGXzGwsUq5lo1Dj3xwdGnzltUhcvAetpvq6 kjwIdFkyz4IT7T+wEmDXUyh9Rwr3DJ3UmTNxz61KsKcm7/JT5uDpNgulXzsCCOTrSj 28yTos6s2IEExQ0XzViRvEfP7vKCvT8FtYf9xr7CuqkqCU1OpDJfTbNI/LT12AmQPi nPjhkmHj7MR/A== In-Reply-To: <1408913120.160961.1527779327833@webmail.appsuite.proximus.be> X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; i686-pc-linux-gnu) X-CMAE-Envelope: MS4wfNry4QbGyrleSyWiTv7zZVpCcGTHlN2LonHij1LFUranzpLzJzZ9Fuic2XtGsw39tV05V25D/NMXhSm2qUqGvsv4UBfJyrQVkKDaMhhlmSOMphB9X/us vk2ib99VMdBIA+rydGMvex6k1AZ+kTqEB60XBpDGxJsCEnpVAmpLbfAu X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 80.0.253.74 X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.org gmane.emacs.help:116931 Archived-At: On Thu, 31 May 2018 17:08:47 +0200 (CEST) "S. Champailler" wrote: > I second that, removing accents and other "nationalities" is much > trickier than one might expect (you can look at Java example, the > Java unicode support is quite complete), especially for lanugages far > away from english such as russian. By "tricky" I mean there are > *hundreds* of edge cases. Nevertheless, there are ways do sort of do > what you want by playing with thigsn such as "non spacing combining > characters", "normalized strings", etc. If you have the opportunity, > just try to do it, the great lesson you'lll get of that is that human > languages are super complexe (and thus super interesting). Make sure you transliterate the string first. Remember that stripping out Indic vowels (many of which are gc=Mn) is no more reasonable than stripping out ASCII vowels. > Today, everyone should use Unicode, it's much simpler. Many file > systems support unicode. But be warned that some very different strings may compare equal. The Unicode Collation algorithm is highly likely *not* to be the default. Windows XP used to compare strings of Canadian Aboriginal Syllabics of the same length as equal. I remember using sort -u to remove duplicates from a list of words on a Linux distribution, and finding that I only had one left. I now play safe and do that sort of trick in the C locale. Richard.