From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Newsgroups: gmane.emacs.help Subject: Re: "split-sentences"? Date: Sat, 23 Jan 2021 09:41:37 +0100 Message-ID: <20210123084136.GA2306@tuxteam.de> References: <87zh109r2d.fsf@zoho.eu> <87v9bo9myu.fsf@zoho.eu> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="VbJkn9YxBvnuCH5J" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="16571"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mutt/1.5.21 (2010-09-15) To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Sat Jan 23 09:42:04 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1l3EUd-0004D0-1q for geh-help-gnu-emacs@m.gmane-mx.org; Sat, 23 Jan 2021 09:42:03 +0100 Original-Received: from localhost ([::1]:36934 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l3EUc-0004Q6-2Y for geh-help-gnu-emacs@m.gmane-mx.org; Sat, 23 Jan 2021 03:42:02 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:34310) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l3EUI-0004Pz-RQ for help-gnu-emacs@gnu.org; Sat, 23 Jan 2021 03:41:42 -0500 Original-Received: from mail.tuxteam.de ([5.199.139.25]:59097) by eggs.gnu.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.90_1) (envelope-from ) id 1l3EUF-000700-UE for help-gnu-emacs@gnu.org; Sat, 23 Jan 2021 03:41:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tuxteam.de; s=mail; h=From:In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject:To:Date; bh=JvFEb9koHbtV/IQroXDRYIhh7qDESUbhn633NgUCfiU=; b=Ys8XRxRISOyY61qEbuPoMrmTzfQG8MAb+L+0UY4UMpOpskFTiYilbeAre2tEvhwofmTHsINs0H+I0DFOSIUsO8F5R2hxPsgRzsHR105eNebEPFLYigYu/Cz8WoZ7cpAvWgOWMLf3jKsArE4cqgixYp2uHdokGiv6UeMfd3iemMfXgqFfgpbf2DCz0jeT5m42E1KYgvcRkvf2VLBDEQBcNnvDQ8CxgEHCll8cMtYVGLAvzzs9mX383Uw8RETH8xlD+Ozqjo3fg+1v2kjpHQ8VBejNOZDAaVdwRI08ppCtfEVeyXbop3sOdYjGFR9f+vMIGI/dyQMOTU9udarR1z7tyw==; Original-Received: from tomas by mail.tuxteam.de with local (Exim 4.80) (envelope-from ) id 1l3EUD-0000oS-0s for help-gnu-emacs@gnu.org; Sat, 23 Jan 2021 09:41:37 +0100 Content-Disposition: inline In-Reply-To: <87v9bo9myu.fsf@zoho.eu> Received-SPF: pass client-ip=5.199.139.25; envelope-from=tomas@tuxteam.de; helo=mail.tuxteam.de X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:127323 Archived-At: --VbJkn9YxBvnuCH5J Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jan 23, 2021 at 07:38:49AM +0100, moasenwood--- via Users list for = the GNU Emacs text editor wrote: > moasenwood--- via Users list for the GNU Emacs text editor wrote: >=20 > > Can I parse/split a string into sentences based on > > human-language punctuation? > > > > Did anyone do that already? >=20 > I mean very mechanically is fine, no linguistics or anything. >=20 > So this >=20 > "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played > through amazon.com alexa speakers?" > > would be >=20 > ("'" "This sentence is spoken by Mr" "." "W" "." "E" "." "B > Dubois" "," "Esq" "." "!" "'" "played through amazon" "." > "com" "alexa "speakers" "?") Not exactly your result, but this comes close: (split-string "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played through = amazon.com alexa speakers?" "[[:punct:]][[:space:]]*") =3D> ("" "This sentence is spoken by Mr" "W" "E" "B Dubois" "Esq" "" "" "played through amazon" "com alexa speakers" "") You can adjust the results by tweaking the regexp (try word boundaries like '\<' and '\>' if you want to keep punctuation) or the other split-string's optional params (e.g. drop the empty matches, etc.). Cheers - t --VbJkn9YxBvnuCH5J Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAmAL4UAACgkQBcgs9XrR2kYykACffwNlCEq54fb1XqwbLfSA9vpf GVEAnRRFP8YUKYNCxzTxFipgjaNFw7ET =FctT -----END PGP SIGNATURE----- --VbJkn9YxBvnuCH5J--