From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Tomas Hlavaty Newsgroups: gmane.emacs.help Subject: Re: "split-sentences"? Date: Sat, 23 Jan 2021 10:07:06 +0100 Message-ID: <87lfcknhs5.fsf@logand.com> References: <87zh109r2d.fsf@zoho.eu> <87v9bo9myu.fsf@zoho.eu> <20210123084136.GA2306@tuxteam.de> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="3377"; mail-complaints-to="usenet@ciao.gmane.io" To: help-gnu-emacs@gnu.org Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Sat Jan 23 10:07:51 2021 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1l3Etb-0000mF-HO for geh-help-gnu-emacs@m.gmane-mx.org; Sat, 23 Jan 2021 10:07:51 +0100 Original-Received: from localhost ([::1]:40992 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l3Eta-0007YC-IO for geh-help-gnu-emacs@m.gmane-mx.org; Sat, 23 Jan 2021 04:07:50 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37070) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l3Et7-0007Y2-Kv for help-gnu-emacs@gnu.org; Sat, 23 Jan 2021 04:07:21 -0500 Original-Received: from logand.com ([37.48.87.44]:39332) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l3Et5-0001xF-Fn for help-gnu-emacs@gnu.org; Sat, 23 Jan 2021 04:07:21 -0500 Original-Received: by logand.com (Postfix, from userid 1001) id C57F219F149; Sat, 23 Jan 2021 10:07:08 +0100 (CET) X-Mailer: emacs 27.1 (via feedmail 11-beta-1 I) In-Reply-To: <20210123084136.GA2306@tuxteam.de> Received-SPF: pass client-ip=37.48.87.44; envelope-from=tom@logand.com; helo=logand.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:127324 Archived-At: On Sat 23 Jan 2021 at 09:41, wrote: > On Sat, Jan 23, 2021 at 07:38:49AM +0100, moasenwood--- via Users list for the GNU Emacs text editor wrote: >> Can I parse/split a string into sentences based on >> human-language punctuation? not easily >> Did anyone do that already? https://www.unicode.org/reports/tr29/#Sentence_Boundaries Does emacs expose unicode text functions? For example to classify characters, determine graphemes, words, sentences, line breaks etc? >> I mean very mechanically is fine, no linguistics or anything. >> >> So this >> >> "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played >> through amazon.com alexa speakers?" >> >> would be >> >> ("'" "This sentence is spoken by Mr" "." "W" "." "E" "." "B >> Dubois" "," "Esq" "." "!" "'" "played through amazon" "." >> "com" "alexa "speakers" "?") That is not really split-sentences. The example has two sentences. Moreover the first sentence is a subject of the second. This would be represented something like this: (sentence (sentence "This sentence is spoken by Mr. W. E. B Dubois, Esq.!") "played through amazon.com alexa speakers?") but it depends, what do you want to achieve.