unofficial mirror of help-gnu-emacs@gnu.org
 help / color / mirror / Atom feed
* "split-sentences"?
@ 2021-01-23  5:10 moasenwood--- via Users list for the GNU Emacs text editor
  2021-01-23  6:38 ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
  0 siblings, 1 reply; 11+ messages in thread
From: moasenwood--- via Users list for the GNU Emacs text editor @ 2021-01-23  5:10 UTC (permalink / raw)
  To: help-gnu-emacs

Can I parse/split a string into sentences based on
human-language punctuation?

Did anyone do that already?

TIA

-- 
underground experts united
http://user.it.uu.se/~embe8573
https://dataswamp.org/~incal




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  5:10 "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
@ 2021-01-23  6:38 ` moasenwood--- via Users list for the GNU Emacs text editor
  2021-01-23  8:41   ` "split-sentences"? tomas
  0 siblings, 1 reply; 11+ messages in thread
From: moasenwood--- via Users list for the GNU Emacs text editor @ 2021-01-23  6:38 UTC (permalink / raw)
  To: help-gnu-emacs

moasenwood--- via Users list for the GNU Emacs text editor wrote:

> Can I parse/split a string into sentences based on
> human-language punctuation?
>
> Did anyone do that already?

I mean very mechanically is fine, no linguistics or anything.

So this

"'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played
through amazon.com alexa speakers?"

would be

("'" "This sentence is spoken by Mr" "." "W" "." "E" "." "B
Dubois" "," "Esq" "." "!" "'" "played through amazon" "."
"com" "alexa "speakers" "?")

-- 
underground experts united
http://user.it.uu.se/~embe8573
https://dataswamp.org/~incal




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  6:38 ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
@ 2021-01-23  8:41   ` tomas
  2021-01-23  9:07     ` "split-sentences"? Tomas Hlavaty
  2021-01-23  9:35     ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
  0 siblings, 2 replies; 11+ messages in thread
From: tomas @ 2021-01-23  8:41 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1275 bytes --]

On Sat, Jan 23, 2021 at 07:38:49AM +0100, moasenwood--- via Users list for the GNU Emacs text editor wrote:
> moasenwood--- via Users list for the GNU Emacs text editor wrote:
> 
> > Can I parse/split a string into sentences based on
> > human-language punctuation?
> >
> > Did anyone do that already?
> 
> I mean very mechanically is fine, no linguistics or anything.
> 
> So this
> 
> "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played
> through amazon.com alexa speakers?"
>
> would be
> 
> ("'" "This sentence is spoken by Mr" "." "W" "." "E" "." "B
> Dubois" "," "Esq" "." "!" "'" "played through amazon" "."
> "com" "alexa "speakers" "?")

Not exactly your result, but this comes close:

  (split-string
    "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played through amazon.com alexa speakers?"
    "[[:punct:]][[:space:]]*")

=>

  (""
   "This sentence is spoken by Mr"
   "W"
   "E"
   "B Dubois"
   "Esq"
   ""
   ""
   "played through amazon"
   "com alexa speakers"
   "")

You can adjust the results by tweaking the regexp (try word
boundaries like '\<' and '\>' if you want to keep punctuation)
or the other split-string's optional params (e.g. drop the
empty matches, etc.).

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  8:41   ` "split-sentences"? tomas
@ 2021-01-23  9:07     ` Tomas Hlavaty
  2021-01-23  9:32       ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
  2021-01-23  9:48       ` "split-sentences"? Eli Zaretskii
  2021-01-23  9:35     ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
  1 sibling, 2 replies; 11+ messages in thread
From: Tomas Hlavaty @ 2021-01-23  9:07 UTC (permalink / raw)
  To: help-gnu-emacs

On Sat 23 Jan 2021 at 09:41, <tomas@tuxteam.de> wrote:
> On Sat, Jan 23, 2021 at 07:38:49AM +0100, moasenwood--- via Users list for the GNU Emacs text editor wrote:
>> Can I parse/split a string into sentences based on
>> human-language punctuation?

not easily

>> Did anyone do that already?

https://www.unicode.org/reports/tr29/#Sentence_Boundaries

Does emacs expose unicode text functions?  For example to classify
characters, determine graphemes, words, sentences, line breaks etc?

>> I mean very mechanically is fine, no linguistics or anything.
>> 
>> So this
>> 
>> "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!' played
>> through amazon.com alexa speakers?"
>>
>> would be
>> 
>> ("'" "This sentence is spoken by Mr" "." "W" "." "E" "." "B
>> Dubois" "," "Esq" "." "!" "'" "played through amazon" "."
>> "com" "alexa "speakers" "?")

That is not really split-sentences.

The example has two sentences.  Moreover the first sentence is a subject
of the second.

This would be represented something like this:

(sentence
  (sentence "This sentence is spoken by Mr. W. E. B Dubois, Esq.!")
  "played through amazon.com alexa speakers?")

but it depends, what do you want to achieve.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  9:07     ` "split-sentences"? Tomas Hlavaty
@ 2021-01-23  9:32       ` moasenwood--- via Users list for the GNU Emacs text editor
  2021-01-23  9:48       ` "split-sentences"? Eli Zaretskii
  1 sibling, 0 replies; 11+ messages in thread
From: moasenwood--- via Users list for the GNU Emacs text editor @ 2021-01-23  9:32 UTC (permalink / raw)
  To: help-gnu-emacs

Tomas Hlavaty wrote:

>> I mean very mechanically is fine, no linguistics
>> or anything.
>> 
>> So this
>> 
>> "'This sentence is spoken by Mr. W. E. B Dubois, Esq.!'
>> played through amazon.com alexa speakers?"
>>
>> would be
>> 
>> ("'" "This sentence is spoken by Mr" "." "W" "." "E" "." "B
>> Dubois" "," "Esq" "." "!" "'" "played through amazon" "."
>> "com" "alexa "speakers" "?")
>
> That is not really split-sentences.
>
> The example has two sentences.  Moreover the first sentence is a subject
> of the second.
>
> This would be represented something like this:
>
> (sentence
>   (sentence "This sentence is spoken by Mr. W. E. B Dubois, Esq.!")
>   "played through amazon.com alexa speakers?")
>
> but it depends, what do you want to achieve.

I found this post, which even provides an example:

  https://lists.gnu.org/archive/html/help-gnu-emacs/2021-01/msg00385.html

-- 
underground experts united
http://user.it.uu.se/~embe8573
https://dataswamp.org/~incal




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  8:41   ` "split-sentences"? tomas
  2021-01-23  9:07     ` "split-sentences"? Tomas Hlavaty
@ 2021-01-23  9:35     ` moasenwood--- via Users list for the GNU Emacs text editor
  2021-01-23 13:10       ` "split-sentences"? tomas
  1 sibling, 1 reply; 11+ messages in thread
From: moasenwood--- via Users list for the GNU Emacs text editor @ 2021-01-23  9:35 UTC (permalink / raw)
  To: help-gnu-emacs

tomas wrote:

> Not exactly your result, but this comes close:
>
>   (split-string "'This sentence is spoken by Mr. W. E.
>     B Dubois, Esq.!' played through amazon.com alexa
>     speakers?" "[[:punct:]][[:space:]]*")
>
> =>
>
>   (""
>    "This sentence is spoken by Mr"
>    "W"
>    "E"
>    "B Dubois"
>    "Esq"
>    ""
>    ""
>    "played through amazon"
>    "com alexa speakers"
>    "")
>
> You can adjust the results by tweaking the regexp (try word
> boundaries like '\<' and '\>'

*scratches my head*

> if you want to keep punctuation) or the other split-string's
> optional params (e.g. drop the empty matches, etc.).

Well, that's a start, for sure. Thanks :)

Silly me, I already used `split-string' 10 times...

-- 
underground experts united
http://user.it.uu.se/~embe8573
https://dataswamp.org/~incal




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  9:07     ` "split-sentences"? Tomas Hlavaty
  2021-01-23  9:32       ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
@ 2021-01-23  9:48       ` Eli Zaretskii
  2021-01-24  0:39         ` "split-sentences"? Tomas Hlavaty
  1 sibling, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2021-01-23  9:48 UTC (permalink / raw)
  To: help-gnu-emacs

> From: Tomas Hlavaty <tom@logand.com>
> Date: Sat, 23 Jan 2021 10:07:06 +0100
> 
> Does emacs expose unicode text functions?

It does expose some of them, although not necessarily under the names
used by the UCS.

> For example to classify characters, determine graphemes, words,
> sentences, line breaks etc?

We have get-char-code-property for Unicode character properties and
find-composition for finding grapheme clusters (Emacs doesn't care
about graphemes, unless you use these two terms as aliases).  For
words, sentences, and line breaks, we use our own definitions, and
generally don't support the Unicode delimiters like U+2028.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  9:35     ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
@ 2021-01-23 13:10       ` tomas
  2021-01-23 17:46         ` "split-sentences"? Eric Abrahamsen
  0 siblings, 1 reply; 11+ messages in thread
From: tomas @ 2021-01-23 13:10 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]

On Sat, Jan 23, 2021 at 10:35:51AM +0100, moasenwood--- via Users list for the GNU Emacs text editor wrote:
> tomas wrote:
> 
> > Not exactly your result, but this comes close:

[...]

> > You can adjust the results by tweaking the regexp (try word
> > boundaries like '\<' and '\>'
> 
> *scratches my head*

A candidate for a sentence boundary is a word boundary
(plus some other conditions). This was at least my thought
process leading to that suggestion. It might be a bad
suggestion, though.

> > if you want to keep punctuation) or the other split-string's
> > optional params (e.g. drop the empty matches, etc.).
> 
> Well, that's a start, for sure. Thanks :)

You're welcome. Note that [:punct:] may be too broad a category:
does a sentence end with a comma? A semi-colon? A colon? What
about question and exclamation marks? What about the latter in
a language like Spanish, where they're parenthetical: "Ella
me preguntó ¿qué quieres?" (the parenthetical things make it
much easier to embed a question or an exclamation into something
else).

As always, the really interesting questions are left as exercises to
the reader... until you end with Natural Language Processing :-)

Possibly this is the danger Tomas Hlavaty is hinting at elsethread.

> Silly me, I already used `split-string' 10 times...

C'm on. Wetware caches are like that. Mine too.

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23 13:10       ` "split-sentences"? tomas
@ 2021-01-23 17:46         ` Eric Abrahamsen
  2021-01-23 20:56           ` "split-sentences"? tomas
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Abrahamsen @ 2021-01-23 17:46 UTC (permalink / raw)
  To: help-gnu-emacs

<tomas@tuxteam.de> writes:

> On Sat, Jan 23, 2021 at 10:35:51AM +0100, moasenwood--- via Users list
> for the GNU Emacs text editor wrote:
>> tomas wrote:
>> 
>> > Not exactly your result, but this comes close:
>
> [...]
>
>> > You can adjust the results by tweaking the regexp (try word
>> > boundaries like '\<' and '\>'
>> 
>> *scratches my head*
>
> A candidate for a sentence boundary is a word boundary
> (plus some other conditions). This was at least my thought
> process leading to that suggestion. It might be a bad
> suggestion, though.

You can see what Emacs thinks would make a good sentence boundary with
the (sentence-end) function.




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23 17:46         ` "split-sentences"? Eric Abrahamsen
@ 2021-01-23 20:56           ` tomas
  0 siblings, 0 replies; 11+ messages in thread
From: tomas @ 2021-01-23 20:56 UTC (permalink / raw)
  To: help-gnu-emacs

[-- Attachment #1: Type: text/plain, Size: 324 bytes --]

On Sat, Jan 23, 2021 at 09:46:40AM -0800, Eric Abrahamsen wrote:
> <tomas@tuxteam.de> writes:

[...]

> > A candidate for a sentence boundary [...]

> You can see what Emacs thinks would make a good sentence boundary with
> the (sentence-end) function.

Wow-wow-wow! Even more functions! Thanks for the hint :)

Cheers
 - t

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "split-sentences"?
  2021-01-23  9:48       ` "split-sentences"? Eli Zaretskii
@ 2021-01-24  0:39         ` Tomas Hlavaty
  0 siblings, 0 replies; 11+ messages in thread
From: Tomas Hlavaty @ 2021-01-24  0:39 UTC (permalink / raw)
  To: help-gnu-emacs

On Sat 23 Jan 2021 at 11:48, Eli Zaretskii <eliz@gnu.org> wrote:
> We have get-char-code-property for Unicode character properties and
> find-composition for finding grapheme clusters (Emacs doesn't care
> about graphemes, unless you use these two terms as aliases).  For
> words, sentences, and line breaks, we use our own definitions, and
> generally don't support the Unicode delimiters like U+2028.

thanks, I'll have a look



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-01-24  0:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-23  5:10 "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
2021-01-23  6:38 ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
2021-01-23  8:41   ` "split-sentences"? tomas
2021-01-23  9:07     ` "split-sentences"? Tomas Hlavaty
2021-01-23  9:32       ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
2021-01-23  9:48       ` "split-sentences"? Eli Zaretskii
2021-01-24  0:39         ` "split-sentences"? Tomas Hlavaty
2021-01-23  9:35     ` "split-sentences"? moasenwood--- via Users list for the GNU Emacs text editor
2021-01-23 13:10       ` "split-sentences"? tomas
2021-01-23 17:46         ` "split-sentences"? Eric Abrahamsen
2021-01-23 20:56           ` "split-sentences"? tomas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).