unofficial mirror of notmuch@notmuchmail.org
 help / color / mirror / code / Atom feed
* Add (extracted) attachment text to the search index?
@ 2017-03-01  2:58 Olaf TNSB
  2017-03-01 11:34 ` David Bremner
  0 siblings, 1 reply; 6+ messages in thread
From: Olaf TNSB @ 2017-03-01  2:58 UTC (permalink / raw)
  To: notmuch

[-- Attachment #1: Type: text/plain, Size: 471 bytes --]

HI,

I was wondering if it was possible to add the text extracted from an
attachment to the search index?

For the moment let's leave aside the important issues like - security,
buffer overflows, clients having to install
doc2text/pandoc/pdftotext/whatever...


I *think* I'm trying to ask - How can I take a lump of text (e.g. from an
attachment) and associate it with a message ID so I can then search for it?

Is this a notmuch command, or a Xapien command?


Thanks!

[-- Attachment #2: Type: text/html, Size: 651 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Add (extracted) attachment text to the search index?
  2017-03-01  2:58 Add (extracted) attachment text to the search index? Olaf TNSB
@ 2017-03-01 11:34 ` David Bremner
  2017-03-01 17:55   ` Steven Allen
  0 siblings, 1 reply; 6+ messages in thread
From: David Bremner @ 2017-03-01 11:34 UTC (permalink / raw)
  To: Olaf TNSB, notmuch

Olaf TNSB <still.another.person@gmail.com> writes:

> HI,
>
> I was wondering if it was possible to add the text extracted from an
> attachment to the search index?
>
> For the moment let's leave aside the important issues like - security,
> buffer overflows, clients having to install
> doc2text/pandoc/pdftotext/whatever...
>
>
> I *think* I'm trying to ask - How can I take a lump of text (e.g. from an
> attachment) and associate it with a message ID so I can then search for it?
>
> Is this a notmuch command, or a Xapien command?

This would require some modifications of notmuch. Either modifying
lib/index.cc to add the terms at indexing (notmuch new/insert) time, or
providing some way of adding the terms later. The former actually sounds
simpler to me.

d

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Add (extracted) attachment text to the search index?
  2017-03-01 11:34 ` David Bremner
@ 2017-03-01 17:55   ` Steven Allen
  2017-03-01 22:45     ` Olaf TNSB
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Allen @ 2017-03-01 17:55 UTC (permalink / raw)
  To: notmuch; +Cc: David Bremner, Olaf TNSB

[-- Attachment #1: Type: text/plain, Size: 511 bytes --]


David Bremner <david@tethera.net> writes:
> This would require some modifications of notmuch. Either modifying
> lib/index.cc to add the terms at indexing (notmuch new/insert) time, or
> providing some way of adding the terms later. The former actually sounds
> simpler to me.

To do this correctly, you'd want to be able to run an external text
extraction tool (for PDFs, word documents, etc.) so I think the latter
would be better in the long run (it would allow the user to index
attachments in the hooks).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Add (extracted) attachment text to the search index?
  2017-03-01 17:55   ` Steven Allen
@ 2017-03-01 22:45     ` Olaf TNSB
  2017-03-02  0:41       ` David Bremner
  0 siblings, 1 reply; 6+ messages in thread
From: Olaf TNSB @ 2017-03-01 22:45 UTC (permalink / raw)
  To: Steven Allen; +Cc: notmuch, David Bremner

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

On Thu, Mar 2, 2017 at 4:55 AM, Steven Allen <steven@stebalien.com> wrote:
>
>
> David Bremner <david@tethera.net> writes:
> > This would require some modifications of notmuch. Either modifying
> > lib/index.cc to add the terms at indexing (notmuch new/insert) time, or
> > providing some way of adding the terms later. The former actually sounds
> > simpler to me.
>
> To do this correctly, you'd want to be able to run an external text
> extraction tool (for PDFs, word documents, etc.) so I think the latter
> would be better in the long run (it would allow the user to index
> attachments in the hooks).

(As a non-dev...) I agree.  The ability to add (and delete!) content
post-insert sounds more desirable.  I don't want to have to re-index all my
email as the next version of <horrible-binary-object>-to-text gets
released.  I'd like to be able to (search-for-attachment)-(delete)-(re-add).


I was thinking a really hacky solution would be fake up a new email with
the same headers but body being the attachment text, doing a notmuch
new/insert and then replacing the file on disk of the new email with a link
to the original message (not sure if that will trigger notmuch new, I don't
think so).  Doesn't feel robust, but...


What do ya reckon?

[-- Attachment #2: Type: text/html, Size: 1564 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Add (extracted) attachment text to the search index?
  2017-03-01 22:45     ` Olaf TNSB
@ 2017-03-02  0:41       ` David Bremner
  2017-03-02  0:55         ` Olaf TNSB
  0 siblings, 1 reply; 6+ messages in thread
From: David Bremner @ 2017-03-02  0:41 UTC (permalink / raw)
  To: Olaf TNSB, Steven Allen; +Cc: notmuch

Olaf TNSB <still.another.person@gmail.com> writes:

> On Thu, Mar 2, 2017 at 4:55 AM, Steven Allen <steven@stebalien.com> wrote:
>>
>>
>> David Bremner <david@tethera.net> writes:
>> > This would require some modifications of notmuch. Either modifying
>> > lib/index.cc to add the terms at indexing (notmuch new/insert) time, or
>> > providing some way of adding the terms later. The former actually sounds
>> > simpler to me.
>>
>> To do this correctly, you'd want to be able to run an external text
>> extraction tool (for PDFs, word documents, etc.) so I think the latter
>> would be better in the long run (it would allow the user to index
>> attachments in the hooks).
>
> (As a non-dev...) I agree.  The ability to add (and delete!) content
> post-insert sounds more desirable.  I don't want to have to re-index all my
> email as the next version of <horrible-binary-object>-to-text gets
> released.  I'd like to be able to (search-for-attachment)-(delete)-(re-add).

There has been some patches (related to encrypted email), that reindex
individual messages. So that would be a clean fix for that. I haven't
really thought about reindexing parts of messages, which is what seems
to be the proposal here.

d

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Add (extracted) attachment text to the search index?
  2017-03-02  0:41       ` David Bremner
@ 2017-03-02  0:55         ` Olaf TNSB
  0 siblings, 0 replies; 6+ messages in thread
From: Olaf TNSB @ 2017-03-02  0:55 UTC (permalink / raw)
  To: David Bremner; +Cc: Steven Allen, notmuch

[-- Attachment #1: Type: text/plain, Size: 744 bytes --]

On Thu, Mar 2, 2017 at 11:41 AM, David Bremner <david@tethera.net> wrote:
> There has been some patches (related to encrypted email), that reindex
> individual messages. So that would be a clean fix for that. I haven't
> really thought about reindexing parts of messages, which is what seems
> to be the proposal here.

Oooh.  That's brilliant!  I would have never got to the point of thinking
of attachments and re-indexing rather than separately of encrypted email,
emails with (PDF/TXT/DOCX) attachments.

Then the issues collapses down to two different components
1 - re-indexing messages
2 - plugins/external tools for different attachment types  - hello
mimetypes!

Simple!  So you'll have a patch and next release by this weekend?   :-P

[-- Attachment #2: Type: text/html, Size: 984 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-02  0:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01  2:58 Add (extracted) attachment text to the search index? Olaf TNSB
2017-03-01 11:34 ` David Bremner
2017-03-01 17:55   ` Steven Allen
2017-03-01 22:45     ` Olaf TNSB
2017-03-02  0:41       ` David Bremner
2017-03-02  0:55         ` Olaf TNSB

Code repositories for project(s) associated with this public inbox

	https://yhetil.org/notmuch.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).