From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Jean Louis <bugs@gnu.support>
Newsgroups: gmane.emacs.help
Subject: Re: Org tag generator?
Date: Thu, 26 Dec 2024 23:44:21 +0300
Message-ID: <Z23AJSV_-RduneGM@lco2>
References: <87zfktueqr.fsf@librehacker.com> <Z2VfjsJSq4HBDwWS@lco2>
 <878qs6ia9d.fsf@librehacker.com> <Z2nGULkpAl9FUiU5@lco2>
 <87r05xf3bb.fsf@librehacker.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="10565"; mail-complaints-to="usenet@ciao.gmane.io"
User-Agent: Mutt/2.2.12 (2023-09-09)
Cc: Help Gnu Emacs Mailing List <help-gnu-emacs@gnu.org>
To: Christopher Howard <christopher@librehacker.com>
Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Thu Dec 26 21:44:53 2024
Return-path: <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org>
Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org>)
	id 1tQuiy-0002dp-T7
	for geh-help-gnu-emacs@m.gmane-mx.org; Thu, 26 Dec 2024 21:44:52 +0100
Original-Received: from localhost ([::1] helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <help-gnu-emacs-bounces@gnu.org>)
	id 1tQuic-0006gb-1b; Thu, 26 Dec 2024 15:44:30 -0500
Original-Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <bugs@gnu.support>) id 1tQuia-0006gR-G1
 for help-gnu-emacs@gnu.org; Thu, 26 Dec 2024 15:44:28 -0500
Original-Received: from stw1.rcdrun.com ([217.170.207.13])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <bugs@gnu.support>) id 1tQuiY-0006lx-QU
 for help-gnu-emacs@gnu.org; Thu, 26 Dec 2024 15:44:28 -0500
Original-Received: from localhost ([::ffff:41.75.190.33])
 (AUTH: PLAIN admin, TLS: TLS1.3,256bits,ECDHE_RSA_AES_256_GCM_SHA384)
 by stw1.rcdrun.com with ESMTPSA
 id 000000000007DC20.00000000676DC028.00123343; Thu, 26 Dec 2024 13:44:23 -0700
Mail-Followup-To: Christopher Howard <christopher@librehacker.com>,
 Help Gnu Emacs Mailing List <help-gnu-emacs@gnu.org>
Content-Disposition: inline
In-Reply-To: <87r05xf3bb.fsf@librehacker.com>
Received-SPF: pass client-ip=217.170.207.13; envelope-from=bugs@gnu.support;
 helo=stw1.rcdrun.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9,
 RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001,
 SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: help-gnu-emacs@gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Users list for the GNU Emacs text editor <help-gnu-emacs.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/help-gnu-emacs>,
 <mailto:help-gnu-emacs-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/help-gnu-emacs>
List-Post: <mailto:help-gnu-emacs@gnu.org>
List-Help: <mailto:help-gnu-emacs-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/help-gnu-emacs>,
 <mailto:help-gnu-emacs-request@gnu.org?subject=subscribe>
Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org
Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org
Xref: news.gmane.io gmane.emacs.help:149004
Archived-At: <http://permalink.gmane.org/gmane.emacs.help/149004>

* Christopher Howard <christopher@librehacker.com> [2024-12-24 19:35]:
> Now you are recommending a system whereby I myself have to manual
> recognize and record all specific words that lead me to assign a
> specific tag. This works, in a sense, but does nothing to reduce my
> work load. There might be, for example, thousands of plant-related
> words that lead me to apply the tag :botany:. I don't want to have
> to figure out what those words are specifically and make a record of
> them. I just want to intuitively apply the tag :botany: and then the
> software figures out, after observing many nodes, that the words
> "rainforest" or "flower" or should generate the tag :botany:.

I have found deterministic way of getting keywords by using some
manual preparation.

I am using PostgreSQL database, and I am aware that this may sound
hard, though is not.

Full documents could easily be imported into PostgreSQL, and then
tsvector for full text search generated.

I have done that.

I have table hyobjects, with hyobjects_tokens, containing tsvector
type for English (or other language).

Let us say I need some 2 words terms from database, by their ranking,
then it is breeze:

SELECT 'Ugandan ' || sub_words[1] AS term, COUNT(*) AS count
FROM hyobjects
CROSS JOIN LATERAL regexp_matches(hyobjects_text, '\mUgandan\s+([a-zA-Z]+)', 'g') AS sub_words
WHERE NOT EXISTS (
    SELECT 1 
    FROM stopwords 
    WHERE LOWER(stopwords_name) = LOWER(sub_words[1])
)
GROUP BY sub_words[1]
ORDER BY count DESC;

         term         | count 
----------------------+-------
 Ugandan company      |    82
 Ugandan shillings    |    58
 Ugandan coffee       |    26
 Ugandan miners       |    12
 Ugandan people       |    12
 Ugandan shilling     |     8
 Ugandan Coffee       |     8
 Ugandan gold         |     7
 Ugandan Constitution |     6
 Ugandan government   |     6

I like this very deterministic way, so I will use it for automatic
referencing and tagging for easier sales and marketing purposes.

If you happen to need help to generate keywords this way, I am 😄
available.

-- 
Jean Louis