From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Jean Louis Newsgroups: gmane.emacs.help Subject: Re: Org tag generator? Date: Thu, 26 Dec 2024 23:44:21 +0300 Message-ID: References: <87zfktueqr.fsf@librehacker.com> <878qs6ia9d.fsf@librehacker.com> <87r05xf3bb.fsf@librehacker.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="10565"; mail-complaints-to="usenet@ciao.gmane.io" User-Agent: Mutt/2.2.12 (2023-09-09) Cc: Help Gnu Emacs Mailing List To: Christopher Howard Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Thu Dec 26 21:44:53 2024 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1tQuiy-0002dp-T7 for geh-help-gnu-emacs@m.gmane-mx.org; Thu, 26 Dec 2024 21:44:52 +0100 Original-Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1tQuic-0006gb-1b; Thu, 26 Dec 2024 15:44:30 -0500 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQuia-0006gR-G1 for help-gnu-emacs@gnu.org; Thu, 26 Dec 2024 15:44:28 -0500 Original-Received: from stw1.rcdrun.com ([217.170.207.13]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tQuiY-0006lx-QU for help-gnu-emacs@gnu.org; Thu, 26 Dec 2024 15:44:28 -0500 Original-Received: from localhost ([::ffff:41.75.190.33]) (AUTH: PLAIN admin, TLS: TLS1.3,256bits,ECDHE_RSA_AES_256_GCM_SHA384) by stw1.rcdrun.com with ESMTPSA id 000000000007DC20.00000000676DC028.00123343; Thu, 26 Dec 2024 13:44:23 -0700 Mail-Followup-To: Christopher Howard , Help Gnu Emacs Mailing List Content-Disposition: inline In-Reply-To: <87r05xf3bb.fsf@librehacker.com> Received-SPF: pass client-ip=217.170.207.13; envelope-from=bugs@gnu.support; helo=stw1.rcdrun.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Xref: news.gmane.io gmane.emacs.help:149004 Archived-At: * Christopher Howard [2024-12-24 19:35]: > Now you are recommending a system whereby I myself have to manual > recognize and record all specific words that lead me to assign a > specific tag. This works, in a sense, but does nothing to reduce my > work load. There might be, for example, thousands of plant-related > words that lead me to apply the tag :botany:. I don't want to have > to figure out what those words are specifically and make a record of > them. I just want to intuitively apply the tag :botany: and then the > software figures out, after observing many nodes, that the words > "rainforest" or "flower" or should generate the tag :botany:. I have found deterministic way of getting keywords by using some manual preparation. I am using PostgreSQL database, and I am aware that this may sound hard, though is not. Full documents could easily be imported into PostgreSQL, and then tsvector for full text search generated. I have done that. I have table hyobjects, with hyobjects_tokens, containing tsvector type for English (or other language). Let us say I need some 2 words terms from database, by their ranking, then it is breeze: SELECT 'Ugandan ' || sub_words[1] AS term, COUNT(*) AS count FROM hyobjects CROSS JOIN LATERAL regexp_matches(hyobjects_text, '\mUgandan\s+([a-zA-Z]+)', 'g') AS sub_words WHERE NOT EXISTS ( SELECT 1 FROM stopwords WHERE LOWER(stopwords_name) = LOWER(sub_words[1]) ) GROUP BY sub_words[1] ORDER BY count DESC; term | count ----------------------+------- Ugandan company | 82 Ugandan shillings | 58 Ugandan coffee | 26 Ugandan miners | 12 Ugandan people | 12 Ugandan shilling | 8 Ugandan Coffee | 8 Ugandan gold | 7 Ugandan Constitution | 6 Ugandan government | 6 I like this very deterministic way, so I will use it for automatic referencing and tagging for easier sales and marketing purposes. If you happen to need help to generate keywords this way, I am 😄 available. -- Jean Louis