From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Juri Linkov Newsgroups: gmane.emacs.devel Subject: Re: Language identification Date: Fri, 28 Aug 2009 22:08:28 +0300 Organization: JURTA Message-ID: <87k50noj5r.fsf@mail.jurta.org> References: <87skfczqc8.fsf@mail.jurta.org> <87my5kl9ld.fsf@alexott.dev.webwasher.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1251492694 17764 80.91.229.12 (28 Aug 2009 20:51:34 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 28 Aug 2009 20:51:34 +0000 (UTC) Cc: joakim@verona.se, Emacs Development To: Alex Ott Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Fri Aug 28 22:51:26 2009 Return-path: Envelope-to: ged-emacs-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1Mh8Pt-0005jT-0n for ged-emacs-devel@m.gmane.org; Fri, 28 Aug 2009 22:51:25 +0200 Original-Received: from localhost ([127.0.0.1]:56129 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Mh8Ps-0005Bk-FO for ged-emacs-devel@m.gmane.org; Fri, 28 Aug 2009 16:51:24 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Mh6u1-0004uY-Ed for emacs-devel@gnu.org; Fri, 28 Aug 2009 15:14:25 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Mh6tw-0004p9-ID for emacs-devel@gnu.org; Fri, 28 Aug 2009 15:14:24 -0400 Original-Received: from [199.232.76.173] (port=55580 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Mh6tw-0004ow-Bs for emacs-devel@gnu.org; Fri, 28 Aug 2009 15:14:20 -0400 Original-Received: from smtp-out1.starman.ee ([85.253.0.3]:37198 helo=mx1.starman.ee) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1Mh6tv-0001Lv-Vv for emacs-devel@gnu.org; Fri, 28 Aug 2009 15:14:20 -0400 X-Virus-Scanned: by Amavisd-New at mx1.starman.ee Original-Received: from mail.starman.ee (82.131.54.133.cable.starman.ee [82.131.54.133]) by mx1.starman.ee (Postfix) with ESMTP id 654EB3F4187; Fri, 28 Aug 2009 22:14:14 +0300 (EEST) In-Reply-To: <87my5kl9ld.fsf@alexott.dev.webwasher.com> (Alex Ott's message of "Fri, 28 Aug 2009 08:46:06 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (x86_64-pc-linux-gnu) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.emacs.devel:114782 Archived-At: >>> In `auto-mode-alist' you can see that with the exception of >>> `archive-mode', `doc-view-mode' and `image-mode', all remaining >>> modes are programming text modes. It would be more useful >>> to identify file types for these modes that libmagic can't do. >>> Do you know a library that identifies programming languages? >>> Such a library might be implemented using a Bayesian classifier >>> trained on a sufficiently large corpus of different programming >>> languages. >> >> N-Gram algorithms is could be used to identify languages - it simpler >> than bayes, and requires smaller database > > Sorry, I skipped, that this was about programming languages, not real > languages. It would be interesting to try using N-Gram algorithms for programming languages and see how well they perform. For example, most frequently used bigram "/*" belongs to C, most frequently used trigram ";;;" belongs to Lisp, etc. -- Juri Linkov http://www.jurta.org/emacs/