From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.ciao.gmane.io!not-for-mail From: Jean-Christophe Helary Newsgroups: gmane.emacs.help Subject: Re: Emacs as a translator's tool Date: Sat, 30 May 2020 12:12:08 +0900 Message-ID: References: <871rn35lqc.fsf@mbork.pl> <87zh9r45ad.fsf@mbork.pl> <87h7vz2m5g.fsf@ebih.ebihd> <87zh9r16ma.fsf@ebih.ebihd> <20200529095905.GA2284@tuxteam.de> <87367j10n1.fsf@ebih.ebihd> <20200529113440.GB2284@tuxteam.de> <871rn3yn6d.fsf@ebih.ebihd> <4C297F95-34C5-42C3-BB80-DA395D983D97@traduction-libre.org> <87o8q6ury3.fsf@ebih.ebihd> Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\)) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: ciao.gmane.io; posting-host="ciao.gmane.io:159.69.161.202"; logging-data="33225"; mail-complaints-to="usenet@ciao.gmane.io" Cc: help-gnu-emacs@gnu.org To: Emanuel Berg Original-X-From: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Sat May 30 05:12:39 2020 Return-path: Envelope-to: geh-help-gnu-emacs@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jervL-0008Xo-Lb for geh-help-gnu-emacs@m.gmane-mx.org; Sat, 30 May 2020 05:12:39 +0200 Original-Received: from localhost ([::1]:50566 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jervK-00038z-K4 for geh-help-gnu-emacs@m.gmane-mx.org; Fri, 29 May 2020 23:12:38 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:37048) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jerv0-00038r-LI for help-gnu-emacs@gnu.org; Fri, 29 May 2020 23:12:18 -0400 Original-Received: from relay1-d.mail.gandi.net ([217.70.183.193]:35469) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jerux-0001p5-Sp for help-gnu-emacs@gnu.org; Fri, 29 May 2020 23:12:18 -0400 X-Originating-IP: 128.53.64.23 Original-Received: from [10.0.1.13] (pl19991.ag0304.nttpc.ne.jp [128.53.64.23]) (Authenticated sender: jean.christophe.helary@traduction-libre.org) by relay1-d.mail.gandi.net (Postfix) with ESMTPSA id 15705240003; Sat, 30 May 2020 03:12:10 +0000 (UTC) In-Reply-To: <87o8q6ury3.fsf@ebih.ebihd> X-Mailer: Apple Mail (2.3608.80.23.2.2) Received-SPF: pass client-ip=217.70.183.193; envelope-from=jean.christophe.helary@traduction-libre.org; helo=relay1-d.mail.gandi.net X-detected-operating-system: by eggs.gnu.org: First seen = 2020/05/29 22:02:12 X-ACL-Warn: Detected OS = Linux 3.11 and newer X-Spam_score_int: -25 X-Spam_score: -2.6 X-Spam_bar: -- X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-BeenThere: help-gnu-emacs@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Users list for the GNU Emacs text editor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: help-gnu-emacs-bounces+geh-help-gnu-emacs=m.gmane-mx.org@gnu.org Original-Sender: "help-gnu-emacs" Xref: news.gmane.io gmane.emacs.help:123187 Archived-At: > On May 30, 2020, at 10:33, Emanuel Berg via Users list for the GNU = Emacs text editor wrote: >=20 > Can't we compile a list of what the commercial CATs > offer? M Helary and Mr Abrahamsen? x commercial =E2=86=92 =E2=97=8B professional, if you don't mind :) OmegaT is very much a professional tool and certainly not a "commercial" = one. My 20 years of practice but otherwise not technically so very informed = idea is the following: 1) CAT tools extract translatable contents from various file formats = into an easy-to-handle format, and put the translated contents back into = the original format. That way the translator does not have to worry *too = much* about the idiosyncrasies of the original format. =E2=86=92 File filters are a core part of a CAT tool *but* as was = suggested in the thread it is possible to rely on an external filter = that will output contents in a standard localization "intermediate" = format (current "industry" standards are PO and XLIFF). Such filters = provide export and import functions so that the translated files are = converted back to the original format. File filters can also accept rules for not outputting non-translatable = text (the current standard is ITS) The PO format can be handled by po4a (perl), translate-toolkit (python) = and the Okapi Framework tools (java). XLIFF has the Okapi Framework, OpenXLIFF (electron/node) and the = translate-toolkit. All are top-notch pro-grade free software and in the = case of Okapi and OpenXLIFF have been developed by people who have = participated to the standardization process (XLIFF/TMX/SRX/ITS/TBX, = etc...) =E2=86=92 emacs could rely on such external filters and only specialize = in one "intermediate" format. The po-mode already does that for PO = files. 2) Once the text is extracted, it needs to be segmented. Basic "no" = segmentation usually means paragraph based segmentation. Paragraphs are = defined differently depending on the original format (1, or 2 line = breaks for a text file, a block tag for XML-based formats, etc.). Fine-grained segmentation is obtained by using a set of native language = based regex that includes break rules and no-break rules. A simple = example is break after a "period followed by a space" but don't break = after "Mr. " for English. =E2=86=92 File filters usually handle the segmentation part based on = user specifications. Once the file is segmented into the intermediate = format, it is not structurally trivial to "split" or "merge" segments = because the tool needs to remember what will go back into the original = file structure. =E2=86=92 emacs could rely on the external filters to handle the = segmentation. 3) The real strength of a CAT tool shows where it helps the translator = handle all the resources needed in the translation. Let me list = potential resources: - Legacy translations, called "translation memories" (TM), usually in = multilingual "aligned" files where a given segment has equivalents in = various languages. Translated PO files are used as TMs, the XML standard = is TMX. - Glossaries, usually in a similar but simpler format, sometimes only = TSV, sometimes CSV, the XML-based standard is TBX. - Internal translations, which are produced by the translator while = translating. Each translated segment adding to the project "memory". - Dictionaries are a more global form of glossaries, usually = monolingual, format varies. - external files, either local documents, or web documents, in various = formats, usually monolingual (otherwise they'd be aligned and used as = TMs) =E2=86=92 each resource format needs a way to be parsed, memorized, = fetched, recycled efficiently during the translation 4) Usually the process is the following: - the translator "enters" a segment - the tool displays "matches" from the resources that relatively closely = correspond to the segment contents - the translator inserts or modifies the matches - when no matches are produced the translator enters a translation from = scratch - the translator can add glossary items to the project glossary - the new translation is added to the "internal" memory set - the translator moves to the next segment 5) The matching is usually some sort of levenstein distance-based = algorithm. The "tokens" that are used in the "distance" calculation are = usually produced by native language based tokenizers (the Lucene = tokenizers are quite popular) The better the match, the more efficient the tool is at helping the = translator recycle resources. The matching process/quality is where = tools profoundly differ (OmegaT is generally considered to have = excellent quality matches, sometimes better than expensive commercial = tools). Some tools propose "context" matches where the previous and next = segments are also taken into account, some tools propose "subsegment" = matches where even if a whole segment won't match significant subparts = can, etc. The matching process must sometimes apply to extremely big resources = (like many million lines of multilingual TMs in the case of the EU legal = corpora) and must thus be able to handle the data quickly regardless of = the set size. 6) Goodies that are time savers include: - history based autocompletion - glossary/TM/dictionary based autocompletion - MT services access - shortcuts that auto insert predefined text chunks - spell-checking/grammar checking - QA checks against glossary terms, completeness/length of the = translation, integrity of the format structure, numbers used, etc. (QA = checks are also available as external processes in some of the solutions = mentioned above, or related solutions.) > I'll read thru this thread tomorrow (today) > God willing but I don't understand everything, in > particular examples would nice to get the exact > meaning of the desired functionality... Go ahead if you have questions. > With examples we can also see if Emacs already can do > it. And if not: Elisp contest :) :) > Some features are probably silly, we don't have to > list or do them, or everything in the CATs, just what > really makes sense and is useful on an every-day basis. A lot of the heavy-duty tasks can be handled by external processes. > When we are done, we put it in the wiki or in a pack. >=20 > We can't have that Emacs doesn't have a firm grip on > this issue. Because translation is a very common task > with text! >=20 > Also, let's compile a list of what Emacs already has > to this end. It doesn't matter if some of that stuff > already appears somewhere else, modularity is > our friend. :) --=20 Jean-Christophe Helary @brandelune http://mac4translators.blogspot.com=