Computational lexicography links
A collection of links related to computational lexicography and terminology management
Dictionary writing systems
Lexicography software is an area of software applications where one size doesn’t fit all. Most dictionary writing software tends to be developed by dictionary publishers in-house or on demand for a specific purpose, and tends to be too specialized to be useful to anyone else. The following three are an exception as their ambition is to be generic products which can be deployed on any dictionary project.
- Lexique Pro A freeware dictionary editor, developed by SIL (Summer Institute of Linguistics), used a lot by field linguists and lexicographers working with under-documented languages.
- TshwaneLex A very powerful and customizable dictionary writing and publishing system from the South African company TshwaneDJe. Not free but various discounts are available.
- IDM The dictionary production system from the French company IDM is not actually a product you can buy, it is more of a solution which the company will customize for each individual dictionary project. Solutions by IDM are used by many high-street dictionary publishers such as Longman, Macmillan and Oxford University Press.
Terminology management systems
Unlike dictionary writing, terminology management tends to be viewed as a corporate activity, and as such it is supported by the following two software packages.
- Multiterm Probably the best known terminology database system of them all, previously owned by Trados, a company known mainly for its translation memory products. Trados has since been acquired by SDL International.
- TermStar and WebTerm Terminology management databases from Trados’ main rival, Star.
Both SDL and Star are mainly producers of computer-aided translation software solutions (such as translation memory tools), and their termnology solutions often come bundled with those.
Major ISO standars
The International Standards Organization (ISO) has been very active in producing standards for the interchange of language data. Within ISO, the unit responsible for language-related standards is Technical Committee 37 (TC 37). The following is not a complete list of all standards released by TC 37 but only a selection of standards most relevant to lexicography and terminology.
Non-ISO interchange formats
- OLIF (Open Lexicon Interchange Format), a broadly-scoped interchange standard maintained by the vendor-neutral OLIF Consortium.
- TBX (TermBase Exchange) An XML-based standard for exachanging data between terminology databases. Implements ISO 16642. Developed by LISA (the Localization Industry Standards Association). Not to be confused with TMX, LISA’s other standard for exchanging data between translation memories.
- The Text Encoding Initiative (TEI) has produced XML DTDs and guidelines for encoding dictionaries (Chaper 12) and terminological databases (Chaper 13). As is common in TEI, the guidelines are highly suitable for the mark-up of free-form text and are not as rigid as OLIF and TBX.
Large and/or important terminology databases
- IATE is the European Union’s centralized terminology database, used internally by its translators and also accessible to the public. The database covers all of the EU’s official languages (currently 23) as well as Latin. According to this press release, IATE contains 8.7 million terms. I suspect that this is actually the number of terms in all the 24 languages, and that the number of concepts is somewhere around 1 million. Either way, IATE is probably the largest terminology database in the world.
- Termium is the Government of Canada Translation Bureau’s terminology database covering English, French and to a lesser extent Spanish. They claim to contain 3.5 million terms but again, I am not sure whether that is the number of terms in all the three languages, or the number of concepts. Presuming that each concept in the Termium database is designated on average by three terms (one English, one French and possibly one Spanish), then the number of concepts in the database would be somewhere around 1 million, similarly to IATE. Termium is only available for a subscription fee. You used to be able to get a trial account for a month but that option does not seem to be available any more.
- EuroTermBank is a consortium of terminology organizations from several European countries. It has created something like a meta-datatabase that combines many smaller terminology databases and dictionaries in 25 European languages - although some languages are represented more numerously than others. It claims to provide access to 1.5 million terms (but how many concepts is that?) from its own database as well as from several interconnected databases.
- focal.ie is Ireland’s national terminology database, covering mainly Irish and English with some terms in Latin (plant and animal names) and a very small number in several other languages. I am including focal.ie here because I have had the privilege to be involved in its creation and maintenance from the very start. With some 140,000 concepts and 280,000 terms it is not the largest in the world but it surely is very large for such a small country, and I rather think the user interface is by far the most elegant and user-friendly of them all - but then again, I am biased!