|Type of site||freeware linguistic engineering development environment|
|Created by||Max Silberztein|
|Current status||Modules for: Arabic, Armenian, Bulgarian, Catalan, Chinese, Croatian, English, French, Hebrew, Hungarian, Italian, Polish, Portuguese and Spanish|
NooJ is a development environment used to construct large-coverage, formalized descriptions of natural languages and to apply them to large corpora in real time.
NooJ is under continuous development and is updated daily by Professor Max Silberztein.
Professor Max Silberztein constructed his first package of "Finite State tools for Natural Language Processing", along with the French DELAC-DELACF dictionaries of compound words as part of his Ph.D. research from 1986 to 1989 at the LADL (University of Paris 7-CNRS) under the supervision of Prof. Maurice Gross.
From 1993 to 2002, he developed a software application called INTEX, which was used at the LADL and at various affiliated laboratories to build DELA dictionaries and perform automatic lexical analysis on texts. See http://intex.univ-fcomte.fr for more details on INTEX.
Since 2002, he has been working on NooJ.
NooJ is a freeware, linguistic-engineering development environment for formalizing various types of textual phenomena (orthography, lexical and productive morphology, local, structural and transformational syntax). It integrates a broad spectrum of computational technology – from finite-state automata to augmented/recursive transition networks.
Included tools can construct, test, debug, maintain and accumulate large sets of linguistic resources, and can describe:
- Inflectional and derivational morphology,
- Variations in spelling and terminology,
- Vocabularies (simple words, multi-word units and fixed expressions),
- Semi-fixed phenomena (local grammars),
- Syntax (grammars for phrases and full sentences) and
- Semantics (named-entity recognition and transformational analysis).
NooJ can also be used as a corpus-processing system, making it possible to process sets of (thousands of) text files in many ways, including:
- Indexing morpho-syntactic patterns,
- Cataloging fixed or semi-fixed expressions (e.g. technical expressions),
- Creation of lemmatized concordances, and
- Statistical analysis of the results.
Modules for several languages are currently available for free download: Arabic, Armenian, Bulgarian, Catalan, Chinese, Croatian, English, French, German, Hebrew, Hungarian, Italian, Polish, Portuguese and Spanish. Several other modules are under development. NooJ's most unique characteristics are:
- Ability to process from 100+ file formats, including HTML, PDF, MS Office, all variants of Unicode, ASCII, etc. It can import information from, and export annotations back to XML documents.
- An annotation system that allows any level of grammar to be applied, yet leaves original text unmodified. This allows linguists to formalize various phenomena independently and to apply the corresponding grammars in cascade. For instance, by combining inflection, derivation and syntactic data, NooJ can perform Zellig Harris-type transformations.
NooJ can be used as a linguistic-engineering development platform, a corpus processor, an information-extraction system, a terminology extractor, a machine-translation development tool, as well as to teach Linguistics and Computational Linguistics.
The author followed a Component-Based Software approach for building NooJ. Although originally, he used Java/J2EE framework, he then switched to C#/.NET framework thus giving NooJ a number of additional capabilities including the automatic management of hundreds of text encodings and formats, native XML compatibility (both for parsing XML documents and storing objects (XML/SOAP)); the ASP.NET library allows NooJ to be easily transformed into a WEB server application; .NET Services and Remoting technology allows NooJ’s functionality to be available as independent agents that run in parallel, etc.
NooJ is a .NET application. It currently runs under Windows 95-98-ME, Windows NT-2000, Windows XP and Windows VISTA, although some of its functionalities (e.g. UNICODE and XML support) are only available with Windows 2000, Windows XP and Windows VISTA. As for any application, it is strongly advised that you update both your operating system and the .NET Framework, by downloading their latest “Service Pack”.
The MONO and the DOTGNU projects aim at building a .NET computing environment (i.e. virtual machine) for LINUX, FreeBSD, Mac OS X as well as several variants of UNIX. So far, noojapply.exe on MONO have been successfully tested, but NooJ.exe does not run yet on MONO. For more information, see: http://www.mono-project.com and http://www.dotgnu.org
Minimum requirements for a computer to run NooJ on small texts (less than one Mega byte) are not very high: 512 Mb of RAM, 1 GB available on the hard drive.
If you plan to use NooJ to parse large corpora (hundreds or thousands of text files), or to compile large-coverage dictionaries (tens of thousands of entries or more), the minimum configuration should be higher: PC with Pentium 4 or equivalent, 2 GB RAM or more.
If you are planning to use NooJ to develop large sets of local grammars (hundreds of graphs), a good screen is necessary: at least a 19 inch screen, with a 1600×1024 16-bit resolution, and a minimum of 80 Hz refresh rate.
NooJ's linguistic engine includes several computational devices used both to formalize linguistic phenomena and to parse texts.
- Finite-State Transducers (FST in general)
- A Finite-State Transducer (FST) is a graph that represents a set of text sequences and then associates each recognized sequence with some analysis result. The text sequences are described in the input part of the FST; the corresponding results are described in the output part of the FST. Typically, a syntactic FST represents word sequences, and then produces linguistic information (such as its phrasal structure). A morphological FST represents sequences of letters that spell a word form, and then produces lexical information (such as a part of speech, a set of morphological, syntactic and semantic codes).
- Finite-State Automata (FSA in general)
- In NooJ, Finite-State Automata are a special case of Finite-State Transducers that do not produce any result (i.e. they have no output). NooJ's users typically use FSA to locate morpho-syntactic patterns in corpora, and extract the matching sequences to build indices, concordances, etc.
- Recursive Transition Networks (RTNs in general)
- Recursive Transition Networks are grammars that contain more than one graph; graphs can be FST or FSA, and also include references to other, embedded graphs; these latter graphs may in turn contain other references, to the same, or to other graphs. Generally, RTNs are used in NooJ to build libraries of graphs from the bottom-up: simple graphs are designed; then, they are re-used in more general graphs; these ones in turn are re-used, etc.
- Enhanced Recursive Transition Networks (ERTNs in general)
- Enhanced Recursive Transition Networks are RTNs that contain variables; these variables typically store parts of the matching sequences, and then are used to perform some operation with them (e.g. put their content in the plural, etc.), and then produce the resulting output. Because variables can be duplicated, inserted and/or displaced in the output, ERTNs give NooJ the power of performing linguistic transformations on texts. Examples of transformations include negation, passivization, nominalization, etc.
- Regular Expressions (RegEx in general)
- Regular Expressions constitute also a quick way to enter simple queries without having to construct grammars. When the sequence to be located consists of a few words, it is much quicker to enter these words directly into a regular expression. However, as the query becomes more and more complex as is usually the case in Linguistics, one should build a grammar.
- Context-Free Grammars (CFGs in general)
- In NooJ, CFGs constitute an alternative means to enter morphological or syntactic grammars. For instance, NooJ includes an inflectional/derivational module that is associated with its dictionaries, so that it can automatically link dictionary entries with their corresponding forms that occur in corpora (this functionality allows NooJ to get rid of INTEX's full form dictionaries such as DELAF and DELACFs). NooJ dictionaries generally associate each lexical entry with an inflectional and/or derivational paradigm. For instance, all the verbs that conjugate like "aimer" are linked to the paradigm "+FLX=AIMER"; all the verbs that accept the "-able" suffix are linked to the paradigm "+DRV=ABLE", etc. Paradigms such as "AIMER" or "ABLE" are described either graphically in RTNs or by CFGs in text files.
With NooJ, linguists build, test and maintain two basic types of linguistic resources:
- Dictionaries ( .dic files)
- usually associate words or expressions with a set of information, such as:
- a category (e.g. “Verb”),
- one or more inflectional and/or derivational paradigms (e.g. how to conjugate verbs, how to nominalize them),
- one or more syntactic properties (e.g. “+transitive” or +N0VN1PREPN2),
- one or more semantic properties (e.g. distributional classes such as “+Human”, domain classes such as “+Politics”).
- Lexical Properties can be binary, such as “+plural” or can be expressed as an attribute-value pair, such as “+gender=plural”.
- Values can belong to the meta-language, such as in “+gender=plural”, to the input language such as in “+synonym=pencil” or to another language, such as in “+FR=crayon”.
- NooJ’s dictionaries constitute a converged and enhanced version of the DELA-type dictionaries that were used in INTEX: a NooJ dictionary can include
- simple words (like a DELAS),
- multi-word units (like a DELAC) and
- can link lexical entries to a canonical form (like a DELAV).
- Contrary to INTEX, NooJ does not need full inflected form dictionaries (no more DELAF or DELACF).
- NooJ’s ability to type pieces of information (e.g. “masculine” is a value of the “gender” property) allows it to process lexicon-grammar tables as well. Indeed, NooJ can display any dictionary in a “list” form or in a “table” form.
- are used to represent a large gamut of linguistic phenomena, from the orthographical and the morphological levels, up to the syntagmatic and transformational syntactic levels.
- In NooJ, there are different types of grammars. NooJ’s three types of grammars are:
- Inflectional and derivational grammars ( .nof files) are used to represent the inflection (e.g. conjugation) or the derivation (e.g. nominalization) properties of lexical entries. These descriptions can be entered either graphically or in the form of rules.
- Lexical, orthographical, morphological or terminological grammars ( .nom files) are used to represent sets of word forms, and associate them with lexical information, e.g. to standardize the spelling of word or term variants, to recognize and tag neologisms, to link synonymous expressions together;
- Syntactic or semantic grammars ( .nog files) are used to recognize and annotate expressions in texts, e.g. to tag noun phrases, certain syntactic constructs or idiomatic expressions, to extract certain expressions or interest (name of companies, expressions of dates, addresses, etc.), or to disambiguate words by filtering out some lexical or syntactic annotations in the text.
Using NooJ functionalities
In its Standard edition, NooJ’s functions are available via a command-line program: noojapply.exe, which is stored in NooJ’s _App directory along Nooj.exe.
noojapply.exe can be called either directly from a “SHELL” script, or from more sophisticated programs written in Perl, C++, Java, etc.
noojapply.exe allows users to apply to texts and corpora dictionaries and grammars automatically.
If you are planning to use NooJ’s functionalities in a professional environment (e.g. build a linguistic research engine), note that they are also available via:
- a .NET dynamic library, noojengine.dll, constituted by a set of public object classes and methods. These classes and methods can be used by any .NET application, in any NET programming language. noojengine.dll allows users to build sophisticated applications such as WEB services, and can be much used to build much more efficient NLP applications than noojapply.exe.
- a noojservice.exe / noojclient.exe client-server application, based on a Windows service, that provides NooJ’s morphological and syntactic parsers functionalities in a Multi-Agent System, that can be used to build a massively parallel NLP application.
NooJ can be freely downloaded.
Most laboratories and academic centers use NooJ as a research or educational tool: some users are interested in its Corpus processing functionalities (analysis of literary text, research and extract information from newspapers or technical corpora, etc.); others use NooJ to formalize certain linguistic phenomena (e.g. describe a language’s morphology), others for computational applications (automatic text analysis), etc.
Among NooJ users, some are actively helping the NooJ project, by giving away some of their linguistic resources, projects or demos, labs, tutorials or documentations. These users, who constitute “NooJ’s community”, should be considered as NooJ’s “co-authors”. The Community Edition of the NooJ application (which is also free), is an extended version of NooJ, that gives full access to its internal functionalities as well as privileged access to sources of its linguistic resources.
NooJ users meet once a year at the NooJ conference. NooJ tutorials and workshops are regularly organized during the year.
- NooJ 2011, Dubrovnik, Croatia
- NooJ 2010, Komotini, Greece
- NooJ 2009, Tozeur, Tunisia
- NooJ 2008, Budapest, Hungary
- [NooJ 2007, Barcelona, Spain]
- NooJ 2006, Belgrade, Serbia
- [NooJ 2005, Besançon, France]
- Max Silberztein: NooJ manual
- Abdelmajid Ben Hamadou, Slim Mesfar, Max Silberztein (Eds): Finite State Language Engineering: NooJ 2009 International Conference and Workshop (Touzeur), Centre de Publication Universitaire, 2010.
- Xavier Blanco, Max Silberztein (Eds): Proceedings of the 2007 International NooJ Conference (Barcelona), Cambridge Scholars Publishing (18 selected papers, 296 pages), 2008.
- Svetla Koeva, Denis Maurel, Max Silberztein (Eds): Formaliser les langues avec l'ordinateur : de INTEX à NooJ, Cahiers de la MSH Ledoux, Presses Universitaires de Franche-Comté (23 articles, 438 pages), 2007.