RDRPOSTagger
A Rule-based Part-of-Speech and Morphological Tagging Toolkit
http://rdrpostagger.sourceforge.net
Copyright © 2013-2016
by Dat Quoc
Nguyen, Dai
Quoc Nguyen, Dang Duc Pham, and Son Bao Pham
2. Train
RDRPOSTagger on a gold standard training corpus
3. Use
pre-trained POS and morphological tagging models
4. Combine RDRPOSTagger
with an external initial tagger
5. Speed up
tagging process with an implementation in Java
News:
·
17/10/2016:
release version 1.2.3 with improved tagging speed in Python
·
18/05/2016:
release version 1.2.2 including the pre-trained Universal POS tagging models
for 40 languages
§ The pre-trained Universal POS
tagging models are learned using the training data from the Universal Dependencies (UD) project
(version 1.3). The tagging accuracies on the UD v1.3 test sets can be found here.
·
21/12/2015:
release version 1.2.1 with improved tagging speed in Python
·
18/11/2015:
release version 1.2
§ Yield improved tagging accuracy, especially
on morphologically rich languages. See experimental results for 13 languages in
our AI Communications article.
§ Include the pre-trained
Part-of-Speech (POS) and morphological tagging models for Bulgarian, Czech,
Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish,
Thai and Vietnamese.
·
14/05/2014:
release version 1.1.3
RDRPOSTagger is a robust, easy-to-use and
language-independent toolkit for POS and morphological tagging. It employs an
error-driven approach to automatically construct tagging rules in the form of a
binary tree. The main properties of RDRPOSTagger are
as follows:
·
RDRPOSTagger obtains fast performance in both learning and tagging process. For
example, on the English Penn WSJ sections 22-24, RDRPOSTagger achieved tagging speeds of 2800 5K 8K and 90K words/second
computed for single threaded implementations in Python and Java
respectively, using a computer with Core2Duo 2.4GHz and 3GB of memory. See more results in our AI
Communications article.
·
RDRPOSTagger achieves a very competitive accuracy in comparison to the
state-of-the-art results.
The general architecture and
experimental results of RDRPOSTagger can be found in
our following papers:
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. RDRPOSTagger:
A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the
European Chapter of the Association for Computational Linguistics (EACL),
pp. 17-20, 2014. [.PDF] [.bib]
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham and Son Bao Pham. A Robust
Transformation-Based Learning Approach Using Ripple Down
Rules for Part-Of-Speech Tagging. AI
Communications (AICom), vol. 29, no. 3, pp. 409-422, 2016. [.PDF] [.bib]
Please cite either the EACL or the AICom
paper whenever RDRPOSTagger is used to produce
published results or incorporated into other software.
· RDRPOSTagger is available to download (10MB .zip file) at:
https://sourceforge.net/projects/rdrpostagger/files/RDRPOSTagger_v1.2.3.zip
· RDRPOSTagger is now also available to download at: https://github.com/datquocnguyen/RDRPOSTagger
We would highly appreciate to have your bug
reports, comments and suggestions about the RDRPOSTagger.
As a free open-source implementation, RDRPOSTagger is
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
ANY KIND, either express or implied.
2. Train RDRPOSTagger on a gold standard training corpus
Notices:
·
In terms of implementation, the training process has
been implemented in Python while the tagging process has been implemented in
both Python and Java. There is a multi-threaded mode with faster
tagging speed for the Python implementation. Additionally, see Section 5 for
details of using the Java implementation.
·
RDRPOSTagger requires an initial tagger. The internal
initial tagger developed within RDRPOSTagger uses a
lexicon to assign a tag for each word. See Section 4 for combining RDRPOSTagger with an external initial tagger.
·
RDRPOSTagger assumes that each line in the gold standard
training corpus is a sequence of WORD/TAG pairs separated by white space
characters. See sample training and test sets in the data directory.
Supposed that Python 2.x is already set to run in command line or
terminal (e.g. adding Python to the environment variable ‘path’ in
Windows OS).
·
We
train RDRPOSTagger on the gold standard training
corpus by executing:
pSCRDRtagger$ python RDRPOSTagger.py train
PATH-TO-GOLD-STANDARD-TRAINING-CORPUS
Example 1: pSCRDRtagger$ python RDRPOSTagger.py train ../data/goldTrain
Note that the actual command starts from python. Here pSCRDRtagger$ is simply used to denote the current pSCRDRtagger source package.
A .DICT lexicon file and an .RDR trained model file, for example goldTrain.DICT and goldTrain.RDR, will be generated in the same directory
containing the gold standard training corpus.
·
To employ the trained model for POS tagging on a raw unlabeled text
corpus, we perform:
pSCRDRtagger$ python RDRPOSTagger.py
tag PATH-TO-TRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS
Example 2: pSCRDRtagger$ python RDRPOSTagger.py tag ../data/goldTrain.RDR ../data/goldTrain.DICT ../data/rawTest
A .TAGGED file, in this case rawTest.TAGGED, will be generated in the same directory
containing the raw text corpus.
To obtain faster tagging process in Python: set a higher value for the "NUMBER_OF_PROCESSES" variable in the "Config.py" module in the "Utility" package. The value should not larger than
the number of CPU cores which your computer has.
· To evaluate tagging accuracy, we can employ the Eval.py module in the Utility package:
Utility$ python Eval.py
PATH-TO-TAGGED-TEST-CORPUS PATH-TO-GOLD-TEST-CORPUS
Example 3: Utility$ python Eval.py ../data/rawTest.TAGGED ../data/goldTest
·
Use RDRPOSTagger4En.py
and RDRPOSTagger4Vn.py in case of
retraining tagging models only for English with Penn Treebank POS tags and
Vietnamese with VietTreebank/VLSP POS tags,
respectively.
3. Use pre-trained POS and
morphological tagging models
·
Pre-trained
Universal POS tagging models for 40 languages are listed here.
·
Pre-trained
POS tagging models with fine-grained POS tags:
|
Language |
Corpus |
Model |
Lexicon |
|
English |
Penn WSJ section 00-18 [M93] |
../Models/POS/English.RDR |
../Models/POS/English.DICT |
|
French Treebank [A03] |
../Models/POS/French.DICT |
||
|
TIGER Corpus [B04] |
../Models/POS/German.RDR |
../Models/POS/German.DICT |
|
|
Hindi Treebank [P09] |
../Models/POS/Hindi.RDR |
../Models/POS/Hindi.DICT |
|
|
ISDT Treebank [B13] |
../Models/POS/Italian.RDR |
../Models/POS/Italian.DICT |
|
|
ORCHID Corpus [S97] |
../Models/POS/Thai.RDR |
../Models/POS/Thai.DICT |
|
|
../Models/POS/Vietnamese.RDR |
../Models/POS/Vietnamese.DICT |
·
Pre-trained
models for combined POS and morphological (POS+MORPH) tagging:
|
Language |
Corpus |
Model |
Lexicon |
|
Bulgarian |
BulTreeBank-Morph
[S04] |
../Models/MORPH/Bulgarian.RDR |
../Models/MORPH/Bulgarian.DICT |
|
Prague Dependency Treebank 2.5 [B12] |
../Models/MORPH/Czech.RDR |
../Models/MORPH/Czech.DICT |
|
|
Lassy Small
Corpus [N13] |
../Models/MORPH/Dutch.RDR |
../Models/MORPH/Dutch.DICT |
|
|
French Treebank [A03] |
../Models/MORPH/French.RDR |
../Models/MORPH/French.DICT |
|
|
TIGER Corpus [B04] |
../Models/MORPH/German.RDR |
../Models/MORPH/German.DICT |
|
|
Tycho Brahe
Corpus [G10] |
../Models/MORPH/Portuguese.RDR |
../Models/MORPH/Portuguese.DICT |
|
|
IULA LSP Treebank [M12] |
../Models/MORPH/Spanish.RDR |
../Models/MORPH/Spanish.DICT |
|
|
Stockholm—Ume°a
Corpus 3.0 [S12] |
../Models/MORPH/Swedish.RDR |
../Models/MORPH/Swedish.DICT |
·
To
use a pre-trained model, we perform:
pSCRDRtagger$ python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL
PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS
Example 4: pSCRDRtagger$ python RDRPOSTagger.py
tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest
Example 5: pSCRDRtagger$ python RDRPOSTagger.py
tag ../Models/MORPH/German.RDR ../Models/MORPH/German.DICT ../data/GermanRawTest
NOTE that each line in the input raw text corpus represents a tokenized/word-segmented sentence. For programming with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:
r = RDRPOSTagger()
# Load the POS tagging model for French
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR")
# Load the lexicon for French
DICT = readDictionary("../Models/POS/French.DICT")
# Tag a tokenized/word-segmented sentence
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable
bombe .")
·
Use RDRPOSTagger4En.py
and
RDRPOSTagger4Vn.py instead of
RDRPOSTagger.py only for pre-trained English and Vietnamese POS tagging models with fine-grained POS tags, respectively.
4. Combine RDRPOSTagger with an external initial tagger
·
To
train RDRPOSTagger in case of using output from an
external initial POS or POS+MORPH tagger:
pSCRDRtagger$ python
ExtRDRPOSTagger.py train PATH-TO-GOLD-STANDARD-TRAINING-CORPUS
PATH-TO-TRAINING-CORPUS-INITIALIZED-BY-EXTERNAL-TAGGER
Example 6: pSCRDRtagger$ python ExtRDRPOSTagger.py
train ../data/goldTrain ../data/initTrain
Here the initialized training corpus initTrain is generated by using the external initial
tagger to perform POS or POS+MORPH tagging
on the raw corpus which consists of the raw text extracted from the gold
standard training corpus goldTrain.
An .RDR trained model file, for example initTrain.RDR, will be generated in the same directory
containing the initialized training corpus.
·
To use the trained model for POS or POS+MORPH tagging on a test corpus
where words already are initially tagged by the external initial tagger:
pSCRDRtagger$ python
ExtRDRPOSTagger.py tag PATH-TO-TRAINED-MODEL
PATH-TO-TEST-CORPUS-INITIALIZED-BY-EXTERNAL-TAGGER
Example 7: pSCRDRtagger$ python ExtRDRPOSTagger.py tag ../data/initTrain.RDR ../data/initTest
5. Speed up tagging process with an
implementation in Java
·
To
use a pre-trained model for tagging a raw text corpus:
jSCRDRTagger$ java RDRPOSTagger
PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS
Example 8: jSCRDRTagger$ java RDRPOSTagger ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest
Example 9: jSCRDRTagger$ java RDRPOSTagger ../Models/MORPH/German.RDR ../Models/MORPH/German.DICT ../data/GermanRawTest
RDRPOSTagger has two
additional parameters specialized for POS tagging only in English and
Vietnamese with
fine-grained POS tags:
Example 10: jSCRDRTagger$ java RDRPOSTagger en ../Models/POS/English.RDR ../Models/POS/English.DICT ../data/en/rawTest
Example 11: jSCRDRTagger$ java RDRPOSTagger vn ../Models/POS/Vietnamese.RDR
../Models/POS/Vietnamese.DICT ../data/vn/rawTest
·
In case of using an external initial POS or POS+MORPH tagger:
jSCRDRTagger$ java RDRPOSTagger ex PATH-TO-TRAINED-MODEL PATH-TO-TEST-CORPUS-INITIALIZED-BY-EXTERNAL-TAGGER
Example 12: jSCRDRTagger$ java RDRPOSTagger ex ../data/initTrain.RDR ../data/initTest
·
Recompile if there is any
problem: jSCRDRTagger$ javac
-encoding UTF-8 RDRPOSTagger.java
[M93] M. P. Marcus, M. A. Marcinkiewicz,
and B. Santorini. Building a Large Annotated Corpus of English: The Penn
Treebank. Computational Linguistics,
19(2):313– 330, 1993. http://www.cis.upenn.edu/~treebank/
[A03] A. Abeillé,
L. Clément, and F. Toussenel. Building a
Treebank for French. In Treebanks,
volume 20 of Text, Speech and Language Technology, pages 165– 187. 2003. http://www.llf.cnrs.fr/en/Gens/Abeille/French-Treebank-fr.php
[B04] S. Brants, S.
Dipper, P. Eisenberg, S. Hansen-Schirra, E. K¨onig, W. Lezius, C. Rohrer,
G. Smith, and H. Uszkoreit. TIGER: Linguistic
Interpretation of a German Corpus. Research
on Language and Computation, 2(4):597–620, 2004. http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html
[P09] M. Palmer, R. Bhatt, B. Narasimhan, O. Rambow, D. M.
Sharma, and F. Xia. Hindi Syntax: Annotating Dependency, Lexical
Predicate-Argument Structure, and Phrase Structure. In Proceedings of 7th International Conference on Natural Language
Processing, pages 261–268, 2009. http://verbs.colorado.edu/hindiurdu/index.html
[B13] C. Bosco, S. Montemagni,
and M. Simi. Converting Italian Treebanks: Towards an Italian Stanford
Dependency Treebank. In Proceedings of
the 7th Linguistic Annotation Workshop and Interoperability with Discourse,
pages 61–69, 2013. http://medialab.di.unipi.it/wiki/ISDT
[S97] V. Sornlertlamvanich,
T. Charoenporn, and H. Isahara.
ORCHID: Thai Part-Of-Speech Tagged Corpus, 1997. URL http://culturelab.in.th/files/orchid.html
[N09] P. T. Nguyen, X. L. Vu, T. M. H. Nguyen,
V. H. Nguyen, and H. P. Le. Building a Large Syntactically-Annotated Corpus of
Vietnamese. In Proceedings of the Third
Linguistic Annotation Workshop, pages 182–185, 2009. http://vlsp.vietlp.org:8080/
[S04] K. Simov, P. Osenova, A. Simov, and M. Kouylekov. Design and Implementation of the Bulgarian HPSGbased Treebank. Research
on Language and Computation, 2:495–522, 2004. http://www.bultreebank.org
[B12] E. Bejcek, J. Panevová, J. Popelka, P. Stranák, M. Sevcíková,
J. Stepánek, and Z. Zabokrtský.
Prague Dependency Treebank 2.5 - a Revisited Version of PDT 2.0. In Proceedings of 24th International Conference
on Computational Linguistics, pages 231–246, 2012. https://ufal.mff.cuni.cz/pdt2.5/
[N13] G. Noord, G. Bouma,
F. Eynde, D. Kok, J. Linde,
I. Schuurman, E. Sang, and V. Vandeghinste.
Large Scale Syntactic Annotation of Written Dutch: Lassy.
In Essential Speech and Language
Technology for Dutch, Theory and Applications of Natural Language Processing,
pages 147–164, 2013. http://www.let.rug.nl/~vannoord/Lassy/
[G10] C. Galves and P.
Faria. Tycho Brahe Parsed
Corpus of Historical Portuguese, 2010. http://www.tycho.iel.unicamp.br/~tycho/corpus/en/index.html.
[M12] M. Marimon, B. Fisas, N. Bel, M. Villegas, J. Vivaldi, S. Torner, M. Lorente, and S.
Vázquez. The IULA Treebank. In Proceedings
of the eighth international conference on Language Resources and Evaluation,
pages 1920–1926, 2012. https://www.iula.upf.edu/recurs01_tbk_uk.htm
[S12] SUC-3.0. The
Stockholm—Ume°a Corpus (SUC) 3.0, 2012. URL
http://spraakbanken.gu.se/eng/resource/suc3
Last updated: October 17, 2016