CoNLL-X evaluation script:
[perl] eval.pl [OPTIONS] -g <gold standard> -s <system output>
This script evaluates a system output with respect to a gold standard.
Both files should be in UTF-8 encoded CoNLL-X tabular format.
Punctuation tokens (those where all characters have the Unicode
category property "Punctuation") are ignored for scoring (unless the
-p flag is used).
The output breaks down the errors according to their type and context.
Optional parameters:
-o FILE : output: print output to FILE (default is standard output)
-q : quiet: only print overall performance, without the details
-b : evalb: produce output in a format similar to evalb
(http://nlp.cs.nyu.edu/evalb/); use together with -q
-p : punctuation: also score on punctuation (default is not to score on it)
-v : version: show the version number
-h : help: print this help text and exit
Download latest release of eval.pl (version 1.8)
This is the official CoNLL-X shared task evaluation script. It computes the official scoring metric "labeled attachment score" and also provides details useful for error analysis. It was first released on 9 January 2006.
An improved version was released on 22 January 2006. The first release required Perl v5.8. However, that version of Perl still contains bugs with respect to Unicode handling. The new release of the evaluation script therefore requires at least Perl v5.8.1. If you have at least Perl v5.8.1, the new release of the evaluation script gives identical scores to the first release. However, it provides more output:
A version with additional output for error analysis was released on 8 February 2006.
A version with a new option for significance testing (-b), with "label accuracy score" and one more error analysis table (thanks to Prokopis Prokopidis) was released on 12 March 2006. Use the -b option as follows:
perl eval.pl -b -q -g GOLD_FILE -s SYSTEM_FILE1 > system1.txt perl eval.pl -b -q -g GOLD_FILE -s SYSTEM_FILE2 > system2.txt perl compare.pl system1.txt system2.txtwhere compare.pl is Dan Bikel's Randomized Parsing Evaluation Comparator (Statistical Significance Tester for evalb Output). Its output talks about "recall" and "precision" but for the output of eval.pl these are really "unlabeled attachment" and "labeled attachment" respectively.
Here are some questions and answers about the definition of non-scoring tokens:
The evaluation script defines a non-scoring token as a token where all characters have the Unicode category property "Punctuation" (see "man perlunicode"). As the underscore character also has the "Punctuation" property, the non-last inflection groups in our Turkish data (with FORM '_') and the punctuation tokens in our Arabic data (with FORMs such as '._.' where the part before the underscore is in Arabic script while the part after the underscore is in Latin script) are all non-scoring tokens, as intended. There might be a few cases where one does not agree with the Unicode category. For example, Unicode classifies the percentage sign (%) as punctuation. We chose not to try to improve upon the Unicode definition and do not think that this will substantially influence results. However, a detailed study of this issue might be useful in the future.
That was indeed our first idea. However, there are two problem cases. The Swedish treebank has a POS tag for punctuation but it is not always used for punctuation. For example, punctuation that is part of a multiword gets the multiword tag. Also some punctuation (e.g. some commas) are tagged as "coordinating conjunction". The Arabic treebank also has a POS tag for punctuation but it is not used in one of the four subcorpora (instead punctuation together with numbers and others has the POS tag "non-alphabetic material").
In addition, there is a more fundamental issue. Parsing as defined for this shared task assumes that the gold-standard POS tags are given. However, this is rarely the case in real applications. Typically, input text is untagged (and untokenized, but that is a different issue). One would therefore either apply a POS tagger to the data and use the output of the tagger as the input to the parser, or use a parser that assigns POS tags directly during parsing (possibly preceded by a morphological component for highly inflected languages). We would like to be able to compare the results for the shared task definition of parsing to results on this more general definition of parsing. However, a meaningful comparison requires that the same set of tokens is scored in both set-ups. If the definition of a non-scoring token relied on the POS tags, it would mean that during parsing in the general set-up the parser does not know which tokens it will be scored upon. It would also mean that each gold standard for parsing must include POS tags. Both consequences might be problematic.
This is the most fundamental question and we do not have a satisfactory answer. Some treebanks (e.g. Alpino for Dutch) do not attach punctuation at all. Although we could (and in fact did) attach punctuation during the conversion to the shared task format, it does not sound like a good idea to then score on it, as parsers would be punished if they fail to reproduce an attachment which we introduced. In addition, punctuation is also normally ignored for the scoring of constituent-based parsers, and it might not even exist for (parts of) treebanks that are based on transcriptions of speech. Finally, the attachment of most punctuation is irrelevant for most parsing applications.
However, there are a few important exceptions. Some treebanks (e.g. the Prague Dependency Treebanks) analyze some punctuation tokens as the syntactic heads of other tokens. For example, in the absence of a coordinating conjunction, a comma might be analyzed as the coordinator, and therefore as the head of the conjuncts. The attachment of the comma therefore encodes the information what the whole coordinated phrase refers to. If we ignore this comma for scoring purposes, a parser might get an attachment score of 100% despite the fact that it misattached the coordinated phrase. Clearly, this is problematic.
We feel that the problem of whether or not to score some or all punctuation for some or all treebanks is a complicated one and we will leave it to the research community to find a satisfactory answer in the future. For the shared task, we exclude punctuation from scoring. It remains to be seen whether this influences results in any way.
usage:
validateFormat.py [options] FILES
purpose:
checks whether files are in CoNLL-X shared task format
args:
FILES input files
options:
--version show program's version number and exit
-h, --help show this help message and exit
-d STRING, --discard_problems=STRING
-e STRING, --encoding=STRING
output character encoding (default is utf-8)
-i STRING, --input_separator=STRING
regular expression for column separator in
input (default is one tab, i.e. '\t')
-p STRING, --punctuation=STRING
use given regular expression to identify
punctuation (by matching with the CPOSTAG column)
and check that nothing links to and that a sentence
contains more than just punctuation (default: turned
off)
-r STRING, --root_deprel=STRING
designated root label: check that there is exactly
one token with that label and that it's HEAD is 0
(default: not specified)
-s STRING, --silence_warnings=STRING
don't warn about certain types of
problems (default is to warn about every problem);
possible choices:cycle punct whitespace root other
-t STRING, --type=STRING
type of the data to be tested: train,
test_blind, system (default: train)
This script can be used to check files for compliance with the CoNLL-X shared
task format. It prints detailed warnings and error messages to STDERR. The
returned status code indicates whether the files passed the test (status 0) or
not (status 1). The requirements for training data (-t train) are stricter than
for system produced output (-t system): Errors in the (P)HEAD and (P)DEPREL
columns cause status 1 for training data but not for system output (STDERR
messages are the same). System output is allowed but not required to have the
PHEAD and PDEPREL column.
You can suppress warnings by using the -s option. E.g. if you already know that your system sometimes predicts cycles in the dependency structure, you could call the script with:
./validateFormat.py -t system -s cycle systems_output.conllYou cannot suppress error messages.
Download validateFormat.py (version 1.4)
Download SharedTaskCommon.py which is needed by validateFormat.pl
Note: I have fixed a bug in version 1.2 that caused validateFormat.py to complain if a file followed the Windows end-of-line convention (of using \r\n)
usage:
tabs2blanks.py [options] <INFILE >OUTFILE
purpose:
Converts tabs to blanks in an attempt to align the column content.
Reads from standard input and writes to standard output.
Expects input in tabular format with columns separated by tabs.
Spaces in column content are replaced by tabs (by default).
options:
--version show program's version number and exit
-h, --help show this help message and exit
-b STRING, --blank-replace=STRING
replacement for blanks in column content (default is
tab)
-e STRING, --encoding=STRING
input and output character encoding (default is utf-8)
-m INT, --max-width=INT
maximum width of a column (default is unlimited)
Download tabs2blanks.py
usage:
blanks2tabs.py [options] <INFILE >OUTFILE
purpose:
Converts each sequence of blanks to a single tab.
Reads from standard input and writes to standard output.
Expects input in tabular format with columns separated by one or more blanks.
Tabs in column content are replaced by blanks (by default).
options:
--version show program's version number and exit
-h, --help show this help message and exit
-e STRING, --encoding=STRING
input and output character encoding (default is utf-8)
-t STRING, --tab-replace=STRING
replacement for tabs in column content (default is a
single blank)
Download blanks2tabs.py
usage:
conlltab2dot.py [options] <INFILE >OUTFILE
purpose:
Converts dependency structures in CoNLL-X tabular format
to dot graph specifications.
Reads from standard input and writes to standard output.
examples:
To generate postscript output on Linux, try:
./conlltab2dot.py <sample.conll | \
dot -Tps2 > /tmp/sample.ps && \
gv /tmp/sample.ps
To generate PDF output on Mac OS X, try:
./conlltab2dot.py <sample.conll | \
/Applications/Graphviz.app/Contents/MacOS/dot -Tepdf | \
open -f -a preview
options:
--version show program's version number and exit
-h, --help show this help message and exit
-e STRING, --encoding=STRING
character encoding (default is utf-8)
-r STRING, --range=STRING
sentence range as comma-separated list of sentence
numbers, optionally using n-, -n, or n-m to denote
inclusive ranges (default is all sentences)
-s CHAR, --shape=CHAR
shape of graph where 'h' means hierarchical and 'l'
means linear (default is 'h')
means linear (default is 'h')
Download conlltab2dot.py
For the shared task, 13 treebanks were converted from their original formats to the data format used in the shared task. The software to do that was developed by several different people and for practical reasons, no effort was made to standardize it. We provide this software here without any warranty but hope that it will be useful to other researchers. For general questions about this page, please contact conll06st@uvt.nl. For questions about specific software, however, please contact the respective author directly.
Note: Now that the shared task is over, we (the organizers) will probably not develop this software any further. However, everybody is invited to make improvements to it and to release those on the Depparse Wiki page.
Download tarred and zipped software: dubey-software.tar.bz2.
This tarball is zipped using bzip2, and can be unpacked with
either 'tar xjf filename' or 'tar xyf filename'
(depending on the your version of tar).
This software is written in OCaml. The tarball contains library functions for the conversion of other treebanks as well.
See tools/nlX4/README.CONLL for general information.
You can contact Amit Dubey at "adubey at inf dot ed dot ac dot uk"
Download tarred and zipped software: erwins-software.tar.bz2.
This tarball is zipped using bzip2, and can be unpacked with
either 'tar xjf filename' or 'tar xyf filename'
(depending on the your version of tar).
The software is organized in the same way as the training and test data.
See the files data/<language>/<treebank>/README for general information.
The treebank-specific conversion software is in data/<language>/<treebank>/tools/.
Some general software is in tools/.
The shell scripts that control the conversion processes are data/<language>/<treebank>/tools/build.sh (go there and execute build.sh).
The main conversion scripts are data/<language>/<treebank>/tools/<treebank>2tab.py (written in Python).
The treebank files are expected in data/<language>/<treebank>/treebank/.
The resulting files can be found in data/<language>/<treebank>/dist/.
There are also some log files that were created during the original conversion. They might be useful for comparing against your log files as a sanity check that your conversion worked the same way as ours.
The Dutch Alpino treebank was not only converted but also retagged. The software used for tagging is MBT. You will need to get the following files and put them into these locations:
data/dutch/alpino/tools/mbt-nl/gen-optimal-mbt data/dutch/alpino/tools/mbt-nl/Mbt data/dutch/alpino/tools/mbt-nl/wotan.all.tag.5paxes data/dutch/alpino/tools/mbt-nl/wotan.all.tag.known.ddwfWawa data/dutch/alpino/tools/mbt-nl/wotan.all.tag.lex data/dutch/alpino/tools/mbt-nl/wotan.all.tag.lex.ambi.20 data/dutch/alpino/tools/mbt-nl/wotan.all.tag.settings data/dutch/alpino/tools/mbt-nl/wotan.all.tag.top50 data/dutch/alpino/tools/mbt-nl/wotan.all.tag.unknown.chnppddwFawasss
Download tarred and zipped software: other-software.tar.bz2.
This tarball is zipped using bzip2, and can be unpacked with
either 'tar xjf filename' or 'tar xyf filename'
(depending on the your version of tar).
The software is organized in the same way as the training and test data.
The treebank-specific conversion software is in data/<language>/<treebank>/tools/.
Some general software is in tools/.
The Makefiles that control the conversion processes are data/<language>/<treebank>/Makefile (go there and type "make"). They also contain some comments about how the training-test-split was determined.
The conversion scripts for PDT, Bosque, and Metu-Sabanci are data/<language>/<treebank>/tools/<treebank>2MALT.py (written in Python). The other conversions were done by different people, so there is no pattern in the naming.
The Bosque, SDT and BulTreeBank treebank files are expected in data/<language>/<treebank>/treebank/. The Metu-Sabanci treebank files are expected in data/<language>/<treebank>/tb_corrected because that's what "tb_corrected_versionConll.zip" (the version of the treebank that we used) expands to. For PDT, you have to modify the Makefile to point it to the location of your PDT CD. The PADT Makefile does not control the complete conversion process: you will have to convert the treebank files individually (using data/<language>/<treebank>/tools/padt2tab.py) and put the results into the directories data/<language>/<treebank>/train/ and data/<language>/<treebank>/test/, respectively.
The resulting files can be found in data/<language>/<treebank>/dist/.
There are also some log files that were created during the original conversion. They might be useful for comparing against your log files as a sanity check that your conversion worked the same way as ours.