Scoring and error analysis script
Proposed name: "evald"
General
option for character encoding, default: UTF-8
option for TABs versus blanks as separator
options which columns are present in what position
input format: two files in
Malt-TAB or
CoNLL-X shared task format (in specified character encoding) output format: plain text (in specified character encoding)
required (?) argument: treebank-specific settings file (in specified character encoding)
check that the two input files actually contain the same sentences (i.e. the same FORM values)
Settings file
allows to specify options (see above and below)
allows to exclude tokens from scoring based on their FORM, LEMMA, CPOSTAG, POSTAG or FEATS
to be decided: allow exclusion of tokens where the set of FEATS contains a specific value or only allow exclusion of fully specified sets?
lower priority: allows to exclude tokens from scoring based on their HEAD (less or greater than specified threshold) or DEPREL
to be decided: how to compute precision when excluding some DEPRELs; proposal: do the same as
MaltEval (what does it do?)
Output
Bookkeeping information
maybe: date, time
maybe: names of input files
definitely: settings used
Essential
labeled attachment score (over non-exluded tokens)
unlabeled attachment score (over non-exluded tokens)
For error analysis
percentage of tokens with correct DEPREL (regardless of HEAD value)
precision and recall per DEPREL
precision and recall per direction of head (to left, right or undefined (to root))
precision and recall per distance to head (binned, e.g. ? (to root), 1, 2, 3-5, >5 or similar)
(un)labeled attachment score on sentences less than z tokens long only (let z be specified through an option)
accuracy per POSTAG
accuracy per CPOSTAG
if punctuation has been excluded: give accuracy on punctuation separately, and accuracy on tokens including punctuation?
if set of relation labels is small (<10?): confusion table for DEPREL
if set of relation labels is bigger: for each DEPREL value:
list of top x predicted DEPREL values (with percentage) for each gold standard DEPREL value
list of top x gold standard DEPREL values (with percentage) for each predicted DEPREL value
let x be specified through an option
frame-based error analysis (acknowledge me in implementation!)
a frame is a local tree, i.e. a head and all its (gold or predicted) children
a frame can be represented minimally like:
subj *ROOT* obj adjunct
where the element marked by asterisks (or other character chosen by user through option) is the head
possible extension: adding the (C)POSTAG of the head
algorithm: for each token: extract gold standard frame and predicted frame; if not identical: increment counter for this frame confusion
output top y most frequent frame confusions
let y be specified through an option
possible extension: also count as a frame confusion those cases where a child's DEPREL value is correct but the wrong token was chosen as child, e.g.:
subj *ROOT* obj adjunct | subj> *ROOT* obj adjunct
if a "subj" child was predicted, but to the right of the gold standard "subj" child
to be decided: How to enable the user to find the sentences that exhibit the errors reported on?
Maybe have an additional option (or several?) so that after each error report the program also prints up to i identifiers of tokens (sentence ID or #, token ID) that exhibit such an error (let i be specified through an option)
Or for each sentence, print its ID, followed by a list of all errors it exhibits (which can be many!)
For statistical significance tests
produce output similar to
evalb, to allow use of
Dan Bikel's
Randomized Parsing Evaluation Comparator
Comments/additions welcome,
Sabine Buchholz