This page describes known bugs and problems with the data and software from the CoNLL-X Shared Task.
|
ID: |
1 |
|
Entry date: |
12 January 2006 |
|
Reported by: |
Ryan McDonald |
|
Description: |
Starting on lines 415455 of the dutch data, there is a series of entries that do not correspond to the data format (there are too few columns). Also, their POS tagging is different (by case) to the rest of the examples in that data set. I should note that a lot of these appear to be repeat entries. |
|
Assigned to: |
Erwin |
|
Status: |
fixed |
|
Update: |
conll06_data_dutch_alpino_train_v1.4.tar.bz2 |
|
Released: |
13 January 2006 |
|
ID: |
2 |
|
Entry date: |
19 January 2006 |
|
Reported by: |
Svetoslav Marinov |
|
Description: |
I think there are at least three types of bugs in the bulgarian data. 1) A postag which is a whitespace 2) whitespace separating multiword expressions 3) "A" and "Am" postags are in UTF-8 plus ASCII format and therefore are duplicated if one is to create a POStag file (for example to train with Malt). The same discrepancy is noted in the dependency relations. Addition: There are exactly 2693 whitespaces in the Bulgarian treebank which should not be there. As I said earlier, there "whitespaced" POS tags, as well as " N" POS tags. In addition to multiword expressions (e.g. "taka che", "kakto i") which should be with underscores in stead. |
|
Assigned to: |
Sabine |
|
Status: |
fixed |
|
Update: |
conll06_data_bulgarian_bultreebank_train_v1.2.tar.bz2 |
|
Released: |
22 January 2006 |
|
ID: |
3 |
|
Entry date: |
30 January 2006 |
|
Reported by: |
Masayuki Asahara |
|
Description: |
We find the DEPREL fields in Chinese data. The dependency relations are slightly strange entries as follows: 1) original POSs in upper case 2) open parenthesis without close parenthesis 3) "themr" and "ttheme" (maybe typo of "theme") 4) "locationl" "ocation" (maybe typo of "location") 5) "propert" and "propertyr" (maybe typo of "property") 6) "resaon" (maybe typo of "reason") 7) "epstemics baa", "mannerDh" and "quantityNeqa" (concatenated with POS?) 8) "topic" with [] (it may not be noize) 9) "%" and "%(evaluation" 10) "Headt" (maybe type of "Head") The attached files is all types of DEPREL in Chinese data. Low frequency entries may be error or noize. |
|
Assigned to: |
Amit |
|
Status: |
fixed 1) and 9) by excluding sentences marked with "%"; 8) is not an error; 7) replaced space in "epstemics baa" by underscore; rest is due to typos in original treebank, so not fixed; also introduced proper CPOSTAGs and included README in tarball |
|
Update: |
conll06_data_chinese_sinica_train_v1.2.tar.bz2 |
|
Released: |
1 February 2006 |
|
ID: |
4 |
|
Entry date: |
6 March 2006 (reported 2 March) |
|
Reported by: |
Deniz Yuret, Eckhard Bick |
|
Description: |
In the Spanish data (newest, from Jan 9th) there are 54635 edge labels left blank (i.e. _), in field 8 (DEPREL), which was supposed to never be blank. |
|
Assigned to: |
Amit |
|
Status: |
Amit confirmed it's a bug and not intended. However, it's too late now to change the data. '_' will be treated as a normal DEPREL label, i.e. if it matches with the parser's prediction, that's counted as correct, otherwise not. |
|
Update: |
None |
|
Released: |
- |
|
ID: |
5 |
|
Entry date: |
6 March 2006 |
|
Reported by: |
Sabine Buchholz |
|
Description: |
Spanish data: Space following or preceding the lemma in four cases. |
|
Assigned to: |
Amit |
|
Status: |
Due to extra spaces in treebank. However, it's too late now to change the data. |
|
Update: |
None |
|
Released: |
- |