For the moment, we leave the note classification problem alone and return to the CRF learning on the corpus 1. After the detailed evaluation and revision of the automatic reference annotation result, we now try to reconstruct a CRF model on the revised learning data. As written in the completeness verification report, the main correction of learning and test data had been carried out on title and book title. So we expect that the reconstructed CRF model is able to separate better the title and book title than before. The following table shows the estimation result of this newly constructed CRF model.
This result is not exactly what we expected. In spite of the data revision, general accuracy rather decreased about 1% compared to the final version on corpus 1. Considering that the 1% difference is acceptable with the change of learning data, we conclude that the automatic annotation quality is not really changed.
Some small errors of manual annotation do not influence the automatic annotation quality. Instead, the limit of the current version of Bilbo system should be overcome using more informative features or by modifying the existing CRF model. Named entity lists can be used as the former approach and in the next posting, we show the results of experiments that use these lists.
Imprimer ce billet