| Tanl Linguistic Pipeline |
Extract features for NER. More...
#include <NerFeatureExtractor.h>
Public Member Functions | |
| NerFeatureExtractor (Resources &resources) | |
| void | analyze (Sentence *sent, int zone) |
| Set the sentence from which to extract features. | |
| void | extract (Classifier::Features &feats, const int &pos) |
| Extract features at position pos in sentence and put them into . | |
| void | reset () |
| Reset to initial state. | |
| void | classified (int position, char const *className) |
| Record that a token at given position has been classified in the given class. | |
Public Attributes | |
| Resources & | resources |
Protected Attributes | |
| bool | insideQuotes |
| TokenCategorizer | tokenCategorizer |
| extracts token type | |
| std::vector< EntityType > | tokenTypes |
| types of sentence tokens | |
| Sentence * | sentence |
| sentence being analyzed | |
| unordered_map< string, bool > | capitalized |
| Words that appeared previously as capitalized. | |
|
std::vector < Tanl::Text::NormWordSet > | prevClass |
| List of words previously designated as given class. | |
|
std::vector < Tanl::Text::NormWordSet > | otherFirst |
| other word in Cap sequence is in First Words | |
|
std::vector < Tanl::Text::NormWordSet > | otherLast |
| other word in Cap sequence is in Last Words | |
| Tanl::Text::NormWordSet | acronyms |
| List of previously found acronyms. | |
Extract features for NER.
| void Tanl::NER::NerFeatureExtractor::analyze | ( | Sentence * | sent, | |
| int | zone | |||
| ) |
Set the sentence from which to extract features.
References Tanl::NER::TokenCategorizer::analyze(), Tanl::Token::form, Tanl::Token::get(), sentence, Tanl::Token::set(), tokenCategorizer, and tokenTypes.
Referenced by Tanl::NER::NerEventStream::analyze().
| void Tanl::NER::NerFeatureExtractor::extract | ( | Classifier::Features & | feats, | |
| const int & | pos | |||
| ) | [virtual] |
Extract features at position pos in sentence and put them into .
Local features include:
Implements Tanl::Classifier::FeatureExtractor< Classifier::Features, const int >.
References acronyms, Tanl::Token::attrIndex(), capitalized, Tanl::Text::NormWordSet::contains(), Tanl::Token::form, Tanl::Token::get(), Tanl::Text::NormWordSet::insert(), otherFirst, otherLast, prevClass, Tanl::Text::RegExp::Pattern::replace(), sentence, Tanl::Text::RegExp::Pattern::test(), and Tanl::Text::to_upper().
Referenced by Tanl::NER::NerEventStream::next().
| void Tanl::NER::NerFeatureExtractor::reset | ( | ) | [virtual] |
Reset to initial state.
Useful when reading several documents with the same tagger instance.
Reimplemented from Tanl::Classifier::FeatureExtractor< Classifier::Features, const int >.
References acronyms, capitalized, otherFirst, otherLast, and prevClass.
Referenced by Tanl::NER::NerEventStream::reset().
unordered_map<string, bool> Tanl::NER::NerFeatureExtractor::capitalized [protected] |