| Tanl Linguistic Pipeline |
Public Member Functions | |
| Resources (char const *POStag, char const *NEtag) | |
| Resources (std::string &locale, char const *POStag, char const *NEtag) | |
| Resources (std::string &resourceDir, std::string &locale, char const *POStag, char const *NEtag) | |
| size_t | typesCount () |
| char const * | typeName (EntityType et) |
| void | load (std::string &resourceDir) |
| Load all resources from the given directory. | |
| template<class WordSet > | |
| void | load (std::vector< WordSet > &sets, char const *file) |
| Load a group of WordSets from a file. | |
| template<class WordSet > | |
| void | load (vector< WordSet > &sets, char const *file) |
| template<class WordSet > | |
| void | load (WordSet *sets, char const *file) |
Public Attributes | |
| Text::WordIndex | classId |
| Maps class names to class IDs. | |
| char const * | language |
| char const * | POStag |
| char const * | NEtag |
| TagSet | prevTokenType |
| TagSet | nextTokenType |
|
std::vector < Tanl::Text::NormWordSet > | dict |
| Tanl::Text::NormWordSet | moneyDict |
| Tanl::Text::NormWordSet | namesDict |
| Tanl::Text::NormWordSet | timeDict |
| Tanl::Text::NormWordSet | prodDict |
| Tanl::Text::NormWordSet | FWL |
| FWL (Frequent Word List): words that occur in more than 5 documents. | |
|
std::vector < Tanl::Text::NormWordSet > | designators |
| CPW (Common Preceding Words): 20 words that most often precede names of a certain class. | |
|
std::vector < Tanl::Text::NormWordSet > | preBigrams |
| CPB (Common Preceding Bigrams): bigrams that often precede names of a certain class. | |
|
std::vector < Tanl::Text::NormWordSet > | prefixes |
| PRE (Prefix for Class): common 3-letter prefix for each class. | |
| std::vector< Tanl::Text::Suffixes > | suffixes |
| SUF (Suffix for Class): common 3-letter suffix for each class. | |
|
std::vector < Tanl::Text::NormWordSet > | firstWords |
| EFW (Entity First Words): list of words starting an entity. | |
|
std::vector < Tanl::Text::NormWordSet > | lastWords |
| ELW (Entity Last Words): list of words terminating an entity. | |
| std::vector < Tanl::Text::NormWordSet > | lowerInterm |
| NAW (Name After Words): list of words after an entity. | |
Static Public Attributes | |
| static IXE::conf_set< std::string > | entityTypes |
| The entity type names. | |
| void Tanl::NER::Resources::load | ( | std::vector< WordSet > & | sets, | |
| char const * | file | |||
| ) | [inline] |
Load a group of WordSets from a file.
The file contains one word per line in the format: class word where class is an entity type, like LOC, MISC, ORG, PER.
| void Tanl::NER::Resources::load | ( | std::string & | resourceDir | ) |
Load all resources from the given directory.
Referenced by Tanl::NER::NER::NER().
Maps class names to class IDs.
IXE::conf_set< std::string > Tanl::NER::Resources::entityTypes [static] |
The entity type names.
Referenced by Tanl::NER::NER::tag().
FWL (Frequent Word List): words that occur in more than 5 documents.
| std::vector<Tanl::Text::NormWordSet> Tanl::NER::Resources::lowerInterm |
NAW (Name After Words): list of words after an entity.
e.g.: center, museum, square, street LIW (Lowercase Intermediate Words): list of lowercase words appearing within a sequence, eg: PER: "van der", "de", "of" ORG: al, in, zonder, vor, for