| Tanl Linguistic Pipeline |
The task of the suffix guesser is to predict a tag-distribution based on the suffix of the word. More...
#include <SuffixGuesser.h>
Public Member Functions | |
| void | serialize (std::ostream &out) |
| Serializes a SuffixGuesser object. | |
| void | serialize (std::istream &in) |
| De-Serializes a SuffixGuesser object. | |
| void | add_word (int n, std::string &word, TagID tag, int count) |
| Adds a word to the suffix trie. | |
| double | tagprob (std::string &word, int tagid) |
| TODO. | |
| double | tagprobs (std::string &word, std::vector< double > &probs) |
| TODO. | |
Static Public Member Functions | |
| static double | calculate_theta (std::vector< double > &apriori_tag_probs) |
| TODO. | |
Public Attributes | |
| double | theta |
| Theta used in the interpolation process. | |
| TrieNode | trie |
| Trie of suffices. | |
| Counts | empty_counts |
| Empty Counts object. | |
The task of the suffix guesser is to predict a tag-distribution based on the suffix of the word.
In training phase, it calculates for each suffix its count in the corpus, in total and for each tag separately. Let's assume a word ending with ABCDE. During prediction, it linearly interpolates the looked-up predictions for the ABCDE, BCDE, CDE, DE, E, "" suffices. Interpolation is done successively with weights 1 and theta, so weights are basically powers of 1/(1+theta), with the shorter suffix getting the larger weight.
| void Tanl::POS::SuffixGuesser::add_word | ( | int | n, | |
| std::string & | word, | |||
| TagID | tag, | |||
| int | count | |||
| ) |
Adds a word to the suffix trie.
| n | Max suffix size. | |
| word | String to be added to the trie. | |
| tag | Tag identifier used to tag the word we are trying to add. | |
| count | Amount of times word was tagged with tag inside the corpus. |
| void Tanl::POS::SuffixGuesser::serialize | ( | std::istream & | in | ) |
De-Serializes a SuffixGuesser object.
| in | The stream from which the object will be read |
References Tanl::POS::TrieNode::serialize(), serialize(), theta, and trie.
| void Tanl::POS::SuffixGuesser::serialize | ( | std::ostream & | out | ) |
Serializes a SuffixGuesser object.
| out | The stream wherein the object will be written |
References Tanl::POS::TrieNode::serialize(), theta, and trie.
Referenced by serialize().