| Tanl Linguistic Pipeline |
This is a language model to calculate P(C | A, B) with linear interpolation i.e. More...
#include <linear_interpolated_lm.h>
Public Types | |
|
typedef std::tr1::unordered_map< int, Context< CT > * > | CMap |
|
typedef std::tr1::unordered_map< CT, double > | WordFreq |
Public Member Functions | |
| void | serialize (std::ostream &out) |
| void | serialize (std::istream &in) |
| void | add_word (std::vector< int >::const_iterator context, int n, CT &word) |
| Add token word in context context. | |
| WordFreq & | get_words () |
| unsigned | total_context_freq () |
| size_t | word_count_at_context () |
| std::vector< double > | calculate_lambdas (int level) |
| void | counts_to_prob (std::vector< double > &lambdas) |
| Translate frequencies to log probabilities. | |
| double | wordprob (CT const &word, std::vector< int > &context) |
Public Attributes | |
| unsigned | freq |
| total word occurrences | |
| CMap | childs |
| context map | |
| WordFreq | words |
| word map | |
This is a language model to calculate P(C | A, B) with linear interpolation i.e.
P(C| A B) = l3 ML (C| A B) + l2 ML (C | B) + l1 ML (C) + l0
the calculation of lambdas are the same as in Brants 2000.
The data structure is similar to the data structure used by SRILM: there is a context tree holding the null context as root (unigrams) B for bigrams starting with B , and B->A node for trigrams staring with AB. Every node of the context node has a word map storing the frequency of the words given the context.
The type of items is parametric: for tags is int, for words is string.
This module first calculates the frequencies, then calculates the lambda parameters of linear interpolation and finally transforms frequencies to probabilities.
TODO: the freq counting and the lambda calculation should be separated.
| void Tanl::POS::Context< CT >::add_word | ( | std::vector< int >::const_iterator | context, | |
| int | n, | |||
| CT & | word | |||
| ) | [inline] |
Add token word in context context.
Consider as contexts the n-grams [context:context+i], 0 <= i < n. If you want to add the A B C trigram, just go down in the tree and add word C at every level.
References Tanl::POS::Context< CT >::add_word(), Tanl::POS::Context< CT >::childs, Tanl::POS::Context< CT >::freq, and Tanl::POS::Context< CT >::words.
Referenced by Tanl::POS::Context< CT >::add_word().
| std::vector< double > Tanl::POS::Context< CT >::calculate_lambdas | ( | int | level | ) | [inline] |
| void Tanl::POS::Context< CT >::counts_to_prob | ( | std::vector< double > & | lambdas | ) | [inline] |
Translate frequencies to log probabilities.
At level n we have to know the probability of word at level n-1. For example
P(C| A B) = l3 ML(C| A B) + l2 ML(C| B) + l1 ML(C) + l0, which is P(C| A B) = l3 ML(C| A B) + P(C| B)
P(C| B) is calculated first.
ML is the maximum likelihood probability: ML(C) = f(C)/N ML(C|B) = f(B, C)/f(B) ML(C|A B) = f(A, B, C)/f(A, B) where N is the number of tokens in the corpus.
References Tanl::POS::Context< CT >::childs.