改进的迭代尺度法
改进的迭代尺度法(improved iterative scaling,IIS)是一种最大熵模型学习的最优化算法。
已知最大熵模型为 \[P_{\pmb{w}} (y|\pmb{x}) = \frac{1}{Z_{\pmb{w}}(\pmb{x})} \exp \left( \sum_{i=1}^n w_i f_i(\pmb{x},y) \right)\] 其中 \[Z_{\pmb{w}} (\pmb{x}) = \sum_{\pmb{y}} \exp \left( \sum_{i=1}^n w_i f_i(\pmb{x},y) \right)\] 对数似然函数为 \[L(\pmb{w}) = \sum_{\pmb{x},y} \tilde{P} (\pmb{x},y) \sum_{i=1}^n w_i f_i(\pmb{x},y) - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \log Z_{\pmb{w}} (\pmb{x})\] 目标是通过极大似然估计学习模型参数,即求对数似然函数的极大值\(\hat{\pmb{w}}\)。
IIS的想法:假设最大熵模型当前的参数向量是\(\pmb{w} = (w_1,w_2,\cdots,w_n)^T\),我们希望找到一个新的参数向量\(\pmb{w}+\pmb{\delta} = (w_1+\delta_1,w_2+\delta_2,\cdots,w_n+\delta_n)^T\),使得模型的对数似然函数值增大。如果能有这样一种参数向量更新的方法\(\tau:\pmb{w} \to \pmb{w} + \pmb{\delta}\),那么就可以重复使用这一方法,直至找到对数似然函数的最大值。
对于给定的经验分布\(\tilde{P}(\pmb{x},y)\),模型参数从\(\pmb{w}\)到\(\pmb{w}+\pmb{\delta}\),对数似然函数的改变量是 \[\begin{align} L(\pmb{w}+\pmb{\delta}) - L(\pmb{w}) &= \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \log P_{\pmb{w}+\pmb{\delta}}(y|\pmb{x}) - \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \log P_{\pmb{w}} (y|\pmb{x}) \nonumber \\ &= \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \log \frac{Z_{\pmb{w}+\pmb{\delta}}(\pmb{x})}{Z_{\pmb{w}}(\pmb{x})} \nonumber \end{align}\]
利用不等式 \[- \log \alpha \geqslant 1 - \alpha, \quad \alpha > 0\] 建立对数似然函数改变量的下界: \[\begin{align} L(\pmb{w}+\pmb{\delta}) - L(\pmb{w}) & \geqslant \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) + 1 - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \log \frac{Z_{\pmb{w}+\pmb{\delta}}(\pmb{x})}{Z_{\pmb{w}}(\pmb{x})} \nonumber \\ &= \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) + 1 - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \sum_{\pmb{y}} P_{\pmb{w}} (\pmb{y}|\pmb{x}) \exp \sum_{i=1}^n \delta_i f_i(\pmb{x},y) \nonumber \end{align}\] 将右端记为 \[A(\pmb{\delta}|\pmb{w}) = \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) + 1 - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \sum_{\pmb{y}} P_{\pmb{w}} (\pmb{y}|\pmb{x}) \exp \sum_{i=1}^n \delta_i f_i(\pmb{x},y)\] 于是有 \[L(\pmb{w}+\pmb{\delta}) - L(\pmb{w}) \geqslant A(\pmb{\delta}|\pmb{w})\] 即\(A(\pmb{\delta}|\pmb{w})\)是对数似然函数改变量的一个下界。
如果能找到适当的\(\pmb{\delta}\)使下界\(A(\pmb{\delta}|\pmb{w})\)提高,那么对数似然函数也会提高。然而,函数\(A(\pmb{\delta}|\pmb{w})\)中的\(\pmb{\delta}\)是一个向量,含有多个变量,不易同时优化。IIS试图一次只优化其中一个变量\(\delta_i\),而固定其他变量\(\delta_j,i \neq j\)。
为了达到这一目的,IIS进一步降低下界\(A(\pmb{\delta}|\pmb{w})\)。具体地,IIS引进一个量\(f^{\sharp}(\pmb{x},y)\), \[f^{\sharp}(\pmb{x},y) = \sum_i f_i (\pmb{x},y)\] 因为\(f_i\)是二值函数,故\(f^{\sharp}(\pmb{x},y)\)表示所有特征在\((\pmb{x},y)\)出现的次数。这样,\(A(\pmb{\delta}|\pmb{w})\)可以改写为 \[A(\pmb{\delta}|\pmb{w}) = \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) + 1 - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \sum_{\pmb{y}} P_{\pmb{w}} (\pmb{y}|\pmb{x}) \exp \left( f^{\sharp}(\pmb{x},y) \sum_{i=1}^n \frac{\delta_i f_i(\pmb{x},y)}{f^{\sharp}(\pmb{x},y)} \right) \tag{6.30}\]
利用指数函数的凸性以及对任意\(i\),有\(\frac{f_i(\pmb{x},y)}{f^{\sharp}(\pmb{x},y)} \geqslant 0\)且\(\sum_{i=1}^n \frac{f_i(\pmb{x},y)}{f^{\sharp}(\pmb{x},y)} = 1\)这一事实,根据Jensen不等式,得到 \[\exp \left( \sum_{i=1}^n \frac{f_i(\pmb{x},y)}{f^{\sharp}(\pmb{x},y)} \delta_i f^{\sharp}(\pmb{x},y)\right) \leqslant \sum_{i=1}^n \frac{f_i(\pmb{x},y)}{f^{\sharp}(\pmb{x},y)} \exp (\delta_i f^{\sharp}(\pmb{x},y))\] 则公式(6.30)可改写为 \[A(\pmb{\delta}|\pmb{w}) \geqslant \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) + 1 - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \sum_{\pmb{y}} P_{\pmb{w}} (\pmb{y}|\pmb{x}) \sum_{i=1}^n \left( \frac{f_i(\pmb{x},y)}{f^{\sharp} (\pmb{x},y)} \right) \exp (\delta_i f^{\sharp}(\pmb{x},y))\]
将不等式右端记为 \[B(\pmb{\delta}|\pmb{w}) = \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y) \sum_{i=1}^n \delta_i f_i(\pmb{x},y) + 1 - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \sum_{\pmb{y}} P_{\pmb{w}} (\pmb{y}|\pmb{x}) \sum_{i=1}^n \left( \frac{f_i(\pmb{x},y)}{f^{\sharp} (\pmb{x},y)} \right) \exp (\delta_i f^{\sharp}(\pmb{x},y))\] 于是得到 \[L(\pmb{w}+\pmb{\delta}) - L(\pmb{w}) \geqslant B(\pmb{\delta}|\pmb{w})\]
这里,\(B(\pmb{\delta}|\pmb{w})\)是对数似然函数改变量一个新的(相对不紧的)下界。
求\(B(\pmb{\delta}|\pmb{w})\)对\(\delta_i\)的偏导数: \[\frac{\partial B(\pmb{\delta}|\pmb{w})}{\partial \delta_i} = \sum_{\pmb{x},y} \tilde{P}(\pmb{x},y)f_i(\pmb{x},y) - \sum_{\pmb{x}} \tilde{P}(\pmb{x}) \sum_y P_{\pmb{w}}(y|\pmb{x})f_i(\pmb{x},y) \exp(\delta_i f^{\sharp}(\pmb{x},y))\] 上式中,除\(\delta_i\)外不含任何其他变量,令偏导数为0得到 \[\sum_{\pmb{x},y} \tilde{P}(\pmb{x}) P_{\pmb{w}}(y|\pmb{x})f_i(\pmb{x},y) \exp(\delta_i f^{\sharp}(\pmb{x},y)) = E_{\tilde{P}}(f_i)\] 于是,依次对\(\delta_i\)求解上述方程可以求出\(\delta\)。
算法如下:
算法关键的一步是(a),即求解方程中的\(\delta_i\),如果\(f^{\sharp}(\pmb{x},y)\)是常数,即对任意\(\pmb{x},y\),有\(f^{\sharp}(\pmb{x},y) = M\),那么\(\delta_i\)可以显示地表示为 \[\delta_i = \frac{1}{M} \log \frac{E_{\tilde{P}}(f_i)}{E_P(f_i)}\] 如果\(f^{\sharp}(\pmb{x},y)\)不是常数,那么必须通过数值计算求\(\delta_i\),简单有效的方法是牛顿法,以\(g(\delta_i) = 0\)表示偏导数方程,牛顿法通过迭代求得\(\delta_i^*\),使得\(g(\delta_i^*) = 0\)。迭代公式是 \[\delta_i^{(k+1)} = \delta_i^{(k)} - \frac{g(\delta_i^{(k)})}{g'(\delta_i^{(k)})}\] 只要适当的选取初始值\(\delta_i^{(0)}\),由于\(\delta_i\)的方程有单根,因此牛顿法恒收敛,而且收敛速度很快。