总结深度学习中的word2vec模型。

word2vec

比较skip-gram和CBOW

Performance

The training speed can be significantly improved by using parallel training on multiple-CPU machine (use the switch '-threads N'). The hyper-parameter choice is crucial for performance (both speed and accuracy), however varies for different applications. The main choices to make are:

性能问题可通过并行计算、设置核实的超参数解决，但不同应用也需要选择不同的方案

architecture

architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)

模型结构：
- skip-gram:低频词上效果更好
- CBOW：速度快

the training algorithm:

hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)

优化方法：
- hierarchical softmax:低频词上效果更好
- negative sampling：高频词上效果更好，词向量维度低更好

其他超参数设置

- sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
- dimensionality of the word vectors: usually more is better, but not always
- context (window) size: for skip-gram usually around 10, for CBOW around 5

根据Word2Vec作者Mikolov的说法，两者的比较如下：

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words. 这里给出我对以上比较的理解。

关于训练集大小：Skip-gram在数据集较小时能取得更好的效果；原因我认为有两点：1. 数据集较小时，低频词较多，Skip-gram对低频词的表示效果较好（原因后面会说）；2. 同样的语料库，Skip-gram能产生更多(子)样本。

关于训练速度：CBOW训练速度更快。这是因为，一个batch中，CBOW包含batch_size个中心词样本，Skip-gram包含batch_size/num_skips个中心词样本，同样的batch_size遍历一次样本集，CBOW需要更少的step。

关于低频高频词的表示效果：Skip-gram对低频词的表示效果优于CBOW，CBOW对高频词的表示效果略优于Skip-gram。原因的话，我从两个方面进行阐述：1. 直观上来理解，CBOW通过上下文预测中心词，如知识__命运，类似于完形填空，选择概率最高的一个词填入其中，这就要求词向量对于概率最高的那些词有足够的「理解」；而Skip-gram则是根据中心词填周围的词，如改变_，要在中心词周围填词，词向量需要对该词有足够的「理解」才能填词，包括低频词。2. 从反向传播更新词向量的角度来看，在反向传播更新词向量时，CBOW使用相同的梯度更新所有输入词向量，如对于（[知识，命运]，改变）这个样本，知识和命运的词向量使用同样的梯度进行更新，这就使得低频词的词向量更新被高频词「平滑」掉了。而Skip-gram则不存在这个问题。

buaawht

word2vec模型

word2vec

比较skip-gram和CBOW

Search

Table of Contents