独热编码

一种文本的离散表示形式是把单词表征为独热向量(one-hot vectors)的形式
独热向量：只有一个1，其余均为0的稀疏向量
在上述的独热向量离散表征里，所有词向量是正交的，这是一个很大的问题。对于独热向量，没有关于相似性概念，并且向量维度过大。

Word2vec

目标：为每个单词构建一个稠密表示的向量，使其与出现在相似上下文中的单词向量相似。
词向量(word vectors)有时被称为词嵌入(word embeddings)或词表示(word representations)。
稠密词向量是分布式表示(distributed representation)。
Word2Vec是一个迭代模型，该模型能够根据文本进行迭代学习，并最终能够对给定上下文的单词的概率对词向量进行编码呈现，而不是计算和存储一些大型数据集(可能是数十亿个句子)的全局信息。
这个想法是设计一个模型，该模型的参数就是词向量。然后根据一个目标函数训练模型，在每次模型的迭代计算误差，基于优化算法调整模型参数（词向量），减小损失函数，从而最终学习到词向量。大家知道在神经网络中对应的思路叫“反向传播”，模型和任务越简单，训练它的速度就越快。基于迭代的方法一次捕获一个单词的共现情况，而不是像SVD方法那样直接捕获所有的共现计数。

两个算法：continuous bag-of-words(CBOW)和skip-gram

CBOW

CBOW是根据中心词周围的上下文单词来预测该词的词向量
过程：输入上下文one-hot 通过矩阵V 获得输入词向量取均值再乘U ->Softmax 获得中心词向量

skip-gram则相反，是根据中心词预测周围上下文的词的概率分布
① 输入的中心词One-hot向量
② 输入乘以center word的矩阵公式，得到词向量
③ 词向量乘以另一个context word的矩阵公式得到对每个词语的「相似度」
④ 对相似度得分取 Softmax 得到概率，与答案对比计算损失。

代码实现

# %%
# code by Tae Hwan Jung @graykode
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

def random_batch():
    random_inputs = []
    random_labels = []
    random_index = np.random.choice(range(len(skip_grams)), batch_size, replace=False)

    for i in random_index:
        random_inputs.append(np.eye(voc_size)[skip_grams[i][0]])  # target
        random_labels.append(skip_grams[i][1])  # context word

    return random_inputs, random_labels

# Model
class Word2Vec(nn.Module):
    def __init__(self):
        super(Word2Vec, self).__init__()
        # W and WT is not Traspose relationship
        self.W = nn.Linear(voc_size, embedding_size, bias=False) # voc_size > embedding_size Weight
        self.WT = nn.Linear(embedding_size, voc_size, bias=False) # embedding_size > voc_size Weight

    def forward(self, X):
        # X : [batch_size, voc_size]
        hidden_layer = self.W(X) # hidden_layer : [batch_size, embedding_size]
        output_layer = self.WT(hidden_layer) # output_layer : [batch_size, voc_size]
        return output_layer

if __name__ == '__main__':
    batch_size = 2 # mini-batch size
    embedding_size = 2 # embedding size

    sentences = ["apple banana fruit", "banana orange fruit", "orange banana fruit",
                 "dog cat animal", "cat monkey animal", "monkey dog animal"]

    word_sequence = " ".join(sentences).split()
    word_list = " ".join(sentences).split()
    word_list = list(set(word_list))
    word_dict = {w: i for i, w in enumerate(word_list)}
    voc_size = len(word_list)

    # Make skip gram of one size window
    skip_grams = []
    for i in range(1, len(word_sequence) - 1):
        target = word_dict[word_sequence[i]]
        context = [word_dict[word_sequence[i - 1]], word_dict[word_sequence[i + 1]]]
        for w in context:
            skip_grams.append([target, w])

    model = Word2Vec()

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training
    for epoch in range(6000):
        input_batch, target_batch = random_batch()
        input_batch = torch.Tensor(input_batch)
        target_batch = torch.LongTensor(target_batch)

        optimizer.zero_grad()
        output = model(input_batch)

        # output : [batch_size, voc_size], target_batch : [batch_size] (LongTensor, not one-hot)
        loss = criterion(output, target_batch)
        if (epoch + 1) % 1000 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

        loss.backward()
        optimizer.step()

    for i, label in enumerate(word_list):
        W, WT = model.parameters()
        x, y = W[0][i].item(), W[1][i].item()
        plt.scatter(x, y)
        plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
    plt.show()

补充
向量之间越相似，点乘结果越大，从而归一化后得到的概率值也越大
模型的训练正是为了使得具有相似上下文的单词，具有相似的向量
点积是计算相似性的一种简单方法，在注意力机制中常使用点积计算 Score， Seq2Seq序列模型和注意力机制

其中对于名称中soft和max的解释如下（softmax在深度学习中经常使用到）：
max：因为放大了最大的概率
soft：因为仍然为较小的公式赋予了一定概率

word2vec中的梯度下降训练细节推导

暂略

基于SVD降维的词向量

基于词共现矩阵与SVD分解是构建词嵌入(即词向量)的一种方法。

首先遍历一个很大的数据集和统计词的共现计数矩阵
然后对矩阵进行SVD分解
再然后我们使用的矩阵U的行来作为字典中所有词的词向量 why??

CS224N