数据挖掘 - 固定词汇的 One-hot 向量 - 吾爱随笔录

固定词汇的 One-hot 向量

数据挖掘一热编码词袋

2022-03-10 05:52:27

给定一个词汇表 $|V|=4$ 例如，V = {I, want, this, cat}。

关于例句，使用此词汇表和 one-hot 编码的词袋表示如何：

你是这里的狗
我五十岁
猫猫猫

我想它看起来像这样

$V_1 = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$
$V_2 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$
$V_3=\begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \\ \end{pmatrix}$

但是，这种表示究竟有什么意义呢？是否显示了使用固定词汇表的 one-hot 编码的弱点，还是我错过了什么？

1个回答

library(quanteda)

mytext <- c(oldtext = "I want this cat")
dtm_old <- dfm(mytext)
dtm_old

newtext <- c(newtext = "You are the dog here")
dtm_new <- dfm(newtext)
dtm_new

dtm_matched <- dfm_match(dtm_new, featnames(dtm_old))
dtm_matched

$V_1$

Document-feature matrix of: 1 document, 4 features (100.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   0

$V_2$

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 1    0    0   0

$V_3$

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   3

当然，当使用“one hot”矢量化器时，“cat”在 $V_3$ 将是 1（而不是计数）。

其它你可能感兴趣的问题

上一篇Keras中的序列预处理和文本预处理有什么区别？下一篇为什么 plt.plot(feature, '.') 中的水平线表示数据已正确洗牌？