固定词汇的 One-hot 向量

数据挖掘 一热编码 词袋
2022-03-10 05:52:27

给定一个词汇表|V|=4例如,V = {I, want, this, cat}。

关于例句,使用此词汇表和 one-hot 编码的词袋表示如何:

  1. 你是这里的狗
  2. 我五十岁
  3. 猫猫猫

我想它看起来像这样

  1. V1=(0000)

  2. V2=(1000)

  3. V3=(0001)

但是,这种表示究竟有什么意义呢?是否显示了使用固定词汇表的 one-hot 编码的弱点,还是我错过了什么?

1个回答
library(quanteda)

mytext <- c(oldtext = "I want this cat")
dtm_old <- dfm(mytext)
dtm_old

newtext <- c(newtext = "You are the dog here")
dtm_new <- dfm(newtext)
dtm_new

dtm_matched <- dfm_match(dtm_new, featnames(dtm_old))
dtm_matched

V1

Document-feature matrix of: 1 document, 4 features (100.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   0

V2

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 1    0    0   0

V3

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   3

当然,当使用“one hot”矢量化器时,“cat”在V3将是 1(而不是计数)。