识别考试数据中异常值的好方法是什么?

机器算法验证 异常值 心理测量学
2022-04-11 20:36:22

我给我的学生一个有 8 个问题的考试。每个问题都与特定主题有关。考试是通过从针对该特定主题的问题池中为每个主题随机选择 1 个问题来进行的。每个主题池中有 20 个问题。我担心每个池中可能会有一些异常问题(即,它们比其他问题更容易或更难)。

我想知道每个池中的问题是否基本相同,或者池中是否存在比池中其他问题明显更难或更容易的特定问题。我有大约 300 名学生的分数。

任何人都可以提出一种方法,让我可以根据学生在考试实例中对其他问题的表现的难易程度对每个问题进行排名吗?

根据评论的要求,这里是我目前的幼稚方法:

假设考试由问题组成。每个问题都来自一个特定的池。形式的一组元素,其中是从中提取问题的池,而是该池中问题的实例。为了便于表示,我们假设每个池有个实例。所以每个考试并且有个学生,所以我们有个考试,我想确保所有对于固定的硬度和nqpipime{qpi|0p<n,0i<m}ss{e1,...,es}qpip0i<m大致相似。

为了确定的相对硬度,我会查看所有包含的考试,并将每个学生在上的分数与他们在其余考试中的分数进行比较,例如,其中表示该特定学生参加的实例。然后,总结的所有学生的差异。然后,我将比较特定池的如果一个特定的比其他的要大得多(超过 1 个标准偏差?),我将修改它的重量。qpjqpjqpjr=x=0x<nqxx!=pxdpj=qpjrdpj|dpj|

建议?注释?

2个回答

您是否考虑过基于项目反应理论的方法?IRT 专为此类目的而设计。

简单的例子是Rasch 模型,它可以让您在单个广义线性模型(或广义线性混合模型)中计算学生的能力和问题难度。使用二进制答案格式,Rasch 模型可以写为

P(Xij=1)=exp(θiβj)1+exp(θiβj)

where response 1 by i-th student for j-th test item is modeled as a function of student's abilities θi and item's difficulty βj. There are also models considering more than one item parameter (e.g. item discrimination, guessing), models for items with polytomous answering format, or such models that include other additional explanatory or grouping variables. In measuring student's abilities the model "weights" test items by their difficulty so you don't have to worry about the fact that some items are easier and some are harder. If you are interested in item difficulty you can check their βj values. There are also additional tools for measuring item- and person-fit that may be helpful for identifying item- and person-outliers.

This model is suited for "static" tests with finite number of items, but missing-data design is also possible, where you threat the non-answered questions as missing data and impute their answers using your model. Usually EM algorithm is used for estimation but for more complicated designs Bayesian approach is more suitable.

Those methods are really design-specific so it is hard to give a single answer. There are multiple books available, e.g. a nice introduction Item Response Theory for Psychologists by Susan E. Embretson and Steven P. Reise.

You might try fitting a logistic model. Response is a whether the answer was correct or not. You could add a (random?) effect for the student and a fixed effect for each question. You could develop a ranking based on the coefficients for each question.