Thursday, October 10, 2013

10-17 W. John Wilbur and Won Kim. Improving a gold standard: treating human relevance judgments of MEDLINE document pairs


  1. 1. In what ways do the authors try to avoid overtraining in their machine learning method? Do you think that these were successful?

    2. What is the standard that the authors use to measure if a method has "better prediction" (p. 6)? Since they only had limited relevance judgments from the human judges, can this be accurate?

    3. How does the title “Improving a Gold Standard” describe this article? On p. 7, the authors write, “Since we do not have true labels and true labels are philosophically inconsistent with our data … we would not be able to evaluate such an application of these models to our data except in how well they could predict judgments of held out judges.” If the models are just trying to predict how human assessors would react, how is this creating a gold standard? Would the models used in this paper be a reliable baseline for further experiments

  2. 1. The paper presents several methods for predicting future judges based on current judges. In this study, the author uses all judges with the assumption that their judgments are all legitimate and valuable. As we all know in real IR studies, especially those recruiting crowdsourcing tools, there are a lot of potential errors and malicious results. How can we apply the proposed methods on those studies? And the observation that method 5 performs poorly predicting judgments of judge 4 also argues the idea that considers all judgments to be legitimate and valuable.

    2. Different methods were compared using the single metric Ave. As we know introducing more free parameters will always make the model fits the observed data better, and this improved performance can be used as evidence of a better model. In this study method M23 is simply an addition of M2 and M3, and as a result it’ll fit the data better. Why weren’t standard model evaluation methods such as the Akaike information criterion (AIC) used in the comparison? For method M3, the conditional probability p(c’ assigned by any judge ≠ j|c assigned by j) doesn’t match with the real case. How can judgment by one judge affect the judgments of the other judges on the same topic?

    3. My last question is about the general application about this study. Several methods were proposed to predict judgments by future judges, and what are the applications of these methods in real case? We are probably more interested in extracting correct judgments from current available data, which is not necessary the judgments from future judges.

  3. The authors use all available judgment data to make prediction of probabilities that different relevance categories would be assigned by some new unseen judge to a query. To answer the important question that how the quality of these probabilities produced from the human judgments shall be measured, the authors leave out one judge's judgments and measure the quality by compare how well the judgments match the probabilities. I think it is not safe to use one judge's judgments as Gold standard, which might lead to biased results. The assumption the authors made is apparently every single judge's judgments are valid, which is dangerous to the validity of the entire experiment.

    The authors use two panels of judges to get judgment data. As the documents were extracted from MEDLINE documents dealing with aspects of molecular biology, I am concerned the validity of using the second panel of judges who are untrained in the area of molecular biology (the first panel of judges are trained in the area). If these judges possibly have no basic understanding of molecular biology, how can they make close to accurate judgments? To include those amateurish judges in the judges pool is problematic.

    The authors use six different methods to predict relevant judgments for the unseen judge. The result shows Method 5 would have been the best had it not done very poorly predicting for the fourth judge. The authors' further analysis shows it was the problem of using too small a regularization parameter. My first question is why other methods are not suffered from this overtraining? Then based on this, the authors propose using one single parameter. I am wondering whether the results are better because of another overtraining introduced as only one parameter is used instead of a parameter set?

  4. 1. From the judgment data we learn that the “…pooled results of the untrained judges were almost as good as the pooled results for the trained judges…” (pg.3) Don’t you think it would be interesting to use a survey, like in the Chilton and Horton paper, and study how the untrained judges made their decisions? Then could their method be tested and used to train crowdsourced participants? For example, if they say they looked out for keywords, could that strategy be used to improve results when untrained /crowdsourced participants are used to make judgments?

    2. The researchers say that for one of their methods they “assume there are weights that represent the value of the individual judges” (pg. 2). I’m still a little confused about how these weights and thus value are determined. In method 2 they indicate that it’s an arbitrary assignment, but shouldn’t training influence weight and value? Or have they established that training does not necessarily affect judgment, and untrained judges can do almost a good job as trained judges, and thus have equivalent value?

    3. What do the researchers mean by overtraining (pg. 3), and how do they define “optimal training” – or do they have such a standard? Are these terms different for the two groups of judges participating in this experiment?

  5. 1. I was surprised that the pooled judgments from the untrained judges was almost as good as the pooled judgments from trained judges. If the corpus had not been MEDLINE, perhaps this would still have been believable, but hasn't it been reported in an other work that in more specialized domains like these, expert judgments far outweigh non-expert judgments?
    2. This leads to me another more obvious question. Does 'untrained' means a judge who is untrained in good search or a judge who is untrained in the subject matter? We've seen that both matter. In this context, I couldn't find a clear definition on what the authors meant by untrained.
    3. The authors keep mentioning (all the way till the conclusion) that their algorithms are different from standard machine learning because they don't have definitive gold standards. However, this statement seems a little disingenuous to me. The authors are, in a strong sense, treating the relevance assigned by the judge whose judgment they are trying to predict analogous to a 'true label'. By doing this over all queries and all judges, they are essentially trying to make a separate classifier for each judge. Therefore, I think the situation warrants comparing their methods to a standard classifier on hold-out data that neither had access to at any point of time. Why is this problem different from any other multi-class classification problem, with a separate classifier trained for each judge?

  6. 1. My first question is about the data used here as well as the method this paper used to generate query-document pairs. The dataset is originally from million MEDLINE documents dealing with aspects of molecular biology. This narrows the whole search down to one topic. Can this be representative of other topics? The method the paper used to generate query-document pairs is the most basic cosine retrieval algorithm, which just calculate the bag-of-word representation similarity of two documents. What if we use other ranking mechanism for each query? How does this affect the experiment?

    2. My second question is about the Method M23. In this paper it says it is a combination of M2 and M3. They combined M2 and M3 as a mixture of weighted terms coming from each method plus the smoothing. But there is no information about how they did the smoothing. Without that information, we can’t know the actual algorithm (or the formula), and we can’t know how that part affected the experiment results.

    3. My third question is about the explanation of the experiment results. Method M5 gave the best results on twelve of the thirteen judges, but made bad error on one judge. The paper is pretty vague in proposing method for improvement for M5. It said it is possible that such large error could be prevented by setting a lower limit for the regularization parameter which is followed regardless of training. However, this is neither information about how to set a lower limit, nor information about how this would affect overall performance. How much lower can be taken as lower enough to improve the worst result? What does this change affect the overall performance?

  7. 1. On page 7, the authors mention that the single parameter optimization for Method M4 does not work. However, the results as shown in Table 3 indicate an increase in score (Table-2 M4 value is -7240 and Table-3 M4 value is -7207) implying an improved performance. Isn’t this contradicting and inconsistent?

    2. In equation 9 of Method M4 (Intrinsic judgments from a weighted average), why is U(k,i)(dp) and not U(k)(dp) chosen? Is judge-‘i’ in the equation a cross-validated held out member?

    3. It is a little confusing that the single parameter optimization equations produce better results than the optimal parameter (*) approach, especially when the latter avoids overtraining and generalizes better for given data. Why is this happening?

    1. The font was not supported here. The square in question 3 was 'phi'.

  8. 1.In method 2, the authors state that they assign arbitrary positive weight to each judge. It can be understood that adding weights will improve probability estimation but it remains to be seen how the arbitrary weights are assigned. How does the model choose weights for judges? If it is based on judges’ past judgments, how can that be done?

    2.The authors describe models and methods to predict the future judgments with the caveat that they have an increasing dependence with the available judgments. From the previous studies and from intuition, we know that the judgments are easily likely to be prone to biases and errors. Did the authors verify the judgments before using them to predict future judgments? Is there any way that they can relax the extent to which the prediction is dependent on the current judgments by introducing some error and without compromising highly on the accuracy? (Eg like the Latent Variables problem in graphical models)

    3. M5 seems to be the best of the three methods even without taking the one judge where it has failed with a large error. It clearly produces the best result for 6 judges. Why does the author then state that M23 is the best among the methods and goes on to clarify why M5 is the best? I believe that this point needs some discussion.

  9. 1. In the probabilistic estimation methodologies the idea is to estimate the relevance of every category and every query document pair. However, for the query formulation - every term in the query is treated as independent of the other. While this may work well for certain cases, it however does not account for the semantic gap and the dependencies of terms which form part of the query that has been formulated. How can we resolve this issue and be able to estimate the various parameters that comprise the query which require to be analysed? Further, how do we hope to represent the documents and queries through a dependence model which takes into account the syntactic as well as semantic structure of the phrases?

    2 . The paper proposes the utilization of Maximum Entropy classification wherein the data points are classified corresponding to the query document pairs. It is known that certain distributions do not see representation through binary assessments. Since, our human judgements continue to be binary how do we hope to represent these such models by making use of a maximum entropy classifier? How do we hope to modify the query document pairs in these cases so that we do manage to get a result through maximum entropy classification?

    3. I am curious about how we would go about automatically identifying different MEDLINE abstracts that are related in meaning? Further, if there are two sentences which are related in meaning their importance would be dependent on the fact that the user places a different relative importance on them. So, how can we account for this difference in importance which may also correspond to a difference in the relatedness? And finally, how can we cater in to the loss of context due to loss of focus when dealing with human judgements?

  10. The results of predicting relevance judgments are based on algorithm; so, is there any need in proving that these predicting relevance judgments can actually fit the real users’ model of relevance judgment?

    In this experiment, since each judge will assess many different query-document pairs, in this case, the results of query-document pairs are very likely to be correlated to some degree. However, in many statistics methods, variables are assumed to be independent, like in the correlation test. So, will it influence the results in this experiment actually?

    In this paper, weighting parameters have been put in methods. Here, I don’t understand how to set these weighting parameters and how to generate them in the experiment actually.

  11. Wilbur and Kim use a graded relevance scale in which the judges were asked to rank from 0-4. For this they also assigned a probability aspect to each ranking in order to help show the relevance of a document. How useful is this practice in assigning .25, .5, .75, or 1.00 relevance vs. the numbering system or simple system of not relevant up to highly relevant?

    On page 8, the authors identify M23 as being the best overall method for predicting the judgments but also points out that M5 did the best except for a single judge which it did terribly on. Knowing that the possibility for missing by a wide margin is possible, would it still be feasible to use M5 even with that aspect looming over head or would M23 be the best choice since it eliminates the outlier problem?

    In suggesting further investigations, Wilbur and Kim point out that their method could lead to instances of active learning which might involve controlling judges. What exactly does he mean by this? Does he mean simply training them to a certain manner using the predictive application to focus them onto a gold standard type mentality?

  12. 1. When laying out the different methods of estimation, the authors describe M5 which uses maximum entropy classifier. In order to set up the equation, different judges are set aside for training, testing as well as determining values within the equation itself. The number of judges at least in the study presented in this paper are limited. Removing three or four judges from the evaluation seems like it would eat into that limited resource. The measure goes on to have a pretty good evaluation outside of really failing to predict one judge’s answers. Is this due to the limited information? Is it wise to take the effort to have an evaluation technique that requires so many different judges set aside to make the measure valid? One thing the papers we have read this semester including this point out is the high cost of getting judgments. Considering the likelihood of having several judges judge the same information is low, I would think the evaluation measures should avoid putting a strain on the number of judges.

    2. When comparing the different estimation equations, they find M5 performs the best for predicting the actions of all judges except one. In addition, M5 fails badly enough on the one judge to cause the M23 measure to be regarded as the best estimation technique. Being the most accurate if the one judge is disregarded, would it not still be better to use this technique over the others? The authors hypothesize potential ways to try to eliminate this error in estimate. This seems worth the investigation since the measure seemed promising. However, there were really only a small number of judges to set up and evaluate the measure against. Even though the study pin-points potentially successful evaluations measures, is 12 judgments even realistic for drawing reliable conclusions about the usability of any of these measures?

    3. There were a handful of features related to the judges that I was uncertain about after reading this paper. The authors use multiple judgments for each pair to account for the natural difference in human relevance judgments. From the various papers we have read so far, other factors potentially influence relevance judgments. The first being if the judges are experts in the material. The second reoccurring theme in past papers was the training applied to the assessors. I do not remember the paper outlining if either of these conditions applies in this study. If the judges are all trained to regard documents in the same fashion, would this make their behavior easier to predict because they would be more inclined to act the same way? Would equations derived from trained assessor lose applicability?

  13. What exactly was the background of both the trained and untrained judges? Did the untrained judges have familiarity with MEDLINE, but not with the subject matter? Did they have familiarity with or expertise in the subject matter, but were not trained for relevance assessment? I find it hard to believe that the average person (or group of six people, as in this study) would even be comfortable enough with the PubMed interface to perform an effective search. While this is of course alleviated in this study by having the potentially-relevant documents previously retrieved by a cosine similarity-based algorithm (from Salton, 1988), I am still suspicious that these untrained judges may have still had subject matter expertise. This is because the results do not seem to fit with the intuition that SMEs would have judgments dissimilar from or not predicted by non-experts.

    I am a bit confused by their approach to weighting judgments in Method 4. Without a gold standard, they appear to use the community mean in order to judge assessor error and to generate the judges' weights with a sort of Markovian chain. This ignores multimodal relevance judgment similarity (e.g., having justifiable, unique beliefs or preferences) which other literature has seemed to embrace. Furthermore, if both trained and untrained judges were used to determine this mean, are not the data biased in terms of determining the "gold standard" (and the judges' weights) in the first place? Lastly, the authors state that these weights converge. Should they?

    Arguing from the other side, with a suitably large pool of judges and judgments, would any of these methods have advantages over collaborative filtering? Assuming that we have previous data for a highly similar judge or group of judges on the same query (as we probably would from search engine data for queries that are not extremely unique), can we simply use that individual or group's judgment as a heavily weighted predictor?

  14. 1) In the introduction of the paper, the authors state that they consider each judgment to be legitimate and valuable. However, as we discussed in class are they not still catering to “popular” judgments? All of the methods involve some sort of predictive behavior that is based on consensus. Doesn’t this mean that outlying judgments will be inherently highly unlikely to be chosen?

    2) The authors mention using both using human and trained judges for both their training and test data. They go on to discuss which set or subset was “better.” I’m confused in what context this is. Are the untrained judges “bad” at being predicted? or are they bad at making good relevance judgments in general?

    3) Is there a lower limit, in terms of number of previous judges necessary to predict the outcome of a single new judge? It would be interesting to see results for sets of judges ranging from 3 - 12, in addition to the 13 used for the experiment in the paper. Along the same lines, is there an upper bound where the predictions no longer become better?

  15. 1. The authors mention that the untrained judges pool performed better than any single trained judge. What was the background of these "untrained" judges? What about these people qualified them to participate in this experiment as untrained?

    2. In the discussion of the results, the authors mention M5s failure to predict the judgments of judge 4, while it performed really well with the other judges. What about judge 4 made it difficult for M5 to predict? They don't seem to elaborate the issue.

    3. M5 is said to generally be the best method that they tested, but is subject to occasional large errors in prediction. Given their medical area of interest, wouldn't any sort of large predictive area of relevance be a larger concern than they make it out to be?

  16. 1. The work was based on MEDLINE. However, we have no clear picture what the data is, especially, whether there is any special feature in such data which would impact the results presented in the paper.
    2. Page 5, last paragraph, it is said that the performances of all methods were almost always better than the random level on each judge. What is the cost to achieve such result? In other words, are the methods mentioned in this paper expandable and able to apply to other research work?
    3. Page 7, left column, first paragraph, it mentioned the M4 and said “the reason for this failure is not clear”. Does it mean the limitation of M4, and imply that this method may fail to apply to other fields?

  17. 1. There was a term “overtraining” under equation 4 on page 3. What is the meaning of “overtraining”? Especially in this paper, what is the harm of “overtraining”?
    2. It’s mentioned that judge 0 and 12 were somewhat special when discussing the result. Why that happened?
    3. M23 achieved the best result in this paper. However, it is the only combined method in the paper. Is it possible to combine other methods? Or, is there any constraint blocking such a combination?

  18. 1. What is a motivating reason to combine judge pools? It would have been interesting to look at them separately and discuss common properties. This is clearly a specialized task for an untrained crowd.

    2. The experiment collected calibrated relevance judgments (induced because of presenting a relevant document). Hence, I feel the variability among judges is reduced. Will we see similar results under the usual relevance setting? A good extension might be to identify a pool of relevant documents per topic (gold) and then make relevance judgments on the rest.

    3. M5 feature representation is not clear, what is the impact when we have different feature dimensions.

  19. The authors speak to the concept of overtraining, but don't define what that is, or how its differs from being trained. This concept is glossed over and I wonder why this was brought into the paper without adequate explanation.

    On page 2 the authors talk about estimating classes of probabilities for each document, but then don't go into a lot detail about what these classes are to compute maximal likelihood estimates. I wish the writers had clearly defined these classes a little more than trained and untrained judges.

    What is medline and how is it different from TREC?

  20. 1) I am having a difficult time understanding how method 1 and similar ones avoid having a 0 probability for a category. Can you elaborate on this?

    2) Since the goal is to improve relevance judgments, is their choice of using a sub optimal solution for beta in method 4 an acceptable one? Especially since M4 came out to be “someone in between” the best methods.

    3) Part of the future work for this paper involves exploring merging techniques for classifiers. However, Wilbur warns that the relevance judgment problem is fundamentally distinct to the classification problem. How does this affect the use of merging techniques like boosting?

  21. 1. What is the appropriate way to view the probability measures presented in Table 2. It is said that the random baseline has a value of -8047.19 while most of the methods tested have values in the -7200- -7400 range. Is this actually a significant result above the random baseline? Why are the success measures not normalized to make the table more readable?

    2. Do the methods described in the paper improve with more training data? At what rate does prediction success increase with more training data? Does this paper actually describe how these methods might be best operationalized in an active learning scenario?

    3. Might trained judges and untrained judges have different predictive success rates? How would the methods perform if the training set were restricted to just one judge type?

  22. 1. In this article the authors attempt to predict how judges will act based on how other judges judged the same articles. However this method does not take into account whether or not the judgments were correct just if they were similar to the consensus of other judges’ judgments. Is it really useful to create a method of predicting judgments that does not account for situations when correct judgments may be in the minority?
    2. In this article the authors have their judges rate a document’s relevance on a scale from 0 to 4. They state that they told the judges that each level corresponded with an exact percentage that the document would be considered relevant to the query with 0 being not relevant, 2 being 50% relevant and 4 being totally relevant. Is this method of assigning each relevance level an exact value any better than other methods we have seen previously where the levels would be very relevant or somewhat relevant?
    3. In this article the authors rate the effectiveness of their 6 methods by averaging the score that they received using the evaluation equation that they devised. However we have seen in the past that there can be a large difference between the relevance judgments that separate judges make due to the subjective nature of the process. We even see in their results that initially M23 seems better than M5 until they account for the error that judge 4 cause for Method 5 at which point M5 looks better. Considering this, do you think that averaging the scores provides a false positive because it assumes that each of the judgments is equal? If so what other methods could be used to get a better picture of how these methods operate?

  23. 1. The paper aims to study a collection of judgment data to prove that it is possible to make predictions of future human judgments given the current and prior judgments about an object. What is the applicability of this experiment in an IR system? How useful and accurate can it fit the generic cases in a retrieval system, which is fraught with many unpredictable behaviors?

    2. While describing the second approach to treat the problem of predicting future judgments, the authors have mentioned that weights were assigned to the features of the judgments by training and learning. It is stated that
    when the training is completed the learned weights are then suitable for prediction. How do they know when to stop training? The problem of identifying the stop criteria for the training is very critical as this approach lies it base on the training and assigning weights to the features of the judgments.

    3. In the Method M5, Maximum Entropy Classifier, all labels provided by the judges in turns are treated as true and is also not connected to the judges who produced them. This would not help in assessing the individual assessor’s quality. Since it is always assumed that labeling is done correctly, it may not be accurate, as the error rates are not considered as a contributing factor towards evaluation. How effectively can it help in evaluating a retrieval system?

  24. 1-The entire basis for this research is the ‘popularity contest’ method. The authors do not take into account or address the issue that just because a judge disagrees with most of the other judges does not necessarily make that judge’s opinion less valuable. Some acknowledgement of that would be helpful.

    2- I am confused about the use of ‘an arbitrary positive weight’ assigned to each judge in method two. Why should this weight be arbitrary? Why does adding an arbitrary value to the calculations improve them?

    3- On of the most interesting things this article suggested is that modeling relevance judgements can be used to predict how another random assessor might judge a document and that assessor could be the end user. In order to really test that I would like to see the methods stacked up to some real data pulled from user logs. Then it would be more possible to see how closely users are being modeled and if the judges are doing a good job of predicting user behavior at all.