Saturday, November 9, 2013

11-14 Falk Scholer et al. Quantifying test collection quality based on the consistency of relevance judgements.


  1. 1. In Section 2, it has been described that some ground truth data could be inserted into the test collection to detect assessor errors by crowd-sourced workers and hence if workers failed to mark up data correctly, it could be extrapolated to all other items in the collection. How is this method of testing a sample of workers on a sample of ground truth data against the entire test collection over all workers justified? It can be a hypothesis but can it be generalized to assess their behavior and thus the quality of judgments? How does this relate to quantifying the test collection quality?

    2. In an experiment, the number of similar documents seen prior to judging a document influenced judgments. Furthermore when similar documents were seen, they seemed to improve the consistency of judgments. Were the similar documents placed successively in the collection? Can consistency be attributed to relevance judgments as in this case it is seen as a good measure to have consistency of judgments for a given test collection and sometimes otherwise can be a source of assessor errors due to their biased opinion or optimistic model of assessing?

    3. In most of our earlier studies there were ideas to discard the duplicates in the test collection, as they did not improve the relevance from a user perspective. But in this study they have suggested including duplicates in the test collection in order to maintain consistency in assessors' relevance judgments. Would it not be more apt if the assessors were asked to assess unique test collection sets and then study the impact of consistency among their judgments?

  2. 1. My first question is about checking relevance assessors in crowd sourcing (such as Mechanical Turk). Actually this is discussed in the related work part, but I think this is an interesting question. Just as referred in this paper, one common approach to detecting errors by crowd sourced workers is to insert ground truth data into the stream of items to be judged. If workers fail to mark up such data correctly, their judgment on other items can be assumed to be similarly mistaken. However, I am wondering that whether we are going to throw all data from that user (who made mistakes on ground truth data), or we are going to assign their input certain kind of possibility as judgment.

    2. My second question is in Section 3.2 Duplicate Documents. In this section it talks about duplicate documents in datasets. It lists some dataset and then says because the 0.9 threshold was found to be a reliable identifier of duplicates, it was used throughout their experiments. Although they gave explanation about the 0.9 threshold, I still have questions about it. It seems to me that they get the 0.9 threshold just by checking several datasets, it does not necessarily mean that this is a universal rule for all datasets. Plus, it does not incorporate all kinds of datasets. For example, it also discusses about crowd sourcing data, which may exhibit different characteristics.

    3. My last question is about the usage of the conclusions drawn from this paper. This paper is about evaluation, and their aim of this paper is to encourage use of this analysis method which can further enhance the quality of collection-based evaluation of IR systems. Does this necessarily mean that we need to do this kind of evaluation each time when we are dealing with relevance judgments on various datasets or we can use some test collections that are of “perfect” quality?

  3. 1. Is it possible that by only using duplicate documents to examine the phenomenon of disagreement, the researchers are introducing a bias into their findings? Are there other ways that researchers could re-test these hypotheses without new experimental data?

    2. There is a large amount of variance between the distances of different topics. The researchers do not control for inertia when considering distance, which may alter the findings in the distance section. However, I am still curious as to whether or not there are other factors not taken into consideration in the paper. This is a troubling question to research, but how might we begin to understand the relationship between the topics and assessor agreement on relevance, if there is one?

    3. It sounds like randomizing the documents in qrels before having them assessed is an important procedure. Do none of the TREC procedures do this? In addition, how might we prevent inertia in relevance assessments? Researchers mention time between judgments as one factor that may reduce inertiatic judgments. Are there others?

  4. 1. I was confused about the varying numbers of duplicate documents in each test collection. If the consistency of each assessor's judgments with him or herself across randomly occurring duplicate documents, wouldn't it have been a good idea to control the frequency of these documents before the experiment took place? Or were the test creators interested to see what the difference would look like between collections with high versus low numbers of duplicates?

    2. In their description of the test collections, the authors site Carterette and Soboroff with the fact that "more time is spent judging relevant documents than irrelevant documents"(p. 2). How might this concept relate to the idea of Need for Cognition, as described by Scholer, Kelly & Webber?

    3. The comparison of Interactive Search and Judge (ISJ) versus DocID seemed too brief a discussion to be included in this study. Should this have been its own study or paper? How does it fit in with the major findings which consider consistent duplicate document assessments and intra-assessor evaluation?

  5. This comment has been removed by the author.

  6. The authors use ground truth data to determine the level of error made by relevance assessors. Though the approach is correct, the authors use duplication as the ground truth data in the paper instead of using the assessment made by the authority for the topics examined. I believe duplication might no be able to reflect assessment errors as there are potentially many other contributing factors besides assessment errors (e.g., systematic bias of the assessors for some specific topics, which would create consistent judgements for the duplicate documents). To use duplication as the ground truth is biased and incomplete.

    In the experiments, the authors do not choose to inject duplicates or near duplicate documents into test qrels. Instead they use whatever duplicate documents present in the test collections. It is problematic as the number of tests could be run is very restricted and might not be representative.

    The authors state that ISJ (Interactive Search and Judge) approach can lower the time period between seeing duplicates compared with the original DocID sorted linear order in the TREC test collections.Though it might lower the inconsistency rates for duplicate documents, in the same time it provides a kind of learning effect which might introduce bias into the judgement (e.g., two documents are similar and present in sequence, but only the first one is relevant, the second is close to the first one but not relevant. The assessors might regard the second one as relevant as well due to the learning effect)

  7. 1. The authors state that partially relevant documents were more likely to have inter-assessor disagreements and say that these problematic topics can be removed for creating test collections (p. 1071). Do you agree that problematic topics should be removed? If these topics are removed how could you built and evaluate a search system that works on difficult topics?

    2. In their evaluation of results, it appeared that raters sometimes forgot or disregarded the criteria on which they were to be judging documents (p. 1067). How could you set up the assessment process to remind the assessors of the instructions? Would reminding the assessors of the criteria distort the results, since real users would not have such set criteria for rating documents?

    3. The authors write that “The number of pairs where assessors are consistent is largest for partially relevant, and lowest for highly relevant” (p. 1066). Why do you think there is such a gap? Are the queries specific enough to make such discrete relevance judgments? Could the gap come from assessors trying to make relevance judgments that can be generalized to a larger audience?

  8. 1. This paper presents an important study on evaluating the inconsistencies of TREC assessors. A central model of this is the identification of “similar” documents based on the cosine scores. However, is the cosine score enough to reflect the similarity of documents since they are not taking account of the words order? Evenly the manual inspection shows that these documents are similar, it is still possible that these documents can have significantly different purpose just by differences of few words, how can we rule out this possibility?

    2. The author simulate the inconsistent model by the time (distance) between two similar documents, and suggest that it’s possible that the assessor might forget the judgment of the first one as time goes. However, since relevance judgment is dynamic, is there possible that the perceive of other novel documents in between the two similar documents change the internal relevance model of the assessors on this topic? And this can also explain while inconsistency tend to arise with increasing distance, because longer distance suggests more documents and higher chance of change of internal relevance judgment model. In this case, both of the judgments are reasonable. How can this possibility been ruled out?

    3. Another study focus on the similar documents placed in between of the two compared documents. Firstly the measurement of “similarity” of documents is questionable as suggested in question 1. Secondly the effect of these “reminder” documents is evaluated in combination with distance. How can we decompose the effects of more “reminder” documents and longer distance since they may both contribute to the effect of inconsistency?

  9. 1. One takeaway from early on in the paper is that trinary scales produce more inconsistency than binary scales. One would reason therefore that it is appropriate to stick to binary scales rather than present more choices to assessors, at risk of deepening inconsistency and degrading quality. Are there any studies, in fact, showing that finer granularities in relevance assessments have positive impact on agreement?

    2. The authors even show that the greatest inconsistency is between partially relevant and non-relevant. One would assume therefore that assessors have a hard time deciding whether to rank a document partially relevant or non-relevant. Is it because the document contains some useful information about the broader topic but not about the specific topic the assessors were supposed to be looking for? From personal experience, it seems easier to grade something partially relevant than non-relevant for a case like that. What's curious though is that these assessors are trained, yet some kind of bias is evidently showing up. It should be interesting to identify what that bias is that's causing this inconsistency.

    3. The finding that the probability of judging a document relevant right after having seen another relevant document is unsurprising yet disturbing, given these were trained judges. What are ways of reducing this bias? Is it possible to seed the documents as presented to the judges so that non-relevant documents are randomly sampled and inserted in strategic locations in the list? Would the overhead be worth the potential gain in accuracy?

  10. 1. The methodology employed by the authors in section 4.3 to measure the effect of inertia seems confounding. They measure the probability of the (i+1)th document being judged relevant given that the assessor had judged the ith document as relevant. Do you believe that this is a good measure? Are the authors not ignoring the fact that two consecutive documents can actually be relevant? Would it not have been accurate if the authors had devised a measure where the assessors assessed the documents as relevant in spite of them being irrelevant?

    2. What will lead to a more efficient system?- a system where all the duplicates are removed or a system where all the assessors are devoid of any inertia or other factors affecting judgment? The answer to this question might also answer the biggest question of what is more important - a well picked document collection or the assessor skill.

    3.The authors state that partially relevant documents contributed disproportionately to the inconsistency among the assessors and suggest the removal of topics with high levels of inconsistency. In the case of multiple levels of relevance, there might be a lot more inconsistency in assessment and such an effort to remove inconsistent topics could prove disastrous. Do you think that re-assessment of such documents can result in a better solution, instead of removing them altogether?

  11. 1. Previous work put forth that one could objectively measure the rate at which relevance assessors make mistakes by comparing their judgments to the topic authority (pg. 2). Do you think this really an objective measure considering the topic authority is a human being too? Is there an alternative / less subjective method of creating a topic authority, or is that not possible?

    2. The researchers suggest that extremely long topics reduce assessor inconsistency (they do say that there is insufficient evidence for the same), but do you agree with this assumption and what may be the factors that work in symbiosis with length to limit inconsistencies?

    3. How long does the inertia effect caused by a document judged as relevant last and what does it take to break this effect? / With respect to the clustering effect, is there any research on the distance between the first document judged relevant and the first document judged not relevant.

  12. 1. We've read papers in this course that have suggested that discrepancies between assessors' relevance judgements don't actually hurt evaluations' results. Stated another way, the evaluations' ranking of search engines is robust to disagreements between relevance judges. This paper explores intra-assessor disagreements, but what reason is there to think that these are any different than inter-assessor disagreements? This paper seems to exaggerate the importance of these findings without explaining how exactly they impact the evaluation of search engines.

    2. The authors acknowledge their assumption that distance between judgements can be inferred from the ordering of qrels. An interesting follow-up might be to test this assumption by observing an assessor at work or by creating a system that can actually track this (which does not seem difficult).

    3. In the future work they mention the idea of inserting duplicate documents in order to characterize intra-assessor disagreement. These documents would have to be ignored when evaluating the search engines though, correct? It seems like introducing these would disrupt the results, especially considering the impact it might have on assessor inertia.

  13. In this research, the time of judging the documents has been recruited as a criterion to estimate the distance between the different documents. However, I can’t understand why the time should be considered as the criterion there. Will the time spending in assessing directly impact the relevance judgment?

    In paper, it seems that the trinary judgments are more likely to cause the increasing number of inconsistencies. So, I am a bit wondering whether the inconsistencies result from the trinary judgments or are affected by other factors.

    In the ending of this paper, it’s mentioned that the topics difficult to judge are suggested to be deleted. However, I think this suggestion is problematic since some of the topics may be very important in real life. If these topics were deleted, would the evaluation of the information retrieval be influenced?

  14. 1. In IR - we are required to work with the premise that information can be represented through multiple queries. But, is it fair to assume that all queries are capable of polyrepresentation? Also, wouldn't the task of mapping these multiple query representations to a single cohesive document ranking get tedious especially when we still have to deal with an incomplete set of relevance judgments and have no metric to calibrate the assessor's consistency across a bunch of relevance documents?

    2. When working on quantifying a Test Collection - is it alright to just take into consideration topic length as well as topic effects as the two parameters which require to be taken into consideration. Wouldn't it be prudent to also include documents which are from varied sources but are based on the same topic? Like for instance - draw conclusions on the user's intent on the basis of whether he/she seems to prefer scholarly (depth based) articles as opposed to newspaper/article readings - which would provide more breadth?

    3. I do not seem to completely grasp the concept of the 'Reminder document' that has been elaborated on in this paper. We have read quite a few papers which focussed on how there is a definite elevation in the learning curve of the searcher and how this affects the relevance judgements of documents placed lower on the ranking. Doesn't the reminder document support a hypothesis which is contradictory to what we have seen in the past where this learning curve of the searcher seems to be undermined?

  15. 1. The authors mention using a graded relevance scale and not just a binary scale. In class, we mentioned that users feel more comfortable when they have multiple levels of relevance presented to them. When looking at agreement among assessors, does a graded relevance mask the true amount of agreement? When people are uncertain about the relevance of a document, they can fall back on a middle ground. In addition, in one of the other papers this week, people commented on marking documents in the middle range of relevance when they did not understand the document. Although these assessors are putting the same score, they are not truly agreeing on the relevance of the document.

    2. In class, we talked about how one of the weaknesses of the Kendall’s tau correlation metric is that it is not sensitive to the position of the swaps. Instead, the measure is only concerned with the number of swaps. Therefore, for a search engine, it would not be the best correlation measure because a higher swap is more meaningful than a lower swap. For this experiment, the authors use Kendall’s tau correlation to look at the discrepancies between two different relevance assessors. Although the documents appear in a ranked list, the purpose behind the ordering is not the main focus of the task. Instead, the focus is on the relative difference between the two lists. Therefore, are these concerns not important and is Kendall’s tau the standard correlation metric to use in these cases? Is there another metric that would have been a better reflection?

    3. In the final evaluation of their results, the authors summarize the different answers to their research questions. The authors note that they did observe differences in system performance. At the same time, the authors acknowledge that they do not have reasonable proof to establish the difference in performance is related to the inconsistency of relevance judgments. If they can not reach the conclusion which motivated their experiment, then was the experiment successful? Some of their measurements yielded no significant difference, but then other measures did.

  16. Q. The author is assuming that duplicate documents will be ranked similarly. "Since these are retrieved based on a similarity function, one would expect that duplicate documents occurred close to each other in such lists. This suggests that the time period between seeing duplicates was lower than in the linear presentation case, and so inconsistency rates were also likely to be lower.”
    This doesn’t seem to be a safe assumption because duplicate documents have been seen to be judged differently in the past. The author himself says : “It seems reasonable to assume that when an assessor encountered a duplicate of a previously judged document, but gave a different judgement than previously, that they forgot their previous judgement, and perhaps even forgot the document itself.”
    Q. The author has used only duplicates as a way to determine the unreliability of the workers, which itself is questionable. We have seen in the past to that in IR, a single measure can never be used to draw a complete and reliable conclusion.
    Q. This paper indicated that assessors found it hard to differentiate between non relevant and partially relevant topics. However in class last week we discussed that it is difficult to determine if the document is partially relevant or not but he decision with regards a document being non relevant is quick. These differences in conclusion seems to indicate that the results of the tests being done in IR cannot be always relied upon.

  17. In regards to topic length Scholer et al., make note of the relatively small difference when judging duplicates in collections which contained long topics versus collections with fewer words used in their topics. Even if the differences were considered small, would it be beneficial to use longer topics in an effort to further reduce errors in relevance assessments?

    Section 4.2 mentions distance between assessments in the form of time. Is there anyway to measure whether or not a judge simply changed their measure of relevance or simply forgot the criteria as Scholer et al. suggest? Or could it be a mixture of both events happening as judges move through a test collection, which would then mean that distance will always be a factor no matter the reasoning behind consistency?

    The conclusion mentions partially relevant documents as being a key factor in judging inconsistency. One of their suggestions is to remove those documents entirely or ensure that the judges look into each document carefully in an attempt to determine the actual relevance. Wouldn’t removing the partially relevant documents remove the need for graded relevance entirely? And would it be possible to assign relevance judgments to sections of a document versus the whole body for instances of partial relevance?

  18. 1) In their discussion of disagreements on graded judgments, the authors mention that inconsistencies between relevant and partially relevant are most common. Is this inherently due to the fact that there are three levels, and the “difference” between levels is not even? I wonder what the results would have been if they did a study on the same data using binary relevance. If a weighted average of discrepancy between 0-2 and 0-1 in the trinary case was compared, to 0-1 in the binary case, and found to be different, there findings about the trinary case might have stronger meaning.

    2) It seems odd to only consider topics that have large numbers of duplicates. Doesn’t this restrict the results to “popular” documents? It would be interesting to crowdsource such an experiment, with an even distribution of duplicates for each document. This would allow the results to be meaningful for all topics. Further, there would be no question about how closely one document is a duplicate of another.

    3) I’m not sure I buy the inertia factor. Isn’t it possible that most relevant documents appear close together in a list? If this was the case, then obviously the conditional probability would be higher. What constitutes “next” in this calculation? It would be a stronger conclusion, if they explicitly stated that assessors were shown documents purely randomly.

  19. 1. Many studies had revealed that the errors in the test collection had less impact on the performance of an IR system. This paper still focused on this topic. Though it is valuable from theoretical perspective, what is its practical value?
    2. When discussing topic effects, the paper mentioned “one might question the value of including such topics in a test collection.” However, from another point of view, why not we question the narrative of the topic is not clear enough?
    3. This paper criticized the method to present the document according to order of DocID in qrels. The ISJ sample can be treated as another order of DocID. Therefore, one of the key points of this paper is how to present a reasonable order of DocID to assessors.

  20. 1. The authors mention two topics in particular that led to a large number of duplicates in the results. What about these topics caused so many more duplicates than the others?

    2. The authors discuss their analysis of the trinary relevance, but still take the time to compute the data of the trinary relevance when it was folded into binary like other TREC relavance. What information does this provide that original assessment of the data didn't?

    3. The authors mention the "critical point" in the judging process. How would that point be determined in order to insert the duplicates?

  21. 1. Does differing judgments on near duplicate documents reflect exposure or issues with calibration? If we had to present assessors with a burn-in period, how will we do that? Since we use pooling to decide which documents to judge. Will a second run induce dependence?

    2. Where in the ordered list do assessors encounter these duplicate documents? And how does this affect disagreement? Maybe duplicates, which are spaced but lower in the ranking, can have smaller disagreement rates.

    3. Testing the hypothesis is fine but what is the solution to consecutive judgments being dependent? Is this something we can get out of? If we cannot guarantee explicit independence how can we do so explicitly? If we have dependence anyway then why is it that we do not adopt 2 runs of assignments?

  22. 1. TREC is grouped into 3 blocks in section 3.1. Why? What is the criterion?
    2. “Zettair” is mentioned in section 3.2. What are the features of this search engine? Why do the authors use it in their work?
    3. When discussing Reminder Documents, the authors conclude that “consistent pairs were more likely to have a document, close to the second ...”. Is there anything that influences that reminder document?

  23. 1. In this article the authors discuss the different factors that may influence when an assessor may judge two duplicate documents differently. One of the factors they discuss is the difficulty of the topic. They state that some topics might be more difficult and therefore have an increased chance of the assessor judging two duplicate documents differently. However one factor that they do not talk about that may lead to intra-assessor disagreement is similar to this. Rather than the topic causing trouble, the assessor’s skill level at judging might cause this problem. Do you think that charting an assessor’s treatment of duplicate documents is an effective measure of their skill at judging?
    2. Another factor that the authors discuss that may lead to intra-assessor disagreement is the distance between the two duplicate documents. The further the two documents are from each other the more likely that they will be judged differently. They argue that the assessor forgetting the first document when reading the second causes this. However we have seen in previous articles that as assessors judge more documents they become fatigued. What role do you think that fatigue may have in the intra-assessor disagreement that is observed here?
    3. In this article the authors discuss an alternate method of producing qrels for documents aside from the traditional linear method used by TREC. This method was put forth by Cormack et al. and is titled Interactive Search and Judge. It involves assessors searching for relevant documents by running several similar queries on a search engine. Do you think that this method is a better method than the linear method used by TREC? What type of bias do you think that this method could cause?

  24. (1) Much of the work concerns itself with looking at inconsistent annotator behavior across duplicate documents. In the data collected from the graded collections (wt10g, gov2) it is shown that the greatest proportion of inconsistent judgments come from when a document is being judged from not relevant to somewhat relevant. While it is possible to explain this simply as inconsistent annotator behavior, isn’t it also possible that the inconsistency could be the product of deficiency of choices for the annotator (i.e. assessors were not given enough choices in terms of describing a document’s relevance)?

    (2) One of the ways to reduce duplicate judgments was described as an interactive approach, where assessors search and are presented with similar documents at one time. While this has the effect of greater consistency, does more consistent relevance judgments are more right?

    (3) Is it possible that implicit measures of relevance (e.g. eye tracking) could be put to use on the problem of annotator inconsistency? Do we expect identical reading effort (from a cognitive perspective) on identical documents and might something like inertia manifest itself purely in terms of a decision making step?

  25. 1) An interesting point from this paper is the fact that assessors are more consistent in somewhat relevant than in highly relevant. Furthermore, we know from other literature that having different assessors does not influence greatly the metric results. However, we know that most metrics give more weight to higher ranked documents (i.e. the highly relevant ones). How is this possible?

    2) Isn't the inertia characteristic inversely related to the reminder documents. When reminder documents are closely clustered, the judging is more consistent. However, inertia is also created which makes the judging inconsistent in the long run. Considering this, is the DocID sorting approach inferior to methods like ISJ?

    3) In order to do intra-assessor consistency Scholer et al. use duplicate documents. However, as his experiment shows there are collections with a small number of them. An approach they will explore is the manual injection of duplicates. However, considering there has to be a substantial amount of duplicates in the collection in order to make a significant evaluation, is the benefit of intra-assessor evaluation really worth the added overhead cost of additional judgment?

  26. In TREC dataset, in order to objectively measure the rate at which assessors make mistakes, the authors state that a topic authority’s judgments are considered as goldset. However, aren’t we dealing with the same problem again? The bias of the judge considered the topic authority. What about the topic authority’s temporal difference in judgments? And the training the judge received? Wouldn’t each of these contribute some bias that should be countered?

    The authors report that the outliers were not considered in the experimental inputs. But it is surprising that only two topics, incidentally one from each test collection, were anomalous. Were they just ‘bad’ topics in the collections that happened to have a lot of duplicate URLs?

    The ISJ approach mentioned in the paper appears to be a good method for reducing the assessment disagreement for duplicates. However, given the scale of web, this may not be a feasible solution. Moreover having the same words in a document may not necessarily mean that the documents have the same intent. A bag of words model is one of the counter examples to this. Any similar approaches which do not pose the above problem could be a good solution for creating good test collections.