Friday, October 4, 2013

10-10 P Organisciak, M Efron, K Fenlon, M Senseney. Evaluating rater quality and rating difficulty in online annotation activities. ASIS&T’12.


  1. 1. My first question is about the DATA section. It says this data was rated with three-rater redundancy. So for each document, there were three raters to rate the task. The rate for corresponding task depends on majority-vote. Although the authors trying to prove the robustness for majority vote among three raters, there are still problematic issues which are not discussed here. For example, what if all three voters are proven to be low quality raters? What if two of the three are proven to be low quality raters, and they happen to be both in the majority side?

    2. My second question is in RATER AGREEMENT AND TASK DIFFICULTY section. In this section it talks about the ability of rater agreement and task difficulty to discern the accuracy of ratings. For the task difficulty, this paper tries to solve the problem by the proposed iterative algorithms. I think an alternative way is to do a pre-processing to pick all questions that are inherently difficult to agree-on, and then rate them separately from other relatively easy questions. This potentially provides us a chance to get a more comprehensive idea about how task difficulty affect the rating process.

    3. My third question is about the labels for each tasks. In this paper, the datasets are all with binary labels, which is also one reason why for each task only three raters are needed for majority vote. I am wondering how we can apply these methods for multiple category labels, for example, five-star labels? The situation should be more complex. Which steps should we change?

  2. 1. The paper presents an important topic on evaluation of online rater quality. My first question is why the authors use their own judgments as gold set? Although the authors mentioned that they were of known reliability and had a close understanding of both the tasks and data. Using the self-judged results and evaluate different methods based on it are very biased, which makes the evaluation and standard dependent on each other.

    2. The finding that rater who makes the first question correct, is likely to make more accurate judgments on the rest ones makes sense. And how do we improve our collection of data based on this? Is there study that has tried to assign mandatory time for reading instructions, or to reject raters if they make the first question wrong, and to see whether there’re significant improvements on the relevance judgment? Also another possibility for this finding is that the raters who make the first question wrong are facing more difficult tasks, and thus they are more likely to make mistakes in following judgments, is there any study to evaluate this possibility?

    3. The value “0.67” was used as threshold to remove problem workers, which is fairly highly. The value is derived from simulation, which is, however, based on a binary relevance judgment. So is the high agreement between random and real raters simply due to a lack of relevance judgment scale? In real world, there are more different scales for relevance judgment, and is it too stringent to use this value for real world workers removal?

  3. In the Figure 4, which shows the distribution of rater performance by dwell time, why do the count of correct rating and that of incorrect rating begin to decrease at a certain time? In fact, it’s obvious that the count of correct rating may decrease because raters may lose their interest or feel tired in these tasks. However, why does the count of incorrect rating decrease as well? By the way, why do authors want to reject the null hypothesis of equal distributions? What are equal distributions?

    Comparing to IMLS DCC, the starting point accuracy of tweet is very lower. Why? Is this because there are four options for assessors to choose when they attempt to classify the tone of a number of political tweets?

    When talking about learning curve, after the first rating, there are few significant differences between other ratings in a set. Why? Is it because that the description of the coding manual is too difficult for raters to understand so that raters guess some judgment actually.

  4. In Figure 3, it looks like 50 raters contributed majority of the ratings though there are 157 unique raters. Does not this figure point out the generality issue(diversity issue) as two thirds of the raters are not very well represented in the study and these 50 raters might share something in common?

    In Figure 4, raters made most correct ratings using 5 seconds but also the same group of users made most incorrect ones with the same time frame. Also the frequency distribution of the average amount of time the users spent on the tasks that they rated incorrectly is pretty much in line with the one the users spent on the tasks that they rated correctly. Does it imply that time has no influence over rating qualities? If so, I really doubt the validity of the experiment as isn't it a common sense more time you spent in rating, more accurate the rating shall be?

    In the subsection of “Replacing Problem Workers”, the authors state that “the removal and blocking of low-agreement raters can be automated fairly easily, making it possible to incorporate in real time within a rating interface”. I think the automation is not easy as raters can be assigned unfair amount of difficult tasks or a very few good raters might be rated alongside numerous low quality raters. In either case, the automation has the risk of removing those good raters while leaving bad raters intact.

  5. 1. The paper makes use of Expectation Maximization as the algorithm in order to distinguish latent variable of rater quality from the difficulty of the task. The EM algorithm is known to be faced with convergence problems as it can take a really long time to reach convergence. What are the mechanisms proposed to deal with this situation? How can we go about selecting the local maximizer for the function when the EM algorithm does not provide an estimate of the covariance matrix ? And finally, how does the EM algorithm implement adaptability and understand the evaluation of metadata has not been elaborated on in the paper, reduce noise and improve data reliability ?

    2. The paper does state that the user performance is the key for future performance. I am still unclear on how we would be calibrating this user performance metric especially for sentiment analysis. The paper speaks of differences amidst the workers but I am curious about how we would handle differences among workers in terms of the quality of annotations provided? Is it possible to identify genuinely ambiguous examples in the case of annotator disagreements? And, in the case that we are how should resolve these disagreements?

    3. When making use of microtask labour the assumption is that we are dealing with large amount of data and users who have some profile information on MTurk. How would we modify this implementation when we want to include new users who do not have any profile information? What is the collaborative mechanism used in such cases? Say, we have a new user who has a time constraint - how do we hope to ensure an efficient and an effective microtask in such a situation where we need to complement a sparse data scenario?

  6. Accuracy is defined as the ratio of number of correct classifications to the number of total classifications. Most of the evaluation (including rater reliability) is measured using accuracy. It was not mentioned why accuracy would be the most preferred evaluative measure to what the authors were doing. What other measures could be employed in the evaluation of rater judgments?

    Figure 5 of the paper shows the distribution of accuracy ratings across the users’ lifetime experience. From the plot, the authors conclude that lifetime experience does not make a user more reliable. However, when evaluating nth task in a simple consensus based voting process, wouldn’t the performance of the rater depend on the experience of the other raters? As an example, consider a rater, rater1, who is performing his 100th task. The other two raters for the task are inexperienced and produce a label which opposes rater1’s judgment. Rater1 would still be not rewarded for producing a correct label (his accuracy rate for the nth task goes down). Thus, can a generalization such as the authors’ conclusion about lifetime experience be made?

    The authors describe the iterative algorithm as having two phases. In the first phase the expected vote is calculated based on the ‘extra’, previously not available, information. The second round progresses by selective picking of the raters to improve the overall accuracy. How is this exactly done? What information is used to compute expected votes? What order of iterations, as the authors talk about, is required for the convergence?

  7. 1. “Simply by including redundant ratings it is possible to match the quality of expert rating”. If it were based on consensus in agreement of multi-user rating, what strategies could have lead to combining them and rating them on the same scale? If not, how can redundancy match the expert rating?

    2. How can the judgments made by crowd source workers or the non-expert raters be classified on the basis of topic relevance (based on our earlier reading list by Huang and Soergel”)? If only the document’s authors’ oracle judgments were considered as set of ground truth judgments, how can one mitigate bias and portray the rater’s quality is such cases?

    3. “To safeguard against the presence of cheaters and their strengthened influence in low-consensus tasks, a less naïve decision making process would be valuable”. Following a naive decision-making process would again result in ambiguity considering the fact that there would be very little consensus in relevance agreements among the workers. To safeguard against the cheaters can we just not combine these crowd source workers' assessments with a GWAP based

    4. The authors have put forth three different kinds of labels namely Relevant, Non-Relevant and I don’t know. Isn’t the label "I don't know" a little ambiguous? Does it reflect the user's incapability to judge the document as either relevant or non-relevant? Or was it for the rater to acknowledge his ignorance about the document in the topic? How helpful can this label be in assessing the rater's performance?

  8. 1. If aggregating votes from multiple earnest raters matches the quality of an expert, why not just include 5 experts versus 5x non-expert workers? Wouldn't that be more efficient from a data collection and analysis viewpoint as well? Or is cost the only / major / primary concern?

    2. With respect to the Expectation Maximization algorithm, I wish the researchers had elaborated a little more about what considerations go into deciding if a rater is good or bad. While performance over time is an important metric, does it account for (or inspect user profiles for) aspects like background knowledge about a topic, education level, user's location, etc.? Basically does it evaluate how much of a "non-expert" a non expert is? Are these valid concerns, or do they complicate the approach?

    3. Do you think average dwell time (considered for a group of raters) could be a good way of evaluating task difficultly? Ideally, longer dwell time would mean that a rater took longer to read the question, or provide an answer - both of which are reflective of some difficulty with the task. Right?

  9. Due to the need or apparent creation of a “Negotiated truth” or “ground truth” present among the raters. How does this influence the nature of the judgments made by the raters who have evolving experience and ratings based on this evolution? Does the negotiated truth continue to change over time or does it remain static?

    In the data section the article mentions using a different dataset which includes deciding the tone of political tweets. Due to the difference between this dataset and that of the IMLS DCC, how effective is the comparison of the two datasets in terms of crowdsourced ratings considering the lack of temporal information and multiple categories for the twitter decisions vs. the IMLS DCC decisions?

    Organisciak et al. note in the conclusion that the time spent on completing the first task has a connection to the performance of the overall task. Could this be used as a benchmark to prevent some of the malicious or mischievous raters who might be likely to partake in rating tasks? Or would it bias the overall sets in a specific way tied directly to the instructions provided by the presenters of the tasks?

  10. Human-Machine Assessor Hybridization for Further Incremental Progress
    1. How might automatically detectable textual features and language processing improve untrained judgments if the two were to be used in tandem? For example, using bag-of-words or language models? How might such methods compare to both trained and untrained assessors? How much closer (or further) to trained assessor judgments can we get?

    Human-Human Hybridization
    2. We have previously discussed issues with untrained workers such as those on MTurk. This paper re-emphasizes the distinction between them and trained assessors. Have any attempts meen made to hybridize relevance assessment using both? For instance, using one trained judge and an army of untrained judges, and updating the rankings using a biased prior?

    Interest and Performance on MTurk
    3. How much would untrained assessor performance change if they were permitted to select the topic of the query? The paper by Chouldechova and Mease indicated that previously searching for a topic (which I understood to suggest either familiarity or interest in a topic) improves assessor performance. Even if MTurk workers do not apparently improve in accuracy with more experience, would they instead improve if they had previously used a similar query?

  11. 1. The authors discuss their label options for the raters they had participate in the study. They opted to use a 3 point relevance scale with the third option being 'I don't know'. These documents labeled as such were not considered in the data-set. Given the studies on relevance scales, why did the authors choose to go with a modified binary relevance scale?

    2. The authors mention serving fewer than ten tasks to raters when ten tasks were not available. Given their interest in time and how it related to user reliability, wouldn't giving different number of tasks to different raters skew the data? A longer list of tasks presented to the rater looks different and could affect their mindset from say a three task list.

    3. The authors used a set of oracle judgments to compare the judgments they attained from the raters. Given the studies in random relevance, is it really necessary to obtain judgments for all of the tasks? Isn't that a expense that could have been spared?

  12. a. Is it possible to categorise people based of their personal details and make judgment with regards to their accuracy based of that? It is possible that people who are from a younger age group might tend to these tasks more casually than others. So the system might tend to evaluate people of younger age a bit more tightly. Then how is diversity playing a role in these interaction. Some data sets might make more sense to only a particular group of people. So are the topics for evaluation given to people who have same backgrounds? Then is it possible to enhance the process defined by making use of the fact that some people might have more knowledge about some topics. A vote made by an expert could be made to weight more than others.

    b. The Author states "With raters that make a correct rating on the first item are much more reliable in the rest of the rating set." This cannot be extended further if a evaluator has got the questions form different types of scope. In this case we cannot be sure if the tester will be able to grade this question accurately or not. Does this imply that an evaluator cannot be trusted at all unless he has worked on 30 tasks pertaining to a single type of query. Isn't that a huge loss of time and energy. Also cannot an experienced judged be found which might be able to deliver with same efficiency and we would be able save time ?

    c. Tweets are opinions and experience of people. Considering the fact that every individual is entitled to have their own opinion. Why were political tweets chosen to be as one of the datasets to be tested. How can ground truth be established for twitter tweets?

  13. 1. Surprising performance of basic reliability decay, esp. without normalization. I did not expect this to work as well as it did across datasets. Maybe down weighing inexperienced annotators will result in a similar improvement.

    2. Would have been interesting to incorporate findings from task design, i.e. more redundancy for initial queries to better estimate reliability. This along with exponential decay may improve results.

    3. Additionally accuracy may not be the best metric; it would have been useful to discuss class distributions and model bias.

  14. 1. The authors are essentially proposing an iterative EM algorithm. Although they have discussed the algorithm in words, it was hard to get a clear picture because of the lack of pseudocode. Moreover, the authors did not explicitly describe their algorithm as EM, simply that it was a 'similar approach' (see the Related Works section). EM is based on some nice mathematical constructs but that is hard to see in this algorithm. It leaves the question open as to whether the algorithm was the main contribution of the paper after all, or if it was the findings in the first half of the paper.

    2. The IMLS DCC dataset that the authors used doesn't seem to have been used by many other authors. Is this dataset openly available? If not, how do we check this algorithm's validity or replicate the work? If yes, why has it only been used in this work or in few works?

    3. I don't quite see why the random rater should be correct 67% of the time. There is no justification for this number, just that that's what the authors found on the primary dataset. Is this just an empirical finding (an improvement over the 50% we would expect) or can we also prove this in theory? A bigger question is why is it that on random, a cheater is more correct than not, at least on this dataset and has this number found to be similarly higher by other authors on other datasets?

  15. 1. The study looks at whether online annotators are a reliable source for generating relevancy judgments. In order to assess rater quality, the study compares annotator assessments to each other as well as to an oracle set generated by the author. Is this an appropriate metric for assessing quality?

    2. What is the basis for using an iterative algorithm to assess rater quality? How are all of the variables that are mentioned actually used in the algorithm? It is said that ‘info about the document’, ‘possible labels for the document’, and info about the ‘raters evaluating the document’ are used(pg 7) but not really how they are used. Since this is not explained I am unsure what the correct interpretation of table 1 is (pg 9). How should the table be interpreted?

    3. I think most of the conclusions in the paper are well evidenced, except for one. What is the basis for saying that “high disagreement among non-expert raters is not necessarily indicative of problematic results”? In what way does the data reported show this?

  16. What is 'ground truth data' that the authors mention on page 3?

    I have a hard time accepting the general argument that raters don't get better over time. Was it just the time of this experiment? Was there any mention of how long, or what training took place? Aren't raters getting trained? Do they have refresher sessions? Wouldn't you expect a rater to get better over time, if even only slightly better? Is there no on-going training? I would imagine there would be, especially at companies that pay their raters. In addition, do raters ever converse with one another? or go to conferences? or have online communities? I would imagine this would give long term raters opportunity for growth and improvident rather than staying static over time.

    What makes a 'problem worker'? If it is true that raters do not improve over time, then how can search engines seek out raters with good 'gut instincts?' (for lack of better word) or is it due to intelligences? A large expert knowledge base/

  17. 1. As can be observed in the Table 1, the accuracy rates of iterative algorithms on twitter sentiment are pretty low. As an explanation, the author states that the raters showed an aversion to administrative categories by not willing to mark tweets as incoherent or spam. Does this not have much to do with the fact that twitter contains more of natural language and character constraints that twitter poses(<140 characters only)? Do you think that the accuracy will improve if the ‘spam’ and ‘incoherent’ labels were to be correlated to the ‘i don’t know’ ratings?

    2.Figure 7 shows that the average accuracy of the raters increased significantly along with experience after the 30th task of the rater under study for a given query. It can also be observed from Figure 3 that the average contribution per rater follows an inverse power distribution and that only a very few raters contributed more than 30 tasks. How significant is this result, given that the possibility of a rater giving more than 30 ratings for a given query is very less? When the data is not sufficient, the results need not necessarily be accurate.

    3. Since it can be observed that there is no visible correlation between the lifetime experience of the rater and the performance of the rater, do you think that the performance will be improved significantly, if a single query topic was allocated for a given rater? Thus, his experience on the query will increase, which has been shown to improve the accuracy.

  18. 1. In page 4, 1st paragraph, it said “... derived from the actual dataset, they are not completely reliable”. Majority voting is applied in much research work and approved to be an effective method for crowdsourcing. Why did the authors hold such point? Is it just a hypothesis?
    2. When discussing Temporality and Experience factors, it is a common sense that learning curve involves both factors. Usually, the more time spent, the more experience gained. So, is it reasonable to separate these two factors apart here?
    3. In Iterative Optimization Algorithm subsection, it mentioned “the number of iterations … varies”. What are the factors to determine the number of iterations in a given scenario?

  19. 1. This paper used accuracy to measure the whole work. Why did the authors not use precision, recall or F measure?
    2. The authors mentioned the work of Donmez et al, where they found the quality of raters changed over time. If this is true, does it mean a rater’s behavior is unstable and the rating quality is also not solid?
    3. The Y axis is labeled as “density” in Figure 6, what does it mean?

  20. 1) In Figure 2, Organisciak et al. state that agreement with oracle ratings is a conservative approach. Can you elaborate more about this?

    2)When discussing lifetime experience, Organisciak et al. state that their hypothesis of quality increasing overtime was incorrect. However, previously they mentioned that there was a clear distinction between raters that spent time “reading carefully” the task description and those that didn't. Wouldn't it have been reasonable to consider only those raters that read the instructions carefully in this evaluation?

    3) In Figure 7, there is a sharp increase in the accuracy rate after 30 tasks. Why? How come such a drastic increase in accuracy?