Crowdsourcing For Book Search Evaluation: Impact of HIT Design on Comparative System Ranking -Kazai, et al.Summary:The authors of this article test for the effectiveness of both crowdsourced workers and HIT designs to perform the tasks of "Prove It," a program which "aims to investigate effective ways to retrieve relevant parts of books that can aid a user in confirming or refuting a given factual claim"(p. 2). This program is motivated by the creation of academic citations to support claims in research. They compare two HIT designs against a gold standard- the INEX 2010 Book Track. They describe their different control methods (trap questions, qualifying questions, timing conditions, captcha, redundancy, and more) and their pooling techniques to reduce the amount of information crowdworkers must sort through (top-n pool, rank-based pool, and answer-based pool). Their two types of HITS are given different quality controls as well, and the authors go in to depth describing "full design" and "simple design" methods. In their findings they discover a number if interesting things: HITs need to successfully engage the user, the full design system is better than the simple design, a consensus of multiple judgments is necessary, it is best to weed out workers with low label-accuracy from the beginning, and researchers should not depend on only one metric in system evaluation.Questions:1. Once assessors have chosen to assess a specific book, they must not only go through the pages marked as relevant, but are "encouraged to find further relevant pages"(p. 2) which might have been overlooked in the initial pooling process. How else could crowdworkers help improve pooling? 2. In the section on "Factors Improving Accuracy," the authors explain that the "total number of hits completed by a worker provides no clues about accuracy...similarly, the average time on a HIT is only weakly correlated with accuracy"(p. 7). This seems contradictory to other articles we have read. Is this because this study takes into account confounding variables such as the quality of HITs, and takes some of the blame off of crowdworkers themselves?2. When the authors describe the fact that using the practices of this study crowdworkers can be weeded out based on a specific gold standard and HITs can be measured for their practical performance, they are directly defining the use of the study as it applies to the improvement of IR systems. Why don't many other of the studies we read directly explain some of their practical or future uses, even in their conclusions?
On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents - Hosseini et al.Summary: This research looks at methods of collecting relevance judgments through crowdsourcing, so from mostly an untrained population, and possibly noisy judgments. They look at two methods: majority voting (MV) and expectation maximization (EM), and study how crowd workers have labeled a set of documents. The research goal is to estimate the accurate relevance of the documents. The MV method follows the "majority wins" protocol. The EM method studies both the reliability of the workers, and the "true relevance" of the documents. Reliability of a worker can be interpreted as accuracy, and is determined by assessing how many times a worker's labeling matches the true relevance value of the document versus the number of times a worker made a judgment.1. I wish the researchers had elaborated a little more on how the "true relevance value" (pg. 185) is established, and/or by whom? This is an important factor to understand since it is used to establish accuracy/reliability of the workers as well. So, with respect to the equation on pg. 185, how is the value of "k" established?2. The INEX 2010 Book Search evaluation track allowed users to choose from a 4-grade judgment (0,1,2,3). This experiment converts the 4-grade into a binary system, and assumes (1,2,3) are 1. Do you think this is an accurate conversion? Should 1 be converted to 0?3. The research proves that the EM method is better than the MV method. But in the EM method aren't they essentially comparing to a gold standard, which is defined via an MV protocol?
Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turkers Biases in Query SegmentationSummary: The authors explore whether or not crowdsourcing query segmentation is a reasonable practice. To evaluate whether or not this approach is applicable, he sets up a series of different crowd sourced tasks. He tries to cover a range of different types of queries to try and capture enough information to draw general conclusions about crowd worker behavior. For each task, the author calculates an IAA value. The IAA value is based off of a community accepted standard for diverse judges, and is extended by the author to account for the types of segmentations being evaluated. In the end, the author feels crowd workers are subject to four strong biases, and thus are not good candidates to provide query segmentations.1. In previous papers, authors have addressed the issue of crowdsourced workers whom are only concerned with making a quick buck. It seems well known in the IR community that a portion of crowdsourced workers are only after money; therefore, researchers seem to have a series of different safeguards in place. For this paper, the author feels the natural lack of agreement between experts means a “gold” standard cannot be used to weed out nefarious crowdsourced workers. The only safeguard the author has in place is for nested segmentation formations. Is this enough to prevent the money-driven workers from impacting the study? Researchers using the crowd to gather relevance judgments also run into the issue of disagreement among experiments. As a result, they give workers a 65% accuracy requirement. It seems this same approach could be used to ensure the right workers efforts are the ones counted. 2. When setting up the formulas used for evaluation of the study results, the author outlines two different distance metrics, d1 and d2. After providing their respective formulas, the author notes that for flat segmentation the two measures will equate to the same thing. Then, the author explains d2 is created for nested segmentation evaluations. Flat and nested segmentations are evaluated over both distance measures. What is the point in having both measures evaluated? When outlining the bias, the author never directly points out d1 or d2, but instead looks at the overall results. In addition, there does not seem to be a unique reason to calculate the first distance measure d1. The added work seems unnecessary since the author never reveals which measure is better or impactful.3. The author proposes four different biases he believes crowd workers are prone too. The first two measures specifically seem to highlight a human behavior that is intrinsic: splitting the queries into equal portions of having balanced trees. The author makes a side comment that the experts also seem to show some of the bias, but this point is not drawn out more. Given that the author is arguing that crowd workers are not a good idea for query segmentation, isn’t the existence of bias in the already used experts a worthy point to expand on? I feel the author cannot disregard crowd workers unless the bias is much more severe than the experts, but this is never demonstrated.
Summary :As the first step to understand the queries, query segmentation is performed. The author then mentions the two usual ways in which this segmentation is performed. Nested segmentation and Flat segmentation. He then goes on to explore the effectiveness of crowdsourcing in creating an evaluation process for the query segmentation results. In order to be able to evaluate the results of the test he goes on to extend a metric Inter Annotator Agreement. Based of the test results they were able to determine four ways in which human behaviour gets biased when splitting the queries. Questions: We are trying to break the query into various segments to analyse user intent. Is it possible to drive out the inaccuracies introduced because of user bias by testing the effective of the search engine when used for the query as a whole and when a crowdsource worker splits the query into segments and then checks what the out will be? Because our ultimate usage will be to check how well the search engine behaves when the query is split in different manners. "Note that we intentionally kept definitions of flat and nested segmentation fuzzy because (a) it would require very long instruction manuals to cover all possible cases and (b) Turkers do not tend to read verbose and complex instructions." But because of the definitions not being clear to the AMT people the results are bound to not return the expected results. A basic understanding of some examples could have returned different results and might have been more reliable. Maybe because the requirements were not clear enough for the evaluators and that is why the results are not what was expected. The author speaks of four types of bias which the crowd sources are very prone to when splitting the query. e.g : "There exist very strong biases amongst annotators to divide a query into two roughly equal parts that result in misleadingly high agreements." But the author has also indicated this is something which all human being are prone to. Doesn't this imply that creating a gold set for this problem by using humans as evaluator is pointless effort ?
[Lydia Chilton et al. Task search in a human computation market]In this paper, Kazai describes a social game that is used to do different kinds of relevance assessments for books. In the pilot study, the game used the INEX 2008 Book Track and anyone over 16 could participate. However, the participation of those outside those benefited groups was short lived. Regardless, the results of the pilot showed high quality assessments.1) While this might have been considered a failure in the social game aspect since everyone outside the interest groups left, the environment created by the game seems to foment good quality assessments. Is it possible for a similar system to work in either a TREC-like or crowd environments?2) Another interesting thing about this system is the potential of extracting relevance assessments by a specific user profile. What are some of the benefits of having relevance assessments for a specific group of users in comparison to the idea that one topic by a single assessor represents a user group? Also, how can this assessments be extracted?3) While this system provides load balance of work for assessing books, is this system really a solution for the demand of judgments needed?