Friday, October 4, 2013

10-10 Optional Alternative Reading


  1. Quality through Flow and Immersion: Gamifying Crowdsourced and Relevance Assessments
    Summary: The authors seek to try and figure out the best way to get high quality results from crowdsourcing. One of the biggest issue with using crowd sourced workers to provide relevance assessments is the workers who try to cheat the system to earn money. The authors classify two different types of works: money-driven and entertainment-driven. The authors hypothesize that tailing an experience to try in entertainment-driven workers will produce the best results. They performed experiments where they used a traditional HIT task and additionally, they introduced a new task by creating a game version. The authors compared the two approaches and, in general, found several benefits to using a game over a traditional HIT approach. From a financial standpoint, the author was able to find many users returned outside of the crowd sourced assignment, meaning they would not get paid for playing the game again.

    1. The authors designed the game such that in each round, the speed of the following item increases. As one would expect, they did note that errors increased each round. To account for this, the authors try to quantify the rate of error and adjust the accuracy of each round. In their experiments, the authors do have a ground truth since they created both tasks out of TREC. However, this will not always be the case. Using this approach on new data, would it be worth keeping results from the later rounds of the game? The authors of the task would have no way of knowing which assignments were incorrect due to the increase in speed. Although they could use estimates to predict the accuracy of judgments in each round, they would have no way of knowing explicitly which judgments were the result of an error due to speed. The results of the game seemed overall positive with users returning multiple times even without payment. Would it be better for someone to only take into account the first round results? Or the first two rounds?

    2. The game is set up to where the player is matching the keyword falling to a category. For relevance assessments, the titles are the categories, a keyword of the document is falling, and there is a box capturing the keywords context in the document. The player never reads the actual document. As a result, their relevance assessment is based solely on the context displayed. A keyword may be referenced several times in the document with varying degrees of relevance. The player is making a judgment based on a limited view that the game designer decides. Is this current game design a valid method for obtaining relevance judgments for an entire document?

    3. The author takes the time to explore the demographic that their game appealed to the most as well as the quality of results following such a division. The authors did find that there was a significant difference in gender preferences. As a whole, men enjoyed the game task over the conventional task more than women. This result is not surprising given the well-known demographic for the gaming industry as a whole. When it comes to relevance judgments, is this something that can affect the applicability of the game approach? Although some topics may be more relevant to one gender over the other, crowdsourcing isn’t controlled by gender distribution in the first place. Therefore, I would not think it would be a negative aspect, but it does mean the approach is, by nature, already biased.

  2. News Vertical Search: When and What to Display to Users

    Summary: In this study the authors tried to find out if it was effective to supplement breaking news stories with content from newswire, blogs, and Twitter. They set up a study using assessors from Amazon Turk to compare a page with normal web results and a page of web results with the additional content. Through the study they found that adding additional content does make the results more useful, but the type of content added depends on how far after the event the user is searching.

    1. Most of the crowdsourced workers in the study were from India or the United States. Since the authors are evaluating the usefulness of news content, would the types of news events displayed affect the results? Should test designers take into account the demographics of their participants when creating a study?

    2. In the display of the query results the authors provided additional topic descriptions on the significance of the query. Does providing a written description of query intent help assessors create more accurate relevance judgments? How does a written topic description bias the results?

    3. In this study, the assessor is shown a lot of information, including the topic, a description of the topic, and a time stamp. In making a judgment, most assessors completed the task in under 20 seconds. How closely do you think an assessor would look at the timestamps since they made their judgment so quickly? Could this affect the results? How do you design effective crowdsourcing experiments? Did the designers make crowdsourced workers do too much for this task?

  3. Summary :
    The authors in this paper have explored other ways which can be used to evaluate search engines. They decide to establish a relations between the queries and the web pages a user is looking for. Because this doing this for web page would be a very costly affair, they have tested their idea by developing a game and then have used the findings of that game for indicating how exactly this relationship between query and a web page can be used to compare the performance of search engines.
    They introduce a new concept Findability which is the measure of how easy or difficult it is for a page to be retrieved by a search engine. They proposed that by comparing the findability across the data set for two different search engines their performance can be compared. They then show how they have used Bitext mapping to extract phrases from the quires entered by users to see what set of text retrieves a given page. So they see what combination of phrases is needed to find a given page. Then they have categories quires into four types based of this bit ext matching.

    Questions :
    a. Will not gathering results from a game introduce a uncertainty about data quality. When we have graders or evaluators, because they are paid they have some responsibility with regards to the work they doing. And yet the results gathered in this way can be questioned in numerous ways, as we have discussed in the class many times. But in this case we have gathered whole set of data from people whose primary focus is enjoyment. So can data which has been gather in this way be considered as reliable?

    b. The author has been very vague as to the implementations of the bitext matching for evaluation of search engines. The applications of this could be testing if the search engine is able to cater to all type queries. And also based of the fact that category 1 and 4 mentioned in page : 8 are the most widely used quires, it makes sense to encourage people to test search engines based of these types of queries, so as to make sure that the search engine is actually able to carter to all types of requests. But are there any other ways by which bit ext mapping can be used in evaluation of search engines?

    c. Users tend to change their query based on the result set. What if a page is being retrieved in one search engine and not in another but the queries being used by users are different. Then it is also possible that users(in our case the gamers) have adapted themselves so they are able to mould their queries better. But the search engines will be used by people who do not have this kind of expertise so to say. SO how much of the results which have been derived from this testing be considered as being reliable? Also the author mentioned that people who started getting ranked as the highest scorers tended to play more to maintain their scores. Doesn't that introduce a bias in the test results ?

  4. Optional Reading: Ma et al: Improving Search Engines using Human Computation Games

    Summary: The authors attempt to tackle the problem of what documents are returned for a particular query from the other end. They ask: given a document (in this case a web page) what query would return it? The develop a game called Page Hunt (as well as two variations for multi player access) that presents a user with a webpage and asks the user to type a query into the search engine that produces this page in the top n results. The game has a limited run and the following preliminary findings are established: 1) The game seemed to be fun (important for player retention and engagement) 2) The winning queries reasonably matched real life queries 3) How easy a URL is to find by a search engine is directly related to how long it is 4) variations of and mixed abbreviations used in queries provided data for a bitext matching/expansion method to improve search results.

    1- How long did the game run for? I noticed it is not still currently running. A figure of 10,000+ web users was quoted. How many of those were unique vs repeating users? Why was the game removed if it was proving useful data?

    2- After discussing, at length, the issues with other multi player games why develop not one but two multi player variations of Page Hunt? What part of the data presented was gathered from the multi player games? I assumed that all of the formalized findings were from single player Page Hunt but I wasn’t sure. Some divider should be clearly established and the differences between the data found in the multi player games and the single player games should be discussed. I was actually curious what the findings of multi player games were. Does this double results because two inputs are recorded per page or halve it because only half as many pages are shown? In terms of the collaborative game what data in specific was being collected?

  5. Optional reading: Improving Search Engines Using Human Computation Games by Ma et al.

    1. In this article the authors describe using a game called Page Hunt to identify possible queries for web pages by checking them against the results of the Bing search engine by see if the web page showed up in the top 10 of the Bing search for that query. One problem that the authors did not address was the idea of bias. What happens when the user who is playing this game to win learns how to structure their queries so that Bing will respond with the correct page? Wouldn’t this behavior just end up supporting the search engine’s current algorithm because the user is essentially learning how to use it better?
    2. In this article the authors point out that the quality of the game that they were creating depended on the webpages used, as some were better than others. What different types of pages would you use if you were choosing which webpages to use in this game and why would you choose these pages?
    3. In the game described in this article the developers only used Bing as the search engine used to test the queries that players submitted. The main reason for this seem to be that two of the researcher were from Microsoft and so they wanted to work on this to improve the Bing engine. However is it possible to create a version of this game that searched the query on several different search engines at once? What kind of results could you gain from creating such a game and do you think that multiple search engines would subtract from the fun of the game?

  6. Paper: Crowd Sourcing Local Search Relevance

    Summary: Similarly to other papers we have read in class, the authors use crowd sourcing (through Amazon Mechanical Turk) as a cheaper / more efficient alternative to human judges for determining relevance, specifically targeting local search relevance. They compare “trained labelers” (professional judges, usually domain experts) and “workers” (casual raters through AMT) with the aid of a metric that they call Interannotator Agreement (ITA). For each query-location pair they select a service using the (yellow pages) search engine, forming a query-location-service triplet. 4 trained labelers are used for each triplet to determine the relevance of the service. Similarly, each HIT (AMT task) is presented to 5 workers for relevance judgment. In all cases a 3 point scale of relevance is used. ITA is computed by taking the Pearson correlation of one worker’s label value with the average of the labels from the other 4 workers. They discovered that using approximately 30% of the best workers (according to their ITAs) yields an average ITA equivalent to using trained judges, which is still a large savings considering the cost of trained judges vs AMT workers. Lastly, they separated the topical relevance task and distance task, and noted that the distance task had a much lower ITA. This suggests that topical relevance is a more important factor.

    1) I thought the idea of separately analyzing the location relevance part of the search was interesting but the way they went about it heavily influenced the result. Specifically, why did they think they right question to pose for the distance task was, “Is this service close enough to the query location?” Even with the guidelines they gave (i.e., pizza in Manhattan vs furniture in rural area), this is a very subjective question. Moreover, it seems like queries in urban areas should almost always have objective answers. If I search for “Pizza in Manhattan” it is either in Manhattan or it isn’t. Were they specifically targeting triplets where the service is not exactly within the bounds of the location (i.e., San Matteo is not within San Francisco)?

    2) Assuming that ITA on distance task vs ITA on topical relevance task accurately represent the correlation of relevance judgments for these two tasks, is the right conclusion that topical relevance is a more important factor than geographical aboutness? Just because the task leads to more subjective judgments and less correlation does not mean that people care about it less.

    3) ITA seems like a reasonable metric for comparing relevance between different types of judges. However, what makes it special or better suited for local search relevance? What techniques that work for general relevance, fail for local search relevance? Aside from the final section, which separates the two pieces (topic and location) and does a short comparison, it seems like the paper could have been called, “Crowd Sourcing Search Relevance.”

  7. "Improving Search Engines Using Human Computation Games" -Hao Ma, et al

    Summary: This article is focused on the creation of the first single-player game to help improve search engines called "Page Hunt." Prior to this, games had at least two players and existed in one of three categories: output agreement, inversion problem, and input agreement. A problem with two-player games is that if there are an odd number of players, one will have to be simulated. In Page Hunt, the player is given a webpage and must generate a query which will lead to it. Some precautions are put into place to ensure that this process generates useful queries, not just long strings of exact text. For this purpose, players cannot copy-paste from the website. The search engine's preexisting limitation on query length is also helpful. Some of the findings were that players through the game was fun (one of the goals of the researchers to ensure) and that the longer the URL, the less findable the website. In the future, the researchers hope to include eye-tracking into the study.

    1. Both this article and another we have read quote the statistic that "By age 21...the average American has spent 10,000 hours"(p. 1). Are they proposing using minors in testing these types of games? I have read studies that say middle-aged women are one of the largest demographics for online game play. Who would be a good target group to test these games?

    2. The authors say that "this can be valuable in training people how to query better"(p. 2). Is this the goal of the experiment? How is this useful for real-life user needs and IR improvement?

    3. The queries generated were separated into the simple categories of "Ok, over-specified, underspecified." What would be a good way to broaden this scale and find more and more meaningful categories? In further studies, what else can be learned by expanding these classifications?