1. The authors state that order within the first two pages doesn't matter. In what other searches does order on the page not matter? How would this affect the design of the algorithm? How would this affect the interface design?2. In the section titled “Future Work,” the authors write that the method does not account for keyword searching or sorting by qualifications (p. 10). How could you build on the methods of this study to examine how MTurk workers use keywords to find tasks? What other methods would you need to use to examine how keyword searching affects the ability of MTurk workers to find tasks?3. On p. 3 the authors write that “price is not indicative of work quality.” Since MTurk task posters cannot rely on a higher price to get higher quality work, what other methods could they use? How could academics work to make their task more visible to the right MTurk workers?
1. The authors' strategy of using "publicly available data" and setting up a statistical measurement program to track that data seems like a good idea-- How else could this strategy be implemented to improve IR systems or track search engine practices? Could it be used to track how long a top result returned for a query remains in this highly ranked position?2. It seems like there could be a symbiotic relationship between academics and commercial establishments, as we see in this article in which academics work with Mechanical Turk, and in "Difference in Search Engine Evaluation Between Query Owners and Non-Owners" between researchers and Google. Researchers using this kind of data produce results that are useful for the academic field, but also for the companies. Would proposing studies to companies like Google and Amazon be a way to potentially receive funding for a project? Or would the companies want too much control over the study?3. When reviewing the results of this study, the authors note their surprise that workers were willing to go very deep past the first pages of HITs results. What are the benefits of skipping the first results pages when the relevance or general appeal may decrease significantly? We briefly started talking about this in our last class, addressing the fact that on Google, the first results page is sponsor-oriented. I would like to continue that discussion.4. Extra credit? I found two typos in this article. On page 3, paragraph 3 "they" should be "that" in the last sentence. And, on page 4, paragraph 3 "group specific-effect" should be "group-specific effect" (as they use the hyphenation later in that same paragraph).
1. Understanding workers’ search performance is very important for IR tasks using crowdsourcing. In the first study, HIT disappear rate is used as metric for studying effects of different factors on task search. However, the HIT disappear rate is not only affected by the time users spend to find the hit, but also the time it takes to finish the task. In other words, the reason that those tasks have high disappear rate might simply be that it takes less time to finish these tasks. So how do the authors exclude these probabilities? 2. The author proposed a model incorporating group-specific random effects which might partly explain the task-specific property. Why the normal distribution was chosen for the group-specific random effect? The results show that the model is worse than that of the pooled model due to the possible gaming behavior. Is there any way to discompose the effects of gaming behavior and real task natures (difficulty, content-specific interests)?3. The paper proposed an essential study that tries to understand the search performance of workers on AMT. Do the factors that affected the disappear rate of HITs affect the quality of the performance is affected by these different factors? For example, those “Best-case” postings may have lowest accuracies, while the “Worst-case” postings have the highest accuracies. Considering the fact that only a survey is used in this study, is there any other study that tries to address the quality problem?
The authors state that “We find that workers look mostly at the first page of the most recently posted tasks and the first two pages of the tasks with the most available instance but in both categories the position on the result page is unimportant to workers”. And then later the authors confirm “that a favourable position in the search results do matter; our task with favourable positioning was completed 30 times faster...”. I am wondering are these two findings contradictory to each other or do I miss something?The authors state that “Our premise for observing search behavior is that search methods which return HITs with higher rates of disappearance are the search methods which workers use more”. Does the author imply that search methods are independent variables and higher rates of disappearance of HITs is the dependent variable, which further imply that search methods determines solely the rates of disappearance of HITs? If so, is it valid? What happened to those possible confounding variables like level of difficult of HITs, amount rewarded, risk involved, required qualification? How the authors rule out the possible impact of these variables?The authors assume that “the disappearance of a particular HIT during some interval of time is an unknown function of that HIT's attributes and its page placement and page position”. And the authors further assume that the HIT's attributes are constant. It is dangerous assumption as from my personal experience, in the Mechanic Turk, users can give live feedback for HITs which might influence the disappearance.
1. The researchers put forth that “requesters pay workers for acceptable work”. How is acceptable work defined in the MTurk environment, or for any other crowdsourcing platform which accommodates a diverse crowd, and the requesters cannot set / have a standard set of expectations? Then, does completion equal acceptable? 2. I am a little confused about HIT Type (1), where works can perform a single task multiple times. I see how this can be monetarily beneficial to workers, making it easier for them to complete the task the second time, having gone through it before. But how do the requesters benefit from the same person preforming the task more than once?3. I appreciate the effort the researchers took to manipulate where the surveys were placed for this experiment. But do you agree that surveys “provide more detailed explanations for worker behavior than statistics”? Don’t you think more unobtrusive methods like query logs, or performance statistics are more effective metrics? (Here I have to say that I agree that surveys might be better for gathering subjective estimations, purely because they directly ask for user input.)
The parameters used for scraping were selected without any justification. Why were only 3 pages of results scraped? Was it to reduce the volume of data? Or was there data supporting that workers do not go beyond page 3? Or was it because most of the worker activity is seen in the top 3 pages? Particularly when workers were drilling into pages as deep as 15 (as seen in some of the sorting methods), is the 3 page limit for retrieval enough?In the Group-specific Random effects model, how was the attractiveness value of a group computed? The authors claim that adding this value to the model solves the static characteristic problem. Is it computed by averaging of the empirical data of various observed static parameters?For the title (A-Z) sorted order, the first three pages are almost always constant. However, there might be some HIT’s that are added new and whose titles position them in the first three pages. Each of such HIT’s will replace an older HIT. This does not mean that tasks disappear because the workers completed the HIT’s. What was intended to be achieved by this sorting order? Does it not violate the underlying assumption of disappearance of HIT’s?
1. My first question is about one observation from this paper. In this paper one conclusion is that workers search by almost all the possible categories and look for more than 10 pages deep. It reminds me of my personal searching experience when I used the job search engine like Indeed. When I was searching for job in Indeed, although I was trying to search for internship on software developer, I still wanted to check for more opportunity like testing, QA and something else that may be directly or indirectly related to software developing. I was just trying to search as much as possible in order not to leave out any valuable information. I am wondering how we can make use of this kind of completeness to cover as much as possible to guide us to improve the search result?2. My second question is about premise of the first method, which observed the disappearance rate of tasks across key ways Mechanical Turk provides for workers to sort tasks. This implies that if one task disappears, then it has already been done by sufficient number of workers. I am questioning this based on my following two concerns. The first is that one task’s disappearance does not necessarily mean that it has been finished. Other reasons such as it is canceled by the employers may also be possible. My second concern is that it does not take into account the number it requires for the task. For example, one task that only requires 10 persons to participate in is likely to disappear much faster than one task that requires 1000 persons.3. My third question is about the second method, which hired paid workers for self-reported information on how they searched for tasks. I have two questions here. The first one is whether we should distinguish the searching ability for all these workers. Some workers may have more searching experience on Mechanical Turk than others, should we divide them into different groups by their searching experience? My second question is that from the result we can see that among 257 unique works, 70 were recruited by the best-case posting, 58 were recruited by the worst-case posting. The difference between the number of people in these two groups is not very big. The worst case posting is done by making the survey HIT appear in the middle of the pack, which is hard to find. Does this mean that ranking mechanism is relatively not that important for job search?
1. One of the workers state that “It is difficult to move through the pages to find a HIT because it automatically takes you to page 1 once you finish a HIT”. Under this context, isn’t the MTurk itself biased towards forcing the workers to choose from the first few pages as the workers might trade off in favour of a HIT that is readily available than to search for one, which takes more time and effort? If the above question makes sense, then is the result that workers tend to choose more from the first two pages a bit skewed too? 2.From the figure 3, it can be observed that the a-z favored posting attracted more workers than the newest favored posting consistently. The author claims that it is because of the fact that the work is more likely to be in the same place for a-z while it might soon be replaced by a newer task on the time skewed work. The result although, seems indicative of the fact that users prefer searching for tasks in the a-z search than in the chronological order. Are both these inferences consistent with each other? Does it not make sense to model the user behavior in a better manner than described in the paper? 3.The author states in section 4.5 that the results represented by the pooled model are precise but less reliable. But one can observe from the section (b) that there is no observable correlation/pattern between the reward and the expected HIT disappearing event per scrape iteration. This seems a bit counter-intuitive. If the authors had done a study by including ‘money-favored’ tasks as discussed in question 2, and observed the worker behavior, would it not have answered most of the questions objectively? Has anyone done that already? An additional point to note is that in the Best case posting in the method B, the workers chose the best posting although it gave a reward of just $.01.
1. When they scrape for tasks and notice that a task disappears, how do they know that it wasn't just removed by the requester?2. This paper states:> The “search problem” is particularly challenging in labor markets because both jobs and workers are unique, which means that there is no single prevailing price for a unit of labor, nevermind a commodified unit of labor.but isn't the price for a "commodified unit of labor" going to be easier to standardize based on the increased population from which to sample?3. There are several factors that make their scraping-measurement of task popularity questionable, including the bias that HIT characteristics that are likely to affect the HIT's popularity also affect their page placement, the unknown of whether a HIT was removed because it was taken vs. being deleted by the requester, and the tendancy of requester's to re-post HIT's so that they stay higher in the list of 'newest' tasks. It seems like when these effects combine, they produce very noisy results, so can we really trust their model?
1. The paper does provide a pretty neat way to study search behaviour by reviewing MTurk's search features. However, when dealing with MTurk - the issue I see as most evident is the selection bias which exists. MTurk workers have prior, repeated exposure to studies of a particular nature like say data entry, research or transcription and so is it fair to automatically extrapolate their results to all MTurk workers? Wouldn't MTurk users themselves have a sense of bias towards a particular behavioural search pattern based on the prior tasks they have performed that has developed over a period of time? 2. We have seen in Efron's paper as to how raters who do spend more time on their first rating of a task set are significantly better performers on the task. However MTurk functions on the premise of fast and cheap data. From the worker's perspective does this not translate into how fast they can complete the HIT's? And so, are we not compromising on the effectiveness and the quality through this principle? Further, how is it justified to continue to use simple majority voting always to determine correct responses in studies which require analysing variability in data? 3. MTurk does not dwell into how the distribution can get skewed when a worker reruns an experiment. There currently seems to be no efficient methodology in MTurk which prevents a worker who has already performed a HIT from rerunning it. This principle so far has only been applied to surveys. Wouldn't the absence of such a checking principle which will negate these biases result in the overemphasis of a particular output simply due to reruns of the experiment? Also, isn't this likely to be the cause of a discrepancy in the results that have been tabulated? How can such deception be avoided?
Lydian creates a very unique way to investigate how crowd workers search for tasks and what factors actually influence their decisions of taking certain tasks. In attempting to study the influence, generated by page placement and page positions, on the disappearance rate of a HITs, Lydian use “natural experiment” in her first study; however, since it’s too difficult to control the variables in this method, I think, it’s not very suitable to recruit it to test a conceptual model there. Moreover, since this model supported by Lydian is too simple due to the limitation of this method, its result is very less likely to be persuasive to some degree.In Lydian’s second study, is it possible that these participants there are prone to take survey tasks actually? In other words, can these participants represent the most of workers on Mechanical Turk? Perhaps, the workers taking other tasks have their different ways to search tasks in fact.In this figure 3, it’s obvious that this survey task by the best-case posting can be done in a shortest time. However, how about other kinds of tasks? Will the results be similar if types of tasks are quite difficult from those of survey tasks?
1. Rather than find a one size fits all model, the authors took a more practical approach of seeing where the group-specific effects model and pooled model were weak and strong. As it turns out, the pooled model does a good job of modeling for the 'newest' sort, while group specific effects model does a good job with 'most available'. It seems like this is because the latter models worker behavior correctly, while the former models requester behavior correctly (in that they're trying to game the system). Or is this too broad a conclusion?2. I'm surprised that requesters are able to game the system the way it's described in the paper, by simply reposting and trying to maintain their dominance in the rankings in the 'newest' category. If that's indeed the case and crowd workers know of this trend, doesn't it make sense for them to go beyond the first page and look for tasks that are genuinely new, and are more likely to show up in later pages? Yet the study seems to conclude that when it comes to the 'newest' category, workers tend to exclusively focus on the first page. 3. Perhaps the most illuminating comment came right at the end where it was reported that many workers would like MTurk to not report requests from certain requesters. This shows that there is at least one non-trivial bias that needs to be accounted for. Specifically, if a worker knows to avoid a requester, than it doesn't matter what position that requester occupies since it will be avoided. I was wondering if any of the models the authors presented compensate for these kind of biases.
In section 4, the researchers point out that there are certain aspects of HIT are essentially unobservable and assume that certain “effects” remain the same over time. Does this have any influence on the experiment and its results if certain aspects of how HITs function are assumed to remain constant?Section 4.4 states that there is a drop off position effects 2 and a half pages in while searching using the availability function. The researchers assume this drop off results in workers abandoning their searches and present this as a spot for future investigation. Based upon what we know about time/cost concepts workers seem to utilize in dealing with HITs, how likely is it that they are abandoning their searches at this point?In the results section, 5.2, the HIT workers surveyed mentioned the problems associated with moving through multiple pages and often times they stopped at around page 10. Is this more a result of the usability aspect of the search pages on the Turk more than anything? It seems like most of the “free-form comments” point in this direction and as a result the changes to the interface might yield different results for the entire experiment.
1. The authors develop two different models for evaluation. The first is group random effect which looks at HIT groups individually to account for a number of factors that can impact its disappearance rate. The second is a pooled model which looks at all tasks as a whole. The two different granularities try to capture different aspects of each task. The author states that group random effect models lead to more credible results while pooled models lead to more precise results. This seems like almost a contradictory separation; however, one would want results that are both credible and precise. Does this indicate that both measures should be used to draw conclusions or that neither model is desirable? 2. In their evaluation, the authors discovered a tricky case in which the results are impacted by people posting tasks trying to game the system. People would repost their task in order to keep it high on the “newest” list, which many people use for finding new tasks. Using the group random effect model, the results indicate that it is not good to be the first few results on the “newest” page because these tasks do not appear to disappear. In fact, people are gaming the system such that these results would not change. However, the measure is not able to capture this. At the same time, the pooling model is able to capture this exchange and considered being on the top of the “newest” page desirable. As a result, the authors conclude that the pooling model is best suited in this situation. However, I would think that people try to game all aspects of MTurk to get their task seen by the most amount of people. Should evaluations just follow the pooling model approach? 3. Based on the survey conducted, the authors discovered a few bugs MTurk’s search interface. Specifically, when specific search criteria are specified, it was not always upheld. This bug first appeared in the responses to the best-case criteria used to post the survey. People searched for HITs with a minimum above that of the best-case criteria and yet the task still appeared to them. The authors draw some conclusions from the number of responses from the four different criteria they used to display their survey task. However, doesn’t a bug in the interface which will display tasks that should have been filters out impede the ability to draw any conclusions? The authors inferences about search patterns based on typed responses would all still be valid, but the rate of response may not reflect actual desired behavior of the user.
In terms of worker reviews, do highly-ranked MTurk workers seek out tasks in different manners than low-ranked MTurk workers? Are highly-ranked HIT sources often found through different search strategies than low-ranked sources? It would be interesting to learn about the relationship between worker attributes (in terms of both their rank and other criteria) and search strategies.Why did the "random" vs. "fixed" effects distinction become less important with more data? Wouldn't more data generally support one approach over the other? Also, why not update the prior assumptions based on the user behavior? I might be misreading it, but the model in the first section seems a bit simplistic in this regard.In what other ways might both companies and users "game" MTurk? It seems that these gaming activities clearly impact the findings of the research task, as mentioned on several pages and for both studies conducted. Moreover, I imagine that gaming is fairly pervasive and nuanced on both sides (workers and posters). How exactly do we separate "gaming" from normal strategic economic behavior, and is there any further research on gaming and strategizing by both users and posters?
1) The authors find significant correlation between favorable positions and time/cost savings, but their investigation focuses on HIT groups that have lots of available HITs. They mention that “1-HIT Wonders” are still often completed (albeit after more time). What would motivate users to select these sorts of groups? Were there significant financial benefits? The authors do not really discuss it but I was curious.2) I was struck by the fact that high reward categories were so unpopular. Is it a cost-to-work ratio issue, where workers can tell the task is too time consuming and labor-intensive for the money? I’d like to see some analysis that qualifies, for example, the first ten pages of highest reward HITs based on how workers feel about the workload. Even just qualitative feedback/comments (like towards the end of the paper) would be interesting to hear.3) At what point did Amazon Mechanical Turk start gaining popularity? I’m wondering specifically because of the complaints from users in navigating the pages. Do they intentional always place users back where they started? It seems fairly simple to implement maintaining the page from the previous HIT and odd that it has not been implemented.
1. The authors proposed a formula where Xi referred to all factors. However, that formula seems not fully utilized since only the factors like sort category, page replacement, and page position were covered. If there are any other unknown factors, what is the value of this paper since the authors only constrained on the factors mentioned above.2. In section 4.4, the paper concluded that workers actively sorted by the most available HITs. However, the most available sorting is the default sorting provided by MTurk if the workers had no preference to change sorting category. So, this conclusion seems arbitrary.3. In section 4.5, it mentioned the HITs with high reward. Those HITs might be difficult or require special skills. In such case, the difficulties and skill requirements should additional factors in the research, which might be ignored by this paper.
1. The authors mentioned in the related work that there were similarities between many kinds of online search activities and they only focused on the domain of workers searching for HITs on MTurk where the web search behavior was a guide. Except for this point, what else applicable in the research of this paper can we refer to in the web search?2. When discussing method A, the authors listed several independent variables to be observed. Is there any other factor influencing the user behavior?3. In formula on page 3, the authors state that “our key research goal was to … position of i -- while keeping Xi constant -- affect ...”. However, how to keep Xi constant is questionable, since the factors of Xi is not under the authors’ control and might be changed, e.g. the requestor changed the content of a HIT.
In section 4.1.1, the authors talk about how the outcomes can be challenged by users deleting a large amounts of HITs at once time, but my question is, don't most people go through their inboxes and delete all the unimportant items at one time. So the authors mention how this would effect the data, and that seems to be a part of user interaction that should be studied further.Can we discuss the 'group random effect' vs the pooled model effect--how the random effects model suggests no effects on the uptake pattern?What are the ethical implications of using mechanical turk, and internet workers? Does this paper talk about where these Mturk workers come from? What country? Because I wonder if some of these results would be different depending on who is filling out these questions, and where the data is coming from?
1. The authors have stated that in MTURK, on each worker’s account summary page, there are 10 suggested HITS selected based upon high rewards and also that they are unaware of any services to intelligently recommend HITS to workers. Isn't this similar to a recommendation engine? A worker's preferences could be saved and may be recommend different categories of HIT groups to him/her, which may not necessarily be high-rewarding ones but may make the worker enthusiastic about the task. 2. There are two types of HITS that are mentioned as encompassing the tasks the requesters post: Type 1 being image labeling and Type 2 such as surveys. The Type 2 HIT groups are such that the task only appears as a single available HIT to each worker. What would motivate the workers to choose type 2 hits over type 1? If it were money then would it not result in fatigue or laziness because of similar tasks being repeated over time? A particular limit can be imposed on the tasks they choose so that they are forced to work on varied data tasks. This could probably maintain the enthusiasm leading to better accuracy in the ratings.3. In Method A where inferences were obtained from observed data the authors have put forth that they have observed a HIT disappears faster when moved from one position to another and they have attributed this difference to the popularity of the new position and not to the nature of the HIT. But later found that when a fun HIT that is relatively highly paid is moved from first to the third page of results always disappeared quickly. These two statements contradict each other. The second statement shows that the nature of the HIT (fun) also played an important role in HIT disappearance rate.
1. The paper makes the assumption that AMT strictly follows a push strategy (tasks are pushed by requesters). It is well know that requesters commonly adopt a pull strategy (directing work by requesters). The model fails to account for such behavior. It is critical to understand dependence on the search tool.2. Understanding search patterns in this paper was applied to enable faster task completion. It would have been also interesting to measure performance as task accuracy. There maybe scope to model searching behavior with worker reliability, or at least have this as a feature.3. Generalizing findings to other crowdsourcing platforms is not evident. In that sense the scope of the work seems to be limited.
1. How does worker traffic come into play on MTurk? The authors don't explicitly mention the flow of workers but I'd imagine just as with any online application different times of day there are different populations of people using the site. How would this affect the rate at which HITs are completed?2. The authors mentioned gaming and how it is used to manipulate the order of displayed HITs to workers. Surely frequent users of MTurk are aware of the presence of the same HITs several searches in a row. Even if they don't know the exact mechanisms used, couldn't workers counter these practices by being biased against positions that are typcially filled with vying "gaming" postings?3. How does the quality of the worker factor into these results? The authors don't (and possibly can't) look into the ratings of the workers that participated in their survey. Wouldn't a frequent worker be more likely to dig deeper into the searches and look for specific types of HITs that they prefer?
Because same ad posted as part of the test, will not the order in which these were posted have affected the test results. Because ideally, if a person sees the same HIT again and again he might start clicking it even though it was not the ads position which affected his decision to go for a HIT. The testing being done here has been very specific as to why Mturk is not the best place for human computer to be looking for a HIT. But he has failed to extend the search results into a more general way. As to how these results can be used for other search engines. The author could have actually defined the way in which such kind of surveys need to be carried to help people creating platforms similar to MTurk could have used these results.The author indicates that the workers realise that the popularity of HIT can be a bit skewed. and because of that they tend to go to very higher pages to search for HITS which might suit them. This show the unreliability of the the Mturk in the eyes of workers. Doesn't this affect the testing being done in this paper ?
1. In this article the authors examine how users choose HITs on Amazon’s Mechanical Turk. In doing this the authors state that due to the nature of the program that there are several factors of how HITs are chosen that they cannot account for. They state that this is not a problem because these factors would be consistent over time for any specific group of HITs. Do you agree with this assumption? If not how would you go about measuring the effect of these factors?2. In this article the authors propose two different models for examining the results they acquired from their scrape of the results of different sorts on MTurk. These two models were the group random effects model and the pooled model. When analyzing these two models they find that there are benefits and drawbacks to each model. Do you think that there is a better model or do you think that both models should be used for the analysis of MTurk related data?3. In this article the authors discuss a survey that they put out on the Mechanical Turk with 4 different methods. These methods were meant to aim the placement of these surveys on different sortings of HITs. One was meant to represent the best placing across all sorts, another was meant to be placed in the middle to avoid both ascending and descending sides of all sorts, while the other two were meant to be on the top of the a-z name sort and the newest sort. In the results of this study the authors were surprised to find that 25% of the people who took the survey found it after page 10 of the results. Do you think that this is due to the fact that they orchestrated their survey setup in such a manner that at least 25% of their surveys could only be found on pages after page 10 of the results? If so do you think that this method oversampled people that would look beyond the first 10 pages and that this could be the reason for the surprising results?
1. One way in which the Group-specific and pooled models diverge with respect to HIT disappearance is on the importance of ‘newest’ task. The authors explain the divergence as the result of some employers ‘gaming’ the system. What does an employer gaming the system actually do? Repeatedly take down and repost a task?2. One of the interesting findings in the paper is that position of a task can self select for different types of Turkers (pg 9). With respect to IR tasks, how might the behavior of a person motivated by money differ from someone who wants a maximum number of hits? Is either group more desirable for some tasks?3. Are this article’s findings still applicable? How often does the interface and search features built into the Mechanical Turk Website change? Is there anything important about this paper with regard to IR search behavior?
1- Thing I am thinking about: Will publishing these results cause MTurk to take steps to fix how easy it is to manipulate task position? Does MTurk care? Should they? Will publishing these results inspire task authors/requestors to work harder at manipulaing their task’s position? It seems that a certain amount of ‘gaming’ is already going on. Does this matter?2- Considering the fact that the discovered bugs/usability issues of MTurk affected the results of the survey (and presumably a fair amount of the scraped evidence) I feel the authors should perhaps alert MTurk to their findings and re-run some of their experiment if MTurk fixes some of them. Failing that (or in addition to that) a similar experiment should be re-run on another crowd sourcing site to see if results or problems are consistent.3- I do not understand the Group Random Effect. Some numerical value was assigned to a subset of tasks based on the perceived attractiveness? To control for perceived attractiveness? How was this number developed? How did the authors choose which subset was assigned what number? This seems like a subjective task.