Thursday, October 17, 2013

10-24 Moffat, Alistair, Falk Scholer, and Paul Thomas. Models and metrics: IR evaluation as a user process.


  1. 1. My first question is in Section 5 USER-INSPIRED ADAPTATION. In order to illustrate their proposed algorithm, they considered that a user undertaking an informational query with an initial (unvoiced and unexpressed) anticipation of finding perhaps 10 documents. This seems a little weird to me. When I am searching for something, I would not assume how many documents that I want to search. Instead I just want to get the information that I need. Their algorithm relies on this assumption. So I am wondering whether this assumption is reasonable or not.

    2. My second question is about the information in Table 3. In this table, it shows hypothesis for user search behavior. It divided the users’ intention to be navigational and informational. Intuitively the conclusion may be right. For example, when people are searching for navigational information, they will be either satisfied or dissatisfied quickly. Otherwise they will look down to check more. However, this is not necessarily always true. When people search for navigational information, they may still look down to search for more. This also applies for informational search results. So solely from the observation listed in Table 3, it can not be taken as rigorous proof and used substantially.

    3. My third question is about the experiments in this paper. In Figure 2, it shows us the results of their experiments. Here they chose T to be 1, 5 and 25. Here T is user-desired relevant documents. Since they wanted to use T to represent users’ intention for number of relevant documents, there should be some explanation about why 1, 5 and 25 were chosen. Here 1 stands for the top-1 answer. But there are no other details about their choice of parameters. So I am assuming they just picked parameters randomly or they have better results for these parameters. I am wondering whether there is better way to illustrate this parameter choice problem since parameter should reflect users’ intention.

  2. 1. At the end of the article, the authors write that they will conduct user studies to better measure search behavior. Why do they choose to develop the metric before conducting the user behavior? What benefits or disadvantages are there? Do you think the experimental designs proposed by the authors will be useful in helping create metrics to better model user behavior?
    2. Why did the authors state that it was encouraging that INSQ agrees with the other metrics? Did the authors fulfill their goal to create a metric based on user behavior, even though they haven't tested or tuned it with real user behavior? How could the metric be tested to make sure it corresponded with real user experience with a search system?
    3. INSQ depends on an estimate of the amount of documents necessary to fulfill an information need. Is this need always expressed in number of documents? Wouldn’t the number depend on how much information is contained in each document? Do you think you could provide a reasonable estimate of the number of documents you need for a certain project?

  3. 1. Motivation of the section on user inspired adaptation, misses accounting for information diversity, i.e., the notion of skipping documents from different facets. For such analysis, the properties of the whole collection needs to be represented? My concern is that the proposed metric follows largely from the existing metrics.

    2. The paper derives motivation for T in the proposed metric from the adaptive measures. However, T is again a parameter the metric depends on – It is not evident as to how one goes about choosing this. T also does not capture duplicate information sources, i.e. T should reflect information gain.

    3. The naive disagreement measure adopted can be misleading. Notion of where and the displacement is not captured. Doing so would have better motivated the need for more measures. Is there a reason for not doing this?

  4. As far as I can self-assess, whenever I search, there are generally two potential T-values, based on two types of searches: I am searching for an infinite number of relevant documents, or I am searching for one relevant document. However, I do not search forever, and I generally do not even exhaust the results that seem relevant. How are users expected to know the number of results they seek? Are there unknown values that may be used as approximations? The researchers do not take on this question in their paper, but it would be immensely helpful to develop such "latent" parameters in order to avoid relying too heavily on user observation during further (adaptive) metric design research (part of the goal of this paper).

    Has INSQ been further examined or used in other research? While certainly interesting from a theoretical perspective, more data are needed before anybody can make assertions regarding its actual use. A conditional weighting metric is not necessarily extremely difficult to build, but a good one probably would be.

    The conditional weighting metric offers much opportunity for further nuance in relevance judgments. How could such a weighting metric interact with documents of unknown relevance (e.g., breaking news articles)? Would it be possible to improve upon unsupervised relevance judgments of unjudged documents using a weighting parameter that takes particular metadata into consideration along with user-based tendencies or preferences? (e.g., having a tag "documentSource:")

  5. In attempting to explain the results in table 1 and table 2, Alistair claims that these agreement scores represent the outcomes of “real” batch-mode IR experiments. Is it really possible there to represent the outcomes of “real” batch-mode IR experiments? Moreover, he mentions that in terms of the confidence level, “a researcher who conducted the same IR experiment, but measured the outcomes using a metric other than AP, would have rejected the results as being not significant (and hence uninteresting) around 19% to 38% of the time.” For this claim, I just intuitively think that it may be problematic in statistics.

    According to the table 1 and table 2, the results against each metric in table 1 are much higher than those in table 2; for instance, the discriminative power of the rr in table 1 is 55.6, which is higher than that in table 2 (49.9). So, why are results in table 1 obviously higher than those in table 2? Is it because of the difference between TREC-8 Adhoc Track and TREC-10 Web Track, or because of the difference between the number of systems in table 1 and that in table 2?

    When talking about user-inspired adaptation, the anticipated number of answers (T) is included in this model. However, how to determine the T there? I think, for different individuals, the T may be quite different. Moreover, even for the same persons, the T they participate may be different in different search tasks.

  6. 1. The authors claim that the users are more inclined to end their search as the relevant documents are identified. Though this might seem intuitive, there might also be other reasonings behind why the user is inclined to end their search - viz, the user can be inherently assumed to be impatient and wants to end the search as soon as possible or the user might decide to reform his query to get better results as he is not satisfied with the current set. Do you think that the authors provide a convincing quantitative measure to support their claim? This is all the more important as this is the central claim of the authors in this paper.

    2. How do the authors expect the users to know how many relevant documents that he/she is expecting out of this search? When I perform a search on a topic (navigational/informational) that I have less expertise on, I do not always know how many relevant documents are expected out of the search. In additional to the above assumption, the authors also assume user’s rationale on undertaking a particular search, which might be completely subjective. Are these assumptions reasonable?

    3.The authors start the paper by stating an experiment that will establish the strength of linkage between models and metrics. But the metric that has been proposed in the section 5 has been a reformulation of INSQ metric that has been proposed earlier. The authors have not attempted to show if the resultant metric is convergent(as INSQ is not convergent). Do you also believe that the authors convince the reader so as to how their experiment establish any relation between their metric and the user model?

  7. 1. I'm not too sure about the assumption the authors make in Table 3 when they state that for navigational queries, users are satisfied when they receive 'many' answers and don't reformulate. Actually, the opposite may be true. As a motivating example, when I search for a place on Google maps, if ten replies come up, and I'm confused about where I really want to go, I try to reformulate so that I only get one convincing location. The same goes if I'm searching for a company's website and many things pop up (including the correct answer, somewhere at the top but not exactly the top, for example) then I would try reformulating till the results were relatively unambiguous. This leads to the larger issue of whether the authors constructed Table 3 based purely on intuition. Surely there are some studies that could have guided these user-modeling decisions? I didn't see any citation/rigor in the way that table was constructed.
    2. There are other places where such choices show up but don't seem to be justified in practice. For example, the authors mention as criteria 2 in the User-inspired adaptation section that weights should be non-negligible at ranks of 100 and beyond. While a motivation for a smooth function without sudden truncations is understandable, the choice of 100 does not seem too practical as a 'soft' cutoff point. Other studies have shown that for real users, 10 is more indicative for instance. So what justifies picking 100?
    3. The results in Tables 4 and 5 for percentage agreements in the SSA category don't seem to be too different from that observed for ordinary INSQ in the earlier table. So can we consider this to be the primary contribution of the paper? For my part, I thought the paper did a much better job succinctly describing the different metrics and showing the correlations between the existing metrics from the TREC 8 and 10 sets. Do you think that was actually the true point of the paper, and the authors were presenting the new/modified metric as a baby step towards a more reliable metric that also takes user experience into account? What are the future avenues for improving upon this metric so that there is even better correlation?

  8. 1. This paper presents an overview of various IR metrics and their associated user models. My first question is about their critiques on RBP. They argues that the halting frequency needs to go up since users are less likely to stop when they go further down the searching results, which is not very clear to me. As the users check more results, they will be more likely to get tired and to find the results they want. Both of the cases suggest that users are more tending to stop. Is there real user study to support their claim?

    2. In “comparing metrics” chapter, what’s the difference between SN and NS? According to definitions these metrics seem to be identical, and what’s point of defining them separately? And what’s the meaning of the equations defined for the class agreement and why there is a ‘2’ for some of the metrics (SSA, NN) but not the others (NS, SN)?

    3. Lastly, the model proposed by the authors requires parameters such as number of searching results the users want and already get, which in reality is more than often inaccessible to the search systems. And these information are difficult to infer because firstly the users’ queries usually don’t contain them, and secondly users probably have no idea how many results they are expecting, and lastly even if the users know many results they are looking for, the number might change depends on the searching results they’ve seen. So how can this model be applied in real case and how are these problems addressed?

  9. 1. In the section on sequencing user models - the paper proposes the utilization of qrels to generate the run. Especially with conversational speech wouldn't the mapping these multiple query representations to a single cohesive document ranking get tedious ? Isn't there the underlying assumption that all queries are capable of polyrepresentation? How do we hope to deal with the issue given that we still have an incomplete set of relevance judgements?

    2. When evaluating user satisfaction - the importance of trading off of robustness versus speech variation has not been dwelled into. Wouldn't just capturing the speech characteristics of the training set result in a generalization error? And so, how would we deal with the development of an adaptive learning and a discriminative training methodology? Further, how can we go on to define the tolerance of error by the assessors when evaluating a conversational speech model?

    3. The paper elucidates on how it would be tedious to choose one metric model over another as the real difference between metric evaluation is unknown. This gets me confused as until we have a threshold for comparison of these metrics - how would we be able to choose one metric over the other when dealing with a specific conversational speech retrieval task? Further, how would we be able to gauge the performance of the system even if we introduce user adaptability?

  10. Users are not sure about what exactly they are looking for,because they do not have an idea about what is out their. We can say that a suer will stop as soon as he finds what he is looking is very subjective. I person may have found the relevant information, but could still be looking at further results just to get an idea about what else he might find. Or he might not have found any relevant result and could still stop his search, and might decide to change his query. Thus the assumption made by the author seems to be not representative of complete user scenario.

    None of the models discussed by the author, take into account that the search results which are shown in same page would have higher tendency to be clicked at than the ones on next page. A user might have the idea that going to next page implies a significant drop in relevance. I mean the search results 20 and 21 have a huge difference from a users perspective

    The author mentions their future testing technique to be "Subjects will also be shown a set of similarly-categorized information needs, and asked (without performing any searching) to estimate the number of documents they think they would need to locate in order to satisfy those information needs." But the author has not indicated as to what purpose they want to do this testing for. Do they wish to see what set of documents compliment one another to provide a complete set of information ?

  11. This shall be the first paper, from my memory, to investigate the link between user's ability to complete a search task and numeric effectiveness metrics. We often conduct user study to monitor the behavior of users and use quantitative metrics to measure retrieval system effectiveness. It looks like these two sets of measurement are quite in parallel. From this perspective, it is novel and interesting to investigate the linkage between models and metrics.

    In section 2, the authors state “All of these four static metrics – P@k, SDCG@k, RBP and INSQ – can be criticized”. As for the underlying reasons, the authors list “But the real failing of static metrics is that, in terms of a user model, none of them take into account what it is that the user is experiencing as they step down the ranking”. I am wondering what the authors imply? Do the authors suggest a weak link between the static metrics and the user models? If so, what further implication the authors have (e.g., do not use them)?

    Having argued that existing static and adaptive metrics are flawed in terms of reflecting user models, the authors look into user-inspired adaptation. To develop a user model, the authors have a hypothesis that “the conditional probability of a user continuing their search having reached some depth I in the ranking is a combination of three factors: the depth in the ranking that has been reached; the anticipated number of answers; and the number of answers that have been identified so far through to that depth”. It look like quite reasonable from first glance, but the problem is the hypothesis is not supported throughout the paper. What if there are significant confounding variables which also contribute to the probability of a user continuing their search (e.g., temporal and spatial aspects)?

  12. 1. In section 3.1, when the authors talked about RR, they criticized it “... regardless of how deeply it appears...”. Is it practical? Generally, modern search engines are powerful enough so that the users’ scanning would not go so further. So, it is just a theoretical statement.
    2. In the last paragraph of section 4, did the authors present the intention to find a solution to make all the metrics mentioned before comparable? From my perspective, each metric has its own features and scope of application. We cannot say the gap among them is problematic, because it is their nature. Whether they perform well or not depends on the data and scenario.
    3. In table 3, it is interesting there is no item of “no answers” at Initial Expectation column. Such expectation is expected, e.g. in patent search where the inventors would like to know whether their own ideas are innovative and no similar work exists.

  13. Mentioning the types of user intents (navigational, informational) the authors lay out their observation that different intents result in varying number of relevant documents looked up. They try to model it using the parameter ‘T’. However the value of T cannot be found in practical situations as it differs from user to user. Prior probability distributions can be used to compute the values of T. However no such thing was done here. Thus I am not totally convinced by the values of T chosen here.

    That the different metrics operate in their own spectra, which also are not well defined, was shown well in the paper. Table 2(b) shows the varying active agreement values and active disagreement values are for the same pair of measures. It would have been more interesting and appealing to the readers if the authors took their experiment in the direction of normalizing all the evaluative measures (in the same range).

    The related work section mentioned raises more doubts regarding the work done in the paper. For example, Smucker and Clarke’s work on time biased gains seems more intuitive than estimating the number of documents a user would see through. Al-Maskari’s work on shallow metrics provided better correlation with the experience reported by users. Though the paper has given useful insights into the gap between the various metrics, the INSQ metrics performance in the general scenario is still unclear to me.

  14. In the halting and Continuing section, the researchers state that it is “assumed that the user scans the items in the result listing from top to bottom, and stops at some point and abandons that query.” While this assumption makes sense for most users and for weighing the difference in depth, how does this work with tail queries and the lack of depth associated with them? Does the assumption just cover that aspect, which would lead to a user abandoning that query for a more favorable one?

    On page 6 the article mentions that a “user’s mental state changes...anticipating finding further answers relatively quickly, after the early wins already attained.” However, the article states that navigational query users have a different kind of behavior which looks for one specific answer to be sufficient. How are multiple route suggestions and navigational related relevant documents such as road closures counted in this behavior?

    In table 3, the initial expectations for navigational assumes that some answers will possibly satisfy a user and many users will result in a quickly satisfied user. Do these expectations change in mobile users(who possibly have a higher query number for navigational queries) vs. users at a desktop initiating a navigational search in anticipating for a trip etc?

  15. 1. To argue his point, the author outlines a handful of static methods and a handful of adaptive methods. Based on the results from these different methods, the author draws conclusions about the usefulness of all static user models. Are the static models chosen considered to be the best of their kind? For instance I do not remember any time in which I searched with the intention of going through exactly 50 results before stopping regardless of the content of the 50 results. However, that is the behavior model in one of the static user models. It would be easy to use bad static models to draw conclusions when comparing to top adaptive user models.

    2. The author provides data to facilitate drawing comparisons between all of the different user models both static and adaptive. The author makes a valid point that these models are all flawed even if you look beyond the data provided and consider the actions of the user they represent. When written in simple English, it is increasingly obvious that there are limitations to the models. People do not commit to searching 100 documents or keep searching until they find the first relevant document even if its 50 documents in. In the end, the whole point of a model is to try and capture human behavior, but it will never be perfect. The author concludes that the best model is one that is tailored to the search task and the user himself. However, at that point, you would have to be making a unique model for every situation, which does not utilize resources well. In addition, the model will inherently be limited. The arguments for adaptive over static models does seem promising, but instead of leading to the conclusions that every situation needs its own model, would it not be better to determine board adaptive models based on common search situations?

    3. One of the adaptive models, average precision, is based on a term we have heard often in our discussions in class. From MAP and GMAP measures to this model, average precision seems to be a common calculation in information retrieval evaluations. Is average precision prevalent in information retrieval because it is based on a good mathematical foundation for the task or because it is a commonly used metric? The author pinpoints a number of different issues with the average precision user model. In addition, there are issues related to MAP that GMAP tries to address. Therefore, is average precision a common metric because of popularity when alternatives may be better? Even this paper mentions the limitations of the average precision model, but then goes on to state the model is still popular.

  16. 1) It seems like the user model of inverse squares is similar to RBP except that inverse squares has a decreasing halting probability. What are the differences and benefits of such model in comparison to RBP? Seems like RBP is more realistic than inverse squares.

    2) Why is termination at a specific point a weakness? In the worst case, a user that fails to satisfy his information need will stop examining documents at that level. In the best case, if the user were to satisfy his information need and thus stop sooner than the stopping point, the contribution of the remaining points would be overhead, hence, we can use the values as upper bounds.

    3) Why is T a simple parameter? In order to have a realistic model, shouldn't T be a function of the past? By having T being a constant, it seems that the overhead assumptions outweighs the realism effect that it provides.

  17. 1. The authors pointed out the failing of static metrics in the last paragraph of section 2. However, these metrics are still popularly used in system assessment; and they play their own roles in certain scenarios. We cannot expect the metrics are panacea.
    2. What is the meaning of “Class Agreement” in Table 1 and 2? There is no clear definition or explicit explanation for this term.
    3. Why the authors setup their metric on the INSQ in section 5? Why not choose RBP or DCG?

  18. 1. Is there a relationship between halting and the nature of the query? So, do generic queries like “President Obama” that probably have sufficient information in the top-ranked answers, have an early halting versus more specific queries? Is halting different for location-based versus image-based versus text queries?

    2. I understand the researchers’ critique of static methods, in that they do not take user experience into account. So, if we are going to focus on UX, are we going to adapt to/consider different user types (trained versus untrained, novice versus subject specialists) as well? Isn’t that a consideration?

    3. Reciprocal Rank: The hypothesis of this confuses me. Do users really stop their search at the first relevant document? Don’t they validate their answers? Based on some user feedback sessions I’ve conducted, people said that they always validate any answers or information just because they have the option to do so.

  19. The authors say that in usability studies, users need to empathize with the information-seeking tasks, but no other author up till this point has commented on participants ability or willingness to empathize. Most researches and authors focus their testers, and evaluators more on user intent than empathy. How can the subtle differences in these two concepts effect how tests are designed and evaluated?

    How can you realistically stick to and defend a static model without adjusting either in pre-tests, or during testing incorporating user adaptive models and techniques?

    Does the halting and continuing mathematic equation take into account redundant query results offered as different rankings?

  20. 1. The metrics Reciprocal Rank and the Average Precision are widely used and have been found to be standard metrics in some of our earlier readings. However in this paper, it states that neither the Reciprocal Rank nor the Average Precision metric work out if there are no relevant documents for the query. With the huge gap between the user models and the measurement that these two metrics propose, how important it is for the user models to be realistic? Since these two metrics come with constraints such as user's knowledge about the number of relevant documents in the ranking, how is it considered reliable and repeatable in retrieval experimentation?

    2. “The gap between the metric behaviors is problematic because there is currently no principled way in which to choose one evaluation metric over another.” – This can be attributed to the fact that the system under experiment is not a stable one which produces the same result every time it is tested. It is known that it is variable and volatile that multiple users can manipulate the results in multiple ways. So are the authors trying to find one metric that bridges this gap? Is it realistic? How can user model be made realistic?

    3. In Section 5, “User-Inspired Adaptation”, they have considered a case where a user undertaking an informational query is most likely to continue looking down the ranking till he finds the number of relevant documents he is interested in. Further more, they also assume that the users were never going to stop after just one document. How can this assumption be considered valid? Would the user not immediately refine his/her query? The probability of a user continuing his search farther down the ranking list till he finds a relevant document seems unrealistic and does not capture the real-time scenario.

  21. 1. In this article the authors link several different effectiveness metrics with the user model that the metric describes. However when initially describing the different metrics the authors do not explicitly describe the user model that the metric INSQ describes. It can be inferred that they are stating that INSQ describes a user model similar to another metric RBP. What type of user model do you think that their initial INSQ describes and do you think that it is similar to RBP’s?
    2. The authors, when describing their own user model and metric, describe two different information-seeking behaviors that are relevant to their study. These behaviors are navigational interactions and informational interactions. Do you think that these two behaviors are the only types of information-seeking behavior there are? If not what other types o information-seeking behavior can you think of and would that behavior be relevant to what the authors are trying to do?
    3. In this article the authors try to connect user behavior with metrics that are used to evaluate information retrieval. However one thing that they never consider is how a user would react to the various models of their own behavior that are being created. Do you as a user of search engines think that any of the models presented here accurately describe your search behaviors? Can you think of how you would go about creating a study that would compare user reactions to a search engine with the scores that various metrics give that search engine? Would you consider getting the users’ opinions on which evaluation metric they agreed with?

  22. 1) In the Halting and continuing model, I’m confused as to what is the metric exactly? Based on their description it seems that the metric is minimizing how far down the list the user goes before they stop. How would they model this? Do they assign a probability to each result for stopping at it based on its relevance?

    2) The Halting model is viewed as static and the reciprocal model is viewed as adaptive, but isn’t the actual emulated user behavior the same in both? I considered an adaptive model to incorporate a change in user behavior based on the varying degrees of relevance seen in the document. How are these models adaptive if all they do is quantify accumulated utility, but never allow the user to “adapt” his behavior?

    3) The authors mention an interesting point that although an evaluation system might be good at differentiating between systems, if the metric it uses is irrelevant to the usefulness, then the differentiation does not really say anything. Is it possible to determine the usefulness of a metric without some prior user study?

  23. 1- I’m not sure I agree with the author’s suggestion that “halting probabilities decrease as a function of depth”. This would seem to suggest that there exists a depth that if reached by a specific user that user would never stop looking at results. I find this improbable. I think is can be one aspect of some searches through a certain depth. It is probably more likely that a user at depth 20 will view results through 30 then a user at depth 10 will view through depth 20. But I think it is far more likely that a person at depth 2 will view document 3 then a person at 92 will examine 93 as the second user must surely be closer to the satisfying their information need. I would certainly not agree that the authors’ idea ‘seems natural’.

    2- Their fifth factor in developing a metric is very interesting (that a system should adapt to the user’s idea of how many documents they need). However their model is still somewhat static. According to them either a user knows they want a few or many documents. But I think many times a searcher may think they will only need a few documents and only in the midst of their search do they realize the need more. Or they suddenly realize the need related documents that were brought to their attention to the search.

    3- Regarding relevant work: The authors spend a lot of time in the begining describing common metrics and their flaws. Then they spend only a little time at the end in discussion of what seem to be potentially very relevant solutions to some of the problems they identified in the first place. I am slightly less convinced of the wonder of their new idea since they have not sought to integrate or compare it well with current useful (by their standards) metrics.

  24. 1.

    > A benefit of the use of the geometric distribution is that it converges, and
    > hence the RBP@k metric is monotonic as the depth of evaluation k is
    > increased.

    What does that mean? Does this mean that at some point the probability of a
    result being relevant is treated as 0?

    2. The paper introduces a parameter T in their proposed effectiveness measure,
    which they say indicates the number of relevant documents the user expects to
    find. Does this reflect the difference between navigational and information
    searches? In particular, would a navigational search imply T=1, since the
    user's search needs are satisfied as soon as they locate their destination?

    3. Table 3 does a good job of summarizing the authors' hypothesis that the
    user's behavior changes depending on how relevant the results are.
    Specifically, if the user sees bad results early, then they are more likely to
    quickly reformulate their search. What in their proposed formula for
    effectiveness reflects this change in behavior?