Thursday, October 10, 2013

10-17 Azzah Al‐Maskari and Mark Sanderson. A review of factors influencing user satisfaction in information retrieval


  1. 1. The authors state that they think that experiments should not focus on only one factor for user satisfaction in an experiment (p. 867). What factors should you measure an experiment to get a sense of human behavior? How do you get a comprehensive view of all the factors that go into user satisfaction?

    2. In this experiment, the authors found that user characteristics did not have an effect on user satisfaction; however, these results were not conclusive (p. 866). What user characteristics did they consider and how were these defined? What further studies do you think need to be done to prove that familiarity does or does not have an effect?

    3. In this study, the authors found that familiarity on a search topic had no impact on the effectiveness. However, in this study users had to create queries to find documents on TREC topics, so they were query creators, but not topic creators. How does the fit with Chouldechova and Mease’s study that showed that query owners were better at assessing relevance because they had some special knowledge of the topic? How does it affect the study in that they do not have total knowledge about the topic? Or is there no relation because the previous studies were focused on relevance and this study explores user satisfaction?

  2. 1. In the system effectiveness chapter, the superior and inferior systems were chosen based on the number of relevant documents retrieved, which is similar to the definition of recall. If so, why the author also mentioned that the system was chosen based on AvP score? And how exactly is system effectiveness measured? Whether it’s the number of relevant documents retrieved or number of retrieved documents?

    2. My second question is about the overall design of the experiment. Why the effects of the four features (system effectiveness, user effectiveness, user effort, user characteristics) are measured with respect to satisfaction? Intuitively these features are not independent of each other. High user effectiveness/user characteristics may lead to minimal user effort. So why use these four features when they have overlapped effects on satisfaction? Also since satisfaction is strongly affected by emotions, is it better to measure a basal line of satisfaction, which is the initial emotional state, and compare the difference of satisfaction before and after the searching task?

    3. About the experimental design, why the 7-minute limitation is applied in the search task, which might put a bias on users and make them pay more attention to time and as a result gain more satisfaction when they use less time? Tests with the same format to that of TREC were used here to define relevance. Does this suppress other real effects on satisfaction? Such as the novel or interesting contents of the documents.

  3. I found this paper is quite illuminating as those factors mentioned in the paper which can influence user satisfaction can be also applied in other research domain for carrying out user study. Those factors shall be taken into consideration in designing a user study and discussing the validity issues for a user study.

    In discussing the result, the authors state “As illustrated in Table 2, in general, users were significantly more satisfied with the results of the superior system than those of the inferior system.” But from the table 2, I can see the difference is 0.3 (2.04 v.s. 2.34). I am not sure this counts for significant difference.

    In testing hypothesis 4 (The effect of user characteristics on user satisfaction), the authors found no support for this. I think the authors only take into account “topic familiarity” and “online search experience”. There are many other user characteristics (e.g., education level, personality) the authors fail to take into consideration which might lead to the type II error.

  4. 1. According to this article, previous though in the field of IR presented success of a search as more important to satisfaction. Here, the authors think of satisfaction as linked to many of the elements of a successful return on a query. To me, it seems as if the authors are working mainly to redefine "success" with this article. How do we define this term in the field of IR today?

    2. In the section on "Testing the Hypotheses," under the "User Characteristics" section on p. 862, the authors predict that as a user's searching experience increases, so does his or her satisfaction. (Even though by the conclusion, they say that experienced and inexperienced searchers claim to be equally satisfied- p. 865) This reminded me of the study which had users start with a document and find a query which would lead to it (an optional reading from last week from Hao Ma, et al). This lead to improvements in query owners themselves. How can better searching skills for individuals help lead to improvements in search engines? Maybe tutorials or searching tips on a search engine could lead to greater satisfaction from users?

    3. The authors use sources which seem quite old- dating back even to the 1980s. What should be the cutoff when we are doing our own literature reviews? How quickly does this kind of data become irrelevant?

  5. 1. What is the basis behind the authors’ definition of the factors influencing user satisfaction as System effectiveness, user effectiveness, user effort and user characteristics? Are these parameters sufficient for the study? There are other factors which we have discussed in the previous classes but which have not been discussed by the authors like search diversity, completeness etc., which are not covered in this study but which can represent of user satisfaction in one way or the other.

    2.The authors have used Spearman’s correlation as an evidence. But the burning question that I have had over the past few weeks is how correlation (Spearman’s and kendall’s) can give us the information that we really need. They provide us with the sole information that the two groups are identical, but the main question at our hands is ‘why’ they are identical(or not). Are there any measures which can be used to analyse the factors which lead to correlation? Or should we always assume the factors through intuition?

    3.There are numerous latent variables like the user’s past experience with the system and the other variables which are not exactly observed. Are the claims of the authors consistent with the unobserved variables which can also be considered to be a part of one of the four factors? If so, how can we measure them?

  6. 1. It seems like the authors tried to correlate everything with User satisfaction. While notable in itself, I think the study would have been even more interesting if the authors had also shown what kind of correlations showed up between the different factors themselves (e. g. user effectiveness and effort). One can only wonder which pairs would have exhibited the strongest relationship.
    2. It's an interesting observation that user characteristics don't seem to have an impact on user satisfaction. The authors themselves admit that this isn't conclusive and that at least one other study has arrived at a different conclusion. One would have to assume that the criteria chosen by the authors to measure user characteristics (like domain expertise and search experience) are more volatile metrics than were known before. Assuming we had the resources to conduct an experiment, what metrics might we choose other than those two to express user characteristics?
    3. The authors did a good job of evaluating user satisfaction across four dimensions. I was wondering if there have been other studies along other dimensions e. g. seeing how well a good interface correlates with user performance.

  7. The paper cites Soergel's 1976 work as concluding that user satisfaction is not
    a good measure of systems because of factors such as the "user-distraction
    phenomenon," where non-relevant results distract a user from choosing relevant
    ones. This seems like an attempt to evade the complexity of information
    retrieval -- as though it's really hard to build systems which factor this
    behavior in, so why try? As Tim Carmody puts it, "computers are for
    , so is there serious
    support of the idea that users are somehow wrong for their tendencies to
    exhibit this behavior?

    This paper examines four hypotheses which investigate the relationship between
    user satisfaction and factors influencing user satisfaction. It seems to claim
    that these represent a comprehensive survey of aspects that influence the
    user's satisfaction, but that seems overly simplistic. Is it really possible
    that something as complex as the user's satisfaction can be boiled down to just
    these concepts?

    The technique this paper uses to determine user satisfaction is basically a
    survey at the end of each experiment. Aren't these surveys susceptible to many
    of the same biases that relevance judgements are? For example, there may be
    overly optimistic or pessimistic users that throw off the measurement of

  8. 1. Cooper proposed a metric “Expected Search Length (ESL)” for measuring the user effort. ESL is stated as the average number of documents “examined” to retrieve a given number of relevant documents. How is the ESL computed? What does "examined" mean? Does it refer to skimming through the results or was "eye-tracking" used to know if the user looked at the links or does it indicate the user's navigation through the pages of search results? How accurately can this measurement indicate the user’s effort? If ESL is large, does it indicate that the user is less satisfied with the results, which may not be true in most of the real time scenarios?

    2. In the Hypothesis I, which states that System Effectiveness influences User Satisfaction, there have been multiple experiments carried out and most of them were contradicting each other. The Sandore’s experiment in particular found that the users were often satisfied with low precision search results but Su’s results were varying when conducted with different participants. The reason for these contradicting results could be attributed to the real time scenario where the audience of a retrieval system is versatile pertaining to different kinds of expertise and intention at different times of search. Given the fact that users' intentions are unpredictable or rather cannot be generalized, how can a factor as volatile in nature as "user satisfaction" contribute towards evaluation of an IR system? How does it matter when it is evident from this case that it may not reflect the system’s performance but strongly dependent on the individual?

    3. In the Hypothesis 4, where the authors analyzed the effect of user characteristics on user satisfaction, it was expected that the users with previous searching experience were likely to be more satisfied with results than less familiar users. In my opinion, this is an invalid and incomplete statement. The experience cannot be related to satisfaction criteria. Invalid because the poor satisfaction score in less familiar users may have been due to the users' intent mismatch, or they might have been hard queries or can be due to several other factors as stated in the previous reading “The Effect of Assessor Errors on IR System Evaluation” like that of laziness, fatigue and many more. Incomplete because the authors do not state how they chose the participants and under which criteria the participants were classified into familiar and unfamiliar.

  9. In Azzah’s experiment, how was the satisfaction measured? There’s no detailed information about the scale he used in this experiment. Is possible that this scale actually can hardly reflect the level of satisfaction when users were completing their tasks?

    When attempting to calculate the TimeFRD, Azzah only recorded the time taken by the users to locate the first relevant document matching with TREC relevance assessment. Here, there’s a problem that users may consider some documents very relevant, which are, however, irrelevant in TREC’s judgment. In this case, the scores on TimeFRD may be inaccurate actually.

    In measuring user characteristics, Azzah tried to evaluate users’ familiarity and search experience; however, there were little information indicating how to gather the data about users’ characteristics. By the way, in this paper, there’s few information about the significance test for these correlation index described in this experiment.

  10. 1. Through this paper, when attempting to investigate the correlation between user and system effectiveness a factor which I'm left unsure about is the variation in the personality as well as the cognitive processes of the participants. Like for instance , we have established how a more experienced users of IR would be capable of making more efficient and informed decisions regarding the relevance of a document as opposed to a novice searcher. Thus, wouldn't the users perceptual speed as well as his/her prior search experience affect user effectiveness and thereby the system effectiveness? Having said that, wouldn't the fact that experienced searchers have a preconceived notion about what to expect from on a query in anyway cause them to not be easily satisfied with the relevant documents rendered by the IR system? How do we mark a tradeoff between both cases?

    2. The research conducted uses TREC relevance judgments and computes the average precision for each system on each topic. We have seen how user satisfaction is rather tedious to compute as a metric as it is purely on the user's discretion as to whether a high correlation between user satisfaction would exist with a higher precision, a higher recall or a more complete list of documents. Given that we can just work towards a single goal when computing the IR system's efficiency - how do we propose to truly calibrate user satisfaction which is based on multiple effectiveness metrics? Would making use of a metric like F measure which combines both precision and recall then be a better metric to analyse user satisfaction?

    3. While personalization does come across as a useful tool which will help boost user satisfaction - how do we hope to balance effectiveness versus personalization while also ensuring diversity? Like for instance, it may be possible that a person who does have a search history on a specific topic now wishes to broaden his search? Say, a user A has had a history of browsing through scientific papers on Abstract State Machine(ASM). He however now wishes to know more about the Advertising of Sales 7 Marketing(also ASM) which involves a more diverse coverage. In such a case when searching through acronyms - how can we hope to account for user satisfaction when also trying to maximize personalization since the search engine would continue to provide the user with what the user considered relevant documents in the past? And, wouldn't this impede user satisfaction?

  11. 1. This paper talks about different factors that may affect user satisfaction. Altogether 4 factors are discussed in this paper. Are there any other factors that may affect user satisfaction? I think so. For example, the layout of the search page, the system interface and the representation of search results should all affect user satisfaction. What other factors may affect user satisfaction?

    2. My second question is about the relationship between system effectiveness and user effort. From my point of view, system effectiveness and user effort should be related to each other. If one system is easy to use and providing accurate information just to user’s need, then it does not require users to take a lot of time to search for related information. This paper isolated these factors, and did experiments separately. So I am wondering what the experiment result would be if we take them into account jointly.

    3. My third question is about how we can make use of the conclusion from this paper. This paper mainly examined four factors that would affect user satisfaction. I am wondering how we can integrate all these factors to improve the system effectiveness. Can we make the IR system perform better based on these factors?

  12. This article mentions a Steffey and Meyer article that showed how users opinions of the technology influenced their satisfaction regarding retrieval systems. Would this kind of influence, which appears to be centered around the “newness” of a particular system, work the same way with software or simply changing a UI to a more “user friendly” version?

    The participants of Experiment 4 vary in age by a good amount being from age 19 to 40. However, age does not necessarily equate to search experience, which is a factor the experimenters included in their hypothesis. What factors did they use to determine search experience? A self-reporting system or some kind of benchmark/test possibly?

    Al-Maskari and Sanderson note that user effort was shaped by the number of queries submitted while searching a topic. On page 864, they state that modified queries counted as an additional query. If this is the case, would there have been a difference in satisfaction in search experience, those who use search engines on a regular basis and anticipate spelling errors not having a huge effect on retrieval, or new users who might not be aware of search engines having “did you mean” search returns?

  13. Applications to Last Week: The authors cite Marchionini (1995) as explaining that every individual has a set of unique IR skills: domain expertise, system expertise, and search expertise. These are at the meaning of their term "user characteristics", as defined on pg. 861. How might these user characteristics have played a role in the paper we read last week about "owners" vs. "non-owners"? Does ownership imply domain expertise? System expertise? Search expertise? Some combination of the three? Moreover, this paper states that "there was no relationship between familiarity and satisfaction" (865). What does this suggest about the findings from last week's paper? What were the differences in the approaches that could have led to the divergent results?

    User Diversity: System effectiveness, user effectiveness, user effort, and user characteristics all seem to ignore the cultural diversity of users. Users from different backgrounds (and of different ages, nationalities, and education levels, among a potentially infinite array of other factors) may search for particular things in different ways, or for different aspects of the same topic, or for different topics altogether. Do these metrics address such diversity, and how? Are all of these contained under "user characteristics," and thereby within the categories of domain expertise, system expertise, and search expertise?

    The Participants: Are we to assume that the participants in the experiment in this paper were graduate students? How might the results have varied had the researchers drawn their participants from another population? How would graduate students compare to trained assessors? What about compared to crowd workers?

  14. I feel that the rank position for the first relevant TREC assessment based document (RankFRD), which is used an evaluative metric in user effort, is the same as TimeFRD metric of user effectiveness. This is because more is the RankFRD’s value, more will be the time needed to reach that position. Thus a lower rank implies more time the user has spent going through the ordered list. It is thus, to me, not surprising that user effectiveness and user effort both support correlation to user satisfaction.

    In the experiment carried out, the term ‘UserDocs’ refers to the total number of relevant documents obtained by the user. For queries created by the users (modified TREC queries), how are the relevance judgments obtained? Are we assuming the same document judgment that the original query resulted?

    For the parameters (domain expertise, search expertise) chosen to represent user characteristics to find correlation with user satisfaction, it is not counter-intuitive to have no correlation results. For some queries the Domain experts could be looking for specific information before they classify the document as relevant. If the retrieved sets are not extremely relevant, the non experts would be reporting more satisfaction than the domain experts. Similarly for hard topics and queries and a good retrieval system, domain experts would be reporting higher satisfaction levels than the non-experts, who cannot discern which documents are specifically relevant. Thus, it seems that the correlation depends a lot on the query and topic difficulty.

  15. 1) I found that even the description of the four “factors” that compose user satisfaction showed just how subjective the idea is. For example, “system effectiveness measures how well a given IR system achieves its objective.” Couldn’t you argue that the objective is user satisfaction, making even the definition circular? We talked about finding the correct result vs the popular result, and which is really more relevant. If we say the more popular one, then user satisfaction seems to be the objective.

    2) The authors mention that in various studies user satisfaction sometimes correlates with precision and other times with recall. Precision and recall are noted as pieces of the overall system effectiveness. Specifically, in the case of correlation with precision, how much of this can be attributed to user effort? It seems feasible that in many cases, the reason users are not satisfied with recall and no precision is that they have to spend lots more time and clicks to find what they want. Does this make satisfaction with precision/recall a function of user effort instead of system effectiveness?

    3) One thought that caught me during the paper arose from the idea that there is no consistent answer that user satisfaction predicts actual search effectiveness. Specifically, 10 years ago, people were arguably quite satisfied with Google since it was already the most popular search engine. However, it is arguably more effective now, so are people more satisfied now? This sort of question is what makes me feel satisfaction is not possible to measure.

  16. 1. As a user becomes more familiar with a IR system, wouldn't they be less satisfied with their experience as it becomes more used? The novelty of a successful query would decrease into a sort of expectation that the system should just perform as it normally does.

    2. The authors chose to include the time it took for users to complete the task in user effectiveness, but not in the effort the user put forth to complete the task. Isn't the amount of time spent by the user an indication of how much effort they have to put forth to complete a task?

    3. The ages of the participants were mentioned by the authors, but nothing more about their backgrounds. Were the 56 participants given any sort of training before undertaking the tasks for the experiment?

  17. 1. One would expect a number of the variables being studied to strongly depend on one another (i.e. system effectiveness measures seem intimately related to user effectiveness measures). Isn’t it probable that certain variables used in the study are not independent of one another, and doesn’t that diminish the usefulness of using Spearman’s correlation?

    2. It isn’t clear from the study whether the users actually read each document or what the manner of presentation of the information was. Were users presented with options in terms of quickly filtering documents? If they were not, could this explain why negligible satisfaction effects were found with respect to user experience?

    3. One thing that was not explored directly was the importance of document rank on relevance scores. Instead, the researchers relied on indirect measures, like the number of queries issued and the rank of the first tagged relevant document (e.g. what are called user effort factors). Is it valid to say that the effects seen with these measures are indicative of the importance of system ranking?

  18. 1. Hildreth “questioned the reliability of the satisfaction criterion as a measure due to its lack of independence from other influential factors in the retrieval procedure”. (p.860) Is “independence” a prerequisite for a variable to be used as a measure when evaluating the performance of information retrieval?
    2. Based on Marchionini’s research, every individual has a unique set of IR skills which consist of three components: domain expertise, system expertise and search expertise. What’s the difference between system expertise and search expertise?
    3. The authors mentioned several researchers and conflicting conclusions when talking about the correlation between satisfaction and precision and recall. (p.861) Shouldn’t it be more meaningful if we discuss this question in the context of what tasks are to be performed? For example, if the users are doing fact-finding tasks or question-answer tasks, they will be satisfied once they find the answer and won’t care how many results are retrieved; while if the users are doing information-gathering tasks, they might care more about recall.

  19. 1. Soergel’s research is mentioned in this review. (p.860) He cited the “user-distraction” phenomenon and recommended that helping users in completing their search task successfully should take priority over seeking their satisfaction. However, if a user still expresses satisfaction when he receives an irrelevant document from the IR system in response to his/her search operation, it means that the information system might accommodate exploratory search. Is this an attribute which should also be taken into consideration when evaluating retrieval performance?
    2. It is stated in the review that the less time spent searching, the greater the satisfaction. I wonder how this is measured. How much time a user need to search for certain information is to a great extent decided by his/her ability to acquire information, which means a less able user may feel satisfied even if he spends more time in searching; while a more able user may feel dissatisfied even if he spends less time finding the information. Is this taken into consideration by the researchers?
    3. TimeFRD is tested as one of key variable in the experiment conducted by the authors of this review. Why the time taken by the user to locate the first relevant document rather than the time used by the user to complete the whole task is considered here? (p.863-864)

  20. The author state:"Turpin and Hersh (2001) did not substantiate a relationship between system effectiveness and user satisfaction. Twenty-four users were involved and required to identify a number of factual answers to eight questions from two systems with different effectiveness with MAP2 scores of 0.27 and 0.35. Despite the systems exhibiting quite different retrieval effectiveness, there was no significant difference in user satisfaction with the results retrieved from the systems."
    It seems far fetched to be saying this. the way Author mentions this seems to be different than what the paper actually mentions. The paper : actually says that "Users of the baseline system had to “work harder”, however, to satisfy the requirements of the tasks by issuing more queries and reading more documents than users of the improved system" which doesnt mean they people had the same user experience..

    Author says : "IR system evaluators should consider all factors listed above in measuring user satisfaction." that is System effectiveness, user effectiveness, user effort and user characteristics. When we are saying that user satisfaction is a combination of everything then we shouldn't we try and see their level individually and not comprehensively ?

    The author has tried to categorise user satisfaction into these four subjective measures. But the author has not tried to define how to quantify them. And are these four categories board enough that they include every aspect of user satisfaction ?

  21. 1) It seems like there are many studies about user satisfaction evaluation, what is the benefit of having another study in material which has already been debated intensively? Given the amount of effort put into analyzing a hypothesis that seems to be in a stalemate, I felt the value and contributions of this paper were relatively small.

    2) When discussing effectiveness, Al-Maskari mentions that one of the reasons for discrepancies is that some users are more generous than others. Isn't this the same as Hufnagel's argument that unsuccessful users blame the system for their failure?

    One of the things I like about this paper is the idea that user satisfaction might be inflated when users for the first time are exposed to some feature of the system. Hence, user satisfaction will change over time as the inflation disappears.

  22. 1. All the results in this paper and prior work is based on findings from a small group. Prior work also indicates that results may not be generalizable in most cases. Reproducibility of results over time and groups for the same set of experiments considered can add a lot more stability to findings. Why don’t we see this?

    2. Measures for each hypothesis is not independent, what to make of results when there is no experiment which measures correlation?

    3. Experiment to validate hypothesis 4 was not setup as defined. There was no indication where knowledge of participants was independently assessed. Hence, what to make of the conclusion? Also, a point made in prior work is that of inflation of results when evaluating new systems – this as described appears to be systematic in nature and hence can it not be accounted for?

  23. I thought the Cooper vs. Soergel argument was interesting. Assigning a monetary value makes sense, and initially seems like a good way to place value on information retrieval, but as Soergel points out, many users could fall victim to 'user-distraction' where a user, upon receiving an irrelevant document might still express satisfaction. Can IR users ever 'know' that they're getting the best results for their query? It seems like many users who are doing a great deal of research refine their query terms over time in order to tailor their results to fit their needs better. So when do users 'know' they are getting good IR results?

    When talking about Hypothesis 1, System Effectiveness Influences User Satisfaction, the authors point out that there is a strong correlation between the 'relevance of results and the user satisfactions using navigational and nonnavigational queries'. What is a nonnavigational query? Can you give an example?

    Early on in the paper, the authors site Marchionini's IR skills (domain expertise, system expertise, and search expertise) and state that they would test/use those same sets, but then later (pg 8) under the User Characteristics section, they only seem to be interested in (domain expertise, and search expertise). Why are they no longer interested in system expertise?

  24. 1. In this article the authors discussed how four different factors interacted with user satisfaction of IR systems. The four different factors were system effectiveness, user effort, user effectiveness, and user characteristics. However one thing that they did not examine was how these different factors interacted with each other. For example, how do the factors, user effectiveness and user effort, affect each other? Do you think that such a comparison would have any value considering the results of this study?
    2. In this paper the authors state that they found no correlation between user characteristics and user satisfaction. However they argue that there should have been some correlation between the two because previous work has shown that there was some. What could be the reason that no correlation was found in this experiment and yet previous experiments did show correlation?
    3. In this article the authors cite a previous study by Hufnagel where Hufnagel states that there is a bias in the users’ perspective of system effectiveness. Hufnagel explains that this bias exists because users tend to focus on the failures of an IR system and ignore its successes. How would you go about accounting for this kind of bias in an IR study? Do you think that the authors of this study needed to account for this type of bias and if so did they do so adequately?

  25. 1- Firstly I have an issue with the measure of “user characteristics”. Namely user familiarity with a topic and experience with searches in general. No explanation was given for how those things were established. Did they rely on participants self-reporting or was a more objective method used? How can a reader understand the validity of the findings if they don’t know that?

    2- The authors then have a whole section about how since their findings about user characteristics did not correlate with what they expected and some prior research that they encountered (basically they found that user characteristics had no correlation to satisfaction) their findings are ‘not conclusive and further investigation is needed on this matter’. It is pretty dangerous to assume that evidence that does not line up with your expectations merits further investigation but evidence from the same experiment that does line up with expectations is acceptably conclusive. If they are going to need further investigation on this matter then they ought to further investigate every finding here. Unless the inconclusiveness stems from the fact that the methods of collecting user character information are sub-par. The authors do not say. See my question 1.

    3- I would be curious about the authors’ suggestion for taking user satisfaction in to account under less controlled circumstances. This would surely be easier but is it advisable? Should satisfaction ratings be qualified as coming from experienced users (or effective users) to make the judgments useful?