1. My first question is about the usage of this proposed method. Although it is useful to track users’ eye movement to better interpret users’ intention. I am wondering how we can integrate this kind of method into search engine. The whole procedure seems to be device-intensive and labor-intensive. The eye tracking process has to use Tobii 1750 eye tracker which consists of an infra-read LED and two cameras mounted on the frame of a computer screen. 2. My second question is about the experiments. In the experiments, in order to fit the document into the screen, the documents were truncated to at most 11 lines of text. It also shows that the answers given by the users agreed in 97% of the cases with the true labels. These measures they took are far from realistic searching process. I am wondering how they can relate the results drawn from this experiment to real-life searching experience.3. My third question is about the participants in the experiments. Altogether there were ten participants in the experiments, who were voluntary post-graduate and senior researchers from Department of Information and Computer Science, Helsinki University of Technology. The participants seem to be experienced searching experts. There are no information about their pre-knowledge for the topics in the experiments. Thus I think they should take these factors into consideration in their experiments.
1. This study presents an interesting study on the development of a proactive information retrieval system which based on eye tracking. Features of eye movement were extracted and the parameters λ were trained, and SVM was used as classifier. Why SVM is used here? Intuitively, since there are two sets of unknown parameters λ and gw(d), it is more nature to use algorithms such as EM and corresponding GMM, which estimate the two sets of parameters iteratively. Also at the end of paper the author found that some of the eye features are not useful, if so, why not use other machine learning algorithms such as Artificial Neural Network (ANN) which can automatically assign weights to these features?2. The BOW (bag of words) was used to simulate documents, which counts the word frequency. Four textual features such as “length of word” were used. For all these analysis only features about single word was considered. Since words are usually not independent from each other, such as human names, sport teams and etc. Why bigram or even trigram models were not used, which might simulate the document better? Also for TFIDF, the word frequency vector was used, is there any study that incorporate the order of these words?3. Different models with various features were compared. Why MAP was used as the single metric for evaluation of different models? From Table 4 the model using eye features was not better than the random models, which is probably because eye features are not indicative for relevance, or eye features are only indicative for some of the topics. Is there any study that focus on the effects of eye features on specific topics?
In this research, Wikipedia has been recruited to generate the document corpus. However, why has Wikipedia be employed there? I think the structure of the information on Wikipedia is very different from that of most of other websites, in which case I doubt that the similar result of this research can be gained in other situations.Most of the parts of this paper talks about the different models proposed there, while few talks about how to analyze the outcomes of the eye movement in this research. From this paper, it’s less likely to find the criteria employed by Ajanki to distinguish the part of gaze patterns related to relevance judgment.At last, I have a question about training: how have the ground-truth queries been generated or collected. If the ground-truth queries can hardly represent most of common queries, I’m afraid the outcome of the training in this research will be problematic.
1. It seems like the investigation assumes and to an extent even requires a level of technical and linguistic expertise far beyond what can be reasonably expected of the average user. Is this a justified assumption? The web users would have varied expertise, different language background and distinct intents during search. How can we account for this diversification? 2. The reading rates of different individuals would be different. Therefore, it is likely that some searchers focus on a particular word for longer than the others purely on the basis of the amount of time they take to read. Does this analysis make use of a certain baseline for the speed of reading of users? If not, how does it account for this disparity while not considering the time in attempting to read and understand a sentence to considering that information relevant? 3. When making use of gaze movements as a metric for evaluating pertinent information - aren't we biasing this on the basis of visually appealing information? Like for instance, the user may happen to focus on a pop up advertisement or any other such distraction which would cause the searcher's gaze to fluctuate. How can we take into consideration these distractions in the searcher's gaze?
The authors have an important assumption for the paper that there is a direct relationship between the eye movements and the importance of a word for the query. I think the assumption is not very safe as different people have various reading habits and even for the same person, he might have different reading strategy with respect to different information sources (e.g. skim through casual reading materials but read very thoroughly for research papers).In Fig. 4, average precision of Wtext(4) is much better than Weye+text(26) for the Astronomy category. Though it is a only exception, the authors have not explained why this happen which might lead to quite interesting findings.My last question is that in the experiment the font size and space between characters are required to be unusually bigger so that the eye tracking device can work, I am wondering with this limitation how practical it is to apply eye tracking technique into those most common information retrieval systems which typically output a lot of contents. Also how eye tracking technique can help with multimedia information?
1. In page 308, 2nd paragraph, the authors said “query-based searches are only possible if the user knows her information need”. I disagree with this point. Not all queries have clear purpose. If the user just heard some term and want to clarify what it referred to, he might have no idea what the exact information he wanted. In other words, without know what he wants, the user can still invoke query-based search.2. In section 3.2.3, the function f is divided into 2 parts. It seems these 2 parts are independent each other. How to define and retrieve the two lambda parameters is not clear in the paper. What is the semantic meaning of these two parameters?3. The experiments in the paper used documents from Wikipedia. Why did they not use that standard data set like TREC or CLEF? What is the advantage and disadvantage of using Wikipedia documents in this paper?
The use of MRI scanner has forced the authors to arrange the experimental setup (supine position of participants) that they have resorted to. The results found by this method can only be used to model ‘ideal’ users. This is because most of the cognitive processes are short term with fleeting attention spans. Thus any study that tries to analyze this should be quick enough in its evaluation, which the fMRI, is clearly not. Moreover, it seems that this method of relevance determination seems impractical to scale with the current technology. The authors mention that their analysis shows that three regions show greater activations for relevant stimuli than for non-relevant stimuli. However, they fail to define what greater is. Was it a conspicuous difference or an insignificant change? What would have happened if the documents presented to them were not images? The authors mention that from the pilot study with 4 participants a number of changes were incorporated into the experimental setup thereby resulting in the one that they described in the paper. These changes, if mentioned, would have provided the reader some insights about the challenges the authors faced while creating this new experiment.
1. This paper assumes “there is a link between relevance or interest and eye movements, and that this like is, to a reasonable extent at least, independent of the actual topic and query”. (p.309) Is this assumption solid and reasonable? There are multiple reasons for eye movements, and whether they are independent of the topic and query is still questionable. 2. The author mentioned Support Vector Machine (SVM) in page 312. What is SVM? How does it work? Compared with other functions, what unique feature of SVM makes it most suitable for the work in this paper? 3. Only 4 text features were identified in 3.2.4, which were all called “query-independent features”. Is there any other possible feature not covered in the paper? Why the authors only choose these 4 features?
1. The authors mentioned that they split the corpus into training, validation and testing but what they left out was whether they were using different training and testing for each user or if they clumped all the data together so that the user did not matter. Early on, they mentioned topical relevance so perhaps it was the latter. If instead, user relevance was being considered would a separate set have been used for each user?2. The authors used TF-IDF vectors for representing their documents but we know now that that's a baseline and there are more sophisticated feature choices available these days. Is it possible to extend these methods to deal with more advanced feature spaces?3. Since we have discussed baselines in recent times, it seems like the authors in this paper chose really weak baselines too. For example, one of their baselines was a random baseline that ranked test documents randomly. Isn't that a little unfair? What would have been a stronger baseline to compare their methods against?
Assuming eye movement and time spent on a particular word implies the importance of a word, what kind of noise does simply not knowing a particular word have on the weights given to each word? For example, the inclusion of foreign words, biology related terms, and even rarely seen words might draw more attention toward a word in actual practice.In section 4, the researchers deliberately changed the size of the words and number of lines of texts visible at any given time along with several other changes to limit “unwanted eye movements.” Would an experiment involving the environmental and text changes have been a good addition to this experiment? I feel like including the “bells and whistles” normally associated with searches would help further identify significant eye movements related to the texts.In the discussion section, a lot of emphasis is placed on the practical application of eye tracking. How feasible is a practical application at this time given the restrictions placed on the experiment in this instance?
1. Do all readers display the same distributions of eye tracking features? Would there be a necessity to ‘tune’ the algorithm to the specifics of an individual’s reading behavior?2. Is this paper modeling relevance in the same sense as other IR studies? Given the topics and experimental methods would it be fair to say that they are constructing a model for interesting documents, as opposed to documents relevant for a particular query?3. The value of the feedback methods seems to hinge on the level of relevance bias in the underlying document set. Is it the intent with this type of approach for it to be used after running a general IR technique like TFIDF or general exploration of a document set?
This eye tracking technique seems good in theory, such as in a controlled environment, but I don't know how useful it would be 'in the wild' of the internet--with ads, popups (both user allowed/requested alerts, and spam), and other distractions. The authors cite Maglio and Cambell and how they monitored eye movement in order to determine whether the user is reading or just browsing. But what if the user is interested in the topic, but there is no text? What if it's video or photographs? Such as searching for clothing or other products?The results seem to have high agreement (97%) after reading how most assessors only agree a small percentage of the time when judging relevance documents.
There is a difference between something being "eye-catching" and something being relevant. Using eye tracking as a sole source of data is, due to this difference, a bad idea. But asking users for relevance feedback, which the authors mention as a more effective strategy in the paper, is impractical in a non-lab setting. Furthermore, it is unclear how exactly eye tracking can improve upon direct user feedback re: relevance judgments. Does eye tracking help reveal when a user is uncertain? As an alternative, what would it mean for eyes to linger on a document, and for the document to subsequently be clicked - is this an accurate proxy for relevance? The authors state on pg. 324: "...in a more realistic setup we can’t assume such a large fraction of the seen documents would be relevant, and therefore models using just the textual content of the documents are likely to perform much worse." This refers to a 50% "positive" rate. Is this actually that optimistic? Perhaps the issue is that I am struggling to understand a previous statement in this section: "If the precision of the search results is good enough, then as a result of the bias even the learning document set as such is a reasonably good query—without any explicit or implicit relevance feedback!" What does it mean to make the learning document set a query?Is it a good idea to focus on "fixation" as opposed to something else? Some people are slow readers, some people get distracted by dirt on their screen, some people stop and think while staring off absent-mindedly, etc. I wonder whether, supposing that tracking capabilities improve to the point that webcams may be used for granular eye tracking, it might be better to focus more on something like the user's expression, or the dilation of their pupils, or their posture, heartrate, breathing, or something else.
1) The author’s algorithm for proactive IR focuses on using eye movement features that indicate fixation on a particular term. Is this necessarily an indication of relevance? How do the author’s take into account that this might be the result of the user not understanding the text at that point. I often reread segments of papers that include topics that I do not totally understand. Whether or not they are relevant to me is not clear until I have a better understanding of the text.2) Could this sort of system be used to help users not only find documents relevant to their initial query, but to also help them determine queries that could increase their knowledge in general? Lots of technical papers all relevant to a single query use many different models. If my eyes are fixated on the name/description of a specific model, the system could reformulate a query, and forward me to a document that explains that model more thoroughly.3) Their final results were a bit disappointing, since they barely performed better than the random model (I realize there was a statistically significant improvement, but it still seemed to be by a small margin). Is there an alternative metric that could be used for evaluation which would yield better results? Perhaps, eye-tracking should not be used to directly gauge relevance, but instead to aid the user in reformulating their queries which lead them to accomplishing their task, even if that means reading documents that are not immediately “relevant.”
1. In this article the authors use eye movements to attempt to create implicit relevance feedback. They capture eye movements by noting certain areas on a screen that the eye stops or fixates. They measure what is a fixation based on how long the eye lingers on one spot. The cutoff that they use in this article is 100 ms intervals. However, they mention that the manual for the eye tracker recommended a fixation interval of 40 ms and that when they ran results for 40 ms they were similar to the 100 ms interval. Why would they use the 100 ms interval results instead of the 40 ms results? What would be the benefits and drawbacks of either set of results? 2. In creating the collection used in this experiment the authors used 750 documents taken from Wikipedia entries. They then truncated these documents manually so that they were 11 lines long and still maintained the meaning of the document. How much bias would this process result in, as the meaning of the document would have to be assumed by the person who truncated the document? Would this bias have any affect on their results?3. In this article the authors compared the eye movement created implicit feedback with explicit feedback. While they knew that the explicit feedback would produce better results when compared to the implicit feedback they wanted to see if the implicit feedback could improve the explicit feedback when the two were combined. The unbiased results showed that there was not any change in the performance of the two models at all. Given these results why do you think that someone who is making an IR system would choose to use eye movement based implicit feedback when explicit feedback provides better results?