Summary: The authors feel that current progress in creating ranking algorithms that learn from training data is halted. The main reason is the experimental design used by researchers. Although they may use the same test collection, they extrapolate data differently. In addition, there are a number of different test collections to choose from. As a result, their empirical findings cannot be compared. To make progress, the authors argue the field needs a standard data set and conditions such that these comparisons are possible. The authors created a benchmark dataset they named LETOR. After its creation, they went on to compare two different ranking algorithms using a number of community accepted standards consisting of MAP, NDCG, and precision at rank n. Based on these results, the authors feel their benchmark is acceptable and can help lead to improvements in the field.

1. The overall idea seems to be to create a standard of what data should be produced for each document in a test collection, resulting in a common set of attributes for ranking algorithms to consider. LETOR is comprised of documents from TREC as well as documents from MEDLINE. Therefore, the approach is meant to be across multiple test collections. However, TREC and MEDLINE are two test collections that get a lot of focus. As a result, it is easy for them to hang a range of data available and documented along with the document and relevance judgment. Other test collections may not have the same amount of information available. In addition, the author even outlines the elements taken from TREC, but does not show that they all translate into the MEDLINE based collection. Is this idea limited to a set of elements per test collections? Is the process of extracting this information also limited to the format of the test collection? Or is it more robust than it appears? The data from the benchmark might not be applicable to more settings, so can the benchmark concepts be expanded to new collections easily?

2. The authors evaluate the success of their benchmark by comparing two ranking algorithms. After a bunch of data collection and analysis, the authors conclude their benchmark is promising. As per most research endeavors, they do note there is always room for improvement. Did the authors really provide a good evaluation of their benchmark to support the conclusions they draw about which types of ranking algorithms are better? The authors explore two different ranking algorithms that come from two different types of algorithms sets. Are these algorithms representative of related algorithms? Are these algorithms the best in their classification? In addition, the authors claim their benchmark will facilitate being able to compare across experiments done by anyone trying to make ranking algorithm improvements, assuming their benchmark is used in the evaluation. Is comparing just two algorithms really a sufficient evaluation to claim they have developed a good benchmark?

3. The whole point of the paper is to have a benchmark that anyone writing a learned ranking algorithm paper can use. Then the advancement of the field can be measured. A test collection consists of tons of documents and their corresponding relevance assessments. In previous papers, the issue of incomplete relevance assessments has been addressed where people use pooling. In another paper, it was also addressed that there are financial considerations for information retrieval evaluations and that there is a lack of funding to certain parts. If people perform their experiments on a submit of TREC documents due to financial, time or machine considerations, will these benchmarks still allow results to be compared? Or does segmenting the documents break the assumption of the benchmark? If it is that sensitive, then the quality of the benchmark can have a big impact on the evaluation. Is this benchmark of MEDLINE documents and TREC documents really reflective of the types of documents that should be under evaluation for information retrieval systems?

Paper: Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes - By Mark Smucker, James Allan and Ben Carterette

Summary: The paper describes about the agreement among statistical significance tests for information retrieval evaluation and also illustrates results from the experiment conducted at varying sample sizes (from 10 to 50 topics). They analyzed three tests namely Randomization test, Students paired t-test and Bootstrap test. The test metric that was used was Root Mean Square Error and the results showed that smaller the number of topics the larger is the disagreement among the p values. It was also found that even with a small number of topics the t-test appears to be suitable when compared to others.

Questions: 1. The randomization test has been recommended although the t-test appears to be suitable even when the number of topics is small. When it has results that show t-test is more suitable why would the authors recommend randomized test? Although they state that the randomized test tended to produce smaller p-values than the t-test, they have not clearly mentioned the margin of difference. What is the advantage of considering randomized/permutation test over t-test when the number of the topics?

2. The tests show increasing disagreement as the number of topics decreases. Can decrease in number of topics mean decrease in number of documents? Shouldn't the number the documents decide the strength of the tests' results? Fewer numbers of topics could mean lesser number of documents or if the topics were vast and they were most common ones that they have huge set of documents? The larger the number of topics/test-size, the more likely it reflects a real difference. So why did the authors experiment with as few as 10 topics for testing the comparing the significance tests?

3. From the Figure 1 that illustrates the pairwise comparison if the p-values for 10 topics, it is evident that the bootstrap test consistently gives smaller p-values than the t-test. Can the bootstrap test's results be attributed to pessimistic opinion, as it systematically tends to produce smaller p values? Is this the case only when small number of topics was chosen or even when the number of topics was substantially larger?

Zhai -- “A Brief Review of Information Retrieval Models” Summary: In this article Zhai breaks current IR models into three camps: Similarity (Vector Space) Models, Probabilistic Relevance Models, and Probabilistic Inference Models. Similarity models do not directly incorporate notions of document relevancy into their frameworks. Instead the techniques attempt to assess similarity between queries and documents. Probabilistic Relevance Models encompass a second category. Here, models attempt to learn the importance of features from relevance-assessed documents. Probabilistic Models further break down into two camps: Query Generation Techniques and Document Generation Techniques. The basis for this further delineation has to do with the ease of calculating certain probability functions. A third group of models—Probabilistic Inference Models—are similar to Query Generation Techniques in that they attempt to model a query based on a distribution of documents. This is said to be very difficult to operationalize in practice, and so inference based techniques must make a number of sometimes unintuitive assumptions about the relationship between document and query. Either Probabilistic or Vector Space Models are described to be current state-of-the-art.

1. One thing the article discusses is difference between Vector Space (similarity) Models and Probabilistic Relevance Models. It is remarked that Vector Space Models do not assess document relevancy directly while probabilistic techniques attempt to learn features directly from relevant documents. Is there always a dichotomy between the techniques? It seems as though you could have various weight vectors in a Vector Space Model which have values that are modeled using probabilistic techniques.

2. It is mentioned that one challenging part of building a probabilistic model is appropriate feature selection. What features have been found to be important for IR models?

3. Okapi BM25 is said to be inspired by a probabilistic model, but does not itself appear to carry any probabilistic parameters. What is the basis for calling BM25 a probabilistic model?

Summary: The authors feel that current progress in creating ranking algorithms that learn from training data is halted. The main reason is the experimental design used by researchers. Although they may use the same test collection, they extrapolate data differently. In addition, there are a number of different test collections to choose from. As a result, their empirical findings cannot be compared. To make progress, the authors argue the field needs a standard data set and conditions such that these comparisons are possible. The authors created a benchmark dataset they named LETOR. After its creation, they went on to compare two different ranking algorithms using a number of community accepted standards consisting of MAP, NDCG, and precision at rank n. Based on these results, the authors feel their benchmark is acceptable and can help lead to improvements in the field.

ReplyDelete1. The overall idea seems to be to create a standard of what data should be produced for each document in a test collection, resulting in a common set of attributes for ranking algorithms to consider. LETOR is comprised of documents from TREC as well as documents from MEDLINE. Therefore, the approach is meant to be across multiple test collections. However, TREC and MEDLINE are two test collections that get a lot of focus. As a result, it is easy for them to hang a range of data available and documented along with the document and relevance judgment. Other test collections may not have the same amount of information available. In addition, the author even outlines the elements taken from TREC, but does not show that they all translate into the MEDLINE based collection. Is this idea limited to a set of elements per test collections? Is the process of extracting this information also limited to the format of the test collection? Or is it more robust than it appears? The data from the benchmark might not be applicable to more settings, so can the benchmark concepts be expanded to new collections easily?

2. The authors evaluate the success of their benchmark by comparing two ranking algorithms. After a bunch of data collection and analysis, the authors conclude their benchmark is promising. As per most research endeavors, they do note there is always room for improvement. Did the authors really provide a good evaluation of their benchmark to support the conclusions they draw about which types of ranking algorithms are better? The authors explore two different ranking algorithms that come from two different types of algorithms sets. Are these algorithms representative of related algorithms? Are these algorithms the best in their classification? In addition, the authors claim their benchmark will facilitate being able to compare across experiments done by anyone trying to make ranking algorithm improvements, assuming their benchmark is used in the evaluation. Is comparing just two algorithms really a sufficient evaluation to claim they have developed a good benchmark?

3. The whole point of the paper is to have a benchmark that anyone writing a learned ranking algorithm paper can use. Then the advancement of the field can be measured. A test collection consists of tons of documents and their corresponding relevance assessments. In previous papers, the issue of incomplete relevance assessments has been addressed where people use pooling. In another paper, it was also addressed that there are financial considerations for information retrieval evaluations and that there is a lack of funding to certain parts. If people perform their experiments on a submit of TREC documents due to financial, time or machine considerations, will these benchmarks still allow results to be compared? Or does segmenting the documents break the assumption of the benchmark? If it is that sensitive, then the quality of the benchmark can have a big impact on the evaluation. Is this benchmark of MEDLINE documents and TREC documents really reflective of the types of documents that should be under evaluation for information retrieval systems?

Paper: Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes - By Mark Smucker, James Allan and Ben Carterette

ReplyDeleteSummary:

The paper describes about the agreement among statistical significance tests for information retrieval evaluation and also illustrates results from the experiment conducted at varying sample sizes (from 10 to 50 topics). They analyzed three tests namely Randomization test, Students paired t-test and Bootstrap test. The test metric that was used was Root Mean Square Error and the results showed that smaller the number of topics the larger is the disagreement among the p values. It was also found that even with a small number of topics the t-test appears to be suitable when compared to others.

Questions:

1. The randomization test has been recommended although the t-test appears to be suitable even when the number of topics is small. When it has results that show t-test is more suitable why would the authors recommend randomized test? Although they state that the randomized test tended to produce smaller p-values than the t-test, they have not clearly mentioned the margin of difference. What is the advantage of considering randomized/permutation test over t-test when the number of the topics?

2. The tests show increasing disagreement as the number of topics decreases. Can decrease in number of topics mean decrease in number of documents? Shouldn't the number the documents decide the strength of the tests' results? Fewer numbers of topics could mean lesser number of documents or if the topics were vast and they were most common ones that they have huge set of documents? The larger the number of topics/test-size, the more likely it reflects a real difference. So why did the authors experiment with as few as 10 topics for testing the comparing the significance tests?

3. From the Figure 1 that illustrates the pairwise comparison if the p-values for 10 topics, it is evident that the bootstrap test consistently gives smaller p-values than the t-test. Can the bootstrap test's results be attributed to pessimistic opinion, as it systematically tends to produce smaller p values? Is this the case only when small number of topics was chosen or even when the number of topics was substantially larger?

Zhai -- “A Brief Review of Information Retrieval Models” Summary:

ReplyDeleteIn this article Zhai breaks current IR models into three camps: Similarity (Vector Space) Models, Probabilistic Relevance Models, and Probabilistic Inference Models. Similarity models do not directly incorporate notions of document relevancy into their frameworks. Instead the techniques attempt to assess similarity between queries and documents. Probabilistic Relevance Models encompass a second category. Here, models attempt to learn the importance of features from relevance-assessed documents. Probabilistic Models further break down into two camps: Query Generation Techniques and Document Generation Techniques. The basis for this further delineation has to do with the ease of calculating certain probability functions. A third group of models—Probabilistic Inference Models—are similar to Query Generation Techniques in that they attempt to model a query based on a distribution of documents. This is said to be very difficult to operationalize in practice, and so inference based techniques must make a number of sometimes unintuitive assumptions about the relationship between document and query. Either Probabilistic or Vector Space Models are described to be current state-of-the-art.

1. One thing the article discusses is difference between Vector Space (similarity) Models and Probabilistic Relevance Models. It is remarked that Vector Space Models do not assess document relevancy directly while probabilistic techniques attempt to learn features directly from relevant documents. Is there always a dichotomy between the techniques? It seems as though you could have various weight vectors in a Vector Space Model which have values that are modeled using probabilistic techniques.

2. It is mentioned that one challenging part of building a probabilistic model is appropriate feature selection. What features have been found to be important for IR models?

3. Okapi BM25 is said to be inspired by a probabilistic model, but does not itself appear to carry any probabilistic parameters. What is the basis for calling BM25 a probabilistic model?