Thursday, October 17, 2013

10-24 Oard et al. Building an information retrieval test collection for spontaneous conversational speech. SIGIR 2004, pp. 41-48.


  1. 1. This paper mainly talks about building a test collection for spontaneous conversational speech that can be used for information retrieval. It is a good corpus for speech recognition. I am wondering that how we can decide on the granularity for the search results in different situations. For example, in prevailing search, we return the whole web page. We also see research about return part of the whole document which can provide more specific information. How do we decide on the granularity based on different needs? How do we reflect that in the test collection?

    2. My second question is in Section 5 USING THE TEST COLLECTION. They indexed the ASR results using word-spanning character n-grams. From their experiment, they showed that 5-grams yielded MAP values comparable to those obtained at Maryland and IBM using stemming, and these results are slightly better than they obtained with 4-grams. Just based on this observation, they conclude that character n-grams exhibited little or no benefit from their potential to conflate acoustically confusable words. I am wondering why they drew the conclusion solely based on 4-grams and 5-grams and then arbitrarily declared this to be applied to n-grams?

    3. My third question is in Section 4.2 Topic Construction. From over 250 topic-oriented written requests for materials from the collection, they selected 70 requests that they felt were representative of the types of requests and they types of subject contained in the topical requests. However, there are no detailed information about what rules or principle the followed to select the requests. Was it based on topic or did they group similar requests together based on the content? Without this information, I am not sure how the 70 requests can be representative of the total 250 requests.

  2. 1. The authors allow relevance assessors to provide information about why they made each relevance judgment. What benefit does this have in evaluating the system? Do you think that the extra work was worth the effort in this case? How does this improve the collection as a “gold standard”?

    2. In developing this test collection, the assessors use relevance judgments with 5 grades of relevance—the lowest being weighted “0” and defined as “provides pointer to a source of information.” What if the segment does not fit any of these categories and is complete irrelevant for the search? How would that affect an experiment? Could that be the case in a collection with such a focused topic?

    3. In order to use automatic speech recognition (ASR), a large set of training data is needed. In this experiment, the authors used the metadata and transcriptions that were done manually, so the training data matched up closely with the videos. For other experiments, how close to the actual videos does training data need to be in order to achieve accurate results? Since the retrieval with just the metadata and summaries was more effective than ASR, are manual descriptions a better way to retrieve audio or video material? What improvements have been made since this article was written in 2004?

  3. The writers discuss the potentially profound benefits of recording, documenting, and tagging speech for IR purposes. However, where these benefits could have helped tailor their research strategy, it did not appear to have done. Two questions related to this: how might user intent influence the effectiveness or utility of annotation strategies (for instance, topic segmentation-based annotation vs. time-based annotation, as mentioned in the "Conclusions" section) going forward? How could their relevance judgment procedure be honed in order to provide better coverage for different user intents? How might a retrieval systems designer take these data into consideration when handling multiple user intents for a single query?

    The authors state: "The topical segment boundaries defined by the indexers were adjusted to the nearest significant silence (2 seconds or longer), and the words produced by ASR were treated as the text of that segment, resulting in 9,947 segments with an average length of 380 words." Why was this approach used for topical segmentation? Topics can change in conversation without "significant silence" in between them, and they often do; in addition, there are frequently significant silences during natural human speech WITHOUT topic change. Why use this strategy?

    A pair of shorter questions: a) Why did ASR seem to perform so poorly? Were the ASR resources used not of high quality, or are those relatively poor results the state of the art? b) Would using a binary relevance judgment strategy have worsened the results? Would it have made a difference?

  4. When trying to create relevance judgments, five categories have been recruited in this research. Here, I’m wondering how well these criteria can be used to judge the relevance. There is little evidence to support the validity of these categories actually.

    In this paper, it’s mentioned that these assessors have conducted extensive research there. In this case, these assessors have a very comprehensively understanding of this research, which may become a bias in this research that these assessors’ judgment may be influenced by their expectation to the results.

    In the Figure 1, it’s pointed out that the topic is prone to affect the relative effectiveness of ASR and Manual indexing. Why? Why are some terms difficult to be recognized there?

  5. 1. Introducing 5 different categories, and then further asking judges to be rating on a scale of 5 on each of those categories (not to mention brevity assessments) seems to be an excessive dose of relevance information at first glance, and with lot of potential for noise. Was the decision to introduce such levels of detail motivated by the fact that most of the judges were experienced searchers (and graduate students)? Or is it a good thing to be asking for more details even from ordinary judges on similar IR tasks (in other words, what is the tradeoff here between noise and redundancy)?
    2. It seems like the problem of retrieval in domains like these, that are not pure text, is tied closely to the progress in other fields, especially AI related. For example, ASR, if it worked at human accuracy levels, would have saved a lot of indexing effort which took thousands of dollars per interview. The same probably goes for image tasks, except the cost seems to be cheaper there. This brings up the intriguing issue of resource allocation: what's a good way to decide when to plug costs for manual labeling of samples (for improving accuracy on the AI subsystem) and redirect them instead to, say, more relevance judgments? Can we weight these costs to come up with a better way of resource allocation? In principle, such weighting can be analyzed by looking at crowdsourced tasks that fall within one of these two categories. I'm wondering if anyone has attempted that kind of analysis yet.
    3. The authors give some details about search guided relevance judgments, and suggest it has been used in other IR tasks as well. However, we don't seem to have read about these so far (I could be wrong about this, but either way, it seems to have been dwarfed by pooled assessments). I'm still not clear what the scope of search guided assessment is. The authors said that relevance judgments are obtained by experienced searchers through painstaking sessions where many queries are reformulated, and documents analyzed intelligently rather than exhaustively or by majority. Later, towards the end of the paper, the authors said that pooled judgments would become more common as more people requested their data. The conflict/pros-cons between search guided assessment and pooled assessments are still not all that clear to me, though. Can they work in tandem, for instance?

  6. 1. The authors state that the cost of interviewing, digitizing and data entry was $2000/interview and that the cost of indexing was another $2000/interview. Are the costs involved per interview justified given that they had managed to collect 17 MB of data? Do you think that this cost can be reduced by using Mturk or other crowdsourcing options as we discussed in last class?

    2. According to the authors, the main aim of the paper is to build an information retrieval test collection for spontaneous conversational speech. The authors have used ‘interviews’ as the only type of conversation. But aren’t there other types of conversation than interviewing? The authors do not seem to have covered other informal ways of communication which might be more useful for practical purposes. E.g., an informal conversation will be more useful as a test data for a ‘siri’ -like application when compared to a formal interview.

    3. I am also curious to know about the current gold standard of the conversational information retrieval systems. With greater advancements in Speech recognition as well as natural language processing after 2004, we have systems like Siri and Google Voice which perform highly accurately. Does the work of the system end with accurately transcribing the speech into words so that the words can be used as a query?

  7. 1. This paper discusses an interesting topic of creating test collection of conversational data. My first question is about the design of the experiment. Why only MAP is used as metrics for evaluating the IR systems? In reality users are more sensitive to the highly ranked bad results, and since speech materials will probably take more time to evaluate, users will probably be more discouraged by highly ranked irrelevant documents. Why metrics more sensitive to bad results, such as GMAP is not used?

    2. Multi-valued scale was used and collapsed to binary judgment for standard IR metric computation. Scores 2, 3 and 4 was defined as relevant. Why score 1 is defined as irrelevant? There are five different categories for each judgment and how is relevance defined for them? Are they defined separately? Two assessors’ judgments were performed independently or subsequently for adjudication, what are the differences in performance between the two techniques?

    3. The performance of ASR is really bad compared with manual judgment, and one of the limitations is the ability of detecting domain specific vocabularies. This raises the question of the potential bias on this test collection. Since the vocabulary detection is the major bottleneck (with at least 40% WER), the potential effectiveness of different IR systems will be ignored, and the performance of IR systems on this test collection will totally depend on the ability of the system to recognize special words. Is it better to construct a test collection with fewer domain specific terms?

  8. 1. Having done oral history transcription myself, I wonder how ASR can ever be used for initial transcriptions of collections such as these. It is a responsibility of the transcriber to slightly clean up the language of the speaker without changing the nature of what they are relating. For example, it would be embarrassing for the speaker, and distracting from the topic (especially one so delicate as surviving the Holocaust) to leave in too many "uh"s and "um"s, or have it look anything remotely like the example on page 5. How can ASR work towards striking this balance between accuracy and respectful representation?

    2. The sampling rates they use when digitizing the audio (44k hz which is CD quality, and then later they say they had to downsample to 16k hz) is far from standard preservation quality for an audio collection (96k hz). What is the minimum sampling rate recognizable by ASR and why did they have to take the quality down so low?

    3. Another article I read while doing research for the final project ("Unnamed things: Creating a controlled vocabulary for the description of animated moving image content"- Randal Luckow) claimed that in order to generate sufficient tags or topics, it is necessary to incorporate a historical knowledge of your topic (for his purposes, the subject was animation and the example of historical knowledge was vaudevillian or slapstick terminology). How can we build a thorough enough thesaurus to automate this process?

  9. 1. When working towards the evaluation of an IR system for conversational speech - how do we hope to account for the voice attenuation when we are gathering speech collections ? Since in actual real time conversational speech there is sure to be noise and distortion that has been introduced I am unsure of how the model hopes to achieve complete utterance of all the information by taking into account the acoustics, the semantics and the structure of the speech. Additionally, how does the model hope to incorporate grammar, articulation, accents differences when hoping to work with conversational speech in the presence of so much variability ?

    2. Human conversational speech is unstructured and so how do the relevance assessors in a conversational speech model know where to place the boundaries of a document and be able to assess whether a result is relevant? How do we capture this unstructured data in a structured format given that we just have the classification labels? Also, the paper proposes making use of boolean queries and so wouldn't that again equate to a binary assessment? And finally, the paper states that there have been 5 categories that have been used on the basis of which users assign relevance - however, the importance of these categories have not been stated in the paper.

    3. While the paper has focussed at some attempts to try and reduce distortion and noise it has not addressed the out of out-of-vocabulary problem. Like for instance the vocabulary of the speech recognizer and the large number of highly unpredictable lexical forms produced in spontaneous conversational speech require to be taken into consideration and this has not been elaborated in the paper. How do we hope to solve this problem and what would be the vocabulary adaptation techniques used?

  10. The authors discuss their creation of a conversational and spontaneous speech test collection, but they opt to use a collection of formal interviews done with witnesses to and survivors of the Holocaust. If they were trying to create a test collection of spontaneous and conversational speech, wouldn't it have been more fitting to use a collection of informal or more casual speech? What led them to choose these formal interviews as the source of their data?

    2. Related to my first question, given that their dataset is on a specific type of conversation (a formal interview) and on a very specific subject, how useful is the collection that they created? How do results using this collection scale to more general search?

    3. The authors spend time discussing to costs involved with the creation and processing of this collection of interviews. Given the high costs involved with the processing side of the task, could something like a crowdsourcing tool be used to help lower the cost?

  11. The author says: "The most widely reported retrieval electiveness measures are based on binary relevance judgments, but assessors generally report greater condence in their judgments when they can express the degree relevance of an item on a multi-valued scale. Rong et al suggest ve-point and seven-point scales [17], and collapsing multi-point scales to binary values has been shown to give stable rankings of systems [2]. That is the approach that we adopted for the work reported here." But isn't this
    loss of information. Will not evaluating with multi-level judgment give a better evaluation perspective ?

    They defined five categories of relevance to be : Provides direct evidence , Provides indirect/circumstantial evidence, Provides context, Useful as a basis for comparison, Provides pointer to a source of information. Then how did they collapse the graded relevance to binary relevance when their scale of relevance doesn't have any kind of marker which says not relevant. Did they categorise the Pointer to the information as being non relevant ?

    Getting relevance judgment from multiple assessors allows us to see various interpretations of the same query. But here there were four assessors who worked very together, rather than individually. Once a piece had been judged by someone, the next assessor looked at the notes by the first assessor to make his judgment. Isn't that introducing a bias into the evaluation process?

    The author mentions that "The interviews were first manually subdivided into topically coherent segments by indexers with professional "training appropriate to the subject matter." But topic segmentation makes sense for broadcast news materials, but how exactly was it achieved for conversational speech. Then we have need to take into account the fact that a piece when spoken in some context could have meant something else and when segmented might mean something else.

  12. The authors adopted a method called search guided assessment for dealing with document collection in this paper. Pooling could not be used because the systems used were limited in number. However, there was no mention of the advantages and the disadvantages of using it. They describe it as a process of iteration between topic research, relevance assessment, and interactive query reformulation by the assessors. It is to be noted that each of the steps mentioned in the above sentence themselves are vaguely defined. If this and pooling produced more or less the same alternate pair orders, why is pooling more preferred although pooling involves more work.

    The relevance categories based on the notion of evidence seem to be overlapping. Apart from the direct evidence category, it appears that all are more or less not distinct. For example, provides context and provides pointer to source of information may not be so distinct for many of the cases. Why was this set of categories chosen? Is it because they can be easily collapsed?

    Going by the reading and the effort put into the making of relevance judgments (Pg 4 – second column) this is what perhaps would make a good test collection. Notes on why a particular judgement is assigned are handy as in future when the perceived judgment changes, these notes can act as a reference. The notes also give us insights into how search (speech/written) evolves and will help us predict the future of the field.

  13. This comment has been removed by the author.

  14. 1. During the relevance judgement, they hired 4 graduated students worked to remove some data. Here, why did they not include the topics with less than 5 relevance segments?
    2. In page 4, last paragraph, they split 28 topics into 2 groups with different assessment strategies. Why did they split the group, and why did they adopt 2 different strategies? They also asked assessor to meet to adjudicate cases. Why do they take such methods? How about any other methods?
    3. ASR has a lower bound of WER which means ASR cannot recognize everything. Does it mean that some precision and MAP measurements in the experiments also have some bounds?

  15. In Section 4, the authors state “A subset of the collection has been digitized and manually indexed”. It looks like indexing is a key step in creating test collection. I am thinking since the user queries can target different aspects of the document collection, what is the point of document indexing? For instance, in the paper, the indexers used person names and an average of five controlled vocabulary terms to index all segments from an interview. But in the Topic Construction, the sample query has nothing to do with these index terms.

    In Section 4.3, the relevance judgments were created using search-guided assessment. In the process, “assessors first conduct detailed topic research and then iteratively search the collection for relevant documents”. Is it biased as what if there are potential topics not included in the initial topic research?

    In Section 4.4, the authors state “Our interviews contain natural search filled with dis-fluencies, heavy accents, age-related coarticulations, uncued speaker and language switching, and emotional speech”. It is good to have diversity, but it will be more helpful to have more details of these user information (e.g., percentage of these categories, ages and sex). I believe it is essential to boost confidence of the applicability of the test collection.

  16. Without having the original query owners or experts in specific topics it seems like search-guided assessments offer an interesting way to make relevance judgments. However, for a topic such as the one discussed in this paper would it have been better to have relevance judgments made by historians with a solid background in Holocaust history then apply the team adjudication?

    The article mentions that the indexers had “professional training appropriate to the subject matter”(pg. 2) and used four graduate students with history or LIS backgrounds as the judges. How did the indexer notes influence the judges relevance judgments particularly on a grade scale if, as the article points out, the judges were able to check a box which specified that the judgment was “based on indexer’s notes?”

    On page 5, Oard et al. briefly talk about the use of an independent review. What exactly is the independent review? Is it the same as the reviewer which looked over the higher scoring judgments? Or is this a separate entity from the second assessor who went over the other 14 topics?

  17. 1. The relevance assignment does not follow the Crainfield model. Rather the model allows dependence between segments. This makes sense since these are continuous recordings, relevance can clearly propagate over many segments. However since the propagation is limited to only from the previous segment does this not pollute the judgments?

    2. How are the grades calibrated? The concern is that since relevance propagates over segments, if the first segment is marked as highly relevant, should the second be marked one level lower?

    3. What does it mean to disagree, the paper talks about marking segments within the 3min clip for which was significant to the decision. So does disagreement extend to this degree or is it at the level of the whole segment?

  18. 1. Most of the research papers we have read in class have described how assessors often do not agree with each other on relevance. In addition, an assessor does not always agree with himself. One of the first papers we read described the process of making a new test collection, which was described as a very expensive process. A good portion of the cost resides in relevance judgments. For this paper, the test collection is built upon the idea of obtaining multiple relevance judgments. The author makes some arguments related to the alternate approach they used to pooling to get a good amount of relevance judgments. In the end, they still underwent a significant effort and cost to get the relevance judgments. With all of the other costs associated with creating a robust speech-based test collection, is the investment in multiple judgments really worth it? Would it not have been better to spend the time collecting more judgments over each topic?

    2. To obtain judgments, four different graduate students were used. The students were studying history and allowed to research the topics as much as they desired before giving a relevance judgment. In past papers, it has appeared that experts are better at making relevance judgments, which aligns with the process the author used for his test collection relevance judgments. At the same time, other papers have noted that averaging non-experts judgments leads to the similar performance as any expert. Given the author’s desire to collect multiple relevance judgments, could the author have saved time and money by using any number of people to make relevance judgments?

    3. Most of the papers we read focused on binary relevance judgments. Specifically, those with detailed test collections such as TREC employed a binary relevance scheme. In addition, a lot of authors referred to relevance as independent. For this speech test collection, the author used a 5 point relevance scale. The scale is not based on vague terms but instead of direct and indirect references. A 5 point scale is a common feature that arises in the few papers we have read that employ a multi-point relevance scale. Is this a particular psychological reason for this sweet spot? On top of this, the author explains relevance by noting the segments may refer to more than one topic with more than one degree of relevance. It is interesting to see that judgments are still considered independent but the speech segments under evaluation are not treated as isolated based on topic.

  19. 1) Given that they determined 5 categories of relevance, why would they only consider the union of 2 (direct and indirect) as relevant? Shouldn't they use all of them during their experiments?

    2) When describing Figure 1, it is explained that there are 3 clear distinct groups ([0-5], [8-36] and [36,100]) when comparing ASR and manual. However, isn't this breakdown a little too optimistic? Shouldn't a considerable portion of the second group be also considered total failure because of the lack
    of usability?

    3) Figure 1 shows data over fitting problems which signals issues in this experiment? If there are problems, what are they?

  20. 1. It is said in section 2 that the result of Switchboard “is an exceptional degree of focus on the selected topics, hardly a representative condition for IR.”. Why?
    2. It is said in section 3 that they made the data collection based on 4 systems which were insufficient to use pooled assessment. What is the threshold of system numbers to adopt pooled assessment?
    3. The authors said in section 4.2 that they selected the “requests that they felt were representative of the types of requests and the types of subject contained in the topical requests”. However, these requests were not originally prepared for this test collection. Is there any bias when they made such selection?

  21. 1. The assessors conduct detailed topic research and then interactively search the collection for relevant documents. Don’t you think the research needs to elaborate on the details about how they were chosen? Also, while building the test collection the researchers say they built their collection based on questions posed by “serious users” – what do they mean?

    2. The method outlined in this research seems pretty complication and sophisticated, involving judgments about brevity. What do you think about this method that seems to rely on trained and qualified assessments, when crowdsourcing is becoming popular? Would it work as efficiently if crowdsourced workers were involved?

    3. This research focuses on audio files that have a common topic – the Holocaust – and stories told by one single person. How would (if at all) the topic construction change for mixed collections like new casts that have multiple speakers and topics?

  22. I like that these topics come from actual user requests vs. the Eskevich paper which picked types of text, that 'might be useful in aiding in retrieval in the future'. This leads me to wonder, which works better in designing IR tests and new systems, starting with the query, or starting with an idea or concept that 'might be useful'?

    If advanced word sequences and word spotting work best with personal dictation and recorded news broadcasts that incorporate good articulation and allow IR systems that utilize optimized and trained models for query processing, how can these be adjusted or for natural language?

    There were five categories that were derived from historical information seeking processes, and then these were refined, but there is no mention of how they were refined?

  23. 1. In this article the authors constructed their topics for their test collection by taking requests from a variety of different agencies and organizations that wanted information from the collection that they were using. They expanded the scope of several of the more specific requests based on their own understanding of the work and information requests in general. Wouldn’t this type of behavior create some sort of bias as the authors are making general assumptions of the information needs of specific users? Do you agree with what they did or not?
    2. In judging the relevance of the collection that they were creating the authors chose five different relevance categories that were each judged on a five-point scale. The categories were created from their own understanding of information seeking and from feedback given by their assessors in a two-week pilot study. Do you agree with the categories that the authors defined in this article? Do you think that they should have used information from assessors in creating these categories or that they should have contacted potential users and gotten their ideas?
    3. In this study the authors used two methods to create relevance judgments. One method had only a single assessor judge 14 topics and the other method had two different assessors judge a different 14 topics. In the second case when one of the assessors gave a high score the two met and decided together what score to give the topic whereas all other discrepancies in score were merely averaged together and rounded up. What reason can you think of for the two different methods of adjudication being used for different scores? Also why did the authors round up when they averaged scores and what type of bias would that introduce?

  24. 1) If we assume that modern day ASR systems are highly reliable in accurately transcribing spoken words, and thus leave us with what is essentially a script, is the associated retrieval task any different than other text retrieval tasks? Since at this point it’s just another document, do all the IR systems we have work just as before?

    2) How might the judges’ determination of relevance change if they were given scripts instead of audio/video clips to perform their judgments on? It seems like this might go faster, since reading a script is usually faster than hearing the actual conversation.

    3) The authors note that ASR systems have trouble with uncommon words that likely would not be part of their lexicon (unfamiliar names, places, words from other languages). While this may be the case, is there any way to determine how much of an impact such words might have on the actual topic? And if they are, would it not be relatively easy to train an ASR system on these words through just a few manual iterations?

  25. 1- Building this test collection was immensely expensive requiring up to $4,000 an interview in some cases not to mention the 700 hours of work by graduate researchers. Do the authors anticipate this kind of expense for establishing other spoken word collections? What is their idea for cost mitigation as their idea of ‘changing the world in which we live’ through spoken word collections will be severely impaired if each interview costs $4,000.

    2- I was extremely impressed by their system of generating relevance judgements. If featured multi-level judgements on several categories (much like google’s assessors), multiple assessors, and required assessors to provided evidence and reason to support their decisions. This and other documentation became a part of the collection as a whole. This is the most through assessment model I remember seeing. It appears to limit many of the problems with relevance assessors. So my question is: why isn’t TREC and other industry standards supporting or running a model like this?

    3- I am curious about how actual searches are conducted in this collection. Are results based on the copious amounts of metadata available or are segments run through a computerized listener to identify key words? This is more of a technology question- is the audio file searched or just the text around it? How are these two aspects weighed?

  26. 1. The method proposed to build the test collection for spontaneous conversational speech is the use of topic segmentation. This made me wonder how can spontaneous speeches be categorized? Can there be any particular single category that the speech can fit in? If not, how can the multi-category issue be addressed?

    2. The authors have mentioned that it is proved to be unaffordable to scale the process of manual building of entire test collection. They even tried paying a couple of people to make calls to each other or talk to record. Instead one can use GWAP techniques to build the collection, which would a fun and also a cost-effective approach. Alternatively, what is the scope of crowdsourcing in this task?

    3. In section 4.3, “Creating Relevance Judgments”, it has been stated, “For indirect relevance, the assessors considered the strength of the inferential connection between the segment and the phenomenon of interest. How was the phenomenon of interest measured and evaluated?

  27. 1. This paper seems to confirm some of the observations in Allan (2002) in that it is noted that short query ASR tasks are very bad for IR performance. What can be done to alleviate the problems of short query for ASR? Might it be possible to do as Allan suggested and search not just the categorized lexemes, but also all potential lexemes. How does this effect performance?

    2. What are the particular reasons that InQuery would fail so spectacularly on the ASR terms? In what ways were interviews segmented into documents? Are there IR algorithms that work particularly well for ASR data, and if so how do they manage?

    3. The IBM Okapi BM25 variant generally does quite poorly on the collection, yet it manages better performance than InQuery on the raw ASR terms thanks to “Blind Relevance Feedback”. Are there any particular reasons why Blind Relevance Feedback would help in an ASR setting? Might it be due to query expansion mechanisms?