Optional reading: video lecture http://videolectures.net/mlmi04uk_oard_ra/ "Searching Speech: A Research Agenda" -Douglas OardSummary: This lecture takes place as part of a series of lectures from various presenters on dealing with IR and spoken word collections. This study in particular uses manual segmentation to break up lengthy oral histories from Holocaust survivors (recorded in various languages). This is a preexisting data set, and metadata for it has already been generated in various ways including human-generated transcriptions and handwritten notes from an indexer. They aim to compare how humans understand and search for data with how these tasks are accomplished by automated programs. To examine their data, they use types or relevance judgments (direct, indirect, context, comparison, pointer) and find that results from the humans were 4 times better than speech recognition program. An interesting thing to note from this study is that typically with IR experiments, large data sets are standard and necessary, but with oral history collections, this is not possible, and small collections have to suffice. Also, working with real data such as this collection brings up a number of privacy issues. And, speech recognition needs to not only be fast, as it is now, but much more robust than as it currently exists.Questions:1. Oard explains that if we were to change metadata standards, we could solve many of the problems encountered in IR at the initial point of data collection, but he did not go into detail about this. What are some ways this could be possible?2. Like the article from last week dealing with Mechanical Turk, this study also uses publicly available data. But, unlike that article, issues with privacy arise. What are some cases where privacy issues arise when using data from crowdsourcing sites?3. I was especially interested in this lecture because the of the data set used. Digitized archival collections of oral history and other media are being searched by users, but we have not read any articles which address these kinds of collections until now-- why not?
Article: James Allan. Perspective on information retrieval and speech. Information Retrieval Techniques for Speech Applications, 323-326, 2002.Summary: In this article, Allan discusses previous work done with ASR in the IR field, including tracks done at TREC and TDT. He discusses how IR systems have worked with ASR in the past and how ASR error rates affected the effectiveness of the systems. Allan hypothesizes that the error rates of ASR have a smaller than expected affect on IR systems because of "repetition of important words" in the text and additional words providing context for the query. He then describes how the findings from these TREC and TDT tests support his hypothesis. The last part of the article is spent discussing how ASR could impact future work in IR and in what areas there is still work to be done with ASR in the IR field.1. Allan mentions that the effectiveness for detecting new events in the TDT 1998 evaluation were rather poor. The results including ASR assistance in the new event evaluation process were also poor and as a result ASR errors were found to have a "substantial" effect on the process. If the performance of the system was already seen as poor, how can they be sure that ASR errors really had that bad of an effect on the system and weren't simply magnified by the already poor performance?2. Allan mentions that researchers controlled the ASR error rate in some of their experiments to see the effects it had on effectiveness. In what way did these controls effect the error rates? In what way did they alter the ASR to get these errors and what kind of errors were they?3. How might personalied search be worked into ASR? Just like Google uses personalization to learn what and how a user searches, couldn't ASR systems learn how a user speaks and forms their queries? How would this kind of development affect the error rates that we see with current ASR?
Perspectives on Information Retrieval and SpeechSummary: The author comments on the current state of speech retrieval tasks and enumerates the new directions to be taken in the future. The author starts by addressing what he feels is the “red herring” problem: the errors created by automatic speech recognition, ASR, translations. The author points out that two different studies, TREC and TDT, had information retrieval tasks based on speech instead of strictly text. Both tasks produced the same conclusion: errors made by ASR systems do not significantly impact the retrieval tasks. The author reveals that the errors did not impact the retrieval results because there is enough context developed by the length of speeches to overcome errors that arise even if a keyword is disrupted. Upon further evaluations, the author finds that the shorter the document translated by an ASR system, the more impactful errors become. To conclude the paper, the author outlines multiple different directions to take speech information retrieval research beyond the context of document retrieval. 1. In class a few weeks ago, we discussed a paper that contained multiple comparisons to support that query owners are better than non-query owners for determining relevance. One of the evaluation tasks involved corrupting the query by changing the third word. Based on the conclusions this authors draws, there should not be a great impact if the query length in long enough. In class, our discussion turned into a question of whether or not the number of evaluations in the paper was necessary. Several people voiced that changing the third word could completely change the meaning of the query and it would be easy to tell the difference between the good and bad results. The work summarized in this paper could indicate that we were jumping to a wrong conclusion. 2. One of the new directions the author spends some time evaluating is when the speech under consideration is a query. With modern cell phones, people use ASR systems to search on their phones. Therefore, improvements in this area definitely seem of importance. The author notes that most people consider the errors of ASR systems as a solved problem, but further investigation reveals that if there are less than 28 words, the errors can be harmful. The author suggests a change to the user interface in which the user speaks longer. However, we have mentioned in class the current way people structure search queries. Would trying to implement a change in this fashion not benefit the current users of such a system? Or are users more likely to give detailed queries when using a speech interface?3. The author mentions several new ideas from moving forward about capturing user context and several ideas related to the domain of the speech to counter errors in the ASR system. However, all of this left me wondering about the relevance to attach to a speech. If multiple people are talking in a speech, do ASR systems detect and note the change in speaker? My sister did research related to determining the height of a speaker from an audio clip. Therefore, I am aware that ASR systems can have multiple ways of noting a change in speaker. The author addresses some general issues related to the field, but I am curious as to what context is captured for speeches? For instance, the speaker himself could be context. Is this a known and solved problem; therefore, prompting the author to dwell other areas?
Perspectives on Information Retrieval and Speech – James AllanSummary: The article does an overview of the different TREC tracks and experiments concerning speech recognition and Information retrieval before offering some analysis on future directions for this segment of IR. The author notes a common perception that speech recognition error wasn’t very important because retrieval performance wasn’t significantly impacted even when automatic speech recognition (ASR) error topped 50%. This claim is explored in greater detail, and it is found to be somewhat true, but only in specific speech IR scenarios. In particular, ASR error may be less important when very long documents serve as the document pool. The reason for this is that there is a lot of redundancy in larger documents and more contextual information. Nevertheless, IR performance loss is argued to be significant when ASR error is seen on the query, and particularly short queries. Here, getting individual words right really matters because misclassifying even a single word means derailing a significant portion of the query terms. The final conclusion seems to be that ASR actually does matter, it just matters a lot more in certain document/query contexts.Questions1. How is error of speech recognition systems measured? Is error measured as failure to recognize at the level of lexeme, morpheme, phoneme, or phone?2. Around what kinds of lexical items does speech recognition error tend to cluster? How are Named Entities and slang/contractions handled by these systems?3. Several of the experiments discussed created audio recordings of newspaper articles. In what ways is the language of a newspaper article different from that of prototypical conversational speech? It seems like more natural speech settings would have a lot more anaphors and as a result would be a lot more challenging for a dual ASR IR approach. Can we define a set of speech contexts where we would expect problematic lexical items?