1) Electronic Textbook and PubMed Central Indexing
Current processing of the electronic textbook material involves a number of steps designed to produce the most meaningful phrases in the text to be used as reference points. The first task is to identify grammatically reasonable phrases. We use a version of the Brill transformation based tagger, rewritten in C++, for part-of- speech tagging. This forms the basis for determining grammatically reasonable phrases. There is a significant post processing step that removes phrases that involve inappropriate references to context (e.g., different cells, final mutation). After finding grammatically reasonable phrases we attempt to eliminate those that are too common or generic to be useful (e.g., significant result, short time). The next step is to compare a phrase with previously rated phrases that have been collected over the life of the project. The final stage is to estimate the importance of a phrase in the passage where it is found in a textbook. Such an estimate is based on the frequency of the phrase and the size of the passage compared with the frequency of the phrase throughout the book and the overall size of the book. In order to improve such an estimate we attempt to take account of the phrase or any phrase that represents the same concept. For this purpose we use the UMLS Metathesaurus and also stemming and combine these two approaches into a consistent picture of the concept as it occurs in the text. The result of this processing is a scored list of phrase-book section pairs for each textbook. These are used to guide the response of general searching in the books. When a user types in a phrase that is on our curated list the first results given are the highly rated book sections for that phrase. We are now applying a similar indexing scheme to the text of articles in PMCentral. This allows us to give a list of highly rated phrases for each article as an enhanced reference point for searchers.
2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text.
3) Currently we are studying how good phrases can be recognized by their characteristics, such as frequency, tendency to be repeated in documents where they occur, and other numerical properties. These features allow one to predict which phrases are of high quality. We have found such predictions to be useful in studying different kinds of terms that may appear in text and how an ontoloogy might be extracted from text.
No Sub Projects information available for 1ZIALM091711-02
Publications
Publications are associated with projects, but cannot be identified with any particular year of the project or fiscal year of funding. This is due to the continuous and cumulative nature of knowledge generation across the life of a project and the sometimes long and variable publishing timeline. Similarly, for multi-component projects, publications are associated with the parent core project and not with individual sub-projects.
No Publications available for 1ZIALM091711-02
Patents
No Patents information available for 1ZIALM091711-02
Outcomes
The Project Outcomes shown here are displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed are those of the PI and do not necessarily reflect the views of the National Institutes of Health. NIH has not endorsed the content below.
No Outcomes available for 1ZIALM091711-02
Clinical Studies
No Clinical Studies information available for 1ZIALM091711-02
News and More
Related News Releases
No news release information available for 1ZIALM091711-02
History
No Historical information available for 1ZIALM091711-02
Similar Projects
No Similar Projects information available for 1ZIALM091711-02