| Search by Word Combinations | |
|
Alexandr Savinov |
|
Introduction |
|
| The general idea of this
technology comes from the following observation: the word interpretation
depends on the context where it is used, i.e., other words in the
current context may significantly change its meaning. It is a well known
and quite natural fact described in numerous sources and reformulated in
very different forms. Formally, it is not very important to use words as
an elementary unit. In more general form it is important that for any
set of elementary units we have always to take into account their
context in order to produce correct interpretation. For example, if we
consider symbols of an alphabet as elementary units then it is clear
that individually they give us almost no information on the meaning of
the text and deeper understanding requires to process their
combinations. Thus one source of inspiration for this work was the
desire to built a formal framework or a mechanism for analyzing meaning
hidden behind such combinations of various elements. It is important
that such combinations may have different levels starting from symbol
combinations at the low level and phrase or keyword combinations at the
high level. Another observation was that a good search technique has to take into account the notion of generality. It means that we need to understand how general or how specific a document or request is. It is also well known that the contemporary search engine suffer from one-word search requests like "algorithm" because the search result will include a huge number of all documents involving the specified word. The problem is that only a small portion of them are really relevant while the rest is too specific or irrelevant at all. Irrelevant documents include this words but are not devoted to this notion. While too specific documents are devoted to some very concrete specialization of this notion. For example, the document about "iterative proportional fitting algorithm" is not about algorithms. As a correct response we would expect documents providing definition of the notion of algorithm, documents about general properties of algorithms etc. |
|
Search for Documents |
|
| The described approach
supposes that all texts are equivalent, i.e., we do not distinguish
between documents and queries. The main function consists in determining
the matching factor between any two texts, which describes numerically
how two texts are close. In order to calculate this distance between two
texts we represent them as a sequence of words. Each word in this
sequence is a lemma obtained from the corresponding word in the original
text.
To perform a search we specify a parameter "text window width", which is a number of words considered at any one moment of analysis. It means that each text at each moment is projected onto this window and only the specified number of words is analyzed. The window can move along the text so that we can retrieve all possible subtexts of the specified width. In the case of queries and documents these parameters may be different for them while for comparing equivalent documents it can be one and the same. Essentially this parameter specifies cardinality of the whole process and the higher the window width the more complex dependencies we can analyze. For example, if it is 1 then the process is reduced to word to word comparing with word combinations at all (a kind of word frequency counting). If it is 2 then we can compare pairs of words. If the query size is reasonably limited then we can always set this parameter to the size of the query so that all word combinations in the query are considered. The idea of the search consists in comparing all windows from the first documents with all windows from the second one. Suppose that the first window is equal to the query size. In this case we have only one window consisting of all words from the query. To find the match factor we move the window along the document and compare it with the query. As we move we get various values which reflect the proximity of the query to the chosen window in the document. This sequence is then aggregated along the while document (over all windows) and the final matching factor is produced. For example, we might take the average value or some more complex aggregation function. If the query includes more windows then each of them is compared with all the windows from the source document. To compare two windows we calculate the number of words from the first one included into the second one weighted by distance and permutation. The highest weight is where the whole sequence of words is included in the precisely the same order. If the sequence of words from the first window is included into the other document but there exist intermediate words then the weight is decreased. Each permutation also decreases the weight. |
|
Implementation |
|
| The approach has been
implemented as a document search system CombiS
(accessible only from intranet). It provides a web interface and
allows for finding relevant documents in the internet by providing
queries. We decided not to retrieve, index and store all the documents
from the internet. Instead we use Google
service as an indexed database of all documents to be search
through.
When a query is formulated the system sends it to Google as an external document database. Here it might be any other document database, e.g., a file system with documents. The obtained result is considered as a very approximate and unordered answer to the query. In other words we consider the set of documents retrieved from Google as all approximately relevant documents. This set has to be further narrowed down and ordered according our approach. When a set of relevant documents has been obtained we calculate for each of them the matching factor. As we described in the previous section, the idea is to get documents with high coincidence of word combinations rather than simply by counting word frequencies. The word combinations are taken in a generalized form by allowing for permutations, intermediate words and considering lemmas instead of the original words. We used Enterprise Java Beans to implement the search. Each query is sent to a session bean, which retrieves the set of all relevant documents, then calculates the matching factor for each of them and returns the ordered result. The result is formatted and presented by means of Java Server Pages. We used JBoss as an application server with embedded Jetty servlet and JSP engine. In the result we show first (in red) our own rank and then (in blue) the rank by Google. Only 10 result docuements is retrieved and shown for simplicity. To produce lemmas from the compared documents we used RML lemmatizer working with English, German and Russian dictionaries (LGPL license). This library originally created for Linux has been modified and recompiled for Windows and used from Java via JNI. |
|
Conclusion |
|
| We described a very simple implementation of the approach to search for text documents by word combinations. The main idea is that the meaning of texts is hidden behind higher order combinations of elementary units of representation. In this system we simply detect such common word combinations occurrences and count them (in a generalized form) instead of word counting. Yet it is much more interesting to detect the important (informative) combinations themselves from any corpus of documents and them use them as a higher level coordinate system to represent and analyze documents. | |