Information Retrieval

This page and those that follow it are concerned with the flow, preservation, and retrieval of information in society.

Obviously the internet, the World Wide Web, the various search engines, and the browsers individual people use are mechanisms for information retrieval (at least), but they are in many ways inadequate. The information people seek may be out there, but it can be very hard to find. To solve this problem, many people have suggested meta-content information, or content-descriptors, often based on what librarians have called descriptor languages. I have been working on descriptor languages for many years, but what is most distinctive about my approach is its relationship to natural human languages.

This interest in descriptor languages began many years ago with the idea of a language composed entirely of  acronyms.

I remember sitting on the floor surrounded by little pieces of paper on which I was writing definitions and pasting on a large piece of cardboard, and at the time I must have thought the task easy enough that cut-and-paste engineering would be feasible. I find it quite hard to remember more about that time, but I seem to think the idea was to sort words by meaning into 26 clusters and then divide up these clusters in some similar way.

Melvil Dewey mentioned having experimented with a classification involving 26 classes while working on what eventually became his Decimal System, based on 10 classes, and though he doesn’t go into any detail I suspect those were classes labelled with letters of the alphabet.  I see I’ve just used class and cluster interchangeably. It will help if we keep the terminology straight, and to help do that I’ll discuss these things in an entirely different context.

At MDA (the acronym for the R&D firm where I used to work), I worked on the Meridian project, which created an image analysis system, and I was responsible for cluster analysis, signature management, and classification software. Classification in library science is not that different from classification in image analysis, though librarians might be surprised at the idea.

At MDA our images were usually from the LandSat or SPOT remote sensing satellites, and each pixel in one of those image represented a small square of land or sea, and we ran cluster analysis and classification on those images to find out something about those small squares:

— were they land or were they sea, and if sea, how deep? — were the bare rock, soil, planted with wheat or were they forest? — were they deciduous forest, evergreen forest, or clear cut?

The librarian’s task is somewhat similar: each book has to classified into fiction or non-fiction, and according to subject, and in various other ways, so it can be put on a shelf near similar books or at least retrieved easily by the librarian at a library user’s request.

While I was out of  town a few years ago, my ex-sister-in-law Cheryl and niece  Marika took it upon themselves to clean up my room, a task of unthinkable proportions, and to arrange my books on my shelves. When I returned I found such anomalies as a book on “Small Groups” placed between a book on “Combinatorial Group Theory” and one on “Applied Group Theory”, an interesting and perhaps serendipitous conjunction that caused me many hours of thought, since the “Small Groups” was actually about the psychology of small group interaction.

One of the most interesting library classification schemes, one that seems almost forgotten, as a long web-search last year found almost nothing about it, is Ranganathan’s Colon Classification, which is quite literally an ideal language — much more so than Dewey’s. It is aimed mostly at the classification of technical books and published papers, which often have titles that clearly state what they are about, and when that is the case Ranganathan’s rules can be applied to translate the title into his descriptor language.

If that was done blindly it might well result in the same kind of amusing conjunction as Cheryl and Marika produced, so obviously librarians had to supplement the system with their own knowledge, but the basic idea is still a linguistic one.

Regardless of what classification scheme is used, the actual technique employed to classify the books can very widely, I have a quite a thick volume called “Anglo-American Cataloging Rules” which sets out in excruciating detail the process of classify a book — a process that involves reading tables of contents and looking at indices, and selected reading for content.

In contrast to that, another interesting book on my shelves, “On Retrieval System Theory”, by B.C.Vickery, dating from the mid-60’s, sketches ways to manage a library and do information retrieval using computers, an idea quite new at the time.

Lately I’ve moved beyond counting words to full-text translation, but I’d like to say more about the old word-counting idea, which figures prominently in Vickery’s book. The nice thing about word-counting schemes is that use nice lists of numbers, easily handled by computer programs.

In many ways what I used to do in classifying (pixels from) satellite imagery is similar to what has been used so often to classify books and papers. Some things are clearly different: we used to have only a few numbers per pixel, each corresponding to a single spectral band, and we’d classify perhaps a billion pixels into a relatively small number of classes, such as “ground-cover” classes for vegetation growing on the land.

Automatic book classification schemes use many numbers per book, perhaps thousands — quite different than the half-dozen numbers pixel in image classification, but the number of books or articles would usually be smaller than the number of pixels, and the number of classes usually much larger — books can be classified into thousands of distinct classes, whereas image pixels may fit into perhaps a hundred different classes. But all those differences considered, the basic mathematics is pretty much the same.

It is common to use the term signature to describe the most typical numbers expected of items in a class. Vegetation is brighter in the green and blue than rock or soil, so the numbers for those spectral bands will be higher in the signature for vegetation classes than in the signatures for rock or bare soil classes. Similarly the numbers corresponding to the frequencies of names for chemical elements will be higher in the signatures of chemistry books than in the the signatures of books on Hume’s philosophy.

Such signatures will not be collected for all words, but only for words other than a “stop-list” of common words like ‘he’, ‘it’, ‘should’, ‘are’, ‘been’, and so on.

I seem to be getting into unnecessary detail here, but there is one important point that has to be made: information retrieval software often uses word-counts to classify and retrieve texts, and one aspect of what I do is what mathematicians call the “dual” problem, using text samples to classify words. You may have encountered this notion before in logic: the ‘and’ and ‘or’ operators are duals, as are the ‘union’ and ‘intersection’ operators in set theory.

Theorems proved for the “primal” operator (which ever one you are most interested in) often turn out to be true for the dual operator as well. In logic the theorems about disjunctive normal forms also hold true for conjunctive normal forms, and in optimization theory a very powerful method is to do as much as you can to optimize the primal form of a problem, then switch to solving the dual form, and so on, back and forth, each operation in one direction opening up new possibilities in the dual direction.

So, even though I am now interested in full-text translation followed by automatic summarization as a method for information retrieval, I still have an enormous interest in word-counting. But rather than counting the words in various texts to find texts similar to one another I count the words in various texts to find words similar to one another, the dual problem.

In the course of the Interpedia project I wrote a lot about abstract spaces which were implicitly continuous spaces — that is to say a point in one of these spaces would be described by specifying coordinate values that could be any real number, or perhaps any complex number, and one could move about the space in microscopic increments by just specifying more decimal places for more precision.

I have for a long time been interested in Glenn Gould’s fantasy about a music box with knobs on, allowing changes not just in loudness but in tempo, style, instrumentation, and so on. The knobs were to be continuously variable controls, and that must surely be the right fantasy — Gould certainly didn’t mean push buttons or toggle switches that had only two positions: fast or slow, baroque or romantic, piano or symphony orchestra.

So why the interest in binary?

It is partly an implementation matter, and there is a user interface question too, but the main reason has to do with information retrieval. Gould’s music box is presumably a generator of music, not a retrieval tool for finding music. I think we could actually implement Gould’s fantasy with some piece of software capable of controlling a synthesizer, and if the software is generating the music it’s not too hard to imagine it generating it with more or less of some musical properties.

But if this device was instead trying to find some appropriate piece of music by searching the internet, then there is a problem, because there may not be, anywhere, a piece of music with exactly the requested properties.

The one fundamental difficulty with continuous spaces for information retrieval is that they are mostly empty. Suppose you have a million different things, spaced out from one another is some abstract space. In a one-dimensional space there would be all million points on a single straight line, which you can visualize as quite tightly packed. But if it is a two-dimensional space there would be roughly 1000 points along each axis — much less densely packed, and a three-dimensional space would have only 100 points along each axis.

Let me bore you with a story I’ve told before, which I only repeat because I think it makes the point so well. Years ago while working for MDA I found myself at BWI as everyone calls it — I never heard anyone say it in full: Baltimore Washington International airport — that’s a lot of syllables even though the names of both cities have somewhere between one and two syllables apiece in the local dialect.

Well anyway, since I was going to be working in the area for several months I rented a car at the airport, a Toy(ota) Turtle, I think it was.  I could live with the rather poor performance, 0 to 60 eventually, but I just couldn’t live with the car radio. It had a tuning knob which you had to turn in small increments while driving down the beltway and keeping your eye on six lanes of traffic.

So after about a week of trying to find a station playing music — any kind of music — instead of the usual drive-to-work chatter, I drove back to BWI, took the car back, said I didn’t like it and asked for another one. The woman at the counter gave me a dubious glare and asked “What seems to be the problem sir?” in a cold tone. I said I didn’t like the radio, and asked for a car with push-button tuning.

The woman told me I couldn’t exchange cars just because I didn’t like the radio, but at the airport all the rent-a-car places are lined up along one wall, and when I said I’d just return the car and asked her to recommend one of their competitors, she changed her mind.

The point should be obvious enough — radio stations are scattered along the spectrum with more or less empty space between them (not very empty, in that crowded environment), and turning a knob to find one is not much fun, especially in traffic.

As I said above, there are also internal implementation problems with continuous spaces, and strictly from the point of view of a programmer trying to get code to work binary is better. But since I am not only writing the code but performing the systems analyst role, I could not in good conscience allow a programmer’s preference for binary over continuous affect the user’s view of the system.

On the third hand, so to speak, I should probably tell you about the implementation details, especially when there is some doubt about whether the kind of system we are talking about can be implemented at all, ever.

What we are really talking about here is what I will always think of as high-technology, and one rule of thumb I’ve encountered says that if you can understand how it works it’s not high technology.

In particular, the World-Wide Web is not really high technology.

It doesn’t take a lot of magic to simple go out after web pages when the user clicks on a link. Search engines are somewhat more magical, but it’s not too hard to understand how they work, and there is no real mathematics used in them. That’s almost a tautology, because if you can understand how something works there’s probably not much math inside it, and if there is real math inside you (and I) will have trouble understanding how it works.

This seems like a rather silly thing to say, but when I think of the information retrieval or filtering problem I am more sure that it must have heavy mathematical wizardry inside it than I am of anything else about it. It’s not just that I am fascinated by mathematics and want to see it used more, though I am, and it should be — but I have have a very strong feeling that the core problem is so difficult that only such strong magic can make it work.

But what is the core problem?

Just in the past day or so I saw on the tube a review of a book called “Data Smog”, by which the author was describing what he saw as the pollution of our intellectual environment by too much data. I got the impression the author thought this was something of a new idea, but somebody called this “drinking from the firehose,” — many people are working on information filtering schemes precisely to filter out the overflow of irrelevant data.

But the fact that this author seemed to imagine he had a novel idea perhaps proves that original or not, there is some truth in what he is saying — the flood of data and lack of filters has kept him from seeing all the information-filtering discussions and software projects that are trying hard do something about that data smog.

In a better world anyone writing a book on some subject like that would have no trouble finding what he missed, and might have some trouble avoiding it, even if he never looked for it.

Information retrieval is one aspect of the core problem — the ability to get what you want, and perhaps what you need or should want, even if you don’t quite know enough to want it. Information filtering is another aspect of the core problem — the ability to avoid what you don’t want, perhaps even when you don’t know enough to know you don’t want it.

There are various other ways of describing the core problem, none of them quite right yet. One concept I have used a lot is that of “true bandwidth”, or “true baud-rate” by which I mean NOT the amount of information that can flow across some interface and be sensed (seen or heard) by you, but rather the amount of information you actually absorb and convert into knowledge or wisdom.

Another, related concept is that of “true throughput”, by which I mean NOT the number of bytes of e-mail read and answered, or the number of pages removed from your in-basket and commented on or answered by something in your outbox, but something more fundamental, something very hard to define.

Imagine two people with office jobs that require reading and answering messages, e-mail, printed memos, faxes, perhaps even telephone messages. In terms of bits and bytes the “apparent throughput ” of the two might be the same, just as many bytes of e-mail or pages of memos, but for one of these people might be writing the same or similar sentences over and over again, really just responding to familiar stimuli with known responses.

Meanwhile the other person may be actually understanding the messages he receives, including both what is explicit and what can be read between the lines, and is actually coming up with truly appropriate and* original responses. Whereas his colleague’s conscious intelligence may be almost “out of the loop” and his responses are just the typical responses such stimuli usually receive, this person’s conscious mind is “in the loop”, consciously applying his intelligence to the best of his ability.

So by “true throughput” I don’t mean just the number of bytes of e-mail received and answered, but the number of bytes which have actually been put through the person’s conscious intelligence. Even the second of out two people probably let’s a little through while thinking about other things, so his true-throughput is less than his nominal or apparent throughput, but unlike his colleague he is actually processing the data stream that is put through him.

Part of  this fantasy is that there are ways to increase our “true input bandwidth”, so more of the bits we get are actually converted into knowledge and wisdom, and there are also ways to increase our “true throughput” so that we do engage our conscious intelligence in what we do.   So this approach considers information retrieval as one side of a larger problem, and that side involves not just finding what you want, but finding what you should want or would want if you knew any better.

There are actually two very different ways to do this.

One is by better software, including much better information retrieval software and much better filtering software, and a much better user interface, which might include an ideal language (which may or may not be visible to the user).  The Interpedia Project was for me an attempt to create such software and use it for indexing the Internet.

The other way is by changing the structures of human society itself, to use people as near perfect interfaces for each other, through the use of combinatorial optimization to match people with truly compatible friends, co-workers and mates. Take a look at my home page and follow the Social Technology link for more on this.


Copyright © 1998 Douglas P. Wilson    


Please visit our main page if you have not already done so.


If any of this interests you, please add a link to it on your own page!


Related Web Pages :


All of these pages relate around the central idea of Social Technology.   I believe I was the first person on the Internet to use the phrase Social Technology — years before the Web existed.

Those were the good old days, when the number of people using the net exceeed the amount of content on it, so that it was easy to start a discussion about such an upopular topic.  Now things are different.  There are so many web pages that the chances of anyone finding this page are low, even with good search engines like Google.   Oh, well.

By Social Technology I mean the technology for organizing and maintaining human society.  The example I had most firmly in mind is the subject of  what I consider to be the key pages related to  Finding Good Jobs, and Compatible People , the ones with the real solution to all other problems explained.

As I explained on my early mailing lists and later webpages, I find that social technology has hardly improved at all over the years.   We still use casual or accidental meetings to find friends and spouses, we still search for jobs in tedious and often humiliating way, and we still us representative democracy, almost exactly the same as it was used in the 18th century.  By contrast, horse and buggy transporation has been replaced by automobiles and airplanes, enormous changes.

In the picture below you will see some 18th century technology, such as the ox-plow in the middle of the picture.  How things have changed since then!  But we still use chance encounters, convenient friendships, dubious social organizations, engagements and marriages to organize our social environment, including home life and the raising of children.  

I claim that great advances in social technology are not only possible but inevitable.  I have written three novels about this, one preposterously long, 5000 pages, another merely very very long, 1500 pages.  The third is short enough at 340 pages to be published some day.  Maybe.  The topic is still not interesting to most people.   For those who might be interested, the complete texts and some excerpts are given on my books page.


This site includes many pages dating from 1997 to 2008 which are quite out of date.  They are included here partly to show the development of these ideas and partly to cover things the newer pages do not.  There will be broken links where these pages referenced external sites.  I’ve tried to fix up or maiintain all internal links, but some will probably have been missed.   One may wish to look at an earlier version of the main Social Technology page , rather longer, and at an overview of most parts of what can be called a bigger project.

Type in this address to e-mail me.  The image is interesting.  See Status of Social Technology

Copyright © 2007, 2008, 2009, Douglas Pardoe Wilson

I have used a series of e-mail address over the years, each of which eventually became out of date because of a change of Internet services or became almost useless because of spam.  Eventually I stuck with a Yahoo address, but my inbox still fills up with spam and their spam filter still removes messages I wanted to see.  So I have switched to a new e-mail service.  Web spiders should not be able to find it, since it is hidden in a jpeg picture.   I have also made it difficult to reach me.  The picture is not a clickable link.  To send me e-mail you must want to do so badly enough to type this address in.  That is a nuisance, for which I do apologize, but I just don’t want a lot of mail from people who do not care about what I have to say.

This entry was posted in Old Pages. Bookmark the permalink.

Leave a Reply