Wednesday, October 3, 2007

Data mining and "Neural Nets"


The phrase "data mining" doesn't refer to anything specific, but rather to a collection of techniques that enables one to go through a very large collection of data stored on a computer to find patterns. This is very different from simply going through the data and asking for examples of a particular correlation or statistical relationship (e.g. How many owners of Fords traded them in to buy Chevrolets?).

Data mining is not so much a mathematical idea as a collection of computer science techniques, some of which have mathematical origins, usually in statistics but also in combinatorics (study of arrangements and selections from finite sets of objects), mathematical logic and mathematical linguistics. If the data is numerical in nature, other math techniques may be drawn in as well, such a pattern recognition, Fourier analysis and autocorrelation (sophisticated mathematical techniques for analyzing images and sound patterns).

The first step in data mining is to prepare the set of information or data base. This can be most anything, but it has to be storable on a computer in a readily accessible way. This data base can consists of spreadsheets, standard data-base files, tables, text, even pictures, sounds and charts.

Next, you need software that can find patterns by comparing all sorts of aspects of the data. For example, you can have a chart of the India, with some states colored blue and others red. You can also have tables that list states, their populations, divorce rates, frequencies of teenage pregnancy, average income, education, number of licensed motor vehicles and state flowers. The software then scans the chart and the tables and tries to find correlations. For example, it may -- in fact it does -- find that states colored "red" the highest rates for divorce and teenage pregnancy, for example, while state flower correlates with nothing interesting. No one has "asked" the software to find these correlations or lack of correlation: it simply looks through all parameters, checks all possible relationship with the data, and reports the strongest correlations it finds.

How does the software do this enormous chore? It uses very sophisticated techniques involving analysis of existing links in data, statistics, and algorithms from a fairly recent (several decades old) field called "Artificial Intelligence" or AI. One of the developments in AI is a simulation of how part of our nervous system, including the brain, works. This simulation is called Neural Networks, Neural Networking, or simply Neural Nets.

The basic physical building blocks of our thinking or cognitive system are cells called neurons. These are like nodes in a vastly complicated interlocking web. They are connected to each other by physical wiring, which transmits impulses that are both electrical and chemical in nature. These connections transmit signals of varying strength or level which go from one neuron to another. When an input signal to a neuron reaches a certain high enough level or threshold, the neuron responds by sending out a signal to other neurons to which it is connected. Thus, a stimulus of sufficient strength to one neuron (say a detection of the color red in some area to which it is connected, perhaps a region in the retina) will result in a cascade of signals to lots of other neurons, eventually to those in the brain. The interconnections of the neurons in the brain then allows us to string together many of these signals into parts of a pattern which our memory may link to form a thought, such as: "I am seeing part of a fire engine" or "that's blood!" In the case of the human mind, we are only beginning to understand how this data is processed: it is nowhere as simple as was hoped early in the development of the subject.

Neural Nets is an attempt to make a miniature nervous system on a computer. Regions of the computer's memory are set aside as neurons, and certain links are programmed to connect them. These links are assigned numerical "thresholds", so that they will allow connections through neurons only when enough "evidence" (stimulus) has build up from the data to exceed this threshold.

Here's an example. You have a database consisting of facts about a commercial garden. For example, the layout of the rows (type of plants in each), the spaces between the plants in the row, the insects found in various places and the frequency and type of irrigation and fertilization. Also, the date of maturity of various crops, last year's data from the garden, the cost, yield and health of each type of crop planted as well as other economic data. We then turn our data mining software loose on this data base. It may make various crops and/or conditions into neurons, with connections to each other based on position in the garden, fertilizer, or economic variables. Statistical correlations built on numerical data determine the signals sent on these pathways, and possible thresholds are tried. The system can then see which signals move through which pathways and which are filtered out because they don't meet the threshold conditions. The resulting flow of signals enables the system to "draw conclusions." For example, it may "deduce" that more profit can be made by putting more fertilizer into tomatoes; or it may decide that growing eggplants near potatoes is good for the potatoes and bad for the eggplants (eggplants happen to attract potato beetles). We don't necessarily know which pathways -- hence which deductions -- will emerge from the flow of signals in this complicated system; however, the system will report back to us exactly what happens, and give a complete statistical analysis of the relationships.

This whole set-up is somewhat problematical in the case of solving a particular crime. It usually takes quite a while to prepare the database for data mining, and the algorithms have to be fine-tuned. This may take from weeks to years, depending on the data and the sophistication and accuracy required.

Finally, one must remember that any sufficiently complex system will have "bugs": things that make it go wrong, giving unpredictably false answers and/or correlations. Remember "Jurassic Park", the system that couldn't fail and the dinosaurs that simply couldn't reproduce because they were all female? Of course that was just a movie...

No comments yet