SPIN! Spatial Data Mining System

Alexandr Savinov
Fraunhofer Institute for Autonomous Intelligent Systems
Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany
tel: +49-2241-142629, fax: +49-2241-142072
savinov@ais.fraunhofer.de, http://www.ais.fraunhofer.de/~savinov/


Introduction
The SPIN! data mining system has a component-based architecture, where each component encapsulates some specific functionality such as a data source, an analysis algorithm or visualization. Individual components can be visually linked within one workspace for solving different data mining tasks. The SPIN! friendly user interface and flexible underlying component architecture provide a powerful integrated environment for executing main tasks constituting a typical data mining cycle: data preparation, analysis, and visualization.

Component Architecture
The SPIN! data mining system has a component architecture. This means that it provides only an infrastructure and environment while all the system functionality comes from separate software modules called components. Components can be easily plugged-in into the system thus allowing us to extend its capabilities an importance of which for data mining was stressed in [5]. In this sense it is very similar to such a general purpose environment as Eclipse. Each component is developed as an independent module for solving one or a limited number of tasks. For example, there may be components for data access, analysis or visualization. In order to solve complex problems components need to cooperate by using each other.

All components are implemented on the basis of CoCon Common Connectivity Framework which is a set of generic interfaces and objects in Java and allows components to communicate within one workspace. The cooperation of components is based on the idea that they can be linked by means of different types of connections. Currently there exist three connections: visual, hierarchical and user-defined. Visual connections are used to link a component to its view (similar to Model-View-Controller architecture). Hierarchical connections are used to compose parent-child relationships among components within one workspace, e.g., between folder and its elements or a knowledge base and its patterns. The third and the main type is the user connection, which is used to arbitrary link components in the workspace according to the task to be solved (similar to Clementine). It is important that components explicitly declare their connectivity capabilities, i.e., the system knows how they can be connected and with what other components they can work. In particular, the SPIN! configures itself according to available components by exposing their functions in menu and toolbar.


Workspace Management
Workspace is a set of components and connections among them. It can be stored in or retrieved from a persistent storage like file or database. Workspace appears in two views: tree view and graph view (upper and middle left windows in Fig. 1). In tree view the hierarchical structure of the workspace is visualized with components as individual nodes, which can be expanded or collapsed. In graph view components are visualized as nodes of the graph while user connections are graph edges.

Components can be added to the workspace by choosing them either in menu or in component bar. After a component has been added it should be connected with other relevant components. An easy and friendly way to do this consists in drawing an arrow from the source component to the target one. For example, we might easily specify a data source for an algorithm by linking two components in graph view. While adding connections between components the SPIN! uses information about their connectivity so that only components, which are able to cooperate can be really connected.

The SPIN! Data Mining System client interface
Figure 1. The SPIN! Data mining system client interface: workspace (upper and middle left windows), rule base (upper right window), database connection (lower left window), database query and algorithm (lower middle and right windows). (Click on the image to enlarge.)

Each component has an appropriate view, which is also a connectable component. Each component can be opened in a separate window so that the user can use its functions. When a workspace component is opened the system automatically creates a view, connects it with the model and then displays it within internal window. 


Figure 2. The SPIN! Data mining system client interface: subgroup discovery algorithm finds interesting subsets of spatial objects and highlights them on the map as one subgroup is selected in the list. Simultaneously on this very map a set of clusters is displayed discovered by another algorithm (Click on the image to enlarge.)


Executing Data Mining Algorithms
The typical data mining tasks include data preprocessing, analysis and visualization. For data access the SPIN! system includes Database Connection and Database Query components (lower left and middle windows in Fig. 1). The Database Connection represents the database where the data is stored. To use the database this component should be connected to some other component, e.g., via graph view. The Database Query component describes one query, i.e., how the result set is generated from tables in the database. Essentially this component is a SQL query design tool, which allows the user to describe a result set by choosing tables, columns, restrictions, functions etc. Notice also that both Database Connection and Database Query components do not work by themselves and it is some other component that makes use of them. Such encapsulation of functionality and use of connections to configure different analysis strategies has been one of the main design goals of the SPIN! data mining system.

Any knowledge discovery task includes data analysis step where the dataset obtained from preprocessing step is processed by some data mining algorithm. The SPIN! system currently includes several data mining algorithm components, e.g., subgroup discovery [1], rule induction based on empty intervals in data [4], spatial association rules [2], spatial cluster analysis, Bayesian analysis. To use some algorithm, say, Optimist rule induction [4], we need to add this component in the workspace and connect it to Database Query where the data is loaded from as well as to Rule Base component where the result is stored.

The algorithm can be started by pressing the Start button in its view (lower right window in Fig. 1). After that it runs in its own separate thread either on the client or on the server within an Enterprise Java Bean container [3]. The rules generated by the algorithm are stored in Rule Base component connected to the algorithm. The rules can be visualized and further analyzed by opening this component in a separate view (upper right window in Fig. 1).


N-tier EJB-based Architecture
The general SPIN! architecture is shown in Figure 3. It is an n-tier Client/Server-architecture based on Enterprise Java Beans for the server side components. A major advantage of using Enterprise Java Beans is that such tasks as controlling and maintaining user access rights, handling multi-user access, pooling of database connections, caching, handling persistency and transaction management are delegated to the EJB container. The architecture has the following major subsystems: client, application servers each with one or more EJB containers, one or more database servers and optional compute servers.


Figure 2. SPIN! platform architecture. Main components are a Java-based client, an Enterprise Java Beans Container and one or more databases serving spatial and non-spatial data.

The SPIN! client is a standalone Java application. It always creates one server side representative in the form of session bean. The methods of the session bean are accessed through the corresponding remote reference via Java RMI or CORBA IIOP protocol. The client session bean executes various server side tasks on behalf of the client. In particular, workspace objects may be loaded from or saved to its persistent state. The client is based on component connectivity framework, which is implemented in Java as connectivity library (CoCon). The idea is that the workspace consists of components each of which is considered a storage for a set of parameters and pieces of functionality such as algorithms. The system functionality is determined by a set of available components.

The application server is an Enterprise Java Bean container. It manages the client workspace, analysis tasks, data access and persistency. There may be more multiple containers running simultaneously on one or more servers. Among other things, this means that different algorithms and alternate tasks can be executed on different computers under different restrictions. The SPIN! system  uses an EJB container for making workspaces persistent in the database and for remote computations. For the first task the client creates a special session bean, which is responsible on the server side for workspace persistence and access. Specifically, if the client needs to load or save a workspace it delegates this task to this session bean. The client creates one remote object for each analysis task that is to be run so that data can be transferred directly from the database to the algorithm. After the analysis is finished, the result is transferred to the client for visualization.

User data are stored in primary data storage, which is a relational database system accessed via JDBC protocol, which is a part of J2EE standard. The database can reside on the same machine as the application server, it can reside on the client machine, or it can reside on a separate dedicated computer. Optionally, there may be one or more secondary databases. In addition, data can be loaded from other sources such as databases, ASCII files in the file system or Excel files. For remote computations in the application, it is important that server data be transferred directly into the remote algorithm bypassing the client. It is only a set of components (subgraph of the workspace) that is transferred between application server and client. In enterprise applications the amount of data processed may be quite large so it is very important to avoid unnecessary network traffic. In particular, if the data is going to be processed remotely in application server then it should be transferred directly from the storage to this computing server. This is precisely what is done in the SPIN! system where the workspace stores only data description. This data description is transferred to the application server, which has already loaded the data itself directly from the specified database for processing.


References

[1]  Klösgen, W., May, M. Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database, PKDD 2002, Helsinki, Finland, August 2002, 275-286.

[2]  Lisi, F.A., Malerba, D., SPADA: A Spatial Association Discovery System. In A. Zanasi, C.A. Brebbia, N.F.F. Ebecken and P. Melli (Eds.), Data Mining III, Series: Management Information Systems, Vol. 6, 157-166, WIT Press, 2002.

[3]  May, M., Savinov, A. An integrated platform for spatial data mining and interactive visual analysis, Data Mining 2002, Third International Conference on Data Mining Methods and Databases for Engineering, Finance and Other Fields, 25-27 September 2002, Bologna, Italy, 51-60.

[4]  Savinov, A.: Mining Interesting Possibilistic Set-Valued Rules. In: Da Ruan and Etienne E. Kerre (eds.), Fuzzy If-Then Rules in Computational Intelligence: Theory and Applications, Kluwer, 2000, 107-133.

[5]  Wrobel, S., Wettschereck, D., Sommer, E., and Emde, W. (1996) Extensibility in Data Mining Systems. In Proceedings of KDD’96 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.214-219.