| SPIN! Spatial Data Mining System | |
|
Alexandr Savinov |
|
Introduction |
|
| The SPIN! data mining system has a component-based architecture, where each component encapsulates some specific functionality such as a data source, an analysis algorithm or visualization. Individual components can be visually linked within one workspace for solving different data mining tasks. The SPIN! friendly user interface and flexible underlying component architecture provide a powerful integrated environment for executing main tasks constituting a typical data mining cycle: data preparation, analysis, and visualization. | |
Component Architecture |
|
| The SPIN! data mining system
has a component architecture. This means that it provides only an
infrastructure and environment while all the system functionality comes
from separate software modules called components. Components can be
easily plugged-in into the system thus allowing us to extend its
capabilities an importance of which for data mining was stressed in [5].
In this sense it is very similar to such a general purpose environment
as Eclipse. Each component is developed as an independent module for
solving one or a limited number of tasks. For example, there may be
components for data access, analysis or visualization. In order to solve
complex problems components need to cooperate by using each other.
All components are implemented on the
basis of CoCon Common Connectivity Framework which is a set of generic
interfaces and objects in Java and allows components to communicate
within one workspace. The cooperation of components is based on the idea
that they can be linked by means of different types of connections.
Currently there exist three connections: visual, hierarchical and
user-defined. Visual connections are used to link a component to its
view (similar to Model-View-Controller architecture). Hierarchical
connections are used to compose parent-child relationships among
components within one workspace, e.g., between folder and its elements
or a knowledge base and its patterns. The third and the main type is the
user connection, which is used to arbitrary link components in the
workspace according to the task to be solved (similar to Clementine). It
is important that components explicitly declare their connectivity
capabilities, i.e., the system knows how they can be connected and with
what other components they can work. In particular, the SPIN! configures
itself according to available components by exposing their functions in
menu and toolbar. |
|
Workspace Management |
|
| Workspace is a set of
components and connections among them. It can be stored in or retrieved
from a persistent storage like file or database. Workspace appears in
two views: tree view and graph view (upper and middle left windows in
Fig. 1). In tree view the hierarchical structure of the workspace
is visualized with components as individual nodes, which can be expanded
or collapsed. In graph view components are visualized as nodes of the
graph while user connections are graph edges.
Components can be added to the
workspace by choosing them either in menu or in component bar. After a
component has been added it should be connected with other relevant
components. An easy and friendly way to do this consists in drawing an
arrow from the source component to the target one. For example, we might
easily specify a data source for an algorithm by linking two components
in graph view. While adding connections between components the SPIN!
uses information about their connectivity so that only components, which
are able to cooperate can be really connected.
Each component has an appropriate view, which is also a connectable component. Each component can be opened in a separate window so that the user can use its functions. When a workspace component is opened the system automatically creates a view, connects it with the model and then displays it within internal window.
|
|
Executing Data Mining Algorithms |
|
| The typical data mining
tasks include data preprocessing, analysis and visualization. For data
access the SPIN! system includes Database Connection and Database Query
components (lower left and middle windows in Fig. 1). The Database
Connection represents the database where the data is stored. To use the
database this component should be connected to some other component,
e.g., via graph view. The Database Query component describes one query,
i.e., how the result set is generated from tables in the database.
Essentially this component is a SQL query design tool, which allows the
user to describe a result set by choosing tables, columns, restrictions,
functions etc. Notice also that both Database Connection and Database
Query components do not work by themselves and it is some other
component that makes use of them. Such encapsulation of functionality
and use of connections to configure different analysis strategies has
been one of the main design goals of the SPIN! data mining system.
Any knowledge discovery task includes
data analysis step where the dataset obtained from preprocessing step is
processed by some data mining algorithm. The SPIN! system currently
includes several data mining algorithm components, e.g., subgroup
discovery [1], rule induction based on empty intervals in data [4],
spatial association rules [2], spatial cluster analysis, Bayesian
analysis. To use some algorithm, say, Optimist rule induction [4], we
need to add this component in the workspace and connect it to Database
Query where the data is loaded from as well as to Rule Base component
where the result is stored. The algorithm can be started by
pressing the Start button in its view (lower right window in Fig. 1).
After that it runs in its own separate thread either on the client or on
the server within an Enterprise Java Bean container [3]. The rules
generated by the algorithm are stored in Rule Base component connected
to the algorithm. The rules can be visualized and further analyzed by
opening this component in a separate view (upper right window in Fig. 1).
|
|
N-tier EJB-based Architecture |
|
| The general SPIN!
architecture is shown in Figure 3. It is an n-tier
Client/Server-architecture based on Enterprise
Java Beans for the server side components. A major advantage of
using Enterprise Java Beans is that such tasks as controlling and
maintaining user access rights, handling multi-user access, pooling of
database connections, caching, handling persistency and transaction
management are delegated to the EJB container. The architecture has the
following major subsystems: client, application servers each with one or
more EJB containers, one or more database servers and optional compute
servers.
The SPIN! client is a standalone Java application. It always creates one server side representative in the form of session bean. The methods of the session bean are accessed through the corresponding remote reference via Java RMI or CORBA IIOP protocol. The client session bean executes various server side tasks on behalf of the client. In particular, workspace objects may be loaded from or saved to its persistent state. The client is based on component connectivity framework, which is implemented in Java as connectivity library (CoCon). The idea is that the workspace consists of components each of which is considered a storage for a set of parameters and pieces of functionality such as algorithms. The system functionality is determined by a set of available components. The application server is an Enterprise Java Bean container. It manages the client workspace, analysis tasks, data access and persistency. There may be more multiple containers running simultaneously on one or more servers. Among other things, this means that different algorithms and alternate tasks can be executed on different computers under different restrictions. The SPIN! system uses an EJB container for making workspaces persistent in the database and for remote computations. For the first task the client creates a special session bean, which is responsible on the server side for workspace persistence and access. Specifically, if the client needs to load or save a workspace it delegates this task to this session bean. The client creates one remote object for each analysis task that is to be run so that data can be transferred directly from the database to the algorithm. After the analysis is finished, the result is transferred to the client for visualization. User data are stored in primary data storage, which is a relational database system accessed via JDBC protocol, which is a part of J2EE standard. The database can reside on the same machine as the application server, it can reside on the client machine, or it can reside on a separate dedicated computer. Optionally, there may be one or more secondary databases. In addition, data can be loaded from other sources such as databases, ASCII files in the file system or Excel files. For remote computations in the application, it is important that server data be transferred directly into the remote algorithm bypassing the client. It is only a set of components (subgraph of the workspace) that is transferred between application server and client. In enterprise applications the amount of data processed may be quite large so it is very important to avoid unnecessary network traffic. In particular, if the data is going to be processed remotely in application server then it should be transferred directly from the storage to this computing server. This is precisely what is done in the SPIN! system where the workspace stores only data description. This data description is transferred to the application server, which has already loaded the data itself directly from the specified database for processing. |
|
References |
|
|
[1] Klösgen, W., May, M. Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database, PKDD 2002, Helsinki, Finland, August 2002, 275-286. [2] Lisi, F.A., Malerba, D., SPADA: A Spatial Association Discovery System. In A. Zanasi, C.A. Brebbia, N.F.F. Ebecken and P. Melli (Eds.), Data Mining III, Series: Management Information Systems, Vol. 6, 157-166, WIT Press, 2002. [3] May, M., Savinov, A. An integrated platform for spatial data mining and interactive visual analysis, Data Mining 2002, Third International Conference on Data Mining Methods and Databases for Engineering, Finance and Other Fields, 25-27 September 2002, Bologna, Italy, 51-60. [4]
Savinov, A.: Mining Interesting Possibilistic Set-Valued Rules.
In: Da Ruan and Etienne E. Kerre (eds.), Fuzzy If-Then Rules in
Computational Intelligence: Theory and Applications, Kluwer, 2000,
107-133. [5] Wrobel, S., Wettschereck, D., Sommer, E., and Emde, W. (1996) Extensibility in Data Mining Systems. In Proceedings of KDD’96 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.214-219. |
|