Class Comparison Methods in Data Mining
In many applications, users may not be interested in having a single class or concept described or characterized but rather would prefer to mine a description comparing or distinguishing one class (or concept) from other comparable classes (or concepts). Class discrimination or comparison (hereafter referred to as class comparison) mines descriptions that distinguish a target class from its contrasting classes. Notice that the target and contrasting classes must be comparable because they share similar dimensions and attributes. For example, the three classes, person, address, and item, are not comparable.
The previous sections’ discussions on class characterization handle multilevel data summarization and characterization in a single class. However, the sales in the last three years are comparable classes, and so are computer science students versus physics students. The techniques developed can be extended to handle class comparison across several comparable classes.
For example, the attribute generalization process described for class characterization can be modified so that the generalization is performed synchronously among all the classes compared. This allows the attributes in all classes to be generalized to the same levels of abstraction. Suppose that we are given the All Electronics data for sales in 2003 and sales in 2004 and would like to compare these two classes. Consider the dimension location with abstractions at the city, province or state, and country levels. Each class of data should be generalized to the same location level. They are synchronously all generalized to either the city level, the province or state level, or the country level. Ideally, this is more useful than comparing the sales in Vancouver in 2003 with the sales in the United States in 2004 (i.e., where each set of sales data is generalized to a different level). The users, however, should have the option to overwrite such an automated, synchronous comparison with their own choices when preferred.
Class Comparison Methods and Implementation
The general procedure for class comparison is as follows:
- Data Collection: The set of relevant data in the database and data warehouse is collected by query Processing and partitioned into a target class and one or a set of contrasting classes.
- Dimension relevance analysis: If there are many dimensions and analytical comparisons are desired, then dimension relevance analysis should be performed. Only the highly relevant dimensions are included in the further analysis.
- Synchronous Generalization: The process of generalization is performed upon the target class to the level controlled by the user or expert specified dimension threshold, which results in a prime target class relation or cuboid. The concepts in the contrasting class or classes are generalized to the same level as those in the prime target class relation or cuboid, forming the prime contrasting class relation or cuboid.
- Presentation of the derived comparison: The resulting class comparison description can be visualized in the form of tables, charts, and rules. This presentation usually includes a “contrasting” measure (such as count%) that reflects the comparison between the target and contrasting classes. As desired, the user can adjust the comparison description by applying drill-down, roll-up, and other OLAP operations to the target and contrasting classes.
For example, the task we want to perform is to compare graduate and undergraduate students using the discriminant rule. So to do this, the DMQL query would be as follows.
Now from this, we can formulate that
- attributes = name, gender, program, birth_place, birth_date, residence, phone_no, and GPA.
- Gen(ai)= concept hierarchies on attributes ai.
- Ui = attribute analytical thresholds for attributes ai.
- Ti = attribute generalization thresholds for attributes ai.
- R = attribute relevance threshold.
Presentation of Class Comparison Descriptions
As with class characterizations, class comparisons can be presented to the user in various forms, including generalized relations, crosstabs, bar charts, pie charts, curves, and rules. Except for logic rules, these forms are used in the same way for characterization as for comparison. This section discusses the visualization of class comparisons in the form of discriminant rules.
Similar to characterization descriptions, the discriminative features of the target and contrasting classes of a comparison quantitatively by a quantitative discriminant rule, which associates a statistical interestingness measure, d-weight, with each generalized tuple in the description.