Sample Results with GritBotTM
- Medical Data
- Telecommunications Churn
- Crystallography Data
- Marine Biology Data
- Genetics Data
- Agricultural Data
- Now Read On ...
This page illustrates GritBot's ability to find possible anomalies in data. Like all good [ro]bots, GritBot does its job without direction -- the user does not have to specify the nature of anomalies or how to find them. This is illustrated with data from several application areas. Times are for an Intel Core i7 920 (2.67GHz) PC running Linux.
A possible anomaly exists if a case's value for one attribute is surprising when compared with corresponding values from a subset of cases. Such an anomaly is reported in the following pattern:
case identification: [significance] anomalous value (N cases, reason) condition 1 condition 2 . . . condition K
Here
- The case is identified by its index in the application's data or test file. If a label attribute is defined, its value is also shown here.
- The significance value estimates how likely it is that the anomalous value could occur by chance rather than error. Lower significance values imply greater certainty that a real anomaly has been found.
- The second line identifies the anomalous value and indicates why it is out of line with respect to the N cases in the subset. This does not necessarily mean that this value itself is incorrect -- the case's value for one of the attributes defining the subset may be faulty.
- The subset itself is defined as the N cases satisfying all of the K conditions. Each condition refers to a single attribute and restricts the value of a numeric attribute or specifies one or more possible values for a discrete attribute. If there are no conditions, the value is anomalous with respect to the entire dataset.
Medical Data
The data for this application come from a thyroid assay screening service and concern one aspect (hypothyroidism) of thyroid disease diagnosis. The attributes are a mixture of measured values and information obtained from the referring physician. Here are a few examples:Attribute Assay 1 Assay 2 Assay 3 ..... age 32 63 19 sex F M M on thyroxine t f f query on thyroxine f f f on antithyroid medication f f f sick f f f pregnant t N/A N/A thyroid surgery f f f I131 treatment f f f query hypothyroid f f t query hyperthyroid t f f lithium f f f tumor f f f goitre f f f hypopituitary f f f psych f f f TSH 0.025 108 9 T3 3.7 .4 2.2 TT4 139 14 117 T4U 1.34 .98 - FTI 104 14 - referral source other SVI other diagnosis negative primary compensated hypothyr hypothyr
This application's data and test files contain 3772 cases in total. GritBot takes 0.4 seconds to identify five possible anomalies:
data case 1365: (label 861) [0.002] age = 455 (3771 cases, mean 52, 99.97% <= 94) test case 373: (label 769) [0.006] T3 = 7.6 (602 cases, mean 2.08, 99.8% <= 4) TT4 > 83 and <= 155 [120] T4U > 0.99 and <= 1.12 [1.04] data case 2215: (label 2676) [0.008] TSH = 8.5 (35 cases, mean 1.061, 34 <= 2.9) FTI > 120.75 and <= 121.8 [121] diagnosis in {secondary, negative} [negative] data case 2224: (label 1562) [0.014] age = 75 (53 cases, mean 32, 51 <= 42) pregnant = t data case 1610: (label 3023) [0.016] age = 73 (53 cases, mean 32, 51 <= 42) pregnant = t
The first possible anomaly concerns case number 1365 in this application's data file. There are no third or subsequent lines, so all 3771 cases with known values of the attribute age are relevant. This case has a patient age of 455 (!), whereas 99.97% of the cases -- all cases except this one -- have age values not exceeding 94.
The last two possible anomalies concern the 53 patients noted as being pregnant. Two of them are aged in their seventies whereas the average age of pregnant women is 32 and all the others are no older than 42. Note that the value of either "age" or "pregnant" could be faulty for each case, and there is no way to decide which is the culprit.
The other possible anomaly focus on thyroid assays. Expert endocrinological knowledge would be needed to judge whether or not these values are truly anomalous: the others clearly are!
Telecommunications Churn
The MLC++ site at SGI contains several interesting datasets including simulated telecommunications churn data. ("Churn" here has nothing to do with making butter -- it's about customers changing providers.)The training and test files contain a total of 5000 cases, each described by 21 attributes. GritBot analyzes them in 0.9 seconds and finds two possible anomalies:
test case 1570: [0.001] voice mail plan = yes (3678 cases, 99.97% `no') number vmail messages <= 0 [0] data case 15: [0.016] class = 0 (75 cases, 99% `1') total day minutes <= 135 [120.7] number customer service calls > 3 [4]
The first highlights someone paying for a voice mail plan who has received no voice mail messages. The second describes a non-churning customer who is a light user but has numerous service calls.
Crystallography Data
The data for this example were provided by Dr John Rodgers of National Research Council Canada. The data and test files contain a total of 34,641 cases, each describing 122 properties of a substance such as the number of atoms of each element that it contains, the number of atoms belonging to each periodic table family, density, crystal structure group, and whether it is magnetic. GritBot requires 5.1 seconds to identify just one potential anomaly:test case 4190: (label AL2562/Al8 Dy Fe4/MN12 Th/tI26) [0.006] Magnetic = neg (352 cases, 99.4% `pos') Fe > 3 [4] Group = tI26
GritBot has found a subset of 352 cases, most of them magnetic, among which this non-magnetic case stands out. Since only 7% of the cases in the entire dataset are noted as being magnetic, this potential anomaly is indeed interesting.
Marine Biology Data
This database of measurements on the abalone comes from the Marine Resources Division of the Tasmanian Department of Primary Industry and Fisheries. There are 4177 cases divided between data and test files. Nine attributes describe each case's sex (abalones have three!), physical dimensions, whole weight and weights of some parts, and its age (measured by rings).GritBot requires 0.2 seconds to find 23 possible anomalies. Most of these are clear errors since the highlighted cases violate common-sense constraints, e.g. by the weight of a part being greater than the whole weight, or the maximum dimension (length) being less than other dimensions (e.g. diameter). Note that we did not tell GritBot about these constraints -- the anomalies were apparent in the data themselves. Here are a few examples:
data case 2052: [0.000] Height = 1.13 (36 cases, mean 0.129, 35 <= 0.16) Whole weight > 0.5938 and <= 0.6018 [0.594] data case 2628: [0.000] Shucked weight = 0.495 (37 cases, mean 0.0476, 35 <= 0.059) Whole weight > 0.105 and <= 0.1198 [0.1055] data case 1211: [0.003] Length = 0.185 (189 cases, mean 0.479, 99.5% >= 0.435) Diameter > 0.363 and <= 0.377 [0.375] data case 648: [0.016] Whole weight = 0.777 (58 cases, mean 0.5108, 57 <= 0.578) Shucked weight <= 0.2238 [0.216] Viscera weight > 0.113 [0.13] Shell weight > 0.1258 and <= 0.191 [0.17] Rings <= 10 [9]
Genetics Data
This application's data, assembled by Towell, Noordewier, and Shavlik, concern splicing sites in genes. There are 3190 cases, each described by 61 attributes representing a "window" of 60 residues (amino acids, normally A, G, T, or C) and information on whether the center of the window is a splice junction (intron-extron, extron-intron) or not.GritBot finds two possible anomalies in these cases (in 0.6 seconds):
case 550: [0.009] A30 = C (657 cases, 99.8% `G') A34 = G class = EI case 839: [0.010] A28 = T (606 cases, 99.8% `A') A27 = C class = IE
There are 657 extron-intron junction cases that have G in position 34, all of them (except case 550) also having a G in position A30. Similarly, among the 606 cases that are intron-extron junctions and for which the residue in position A27 is C, all except case 839 have A in position A28.
Agricultural Data
The last application is one of the classic datasets in Machine Learning. Assembled by Ryszard Michalski, it contains data on 683 diseased soybean plants. There are 35 discrete-valued attributes providing information about planting, weather, disease symptoms, and diagnosis.In 0.1 seconds, GritBot identifies three possible anomalies in these well-studied data:
case 78: [0.009] leafspot-size = gt-1/8 (221 cases, 99.5% `dna') leafspots-halo = absent case 614: [0.012] stem = norm (191 cases, 99.5% `abnorm') stem-cankers = above-sec-nde case 558: [0.015] seed = abnorm (405 cases, 99.8% `norm') fruiting-bodies = absent mold-growth = absent seed-discolor = absent
The first case has large leafspots but their halo is shown as absent; the second notes the presence of stem cankers, but the stem is stated to be normal.
Now Read On ...
All the examples above were run using GritBot's default option settings. GritBot provides mechanisms
- to select the attributes that will be checked for anomalous values;
- to restrict the number of conditions that can be used to describe subsets of cases; and
- to vary the anomaly reporting level from more complete (with perhaps many false positives) to succinct (perhaps missing some possible anomalies).
After a dataset has been analyzed by GritBot, the regularities that it discovered can be saved and used to inspect new data. Furthermore, the types of potential anomalies identified in new data can be quite different from those found in the original data!
If you would like to learn more about how to use GritBot, please see the tutorial.
© RULEQUEST RESEARCH 2015 | Last updated September 2015 |
home | products | licensing | download | contact us |