Welcome to C5.0, a system that extracts informative patterns from data. The following sections show how to prepare data files for C5.0 and illustrate the options for using the system.
In this tutorial, file names and C5.0 input appear in
blue fixed-width font
while file extensions and other general forms
are shown highlighted in green.
We will illustrate C5.0 using a medical application -- mining a database of thyroid assays from the Garvan Institute of Medical Research, Sydney, to construct diagnostic rules for hypothyroidism. Each case concerns a single referral and contains information on the source of the referral, assays requested, patient data, and referring physician's comments. Here are three examples:
Attribute Case 1 Case 2 Case 3 .....
age 41 23 46
sex F F M
on thyroxine f f f
query on thyroxine f f f
on antithyroid medication f f f
sick f f f
pregnant f f not applicable
thyroid surgery f f f
I131 treatment f f f
query hypothyroid f f f
query hyperthyroid f f f
lithium f f f
tumor f f f
goitre f f f
hypopituitary f f f
psych f f f
TSH 1.3 4.1 0.98
T3 2.5 2 unknown
TT4 125 102 109
T4U 1.14 unknown 0.91
FTI 109 unknown unknown
referral source SVHC other other
diagnosis negative negative negative
ID 3733 1442 2965
This is exactly the sort of task for which C5.0 was designed. Each case belongs to one of a small number of mutually exclusive classes (negative, primary, secondary, compensated). Properties of every case that may be relevant to its class are provided, although some cases may have unknown or non-applicable values for some attributes. There are 24 attributes in this example, but C5.0 can deal with any number of attributes.
C5.0's job is to find how to predict a case's class from the values of the other attributes. C5.0 does this by constructing a classifier that makes this prediction. As we will see, C5.0 can construct classifiers expressed as decision trees or as sets of rules.
hypothyroid
for this illustration.
All files read or written by C5.0 for an application
have names of the form
filestem.extension,
where filestem identifies the application and
extension describes the contents of the file.
Here is a summary table of the extensions used by C5.0 (to be described in later sections):
| names | description of the application's attributes | [required] |
| data | cases used to generate a classifier | [required] |
| test | unseen cases used to test a classifier | [optional] |
| cases | cases to be classified subsequently | [optional] |
| costs | differential misclassification costs | [optional] |
| tree | decision tree classifier produced by C5.0 | [output] |
| rules | ruleset classifier produced by C5.0 | [output] |
hypothyroid.names) that
describes the attributes and classes.
There are two important subgroups of attributes:
The file hypothyroid.names
looks like this:
diagnosis. | the target attribute age: continuous. sex: M, F. on thyroxine: f, t. query on thyroxine: f, t. on antithyroid medication: f, t. sick: f, t. pregnant: f, t. thyroid surgery: f, t. I131 treatment: f, t. query hypothyroid: f, t. query hyperthyroid: f, t. lithium: f, t. tumor: f, t. goitre: f, t. hypopituitary: f, t. psych: f, t. TSH: continuous. T3: continuous. TT4: continuous. T4U: continuous. FTI:= TT4 / T4U. referral source: WEST, STMW, SVHC, SVI, SVHD, other. diagnosis: primary, compensated, secondary, negative. ID: label.
|')
can appear
in names and values, but must be prefixed by the escape character
`\'.
For example, the name "Filch, Grabbit, and Co." would be written
as Filch\, Grabbit\, and Co\.'|'
causes the remainder of the line to be ignored and is handy for
including comments.
This use of `|' should not occur inside a value.
primary, compensated, secondary, negative.
diagnosis.
age: 12, 19.age <= 12, 12 < age <= 19, and
age > 19.
This first entry defining the classes is followed by definitions of the attributes in the order that they will be given for each case.
:' and a description of the values taken by the attribute.
The attribute name is arbitrary, except that each attribute must have
a distinct name, and case weight
is reserved for setting weights for individual cases.
There are eight possibilities for the description of attribute values:
continuous
date
2005/09/30 or 2005-09-30.
time
00:00:00 and 23:59:59.
timestamp
2005-09-30 15:04:00.
(Note that there is a space separating the date and time.)
[ordered] to indicate
that they are given in a meaningful ordering, otherwise they will
be taken as unordered. For instance, the values low, medium, high
are ordered, while
meat, poultry, fish, vegetables are not.
The former might be declared as
grade: [ordered] low, medium, high.
If the attribute values have a natural order, it is better to declare them
as such so that C5.0 can exploit the ordering.
(NB: The target attribute should not be declared as ordered.)
discrete N for some integer N
ignore
label
:='
and then a formula defining the attribute value. The formula is
written in the usual way, using parentheses where needed, and
may refer to any attribute defined up to this point.
Constants in the formula can be
numbers written in decimal notation, dates, times,
and discrete attribute values enclosed in string quotes `"'.
The operators and functions that can be used in the formula are
+, -, *,
/, % (mod),
^ (meaning `raised to the power')
>, >=, <,
<=, =, <> or
!= (both meaning `not equal')
and, or
sin(...),
cos(...),
tan(...),
log(...),
exp(...),
int(...) (meaning `integer part of')
FTI:= TT4 / T4U.
is continuous since its value is obtained by dividing one number by
another. The value of a hypothetical attribute such as
strange := referral source = "WEST" or age > 40.
would be either t or f
since the value given by the formula is either true or false.
If the value of the formula cannot be determined for a particular case because one or more of the attributes appearing in the formula have unknown or non-applicable values, the value of the implicitly-defined attribute is unknown.
d1: date.
d2: date.
we could define
interval := d2 - d1.
gap := d1 <= d2 - 7.
d1-day-of-week := (d1 + 1) % 7 + 1.
interval then represents the number of days from
d1 to d2 (non-inclusive) and
gap would have a true/false value signaling whether
d1 is at least a week before d2.
The last definition is a slightly non-obvious way of determining
the day of the week on which d1 falls, with values
ranging from 1 (Monday) to 7 (Sunday).
Similarly, times are stored as the number of seconds since midnight.
If the names file includes
the value of start: time.
finish: time.
elapsed := finish - start.
elapsed is the number of seconds
from start to finish.
Timestamps are a little more complex. A timestamp is rounded to
the nearest minute, but limitations on the precision of floating-point
numbers mean that the values stored for timestamps from more than
thirty years ago are approximate.
If the names file includes
the value of departure: timestamp.
arrival: timestamp.
flight time := arrival - departure.
flight time is the number of minutes
from departure to arrival.
attributes included:
attributes excluded:
followed by a comma-separated list of attribute names. The first
form restricts the attributes used in classifiers to those specifically
named;
the second form specifies that classifiers must not use any of the named
attributes.
Excluding an attribute from classifiers is not the same as ignoring the
attribute (see `ignore' above).
As an example, suppose that numeric attributes A
and B
are defined in the data, but background knowledge suggests that
only their difference is important.
The names file might then contain the following entries:
. . .In this example the attributesA: continuous.B: continuous.Diff := A - B.. . .attributes excluded: A, B.
A and B
could not be defined
as ignore because the definition of Diff
would then be invalid.
hypothyroid.data)
provides information on the
training
cases from which C5.0 will extract patterns.
The entry for each case consists of one or more lines that give
the values for all explicitly-defined attributes. If the classes are listed
in the first line of the names file,
the attribute values are followed by the case's class value.
Values are separated by commas and the entry is optionally terminated by
a period.
Once again, anything on a line after a vertical bar `|'
is ignored.
(If the information for a case occupies more than one line, make sure
that the line breaks occur after commas.)
For example,
the first three cases from file
hypothyroid.data are:
Don't forget the commas between values! If you leave them out,
C5.0 will not be able to process your data.
41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,1.3,2.5,125,1.14,SVHC,negative,3733
23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,4.1,2,102,?,other,negative,1442
46,M,f,f,f,f,N/A,f,f,f,f,f,f,f,f,f,0.98,?,109,0.91,other,negative,2965
Notice that
`?' is used to denote a value that is missing or unknown.
Similarly, `N/A' denotes a value that is not applicable for
a particular case.
Also note
that the cases do not contain values for the attribute FTI
since its values are computed from other attribute values.
The third kind of file used
by C5.0 consists of new test
cases (e.g. hypothyroid.test) on which the classifier
can be evaluated.
This file is optional and, if used, has
exactly the same format as the data file.
Another optional file, the cases file
(e.g. hypothyroid.cases),
differs from a test file only in allowing the cases'
classes to be unknown (`?').
The cases file is used primarily with
the public source code
described later on.
hypothyroid.costs),
is also optional and sets out
differential misclassification costs.
In some applications
there is a much higher penalty for certain types of mistakes.
In this application, a prediction that hypothyroidism is not present
could be very costly if in fact it is.
On the other hand, predicting incorrectly that a patient is
hypothyroid
may be a less serious error.
C5.0 allows different misclassification
costs to be associated with each combination of real class and
predicted class. We will return to this topic near the end of the
tutorial.
The general form of the Unix command is
c5.0 -f filestem [options]
This invokes C5.0 with the -f
option that identifies the application name
(here hypothyroid).
If no filestem is specified using this option, C5.0 uses a default
filestem that is probably incorrect.
(Moral: always use the -f option!)
There are several options that affect the type of classifier that
C5.0 produces and the way that it is constructed.
Many of the options have default values that should be satisfactory
for most applications.
c5.0 -f hypothyroid
it constructs a decision tree and generates output
like this:
C5.0 [Release 2.05] Fri Oct 26 09:32:14 2007
-------------------
Options:
Application `hypothyroid'
Class specified by attribute `diagnosis'
Read 2772 cases (24 attributes) from hypothyroid.data
Decision tree:
TSH <= 6: negative (2472/2)
TSH > 6:
:...FTI <= 65:
:...thyroid surgery = t:
: :...FTI <= 36.1: negative (2.1)
: : FTI > 36.1: primary (2.1/0.1)
: thyroid surgery = f:
: :...TT4 <= 61: primary (51/3.7)
: TT4 > 61:
: :...referral source in {WEST,SVHD}: primary (0)
: referral source in {STMW,SVHC,SVI}: primary (4.9/0.8)
: referral source = other:
: :...TSH <= 22: negative (6.4/2.7)
: TSH > 22: primary (5.8/0.8)
FTI > 65:
:...on thyroxine = t: negative (37.7)
on thyroxine = f:
:...thyroid surgery = t: negative (6.8)
thyroid surgery = f:
:...TT4 > 153: negative (6/0.1)
TT4 <= 153:
:...TT4 <= 37: primary (2.5/0.2)
TT4 > 37: compensated (174.6/24.8)
Evaluation on training data (2772 cases):
Decision Tree
----------------
Size Errors
12 7( 0.3%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
60 3 (a): class primary
153 1 (b): class compensated
2 (c): class secondary
1 2552 (d): class negative
Attribute usage:
90% TSH
18% thyroid surgery
17% on thyroxine
14% TT4
13% T4U
13% FTI
7% referral source
Evaluation on test data (1000 cases):
Decision Tree
----------------
Size Errors
12 4( 0.4%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
31 1 (a): class primary
1 39 (b): class compensated
(c): class secondary
2 926 (d): class negative
Time: 0.1 secs
(Since hardware platforms can differ in floating point precision
and rounding, the output that you see might not be exactly the same
as the above.)
The first part identifies the version of C5.0, the run date,
and the options with which the system was invoked.
C5.0 constructs a decision tree from the 2772 training cases
in the file hypothyroid.data, and this appears next.
Although it may not look much like a tree, this output can be
paraphrased as:
if TSH is less than or equal to 6 then negative else if TSH is greater than 6 then if FTI is less than or equal to 65 then if thyroid surgery equals t then if FTI is less than or equal to 36.1 then negative else if FTI is greater than 36.1 then primary else if thyroid surgery equals f then if TT4 is less than or equal to 61 then primary else if TT4 is greater than 61 then . . . .and so on.
The tree employs a case's attribute values to map it
to a leaf designating one of the classes.
Every leaf of the tree is followed by a cryptic (n) or
(n/m).
For instance, the last leaf of the decision tree
is compensated (174.6/24.8), for which n is 174.6 and
m is 24.8.
The value of n is the number of cases in the file
hypothyroid.data
that are mapped to this leaf, and m (if it appears) is the number of
them that are classified incorrectly by the leaf.
(A non-integral number of cases can arise because, when the value of
an attribute in the tree is not known, C5.0 splits the case
and sends a fraction down each branch.)
The next section covers the evaluation of this decision tree shown in the second part of the output. Before we leave this output, though, its final line states the elapsed time for the run. (This differs from early releases of C5.0 which gave the CPU time.) The construction of a decision tree is usually completed quickly, even when there are many thousands of cases. Some of the options described later, such as ruleset generation and boosting, can slow things down considerably.
The progress of C5.0 on long runs can be monitored by examining the
last few lines of the temporary
file filestem.tmp
(e.g. hypothyroid.tmp).
This file displays the stage that C5.0 has reached and, for most stages,
gives an indication of progress within that stage.
Results of the decision tree on the cases in
hypothyroid.data are:
Decision Tree
----------------
Size Errors
12 7( 0.3%) <<
Size is the number of non-empty leaves on the
tree and
Errors shows
the number and percentage of cases misclassified.
The tree, with 12 leaves, misclassifies 7 of the 2772 given cases, an error
rate of 0.3%.
This might seem inconsistent with the errors recorded at the leaves --
the leaf mentioned above shows 24.8 errors! The discrepancy arises because
parts of a case split as a result of unknown attribute values can
be misclassified and yet, when the votes from all the parts are aggregated,
the correct class can still be chosen.
When there are no more than twenty classes, performance on the training cases is further analyzed in a confusion matrix that pinpoints the kinds of errors made.
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
60 3 (a): class primary
153 1 (b): class compensated
2 (c): class secondary
1 2552 (d): class negative
In this example, the decision tree misclassifies
primary cases as compensated,
compensated cases as negative,
secondary cases as negative, and
negative case as compensated.
When the number of classes is larger than twenty, a summary of performance broken down by class is shown instead. The entry for each class shows the number of cases for that class and the numbers of false positives and false negatives. A false positive for class C is a case of another class that is classified as C, while a false negative for C is a case of class C that is classified as some other class. Of course, the total number of errors must come to half the sum of the numbers of false positives and false negatives, since each error is counted twice--as a false negative for its true class, and as a false positive for the predicted class.
For some applications, especially those with many attributes, it may be useful to know how the individual attributes contribute to the classifier. This is shown in the next section:
Attribute usage:
90% TSH
18% thyroid surgery
17% on thyroxine
14% TT4
13% T4U
13% FTI
7% referral source
The figure before each attribute is the percentage of training cases
in hypothyroid.data for which the value of that
attribute is known and is used in predicting a class. The second
entry, for instance, shows that the decision tree uses a known
value of thyroid surgery when classifying 18% of the
training cases. Attributes for which this value is less than 1%
are not shown. Two points are worth noting here:
FTI that is defined by a formula also counts as
using any attributes involved in its definition (here TT4
and T4U).
If there are optional unseen test cases, the classifier's performance on these cases is summarized in a format similar to that for the training cases.
Decision Tree
----------------
Size Errors
12 4( 0.4%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
31 1 (a): class primary
1 39 (b): class compensated
(c): class secondary
2 926 (d): class negative
A very simple majority classifier predicts
that every new case belongs to the most common class in the
training data.
In this example, 2553 of the 2772 training cases belong to class
negative so that
a majority classifier would always opt for negative.
The 1000 test cases from file hypothyroid.test
include 928 belonging to class negative, so a simple
majority classifier would have an error rate of 7.2%.
The decision tree has a lower error rate of 0.4% on the new
cases, but notice that
this is higher than its error rate on the training cases.
The confusion matrix (or false positive/false negative summary if
there are more than twenty classes) for the test cases again provides
more details on correct and incorrect classifications.
By default, a test on
a discrete attributes has a separate branch for
each of its values that is present in the data.
Tests with a high fan-out can have the undesirable side-effect
of fragmenting the data during construction of the decision tree.
C5.0 has an option -s
that can mitigate this fragmentation to some
extent: attribute values are grouped into subsets and each subtree
is associated with a subset rather than with a single value.
In the hypothyroid example, invoking this option by the command
c5.0 -f hypothyroid -s
merely simplifies part of the tree as
referral source in {WEST,STMW,SVHC,SVI,SVHD}: primary (4.9/0.8)
without affecting classification performance on either the training
or test data.
Although it does not help much for this application, the
-s
option is recommended when there are important discrete
attributes that have more than four or five values.
Decision trees can sometimes be quite difficult to understand. An important feature of C5.0 is its ability to generate classifiers called rulesets that consist of unordered collections of (relatively) simple if-then rules.
The option -r causes
classifiers to be expressed as rulesets rather than decision trees.
The command
c5.0 -f hypothyroid -r
gives the following:
C5.0 [Release 2.05] Fri Oct 26 10:00:25 2007
-------------------
Options:
Application `../hypothyroid'
Focus on errors (ignore costs file)
Rule-based classifiers
Class specified by attribute `diagnosis'
Read 2772 cases (24 attributes) from ../hypothyroid.data
Rules:
Rule 1: (31, lift 42.7)
thyroid surgery = f
TSH > 6
TT4 <= 37
-> class primary [0.970]
Rule 2: (63/6, lift 39.3)
TSH > 6
FTI <= 65
-> class primary [0.892]
Rule 3: (270/116, lift 10.3)
TSH > 6
-> class compensated [0.570]
Rule 4: (2225/2, lift 1.1)
TSH <= 6
-> class negative [0.999]
Rule 5: (296, lift 1.1)
on thyroxine = t
FTI > 65
-> class negative [0.997]
Rule 6: (240, lift 1.1)
TT4 > 153
-> class negative [0.996]
Rule 7: (29, lift 1.1)
thyroid surgery = t
FTI > 65
-> class negative [0.968]
Default class: negative
Evaluation on training data (2772 cases):
Rules
----------------
No Errors
7 14( 0.5%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
60 3 (a): class primary
1 153 (b): class compensated
2 (c): class secondary
5 3 2545 (d): class negative
Attribute usage:
90% TSH
20% TT4
14% T4U
14% FTI
11% on thyroxine
2% thyroid surgery
Evaluation on test data (1000 cases):
Rules
----------------
No Errors
7 5( 0.5%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
32 (a): class primary
1 39 (b): class compensated
(c): class secondary
1 3 924 (d): class negative
Time: 0.1 secs
Each rule consists of:
(n, lift x)
or
(n/m,
lift x)
that summarize the performance of the rule.
Similarly to a leaf, n is the number of training cases covered
by the rule and m, if it appears, shows how many of them
do not belong to the class predicted by the rule.
The rule's accuracy is estimated by the Laplace ratio
negative,
that is used when none of the rules apply.
Rulesets are generally easier to understand than trees since each rule describes a specific context associated with a class. Furthermore, a ruleset generated from a tree usually has fewer rules than than the tree has leaves, another plus for comprehensibility. (In this example, the first decision tree with 12 leaves is reduced to seven rules.)
Another advantage of ruleset classifiers is that they
are often more accurate predictors than decision
trees -- a point not illustrated here, since the ruleset has an
error rate of 0.5% on the test cases.
For very large datasets, however, generating rules with the
-r option
can require considerably more computer time.
For a given application, the attribute usage shown for a decision tree and for a ruleset can be a bit different. In the case of the tree, the attribute at the root is always used (provided its value is known) while an attribute further down the tree is used less frequently. For a ruleset, an attribute is used to classify a case if it is referenced by a condition of at least one rule that applies to that case; the order in which attributes appear in a ruleset is not relevant.
-u
option. Under this option, the rule that most reduces the error rate
appears first and the rule that contributes least appears last.
Furthermore, results are reported in a selected number of
bands so that the predictive accuracies of the more
important subsets of rules are also estimated. For example,
if the
option -u 4
Rule 1: (2225/2, lift 1.1)
TSH <= 6
-> class negative [0.999]
Rule 2: (270/116, lift 10.3)
TSH > 6
-> class compensated [0.570]
Rule 3: (63/6, lift 39.3)
TSH > 6
FTI <= 65
-> class primary [0.892]
Rule 4: (296, lift 1.1)
on thyroxine = t
FTI > 65
-> class negative [0.997]
Rule 5: (240, lift 1.1)
TT4 > 153
-> class negative [0.996]
Rule 6: (29, lift 1.1)
thyroid surgery = t
FTI > 65
-> class negative [0.968]
Rule 7: (31, lift 42.7)
thyroid surgery = f
TSH > 6
TT4 <= 37
-> class primary [0.970]
The rules are divided into four bands of roughly equal sizes
and a further summary is generated for both training and test cases.
Here is the output for test cases:
Evaluation on test data (1000 cases):
Rules
----------------
No Errors
7 5( 0.5%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
32 (a): class primary
1 39 (b): class compensated
(c): class secondary
1 3 924 (d): class negative
Rule utility summary:
Rules Errors
----- ------
1-2 56( 5.6%)
1-4 10( 1.0%)
1-5 6( 0.6%)
This shows that,
when only the first two rules are used,
the error rate on the test cases is 5.6%,
dropping to 1.0% when the first four rules are used,
and so on. The performance of the entire ruleset is not
repeated since it is shown above the utility summary.
Rule utility orderings are not given for cross-validations (see below).
Another innovation incorporated in C5.0 is adaptive boosting, based on the work of Rob Schapire and Yoav Freund. The idea is to generate several classifiers (either decision trees or rulesets) rather than just one. When a new case is to be classified, each classifier votes for its predicted class and the votes are counted to determine the final class.
But how can we generate several classifiers from a single dataset?
As the first step, a single decision tree or ruleset is constructed
as before from the training data (e.g. hypothyroid.data).
This classifier will usually make mistakes on some cases in the data;
the first decision tree, for instance, gives the wrong class
for 7 cases in hypothyroid.data.
When the second classifier is constructed, more attention is paid
to these cases in an attempt to get them right.
As a consequence, the second classifier will generally be different
from the first. It also will make errors on some cases,
and these become the the focus of attention during construction
of the third classifier.
This process continues for a pre-determined number of iterations
or trials, but stops if the most recent classifiers is
either extremely accurate or inaccurate.
The option -t x instructs C5.0 to
construct up to x
classifiers in this manner; an alternative option -b
is equivalent to -t 10
In this example, the command
c5.0 -f hypothyroid -b
causes ten decision trees to be generated. The summary of the trees' individual and aggregated performance on the 1000 test cases is:
Trial Decision Tree
----- ----------------
Size Errors
0 12 4( 0.4%)
1 7 52( 5.2%)
2 11 9( 0.9%)
3 15 21( 2.1%)
4 7 12( 1.2%)
5 10 7( 0.7%)
6 8 8( 0.8%)
7 13 13( 1.3%)
8 12 12( 1.2%)
9 16 54( 5.4%)
boost 2( 0.2%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
32 (a): class primary
40 (b): class compensated
(c): class secondary
2 926 (d): class negative
(Again, different hardware can lead to slightly different
results.)
The performance of the classifier constructed at each trial
is summarized on a separate line, while the line labeled
boost
shows the result of voting all the classifiers.
The decision tree constructed on Trial 0 is identical to that
produced without the -b option.
Some of the subsequent trees produced by paying more attention
to certain cases
have relatively high overall error rates. Nevertheless, when the
trees are combined by voting,
the final predictions have a lower error rate of 0.2% on the test cases.
The decision trees and rulesets constructed by C5.0 do not generally use all of the attributes. The hypothyroid application has 22 predictive attributes (plus a class and a label attribute) but only six of them appear in the tree and the ruleset. This ability to pick and choose among the predictors is an important advantage of tree-based modeling techniques.
Some applications, however, have an abundance of attributes! For instance, one approach to text classification describes each passage by the words that appear in it, so there is a separate attribute for each different word in a restricted dictionary.
When there are numerous alternatives for each test in the tree or ruleset, it is likely that at least one of them will appear to provide valuable predictive information. In applications like these it can be useful to pre-select a subset of the attributes that will be used to construct the decision tree or ruleset. The C5.0 mechanism to do this is called "winnowing" by analogy with the process for separating wheat from chaff (or, here, useful attributes from unhelpful ones).
Winnowing is not obviously relevant for the hypothyroid application
since there are relatively few attributes. To illustrate the idea,
however, here are the results when the
-w
option is invoked:
C5.0 [Release 2.05] Fri Oct 26 09:46:36 2007
-------------------
Options:
Application `hypothyroid'
Winnow attributes
Class specified by attribute `diagnosis'
Read 2772 cases (24 attributes) from hypothyroid.data
14 attributes winnowed
Estimated importance of remaining attributes:
990% TSH
270% FTI
200% on thyroxine
30% thyroid surgery
<1% age
<1% T3
<1% TT4
<1% referral source
Decision tree:
TSH <= 6: negative (2472/2)
TSH > 6:
:...FTI <= 65:
:...thyroid surgery = t:
: :...FTI <= 36.1: negative (2.1)
: : FTI > 36.1: primary (2.1/0.1)
: thyroid surgery = f:
: :...TT4 <= 61: primary (51/3.7)
: TT4 > 61:
: :...referral source in {WEST,SVHD}: primary (0)
: referral source in {STMW,SVHC,SVI}: primary (4.9/0.8)
: referral source = other:
: :...TSH <= 22: negative (6.4/2.7)
: TSH > 22: primary (5.8/0.8)
FTI > 65:
:...on thyroxine = t: negative (37.7)
on thyroxine = f:
:...thyroid surgery = t: negative (6.8)
thyroid surgery = f:
:...TT4 > 153: negative (6/0.1)
TT4 <= 153:
:...TT4 <= 37: primary (2.5/0.2)
TT4 > 37: compensated (174.6/24.8)
Evaluation on training data (2772 cases):
Decision Tree
----------------
Size Errors
12 7( 0.3%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
60 3 (a): class primary
153 1 (b): class compensated
2 (c): class secondary
1 2552 (d): class negative
Attribute usage:
90% TSH
18% thyroid surgery
17% on thyroxine
14% TT4
13% T4U
13% FTI
7% referral source
Evaluation on test data (1000 cases):
Decision Tree
----------------
Size Errors
12 4( 0.4%) <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
31 1 (a): class primary
1 39 (b): class compensated
(c): class secondary
2 926 (d): class negative
Time: 0.0 secs
After analyzing the training cases and before the decision
tree is built, C5.0 winnows 14 of the 22 predictive attributes.
This has the same effect as marking the attributes as excluded
by an entry in the names file; winnowed attributes
can still be used in the definition of other attributes.
In this example, T4U is winnowed but is still available
for use in the definition of FTI.
The remaining attributes are then listed in order of importance, C5.0's estimate of the factor by which the true error rate or misclassification cost would increase if that attribute were excluded. If TSH were excluded, for example, C5.0 expects the error rate on unseen test cases to increase to 4% (990% of the current rate of 0.4%). This estimate is intended only as a rough guide and should not be taken too literally!
We then see the decision tree that is constructed from the reduced set of attributes. In this case it is identical to the original tree, but winnowing will usually lead to a different classifier.
Since winnowing the attributes can be a time-consuming process, it is recommended primarily for large applications (10,000 cases or more) where there is reason to suspect that many of the attributes have at best marginal relevance to the classification task.
The top of our initial decision tree tests whether
the value of the attribute TSH is less than or
equal to, or greater than, 6. If the former holds, we go no further
and predict that the case's class is negative, while
if it does not we look at other information before making a decision.
Thresholds like this are sharp by default, so that a case with
a hypothetical value of 5.99 for TSH is treated
quite differently from one with a value of 6.01.
For some domains, this sudden change is quite appropriate -- for instance, there are hard-and-fast cutoffs for bands of the income tax table. For other applications, though, it is more reasonable to expect classification decisions to change more slowly with changes in attribute values.
C5.0 contains an option
-p
to `soften' thresholds such as 6 above.
When this is invoked, each threshold
is broken into three ranges -- let us denote them
by a lower bound lb, an upper bound ub, and a
central value t. If the attribute value in question
is below lb or above ub, classification is
carried out using the single branch
corresponding to the `<=' or '>' result respectively.
If the value lies between lb and ub, both
branches of the tree are investigated and the results combined
probabilistically.
The values of lb and ub are determined by C5.0
based on an analysis of the apparent sensitivity of classification
to small changes in the threshold. They need not be symmetric --
a fuzzy threshold can be sharper on one side than on the other.
The command
c5.0 -f hypothyroid -p
gives the following decision tree:
TSH <= 6 (6.05): negative (2472/2)
TSH >= 6.1 (6.05):
:...FTI <= 64 (65.35):
:...thyroid surgery = t:
: :...FTI <= 24 (38.25): negative (2.1)
: : FTI >= 52.5 (38.25): primary (2.1/0.1)
: thyroid surgery = f:
: :...TT4 <= 59 (61.5): primary (51/3.7)
: TT4 >= 63 (61.5):
: :...referral source in {WEST,SVHD}: primary (0)
: referral source in {STMW,SVHC,SVI}: primary (4.9/0.8)
: referral source = other:
: :...TSH <= 19 (22.5): negative (6.4/2.7)
: TSH >= 26 (22.5): primary (5.8/0.8)
FTI >= 66 (65.35):
:...on thyroxine = t: negative (37.7)
on thyroxine = f:
:...thyroid surgery = t: negative (6.8)
thyroid surgery = f:
:...TT4 >= 159 (153): negative (6/0.1)
TT4 <= 146 (153):
:...TT4 <= 14 (37.5): primary (2.5/0.2)
TT4 >= 61 (37.5): compensated (174.6/24.8)
Each threshold is now of the form
<= lb (t)
or
>= ub (t).
In this example, most
of the thresholds are still relatively tight, but notice
the asymmetric threshold values for the test FTI <= 64.
For this application, soft thresholds slightly improve the classifier's
accuracy on both training and test data.
A final point: soft thresholds affect only decision tree classifiers -- they do not change the interpretation of rulesets.
Three further options enable aspects of the classifier-generation process to be tweaked. These are best regarded as advanced options that should be used sparingly (if at all), so that this section can be skipped without much loss.
C5.0 constructs decision trees in two phases. A large tree is first grown to fit the data closely and is then `pruned' by removing parts that are predicted to have a relatively high error rate. This pruning process is first applied to every subtree to decide whether it should be replaced by a leaf or sub-branch, and then a global stage looks at the performance of the tree as a whole.
The option -g
Turning off global pruning can be beneficial for some applications, particularly when rulesets are generated.
The option -c CF
The option -m cases
Even though C5.0 is relatively fast, building classifiers from large numbers of cases can take an inconveniently long time, especially when options such as boosting are employed. C5.0 incorporates a facility to extract a random sample from a dataset, construct a classifier from the sample, and then test the classifier on a disjoint collection of cases. By using a smaller set of training cases in this way, the process of generating a classifier is expedited, but at the cost of a possible reduction in the classifier's predictive performance.
The option -S x
In the hypothyroid example,
using a sample of 60% would cause a classifier to be constructed
from a randomly-selected 1663 of the 2772 cases in
hypothyroid.data, then tested on the
remaining 1109 cases.
By default, the random sample changes every time that
a classifier is constructed, so that
successive runs of C5.0 with sampling will
usually produce different results.
This re-sampling can be avoided by the option
-I seed
As we saw earlier, the performance of a classifier on the training
cases from which it was constructed gives a poor estimate of
its accuracy on new cases.
The true predictive accuracy of the classifier can be estimated
by sampling, as above, or by using a separate test file;
either way, the classifier is evaluated on cases that were
not used to build it.
However, this estimate can be unreliable unless the numbers of
cases used to build and evaluate the classifier are both large.
If the cases in hypothyroid.data and
hypothyroid.test were to be shuffled
and divided into a new 2772-case training set and a 1000-case test set,
C5.0 might construct a different classifier with a lower or higher error
rate on the test cases.
One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases (including those in the test file, if it exists) are divided into f blocks of roughly the same size and class distribution. For each block in turn, a classifier is constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each case is used just once as a test case. The error rate of a classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases.
The option -X f
runs such a f-fold cross-validation.
For example, the command
c5.0 -f hypothyroid -X 10 -r
selects 10-fold cross-validation using rulesets.
After giving details of the individual rulesets,
the output shows a summary like this:
Fold Rules
---- ----------------
No Errors
0 7 0.8%
1 7 0.3%
2 7 0.5%
3 7 0.3%
4 8 0.8%
5 7 0.5%
6 7 0.3%
7 6 0.5%
8 7 0.5%
9 7 0.8%
Mean 7.0 0.5%
SE 0.1 0.1%
This estimates the error rate of the rulesets
produced from the
3772 cases in hypothyroid.data and hypothyroid.test
at 0.5%.
The SE figures (the standard errors of the means)
provide an estimate of the variability of these results.
As with sampling above, each cross-validation run will normally use
a different random division of the data into blocks, unless this
is prevented by using the -I option.
The cross-validation procedure can be repeated for different random partitions of the cases into blocks. The average error rate from these distinct cross-validations is then an even more reliable estimate of the error rate of the single classifier produced from all the cases.
A shell script and associated programs for carrying out multiple
cross-validations is included with C5.0.
The shell script xval is invoked with any combination of C5.0
options and some further options that describe the cross-validations
themselves:
F=folds |
specifies the number of cross-validation folds (default 10) |
R=repeats |
causes the cross-validation to be repeated repeats times (default 1) |
+suffix |
adds the identifying suffix
+suffix to all files |
+d |
retains the files output by individual runs |
If detailed results are retained via the +d option,
they appear in files named
filestem.ox[+suffix]
where x is the cross-validation number
(0 to repeats-1).
A summary of the cross-validations is written to file
filestem.res[+suffix].
As an example, the command
xval -f hypothyroid -r R=10 +r
has the effect of running ten 10-fold cross-validations
using ruleset classifiers (i.e., 100 classifiers in all).
File hypothyroid.res+r contains the following summary:
XVal Rules
---- ----------------
No Errors
0 7.2 0.5%
1 7.2 0.6%
2 7.2 0.5%
3 7.2 0.6%
4 7.1 0.8%
5 7.0 0.5%
6 7.1 0.6%
7 7.3 0.6%
8 7.2 0.6%
9 7.2 0.6%
Mean 7.2 0.6%
SE 0.0 0.0%
Since every cross-validation fold uses only part of the application's data, running a cross-validation does not cause a classifier to be saved. To save a classifier for later use, simply run C5.0 without employing cross-validation.
Up to this point, all errors have been treated as equal -- we have simply counted the number of errors made by a classifier to summarize its performance. Let us now turn to the situation in which the `cost' associated with a classification error depends on the predicted and true class of the misclassified case.
C5.0 allows costs to be assigned to any combination of predicted and
true class via entries in the optional file
filestem.costs.
Each entry has the form
predicted class,true class:cost
where cost is any non-negative value. The file may contain any number of entries; if a particular combination is not specified explicitly, its cost is taken to be 0 if the predicted class is correct and 1 otherwise.
To illustrate the idea, suppose that it was a much more serious
error to classify a hypothyroid patient as negative
than the converse.
A hypothetical costs file hypothyroid.costs
might look like this:
negative, primary: 5 negative, secondary: 5 negative, compensated: 5
This specifies that the cost of misclassifying any
primary,
secondary, or
compensated
patient as negative is 5 units.
Since they are not given explicitly, all other errors
have cost 1 unit.
In other words, the first kind of error is five times more costly.
A costs file is automatically read by C5.0 unless the
system is told to ignore it.
(The option -e
causes any costs file to be ignored and instructs C5.0
to focus only on errors.)
The command
c5.0 -f hypothyroid
now gives the following output:
C5.0 [Release 2.05] Fri Oct 26 09:52:25 2007
-------------------
Options:
Application `../hypothyroid'
Class specified by attribute `diagnosis'
Read 2772 cases (24 attributes) from hypothyroid.data
Read misclassification costs from hypothyroid.costs
Decision tree:
TSH <= 6: negative (2472/2)
TSH > 6:
:...FTI > 65:
:...on thyroxine = t: negative (37.7)
: on thyroxine = f:
: :...thyroid surgery = t: negative (6.8)
: thyroid surgery = f:
: :...TT4 > 153: negative (6/0.1)
: TT4 <= 153:
: :...TT4 <= 37: primary (2.5/0.2)
: TT4 > 37: compensated (174.6/24.8)
FTI <= 65:
:...thyroid surgery = t:
:...FTI <= 36.1: negative (2.1)
: FTI > 36.1: primary (2.1/0.1)
thyroid surgery = f:
:...TT4 <= 61: primary (51/3.7)
TT4 > 61:
:...referral source in {WEST,SVHD}: primary (0)
referral source in {STMW,SVHC,SVI}: primary (4.9/0.8)
referral source = other:
:...TSH > 22: primary (5.8/0.8)
TSH <= 22:
:...T3 <= 2.3: compensated (3.4/0.9)
T3 > 2.3: negative (3/0.2)
Evaluation on training data (2772 cases):
Decision Tree
-----------------------
Size Errors Cost
13 8( 0.3%) 0.01 <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
60 3 (a): class primary
154 (b): class compensated
2 (c): class secondary
1 2 2550 (d): class negative
Attribute usage:
90% TSH
18% thyroid surgery
17% on thyroxine
14% TT4
13% T4U
13% FTI
7% referral source
1% T3
Evaluation on test data (1000 cases):
Decision Tree
-----------------------
Size Errors Cost
13 5( 0.5%) 0.01 <<
(a) (b) (c) (d) <-classified as
---- ---- ---- ----
31 1 (a): class primary
1 39 (b): class compensated
(c): class secondary
1 2 925 (d): class negative
Time: 0.0 secs
This new decision tree has a higher error rate
for both the training and test cases
than the decision tree generated without costs, and might therefore appear
entirely inferior to it.
The real difference comes when we compare the total cost of misclassified
training cases for the two trees.
The first decision tree has a total cost of 19 (4x1 + 3x5)
for the misclassified training cases in hypothyroid.data.
The corresponding value for the new tree is 16 (6x1 + 2x5).
The total misclassification cost on the test data is
8 (3x1 + 1x5) for the original tree and 5 (5x1) for the
new tree.
That is, the total misclassification cost for both the
training and test cases is lower than that of the original tree.
The new "Cost" column in the output shows the average misclassification cost, i.e. the total cost divided by the number of cases. For the new tree, the average cost is 16/2772 for the training cases and 5/1000 for the test cases.
It is sometimes useful to attach different weights to cases depending on some measure of their importance. An application predicting whether a customer is likely to "churn," for example, might weight training cases by the size of the account.
C5.0 accommodates this by allowing a special attribute that contains
the weight of each case. The attribute name must be
case weight and
it must be of type continuous. The relative weight
assigned to each case is its value of this attribute divided by
the average value; if the value is undefined ("?"),
not applicable ("N/A"), or is less than or equal to zero,
the case's relative weight is set to 1.
The case weight attribute itself is not used in the classifier!
Our sample hypothyroid application does not have any natural case-by-case
weighting, since all patients are equal. For the purpose of illustration,
though, we will add an implicitly-defined attribute to
hypothyroid.names as follows:
case weight := 100-age.
TSH <= 6: negative (2458.3/2.3)
TSH > 6:
:...FTI <= 64.5: primary (69.5/12.8)
FTI > 64.5:
:...on thyroxine = t: negative (39.5)
on thyroxine = f:
:...thyroid surgery = t: negative (9.2)
thyroid surgery = f:
:...TT4 > 153: negative (6.2/0.2)
TT4 <= 153:
:...TT4 > 61: compensated (178.6/28.9)
TT4 <= 61:
:...TSH <= 35: compensated (2.8/0.3)
TSH > 35: primary (3.2/0.3)
The case counts at the leaves now reflect the relative weights
of the cases. (The counts associated with rules are affected similarly.)
A cautionary note: The use of case weighting does not guarantee that the classifier will be more accurate for unseen cases with higher weights. Predictive accuracy on more important cases is likely to be improved only when cases with similar values of the predictor attributes also have similar values of the case weight attribute, i.e. when relatively important cases "clump together." Without this property, case weighting can introduce an unhelpful element of randomness into the classifier generation process.
Once a classifier has been constructed, an interactive interpreter can be used to predict the classes to which new cases belong. The command to do this is
predict
whose options are:
-f filestem |
to identify the application |
-r |
uses ruleset classifiers rather than decision trees |
-p |
prints the classifier as a reminder |
This is illustrated in the following dialog that uses the first decision
tree to predict the class of a case. Input from the user
is shown in bold face and the enter key as
¤.
TSH: 12¤
TT4: 110¤
T4U: 1.14¤
on thyroxine: f¤
thyroid surgery: f¤
-> compensated [0.86]
negative [0.13]
primary [0.01]
Retry, new case or quit [r,n,q]: r¤
TSH [12]: ¤
TT4 [110]: ?¤
T4U [1.14]: ¤
thyroid surgery [f]: ¤
referral source: WEST¤
on thyroxine [f]: ¤
-> compensated [0.64]
primary [0.22]
negative [0.15]
Retry, new case or quit [r,n,q]: q¤
The values of some attributes might not affect the classification, so
predict prompts for the values of those attributes that
are required. The reply `?' indicates
that a requested attribute value is unknown.
(Similarly, use `N/A' for non-applicable values.)
When all the relevant information has been entered, the most likely
class (or classes) are printed, each with a confidence value.
Next, predict asks whether the same case is to be tried
again with changed attribute values (a kind of `what if'
scenario), a new case is to be classified, or all cases are complete.
If a case is retried, each prompt for an attribute value shows
the previous value in square brackets.
A new value can be entered, followed by the enter key, or
the enter key alone can be used to indicate that the value is unchanged.
Classifiers can also be used in batch mode. The sample application provided in the public source code reads cases from a cases file and shows the predicted class and the confidence for each.
Linux users who have installed a recent version of
Wine can invoke a
slightly simplified version of the See5 user interface.
The executable program gui starts the graphical
user interface whose main window is similar to See5's, with
five buttons:
.out;
and
The graphical interface calls C5.0 directly, so use of the GUI has minimal impact on performance when generating a classifier.
Please note: C5.0 should be run for the first time from the command-line interface, not the GUI. The first run installs the licence in C5.0 -- after that, C5.0 can be used from either interface.
The classifiers generated by C5.0 are retained in files
filestem.tree (for decision trees) and
filestem.rules (for rulesets).
Free C source code is available
to read these classifier files and to make predictions with them,
enabling you to use C5.0 classifiers in other
programs. As an example, the source includes a program to read cases
from a cases file, and to show how each is classified
by boosted or single trees or rulesets.
Click here to download a gzipped tar file containing the public source code.
-f filestem
| select the application |
-s
| partition discrete values into subsets |
-r
| generate rule-based classifiers |
-u bands
| sort rules by their utility into bands |
-b
| use boosting with 10 trials |
-t trials
| use boosting with the specified number of trials |
-w
| winnow the attributes before constructing a classifier |
-p
| use soft thresholds |
-g
| do not use global tree pruning |
-c CF
| set the CF value for pruning trees |
-m cases
| set the minimum cases for at least two branches of a split |
-S x
| use a sample of x% for training and a disjoint sample for testing |
-I seed
| set the sampling seed value |
-X folds
| carry out a cross-validation |
-e
| ignore any costs file |
-h
| print a short summary of the options |
| © RULEQUEST RESEARCH 2008 | Last updated July 2008 |
| home | products | download | evaluations | prices | purchase | contact us |