An Overview of Cubist

Cubist is a tool for generating rule-based predictive models from data. Whereas its sister system See5/C5.0 produces classification models that predict categories, Cubist models predict numeric values. This short tutorial introduces Cubist's capabilities and explains how to use the system effectively.

In this tutorial, file names and Cubist input appear in blue fixed-width font while file extensions and other general forms are shown highlighted in green.


Preparing Data for Cubist

We will illustrate Cubist using a simple application -- modeling automobiles' fuel consumption using data published in 2002 by the US Department of Energy and the US Environmental Protection Agency. Each data point concerns one automobile and the attributes or properties capture (possibly) relevant information as follows:

	Attribute                      Case 1     Case 2     Case 3   .....

	class                          SUBCOMPACT MIDSZ WAG  MIDSZ CAR
	manufacturer                   PONTIAC    FORD       LINC-MERC
	model                          SUNFIRE    FOCUS WAG  LS
	displacement (l)               2.2        2.0        3.9
	cylinders                      4          4          8
	drive                          Auto       Manual     Auto
	gears                          4          5          5
	transmission                   FWD        FWD        RWD
	fuel                           regular    regular    premium
	charger (super- or turbo-)     none       none       none
	valves/cylinder                2          2          4
	displacement/cylinder          0.55       0.5        0.49

	city mpg                       26.1       31.5       18.6

Each case has a target attribute or dependent variable -- here the city miles per gallon achieved by the automobile -- and the other attributes provide information that may help to predict this value, although some automobiles may have unknown values for some attributes. There are only thirteen attributes in this example including the target attribute, but Cubist can deal with thousands of attributes if necessary.

Cubist's job is to find how to estimate a case's target value in terms of its attribute values -- here, to relate miles per gallon to the other information provided for the automobile. Cubist does this by building a model containing one or more rules, where each rule is a conjunction of conditions associated with a linear expression. The meaning of a rule is that, if a case satisfies all the conditions, then the linear expression is appropriate for predicting the target value. A Cubist model thus resembles a piecewise linear model, except that the rules can overlap. As we will see, Cubist can also construct multiple models and can combine rule-based models with instance-based (nearest neighbor) models.

Application files

Every Cubist application has a short name called a filestem; we will use the filestem mpg2001 for this illustration. All files read or written by Cubist for an application look like filestem.extension, where filestem identifies the application and extension describes the contents of the file.

Here is a summary table of the extensions used by Cubist (to be described in later sections):

All files for an application must be kept together in one directory but several applications can share the same directory.

Names file

The first essential file is the names file (e.g. mpg2001.names) that defines the attributes used to describe each case. There are two important subgroups of attributes:

The file mpg2001.names looks like this:

| Data extracted from the site http://www.fueleconomy.gov provided by
| the US Department of Energy and the US Environmental Protection Agency.

city mpg.

class:		COMPACT CARS, LARGE CARS, MIDSIZE CARS, MIDSIZE STATION WAGONS,
		MINICOMPACT CARS, SMALL PICKUP TRUCKS, SMALL STATION WAGONS,
		SPEC PURP VEH - MINIVAN, SPEC PURP VEH - SUV,
		STANDARD PICKUP TRUCKS, SUBCOMPACT CARS, TWO SEATERS,
		VANS CARGO TYPE, VANS PASSENGER TYPE.
manufacturer:	ACURA, AUDI, BENTLEY, BMW, BUICK, CADILLAC, CHEVROLET, CHRYSLER,
		DAEWOO, DODGE, FERRARI, FORD, GMC, HONDA, HYUNDAI, IMPCO,
		INFINITI, ISUZU, JAGUAR, JEEP, KIA, LAMBORGHINI, LAND ROVER,
		LEXUS, LINCOLN, LINCOLN-MERCURY, MAZDA, MERCEDES-BENZ, MERCURY,
		MITSUBISHI, NISSAN, OLDSMOBILE, PLYMOUTH, PONTIAC, PORSCHE,
		R-R MTR CARS LTD, SAAB, SATURN, SUBARU, SUZUKI, TOYOTA,
		VOLKSWAGEN, VOLVO.
model:		label.
displ:		continuous.		| liters/litres
cylinders:	continuous.
drive:		Auto, Manual.
gears:		continuous.
transmission:	F, R, 4.		| front, rear, 4WD
city mpg:	continuous.		| unadjusted
fuel:		R, P, D, C, E.		| regular, premium, diesel, gas, ethanol
charger:	T, S, -.		| turbo, super, none
valves/cyl:	continuous.

displ/cyl := displ / cylinders.
Some of the attribute names have been abbreviated and the target attribute city mpg appears among the others rather than at the end.

What's in a name?

Names, labels, and discrete values are represented by arbitrary strings of characters, with some fine print: Whitespace (blank lines, spaces, and tab characters) is ignored except inside a name or value and can be used to improve legibility. Unless it is escaped as above, the vertical bar `|' causes the remainder of the line to be ignored and is handy for including comments. When used in this way, `|' should not occur inside a value.

The first important entry of the names file identifies the attribute that contains the target value -- the value to be modeled in terms of the other attributes -- here, city mpg. This attribute must be of type continuous or an implicitly-defined attribute that has numeric values (see below).

Following this entry, all attributes are defined in the order that they will be given for each case.

Explicitly-defined attributes

The name of each explicitly-defined attribute is followed by a colon `:' and a description of the values taken by the attribute. The attribute name is arbitrary, except that each attribute must have a distinct name, and case weight is reserved for setting weights for individual cases. There are eight possibilities:
continuous
The attribute takes numeric values.
date
The attribute's values are dates in the form YYYY/MM/DD or YYYY-MM-DD, e.g. 1999/09/30 or 1999-09-30. Valid dates range from the year 1601 to the year 4000.
time
The attribute's values are times in the form HH:MM:SS with values between 00:00:00 and 23:59:59.
timestamp
The attribute's values are times in the form YYYY/MM/DD HH:MM:SS or YYYY-MM-DD HH:MM:SS, e.g. 1999-09-30 15:04:00. (Note that there is a space separating the date and time.)
a comma-separated list of names
The attribute takes discrete values, and these are the allowable values. The values may be prefaced by [ordered] to indicate that they are listed in a meaningful order, otherwise they will be taken as unordered. For instance, the values low, medium, high are ordered, while meat, poultry, fish, vegetables are not. If the attribute values have a natural order, it is better to declare them as ordered so that this information can be exploited by Cubist.
discrete N for some integer N
The attribute also takes discrete values, but the values are assembled from the data itself; N is the maximum number of such values. (This is not recommended, since the data cannot be checked, but it can be handy for discrete attributes with many values.)
ignore
The values of the attribute should be ignored.
label
This attribute contains an identifying label for each case, such as an account number or an order code. The value of the attribute is ignored when models are constructed, but is used when referring to individual cases. A label attribute can make it easier to locate errors in the data and also helps with cross-referencing of results to individual cases. If there are two or more label attributes, only the last is used.

Attributes defined by formulas

The name of each implicitly-defined attribute is followed by `:=' and then a formula defining the attribute value. The formula is written in the usual way, using parentheses where needed, and may refer to any attribute that has been defined before this one. Constants in the formula can be `?' (meaning unknown), `N/A' (meaning not applicable), numbers (written in decimal notation), dates, times, and discrete attribute values enclosed in string quotes `"'. The operators and functions available for use in the formula are The value of an implicitly-defined attribute is either numeric or true/false depending on the formula. This example includes one implicitly-defined attribute, engine displacement per cylinder. This is a numeric attribute since its value is a ratio of two explicitly-defined numeric attributes. The value of a hypothetical attribute
	small := cylinders = 4 and class = "COMPACT CARS".
would be either t or f since the value given by the formula is either true or false.

If the value of the formula cannot be determined for a particular case, the value of the implicitly-defined attribute is unknown. For example, consider a car with a value `?' for the attribute cylinders. It is then impossible to find the value of engine displacement per cylinder so this attribute would also have an unknown value.

Dates, times, and timestamps

Dates are stored by Cubist as the number of days since a particular starting point so some operations on dates make sense. Thus, if we have attributes
	d1: date.
	d2: date.
we could define
	interval := d2 - d1.
	gap := d1 <= d2 - 7.
	d1-day-of-week := (d1 + 1) % 7 + 1.
interval then represents the number of days from d1 to d2 (non-inclusive) and gap would have a true/false value signaling whether d1 is at least a week before d2. The last definition is a slightly non-obvious way of determining the day of the week on which d1 falls, with values ranging from 1 (Monday) to 7 (Sunday).

Similarly, times are stored as the number of seconds since midnight. If the names file includes

	start: time.
	finish: time.
	elapsed := finish - start.
the value of elapsed is the number of seconds from start to finish.

Timestamps are a little more complex. A timestamp is rounded to the nearest minute, but limitations on the precision of floating-point numbers mean that the values stored for timestamps from more than thirty years ago are approximate. If the names file includes

	departure: timestamp.
	arrival: timestamp.
	flight time := arrival - departure.
the value of flight time is the number of minutes from departure to arrival.

Selecting the attributes that can appear in models

An optional final entry in the names file affects the way that Cubist constructs models. This entry takes one of the forms
	attributes included:
	attributes excluded:
followed by a comma-separated list of attribute names. The first form restricts the attributes used in models to those specifically named; the second form specifies that models must not use any of the named attributes.

Excluding an attribute from models is not the same as ignoring the attribute (see `ignore' above). As an example, suppose that a numeric attribute A is defined in the data, but background knowledge suggests that only the logarithm of A should appear in models. The names file might then contain the following entries:

	   . . .
	A: continuous.
	LogA := log(A).
	   . . .
	attributes excluded: A.
In this example the attribute A could not be defined as ignore because the definition of LogA would then be invalid.

The same pattern could be used if the goal was to model the log of A rather than the value of A itself. In this case the target attribute would be given as LogA and the exclusion of A would be necessary to prevent the value of A being used in the model for LogA.

Data file

The second essential file, the application's data file (here mpg2001.data), provides information on the training cases that Cubist will use to construct a model. The entry for each case consists of one or more lines that give the values for all explicitly-defined attributes. Values are separated by commas and the entry for each case is optionally terminated by a period. Once again, anything on a line after a vertical bar is ignored. (If the information for a case occupies more than one line, make sure that the line breaks occur after commas.)

The first three cases from file mpg2001.data are:

SUBCOMPACT CARS,PONTIAC,SUNFIRE,2.2,4,Auto,4,F,26.1,R,-,2
MIDSIZE STATION WAGONS,FORD,FOCUS STATION WAGON,2,4,Manual,5,F,31.5,R,-,2
MIDSIZE CARS,LINCOLN-MERCURY,LS,3.9,8,Auto,5,R,18.6,P,-,4
Notice that the value of the implicitly-defined attribute displ/cyl is not given for each case since it is computed from other attribute values. Don't forget the commas between values! If you leave them out, Cubist will not be able to process your data.

A value that is missing or unknown is entered as `?'. Similarly, `N/A' denotes a value that is not applicable for a particular case.

Test and cases files (optional)

Of course, the value of predictive models lies in their ability to make accurate predictions! It is difficult to judge the accuracy of a model by measuring how well it does on the cases used in its construction; the performance of the model on new cases is much more informative.

The third kind of file used by Cubist is a test file of new cases (here mpg2001.test) on which the model can be evaluated. This file is optional and has exactly the same format as the data file. In this application the 852 cases have been split randomly 70%:30% into data and test files containing 596 and 256 cases respectively.

Another optional file, the cases file (e.g. mpg2001.cases), has the same format as the data and test files. The cases file is used primarily with the public source code described later on.

Constructing Models

Once the names, data, and optional files have been set up, everything is ready to use Cubist.

The general form of the Unix command is

        cubist -f filestem [options]
This invokes Cubist with the -f option that identifies the application name (here mpg2001). If no filestem is specified using this option, Cubist uses a default filestem that is probably incorrect. (Moral: always use the -f option!)

There are several options that affect the type of model that Cubist produces and the way that it is constructed. In this section we will examine each of them, starting with the simpler situations.

Rule-based models

When Cubist is invoked with only the -f option, as
        cubist -f mpg2001
it constructs a rule-based model and produces output like this:
Cubist [Release 2.05]  Tue Mar 11 12:51:25 2008
---------------------

    Options:
        Application `mpg2001'

    Target attribute `city mpg'

    Replacing unknown attribute values:
        `gears' by 4.5

Read 596 cases (13 attributes) from mpg2001.data

Model:

  Rule 1: [56 cases, mean 15.74, range 12.4 to 19.8, est err 0.66]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {ACURA, BMW, CHEVROLET, CHRYSLER, FORD, GMC, HYUNDAI,
                         JAGUAR, JEEP, LEXUS, LINCOLN-MERCURY, MAZDA,
                         MERCEDES-BENZ, MERCURY, MITSUBISHI, SAAB, SATURN,
                         SUBARU, TOYOTA, VOLVO}
        cylinders > 6
    then
        city mpg = 31.68 - 2.87 displ - 0.54 valves/cyl - 0.04 cylinders

  Rule 2: [73 cases, mean 17.07, range 13.2 to 24.7, est err 0.73]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {AUDI, DODGE, HONDA, INFINITI, ISUZU, LAND ROVER,
                         LINCOLN, NISSAN, PORSCHE, SUZUKI, VOLKSWAGEN}
        displ > 2
    then
        city mpg = 35.53 - 2.28 cylinders - 20.5 displ/cyl + 1.53 displ
                   + 0.5 gears

  Rule 3: [43 cases, mean 17.37, range 12.3 to 20.8, est err 0.85]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {ACURA, BMW, CHEVROLET, CHRYSLER, FORD, GMC, HYUNDAI,
                         JEEP, LINCOLN-MERCURY, MAZDA, MITSUBISHI, SAAB, TOYOTA,
                         VOLVO}
        displ > 2
        cylinders <= 6
        transmission = 4
    then
        city mpg = 24.48 - 2.81 displ + 7 displ/cyl - 0.19 cylinders

  Rule 4: [31 cases, mean 18.00, range 9.4 to 23.5, est err 1.13]

    if
        class in {COMPACT CARS, LARGE CARS, MIDSIZE CARS, SMALL PICKUP TRUCKS,
                  SPEC PURP VEH - MINIVAN, SUBCOMPACT CARS, TWO SEATERS}
        manufacturer in {ACURA, AUDI, BENTLEY, DAEWOO, FERRARI, GMC,
                         LAMBORGHINI, MERCURY, PORSCHE, R-R MTR CARS LTD, SAAB}
        displ > 2
    then
        city mpg = 28.08 - 1.47 displ - 0.71 cylinders

  Rule 5: [71 cases, mean 19.83, range 15.1 to 24.9, est err 1.09]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {ACURA, BMW, CHEVROLET, CHRYSLER, FORD, GMC, HYUNDAI,
                         JEEP, LINCOLN-MERCURY, MAZDA, MITSUBISHI, SAAB, TOYOTA,
                         VOLVO}
        displ > 2
        cylinders <= 6
        transmission in {F, R}
    then
        city mpg = 22.22 - 3.26 displ + 11.7 displ/cyl + 0.48 valves/cyl

  Rule 6: [176 cases, mean 21.35, range 12.5 to 28.4, est err 1.05]

    if
        class in {COMPACT CARS, LARGE CARS, MIDSIZE CARS, SMALL PICKUP TRUCKS,
                  SPEC PURP VEH - MINIVAN, SUBCOMPACT CARS, TWO SEATERS}
        manufacturer in {BMW, BUICK, CADILLAC, CHEVROLET, CHRYSLER, DODGE, FORD,
                         HONDA, HYUNDAI, IMPCO, INFINITI, ISUZU, JAGUAR, LEXUS,
                         LINCOLN-MERCURY, MAZDA, MERCEDES-BENZ, MITSUBISHI,
                         NISSAN, OLDSMOBILE, PONTIAC, SATURN, SUBARU, TOYOTA,
                         VOLKSWAGEN, VOLVO}
        displ > 2
    then
        city mpg = 30.19 - 0.9 cylinders - 0.77 displ - 0.25 valves/cyl

  Rule 7: [12 cases, mean 22.31, range 18.2 to 26.4, est err 1.37]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {LEXUS, MERCEDES-BENZ, MERCURY, SATURN, SUBARU}
        displ > 2
        cylinders <= 6
    then
        city mpg = 44.73 - 2.47 cylinders - 17.5 displ/cyl

  Rule 8: [45 cases, mean 24.90, range 20.6 to 31.4, est err 1.59]

    if
        manufacturer in {AUDI, DAEWOO, HYUNDAI, KIA, SAAB, VOLKSWAGEN}
        displ <= 2
        fuel in {R, P}
    then
        city mpg = 42.22 + 13.22 displ - 84.2 displ/cyl - 0.79 valves/cyl

  Rule 9: [20 cases, mean 25.46, range 24.3 to 28, est err 0.59]

    if
        manufacturer in {ACURA, CHEVROLET, DODGE, FORD, HONDA, INFINITI,
                         LINCOLN-MERCURY, MAZDA, MITSUBISHI, NISSAN, PLYMOUTH,
                         SATURN, SUZUKI, TOYOTA, VOLVO}
        displ <= 2
        transmission in {R, 4}
    then
        city mpg = 38.18 - 6.49 displ

  Rule 10: [19 cases, mean 28.27, range 24.1 to 35.3, est err 1.42]

    if
        manufacturer in {ACURA, DODGE, INFINITI, LINCOLN-MERCURY, MAZDA,
                         MITSUBISHI, NISSAN, VOLVO}
        displ <= 2
        transmission = F
    then
        city mpg = 48.25 - 101.7 displ + 362.5 displ/cyl + 0.28 valves/cyl
                   - 0.18 cylinders

  Rule 11: [23 cases, mean 29.43, range 24.7 to 35.6, est err 1.76]

    if
        manufacturer in {CHEVROLET, FORD, HONDA, PLYMOUTH, SATURN, SUZUKI,
                         TOYOTA}
        displ <= 2
        gears <= 4
        transmission = F
    then
        city mpg = 50.7 - 85.09 displ + 295.3 displ/cyl - 0.18 cylinders
                   + 0.2 valves/cyl

  Rule 12: [22 cases, mean 33.99, range 25.7 to 67.4, est err 2.32]

    if
        manufacturer in {CHEVROLET, FORD, HONDA, PLYMOUTH, SATURN, SUZUKI,
                         TOYOTA}
        displ <= 2
        gears > 4
        transmission = F
    then
        city mpg = 97.56 - 119.69 displ + 410.6 displ/cyl - 6.73 gears

  Rule 13: [5 cases, mean 43.10, range 38 to 46.5, est err 5.10]

    if
        fuel = D
    then
        city mpg = 46.5


Evaluation on training data (596 cases):

    Average  |error|               0.96
    Relative |error|               0.24
    Correlation coefficient        0.97


        Attribute usage:
          Conds  Model

           99%           manufacturer
           90%    99%    displ
           78%           class
           33%           transmission
           31%    96%    cylinders
            8%           fuel
            8%    16%    gears
                  65%    valves/cyl
                  52%    displ/cyl


Evaluation on test data (256 cases):

    Average  |error|               1.15
    Relative |error|               0.32
    Correlation coefficient        0.93


Time: 0.1 secs
The first part identifies the version of Cubist, the run date, the options with which the system was invoked, and the attribute that contains the target value.

Now we come to the training data. Some attribute values might be missing; if so, Cubist replaces them by the most probable values. Missing values of continuous attributes are replaced by the mean of the known values for that attribute, while the replacement for missing discrete values is the most frequent attribute value. Any such replacements are noted on the output. Here gears is the only explicitly-defined attribute whose value is missing for some cases in mpg2001.data; those cases are given the average value (a rather unrealistic 4.5). The same values are also used to replace missing values in any test cases, although the messages are not repeated.

Cubist constructs a model from the 596 training cases in the file mpg2001.data, and this appears next. A model consists of an unordered collection of rules, each of the form

    if conditions then linear model
A rule indicates that, whenever a case satisfies all the conditions, the linear model is appropriate for predicting the value of the target attribute. (If two or more rules apply to a case, then the values are averaged to arrive at a final prediction.) Each rule also carries some descriptive information: the number of training cases that satisfy the rule's conditions, their target values' mean and range, and a (rather erratic) estimate of the expected error magnitude of predictions made by the rule. Within the linear model, the attributes are ordered in decreasing relevance to the result.

Let's illustrate all this on Rule 1 above. There are three conditions:

	class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
	      SMALL STATION WAGONS, SPEC PURP VEH - SUV,
	      STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
	manufacturer in {ACURA, BMW, CHEVROLET, CHRYSLER, FORD, GMC, HYUNDAI,
		     JAGUAR, JEEP, LEXUS, LINCOLN-MERCURY, MAZDA,
		     MERCEDES-BENZ, MERCURY, MITSUBISHI, SAAB, SATURN,
		     SUBARU, TOYOTA, VOLVO}
	cylinders > 6
Among the 596 training cases there are 56 that satisfy all three conditions and their mpg values range from 12.4 to 19.8 with an average value of 15.74. Cubist finds that the target value of these or other cases satisfying the conditions can be modeled by the formula
	city mpg = 31.68 - 2.87 displ - 0.54 valves/cyl - 0.04 cylinders
with an estimated error of 0.66. For cases covered by this rule, displ has the most effect on fuel consumption, valves/cyl a lesser effect, and cylinders the least effect.

There may appear to be something wrong with the last rule:

  Rule 13: [5 cases, mean 43.10, range 38 to 46.5, est err 5.10]

    if
        fuel = D
    then
        city mpg = 46.5
The constant value predicted for these cases is 46.5, but the mean value for the cases is 43.10! This is not an error -- Cubist attempts to minimize average error magnitude, and so uses the median value rather than the mean.

The next section covers the evaluation of this model shown in the second part of the output. Before we leave this output, though, the final line states the elapsed time for the run. For small applications such as this, with only a few training cases and a handful of attributes, a model is produced quite quickly. Model construction can take much longer for larger applications with many thousands of cases and tens or hundreds of attributes. The progress of Cubist on long runs can be monitored by examining the last few lines of the temporary file filestem.tmp (e.g. mpg2001.tmp). This file displays the stage that Cubist has reached and, for most stages, gives an indication of the fraction of the stage that has been completed.

Evaluation

Models constructed by Cubist are evaluated on the training data from which they were generated, and also on a separate file of unseen test cases if this is present. (Evaluation by cross-validation is discussed elsewhere.) Results on the cases in mpg2001.data are:

Evaluation on training data (596 cases):

    Average  |error|               0.96
    Relative |error|               0.24
    Correlation coefficient        0.97
The average error magnitude is straightforward enough. The relative error magnitude is the ratio of the average error magnitude to the error magnitude that would result from always predicting the mean value; for useful models, this should be less than 1! The correlation coefficient measures the agreement between the cases' actual values of the target attribute and those values predicted by the model.

For some applications, particularly those with many attributes, it may be useful to know how individual attributes contribute to the model. This is shown in the next section:

        Attribute usage:
          Conds  Model

           99%           manufacturer
           90%    99%    displ
           78%           class
           33%           transmission
           31%    96%    cylinders
            8%           fuel
            8%    16%    gears
                  65%    valves/cyl
                  52%    displ/cyl
The first column shows the approximate percentage of cases for which the attribute concerned appears in a condition of an applicable rule, while the second column gives the percentage of cases for which the attribute appears in the linear model of an applicable rule. The second entry, for example, says that displ is used in the condition part of rules that cover 90% of cases and in the models of rules that cover 99% of cases. Attributes for which both these values are less than 1% are not shown.

If a test file is present, Cubist produces a summary similar to that for the training cases:

Evaluation on test data (256 cases):

    Average  |error|               1.15
    Relative |error|               0.32
    Correlation coefficient        0.93
Cubist also generates a file filestem.pred (here mpg2001.pred) that shows the actual and predicted value for each test case. The first few lines of this file generated from the run above are:
(Default value 21.26)

   Actual  Predicted    Case
    Value      Value
 --------  ---------    ----
     28.0      28.70    SW
     21.5      19.85    BOXSTER
     25.2      24.18    GTI
     22.3      21.86    COUGAR
     21.7      20.64    E320 4MATIC (WAGON)
Notice that each case is identified by its value of the label attribute; if there is no such attribute, the case number in the .test file is used instead.

Composite models

For some applications, the predictive accuracy of a rule-based model can be improved by combining it with an instance-based or nearest-neighbor model. The latter predicts the target value of a new case by finding the n most similar cases in the training data, and averaging their target values.

Cubist employs an unusual method for combining rule-based and instance-based models. Cubist finds the n training cases that are "nearest" (most similar) to the case in question. Then, rather than averaging their target values directly, Cubist first adjusts these values using the rule-based model. Here's how it works:

Suppose that x is the case whose unknown target value is to be predicted, and y is one of x's nearest neighbors in the training data. The target value of y is known: let us call it T(y). The rule-based model can be used to predict target values for any case, so let its predictions for x and y be M(x) and M(y) respectively. The model then predicts that the difference between the target values of x and y is M(x)-M(y). The value of x predicted by neighbor y is adjusted to reflect this difference, so that Cubist uses T(y)+M(x)-M(y) instead of y's raw target value. (This is described in more detail in the paper "Combining instance-based and model-based learning", Proceedings of the Tenth International Conference on Machine Learning, pages 236-243, Morgan Kaufmann Publishers, San Francisco, 1993.)

The option -i instructs Cubist to use composite models of this type. Alternatively, the option -a allows the decision regarding which kind of model to use -- rule-based or composite -- to be left to Cubist itself. In the latter case, Cubist derives from the training data a heuristic estimate of the accuracy of each type of model, and chooses the form that appears more accurate. The derivation of these estimates requires quite a lot of computation, so leaving the decision to Cubist can result in a noticeable increase in the time required to build a model.

Now for the value of n, the number of nearest neighbors to be used. The option -n neighbors sets the number directly; the allowable range is from 1 to 9. If the value is not specified in this way, Cubist will choose an appropriate value in the range.

To continue the illustration: when Cubist is allowed to choose a model type on the basis of the 596 training cases and the number of nearest neighbors is not specified, it opts for a composite model using a single nearest neighbor. The rule-based model itself is unchanged, but the composite model gives different results on the training and test cases, the latter being

Evaluation on test data (256 cases):

    Average  |error|               0.89
    Relative |error|               0.25
    Correlation coefficient        0.95
The performance of the composite model on the test cases in mpg2001.test thus improves upon that of the rule-based model alone, average error magnitude falling from 1.15 to 0.89.

Nearest neighbor models are adversely affected by the presence of irrelevant attributes. All attributes are taken into account when evaluating the similarity of two cases and irrelevant attributes introduce a random factor into this measurement. As a result, composite models are most effective when the number of attributes is relatively small and all attributes are relevant to the prediction task.

Committee models

In addition to the composite rule-based/nearest neighbor models discussed above, Cubist can also generate committee models made up of several rule-based models. Each member of the committee predicts the target value for a case and the members' predictions are averaged to give a final prediction.

The first member of a committee model is always exactly the same as the model generated without the committee option. The second member is a rule-based model designed to correct the predictions of the first member; if the first member's prediction is too low for a case, the second member will attempt to compensate by predicting a higher value. The third member tries to correct the predictions of the second member, and so on. The recommended number of members is five, a value that balances the benefits of the committee approach against the cost of generating extra models.

The option -C members causes Cubist to construct a model committee and specifies the number of committee members. When this option is invoked with five members, the results show a smaller improvement than that obtained with composite models:

Evaluation on test data (256 cases):

    Average  |error|               1.00
    Relative |error|               0.28
    Correlation coefficient        0.93

Committee models are of most benefit when single models are reasonably accurate, so they are more useful for fine-tuning good models than for overcoming the deficiencies of poor models. Finally, committee models can be used in conjunction with composite models if desired.

Simplicity-accuracy trade-off

Cubist employs heuristics that try to simplify models without substantially reducing their predictive accuracy. In some applications, however, it might be desirable to generate simpler models -- for instance, when the models must be very easy to understand. Of course, over-simplified models usually have lower predictive accuracy so there is a trade-off between simplicity and utility.

The complexity of a model can be controlled by restricting the number of rules that it may contain. The option -r rules sets the maximum number of rules that may be used in a model. For the mpg2001 application, setting the maximum number of rules to 5 gives a simpler model:

  Rule 1: [182 cases, mean 18.15, range 12.3 to 26.4, est err 1.10]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {ACURA, BMW, CHEVROLET, CHRYSLER, FORD, GMC, HYUNDAI,
                         JAGUAR, JEEP, LEXUS, LINCOLN-MERCURY, MAZDA,
                         MERCEDES-BENZ, MERCURY, MITSUBISHI, SAAB, SATURN,
                         SUBARU, TOYOTA, VOLVO}
        displ > 2
    then
        city mpg = 35.96 - 1.59 cylinders - 12.6 displ/cyl

  Rule 2: [89 cases, mean 18.49, range 13.2 to 30.1, est err 1.20]

    if
        class in {MIDSIZE STATION WAGONS, MINICOMPACT CARS,
                  SMALL STATION WAGONS, SPEC PURP VEH - SUV,
                  STANDARD PICKUP TRUCKS, VANS CARGO TYPE, VANS PASSENGER TYPE}
        manufacturer in {AUDI, DODGE, HONDA, INFINITI, ISUZU, LAND ROVER,
                         LINCOLN, NISSAN, PORSCHE, SUZUKI, VOLKSWAGEN}
    then
        city mpg = 35.53 - 2.28 cylinders - 20.5 displ/cyl + 1.53 displ
                   + 0.5 gears

  Rule 3: [97 cases, mean 18.96, range 9.4 to 29.1, est err 1.26]

    if
        manufacturer in {ACURA, AUDI, BENTLEY, DAEWOO, FERRARI, GMC,
                         LAMBORGHINI, MERCURY, PORSCHE, R-R MTR CARS LTD, SAAB}
    then
        city mpg = 28.08 - 1.47 displ - 0.71 cylinders

  Rule 4: [176 cases, mean 21.35, range 12.5 to 28.4, est err 1.05]

    if
        class in {COMPACT CARS, LARGE CARS, MIDSIZE CARS, SMALL PICKUP TRUCKS,
                  SPEC PURP VEH - MINIVAN, SUBCOMPACT CARS, TWO SEATERS}
        manufacturer in {BMW, BUICK, CADILLAC, CHEVROLET, CHRYSLER, DODGE, FORD,
                         HONDA, HYUNDAI, IMPCO, INFINITI, ISUZU, JAGUAR, LEXUS,
                         LINCOLN-MERCURY, MAZDA, MERCEDES-BENZ, MITSUBISHI,
                         NISSAN, OLDSMOBILE, PONTIAC, SATURN, SUBARU, TOYOTA,
                         VOLKSWAGEN, VOLVO}
        displ > 2
    then
        city mpg = 30.19 - 0.9 cylinders - 0.77 displ - 0.25 valves/cyl

  Rule 5: [134 cases, mean 28.41, range 20.6 to 67.4, est err 3.39]

    if
        displ <= 2
    then
        city mpg = 50.26 - 46.8 displ/cyl - 0.56 cylinders + 0.56 displ
The downside in this example is that the average error magnitude on the test cases increases from 1.15 to 1.34.

Extrapolation

The extrapolation parameter controls the extent to which predictions made by Cubist's linear models can fall outside the range of values seen in the training data. Extrapolation is inherently more risky than interpolation, where predictions must lie between the lowest and highest observed value.

The option -e extrapolation sets this extrapolation factor in the form of a percentage. Each rule records the highest and lowest target value of the training cases satisfying that rule's conditions. When the target value of a new case is predicted using the rule, the value computed from the linear model may fall outside this range. The extrapolation parameter limits the degree to which new values can lie above or below the values seen in the training data, expressed as a percentage of the range (default 10%).

For example, the lowest target value among the 182 training cases covered by Rule 1 above is 12.3 and the highest is 26.4. The range is therefore 14.1 and, under the default extrapolation limit of 10%, the value predicted by this rule for a new case cannot be lower than 10.9 (12.3 - 1.4) or higher than 27.8 (26.4 + 1.4). Any computed value that lies outside these bounds is changed to the nearer bound. If the linear model associated with Rule 1 were to predict a value of 10, say, then this would be adjusted to 10.9.

Extrapolation may be constrained even further in two situations. When all the training cases covered by a rule have target values greater than or equal to zero, the rule will never predict a value less than zero. (This restriction prevents Cubist from making silly predictions such as negative miles-per-gallon values.) Similarly, when a rule covers cases whose target values are all less than or equal to zero, the predicted value from the rule will never be positive.

Sampling from large datasets

Even though Cubist is relatively fast, building models from a large number of cases can take an inconveniently long time. Cubist incorporates a facility to extract a random sample from a dataset, construct a model from the sample, and then test the model on a disjoint collection of cases. By using a smaller set of training cases in this way, the process of generating a model is expedited, but at the cost of a possible reduction in the model's predictive performance.

The option -S x has two consequences. Firstly, a random sample containing x% of the cases in the application's data file is used to construct the model. Secondly, the model is evaluated on a non-overlapping set of test cases consisting of another (disjoint) sample of the same size as the training set (if x is less than 50%), or all cases that were not used in the training set (if x is greater than or equal to 50%).

As an example, suppose that the application's data file contains 100,000 cases. If a sample of 10% is used, the model will be constructed from a sample of 10,000 cases and tested on a disjoint sample of 10,000 cases. Alternatively, selecting sampling with 60% will cause the model to be constructed from 60,000 cases and tested on the remaining 40,000 cases.

By default, the random sample changes every time that a model is constructed, so that successive runs of Cubist with sampling will usually produce different results. This re-sampling can be avoided by the option -I seed that uses the integer seed to initialize the sampling. Runs with the same value of the seed and the same sampling percentage will always use the same training cases.

Cross-validation trials

As we saw earlier, the performance of a model on the training cases from which it was constructed gives a poor estimate of its accuracy on new cases. The true predictive accuracy of the model can be estimated by sampling, as above, or by using a separate test file; either way, the classifier is evaluated on cases that were not used to build it. However, this estimate can be unreliable unless the numbers of cases used to build and evaluate the model are both large. If the cases in mpg2001.data and mpg2001.test were to be shuffled and divided into new training and test sets, Cubist would probably construct a different model whose accuracy on the test cases might vary considerably.

One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases (including those in the test file, if it exists) are divided into f blocks of roughly the same size and target value distribution. For each block in turn, a model is constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each case is used just once as a test case. The accuracy of a model produced from all the cases is estimated by averaging results on the hold-out cases.

The option -X f runs such a f-fold cross-validation. For example, the command

	cubist -f mpg2001 -X 10
selects 10-fold cross-validation. After reporting on the model produced at each fold, the output shows a summary like this:
Summary:

    Average  |error|               1.31
    Relative |error|               0.34
    Correlation coefficient        0.89
The file filestem.pred once again contains a case-by-case record of the actual and predicted values on the unseen cases.

As with sampling above, each cross-validation run will normally use a different random division of the data into blocks, unless this is prevented by using the -I option.

The cross-validation procedure can be repeated for different random partitions of the cases into blocks. The average error from these distinct cross-validations is then an even more reliable estimate of the error of the model produced from all the cases. A shell script and associated programs for carrying out multiple cross-validations are included with Cubist. The shell script xval is invoked with any combination of Cubist options and some further options that describe the cross-validations themselves:

If detailed results are retained via the +d option, they appear in files named filestem.oi[+suffix] where i is the cross-validation number (0 to repeats-1). A summary of the cross-validations is written to file filestem.res[+suffix].

As an example, the command

	xval -f mpg2001 -a R=10 +new

runs ten complete 10-fold cross-validations (and so constructs 100 models in all), allowing Cubist to choose between rule-based and composite models, and gives the following results in file mpg2001.res+new:

Summary:
--------

    Average  |error|               0.99
    Relative |error|               0.26
    Correlation coefficient        0.93

Since a single cross-validation fold uses only part of the application's data, running a cross-validation does not cause a model to be saved. To save a model for later use, simply run Cubist without employing cross-validation.

Weighting individual cases

By default, all training cases are treated equally when a model is constructed. In some applications, however, it may be desirable to assign different importance to the cases. Cubist achieves this by recognizing an optional attribute that gives the weight of each case. The attribute name must be case weight and it must be of type continuous. The relative weight assigned to each case is its value of this attribute divided by the average value; if the value is undefined ("?"), not applicable ("N/A"), or is less than or equal to zero, the case's relative weight is set to 1.

The case weight attribute itself is not used in the model!

To illustrate the idea, we will apply case weights to our example by adding an implicitly-defined attribute to mpg2001.names as follows:

case weight := displ.
This means that the importance of a training case varies as the total displacement of that case. Cubist will now attempt to minimize weighted error, so cars with higher displacements should have more influence on the new model. The following table shows the results from the case-weighted and original (unweighted) models for the seven vehicles in the unseen test cases that have the highest displacement:

The average error of the case-weighted model is 0.08, somewhat lower than the average error of 0.43 given by the original model.

A cautionary note: The use of case weighting does not guarantee that the model will be more accurate for unseen cases with higher weights. Predictive accuracy on more important cases is likely to be improved only when cases with similar values of the predictor attributes also have similar values of the case weight attribute, i.e. when relatively important cases "clump together." Without this property, case weighting can introduce an unhelpful element of randomness into the model generation process.

Linux GUI

Linux users who have installed a recent version of Wine can invoke a slightly simplified version of the user interface of the Windows version. The executable program gui starts the graphical user interface whose main window has five buttons:

Locate Data
invokes a browser to find the files for your application, or to change the current application;
Build Model
selects the type of model to be constructed and sets other options;
Stop
interrupts the model-generating process;
Review Output
re-displays the output from the most recent model (if any), saved automatically in a file filestem.out; and
Cross-Reference
shows how cases in training or test data relate to (parts of) a model and vice versa.
For more details on these, please see the Windows tutorial.

The graphical interface calls Cubist directly, so use of the GUI has minimal impact on the time taken to construct a Cubist model.

Please note: Cubist should be run for the first time from the command-line interface, not the GUI. The first run installs the licence ID; after that has been done, Cubist can be used from either interface.

Linking to Other Programs

The most recent model generated by Cubist is saved in file filestem.model. Free C source code is available to read these model files and to make predictions with them, enabling you to use Cubist models in other programs.

As an example, the source includes a program sample.c that reads a saved model file and then prints the model's predicted value (with optional error bounds) for each case in a cases file. Please see the comments at the beginning of sample.c for information on compilation and usage.

Click here to download a gzipped tar file containing the C source code.


Appendix: Summary of Options

© RULEQUEST RESEARCH 2008 Last updated March 2008


home products download evaluations prices purchase contact us