Cubist icon Notes from Previous Releases

Release 2.08

Rule utility analysis

Release 2.08 calculates the utility of each rule as measured by the additional error on the training set that would result from removing that rule. Rules with insignificant utility are deleted.

Previous releases ordered model rules by the average value of the target value over the cases covered by the rule. Release 2.08 instead shows rules starting with the most useful and ending with the least.

Although the utility analysis requires additional computation, Release 2.08 contains efficiency improvements sufficient to cover this cost for most datasets.

New default option values

Some default options have been changed in line with the larger datasets to which Cubist is often applied. The default extrapolation parameter has been reduced from 10% to 5%, and the default maximum rules parameter increased from 100 to 500. Of course, these new default values can still be overridden by the user.

Note: A revision of this release was issued in August 2015 to address problems when the target attribute has non-zero values of small magnitude (less than 0.0001). For applications with many such values, Cubist could overlook important patterns or display small values with too few significant digits.

Release 2.07

64-bit Windows support

This release includes 64-bit versions of Cubist and CubistX (the batch executable). These versions allow the use of more than 2GB of memory, as required by some extremely large data mining tasks.

The 32-bit release of Cubist will run under either 32-bit or 64-bit Windows, so there is no need to change unless your tasks may use more than 2GB of memory. The 64-bit version of Cubist will run only under 64-bit Windows Xp, Windows Vista, or Windows 7.

The network version of Cubist includes both 32-bit and 64-bit versions for installation on client PCs. A client PC running 64-bit Windows Xp/Vista/7 can install and use the 64-bit version, even if the server runs 32-bit Windows.

Cubist continues to be available in both 32-bit and 64-bit versions for Linux.

New option: unbiased rules

By default, Cubist rules attempt to minimize absolute error on unseen cases. This necessitates minimizing the median rather than the mean residual, so a Cubist rule is generally biased -- its mean prediction differs from the mean of the training cases that it covers. This new option leads to approximately unbiased rules but also to greater absolute error.

This option is recommended for applications where there are many cases with the same target value (such as zero). Unbiased rules will usually give more variation in predicted values near this common value.

Changes to composite models

Release 2.07 is faster when it generates composite models for large applications. Cubist's procedures for setting the number of nearest neighbors have also been changed, and distances are now calculated to a higher precision.

Improved public code

Rulequest provides public source code that enables models generated by Cubist to be employed in users' programs.

  • The public code now runs in multiple threads and so can benefit from modern multi-core processors.
  • The calculation of error limits has been modified slightly.

Warning: The public code for Release 2.07 cannot be used with models produced by previous Cubist releases.

Bug fixes

For composite models, the distance to a neighbor could be under-estimated under some rarely-occurring circumstances.

The scatterplot of real versus predicted values could fail for very large applications with hundreds of thousands of cases. In such situations Cubist now shows a scatterplot for only a sample of cases, although the statistics displayed are still computed over all cases.

The public code sometimes incorrectly flagged a case as having a value outside the range observed in the training cases.

When the public code was used with the -i option and without a label attribute, incorrect nearest neighbors were shown.


Release 2.06

Changes to public code
We provide public source code to enable models generated by Cubist to be employed in users' programs. The sample program included with the code that reads cases and predicts their values has been improved, e.g.:

  • The program now flags any case whose value of a relevant attribute lies outside the range observed in the training data.
  • A new option for composite models shows the nearest neighbors for each case and their distances from the case.
  • This option can be used by itself or in conjunction with the option to estimate the error bounds of the prediction for each case.

Revised documentation
The tutorial is now based on a different application with the aim of better explaining the effects of Cubist's options for building models.

Faster committee models
Additional multi-threading allows Cubist to construct committee models more quickly for large applications.

Bug fix: composite models
Release 2.06 rectifies a quirk that could allow a composite model to make a prediction outside the permitted extrapolation range.


Release 2.05

More accurate models
Cubist models should now have somewhat lower average absolute error on unseen cases.

Improved multi-threading
Another bottleneck in Cubist's model-building algorithm has been parallelized and will now run on multiple CPUs or cores. Assignment of some tasks to processors has also been adjusted to balance loads better and so reduce the time taken to process larger applications.

Faster composite models
When Cubist constructs a composite model for applications with hundreds of thousands of training cases, a significant proportion of the total run time is taken up by calculating the accuracy of the model on these same cases. Instead, Release 2.05 uses a large sample of the training cases to estimate this accuracy.


Release 2.04

Attribute usage
A new summary highlights the usage of attributes that appear in a Cubist model. This shows, for each attribute, the percentage of training cases for which that attribute is used in the conditions of an applicable rule, and the percentage of cases for which it is used in an associated linear model.

Enhanced multi-threading
Release 2.04 will now use up to four processors and so will run faster on the new quad-core CPUs and computers with two dual-core processors.

Linux GUI
For Linux uses who have installed a recent version of Wine, the new release includes an optional graphical interface with many features of the Windows version. (The cross-reference facility in particular provides information that is not available from the command-line version.) The Linux GUI calls the native Linux version of Cubist, so there is no appreciable performance penalty.

Minor improvements to models
There have been a few changes to Cubist's model-building algorithms that often lead to more compact models with slightly higher predictive accuracy.

Bug fix (Windows only)
Release 2.04a affects only the Windows GUI version Cubist.exe that could sometimes freeze during cross-validation. The batch mode executable CubistX.exe is not affected.


Release 2.03

Improved accuracy on larger datasets
Cubist's model-simplification heuristics have been revised, with the effect that it will usually produce more rules from large numbers of training cases. On the positive side, this increase in complexity is usually matched by increased accuracy on new cases. On one dataset containing approximately 30,000 cases, for example, cross-validated average error on unseen cases was half that of release 2.02.

Weighting individual cases
By default, all training cases are treated as equal, but particular applications may need to emphasize some data more than others. Release 2.03 provides an optional attribute that specifies the importance of each case; when this is used, Cubist attempts to minimize case-weighted error or, in other words, pays more attention to fitting more important cases.

Simpler control of model complexity
Previous releases provided two user-configurable parameters that affected model complexity. One of these (minimum case cover) could sometimes prevent Cubist from finding a good model and has been dropped. User control over model complexity is now achieved by a straightforward limit on the number of rules in a model, with default value 100.

Error bound option for public code
The free C code for reading and interpreting Cubist 2.03 models has an option to estimate error bounds. If this option is invoked, each predicted value is shown as value +- error, where value is the predicted value and error is the (nominally 95%) absolute error; i.e., approximately 95% of the time, the real value should lie between value - error and value + error. Now, "95%" should not be interpreted too rigorously and will certainly vary from application to application. When this option is used, the public code also shows the actual percentage of cases whose real values lie within the estimated bounds.


Release 2.02

Faster composite models
Composite models use both nearest-neighbor and rule-based prediction as described here. When there are many training cases as potential neighbors, finding the nearest n of them can be slow. Release 2.02 is considerably faster in this regard, achieved by using more powerful indexing methods and by taking advantage of dual processors, dual cores, or Intel hyper-threading if these are available.

Smaller memory footprint
Release 2.02 requires less memory for applications with very many attributes/predictors. (You probably won't notice this, though, unless your application has hundreds of them.)

Bug fixes (Windows version)
The Windows version of release 2.01 sometimes crashes when model construction is interrupted via the Stop button. Release 2.02 should simply stop as it's supposed to.

Release 2.02 also recovers all previous settings when an application is re-run. Some of the previous settings revert to defaults in 2.01.

Adaptation to Microsoft bug-fix (network version only)
To improve security, Microsoft Windows updates have disabled a feature that is used by clients to read on-line help, as documented here. The client installation program has been modified to set appropriate registry entries on the client, and also leaves a local copy of the help (CubistHelp.chm) in the Cubist folder as a workaround in case new Windows updates affect HTMLHelp.


Release 2.01

Multi-threading
The core of Cubist has been rewritten so that it can take advantage of computers with dual processors or Intel PCs with Hyper-Threading Technology. This can significantly reduce the time taken to process very large datasets.

64-bit Linux version
Cubist is now available in a 64-bit Linux version for AMD PCs with Athlon64 and Opteron CPUs, and Intel PCs with Extended Memory 64 Technology.

Bug fix
In previous releases, use of the cross-reference facility in the Windows version of Cubist could cause problems when there were errors in the cases file. These errors are now reported via pop-up messages.

New distribution format for Windows
Cubist Release 2.01 is distributed as a self-contained Inno executable.


Release 1.13

Simpler models
The push towards simpler models for large datasets that was begun in Release 1.12 has been continued in Release 1.13. The goal is still to make the models easier to understand without impairing their predictive accuracy.

On an oceanographic application with 178,000 cases and 12 attributes, for instance, the model produced by Release 1.11 has 58 rules, Release 1.12's has 46, and Release 1.13's has only 38.

Less memory, faster
The memory required to analyze larger datasets has been reduced, with the added benefit that computation is also speedier. Run times for the oceanographic application mentioned above have decreased from 23.6 seconds for Release 1.12 to 18.0 seconds for 1.13 (both measured on a 3GHz Pentium IV).

Bug fixes
For some applications with many attributes, Release 1.12 sometimes produced rules whose linear sub-models had large coefficients and these were saved to the model file with insufficient precision. Both problems have been addressed in Release 1.13.

Windows on-line help rewritten
The on-line help for the Windows version has been updated to the more modern HtmlHelp format and now corresponds closely to the tutorial available on the web.


Release 1.12

Target attribute can be defined by formula
In previous releases of Cubist, the target attribute was required to be one the explicitly-defined attributes. Release 1.12 allows the target to be defined as a function of other attributes. As a simple example, if the data contain values for an attribute X, the target value to be modeled might be log(X).

Simpler models
Release 1.12 attempts to simplify models even further by reducing the numbers of rules. This mechanism is most noticeable with larger datasets; on an oceanographic application (178,000 cases, 12 attributes), the model produced by Release 1.11 has 58 rules while Release 1.12's has 46.

Further speed improvements
Larger datasets are processed faster by Release 1.12. Run times for the application cited above have been reduced from Release 1.11's 28.0 seconds (3GHz Pentium IV) to 23.6 seconds.

Bug fix
In previous releases, attributes with constant values could sometimes appear in linear models associated with rules. This bug had only a cosmetic effect since the values computed by the linear models were still correct, but it certainly impeded interpretation of the rules.


Release 1.11

Speed improvement
Release 1.11 is noticeably faster when processing larger datasets. For instance, the previous release took 64 seconds on a 1GHz PC to build a model from 30,000 cases with 36 attributes; Release 1.11 does the same job in just over 33 seconds.

GUI (Windows version)
There have been some small changes to the Windows GUI. In particular, the cross-reference window shows predicted values to one decimal place more than the precision of the target values.


Release 1.10

Timestamp attributes
Attributes can now have timestamp values consisting of a date and a time, e.g. 2001-04-30 13:21:10. Timestamps are accurate to the nearest minute and subtracting one timestamp from another gives the number of minutes between them.

More direct control of model complexity
The old `brevity' control has been superseded by an optional parameter specifying the maximum number of rules in a Cubist model. Models with a restricted number of rules are easier to understand but, of course, they may also have lower predictive accuracy than unrestricted models.

Estimated model error
Each rule in a Cubist model gives an estimate of the expected error when the rule is used to predict values for new cases. A bug that produced erroneously high values for this estimate has been fixed.

Simpler linear models
Linear models generated by Release 1.10 generally have fewer, simpler coefficients.


Release 1.09

Control of attributes used in models
The .names file now has a facility to restrict the attributes that can appear in models.

This allows attributes to be used in formulas defining other attributes but not directly in a model. For example, suppose that the data contain two numeric attributes A and B but background knowledge suggests that only their difference is important. It is now possible to define a new attribute

    Diff := A - B.
without allowing A or B themselves to appear in any model.

This same facility makes it much easier to experiment with restricted subsets of the attributes.

Time attributes
An attribute declared to be a `time' takes values in the form HH:MM:SS. As with dates, attributes defined by formulas can subtract one time from another to give an interval (in seconds).

Composite models
The instance-based component of composite models has been extensively revised. Two changes that you will notice are:
  • Previous releases always used five nearest neighbors. The number of neighbors can now be set to any value from 1 to 9 or, alternatively, Cubist will determine an appropriate value in this range.
  • Instances are indexed using kd-trees so that a case's nearest neighbors can now be found more quickly. The kd-tree indexing has also been incorporated into the public code.

Speed!
Several key components of cubist have been optimized to improve, for example, their cache performance. The benefits are particularly noticeable on larger datasets -- Release 1.09 can be more than twice as fast as 1.08.


Release 1.08

Committee models
A new option is available to generate committee models. As the name implies, a committee model consists of several distinct Cubist models, all generated from the same training data. When a prediction is to be made, each model is consulted and the results from all models are averaged.

Committee models are of most value in applications for which single Cubist models are already pretty accurate.

New data values
A new value N/A can be used when the value of an attribute is not applicable to a case. For example, consider the attributes `purchased ticket?' with values `yes' and `no', and `ticket cost' with numeric values. If a case's value of the former is `no', the appropriate value for the latter is now `N/A'.

Dates can now be entered as either YYYY/MM/DD or YYYY-MM-DD.

Changes to saved models
Up to Release 1.07, Cubist models have been stored as binary files. From this release .model files have been changed to ASCII format, so that models generated on one machine type may be deployed on machines of another type. The source code that facilitates such deployment has also changed substantially.

To ease the changeover, Cubist and the new public code will still read model files generated by Release 1.07.

New Unix option
Cross-validation has now been incorporated directly into Cubist rather than being available only through the xval script. The -X option invokes cross-validation and specifies the number of folds.

The xval script is still used for multiple cross-validations. However, the option +d that preserves detailed outputs now saves one file for each cross-validation rather than one file for each Cubist run.

Improved error messages
Problems with application files (.names, .data, .test etc) can be corrected more easily because the error message identifies the line number of the file in question.


Release 1.07

New data types
Dates are input and output in the form YYYY/MM/DD and can be used with implicitly defined attributes to determine, for instance, the number of days between two dates or the day of the week on which a date falls.

Ordered discrete values are nominal values that have a natural ordering, such as small, medium, large, XL, XXL. When an attribute's discrete values are noted as ordered, Cubist exploits this information to test subranges of the values, e.g. [large-XXL]. This tends to produce more compact models with higher predictive accuracy.

New Unix option
The random number seed can now be set, with the result that runs with sampling etc. are repeatable.

Improvements to the Windows GUI
The output window is now more readable, and can be copied and printed directly (without having to switch to WordPad).

A new button on the toolbar allows the previous output to be redisplayed.

Revision of source code
The source code for reading and interpreting models constructed by Cubist has been further revised.


Release 1.06

Attributes defined by formulas
It is sometimes convenient to define the value of an attribute as a function of other attribute values rather than by giving the value explicitly in data files. Release 1.06a allows such implicitly-defined attributes to be described by formulas in an application's names file. The formulas need not be simple -- both numeric and logical values can be introduced in this way.

New parameter controlling simplicity/accuracy tradeoff
The issue of simplicity versus accuracy is one of those things that will always be with us in data mining. Simpler models are easier to understand and convey more insight, but some applications require all the accuracy they can get and insight is not an issue.

Release 1.06a contains a new parameter that influences this tradeoff. When the brevity factor is set to a high value, Cubist will emphasize simplicity (usually at some expense to accuracy). Similarly, a low value puts a premium on accuracy, but may substantially increase model complexity. The choice is now yours!

Further improvements in models
Some fundamental changes to the model-building mechanism mean that Cubist models are now noticeably improved. The rules tend to overlap more, with the result that predictions change more smoothly as attribute values of a case are varied.

Changes to the Cubist GUI
There have been several improvements in line with suggestions made by users (and please keep them coming!):
  • A new Edit menu brings up the names file in WordPad, making it easier to change this file.
  • The model construction settings last used with an application are stored and are reset whenever that application is selected again. (There's also a new button on the dialog box to reset all of them to their default values.)
  • The main window can be clicked on top of the output window.


Release 1.05

Models that are easier to understand
The linear models associated with rules are now ordered so that attributes with a higher differential effect on the model values appear before less-important factors.

More accurate models
Cubist models have also been changed somewhat to improve their predictive accuracy. Those generated by R1.05 will sometimes have more rules than those from previous releases.

Global extrapolation limits
The predictions made by a rule are limited to an extension of the range of values observed for training cases that match the rule (see the extrapolation parameter). The same restriction now applies to "global" predictions that use instances and rules.

Improved model construction dialog box (Windows version)
This has been revised so that options can be specified more easily from the keyboard.


Release 1.04

New attribute type label
In some applications, each case has an identifying code or serial number; this information can be recorded in a label attribute. A label attribute does not affect models in any way, but its value is displayed where possible with information about the case such as error messages, cross-referencing results etc.

Sample locking (Windows version)
The sampling option introduced in Release 1.03 allows random train/test splits of an application's data to be generated automatically. In some situations, for instance when investigating alternative model construction options, it is desirable to be able to `lock in' a particular sample, and an additional option on the model construction dialog box is now provided for this purpose.

Saving cross-referencing results (Windows version)
Cubist's cross-referencing facility is a powerful tool for finding the cases covered by particular model rules, and rules relevant to particular cases. The information in the cross-reference window at any point in time can now be saved as a text file.


Release 1.03

Sampling option
Cubist now includes an option to sample from large datasets. This enables a fixed percentage of the cases in a data file to be used for training. As an added convenience, models constructed using the sampling option are now automatically evaluated on a disjoint set of test cases.

Batch-mode version of Cubist for Windows
GUIs are great, but it's sometimes useful to be able to run Cubist non-interactively from a MS-DOS command window. Cubist Release 1.03 includes an additional program CubistX that can be executed as a console application. Options for CubistX are set by command-line parameters in exactly the same way as for the Unix version. (Not included with the free demonstration download.)


Release 1.02

Model Presentation
The rules in a Cubist model are now ordered by the average target value of the cases covered. Rules that tend to predict low values appear before rules that predict high values. This is solely to make the models more intelligible -- the order of rules does not affect the value predicted.

Speed improvements
Cubist is now considerably faster for large datasets, particularly when using composite models.

© RULEQUEST RESEARCH 2016 Last updated January 2016


home products licensing download contact us