Zsoft: Глоссарий
Компания Zsoft
О компании

Data Mining

Библиотека Xelopes:
О продукте

3D визуализатор Exero
О продукте

Визуализатор Recomendation Engine



The measure of
a model's ability to correctly label a previously unseen test case.
If the label is categorical (classification), accuracy is commonly
reported as the rate which a case will be labeled with the right
category. For example, a model may be said to predict whether a
customer responds to a promotional campaign with 85.5% accuracy. If
the label is continuous, accuracy is commonly reported as the
average distance between the predicted label and the correct value.
For example, a model may be said to predict the amount a customer
will spend on a given month within $55. See also Accuracy
, Classification, Estimation,
and Statistical

Accuracy Estimation:
The use of
a validation process to approximate the true value of a model's
accuracy based on a data sample. See also Accuracy,
RMS, Resampling
and Validation.

Affinity Modeling:
generation of a model that predicts which products or services sell

  • Increases sales
    by channeling the right products to the right customer. E.g.
    shelve layout.

  • Decreases lost
    revenue from items that are unnecessarily placed together on
    a price-based promotion.?

transaction data (transaction id, item, price,
Measurements: Maximization of an accuracy function (usually made up
of confidence and support sub-measures. Often the maximization
is limited to the rules that cover a particular product
segment, such as the top 20% revenue/profit generators.
Techniques: n-way
Correlation, Fisher (F) Statistic, Association Rules. In DMM
see co-occurrence module.
Issues: one-to-one affinities are the most common to be
reported. Many-to-one and many-to-many affinities are also
Sell Modeling
, Data
Mining Task
,? Diapers
and Beer
, Market
Basket Analysis
, and Price
Elasticity Modeling
  • Kitts, B. and
    Donnelly, J. "Point of Sales Data Mining Functions". May,

  • Barry,
    M. and Linoff, G. "Data Mining Techniques". 1997. "Chapter 8
    - Market Basket

products placement layout

A well specified sequence of
steps that accepts an input and produces an output. See also Data
Mining Algorithm

Analysis data
This file contains the necessary information about the
type of processes that will be performed in the data file. (i.e.
target column, cost-matrix value). This file is generated after the
first time of data analysis, and it can be used to speed up the
process of defining the process information for analyzing other data
file that has exactly the same data structure next time.

Intelligence (AI):
The science of algorithms that exhibit
intelligent behaviour. See Data
, Expert Systems, Machine
, Heuristics,
and Pattern Matching.?

When one data item is found
to be closely related to another data item, or cause another data
item, we say that they are associated. Association refers to finding
those associated data items. Note that association does not
necessarily mean that one data item causes the other data

Automated Binning :
Binning which
sets the number of bins based on the range of a numeric value.
Therefore, the user is not required to specify the number of bins.
However, certain values may be 'lost' from the decision tree because
of automatic binning, which is not the case with intelligent
binning. See Binning , Discretization.

Binning :
Choosing the
number of bins into which a numeric range is split. For example, if
salaries range from $20,000 to $100,000, the values must be binned
into some number of groups, probably between eight and twenty. Many
data mining products require the user to manually set binning. See
, Discretization.

Black Box:
Any technology and especially
algorithm which does not explain how it achieved its results. This
renders some data mining technologies unsuitable for many business
applications. See Neural Nets.?

A resampling technique used to
estimate a model's accuracy. Bootstrap performs b experiments
with a training set that is randomly sampled from the data set.
Finally, the technique reports the average and standard deviation of
the accuracy achieved on each of the b runs. Bootstrap
differs from cross-validation in that test sets across experiments
will likely share some rows, while in cross-validation is guaranteed
to test each row in the data set once and only once. See also accuracy,
and cross-validation.

A decision
tree algorithm
which was developed from ID3, originally created
by Ross Quinlan. C4.5 can process both discrete and continuous data
and makes classifications. C4.5 implements postpruning, also known
as backward pruning. C4.5 was developed over 20 years ago.
Therefore, it is a tried and tested algorithm. See ID3, Pruning,

Campaign Response
This model
predicts the people that will most likely respond to a promotional

Marketing department can achieve its new business
quota with fewer resources by concentrating their resources on
the better prospects.
Data of
previous promotional campaigns. Can make use of information
from a Segmentation and Profiling Model of the current
customer base.
Measurements: First
model to reach a pre-specified quota (e.g. 85%) of total
expected respondents. Often visualized with a lift
Techniques: Non-Parametric Classification Algorithms with
posterior probability estimates.
Issues: (tbd)
Sell Modeling
, Data
Mining Task
, and Gain
  • Kennedy, R. L. et al, "Solving Data Mining Problems
    Through Pattern Recognition", Prentice Hall/Unica
    Technologies. pages 4.2 - 4.4, 1998.
  • Berry,
    M. and Linoff, G. "Data Mining Techniques for Marketing,
    Sales and Customer Support". Figure 3.1, pages 107-108.

A chi squared statistical regression
algorithm used for classical statistical analysis. CART stands for
classification and regression trees. CART can be used to build
decision trees, in which case it can also use the Gini index. CART
can only process numeric values effectively. See statistics,
numerics, symbolics.?

Causal Factor:
Any data item which
drives, influences, or causes another data item. For example, if
customer credit limit drives how profitable a customer is likely to
be, it is called a causal factor. See Discriminating

A hybrid algorithm which grafts a
chi squared statistics formula onto AID (heuristics), in an attempt
to handle both numerics and symbolics. While CHAID is reliable, it
is slow and limited in power. See AID, CART, Gini, statistics,

Chi Square Distribution:
A mathematical distribution with positive skew. The shape
depends on the degree of freedom (df). The skew is less with more
degree of freedom. The distribution is used directly or indirectly
in many tests of significance. See also Chi
Square Test

Lane, D. HyperStat Online Textbook. 1999. Chapter 16: Chi

Chi Square Test:
significance test used on contingency
to determine the relationship between two variables. Chi
square test assumes that the data distribution follows the chi
square distribution

The act of labeling a test case into one of a finite number of
output classes. A model that classifies is sometimes referred to as
a "classifier". Commonly a classifer's performance is measured by
its ability to correctly label unseen test cases, that is its
"accuracy". Inversely a classifier's performance may be measured by
its "error rate". A more detailed insight into a classifier's
performance is given by the Confusion Matrix structure because it
captures how well the classifier predicts each of the available
classes. If a Cost-Benefit Matrix is available then the classifier's
performance is measured by the product of the Confusion and
Cost-Benefit matrices. See also: Accuracy,
, Confusion
, Cost-Benefit
, Estimation,
and Type I
and Type II Errors


An algorithm that performs classification. Some
algorithms first construct a model that then can be used to classify
(e.g. Decision Tree, Logistic Regression), while other algorithms
perform the labeling directly (e.g. k-Nearest-Neighbor). See also Decision
, k-Nearest-Neighbor,
and Logistic

A set of similar cases.

The development of a model that
labels a new instance as a member of a group of similar records (a
cluster). See clustering algorithms. For example, clustering could
be used by a company to group customers according to income, age,
prior purchase behavior. Cluster detection rarely provides
actionable information, but rather feeds information to other data
mining tasks. See also Clustering
, Segmentation and Profiling.?

Reference: Barry, M. and Linoff, G. Data Mining Techniques.
1997. "Chapter 10 - Automatic Cluster Detection.

Clustering Algorithms:
Given a data set these algorithms induce a model that classifies
a new instance into a group of similar instances. Commonly the
algorithms require that the number of (c) clusters to be identified
is prespecified. E.g. find the c=10 best clusters. Given a distance
metric, these algorithms will try to find groups of records that
have low distances within the cluster but large distances with the
records of other clusters. See also Agglomerative Clustering
Algorithms, Clustering,
Divisive Clustering Algorithms, K-means Algorithm, and Unsupervised

Reference: Hair, J. F. et al, (1998) "Multivariate Data
Analysis", 5th edition, Chapter 9, pages 469-517.


Confidence of rule "B given A" is a measure of how much more
likely it is that B occurs when A has occurred. It is expressed as a
percentage, with 100% meaning B always occurs if A has occurred. Statisticians
refer to this as the conditional probability of B given A. When used with
association rules, the term confidence is observational rather than predictive.
(Statisticians also use this term in an unrelated way. There are ways
to estimate an interval and the probability that the interval contains the
true value of a parameter is called the interval confidence. So a 95% confidence
interval for the mean has a probability of .95 of covering the true value of the

Window or Level:

A statistical measurement of how sure one
can be that a certain result is true. The window or level describes
how close the value is likely to be to the exact result. See statistical


(from the Latin confundere: to mix together) A
distortion of an association between an intervention (I) and
response (R) brought about by an extraneous cofactor (C). This
problem occurs when the intervention is associated with C and C is
an independent factor for the response.? For example, ____ (C)
confounds the relationship between ______ (R) and __couponing__ (I),
since R and C are related, and C is an independent risk factor for

When the
differences between the treatment and control groups other than the
treatment produce differences in response that are not
distinguishable from the effect of the treatment, those differences
between the groups are said to be confounded with the effect of the
treatment (if any). For example, prominent statisticians questioned
whether differences between individuals that led some to smoke and
others not to (rather than the act of smoking itself) were
responsible for the observed difference in the frequencies with
which smokers and non-smokers contract various illnesses. If that
were the case, those factors would be confounded with the effect of
smoking. Confounding is quite likely to affect observational studies
and experiments that are not randomized. Confounding tends to be
decreased by randomization. See also Simpson's

Reference: http://www.stat.berkeley.edu/users/stark/SticiGui/Text/gloss.htm

Confusion Matrix:
A table that
illustrates how well a classifier predicts. Instead of a simple
misclassification error rate the table highlights where the model
encounters difficulties. For each of the c output classes, the table
presents an algorithm's likelihood of predicting each one of c
classes. The sample confusion matrix below shows a classifier's
accuracy on a problem with the three (c=3) output classes: cans,
produce and dairy. The test set used to evaluate the algorithm
contained 100 cases with a distribution of 30 cans, 35 produce and
35 dairy. A perfect classifier would have only made predictions
along the diagonal, but the results below show that the algorithm
was only correct on (20+25+24)/100 = 69% of the cases. The matrix
also shows that the classifier often confuses dairy for cans (11
incorrect) and cans for dairy (9 wrong).?

See also classification.

Contingency Tables:
Used to
examine the relationship between two continuous or categorical
variables. Chi
square test
is used to test the significance between the column
and the row frequencies, that is, whether the frequencies of one of
the variables depends on the other.

Control Group Study (a.k.a.
Randomized Controlled Study):
Click here for more information
A model of evaluation in which the
performance of cases who experience an intervention (the treatment
group) is compared to the performance of cases (the control group)
who did not experience the intervention in question. In medical
studies where the intervention is the administration of drugs, for
example, the control group is known as the placebo group because a
neutral substance (placebo) is administered to the control group
without the subjects (or researchers) knowing if it is an active
drug or not. Typically, the intervention is considered successful if
its performance exceeds that of the control group's by a
statistically significant amount. When assignment to control and
treatment groups is made at random, and no other factors enter into
the assignment into control or treatment, any differences between
the two groups are due either to the treatment or to random
variation. When a given difference between the two groups is
observed, say in spending on a particular set of items, it is
possible to calculate the probability of this difference arising
purely by chance. If the probability of an observed difference is
very small (generally less than 5 percent but more stringent rules
can be adopted) the observed difference is said to be due to the
treatment. Click here for more information.

Coefficient (also Pearson's Product Moment Correlation

A correlation coefficient is a number, usually
between -1 and 1, that measures the degree to which two continuous
columns are related. Usually the term really refers to the Peason's
Product Moment Correlation Coefficient, usually denoted by r, which
measures the linear association between two variables. If there is a
perfect linear relationship with positive slope between the two
variables, we have a correlation coefficient of 1; if there is
positive correlation, whenever one varialbe has a high (low) value,
so does the other. If there is a perfect linear relationship with
negative slope between the two variables, we have a correlation
coefficient of -1; if there is negative correlation, whenever one
variable has a high (low) value, the other has a low (high) value. A
correlation of 0 means that there is no linear relationship between
the variables. See also Spearman
Rank Correlation Coefficient
Cost-Benefit Matrix: (Click here for more information)
A cost-benefit matrix is an input to the
modeling process that allows predictive modelers to describe the
costs and the benefits associated with each possible prediction. By
default the cost-benefit matrix has a value of one (1.0) for correct
predictions and zero (0.0) for incorrect predictions. This
configuration asks that the predictive model optimize raw accuracy.
In most real-world situations, however, an incorrect prediction has
a net monetary cost (less than zero), and a correct prediction has a
positive benefit. The correct or incorrect values that are chosen
affect the values chosen for the matrix. The default cost matrix
assumes no weighting for each output possibility. When the
cost-benefit matrix has new non-default values assigned, the model
optimizes the net benefit (profit) associated with each prediction.
The cost-benefit matrix input is essential for businesses that want
to optimize their return on investment. PredictionWorks supports the
use of a cost-benefit matrix. Click here for more information.

Cross Sell Modeling:
generation of a model that predicts which products a specific
customer would likely buy, or that predicts which customers would
likely buy a specific product. This task is similar to Affinity
Modeling and Campaign Response Modeling except that the resulting
model is customer centric and targets existing customers instead of
new prospects.?

  • Maximize sales
    to the company's existing customers.?

  • Increase
    customer satisfaction from the avoidance of clearly
    unappealing offers.?

labeled transaction data. Can make use of demographic data,
Value Drivers model and Customer Valuation model.
Measurements: Minimize
average misclassification error rate.
Techniques: Non-Parametric Classification Algorithms with
posterior probability estimates.
Issues: (tbd)
, Data
Mining Task
, and Campaign
Response Modeling
References: Kennedy,
R. L. et al, "Solving Data Mining Problems Through Pattern
Recognition", Prentice Hall/Unica Technologies. pages 4.4 -
4.6, 1998.?

who will buy another unrelated product

A resampling
technique used to estimate a model's accuracy. Cross-validation
first segments the data rows into n nearly equally sized
folds (F1..Fn). Once the
segmentation is accomplished, n experiments are run, each
using Fi as a test set and the other n-1
folds appended together to form the train set. Finally, the
technique reports the average and standard deviation of the accuracy
achieved on each of the n runs. Too small a value for
n will not achieve a confident accuracy estimate. while too
large a value for n will increase the variance of the
estimate and will require increased computation. Empirical
investigation into this technique has suggested the value of n=10
(10 fold cross-validation) to achieve useful results. See accuracy,
and bootstrap.

Customer Relationship Management

The business processes that strengthens the
relationship between a seller and their customers. To ensure
positive contacts a CRM requires the measurement of each customer's
value to the enterprise, the storing of all relevant transactional
(behavioral) data, and the ability to predict future customer
behavior. The implementation of a CRM process requires a significant
technological investment in computing hardware and software,
personnel and customer contact (touch point) systems.

Worldwide revenues in the customer relationship management
(CRM) services markets will increase at a compound annual growth
rate of 29 percent from $34.4 billion in 1999 to $125.2 billion in
2004, according to International Data Corp. (IDC). META Group
predicts a 50 percent annual growth rate for the global CRM market
and projects it will grow from more than $13 billion in 2000 to $67
billion in 2004.

(customer relationship management) is an information industry term
for methodologies, software, and usually Internet capabilities that
help an enterprise manage customer relationships in an organized
way. For example, an enterprise might build a database about its
customers that described relationships in sufficient detail so that
management, salespeople, people providing service, and perhaps the
customer directly could access information, match customer needs
with product plans and offerings, remind customers of service
requirements, know what other products a customer had purchased, and
so forth. According to one industry view, CRM consists of:

  1. Helping an enterprise to enable its marketing departments
    to identify and target their best customers, manage marketing
    campaigns with clear goals and objectives, and generate quality
    leads for the sales team.
  2. Assisting the organization to improve telesales, account,
    and sales management by optimizing information shared by multiple
    employees, and streamlining existing processes (for example,
    taking orders using mobile devices)
  3. Allowing the formation of individualized relationships
    with customers, with the aim of improving customer satisfaction
    and maximizing profits; identifying the most profitable customers
    and providing them the highest level of service.
  4. Providing employees with the information and processes
    necessary to know their customers, understand their needs, and
    effectively build relationships between the company, its customer
    base, and distribution partner

Customer Value Modeling:
The generation of a model that forecasts a customer's future
spending in general or within specific business areas.

other models by providing these models with a list of high
long term value (LTV) customers or possibly those with low
labeled transaction data. Can make use of demographic
Measurements: Minimized RMS average between actual and predicted
Techniques: Non-Parametric Estimation Algorithms
Issues: (tbd)
Mining Task
, Long Term Value (LTV), REVPAR, and

B., Hetherington, K, and Donnelly, J. (1998/04) "Data Mining
Algorithms for Customer Analysis and Prediction". Pages 23-24,
33-38. DataSage deliverable for Sheraton project.

what is a customer's LTV?


See Data

Data Mining:
The automatic
detection of trends and associations hidden in data, often in very
structured data. Data Mining is sometimes thought of as a single
phase of a larger process that includes Data Selection, Data
Cleansing, Data Transformation, Data Mining, and Evaluation.
See Data
Mining Algorithms
, Machine
, Statistics.


Locate and resolve problems with dataset:?

  • repeated
    rows, non-unique keys

  • gaps in time

  • missing data

  • columns
    dominated by one value

  • columns with
    a large number of categorical values

  • discretizing or numeralizing columns

See Data

Data Mining Algorithm:
algorithm that accepts structured data and returns a model of the
relationships within a data set. The algorithm's performance is
measure by its accuracy, training/testing time, training/testing
resource requirements and the model's understandability. See also Accuracy,
, and Non-Parametric

Data Mining Task:
A general
problem for which data mining is called in to generate a model for.
In this glossary data mining tasks are described according to the
following template:?

actionable information and hopefully how it will positively
impact the company's return on investment.?
common sources of information required to generate this model
and possibly a high-level metadata example.?
Measurements: the
performance gauge that will identify the winning model. In the
end what really counts is the return on investment, so it is
beneficial to tie the model's measurement with real
Techniques: some
commonly used data mining algorithms to induce this category
of model. This section should focus on algorithms that exist
within DataSage Mining Manager (DMM).
Issues: common
stumbling blocks specific to this type of
also: Affinity
Campaign Response Modeling
Cross Sell Modeling
, Customer
Value Modeling
, Demand
, Fraud Detection, Price
Elasticity Modeling
, Retention
, Segmentation
and Profiling
, Total
Wallet Modeling
, and Value
Drivers Modeling.

a list of
fuller write-ups on the topic, preferably from past projects
but also from our library.

Decision Tree:
A model made up of
a root, branches and leaves. Decision trees are similar to
organization charts, with statistical information presented at each
node. See Axis Parallel Representations.

Decision Tree Algorithm:
An algorithm that generates classification or estimation models
from the fields of Machine Learning and Statistics. The basic
approach of the algorithm is to use a splitting criterion to
determine the most predictive factor and place it as the first
decision point in the tree (the root), and continually perform this
search for predictive factors to build the branches of the tree
until there is no more data to continue with. Tree prunning raises
accuracy on noisy data and can be performed as the tree is being
constructed (pre-prunning), or after the construction
(post-prunning). The algorithm is commonly used for classification
problems that require the model represented in a human-readable
model. PredictionWorks has several implementations of the Decision
Tree Algorithm. Two of them use different splitting criterion (gini
and entropy), and C4.5 is an implementation of a well-known
algorithm by J.R. Quinlan. See also: Classification
, Estimation
, C4.5, Entropy,
and Gini.

Demand Modeling:
generation of a model that forecasts when an item will be ordered
and how large the order will be.

Overstocking and understocking have their own
associated costs. Overstocking increases interest expenses and
the impact of product spoilage. Understocking increases lost
sales and damages the customer experience.
data (item, qty, tstamp). Spoilage data if appropriate.
Delivery schedules, push (promotion) orders.
Measurements: Minimization of RMS between actual and predicted item
order size. Variance of error to determine the minimum
allowable days of safety stock required. Success will likely
also include the optimization of several submeasures that
reflect how the client measures its own success, such as
inventory levels, turnover, and other measures that relate to
warehouse usage and how much capital is tied up.
Techniques: Time-series Forecasting.
Issues: (tbd)
Mining Task

B., (1998). The Complete Short Cycle Forecasting System.
Walmart (SCFS) project.
Short Cycle Forecasting System -
Phase I (see Murtuza's binder).

are we under/over stocked?

Diapers and Beer:
A popular
anecdote used to illustrate the unexpected but useful patterns
discovered by data mining. The anecdote (probably apocryphal)
recounts that a large supermarket chain used data mining to discover
that customers often bought diapers and beer at the same time. When
the retailer displayed two items together, sales increased for both

Finding unexpected but
useful trends and associations hidden in data. See modeling,

Discriminating Factor:
measure of how important a causal factor is, used by decision trees
to build the tree. See decision
, causal

In data mining, a measure of
the relative difference between two or more data partitions based on
information theory. See also Gini.

Entropy Heuristic:
Use of entropy to determine the information
of a particular attribute (predictor) in Decision
Tree Algorithm
. The attribute with the greatest entropy
reduction is chosen as the test attribute in a Decision
model, because splitting on the attribute produces the
purest / most uniformity data distribution. The purity of the data
distribution affects the ultimate accuracy or the resulted model.
The attribute with the greatest entropy reduction is also the
attribute with the highest information gain.?See also Gini

Reference: Han,
J. and Kamber. M. Data
Mining: Concepts and Techniques
. 2000. Chapter 7: Classification
and Prediction.

The act of
labeling a test case with an continuous value. A model or algorithm
that estimates is sometimes referred to as a "regressor". Commonly a
regressor's repfermance is measured by its ability to predict a
value that is near tot he actual value, such as with a correlation
coefficient. See also Classification,
, and Estimation


An algorithm that performs estimation. Some
algorithms first conduct a model that then can be used to estimate
(e.g. Decision Tree, Linear Regression), while other algorithms
perform the labeling directly (e.g. K-Nearest-Neighbor). See also:
, Estimation,

Euclidean Distance:
of the distance between two points. For any two n-dimensional pointsa=
(a1,a2,...an) and
(b1,b2,...bn), the distance between
aandbis equal

sqrt( (a1 -
b1)2 + (a2 -
b2)2 + ... + (an -
bn)2 )

Adapting data mining
techniques to forecast future trends with statistical reliability.
Forecasting is often confused with prediction, but is usually much
more complex.?
See time
series analysis/forecasting
, what-if analysis, neural

Fraud Detection:
modeling procedure predicts infrequent events that bear a large
financial penalty. This type of modeling is commonly used to detect
criminal activity such as credit card fraud, insurance claim fraud,
and Internet/wireless hacking. Each type of fraud detection requires
a slightly different technique. Generally, anomalous events that do
not fit the normal usage patterns trigger fraud detection alarms.
The main challenges to these tasks are due to the low frequency of
the undesirable events, usually under one percent (1%). Usage of the
cost-benefit matrix is critical to properly weigh the benefits of
correct and incorrect predictions. These conditions often mean that
1) The lack of examples of fraudulent events makes it difficult to
discriminate between legitimate and fraudulent behavior. 2) The
overwhelming number of legitimate events leaves little room for lift
on the already high accuracy from the simple model that simply
predicts all events to be legitimate. 3) The existence of a
cost-benefit matrix allows us to dismiss the simple model described
in condition 2. The use of cost-benefit matrix information into
predictive modeling, however, is a new concept.

A modern decision tree index
algorithm developed by Ron Bryman. Gini handles both numbers and
text, and offers good processing speed. See also C4.5, and CHAID.?

Gini Heuristic:
Use of Gini to
determine the information
of a particular attribute (predictor) in Decision
Tree Algorithm
.?See also Entropy

A rule of thumb that
speeds up the locating of an optimal result.

Horizon Effect :
The event
where a Decision
construction in halted prematurely because no further
benefit seemed apparent.? Usually happens as the result with Pre

The first algorithm which was designed
to build decision trees. ID3 was invented by Ross Quinlan at the
University of Sidney Australia. ID3 was
followed by ID4, ID6 and see 5. See C4.5, Gini, CHAID, CART.

Information Gain:
A measurement
used to select the test attribute (predictor) at each node during
Decision Tree model construction. See also Attribute Selection
Measure or Splitting Criterion.

Instance-based Learning:
learning technique in which training datasets are stored in entirety
and a distance function is used to make predictions. See also: KNN

k-Nearest-Neighbor (kNN) Algorithm:
algorithm from the field of Pattern Recognition that generates both
estimation and classification models. The algorithm assumes that
similar cases behave similarly. The most common proximity measure is
based on the Euclidean distance between two vectors. For
classification problems the prediction is based on the statistical
mode (most common) of the response value for the k closest cases.
For example to predict the target value of a test case and k is set
to the value seven (k=7) then the seven cases most similar to the
test case would be fetched and the most common value from among the
seven would be used to make the prediction. If the problem was an
estimation challenge then the average from the seven would be used
for the prediction. This algorithm works most all datatypes but is
most effective in the presence of continuous columns where the
Euclidean distance can be calculated.

Least Squares:
A method used to
find the line that goes through the datapoints with the shortest
squares of distances between the datapoints and this

Gain Chart (Lift Curve):
A method
from direct marketing that helps to visualize a classifier's
accuracy on binary (positive/negative) problems. Lift charts are
commonly used in promotion campaign response modeling (responded/did
not respond) to present how well a model performs when compared to a
random mailing. The x-axis represents the percentage of the
total population covered, say a city of 100,000. The y-axis
presents the cumulative percentage of correctly classified positive
cases, say 30,000 would respond if they received the mailout. The
chart should include the performance of a random case selection (a
straight line from [0%,0%] to [100%,100%]) and the performance of
the model under investigation. Other possible lines in the chart
include the performance of other competing models, the performance
of a perfect classifier, and the quota to be achieved. From the
chart below we notice that kNN (k-Nearest
reaches the 85% quota faster than the DT
(Decision Tree)

Linear Regression:
algorithm that generates estimation models from the field of
Statistics. The algorithm assumes that a linear relationship exists
between the independent variables and the response variable.
PredictionWorks uses Least Squares as a measure of model fitness.
See also Parametric
, and Least

logarithm of a likelihood equation. It is used when the logarithm is
easier to work with than the equation itself and the outcome is

Logistic Regression:
algorithm that generates classification models from the field of
Statistics. The target column may be NON-binary, but most
implementations are limited to two value (binary) predictions.
PredictionWorks' logistic regression algorithm uses maximum
likelihood estimation to determine parameters of the regression
equation. Forward stepwise selection is used to find the most
predictive columns from which to build the model. The stopping
criterion is based on a chi-square distribution. See also: Classification
, and Log-Likelihood.

Machine Learning:
Is the
research area within AI that studies algorithms which induce models
from a set of data. Machine learning differs from statistics and
pattern recognition in its focus on knowledge representation,
symbolic data, automated discovery, and computational complexity

Majority Model
classification algorithm that simply predicts the most common value
found in the target column. For example, if 87% of a data set's rows
contain the value "Did_Not_Respond" in the target column, then the
Majority Model will simply predict "Did_Not_Respond" for every row,
and on average achieve a raw accuracy of 87%. If one single value
(class) is very prevalent is the data set, say greater than 95%,
then this model's raw accuracy will be difficult to imporve on. When
this is the case, the situation usually requires a Cost-Benefit
Matrix to represent the fact that predicting the majority class may
have little value in the real world. For example, predicting that
all customers will not respond to a direct mail campaign may be
accurate but that model will generate no revenue. See also: Classification
, Minority
Model Algorithm
, Mean
Model Algorithm
, and Naпve
Model Algorithm

Market Basket
technique, used in large retail chains, which studies every purchase
made by customers to find out which sales are most commonly made
together. See Diapers
and Beer

Maximum Likelihood
Method of
choosing parameters for a regression equation that maximizes the
likelihood of observing the target value.
Mean Model

An estimation algorithm that simply predicts the
mean (average) value found in the target column. For example, if the
average value of a data set's target column is 7.875, then the Mean
Model will simply predict 7.875 for every row. Often the central
tendency of a data set is better captured by the median value,
however calculating a column's median is significantly more complex
(requires sorting) than calculating the mean value. See also: Naпve
Model Algoirthm
Meta data file:
This file
contains the necessary information about the data structure (ie
delimiter, data type for each column). This file is generated after
the first time of data analysis, and it can be used to speed up the
process of defining the data structure for analyzing other data file
that has exactly the same data structure next time.
Minority Model

A classification algorithm that simply predicts
the least common value found in the target column. PredictionWorks
implicitly supports this model through the Naive Model Algorithm by
creating a separate model for every value in the target class. This
model is of interest when a Cost-Benefit Matrix biases the value of
predicting against the majority value (class). See also: Naive
Model Algorithm
Missing Value:
A data
value that is not known or does not exist. Two common reasons for
missing values are unfilled optional features, data entry
malfunctions, and non-applicability of column. Some algorithms
require that missing-values be filled-in (imputed). PredictionWorks'
imputes values based on the mean if the column is numeric or the
mode if the column is categorical.

In mathematics, the most common class
in a set

Building a model from a set
of data. See also Model, Non-Parametric
, and Parametric

A structure that labels a case
based on the case's characteristics. A model is measured by its
accuracy and speed, and sometimes by its ability to clearly justify
its prediction. See Accuracy,

Naпve Bayes

An algorithm that generates classification models
from the field of Statistics. The basic approach is simply to use
prior probabilities as the predictive model. The primary constraint
of this algorithm is that is assumes independence of the attributes.
The algorithm is generally faster and less accurate than other model
based algorithms. PredictionWorks' implementation uses a forward
selection approach.

Naпve Model Algorithm:
algorithm that generates both classification and estimation
predictions. The approach of the algorithm is simply to make
predictions based on the distribution of the target column. There
are several possible versions of this algorithm that make
assumptions. PredictionWorks' implements the Majority Model, and
Single Class Model. See also: Majority
Model Algorithm
, Minority
Model Algorithm
, Mean
Model Algorithm
, Single
Class Model Algorithm
, classification/estimation models, domain
, and target

Neural Networks:
A predictive
modeling technology, which attempts to mimic the complex functions
of the brain. The main problem with neural nets is that the tools do
not explain how they determined their results. Another limitation is
that only skilled professionals can successfully use them. See black

Non-Parametric Modeling:
The development of a model with the use of algorithms that do
not make many prior assumptions about the population from which the
data is drawn. This approach requires less understanding of the
specific relationships in the underlying system than parametric
modeling. See also Neural
, k-Nearest
, Decision
Tree Algorithms
, and data-driven modeling.

T he tendency to
mistake noise in data for trends.

Parametric Modeling:
development of a model with the use of strong assumptions about the
underlying behavior within the population that the data is drawn
from. The use of these algorithms generally requires a through
understanding of their specific assumptions and how to test for
their validity. If the assumption is correct, the generated model
will be more accurate than a model generated by a non-parametric
modeling algorithm. See Linear
, Non-Parametric

Choosing data which
is most interesting for mining. This is typically at least eighty
percent of the work of data mining. See sampling.

A type of Pruning in Decision
Tree Algorithm
where the pruning process is done after the model
is fully constructed. Post pruning removes anomalies by removing
branches from a .fully grown. tree. Also see Pre-Pruning,
Pruning and Overfitting.

Predictive Modeling:
Modeling that emphasizes accuracy on unseen

Using existing data to predict how
other factors will behave, assuming that some facts are known about
the new factor. Making a credit check of new customers by using data
on existing customers is a good example of prediction. See What-If?
analysis, time
series analysis/forecasting
, forecasting.?

A type of Pruning in Decision
Tree Algorithm
where tree construction is halted early before
the model is fully constructed. This approach is generally faster
then Post
, but will be less accurate due to the horizon
. See also Pre-Pruning,
Pruning and Overfitting.

Price Elasticity

The generation of a model that forecasts a
product's sales volume based on its price. For example, if the price
of milk is increased from $3.50/g to $4/g the sales of this product
from 683/week to 546/week.

Set the price for a product that optimizes either the
revenue or profit that can be drawn from this product. Price
is one of the most important factors that a retailer can use
to improve profits.
Transaction data (item,price,mean(qty),cost,date). Can
also make use of competitor's pricing and seasonal code.
Measurements: Minimized distance between actual and predicted sales
volume. Ideally the model that achieves the largest increase
in profit (and/or revenue) over all products would be
Techniques: A linear
model has been found to be useful within the area that profit
and revenue are maximized. Non-linear models may be better at
predicting the outcomes of extreme price points like
'barn-yard' sales. In DMM (tbd)
Issues: (tbd)
, Cannibalization, Data
Mining Task
, Linear
, Price Elasticity, and Sensitivity
  • Kitts, B and
    Donnelly, J. (1998/05) "Point of Sales Data Mining

  • Melli, G.
    (1998/08) "Price, Demand, Data Mining and Profit".?

  • Melli, G.
    (1998/11) "Strategic Hotel Capital - Price Sensitivity

    • this report
      includes an analysis of a price survey.

  • Dolan,
    R and Simon H. (1996) "Power Pricing . How Managing Price
    Transforms the Bottom Line", The Free Press. (in our


Optimize rev/profit

The process of removing parts
of Decision
model that describe anomalies rather than the true features
of the underlying model. There are two types of pruning process, Pre-Pruning and Post
.?See also Overfitting.

Posterior Probability (a
posteriori probability) Models:

Classification model which
estimates the probability that input x belongs to class
C, denoted by P(C|x). With these
estimates a set of inputs can be ranked from most to least certain
so that resources can be focused on the top prospects. See also Campaign
Response Modeling
, and Gain

Random Error:
Can be thought of as
either sampling variation or as the sum total of other unobservable
chance deviations that occur with sampled data. The key to working
with random error is that it tends to balance out in predictable
ways over the long run. It can thus be dealt with using routine
methods of estimation and hypothesis testing.

Random Model

An algorithm that generates classification
predictions. The basic approach is to randomly predict one of the
classes without consideration for its frequency.

Resampling Techniques:
An empirical approach to model accuracy estimation based on the
training and testing on multiple samples of the data set. See also
, Cross-Validation and Bootstrap.

Retention Modeling:
generation of a model that forecasts the probability that a customer
will significantly reduce the business they bring in, or defect all


Increase the long
term value (LTV) of its customer base.

  • Try to win back
    customers before they defect because it costs more to bring
    in a new customer than is does to hold to an existing one.
    In combination with a Customer Value model, the Retention
    model can focus on high value customers.

  • Survey potential
    defectors to stay on top of service/product shortfalls.

labeled transaction data. Can also make use of customer
satisfaction questionnaires, demographic data, Value Drivers
model, and Customer Valuation model.
  • For simple
    defect/non defect binary classification minimize the
    misclassification average error rate. Include a confusion

  • For business
    reduction estimation minimize the average RMS between actual
    and predicted decrease in business.

Techniques: Non-Parametric
Issues: (tbd)
Attrition, Churn, and Data
Mining Task
  • Kitts, B.,
    Hetherington, K, and Donnelly, J. (1998/04) "Data Mining
    Algorithms for Customer Analysis and Prediction". Pages
    25-27. DataSage deliverable for Sheraton project.

  • Berry, M. and
    Linoff, G. (1997) "Data Mining Techniques for Marketing,
    Sales and Customer Support". Figure 3.1, pages

  • Reichheld, F. and Sasser, W. (1990) "Zero
    Defections: Quality Comes to Services", Harvard Business
    Review, Sept-Oct 90 reprint 90508.?


Root-mean-squared Error (RMS error):
A measure of
an estimation model's error. Given a data set, it is defined as the
square root of the square of the difference between the true values
and the model's predicted values. For a data set with n records?


A method of performing discovery by inducing
rules about data. Rule induction tests given values in the data set
to see which other data are its strongest associated factors. See decision
, discovery,

Taking a random sample of
data in order to reduce the number of records to mine. Sampling is
statistically complicated, but can be done in an RDBMS by use of a
simple random number generator and column in the database. See partitioning.?

The process of applying a model to a database or

A well-defined set of cases.

Segmentation and

The generation of a model that groups like-minded
customers into 'prototypical' customer types.

Improves predictiveness of other models. In
combination with customer valuation model the profiling can be
performed on customers with high long term value. This profile
can then be passed on to the marketing department to acquire
new customers with high LTV.
labeled transaction data. Can make use of demographic
Measurements: (tbd)
Techniques: Clustering
Issues: (tbd)
Clustering and Data
Mining Task
  • Berry,
    M. and Linoff, G.(1997) "Data Mining Techniques for
    Marketing, Sales and Customer Support". Figure 3.1, pages
    187-215 (Chapter 10 - Automatic Cluster Detection).?


Simpson's Paradox:
What is true
for the parts is not necessarily true for the

Single Class
Model Algorithm:

An algorithm that generates classification
predictions. The approach of the algorithm is to choose a single
class and predict that class every time. This algorithm is simpler
than Majority Model because it does not need to know the frequency
of the target column. Can only be better than Majority Model when a
Cost Matrix is used to give priority a minority class. See also: Majority
Model Algorithm
, Target
, and Cost-Benefit

Statistical significance:
A measure the statistical likeliness that a given numerical
value is true. See confidence window or level.

The field in mathematics which
studies the quantification of variance. One of the basic building
blocks of data mining. See Heuristics,


In contrast to random error, is less easily managed.
According to modern epidemiologic theory, systematic error (or bias,
as they say), can result from: information bias (due to errors in
measurement), selection bias (due to flaws in the sampling), and
confounding (due to the damaging / biasing effects of extraneous


The measure of how often the collection of items in an
association occur together as a percentage of all the transactions.
For example, "In 2% of the purchases at the hardware store, both a pick
and a shovel were bought."

Support Vector Machines:

Support Vector Machines were invented by Vladimir Vapnik.
They are a method for creating functions from a set of labeled training data.
The function can be a classification function (the output is binary: is the input in a category)
or the function can be a general regression function. For classification, SVMs operate by finding
a hypersurface in the space of possible inputs. This hypersurface will attempt to split the
positive examples from the negative examples. The split will be chosen to have the largest
distance from the hypersurface to the nearest of the positive and negative examples. Intuitively,
this makes the classification correct for testing data that is near, but not identical to the training data.

Target Column: (a.k.a.
dependent variable/attribute, label)
the column
whose values a predictive model has to accurately predict. If the
target column is categorical the modeling challenge is referred to
as classification, otherwise if the target column is numeric the
challenge is referred to as an estimation problem. See also test
, train
, classifier,
and estimation.

Test file:
The dataset file
on which predictions will be made. The best predictive model
discovered during the training stage will be used to predict the
values of the target column for each of the rows in the file. The
file should be the same format as the train file although the values
in the target column may be represented by a missing value such as a
question mark. See also target

Time Series

A complicated technology which is used
to give statistically accurate forecasting. This is often confused
with prediction or simple forecasting, but time series
analysis/forecasting is much more difficult, and mathematically
based. See forecasting.?

Total Wallet Modeling:
The generation of a model that forecasts a customer's total
spending for a particular product or service.

Increased sales through more effective product
targeting. In combination with a Customer Value model, the
Total Wallet model can estimate the penetration into each
customer's wallet and therefore identify which customers have
the largest upsell opportunities. A central tenet of
one-to-one marketing is to focus on "wallet share" and market
questionnaire, customer labeled transaction data. Can make use
of demographic data.?
Measurements: Minimized RMS between actual and predicted wallet
Techniques: Non-Parametric
for Estimation.
Issues: The return on investment from this model depends on
how much the company already takes up of its customers'
Mining Task
Train file:
The dataset
file that will be tested against with many algorithms to discover
the best predictive model. One of file's the columns is selected as
the target (dependent) column. The PredictionWorks' web-service
accepts comma, tab or single-space delimited files. If the
discovered model is will be used to predict future behavior then the
predictor (independent) columns in the train file must contain
information that was available before the value of the target column
was known, and possibly further back. For example, if the target
value contains whether a banner was clicked on or not then the data
in the predictor columns must have data that was available before
the person saw the banner. In the example of a mailed-out coupon,
the predictor columns must have data that was available not just
before the coupon was redeemed but data that was available when the
coupon was mailed out.
The minimum
percentage of occurrence of a class needed to choose that class.
e.g. If you have a dataset consisting of blue socks and red socks
and your threshold is 0.6, you will need at least 60% of one colour
of sock to choose that colour. See also: classification

Type I and Type II Errors:
based on two-valued (binary) predictive models may be in error for
two reasons:

  • Type I (False Positive) errors occur when the difference
    with the null-hypothesis is significant, due to factors other than
    chance, when in fact it is not. The probability of this type of
    error is the same as the significance level of the test. Many
    domains consider this the most serious type of error to make. It
    is equivalent to a judge finding an innocent suspect
  • Type II (False Negative) errors occur when the difference
    with the null-hypothesis is due to chance, when in fact it is not.
    Some domains, such as direct marketing represent this type of
    error as lost income because the person who would have responded
    positively was never contacted.

    See also Cost-Benefit

Value Drivers Modeling:
The generation of a model that predicts the top reasons for a
customer to continue to do business with the company.

other forecasting models to channel the right service to the
right customer.
Data Sources: Survey questionnaire, customer labeled transaction
data. Can make use of demographic data.
Measurements: Minimize
average misclassification error rate on top x drivers for each
customer. If the accuracy of predicting the exact x drivers is
low for all models (likely if there are a substantial number
of drivers to chose from) then it is preferable to measure
accuracy of detecting at least one value driver. Possibly also
report the associated confusion matrix.
Techniques: Non-Parametric Classification Algorithms with
Posterior Probabilities.
Issues: (tbd)
See Also: Data
Mining Task

Kitts, B., Hetherington, K, and Donnelly, J. (1998/04)
"Data Mining Algorithms for Customer Analysis and Prediction".
Pages 39-44. DataSage deliverable for Sheraton

The testing of a model with unseen

Visual representation of
discovered patterns.

Выбрать язык

Третье издание
Второе издание