Accuracy: The measure of a model's ability to correctly label a previously unseen test case. If the label is categorical (classification), accuracy is commonly reported as the rate which a case will be labeled with the right category. For example, a model may be said to predict whether a customer responds to a promotional campaign with 85.5% accuracy. If the label is continuous, accuracy is commonly reported as the average distance between the predicted label and the correct value. For example, a model may be said to predict the amount a customer will spend on a given month within $55. See also Accuracy Estimation, Classification, Estimation, Model, and Statistical Significance. 
Accuracy Estimation: The use of a validation process to approximate the true value of a model's accuracy based on a data sample. See also Accuracy, RMS, Resampling Techniques and Validation. 
Affinity Modeling: The generation of a model that predicts which products or services sell together. Business Benefit:   Data Sources:  Raw transaction data (transaction id, item, price, qty).  Measurements:  Maximization of an accuracy function (usually made up of confidence and support submeasures. Often the maximization is limited to the rules that cover a particular product segment, such as the top 20% revenue/profit generators.  Techniques:  nway Correlation, Fisher (F) Statistic, Association Rules. In DMM see cooccurrence module.  Issues:  onetoone affinities are the most common to be reported. Manytoone and manytomany affinities are also available.  See Also:  Association, Cross Sell Modeling, Data Mining Task,? Diapers and Beer, Market Basket Analysis, and Price Elasticity Modeling.  References:  Kitts, B. and Donnelly, J. "Point of Sales Data Mining Functions". May, 1998.? Barry, M. and Linoff, G. "Data Mining Techniques". 1997. "Chapter 8  Market Basket Analysis".  e.g. products placement layout

Algorithm: A well specified sequence of steps that accepts an input and produces an output. See also Data Mining Algorithm. 
Analysis data file: This file contains the necessary information about the type of processes that will be performed in the data file. (i.e. target column, costmatrix value). This file is generated after the first time of data analysis, and it can be used to speed up the process of defining the process information for analyzing other data file that has exactly the same data structure next time. 
Artificial Intelligence (AI): The science of algorithms that exhibit intelligent behaviour. See Data Mining, Expert Systems, Machine Learning, Heuristics, and Pattern Matching.? 
Association: When one data item is found to be closely related to another data item, or cause another data item, we say that they are associated. Association refers to finding those associated data items. Note that association does not necessarily mean that one data item causes the other data item. 
Automated Binning : Binning which sets the number of bins based on the range of a numeric value. Therefore, the user is not required to specify the number of bins. However, certain values may be 'lost' from the decision tree because of automatic binning, which is not the case with intelligent binning. See Binning , Discretization. 

Binning : Choosing the number of bins into which a numeric range is split. For example, if salaries range from $20,000 to $100,000, the values must be binned into some number of groups, probably between eight and twenty. Many data mining products require the user to manually set binning. See Automated Binning, Discretization. 
Black Box: Any technology and especially algorithm which does not explain how it achieved its results. This renders some data mining technologies unsuitable for many business applications. See Neural Nets.? 
Bootstrap: A resampling technique used to estimate a model's accuracy. Bootstrap performs b experiments with a training set that is randomly sampled from the data set. Finally, the technique reports the average and standard deviation of the accuracy achieved on each of the b runs. Bootstrap differs from crossvalidation in that test sets across experiments will likely share some rows, while in crossvalidation is guaranteed to test each row in the data set once and only once. See also accuracy, resampling techniques and crossvalidation. 

C4.5: A decision tree algorithm which was developed from ID3, originally created by Ross Quinlan. C4.5 can process both discrete and continuous data and makes classifications. C4.5 implements postpruning, also known as backward pruning. C4.5 was developed over 20 years ago. Therefore, it is a tried and tested algorithm. See ID3, Pruning, Gini. 
Campaign Response Modeling: This model predicts the people that will most likely respond to a promotional campaign. Business Benefit:  Marketing department can achieve its new business quota with fewer resources by concentrating their resources on the better prospects.  Data Sources:  Data of previous promotional campaigns. Can make use of information from a Segmentation and Profiling Model of the current customer base.  Measurements:  First model to reach a prespecified quota (e.g. 85%) of total expected respondents. Often visualized with a lift chart.  Techniques:  NonParametric Classification Algorithms with posterior probability estimates.  Issues:  (tbd)  See Also:  Cross Sell Modeling, Data Mining Task, and Gain Chart.  References:   Kennedy, R. L. et al, "Solving Data Mining Problems
Through Pattern Recognition", Prentice Hall/Unica Technologies. pages 4.2  4.4, 1998. Berry, M. and Linoff, G. "Data Mining Techniques for Marketing, Sales and Customer Support". Figure 3.1, pages 107108. 19  
CART: A chi squared statistical regression algorithm used for classical statistical analysis. CART stands for classification and regression trees. CART can be used to build decision trees, in which case it can also use the Gini index. CART can only process numeric values effectively. See statistics, CHAID, Gini, numerics, symbolics.? 
Causal Factor: Any data item which drives, influences, or causes another data item. For example, if customer credit limit drives how profitable a customer is likely to be, it is called a causal factor. See Discriminating Factor. 
CHAID: A hybrid algorithm which grafts a chi squared statistics formula onto AID (heuristics), in an attempt to handle both numerics and symbolics. While CHAID is reliable, it is slow and limited in power. See AID, CART, Gini, statistics, heuristics. 
Chi Square Distribution: A mathematical distribution with positive skew. The shape depends on the degree of freedom (df). The skew is less with more degree of freedom. The distribution is used directly or indirectly in many tests of significance. See also Chi Square Test. Reference: Lane, D. HyperStat Online Textbook. 1999. Chapter 16: Chi Square. 
Chi Square Test: The significance test used on contingency table to determine the relationship between two variables. Chi square test assumes that the data distribution follows the chi square distribution. 
Classification/Classifier:? The act of labeling a test case into one of a finite number of output classes. A model that classifies is sometimes referred to as a "classifier". Commonly a classifer's performance is measured by its ability to correctly label unseen test cases, that is its "accuracy". Inversely a classifier's performance may be measured by its "error rate". A more detailed insight into a classifier's performance is given by the Confusion Matrix structure because it captures how well the classifier predicts each of the available classes. If a CostBenefit Matrix is available then the classifier's performance is measured by the product of the Confusion and CostBenefit matrices. See also: Accuracy, Classification Algorithm, Confusion Matrix, CostBenefit Matrix, Estimation, Model, and Type I and Type II Errors. 
Classification Algorithm: An algorithm that performs classification. Some algorithms first construct a model that then can be used to classify (e.g. Decision Tree, Logistic Regression), while other algorithms perform the labeling directly (e.g. kNearestNeighbor). See also Decision Tree, kNearestNeighbor, and Logistic Regression. 
Cluster: A set of similar cases. 
Clustering: The development of a model that labels a new instance as a member of a group of similar records (a cluster). See clustering algorithms. For example, clustering could be used by a company to group customers according to income, age, prior purchase behavior. Cluster detection rarely provides actionable information, but rather feeds information to other data mining tasks. See also Clustering Algorithms, Segmentation and Profiling.? Reference: Barry, M. and Linoff, G. Data Mining Techniques. 1997. "Chapter 10  Automatic Cluster Detection. 
Clustering Algorithms: Given a data set these algorithms induce a model that classifies a new instance into a group of similar instances. Commonly the algorithms require that the number of (c) clusters to be identified is prespecified. E.g. find the c=10 best clusters. Given a distance metric, these algorithms will try to find groups of records that have low distances within the cluster but large distances with the records of other clusters. See also Agglomerative Clustering Algorithms, Clustering, Divisive Clustering Algorithms, Kmeans Algorithm, and Unsupervised Learning.? Reference: Hair, J. F. et al, (1998) "Multivariate Data Analysis", 5th edition, Chapter 9, pages 469517. 
Confidence: Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has occurred. Statisticians refer to this as the conditional probability of B given A. When used with association rules, the term confidence is observational rather than predictive. (Statisticians also use this term in an unrelated way. There are ways to estimate an interval and the probability that the interval contains the true value of a parameter is called the interval confidence. So a 95% confidence interval for the mean has a probability of .95 of covering the true value of the mean.) 
Confidence Window or Level: A statistical measurement of how sure one can be that a certain result is true. The window or level describes how close the value is likely to be to the exact result. See statistical significance.? 
Confounding Factor: (from the Latin confundere: to mix together) A distortion of an association between an intervention (I) and response (R) brought about by an extraneous cofactor (C). This problem occurs when the intervention is associated with C and C is an independent factor for the response.? For example, ____ (C) confounds the relationship between ______ (R) and __couponing__ (I), since R and C are related, and C is an independent risk factor for R. When the differences between the treatment and control groups other than the treatment produce differences in response that are not distinguishable from the effect of the treatment, those differences between the groups are said to be confounded with the effect of the treatment (if any). For example, prominent statisticians questioned whether differences between individuals that led some to smoke and others not to (rather than the act of smoking itself) were responsible for the observed difference in the frequencies with which smokers and nonsmokers contract various illnesses. If that were the case, those factors would be confounded with the effect of smoking. Confounding is quite likely to affect observational studies and experiments that are not randomized. Confounding tends to be decreased by randomization. See also Simpson's Paradox.? Reference: http://www.stat.berkeley.edu/users/stark/SticiGui/Text/gloss.htm 
Confusion Matrix: A table that illustrates how well a classifier predicts. Instead of a simple misclassification error rate the table highlights where the model encounters difficulties. For each of the c output classes, the table presents an algorithm's likelihood of predicting each one of c classes. The sample confusion matrix below shows a classifier's accuracy on a problem with the three (c=3) output classes: cans, produce and dairy. The test set used to evaluate the algorithm contained 100 cases with a distribution of 30 cans, 35 produce and 35 dairy. A perfect classifier would have only made predictions along the diagonal, but the results below show that the algorithm was only correct on (20+25+24)/100 = 69% of the cases. The matrix also shows that the classifier often confuses dairy for cans (11 incorrect) and cans for dairy (9 wrong).? See also classification. 
Contingency Tables: Used to examine the relationship between two continuous or categorical variables. Chi square test is used to test the significance between the column and the row frequencies, that is, whether the frequencies of one of the variables depends on the other. 
Control Group Study (a.k.a. Randomized Controlled Study): Click here for more information A model of evaluation in which the performance of cases who experience an intervention (the treatment group) is compared to the performance of cases (the control group) who did not experience the intervention in question. In medical studies where the intervention is the administration of drugs, for example, the control group is known as the placebo group because a neutral substance (placebo) is administered to the control group without the subjects (or researchers) knowing if it is an active drug or not. Typically, the intervention is considered successful if its performance exceeds that of the control group's by a statistically significant amount. When assignment to control and treatment groups is made at random, and no other factors enter into the assignment into control or treatment, any differences between the two groups are due either to the treatment or to random variation. When a given difference between the two groups is observed, say in spending on a particular set of items, it is possible to calculate the probability of this difference arising purely by chance. If the probability of an observed difference is very small (generally less than 5 percent but more stringent rules can be adopted) the observed difference is said to be due to the treatment. Click here for more information. 
Correlation Coefficient (also Pearson's Product Moment Correlation Coefficient): A correlation coefficient is a number, usually between 1 and 1, that measures the degree to which two continuous columns are related. Usually the term really refers to the Peason's Product Moment Correlation Coefficient, usually denoted by r, which measures the linear association between two variables. If there is a perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one varialbe has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of 1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation of 0 means that there is no linear relationship between the variables. See also Spearman Rank Correlation Coefficient. 
CostBenefit Matrix: (Click here for more information) A costbenefit matrix is an input to the modeling process that allows predictive modelers to describe the costs and the benefits associated with each possible prediction. By default the costbenefit matrix has a value of one (1.0) for correct predictions and zero (0.0) for incorrect predictions. This configuration asks that the predictive model optimize raw accuracy. In most realworld situations, however, an incorrect prediction has a net monetary cost (less than zero), and a correct prediction has a positive benefit. The correct or incorrect values that are chosen affect the values chosen for the matrix. The default cost matrix assumes no weighting for each output possibility. When the costbenefit matrix has new nondefault values assigned, the model optimizes the net benefit (profit) associated with each prediction. The costbenefit matrix input is essential for businesses that want to optimize their return on investment. PredictionWorks supports the use of a costbenefit matrix. Click here for more information. 
Cross Sell Modeling: The generation of a model that predicts which products a specific customer would likely buy, or that predicts which customers would likely buy a specific product. This task is similar to Affinity Modeling and Campaign Response Modeling except that the resulting model is customer centric and targets existing customers instead of new prospects.? Business Benefit:   Data Sources:  Customer labeled transaction data. Can make use of demographic data, Value Drivers model and Customer Valuation model.  Measurements:  Minimize average misclassification error rate.  Techniques:  NonParametric Classification Algorithms with posterior probability estimates.  Issues:  (tbd)  See Also:  Affinity Modeling, Data Mining Task, and Campaign Response Modeling.?  References:  Kennedy, R. L. et al, "Solving Data Mining Problems Through Pattern Recognition", Prentice Hall/Unica Technologies. pages 4.4  4.6, 1998.?  e.g. who will buy another unrelated product 
Crossvalidation: A resampling technique used to estimate a model's accuracy. Crossvalidation first segments the data rows into n nearly equally sized folds (F_{1}..F_{n}). Once the segmentation is accomplished, n experiments are run, each using F_{i} as a test set and the other n1 folds appended together to form the train set. Finally, the technique reports the average and standard deviation of the accuracy achieved on each of the n runs. Too small a value for n will not achieve a confident accuracy estimate. while too large a value for n will increase the variance of the estimate and will require increased computation. Empirical investigation into this technique has suggested the value of n=10 (10 fold crossvalidation) to achieve useful results. See accuracy, resampling techniques and bootstrap. 
Customer Relationship Management (CRM): The business processes that strengthens the relationship between a seller and their customers. To ensure positive contacts a CRM requires the measurement of each customer's value to the enterprise, the storing of all relevant transactional (behavioral) data, and the ability to predict future customer behavior. The implementation of a CRM process requires a significant technological investment in computing hardware and software, personnel and customer contact (touch point) systems.
Worldwide revenues in the customer relationship management (CRM) services markets will increase at a compound annual growth rate of 29 percent from $34.4 billion in 1999 to $125.2 billion in 2004, according to International Data Corp. (IDC). META Group predicts a 50 percent annual growth rate for the global CRM market and projects it will grow from more than $13 billion in 2000 to $67 billion in 2004. CRM (customer relationship management) is an information industry term for methodologies, software, and usually Internet capabilities that help an enterprise manage customer relationships in an organized way. For example, an enterprise might build a database about its customers that described relationships in sufficient detail so that management, salespeople, people providing service, and perhaps the customer directly could access information, match customer needs with product plans and offerings, remind customers of service requirements, know what other products a customer had purchased, and so forth. According to one industry view, CRM consists of:  Helping an enterprise to enable its marketing departments
to identify and target their best customers, manage marketing campaigns with clear goals and objectives, and generate quality leads for the sales team.  Assisting the organization to improve telesales, account,
and sales management by optimizing information shared by multiple employees, and streamlining existing processes (for example, taking orders using mobile devices)  Allowing the formation of individualized relationships
with customers, with the aim of improving customer satisfaction and maximizing profits; identifying the most profitable customers and providing them the highest level of service.  Providing employees with the information and processes
necessary to know their customers, understand their needs, and effectively build relationships between the company, its customer base, and distribution partner 
Customer Value Modeling: The generation of a model that forecasts a customer's future spending in general or within specific business areas. Business Benefit:  Supports other models by providing these models with a list of high long term value (LTV) customers or possibly those with low LTV.  Data Sources:  Customer labeled transaction data. Can make use of demographic data.  Measurements:  Minimized RMS average between actual and predicted spending.  Techniques:  NonParametric Estimation Algorithms  Issues:  (tbd)  See Also:  Data Mining Task, Long Term Value (LTV), REVPAR, and REVPAC  References:  Kitts, B., Hetherington, K, and Donnelly, J. (1998/04) "Data Mining Algorithms for Customer Analysis and Prediction". Pages 2324, 3338. DataSage deliverable for Sheraton project.  e.g. what is a customer's LTV? 

Data Selection: See Data Mining. 
Data Mining: The automatic detection of trends and associations hidden in data, often in very structured data. Data Mining is sometimes thought of as a single phase of a larger process that includes Data Selection, Data Cleansing, Data Transformation, Data Mining, and Evaluation. See Data Mining Algorithms, Machine Learning, Statistics. 
Data Cleansing: Locate and resolve problems with dataset:? repeated rows, nonunique keys gaps in time missing data columns dominated by one value columns with a large number of categorical values discretizing or numeralizing columns See Data Mining. 
Data Mining Algorithm: An algorithm that accepts structured data and returns a model of the relationships within a data set. The algorithm's performance is measure by its accuracy, training/testing time, training/testing resource requirements and the model's understandability. See also Accuracy, Algorithm, Classification, Estimation, Parametric Modeling, and NonParametric Modeling. 
Data Mining Task: A general problem for which data mining is called in to generate a model for. In this glossary data mining tasks are described according to the following template:? Business Benefit:  the actionable information and hopefully how it will positively impact the company's return on investment.?  Data Sources:  the common sources of information required to generate this model and possibly a highlevel metadata example.?  Measurements:  the performance gauge that will identify the winning model. In the end what really counts is the return on investment, so it is beneficial to tie the model's measurement with real dollars.?  Techniques:  some commonly used data mining algorithms to induce this category of model. This section should focus on algorithms that exist within DataSage Mining Manager (DMM).  Issues:  common stumbling blocks specific to this type of modeling.  See Also:  See also: Affinity Modeling, Campaign Response Modeling, Cross Sell Modeling, Customer Value Modeling, Demand Modeling, Fraud Detection, Price Elasticity Modeling, Retention Modeling, Segmentation and Profiling, Total Wallet Modeling, and Value Drivers Modeling.  References:  a list of fuller writeups on the topic, preferably from past projects but also from our library.  
Decision Tree: A model made up of a root, branches and leaves. Decision trees are similar to organization charts, with statistical information presented at each node. See Axis Parallel Representations. 
Decision Tree Algorithm: An algorithm that generates classification or estimation models from the fields of Machine Learning and Statistics. The basic approach of the algorithm is to use a splitting criterion to determine the most predictive factor and place it as the first decision point in the tree (the root), and continually perform this search for predictive factors to build the branches of the tree until there is no more data to continue with. Tree prunning raises accuracy on noisy data and can be performed as the tree is being constructed (preprunning), or after the construction (postprunning). The algorithm is commonly used for classification problems that require the model represented in a humanreadable model. PredictionWorks has several implementations of the Decision Tree Algorithm. Two of them use different splitting criterion (gini and entropy), and C4.5 is an implementation of a wellknown algorithm by J.R. Quinlan. See also: Classification Algorithm, Estimation Algorithm, C4.5, Entropy, and Gini. 
Demand Modeling: The generation of a model that forecasts when an item will be ordered and how large the order will be. Business Benefit:  Overstocking and understocking have their own associated costs. Overstocking increases interest expenses and the impact of product spoilage. Understocking increases lost sales and damages the customer experience.  Data Sources:  Order data (item, qty, tstamp). Spoilage data if appropriate. Delivery schedules, push (promotion) orders.  Measurements:  Minimization of RMS between actual and predicted item order size. Variance of error to determine the minimum allowable days of safety stock required. Success will likely also include the optimization of several submeasures that reflect how the client measures its own success, such as inventory levels, turnover, and other measures that relate to warehouse usage and how much capital is tied up.  Techniques:  Timeseries Forecasting.  Issues:  (tbd)  See Also:  Data Mining Task  References:  Kitts, B., (1998). The Complete Short Cycle Forecasting System. Walmart (SCFS) project. Short Cycle Forecasting System  Phase I (see Murtuza's binder).  e.g. are we under/over stocked? 
Diapers and Beer: A popular anecdote used to illustrate the unexpected but useful patterns discovered by data mining. The anecdote (probably apocryphal) recounts that a large supermarket chain used data mining to discover that customers often bought diapers and beer at the same time. When the retailer displayed two items together, sales increased for both items. 
Discovery: Finding unexpected but useful trends and associations hidden in data. See modeling, associations. 
Discriminating Factor: A measure of how important a causal factor is, used by decision trees to build the tree. See decision trees, causal factor. 

Entropy: In data mining, a measure of the relative difference between two or more data partitions based on information theory. See also Gini. 
Entropy Heuristic: Use of entropy to determine the information gain of a particular attribute (predictor) in Decision Tree Algorithm. The attribute with the greatest entropy reduction is chosen as the test attribute in a Decision Tree model, because splitting on the attribute produces the purest / most uniformity data distribution. The purity of the data distribution affects the ultimate accuracy or the resulted model. The attribute with the greatest entropy reduction is also the attribute with the highest information gain.?See also Gini Heuristic. Reference: Han, J. and Kamber. M. Data Mining: Concepts and Techniques. 2000. Chapter 7: Classification and Prediction. 
Estimation/Regressor: The act of labeling a test case with an continuous value. A model or algorithm that estimates is sometimes referred to as a "regressor". Commonly a regressor's repfermance is measured by its ability to predict a value that is near tot he actual value, such as with a correlation coefficient. See also Classification, Correlation Coefficient, and Estimation Algorithm.? 
Estimation Algorithm: An algorithm that performs estimation. Some algorithms first conduct a model that then can be used to estimate (e.g. Decision Tree, Linear Regression), while other algorithms perform the labeling directly (e.g. KNearestNeighbor). See also: Decision Tree, Estimation, KNearestNeighbor, Linear Regression. 
Euclidean Distance: Measure of the distance between two points. For any two ndimensional pointsa= (a_{1},a_{2},...a_{n}) andb= (b_{1},b_{2},...b_{n}), the distance betweenaandbis equal to: sqrt( (a_{1}  b_{1})^{2} + (a_{2}  b_{2})^{2} + ... + (a_{n}  b_{n})^{2} ) 

Forecasting: Adapting data mining techniques to forecast future trends with statistical reliability. Forecasting is often confused with prediction, but is usually much more complex.? See time series analysis/forecasting, whatif analysis, neural nets. 
Fraud Detection: This modeling procedure predicts infrequent events that bear a large financial penalty. This type of modeling is commonly used to detect criminal activity such as credit card fraud, insurance claim fraud, and Internet/wireless hacking. Each type of fraud detection requires a slightly different technique. Generally, anomalous events that do not fit the normal usage patterns trigger fraud detection alarms. The main challenges to these tasks are due to the low frequency of the undesirable events, usually under one percent (1%). Usage of the costbenefit matrix is critical to properly weigh the benefits of correct and incorrect predictions. These conditions often mean that 1) The lack of examples of fraudulent events makes it difficult to discriminate between legitimate and fraudulent behavior. 2) The overwhelming number of legitimate events leaves little room for lift on the already high accuracy from the simple model that simply predicts all events to be legitimate. 3) The existence of a costbenefit matrix allows us to dismiss the simple model described in condition 2. The use of costbenefit matrix information into predictive modeling, however, is a new concept. 

Gini: A modern decision tree index algorithm developed by Ron Bryman. Gini handles both numbers and text, and offers good processing speed. See also C4.5, and CHAID.? 
Gini Heuristic: Use of Gini to determine the information gain of a particular attribute (predictor) in Decision Tree Algorithm.?See also Entropy Heuristic. 

Heuristics: A rule of thumb that speeds up the locating of an optimal result. 
Horizon Effect : The event where a Decision Tree construction in halted prematurely because no further benefit seemed apparent.? Usually happens as the result with Pre Pruning. 

ID3: The first algorithm which was designed to build decision trees. ID3 was invented by Ross Quinlan at the University of Sidney Australia. ID3 was followed by ID4, ID6 and see 5. See C4.5, Gini, CHAID, CART. 
Information Gain: A measurement used to select the test attribute (predictor) at each node during Decision Tree model construction. See also Attribute Selection Measure or Splitting Criterion. 
Instancebased Learning: Machine learning technique in which training datasets are stored in entirety and a distance function is used to make predictions. See also: KNN 

kNearestNeighbor (kNN) Algorithm: An algorithm from the field of Pattern Recognition that generates both estimation and classification models. The algorithm assumes that similar cases behave similarly. The most common proximity measure is based on the Euclidean distance between two vectors. For classification problems the prediction is based on the statistical mode (most common) of the response value for the k closest cases. For example to predict the target value of a test case and k is set to the value seven (k=7) then the seven cases most similar to the test case would be fetched and the most common value from among the seven would be used to make the prediction. If the problem was an estimation challenge then the average from the seven would be used for the prediction. This algorithm works most all datatypes but is most effective in the presence of continuous columns where the Euclidean distance can be calculated. 

Least Squares: A method used to find the line that goes through the datapoints with the shortest squares of distances between the datapoints and this line. 
Gain Chart (Lift Curve): A method from direct marketing that helps to visualize a classifier's accuracy on binary (positive/negative) problems. Lift charts are commonly used in promotion campaign response modeling (responded/did not respond) to present how well a model performs when compared to a random mailing. The xaxis represents the percentage of the total population covered, say a city of 100,000. The yaxis presents the cumulative percentage of correctly classified positive cases, say 30,000 would respond if they received the mailout. The chart should include the performance of a random case selection (a straight line from [0%,0%] to [100%,100%]) and the performance of the model under investigation. Other possible lines in the chart include the performance of other competing models, the performance of a perfect classifier, and the quota to be achieved. From the chart below we notice that kNN (kNearest Neighbor) reaches the 85% quota faster than the DT (Decision Tree).? 
Linear Regression: An algorithm that generates estimation models from the field of Statistics. The algorithm assumes that a linear relationship exists between the independent variables and the response variable. PredictionWorks uses Least Squares as a measure of model fitness. See also Parametric Modeling, and Least Squares. 
LogLikelihood: The logarithm of a likelihood equation. It is used when the logarithm is easier to work with than the equation itself and the outcome is unaffected.. 
Logistic Regression: An algorithm that generates classification models from the field of Statistics. The target column may be NONbinary, but most implementations are limited to two value (binary) predictions. PredictionWorks' logistic regression algorithm uses maximum likelihood estimation to determine parameters of the regression equation. Forward stepwise selection is used to find the most predictive columns from which to build the model. The stopping criterion is based on a chisquare distribution. See also: Classification Algorithm, and LogLikelihood. 

Machine Learning: Is the research area within AI that studies algorithms which induce models from a set of data. Machine learning differs from statistics and pattern recognition in its focus on knowledge representation, symbolic data, automated discovery, and computational complexity analysis. 
Majority Model Algorithm : A classification algorithm that simply predicts the most common value found in the target column. For example, if 87% of a data set's rows contain the value "Did_Not_Respond" in the target column, then the Majority Model will simply predict "Did_Not_Respond" for every row, and on average achieve a raw accuracy of 87%. If one single value (class) is very prevalent is the data set, say greater than 95%, then this model's raw accuracy will be difficult to imporve on. When this is the case, the situation usually requires a CostBenefit Matrix to represent the fact that predicting the majority class may have little value in the real world. For example, predicting that all customers will not respond to a direct mail campaign may be accurate but that model will generate no revenue. See also: Classification Algorithm, Minority Model Algorithm, Mean Model Algorithm, and Naпve Model Algorithm. 
Market Basket Analysis : A technique, used in large retail chains, which studies every purchase made by customers to find out which sales are most commonly made together. See Diapers and Beer. 
Maximum Likelihood Estimation : Method of choosing parameters for a regression equation that maximizes the likelihood of observing the target value. 
Mean Model Algorithm: An estimation algorithm that simply predicts the mean (average) value found in the target column. For example, if the average value of a data set's target column is 7.875, then the Mean Model will simply predict 7.875 for every row. Often the central tendency of a data set is better captured by the median value, however calculating a column's median is significantly more complex (requires sorting) than calculating the mean value. See also: Naпve Model Algoirthm. 
Meta data file: This file contains the necessary information about the data structure (ie delimiter, data type for each column). This file is generated after the first time of data analysis, and it can be used to speed up the process of defining the data structure for analyzing other data file that has exactly the same data structure next time. 
Minority Model Algorithm: A classification algorithm that simply predicts the least common value found in the target column. PredictionWorks implicitly supports this model through the Naive Model Algorithm by creating a separate model for every value in the target class. This model is of interest when a CostBenefit Matrix biases the value of predicting against the majority value (class). See also: Naive Model Algorithm. 
Missing Value: A data value that is not known or does not exist. Two common reasons for missing values are unfilled optional features, data entry malfunctions, and nonapplicability of column. Some algorithms require that missingvalues be filledin (imputed). PredictionWorks' imputes values based on the mean if the column is numeric or the mode if the column is categorical. 
Mode: In mathematics, the most common class in a set 
Modeling: Building a model from a set of data. See also Model, NonParametric Modeling, and Parametric Modeling. 
Model: A structure that labels a case based on the case's characteristics. A model is measured by its accuracy and speed, and sometimes by its ability to clearly justify its prediction. See Accuracy, Modeling. 

Naпve Bayes Algorithm: An algorithm that generates classification models from the field of Statistics. The basic approach is simply to use prior probabilities as the predictive model. The primary constraint of this algorithm is that is assumes independence of the attributes. The algorithm is generally faster and less accurate than other model based algorithms. PredictionWorks' implementation uses a forward selection approach. 
Naпve Model Algorithm: An algorithm that generates both classification and estimation predictions. The approach of the algorithm is simply to make predictions based on the distribution of the target column. There are several possible versions of this algorithm that make assumptions. PredictionWorks' implements the Majority Model, and Single Class Model. See also: Majority Model Algorithm, Minority Model Algorithm, Mean Model Algorithm, Single Class Model Algorithm, classification/estimation models, domain X, and target column. 
Neural Networks: A predictive modeling technology, which attempts to mimic the complex functions of the brain. The main problem with neural nets is that the tools do not explain how they determined their results. Another limitation is that only skilled professionals can successfully use them. See black box. 
NonParametric Modeling: The development of a model with the use of algorithms that do not make many prior assumptions about the population from which the data is drawn. This approach requires less understanding of the specific relationships in the underlying system than parametric modeling. See also Neural Networks, kNearest Neighbor, Decision Tree Algorithms, and datadriven modeling. 

Overfitting: T he tendency to mistake noise in data for trends. 

Parametric Modeling: The development of a model with the use of strong assumptions about the underlying behavior within the population that the data is drawn from. The use of these algorithms generally requires a through understanding of their specific assumptions and how to test for their validity. If the assumption is correct, the generated model will be more accurate than a model generated by a nonparametric modeling algorithm. See Linear Regression, NonParametric Modeling. 
Partitioning: Choosing data which is most interesting for mining. This is typically at least eighty percent of the work of data mining. See sampling. 
PostPruning: A type of Pruning in Decision Tree Algorithm where the pruning process is done after the model is fully constructed. Post pruning removes anomalies by removing branches from a .fully grown. tree. Also see PrePruning, Pruning and Overfitting. 
Predictive Modeling: Modeling that emphasizes accuracy on unseen data. 
Prediction: Using existing data to predict how other factors will behave, assuming that some facts are known about the new factor. Making a credit check of new customers by using data on existing customers is a good example of prediction. See WhatIf? analysis, time series analysis/forecasting, forecasting.? 
PrePruning: A type of Pruning in Decision Tree Algorithm where tree construction is halted early before the model is fully constructed. This approach is generally faster then Post Pruning, but will be less accurate due to the horizon effect. See also PrePruning, Pruning and Overfitting. 
Price Elasticity Modeling: The generation of a model that forecasts a product's sales volume based on its price. For example, if the price of milk is increased from $3.50/g to $4/g the sales of this product from 683/week to 546/week. Business Benefit:  Set the price for a product that optimizes either the revenue or profit that can be drawn from this product. Price is one of the most important factors that a retailer can use to improve profits.  Data Sources:  Transaction data (item,price,mean(qty),cost,date). Can also make use of competitor's pricing and seasonal code.  Measurements:  Minimized distance between actual and predicted sales volume. Ideally the model that achieves the largest increase in profit (and/or revenue) over all products would be chosen.  Techniques:  A linear model has been found to be useful within the area that profit and revenue are maximized. Nonlinear models may be better at predicting the outcomes of extreme price points like 'barnyard' sales. In DMM (tbd)  Issues:  (tbd)  See Also:  Affinity Modeling, Cannibalization, Data Mining Task, Linear Regression, Price Elasticity, and Sensitivity Analysis.  References:  Kitts, B and Donnelly, J. (1998/05) "Point of Sales Data Mining Functions".? Melli, G. (1998/08) "Price, Demand, Data Mining and Profit".? Melli, G. (1998/11) "Strategic Hotel Capital  Price Sensitivity Analysis." Dolan, R and Simon H. (1996) "Power Pricing . How Managing Price Transforms the Bottom Line", The Free Press. (in our library) ?  e.g. Optimize rev/profit 
Pruning: The process of removing parts of Decision Tree model that describe anomalies rather than the true features of the underlying model. There are two types of pruning process, PrePruning and Post Pruning.?See also Overfitting. 
Posterior Probability (a posteriori probability) Models: Classification model which estimates the probability that input x belongs to class C, denoted by P(Cx). With these estimates a set of inputs can be ranked from most to least certain so that resources can be focused on the top prospects. See also Campaign Response Modeling, and Gain Chart. 

Random Error: Can be thought of as either sampling variation or as the sum total of other unobservable chance deviations that occur with sampled data. The key to working with random error is that it tends to balance out in predictable ways over the long run. It can thus be dealt with using routine methods of estimation and hypothesis testing. 
Random Model Algorithm: An algorithm that generates classification predictions. The basic approach is to randomly predict one of the classes without consideration for its frequency. 
Resampling Techniques: An empirical approach to model accuracy estimation based on the training and testing on multiple samples of the data set. See also Accuracy Estimation, CrossValidation and Bootstrap. 
Retention Modeling: The generation of a model that forecasts the probability that a customer will significantly reduce the business they bring in, or defect all together. Business Benefit:  Increase the long term value (LTV) of its customer base. Try to win back customers before they defect because it costs more to bring in a new customer than is does to hold to an existing one. In combination with a Customer Value model, the Retention model can focus on high value customers. Survey potential defectors to stay on top of service/product shortfalls.  Data Sources:  Customer labeled transaction data. Can also make use of customer satisfaction questionnaires, demographic data, Value Drivers model, and Customer Valuation model.  Measurements:  For simple defect/non defect binary classification minimize the misclassification average error rate. Include a confusion matrix report.? For business reduction estimation minimize the average RMS between actual and predicted decrease in business.  Techniques:  NonParametric Modeling.?  Issues:  (tbd)  See Also:  Attrition, Churn, and Data Mining Task.  References:  Kitts, B., Hetherington, K, and Donnelly, J. (1998/04) "Data Mining Algorithms for Customer Analysis and Prediction". Pages 2527. DataSage deliverable for Sheraton project. Berry, M. and Linoff, G. (1997) "Data Mining Techniques for Marketing, Sales and Customer Support". Figure 3.1, pages 439442.? Reichheld, F. and Sasser, W. (1990) "Zero Defections: Quality Comes to Services", Harvard Business Review, SeptOct 90 reprint 90508.? ?  
Rootmeansquared Error (RMS error): A measure of an estimation model's error. Given a data set, it is defined as the square root of the square of the difference between the true values and the model's predicted values. For a data set with n records? 
Rule Induction: A method of performing discovery by inducing rules about data. Rule induction tests given values in the data set to see which other data are its strongest associated factors. See decision trees, discovery, causal factor. 

Sampling: Taking a random sample of data in order to reduce the number of records to mine. Sampling is statistically complicated, but can be done in an RDBMS by use of a simple random number generator and column in the database. See partitioning.? 
Scoring: The process of applying a model to a database or list. 
Segment: A welldefined set of cases. 
Segmentation and Profiling: The generation of a model that groups likeminded customers into 'prototypical' customer types. Business Benefit:  Improves predictiveness of other models. In combination with customer valuation model the profiling can be performed on customers with high long term value. This profile can then be passed on to the marketing department to acquire new customers with high LTV.  Data Sources:  Customer labeled transaction data. Can make use of demographic data.?  Measurements:  (tbd)  Techniques:  Clustering Algorithms.  Issues:  (tbd)  See Also:  Clustering and Data Mining Task.  References:  Berry, M. and Linoff, G.(1997) "Data Mining Techniques for Marketing, Sales and Customer Support". Figure 3.1, pages 187215 (Chapter 10  Automatic Cluster Detection).? ?  
Simpson's Paradox: What is true for the parts is not necessarily true for the whole. 
Single Class Model Algorithm: An algorithm that generates classification predictions. The approach of the algorithm is to choose a single class and predict that class every time. This algorithm is simpler than Majority Model because it does not need to know the frequency of the target column. Can only be better than Majority Model when a Cost Matrix is used to give priority a minority class. See also: Majority Model Algorithm, Target Column, and CostBenefit Matrix. 
Statistical significance: A measure the statistical likeliness that a given numerical value is true. See confidence window or level. 
Statistics: The field in mathematics which studies the quantification of variance. One of the basic building blocks of data mining. See Heuristics, Machine Learning. 
Systematic Error: In contrast to random error, is less easily managed. According to modern epidemiologic theory, systematic error (or bias, as they say), can result from: information bias (due to errors in measurement), selection bias (due to flaws in the sampling), and confounding (due to the damaging / biasing effects of extraneous factors 
Support: The measure of how often the collection of items in an association occur together as a percentage of all the transactions. For example, "In 2% of the purchases at the hardware store, both a pick and a shovel were bought." 
Support Vector Machines: Support Vector Machines were invented by Vladimir Vapnik. They are a method for creating functions from a set of labeled training data. The function can be a classification function (the output is binary: is the input in a category) or the function can be a general regression function. For classification, SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hypersurface to the nearest of the positive and negative examples. Intuitively, this makes the classification correct for testing data that is near, but not identical to the training data. 

Target Column: (a.k.a. dependent variable/attribute, label) the column whose values a predictive model has to accurately predict. If the target column is categorical the modeling challenge is referred to as classification, otherwise if the target column is numeric the challenge is referred to as an estimation problem. See also test file, train file, classifier, and estimation. 
Test file: The dataset file on which predictions will be made. The best predictive model discovered during the training stage will be used to predict the values of the target column for each of the rows in the file. The file should be the same format as the train file although the values in the target column may be represented by a missing value such as a question mark. See also target column. 
Time Series Analysis/Forecasting: A complicated technology which is used to give statistically accurate forecasting. This is often confused with prediction or simple forecasting, but time series analysis/forecasting is much more difficult, and mathematically based. See forecasting.? 
Total Wallet Modeling: The generation of a model that forecasts a customer's total spending for a particular product or service. Business Benefit:  Increased sales through more effective product targeting. In combination with a Customer Value model, the Total Wallet model can estimate the penetration into each customer's wallet and therefore identify which customers have the largest upsell opportunities. A central tenet of onetoone marketing is to focus on "wallet share" and market share.  Data Sources:  Survey questionnaire, customer labeled transaction data. Can make use of demographic data.?  Measurements:  Minimized RMS between actual and predicted wallet size.?  Techniques:  NonParametric Algorithms for Estimation.  Issues:  The return on investment from this model depends on how much the company already takes up of its customers' wallets.  See Also:  Data Mining Task.  References:   
Train file: The dataset file that will be tested against with many algorithms to discover the best predictive model. One of file's the columns is selected as the target (dependent) column. The PredictionWorks' webservice accepts comma, tab or singlespace delimited files. If the discovered model is will be used to predict future behavior then the predictor (independent) columns in the train file must contain information that was available before the value of the target column was known, and possibly further back. For example, if the target value contains whether a banner was clicked on or not then the data in the predictor columns must have data that was available before the person saw the banner. In the example of a mailedout coupon, the predictor columns must have data that was available not just before the coupon was redeemed but data that was available when the coupon was mailed out. 
Threshold: The minimum percentage of occurrence of a class needed to choose that class. e.g. If you have a dataset consisting of blue socks and red socks and your threshold is 0.6, you will need at least 60% of one colour of sock to choose that colour. See also: classification 
Type I and Type II Errors: Decisions based on twovalued (binary) predictive models may be in error for two reasons:  Type I (False Positive) errors occur when the difference
with the nullhypothesis is significant, due to factors other than chance, when in fact it is not. The probability of this type of error is the same as the significance level of the test. Many domains consider this the most serious type of error to make. It is equivalent to a judge finding an innocent suspect guilty.  Type II (False Negative) errors occur when the difference
with the nullhypothesis is due to chance, when in fact it is not. Some domains, such as direct marketing represent this type of error as lost income because the person who would have responded positively was never contacted.
See also CostBenefit Matrix. 

Value Drivers Modeling: The generation of a model that predicts the top reasons for a customer to continue to do business with the company. Business Benefit:  Support other forecasting models to channel the right service to the right customer.  Data Sources:  Survey questionnaire, customer labeled transaction data. Can make use of demographic data.  Measurements:  Minimize average misclassification error rate on top x drivers for each customer. If the accuracy of predicting the exact x drivers is low for all models (likely if there are a substantial number of drivers to chose from) then it is preferable to measure accuracy of detecting at least one value driver. Possibly also report the associated confusion matrix.  Techniques:  NonParametric Classification Algorithms with Posterior Probabilities.  Issues:  (tbd)  See Also:  Data Mining Task.  References:  Kitts, B., Hetherington, K, and Donnelly, J. (1998/04) "Data Mining Algorithms for Customer Analysis and Prediction". Pages 3944. DataSage deliverable for Sheraton project.  
Verification: The testing of a model with unseen data. 
Visualization: Visual representation of discovered patterns. 
