Predictive Model Markup Language

From Wikipedia, the free encyclopedia
  (Redirected from PMML)
Jump to: navigation, search
PMML Logo.png

The Predictive Model Markup Language (PMML) is an XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics and data mining and to share those models between PMML-compliant applications.

PMML provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application and use other vendors' applications to visualize, analyze, evaluate or otherwise use the models. Previously, this was very difficult, but with PMML, the exchange of models between compliant applications is straightforward.

Since PMML is an XML-based standard, the specification comes in the form of an XML schema.

Contents

[edit] PMML Components

PMML follows an intuitive structure to describe a data mining model, be it an artificial neural network or a logistic regression model.

PMMLComponents.jpg

Sequentially, it can be described by the following components:[1][2]

  • Header: contains general information about the PMML document, such as copyright information for the model, its description, and information about the application used to generate the model such as name and version. It also contains an attribute for a timestamp which can be used to specify the date of model creation.
  • Data Dictionary: contains definitions for all the possible fields used by the model. It is here that a field is defined as continuous, categorical, or ordinal (attribute optype). Depending on this definition, the appropriate value ranges are then defined as well as the data type (such as, string or double).
  • Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mining model. PMML defines several kinds of simple data transformations.
    • Normalization: map values to numbers, the input can be continuous or discrete.
    • Discretization: map continuous values to discrete values.
    • Value mapping: map discrete values to discrete values.
    • Functions: derive a value by applying a function to one or more parameters.
    • Aggregation: used to summarize or collect groups of values.
  • Model: contains the definition of the data mining model. A multi-layered feedforward neural network is the most common neural network representation in contemporary applications, given the popularity and efficacy associated with its training algorithm known as backpropagation. Such a network is represented in PMML by a "NeuralNetwork" element which contains attributes such as:
    • Model Name (attribute modelName)
    • Function Name (attribute functionName)
    • Algorithm Name (attribute algorithmName)
    • Activation Function (attribute activationFunction)
    • Number of Layers (attribute numberOfLayers)

This information is then followed by three kinds of neural layers which specify the architecture of the neural network model being represented in the PMML document. These attributes are NeuralInputs, NeuralLayer, and NeuralOutputs. Besides neural networks, PMML allows for the representation of many other data mining models including support vector machines, association rules, Naive Bayes classifier, clustering models, text models, decision trees, and different regression models.

  • Mining Schema: the mining schema lists all fields used in the model. This can be a subset of the fields as defined in the data dictionary. It contains specific information about each field, such as:
    • Name (attribute name): must refer to a field in the data dictionary
    • Usage type (attribute usageType): defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.
    • Outlier Treatment (attribute outliers): defines the outlier treatment to be use. In PMML, outliers can be treated as missing values, as extreme values (based on the definition of high and low values for a particular field), or as is.
    • Missing Value Replacement Policy (attribute missingValueReplacement): if this attribute is specified then a missing value is automatically replaced by the given values.
    • Missing Value Treatment (attribute missingValueTreatment): indicates how the missing value replacement was derived (e.g. as value, mean or median).
  • Targets: allow for post-processing of the predicted value in the format of scaling if the output of the model is continuous. Targets can also be used for classification tasks. In this case, the attribute priorProbability specifies a default probability for the corresponding target category. It is used if the prediction logic itself did not produce a result. This can happen, e.g., if an input value is missing and there is no other method for treating missing values.
  • Output: this element can be used to name all the desired output fields expected from the model. These are features of the predicted field and so are typically the predicted value itself, the probability, cluster affinity (for clustering models), standard error, etc.

[edit] PMML 4.0 and 4.1

The previous version of PMML, 4.0, was released on June 16, 2009.[3][4][5]

Examples of new features included:

  • Improved Pre-Processing Capabilities: Additions to built-in functions include a range of Boolean operations and an If-Then-Else function.
  • Model Explanation: Saving of evaluation and model performance measures to the PMML file itself.
  • Multiple Models: Capabilities for model composition, ensembles, and segmentation (e.g., combining of regression and decision trees).

The latest version of PMML, 4.1, was released on December 31, 2011. [6][7]

New features include:

  • New model elements for representing Scorecards, k-Nearest Neighbors (KNN) and Baseline Models.
  • Simplification of multiple models. In PMML 4.1, the same element is used to represent model segmentation, ensemble, and chaining.
  • Overall definition of field scope and field names.
  • A new attribute that identifies for each model element if the model is ready or not for production deployment.
  • Enhanced post-processing capabilities.

[edit] Release history

Version 0.7 July 1997
Version 0.9 July 1998
Version 1.0 August 1999
Version 1.1 August 2000
Version 2.0 August 2001
Version 2.1 March 2003
Version 3.0 October 2004
Version 3.1 December 2005
Version 3.2 May 2007
Version 4.0 June 2009
Version 4.1 December 2011

[edit] PMML Products

A range of products are being offered to produce and consume PMML:

  • Angoss StrategyBuilder : (a standard module in KnowledgeSEEKER and KnowledgeSTUDIO)]: produces PMML 3.2 for decision trees (used to represent strategy trees).
  • IBM InfoSphere Warehouse: produces PMML 3.0 and 3.1 for sequences only models. Consumes (scores and visualizes) PMML 3.1 and earlier.
  • IBM SPSS Modeler: produces and scores PMML 3.2 and 4.0 for a variety of models.
  • KNIME: produces and consumes PMML 4.0 for neural networks, decision trees, clustering models, regression models, and support vector machines. As of release 2.4.0, KNIME also offers extensive pre-processing support in PMML, including the ability to edit existing PMML code.[8]
  • KXEN: produces PMML 3.2 for regression models (including mining models) and clustering.
  • Open Data Group's Augustus: Produces PMML 4.0 for tree, naive-bayes and ruleset models. It consumes PMML 4.0 tree, naive-bayes, ruleset and regression models. Older versions produce and consume PMML 3.0 regression, tree and naive-bayes.
  • Oracle Data Mining: supports the core features of PMML 3.1 for regression models. The imported models become native Oracle Data Mining (ODM) models capable of Exadata offload.
  • RapidMiner: Using the free PMML extension, several types of models can be exported to PMML.
  • Zementis PMML Converter: validates, corrects, and converts PMML files expressed in versions 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0.[9]
  • Zementis Universal PMML Plug-in for Hadoop: Scoring of PMML 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0 for the Datameer Analytics Solution (DAS), an end-to-end BI solution that includes data source integration, an analytics engine, visualization and dashboarding. DAS uses Apache Hadoop, a Java-based framework that supports the parallel storage and processing of large data sets in a distributed environment, as its back-end storage and processing engine to scale to 4000 servers and petabytes of data.

[edit] Transformations Generator

PMML provides a variety of data transformations, including value mapping, normalization, and discretization. It also offers several built-in functions as well as arithmetic and logical operators which can be combined to represent complex pre-processing steps. With the Transformations Generator tool, one can graphically design a transformation and obtain the respective PMML code.

[edit] References

  1. ^ A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009.
  2. ^ A. Guazzelli, W. Lin, T. Jena (2010). PMML in Action: Unleashing the Power of Open Standards for Data Mining and Predictive Analytics. CreateSpace.
  3. ^ Data Mining Group website | PMML 4.0 - Changes from PMML 3.2
  4. ^ Zementis website | PMML 4.0 is here!
  5. ^ R. Pechter. What's PMML and What's New in PMML 4.0? The ACM SIGKDD Explorations Newsletter, Volume 11/1, July 2009.
  6. ^ Data Mining Group website | PMML 4.1 - Changes from PMML 4.0
  7. ^ Predictive Analytics Info website | PMML 4.1 is here!
  8. ^ D. Morent, K. Stathatos, W. Lin, M. R. Berthold. Comprehensive PMML Preprocessing in KNIME. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 2011.
  9. ^ A. Guazzelli, T. Jena, W. Lin, M. Zeller. The PMML Path Towards True Interoperability in Data Mining. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 2011.
  10. ^ K. K. Das, E. Fratkin, A. Gorajek, K. Stathatos, M. Gajjar. Massively Parallel In-Database Predicitions using PMML. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 2011

[edit] External links

Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages