User:WillWare/Using Jena

From Wikipedia, the free encyclopedia

The article on production systems gives a definition but it is far more instructive to work with a real production system on one's own computer and get a sense of it for oneself. Jena is a Java framework for building Semantic Web applications. It includes a very capable production system.

This topic will assume that you can read Java and understand inheritance and interfaces. This should be regarded as only a preliminary taste of Jena to get you started. To learn much more about it you'll need to consult the documentation. Hopefully after this introduction that will be a little easier.

This topic will include a lot of discussion of the semantic web. This is not intended as an advertisement for the semantic web, though I personally believe it's a good idea. The point here is that lately the semantic web community has been producing well-designed open-source software which make production systems widely available and easy to use.

Jena basics[edit]

Here are some important Jena webpages.

Project homepage: http://jena.sourceforge.net/
Docs: http://jena.sourceforge.net/documentation.html
http://jena.sourceforge.net/inference/
http://jena.sourceforge.net/ontology/
http://jena.sourceforge.net/ARQ/Tutorial/
Javadoc: http://jena.sourceforge.net/javadoc/index.html

IBM has posted some great on-line tutorials for Jena, and for the semantic web in general.

http://www.ibm.com/developerworks/xml/library/j-jena/
http://www.ibm.com/developerworks/xml/library/j-sparql/
http://www.ibm.com/developerworks/library/x-semweb.html
http://www.ibm.com/developerworks/xml/library/x-plansemantic/
http://www.ibm.com/developerworks/web/library/wa-semweb/
http://www.ibm.com/developerworks/xml/library/x-wikiquery/
http://www.ibm.com/developerworks/opensource/library/os-php-crud/

The semantic web represents knowledge as a directed graph. The nodes of the graph are different things, and the edges of the graph express relationships between those things. This is most easily illustrated with a diagram.

Example of a semantic network

Here, the things are "Animal", "Mammal", "Fish", and so on. The relationships are "is a", "has" and "lives in". All this can be represented in three-word sentences called "triples". This representation is called RDF, or "Resource Description Framework". The corresponding three-word sentences appear below, written in N3, a human-readable formal language used by the semantic web community.

   @prefix : <#> .
   :Cat :has :Fur .
   :Bear :has :Fur .
   :Cat :is-a :Mammal .
   :Bear :is-a :Mammal .
   :Mammal :has :Vertebrae .
   :Whale :is-a :Mammal .
   :Whale :lives-in :Water .
   :Mammal :is-a :Animal .
   :Fish :is-a :Animal .
   :Fish :lives-in :Water .

What are prefixes about?[edit]

The folks at the W3C who brought us the semantic web (after bringing us the web we're already using) are concerned with interoperability and internationalization. What that means is, they don't want to find a solution that only works in Silicon Valley, or only works in the United States, or only works in the EU. They want technology that works everywhere, in all time zones and human languages, and with all kinds of computers and all kinds of networks.

So they've been very careful in defining protocols like XML and RDF to make sure that they spell out everything necessary to make that happen. The prefixes used in RDF are part of that.

The prefixes also allow people to agree on concepts. If I decide to merge my graph of triplets with your graph of triplets, it will be helpful if you and I agree on common concepts like "dog" and "car" and "building". Our computers are too dumb to know that a dog in Korea is the same thing as a dog in Ecuador, unless we establish conventions that make such facts explicit. Some common shared vocabularies are:

Dublin Core: http://dublincore.org/
Friend-of-a-friend, or FOAF: http://www.foaf-project.org/
Semantically-interlinked online communities: http://sioc-project.org/

So you shouldn't be too shocked to see an N3 document that starts with a long list of prefixes.

@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rss:    <http://purl.org/rss/1.0/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix wn:     <http://xmlns.com/wordnet/1.6/> .
@prefix dc:     <http://purl.org/dc/elements/1.1/> .
@prefix rfc:    <http://www.rfc-editor.org/rfc/> .
@prefix w3c-tr: <http://www.w3.org/TR/> .
@prefix genid:  <http://id.ninebynine.org/people/> .

Elsewhere in the document, when you see "dc:description", you'll know it's an abbreviation for

http://purl.org/dc/elements/1.1/description

When RDF is serialized as XML, you'll see something more like this.

<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
   xmlns:foaf="http://xmlns.com/foaf/0.1/"
   xmlns:wot="http://xmlns.com/wot/0.1/"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:owl="http://www.w3.org/2002/07/owl#"
   xmlns:dcterm="http://purl.org/dc/terms/"
   xmlns:cc="http://web.resource.org/cc/" 
   xml:base="http://schema.menow.org/">

Let's see some code already[edit]

In Jena, a Model is one of these collections of triplets. Having constructed it (or having loaded it from either a file on the internet or a file on your computer), you can apply rules that allow you to draw conclusions. First we need to read in the file.

   private static final String baseUri =
       "file:///home/wware/wware-autosci/semweb/java/simpleNet.n3#";
   private static void modelReadFile(String filename, Model model) {
       try {
           File f = new File(filename);
           FileReader fr = new FileReader(f);
           model.read(fr, baseUri);
       } catch (FileNotFoundException e) {
           // TODO Auto-generated catch block
           e.printStackTrace();
       }
   }

and we'll call that method from our main method. I personally find it appalling that the graph above fails to recognize that fish have vertebrae, so we'll add a triple for that.

   public static void main(String[] args) {
       Model model = ModelFactory.createDefaultModel();
       modelReadFile("simpleNet.rdf", model);
       // Fish have vertebrae too!!
       model.createResource(baseUri + "Fish")
                .addProperty(model.createProperty(
                                 baseUri + "has"),
                             model.createResource(
                                 baseUri + "Vertebrae"));
       // Let's print out everything we've got so far.
       StmtIterator iter = model.listStatements();
       while (iter.hasNext()) {
           Statement stmt;
           stmt = iter.next();
           Resource subject = stmt.getSubject();
           Resource predicate = stmt.getPredicate();
           RDFNode obj = stmt.getObject();
           String objstr;
           try {
               objstr = ((Resource) obj).getLocalName();
           } catch (ClassCastException cce) {
               objstr = obj.toString();
           }
           System.out.println(subject.getLocalName() + " "
                   + predicate.getLocalName() + " " + objstr);
       }
   }

and the RDF file that imports the model was translated from the N3 above, using CWM.

 <rdf:RDF xmlns="file:///home/wware/wware-autosci/semweb/java/simpleNet.n3#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
     <rdf:Description rdf:about="#Bear">
         <has rdf:resource="#Fur"/>
         <is-a rdf:resource="#Mammal"/>
     </rdf:Description>
     <rdf:Description rdf:about="#Cat">
         <has rdf:resource="#Fur"/>
         <is-a rdf:resource="#Mammal"/>
     </rdf:Description>
     <rdf:Description rdf:about="#Fish">
         <is-a rdf:resource="#Animal"/>
         <lives-in rdf:resource="#Water"/>
     </rdf:Description>
     <rdf:Description rdf:about="#Mammal">
         <has rdf:resource="#Vertebrae"/>
         <is-a rdf:resource="#Animal"/>
     </rdf:Description>
     <rdf:Description rdf:about="#Whale">
         <is-a rdf:resource="#Mammal"/>
         <lives-in rdf:resource="#Water"/>
     </rdf:Description>
 </rdf:RDF>

There is a Model.write(OutputStream) method, so we can write a model directly to a file instead of stepping through the triples explicitly.

Hacking inference[edit]

So how about some actual reasoning? We should be able to conclude that a cat is an animal, and has vertebrae. This will require that we define two rules of inference, rule1 and rule2 below.

      String rules =
           "[ rule1: (?x " + baseUri+"is-a ?y) " +
                    "(?y " + baseUri+"is-a ?z) -> " +
                    "(?x " + baseUri+"is-a ?z) ] " +
           "[ rule2: (?x " + baseUri+"is-a ?y) " +
                    "(?y " + baseUri+"has ?z) -> " +
                    "(?x " + baseUri+"has ?z) ]";
       Reasoner reasoner = new
           GenericRuleReasoner(Rule.parseRules(rules));
       reasoner.setDerivationLogging(true);
       InfModel inf =
           ModelFactory.createInfModel(reasoner, model);

Simply creating the InfModel is enough to draw all the relevant inferences. The Reasoner's setDerivationLogging method tells the model to remember the derivations that led to any new conclusions, and these derivations can be examined for debugging purposes.

       PrintWriter out = new PrintWriter(System.out);
       for (StmtIterator i = inf.listStatements(sel);
                    i.hasNext(); ) {
           Statement s = i.nextStatement();
           for (Iterator<Derivation> id = inf.getDerivation(s);
                    id.hasNext(); ) {
               Derivation deriv = id.next();
               deriv.printTrace(out, true);
           }
       }
       out.flush();

RDFS Reasoner

OWL Reasoner

Querying with SPARQL[edit]

One way to print the contents of a model would be to explicitly step through each triplet, get its parts, format them as strings, and print them. This gives a lot of control over how they're formatted, but it's a little bulky.

   private static void printModel(Model model) {
       StmtIterator iter = model.listStatements();
       while (iter.hasNext()) {
           Statement stmt;
           stmt = iter.next();
           Resource subject = stmt.getSubject();
           Resource predicate = stmt.getPredicate();
           RDFNode obj = stmt.getObject();
           String objstr;
           try {
               objstr = ((Resource) obj).getLocalName();
           } catch (ClassCastException cce) {
               objstr = obj.toString();
           }
           System.out.println(subject.getLocalName() + " "
                   + predicate.getLocalName() + " " + objstr);
       }
   }

SPARQL is an SQL-like language used to perform queries on a model. There is a good discussion of it on an IBM website.

Here is a very simple query that simply prints the contents of a model.

   private static void printModel(Model model) {
       String queryString = 
           "SELECT ?x ?y ?z " +
           "WHERE {" +
           "    ?x ?y ?z . " +
           "}";
       Query query = QueryFactory.create(queryString);
       QueryExecution qe = QueryExecutionFactory.create(query, model);
       ResultSet results = qe.execSelect();
       ResultSetFormatter.out(System.out, results, query);
       qe.close();
   }

Defining your own actions[edit]

If you define a Rule programmatically, then you can define your own actions.

  • The rule has a body (antecedents) and a head (sequence of actions).
  • Write a class that extends BaseBuiltin, overloading the headAction method. You'll be passed a RuleContext, which lets you assert or remove triples to/from the graph. The RuleContext.getEnv() method returns a BindingEnvironment whose getGroundVersion() method will return the value of a bound variable.
  • Define a Functor and call its setImplementor(Builtin) method to assign it your builtin.
  • Use the Functor in the List<ClauseEntry> argument that the Rule constructor uses for head actions.

This example is rather convoluted. To work out some of the details, I had to hunt for a copy of the Jena source code. I can't imagine why the Jena developers don't post the source on Sourceforge, since they clearly want people to download binaries from there.

Anyway, what this example does is to take every "X has Y" statement in the model and replace Y with "Wings". So cats and bears will no longer have fur, now they'll have wings instead. I can't speak for bears but I am pretty certain my own cat would consider this a raw deal, especially in the winter months.

One slightly tricky thing. Note that there is an early termination clause if Y is "Wings". That's because you would otherwise get an infinite regress, as the inference engine repeatedly replaced "X has Wings" with a new copy of "X has Wings".

       Node_RuleVariable S = new Node_RuleVariable("x", 0);
       Node P = Node.createURI(baseUri + "has");
       Node_RuleVariable O = new Node_RuleVariable("y", 1);
       Functor myFunctor =
           new Functor("f", new Node[] { S, P, O });
       myFunctor.setImplementor(new BaseBuiltin() {
           public String getName() { return null; };
           public void headAction(Node[] args, int length,
                                  RuleContext context) {
               BindingEnvironment be = context.getEnv();
               String uriWings = baseUri + "Wings";
               Node S = be.getGroundVersion(args[0]);
               Node P = args[1];
               Node O = be.getGroundVersion(args[2]);
               if (uriWings.equals(O.getURI()))
                   return;
               Triple tr = new Triple(S, P, O);
               context.remove(tr);
               O = Node.createURI(uriWings);
               tr = new Triple(S, P, O);
               context.add(tr);
           }
       });
       Rule everybodyGetsWings = new Rule("Everybody gets wings",
           new ClauseEntry[] { myFunctor },
           new ClauseEntry[] { new TriplePattern(S, P, O) });
       List<Rule> rules = new ArrayList<Rule>();
       rules.add(everybodyGetsWings);
       Reasoner reasoner = new GenericRuleReasoner(rules);
       reasoner.setDerivationLogging(true);
       InfModel inf = ModelFactory.createInfModel(reasoner, model);

Applications[edit]

Business[edit]

Many businesses routinely collect a lot of data. Online businesses record transactions, supermarkets track product bar codes, and so forth. Generally all this data goes into databases such as MySQL or Oracle.

Often businesses are keenly interested in the patterns of information buried in all that data. They want to extract those patterns and convert them into pictures or short explanations that are understandable at a glance.

blah blah blah blah payroll blah blah blah blah track inventory blah blah blah blah shopping trends and preferences blah blah blah blah data-mine web traffic blah blah blah blah

Bayesian inference[edit]

Bayesian inference is the use of Bayes' theorem to infer the value of a variable that cannot be directly observed, using observations of some other variable. The simplest case is two boolean variables, A and B, with A being unobservable and having some causal influence on B. Because of that influence, we can specify two conditional probabilities.

  • Pr(B|A) is the probability that B is true given that A is true.
  • Pr(B|~A) is the probability that B is true given that A is false.

We also assume Pr(A), an a-priori estimate of the probability that A is true, regardless of B. From these we can calculate the probabilities of all four combinations of values for A and B.

P11 = Pr(A&B) = Pr(A) Pr(B|A)
P10 = Pr(A&~B) = Pr(A) (1-Pr(B|A))
P01 = Pr(~A&B) = (1-Pr(A)) Pr(B|~A)
P00 = Pr(~A&~B) = (1-Pr(A)) (1-Pr(B|~A))

and from these we can compute Pr(A|B), the probability that A is true given that B is true, and Pr(A|~B), the probability that A is true given that B is false.

Pr(A|B) = Pr(A&B) / Pr(B) = Pr(A&B) / (Pr(A&B) + Pr(~A&B))
= P11/(P11+P01)
Pr(A|~B) = P10/(P10+P00)

When we observe B to be true or false, we replace our estimate of Pr(A) with Pr(A|B) or Pr(A|~B) respectively. This can be done iteratively.

In order to perform these computations with Jena, we need an RDF representation of the variables and the relationship between them. The N3 for such a representation looks like this. The file "bayes.rdf" provides RDF resources for "link1", the linking of two boolean variables in this way, and "boolean", indicating that a variable is of boolean type. It also defines five predicates, "p11", "p01", "cause", "effect", and "probability".

@prefix  rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix  rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix  bayes:  <file:///home/wware/wware-autosci/semweb/java/bayes.rdf#> .
@prefix  :       <#> .
:L rdf:type bayes:link1 .
:L bayes:p11 0.654 .
:L bayes:p01 0.123 .
:L bayes:cause :A .
:L bayes:effect :B .
:A rdfs:label "A" .
:A rdf:type bayes:boolean .
:A bayes:probability 0.3 .
:B rdfs:label "B" .
:B rdf:type bayes:boolean .

The code for performing the necessary update looks like this. Note that because we are replacing one "?x probability ?y" triplet with another, we need a flag to avoid infinite regress. This arrangement (which is not thread-safe) ensures that the headAction() method does only one update for each method call.

   private boolean avoidInfiniteRegress;
   public Model updateProb(Model model, final boolean bvalue) {
       final BayesianInference bi = this;
       bi.avoidInfiniteRegress = false;
       final Node_RuleVariable L = new Node_RuleVariable("x", 0);
       final Node_RuleVariable C = new Node_RuleVariable("y", 1);
       final Node_RuleVariable E = new Node_RuleVariable("z", 2);
       final Node probPred = Node.createURI(baseUri + "probability");
       final Node causePred = Node.createURI(baseUri + "cause");
       final Node effectPred = Node.createURI(baseUri + "effect");
       Functor myFunctor = new Functor("f", new Node[] { });
       myFunctor.setImplementor(new BaseBuiltin() {
           public String getName() { return null; };
           public void headAction(Node[] args, int length,
                                  RuleContext context) {
               if (bi.avoidInfiniteRegress) return;
               bi.avoidInfiniteRegress = true;
               double p11 = probA * PrBgivenA;
               double p10 = probA * (1. - PrBgivenA);
               double p01 = (1.0 - probA) * PrBgivenNotA;
               double p00 = (1.0 - probA) * (1. - PrBgivenNotA);
               if (bvalue) probA = p11 / (p11 + p01);
               else        probA = p10 / (p10 + p00);
               BindingEnvironment be = context.getEnv();
               Node cause = be.getGroundVersion(C);
               Triple tr = new Triple(cause, probPred, Node.ANY);
               context.remove(tr);
               Node newprob = Node.createLiteral(Double.toString(probA));
               tr = new Triple(cause, probPred, newprob);
               context.add(tr);
           }
       });
       Rule updateProbability = new Rule("update probability",
           new ClauseEntry[] { myFunctor },
           new ClauseEntry[] {
               new TriplePattern(L, causePred, C),
               new TriplePattern(L, effectPred, E)
               });
       List<Rule> rules = new ArrayList<Rule>();
       rules.add(updateProbability);
       Reasoner reasoner = new GenericRuleReasoner(rules);
       reasoner.setDerivationLogging(true);
       return ModelFactory.createInfModel(reasoner, model);
   }

Given the simplicity of the calculation to be done, this might seem ridiculously cumbersome. If there were only one pair of boolean variables to track, it's likely some much simpler approach would be preferred.

Where this becomes useful is in an enormous graph tracking hundreds or thousands of probabilistic boolean variables, perhaps all related to each other in complicated ways. These might represent a complex system with a lot of moving parts, such as the stock market or the planet's climate.

Additional elaborations are certainly possible. Bayesian inference can be done with continuous variables as well as discrete variables, and many variables will be influenced by several others at a time rather than just one other. At that point, the means of determining appropriate conditional probabilities to represent those relationships may itself become a considerable challenge.

From the Jena javadoc[edit]

Model[edit]

An RDF model is a set of Statements. Methods are provided for creating resources, properties and literals and the Statements which link them, for adding statements to and removing them from a model, for querying a model and set operations for combining models.

Models may create Resources [URI nodes and bnodes]. Creating a Resource does not make the Resource visible to the model; Resources are only "in" Models if Statements about them are added to the Model. Similarly the only way to "remove" a Resource from a Model is to remove all the Statements that mention it.

When a Resource or Literal is created by a Model, the Model is free to re-use an existing Resource or Literal object with the correct values, or it may create a fresh one. (All Jena RDFNodes and Statements are immutable, so this is generally safe.)

This interface defines a set of primitive methods. A set of convenience methods which extends this interface, e.g. performing automatic type conversions and support for enhanced resources, is defined in ModelCon.

Graph[edit]

The interface to be satisfied by implementations maintaining collections of RDF triples. The core interface is small (add, delete, find, contains) and is augmented by additional classes to handle more complicated matters such as reification, query handling, bulk update, event management, and transaction handling.

For add(Triple) see GraphAdd.

Triple[edit]

Triples are the basis for RDF statements; they have a subject, predicate, and object field (all nodes) and express the notion that the relationship named by the predicate holds between the subject and the object.

Resource[edit]

An RDF Resource.

Resource instances when created may be associated with a specific model. Resources created by a model will refer to that model, and support a range of methods, such as getProperty() and addProperty() which will access or modify that model. This enables the programmer to write code in a compact and easy style.

Resources created by ResourceFactory will not refer to any model, and will not permit operations which require a model. Such resources are useful as general constants.

This interface provides methods supporting typed literals. This means that methods are provided which will translate a built in type, or an object to an RDF Literal. This translation is done by invoking the toString() method of the object, or its built in equivalent. The reverse translation is also supported. This is built in for built in types. Factory objects, provided by the application, are used for application objects.

This interface provides methods for supporting enhanced resources. An enhanced resource is a resource to which the application has added behaviour. RDF containers are examples of enhanced resources built in to this package. Enhanced resources are supported by encapsulating a resource created by an implementation in another class which adds the extra behaviour. Factory objects are used to construct such enhanced resources.

Literal[edit]

An RDF Literal.

In RDF2003 literals can be typed. If typed then the literal comprises a datatype, a lexical form and a value (together with an optional xml:lang string). Old style literals have no type and are termed "plain" literals.

Implementations of this interface should be able to support both plain and typed literals. In the case of typed literals the primitive accessor methods such as getInt() determine if the literal value can be coerced to an appropriate java wrapper class. If so then the class is unwrapped to extract the primitive value returned. If the coercion fails then a runtime DatatypeFormatException is thrown.

In the case of plain literals then the primitve accessor methods duplicate the behaviour of jena1. The literal is internally stored in lexical form but the accessor methods such as getInt will attempt to parse the lexical form and if successful will return the primitive value.

Object (i.e. non-primitive) values are supported. In the case of typed literals then a global TypeMapper registry determines what datatype representation to use for a given Object type. In the case of plain literals then the object will be stored in the lexical form given by its toString method. Factory objects, provided by the application, are needed in that case to covert the lexical form back into the appropriate object type.

Rule[edit]

Constructors
Rule(java.util.List<ClauseEntry> head, java.util.List<ClauseEntry> body)
Rule(java.lang.String name, ClauseEntry[] head, ClauseEntry[] body)
Rule(java.lang.String name, java.util.List<ClauseEntry> head, java.util.List<ClauseEntry> body)

Representation of a generic inference rule.

This represents the rule specification but most engines will compile this specification into an abstract machine or processing graph.

The rule specification comprises a list of antecendents (body) and a list of consequents (head). If there is more than one consequent then a backchainer should regard this as a shorthand for several rules, all with the same body but with a singleton head.

Each element in the head or body can be a TriplePattern, a Functor or a Rule. A TriplePattern is just a triple of Nodes but the Nodes can represent variables, wildcards and embedded functors - as well as constant uri or literal graph nodes. A functor comprises a functor name and a list of arguments. The arguments are Nodes of any type except functor nodes (there is no functor nesting). The functor name can be mapped into a registered java class that implements its semantics. Functors play three roles -

  • in heads they represent actions (procedural attachment)
  • in bodies they represent builtin predicates
  • in TriplePatterns they represent embedded structured literals that are used to cache matched subgraphs such as restriction specifications.

The equality contract for rules is that two rules are equal if each of terms (ClauseEntry objects) are equals and they have the same name, if any.

We include a trivial, recursive descent parser but this is just there to allow rules to be embedded in code. External rule syntax based on N3 and RDF could be developed. The embedded syntax supports rules such as:

[ (?C rdf:type *), guard(?C, ?P)  -> (?c rb:restriction some(?P, ?D)) ].
[ (?s owl:foo ?p) -> [ (?s owl:bar ?a) -> (?s ?p ?a) ] ].
[name: (?s owl:foo ?p) -> (?s ?p ?a)].

only built in namespaces are recognized as such, * is a wildcard node, ?c is a variable, name(node ... node) is a functor, (node node node) is a triple pattern, [..] is an embedded rule, commas are ignore and can be freely used as separators. Functor names may not end in ':'.