Sunday, 30 August 2015

Dependency Parsing in Stanford CoreNLP


If you are working on Natural language Processing, this post will be useful for triplet Extraction from the documents.
Here we assume, you have basic knowledge about Part-of-Speech tagging, tokens etc. concepts.  Let’s discuss about Dependency Parsing first.

Stanford Dependency Parsing:
Stanford dependencies provide a representation of grammatical relations between words in a sentence. These dependencies are triplets : Name of the relation, governor and dependent.
Here is an example sentence :
Bell,based in Los Angeles, makes and distributes electronic, computer and building products.

We can see that  “the subject for verb ‘distributes’ is Bell.”  For the above sentence, Stanford dependencies(SD) representation is :

     nsubj(makes-8, Bell-1)
     nsubj(distributes-10, Bell-1)
     vmod(Bell-1, based-3)
     nn(Angeles-6, Los-5)
     prep_in(based-3, Angeles-6)
     root(ROOT-0, makes-8)
     conj_and(makes-8, distributes-10)
     amod(products-16, electronic-11)
     conj_and(electronic-11, computer-13)
     amod(products-16, computer-13)
     conj_and(electronic-11, building-15)
     amod(products-16, building-15)
     dobj(makes-8, products-16)





In above representation, first term is dependency tag, which represents the relation between governor(2nd term) and dependent(3rd term) .
There are various dependency tags, which are listed in the Stanford Dependency manual.

Following are two type of dependencies :
  •  Basic/Non Collapased: This representation gives the basic dependencies as well as the extra ones (which break the tree structure), without any collapsing or propagation of conjuncts. Eg.
                prep(based-7, in-8)
                pobj(in-8, LA-9) 
  •  Collapased : In the collapsed representation, dependencies involving prepositions, conjuncts, as well as information about the referent of relative clauses are collapsed to get direct dependencies between content words. For instance, the dependencies involving the preposition “in” in the above example will be collapsed into one single relation:
               prep(based-7, in-8)
               pobj(in-8, LA-9) 
         will become :  prep_in(based-7, LA-9)

Now we’ll see, how can we get these using JAVA Code.

import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class ParserDemo {
                public static void main(String[] args) {
                                LexicalizedParser lp = LexicalizedParser
                                                                .loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
                                lp.setOptionFlags(new String[] { "-maxLength", "80",
                                                                "-retainTmpSubcategories" });
                                String[] sent = { "This", "is", "an", "easy", "sentence", "." };
                                List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
                                Tree parse = lp.apply(rawWords);
                                parse.pennPrint();
                                System.out.println();
                               
                                TreebankLanguagePack tlp = new PennTreebankLanguagePack();
                                GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
                                GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
                                List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
                                System.out.println(tdl);
                                TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
                                tp.printTree(parse);
                }
}


 Now you can easily extract the triplets from document. You can find the example code in github repo.


1 comment: