This weekend I entered LexHacks 2015 (June 6–7), a hackathon focused on the legal industry and my team won (yay!) the LexPredict challenge, which was to build a parser that could scan a mix of unstructured contracts and identify/extract the names of the parties, the effective date, and the termination date or clause. The hackathon involved a series of specific challenges posted by the event sponsors. Attendees formed teams and, after the two days, submitted a solution.
I decided to focus on a $500 challenge posted by LexPredict, which involved scanning a corpus of over 30,000 unstructured contract documents and trying to extract the parties and certain dates. It seemed obvious from the beginning that the “proper” solution to the problem was to utilize natural language processing methods to extract the data. However, my team’s approach to having something useable within the two-day window of the hackathon involved mixing some basic NLP techniques with some standard text parsing (i.e., using Python).
The typical NLP workflow would have involved several steps, including sentence segmentation, word tokenization, part-of-speech tagging, named entity extraction, and relationship filtering. It would have also required establishing a training set (i.e., a subset of documents tagged with correct answers) and “training” the algorythms used for several of the steps (sentence segmentation, chunking/tokenizing, etc.). I quickly settled on trying to use the Stanford Natural Language Tool Kit (NLTK 3) as part of my solution. I had previously (years ago) taken the Coursera course offered by some of the same Stanford professors that created the toolkit, but I certainly am no NLP pro. While I expected to avoid using the NLP toolkit in favor of custom parsing rules in some places, I also expected to be able to rely on it for some items that were not especially unique to contracts (e.g., identifying names of people).
Luckily, I was able to form a team that included at least three other lawyers and two other developers. In the end, what we understood to be a difficult problem only seemed more so in practice. However, we managed to use the NLP toolkit to tokenize each contract and crafted several token parsing patterns that were able to find and extract some parties and dates. We found that without further parser training, the NLP toolkit did not extract people names as reliably as we wanted (too many false positives). The wide variety of documents in the corpus frequently became the source of pain in writing customized rules. Everyone on the team seemed to walk away with a deeper understanding of this type of NLP problem and their own ideas for how they might proceed – if they had more time.
By the end of the event, everyone seemed happy about what we were able to accomplish in only two days. Overall, it was an interesting event at which I met a lot of new people. Winning our challenge was just a extra bonus.
The team members were: Edward Bryant, Chase Hertel, Tomek Rabczak, Tetyana Rabczak, Bharat Lavania, and Jon Riley.