Tag Archives: information retrieval

Deleting the Law

Back in May, Christine Kirchberger posted an interesting quote from 1968 relating to the growth in the size of the law. Ms. Kirchberger goes on to briefly argue that perhaps a formal system of identifying and deleting “non-relevant legislation and case-law” could help improve the performance of legal information retrieval (IR) systems (i.e., something akin to the delete movement in the area of privacy regulation).

Although I understand the frustration of dealing with an ever-growing mountain of data, I think the solution to this challenge is in improvement of IR technology, not forcibly reducing the amount of content to be indexed. Further, the assumption that non-current legal information would be excluded from IR systems is simply wrong. In a common law system especially, case law is never really non-relevant no matter how much time has passed without it being cited or referenced. In addition, there are numerous research scenarios in which historical information (what the law used to be at a particular time in the past) is the goal (e.g., auditing, litigation over past actions, etc.). While I admit many laws could be simplified or reduced in size, much of the growth in the law is more likely due to an increasingly complex society and the incremental way in which the law grows.

The Competency of the Legal Information Industry at the “Entity Web”

Matthew Hurst (@ Data Mining) recently posted about the concept of the “entity web” to describe companies involved in web-based information retrieval that are evolving into more than search engines for retrieving textual documents. Hurst speculates about the corporate skill set that will be needed to deliver on this concept, which he terms the three competencies: Understanding (1) the Web (e.g., HTML, CSS, AJAX, and other web technologies); (2) the world (i.e., the real world relationships between data points, such as that a song has an artist); and (3) Web presence (e.g., how entities appear and interact on the web). Of course, competencies (2) and (3) include the ability to record and use this knowledge in some structured model. I characterize what Hurst is discussing as integrating semantic data into existing textual search services. I also think the term “entity” is a bit limited because is implies the data is focused only on the actors (people, organizations, websites, document sources, etc.) when the users information needs may not be focused on entities at all (e.g., asking a system how photosynthesis functions or the answer to 1 + 1). Whatever you label it, Hurst is right about the direction in which we seem to be headed and when you think about how the traditional legal information industry measures up on these competencies, things do not look very good.

the Web.— Hurst comments that this is an area in which the broad market players (Google, Facebook, etc.) have largely mastered (but have room to improve). On the legal side, I would say that large legal publishers have suffered from many of the same problems of other older companies when it comes to embracing web technologies. Namely, that they tend to lag too far behind in adopting the newest web technologies. They also have a hard time building institutional knowledge in this area because they often outsource this type of work to vendors and let some departments have too much influence (e.g., marketing and communication, public relations). Overall, I would say the legal information industry is obviously not as competent as the big tech companies in this area but they generally do well with deploying established web technologies and are on par with other older companies when it comes to adopting the newest technologies.

the World.— This is probably the area in which the traditional legal information industry is the most competent but even here I think there are many reasons to be worry about the future. There is a high degree of competency in this area because traditional legal publishers have spent a long time developing institutional knowledge related to all the intricacies of government data and distribution. Other than perhaps law librarians, there a very few places that foster this kind of knowledge. I think this institutional knowledge is, however, at risk because many legal publishers have increasingly outsourced or automated the very functions which gave rise to this knowledge-building.

Web Presence.— This is probably the area in which traditional legal publishers are the weakest. In the legal field, an complete understanding of web presence would involve all the various actors interact on the web (e.g., legislators, courts, state and federal agencies, lawyers, etc.). Although traditional legal publishers are most familiar with official entities involved in issuing documents (legislatures, courts, etc.), they are much less familiar with entities that discuss or debate the legal content (blogs/blawgs, social networking sites, law firms, political and legal discussions by non professionals, etc.). A future entity web information retrieval system might need to track these sources to know that ‘Obamacare’ refers to the Affordable Care Act or that while a particular judge has not ruled on an issue his wife belongs to a group on Facebook against the issue.

Judicial Notice in the Information Age

Prof. Jon M. Garon @ the Chase Law & Informatics Institute blog recently posted about the evolution of courts’ use of judicial notice now that society is accustomed to finding any information on the Internet. In short, judicial notice is when I court accepts some factual assertion as true and Garon is clearly worried that courts are using it too liberally and not critically assessing the source of the information. Garon cites the Federal Rules of Evidence as a typical standard used for judicial notice:

(b) Kinds of Facts That May Be Judicially Noticed. The court may judicially notice a fact that is not subject to reasonable dispute because it:

  1. (1) is generally known within the trial court’s territorial jurisdiction; or
  2. (2) can be accurately and readily determined from sources whose accuracy cannot reasonably be questioned.

(Fed. Rule of Evid. 201(b))

This made me reflect on my perceptions of the idea of judicial notice when I was first introduced to it in law school (approx. 1997) compared to today. I have to admit that it seemed like a sensible and rather boring concept back then but after 10 years working in an information industry I can not think of many examples that I would say meet (2) above. While Garon is worried that the information age has made us less critical of the source of information, I think it has made me more critical.

This debate ties together two of my recent posts. One on the idea that consumers’ reliance on information retrieval tools has made them less critical of the source of the information and another where I thought lawyers may not be always be able to ethically assume the reliability of information from legal research services. Although I view data critically, I am not sure if this view is necessarily common or whether it is shaped by my particular work experience. I would sum up my view as a belief that all data of a significant size contains errors and, therefore, should be taken with a grain of salt. In the example of a lawyer’s ethical responsibility, I think this means not focusing on whether errors exist but on what processes exist to prevent or correct them. In the context of judicial notice, I think this means allowing room to argue that no matter how reliable a particular source is — errors happen. So, if the location of an address is judicially noticed by relying on Google Maps in 99 prior cases, don’t preclude the argument that Google Maps is wrong in the 100th case.

Being Precise About Precision

As I am reading Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (Cambridge University Press, 2009), I have some issues with how the authors have discussed precision. Admittedly, I have not yet covered the Evaluation in Information Retrieval chapter, where I expect more detailed discussion of the concept of precision. However, the authors have defined precision as the fraction of the returned results that are relevant to the information need.

My initial reaction to this definition was that precision cannot really be measured if it is compared to an abstract information need of the user rather than the more concrete search query used to express that need. In my mind these things can be quite different and I think the distinction is important. Sure enough, the issue came up again further in the book when the authors attempted to illustrate the ineffectiveness of lemmatizers (the reduction of words to their base forms), the authors claim that a harm to precision occurs because

“a sentence with the words operate and system is not a good match for the query operating and system.”

First, you will notice that the authors are not comparing the result to the information need, they are comparing the result to the query — violating their own definition. Further, the example only works on the assumption that a query for operating and system represents an information need for results mentioning computer operating systems. I am not comfortable basing an assessment of a system’s precision on such assumptions. I would challenge the authors to distinguish between a user that is seeking information on computer operating systems (who query operating and system) with a user that is seeking information on robotic systems used in operating rooms (who might equally use the query operating and system). I think a sentence with operate and system is a good match in the second scenario.

I frequently see these assumptions in the context of legal data as well. The most common examples are scenarios in which a user searches for a legal concept that is relevant at the state and federal level but the corpus includes one document addressing each specific jurisdiction. Which documents are relevant? Which documents should be ranked first? What does a high precision result list look like? The truth is the query is ambiguous and its hard to measure precision unless you make potentially incorrect assumptions about the user’s actual information need (assume federal, etc.) or you measure precision by the actual query used — returning all 51 documents is 100% precision.

A Glossary of Legal Data Terminology

As part of my own self-study to expand my knowledge of issues related to the use of legal data, I started to create a glossary of common terminology to help me absorb concepts and organize my thoughts. Since the glossary would likely include terminology from various related fields (computer science, web development, law, legal publishing, statistics, data visualization, etc.). I thought I would post it here in the hope that someone else may find it useful (or even better offer suggestions and feedback, hint, hint). My intent is to build the glossary incrementally as I encounter terms during my job, while blogging, or while reading outside materials. The first terms added will likely be basic IR terms because I am currently reading Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (Cambridge University Press, 2009).

I will post here as new terms are added.

UPDATE: The glossary has been posted and can be accessed through the header navigation bar. The glossary includes the following terms: Boolean LogicCorpusDocument UnitGreppingIndexingInformation NeedInformation RetrievalRelevanceStop ListStop WordTokenization