The High Costs of Search Illiteracy for Lawyers

“Search Literacy” is the term I have decided to use to refer to a person’s ability to use search technology and interpret queries and results. Search technology has become so integrated into our everyday lives that being search literate should be considered a basic skill. A recent post by Christopher Danzig @ Above the Law touched on this concept, which has also come up in a few of my recent posts. Danzig’s post highlights a recent decision by a federal District Judge that allowed a large pharmaceutical company to avoid the enormous costs it would have faced for its own lack of search literacy. According to Danzig:

During discovery, I-Med agreed to a forensic keyword search of its computer networks, servers, and storage devices. I-Med made the mistake of not limiting the search to active files or particular time periods. The company also allowed search of “unallocated space,” where deleted and temporary files are kept.

The uber-broad search criteria turned out to be a problem, unsurprisingly. Just in the unallocated space alone, the term generated more than 64 million hits, which represented somewhere around 95 million pages of data. Yeesh. I-Med realized they shouldn’t have agreed to the search conditions, because conducting privileged review on that scale would cost so much money and time. The company asked the magistrate judge in the case for relief from the stipulation.

The company was allowed to avoid the consequences of its lack of search literacy, but the debacle obviously highlights the risk of a lawyer not being search literate. I don’t know if the lawyers responsible were held accountable in some way but I would have liked to see the company face some consequences to create an incentive for others to not make such obvious mistakes. If a lawyer cost a client money because they failed to understand the accounting involved in a settlement agreement, I doubt a judge would be so understanding. The judge in the case sounded more search literate, however, than the lawyers for either party by including in his decision a footnote which complained about the way the parties made their arguments. Specifically, the judge noted that

“[I]t is troubling that the parties refer to the number of raw hits as though each represented a separate document. Given the volume of hits and search terms used … it stands to reason that at least some files mentioning product lines would make reference to more than one at the same time. Consequently, the [c]ourt is left to wonder whether the total hit and estimated page numbers are genuinely correct.”

This incident and the issues of search literacy raised in my previous posts seem to focus on two types of people with search literacy problems. The first type are those who tend to be younger and might be overly reliant on search technology and, therefore, lack some other research skills, such as using proper judgment in assessing the source of information. The second group (likely those in the example above) are those who tend to be older and might suffer from a lack of familiarity with search technology. I previously mentioned ways I think research platforms and other information retrieval systems can aid the first group but features that specifically aid the second group are somewhat tougher to envision. It seems entirely likely that the lawyers that actually signed off on the above e-discovery stipulation were senior lawyers who tend to delegate legal research and, therefore, have an even more limited exposure to search technology than the average lawyer.

The Launch of the Legal Language Explorer

Prof. Daniel M. Katz @ the Computational Legal Studies blog recently posted about the Beta Pre-Release of, the site ( allows users to query one or more comma-separated phrases and return a time series plot of instances where the phrase appears in all the decisions of the U.S. Supreme Court (which may expand to include federal appellate courts). Katz also posted a link to his paper on SSRN explaining the work behind

legal language explorer tool

I have not read the paper yet (I hope to have the time soon) but I did play around with the site a little and had a few quick observations. First, it is interesting to see the prevalence of certain phrases within the court’s decisions, especially where they are tied to cultural or legal changes. I tried a couple of example queries, such as “id, see, ibid” as a way to visualize how the court’s use of citations has changed over time. The time plot displays the raw number of instances of the phrases but the advanced features allow you to switch this to an average count per case (without this the increase in the number of cases would make it appear nearly every term is used more often after the 1940s). I think it would be better if the site had this normalization feature on by default, but that’s a small criticism. Overall, I think it’s interesting tool and will try to read the whole paper soon.

Judicial Notice in the Information Age

Prof. Jon M. Garon @ the Chase Law & Informatics Institute blog recently posted about the evolution of courts’ use of judicial notice now that society is accustomed to finding any information on the Internet. In short, judicial notice is when I court accepts some factual assertion as true and Garon is clearly worried that courts are using it too liberally and not critically assessing the source of the information. Garon cites the Federal Rules of Evidence as a typical standard used for judicial notice:

(b) Kinds of Facts That May Be Judicially Noticed. The court may judicially notice a fact that is not subject to reasonable dispute because it:

  1. (1) is generally known within the trial court’s territorial jurisdiction; or
  2. (2) can be accurately and readily determined from sources whose accuracy cannot reasonably be questioned.

(Fed. Rule of Evid. 201(b))

This made me reflect on my perceptions of the idea of judicial notice when I was first introduced to it in law school (approx. 1997) compared to today. I have to admit that it seemed like a sensible and rather boring concept back then but after 10 years working in an information industry I can not think of many examples that I would say meet (2) above. While Garon is worried that the information age has made us less critical of the source of information, I think it has made me more critical.

This debate ties together two of my recent posts. One on the idea that consumers’ reliance on information retrieval tools has made them less critical of the source of the information and another where I thought lawyers may not be always be able to ethically assume the reliability of information from legal research services. Although I view data critically, I am not sure if this view is necessarily common or whether it is shaped by my particular work experience. I would sum up my view as a belief that all data of a significant size contains errors and, therefore, should be taken with a grain of salt. In the example of a lawyer’s ethical responsibility, I think this means not focusing on whether errors exist but on what processes exist to prevent or correct them. In the context of judicial notice, I think this means allowing room to argue that no matter how reliable a particular source is — errors happen. So, if the location of an address is judicially noticed by relying on Google Maps in 99 prior cases, don’t preclude the argument that Google Maps is wrong in the 100th case.

New Year’s Eve and the Effort for Usable Government Data

With New Year’s Eve fast approaching, it seems like an especially appropriate time to reflect on the past. Amid the various efforts to make basic government-published legal data more accessible (especially in open formats that are machine readable), I think it helps to also remember how far things have come. Over at the company blog for Citation Technologies, a long-time employee, David Gottlieb, recounts the difficulties the company faced as a legal publishing start-up back in the mid-nineties. David’s post contains various anecdotes, including how the company’s only server was thrown off the roof of the office in a business dispute, but it is also a reminder of the common thread between the challenges of the mid-nineties and today — access to usable information.

Although you commonly hear complaints about government entities that publish information in not-so-open proprietary formats such as PDF files and how a standardized XML markup would be so much better, David’s post explains the difficulties of obtaining any electronic version of state regulations back in the nineties. He also mentions manually correcting bad OCR scans, navigating magnetic tapes, and how a southern state once informed them that there was no way they had regulations in any electronic format because they had only just gotten electric typewriters. It is easy to think that the push for usable government data is new, but its actually a long term effort in which the meaning of “usable” keeps evolving.

I hope a few years from now we will be arguing about how the government should provide deeper semantic markup and look back on the days of PDFs and Word documents in the same way we look at the magnetic tapes and OCR corrections in David Gottlieb’s post.

Glossary Update

The glossary has been updated with the following terms: DispositionCaptionCase-FoldingTruecasingNormalizationStemmingLemmatizationHeuristicPorter’s AlgorithmPhrase QueriesPhrase IndexBiword IndexPositional Index.

Companies and Projects List

Even as someone working in this field, I sometimes have a hard time keeping track of the various companies and projects that are involved in the use of legal data. So, I decided to start accumulating a list (not-so-surprisingly titled “companies and projects”) among the other links in the right-hand column of this blog. If you don’t see something you think should be listed, just leave a comment.

Debate on Ethical Rules Governing Users of Legal Information

Carolyn Elefant @ the My Shingle blog has kicked up an interesting discussion regarding proposed changes to ABA Ethics Model Rule 5.3. In short, the ABA is seeking to expand an existing rule, which requires lawyers to supervise the work of nonlawyer assistants, to include services provided by computer-based services (i.e., “nonlawyer assistance” rather than “assistants”). The intention of the change appears to be aimed at the security and confidentiality issues surrounding cloud computing services, where a lawyer may store client information. However, Elefant argues that the change inadvertently imposes additional ethical responsibilities on lawyers over how they might supervise online legal research services used in work for clients. Elefant speculates on the ways in which lawyers might exert greater supervision over such services, including requesting disclosure of search algorithms, requiring certain training for legal publisher employees who write headnotes or other information, and the use of search query data which may contain client information.

Andrew Perlman, the chief reporter to the ABA Commission on Ethics 20/20 (the body overseeing the proposed rule change), jumped into the comments section of Elefant’s blog to argue that she was misinterpreting the proposed rule, however, I am far from convinced it’s a misinterpretation. Notably, Perlman doesn’t argue that the proposed rule would not apply to online legal research services (or other providers of legal information or software for that matter) — he simply argues that this isn’t a problem because lawyers can rely on the reputation of the provider. Although there doesn’t seem to be any language in the rule to justify a reliance on reputation, he claims:

Lexis and Westlaw have outstanding reputations, and there is no reason to think that the information that is generated on those services is somehow erroneous or incomplete or that the research conducted there would be any less complete than if the lawyer performed the research the old-fashioned way.

There are several problems with Perlman’s assertions about the proposed rule. First, any provider of a significantly large amount of data understands there are always errors and gaps in the data, so claiming that there is no reason to think these exist is, frankly, naive. Also, there is a universe of difference between having “no reason to think” data is unreliable and verifying the reliability of data. Second, Lexis and Westlaw are not the only providers in the market and several questions can be raised about using corporate reputation as a proxy for data quality. Some follow-up questions I would ask include:

  • does a provider’s reputation carry over to new products or services?
  • what if the provider is new to the market and has no pre-existing reputation?
  • is it more reputable to be a commercial provider or a free service?
  • what about providers that provide non-legal research services?
  • what happens if events reduce a provider’s reputation (data breaches, etc.)?
  • what about preexisting reputation in light of major changes to editorial process (i.e., the use of auto-summerization, out-sourcing, or off-shoring to replace use of professional editors)?

The whole idea of using something as abstract and subjective as reputation to fulfill one’s ethical responsibility to ensure the reliability of information you use to aid a client should be heresy to anyone with critical-thinking skills. Third, I also think Perlman’s assumption that the standard of comparison, which a lawyer acting ethically under the rule would use, i.e., the “old-fashioned way”, is setting the bar extremely low. I think information providers, lawyers and clients all have higher expectations than “it’s as good as print was” standard might produce.

Elefant’s post and Perlman’s comments also raised several interesting questions about how information providers use search queries that might include sensitive client information, which I will address in a separate post (stay tuned).

Being Precise About Precision

As I am reading Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (Cambridge University Press, 2009), I have some issues with how the authors have discussed precision. Admittedly, I have not yet covered the Evaluation in Information Retrieval chapter, where I expect more detailed discussion of the concept of precision. However, the authors have defined precision as the fraction of the returned results that are relevant to the information need.

My initial reaction to this definition was that precision cannot really be measured if it is compared to an abstract information need of the user rather than the more concrete search query used to express that need. In my mind these things can be quite different and I think the distinction is important. Sure enough, the issue came up again further in the book when the authors attempted to illustrate the ineffectiveness of lemmatizers (the reduction of words to their base forms), the authors claim that a harm to precision occurs because

“a sentence with the words operate and system is not a good match for the query operating and system.”

First, you will notice that the authors are not comparing the result to the information need, they are comparing the result to the query — violating their own definition. Further, the example only works on the assumption that a query for operating and system represents an information need for results mentioning computer operating systems. I am not comfortable basing an assessment of a system’s precision on such assumptions. I would challenge the authors to distinguish between a user that is seeking information on computer operating systems (who query operating and system) with a user that is seeking information on robotic systems used in operating rooms (who might equally use the query operating and system). I think a sentence with operate and system is a good match in the second scenario.

I frequently see these assumptions in the context of legal data as well. The most common examples are scenarios in which a user searches for a legal concept that is relevant at the state and federal level but the corpus includes one document addressing each specific jurisdiction. Which documents are relevant? Which documents should be ranked first? What does a high precision result list look like? The truth is the query is ambiguous and its hard to measure precision unless you make potentially incorrect assumptions about the user’s actual information need (assume federal, etc.) or you measure precision by the actual query used — returning all 51 documents is 100% precision.

Lexis Advance Launch, Feedback

Robert Ambrogi @ the Law Sites blog has posted some good information about the new Lexis Advance platform, which was formally launched yesterday. While Greg Lambert @ 3 Geeks and a Law Blog posted about some of what he think Lexis is doing right and an explanation of how he thinks Lexis Advance represents the integration of the company’s legal platform with their High-Performance Cluster Computing (HPCC) analytics tool.

A lot of the discussion about the new Lexis platform seemed focused on how they plan to transition their users but I am more interested in how popular some of the new user interface features, such as visual Shepards or the new search history, are with users. Lexis does intend to charge users extra for the visual Shepards feature but it appears it will be free for current customers during the transition. While I think there are many interesting new ways to visually present legal data to users, I don’t agree with the idea that an improved visual display is something for which you should charge extra. If a user pays for access to the Shepard’s citator data, different presentations hsould not cost extra. Just think of the ridiculous scenerios that would happen if you applied this logic to print — is readable typography or layout a feature?

Visual Shepards Display

new search history display