“Search Literacy” is the term I have decided to use to refer to a person’s ability to use search technology and interpret queries and results. Search technology has become so integrated into our everyday lives that being search literate should be considered a basic skill. A recent post by Christopher Danzig @ Above the Law touched on this concept, which has also come up in a few of my recent posts. Danzig’s post highlights a recent decision by a federal District Judge that allowed a large pharmaceutical company to avoid the enormous costs it would have faced for its own lack of search literacy. According to Danzig:
During discovery, I-Med agreed to a forensic keyword search of its computer networks, servers, and storage devices. I-Med made the mistake of not limiting the search to active files or particular time periods. The company also allowed search of “unallocated space,” where deleted and temporary files are kept.
The uber-broad search criteria turned out to be a problem, unsurprisingly. Just in the unallocated space alone, the term generated more than 64 million hits, which represented somewhere around 95 million pages of data. Yeesh. I-Med realized they shouldn’t have agreed to the search conditions, because conducting privileged review on that scale would cost so much money and time. The company asked the magistrate judge in the case for relief from the stipulation.
The company was allowed to avoid the consequences of its lack of search literacy, but the debacle obviously highlights the risk of a lawyer not being search literate. I don’t know if the lawyers responsible were held accountable in some way but I would have liked to see the company face some consequences to create an incentive for others to not make such obvious mistakes. If a lawyer cost a client money because they failed to understand the accounting involved in a settlement agreement, I doubt a judge would be so understanding. The judge in the case sounded more search literate, however, than the lawyers for either party by including in his decision a footnote which complained about the way the parties made their arguments. Specifically, the judge noted that
“[I]t is troubling that the parties refer to the number of raw hits as though each represented a separate document. Given the volume of hits and search terms used … it stands to reason that at least some files mentioning product lines would make reference to more than one at the same time. Consequently, the [c]ourt is left to wonder whether the total hit and estimated page numbers are genuinely correct.”
This incident and the issues of search literacy raised in my previous posts seem to focus on two types of people with search literacy problems. The first type are those who tend to be younger and might be overly reliant on search technology and, therefore, lack some other research skills, such as using proper judgment in assessing the source of information. The second group (likely those in the example above) are those who tend to be older and might suffer from a lack of familiarity with search technology. I previously mentioned ways I think research platforms and other information retrieval systems can aid the first group but features that specifically aid the second group are somewhat tougher to envision. It seems entirely likely that the lawyers that actually signed off on the above e-discovery stipulation were senior lawyers who tend to delegate legal research and, therefore, have an even more limited exposure to search technology than the average lawyer.
As I am reading Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (Cambridge University Press, 2009), I have some issues with how the authors have discussed precision. Admittedly, I have not yet covered the Evaluation in Information Retrieval chapter, where I expect more detailed discussion of the concept of precision. However, the authors have defined precision as the fraction of the returned results that are relevant to the information need.
My initial reaction to this definition was that precision cannot really be measured if it is compared to an abstract information need of the user rather than the more concrete search query used to express that need. In my mind these things can be quite different and I think the distinction is important. Sure enough, the issue came up again further in the book when the authors attempted to illustrate the ineffectiveness of lemmatizers (the reduction of words to their base forms), the authors claim that a harm to precision occurs because
“a sentence with the words operate and system is not a good match for the query operating and system.”
First, you will notice that the authors are not comparing the result to the information need, they are comparing the result to the query — violating their own definition. Further, the example only works on the assumption that a query for operating and system represents an information need for results mentioning computer operating systems. I am not comfortable basing an assessment of a system’s precision on such assumptions. I would challenge the authors to distinguish between a user that is seeking information on computer operating systems (who query operating and system) with a user that is seeking information on robotic systems used in operating rooms (who might equally use the query operating and system). I think a sentence with operate and system is a good match in the second scenario.
I frequently see these assumptions in the context of legal data as well. The most common examples are scenarios in which a user searches for a legal concept that is relevant at the state and federal level but the corpus includes one document addressing each specific jurisdiction. Which documents are relevant? Which documents should be ranked first? What does a high precision result list look like? The truth is the query is ambiguous and its hard to measure precision unless you make potentially incorrect assumptions about the user’s actual information need (assume federal, etc.) or you measure precision by the actual query used — returning all 51 documents is 100% precision.