Conference on Internet Privacy, Social Networks and Data Aggregation (Mar. 23, 2012)

I recently attended the Conference on Internet Privacy, Social Networks, and Data Aggregation which was held at my old law school, Illinois Institute of Technology (IIT), Chicago-Kent College of Law. The conference was hosted by the Center for Information, Society, and Policy on Friday, March 23, 2012. There were a number of interesting speakers some of which I have listed below (see the complete conference agenda) with some of my thoughts on the respective issues they covered.

Continue reading “Conference on Internet Privacy, Social Networks and Data Aggregation (Mar. 23, 2012)” »

Share:
Posted in Events | Tagged , , , , | 1 Comment

Data Science Chicago Meetup (Mar. 22, 2012)

Today, I attended presentation about government data hosted by Data Science Chicago, a Chicago-based meetup group. The presentation was interesting both because of the personal background of the speaker, Brett Goldstein, as well as the number of interesting projects that were discussed that are using open government data. The speaker was the former IT director for OpenTable before joining the Chicago police department.

During his presentation he explained how his role as a police officer led to founding the Chicago Police Department’s Predictive Analytics Group, an effort to use patterns in incident-level crime data to predict future incidents of crime. According to Mr. Goldstein, the group’s predictions were able to focus police patrols on 1-2% of the city (down to the census block level) in which murders or other violent crimes were likely to occur.

Mr. Goldstein is now the City of Chicago’s Chief Data Officer and, at the event, he talked about the city’s effort to make government data open to the public and a number of projects using that data. According to Mr. Goldstein, the city’s data portal has already released the incident-level crime data going back 10 years — the biggest such collection of open data in the world. His more recent efforts have focused on using MongoDB for spatially-focused time series data and the release of the city’s 311 data. The speaker also touched on a number of related topics, including the use of regression analysis, the treatment effect, the need for more useful geographical boundaries other than census blocks, and advice for aspiring data scientists on the skills needed to be effective.

Overall, an interesting presentation and it made me want to take a closer look at the data sets available through these government portals. While I was already familiar with data.gov for federal level data, I was surprised to find so much data available at my city, county, and state level.

Share:
Posted in Events | Tagged , , , , , | Leave a comment

Natural Language Processsing Course

I signed up for a free online class taught by two Stanford University professors, Dan Jurafsky and Chris Manning, on natural language processing offered through Coursera back in December. While it was originally set to start in January it was delayed and will now begin March 12, 2012 (registration is still open I believe). I’ll likely post about some of the topics covered in the class, especially how they may relate to applications using legal content.

Share:
Posted in Things of Interest | Tagged , , | Comments Off

SOPA, PIPA, and the Legal Content Industry

In honor of today’s web strike against two pending U.S. bills in Congress, namely the Stop Online Piracy Act (“SOPA”) in the U.S. House and the Protect IP Act (“PIPA”) in the U.S. Senate, I thought I write a little about how those bills would endanger legal content provider market in particular. The uninformed would be forgiven for thinking these bills would not impact the publishing of legal information because much of that information comes from the government and is not copyrighted — but you would be very wrong.

First, there are numerous situations in which the the public domain status of many government documents is not clear, examples include Arkansas’ publishing its laws electronically only through a commercial publisher, claims of copyright made by creators of model laws, as well as the copyrightability of foreign laws, pagination, digitization, copyrighted material republished by the government (e.g., in an appendix to a decision), documents produced by government contractors and many similar issues. One of the major problems with bills like SOPA and PIPA is that they assume copyright is obvious. Someone may claim a copyright over something but that doesn’t resolve many open questions and doesn’t even begin to deal with the situation in which an accussed infringer claims fair use. These problems have always plagued copyright law but these issues become devestating when combined with remedies that do not allow due process or impose liability for linking to infringing material. In short, even information most people assume to be safe from infringment claims (state codes, bills, court opinions, etc.) would not be safe.

Share:
Posted in Commentary, The Legal Data Market | Tagged , , | Comments Off

Paper Addresses the Evolution of Legal Langauage, Legal Memetics

I finally got around to reading Legal N-Grams? A Simple Approach to Track the ‘Evolution’ of Legal Language, which I mentioned in an earlier post about the launch of LegalLanguageExplorer.com and I think the authors raise a number of interesting issues. It seems the focus of the research and the related Legal Language Explorer tool is the practical application of legal memetics, i.e., the study of the evolution and adoption of language and concepts in the legal culture. The authors cite, as an example, the development of Justice Holmes’ famous phrase “Clear and Present Danger” from Schenck v. United States, 249 U.S. 47 (1919).

The Legal Language Explorer appears to be the authors’ first step toward a more comprehensive system for the empirical study of legal memetics. The most interesting part of the paper was the authors’ speculation on the number of possible future directions in which the this research could proceed. It seems to me there are two items that should be at the top of the authors list (one easy, one not-so-much): including data on judicial authorship and including data from secondary law sources.

Authorship: Judicial authorship is something the authors specifically mention is available through the Supreme Court Database (at least for the period after the start of the Vinson Court). Integrating data on which judge authored particular opinions would more accurately relate legal concepts and language to the person actually using it — by the authors’ own description the evolution of legal terminology and concepts is tied to jurists not to the issuing body behind the document.

Secondary sources: I see one of the major problems of legal memetics being the fact that much of the cultural development of particular concepts and terms occurs in materials protected by intellectual property rights — law journals, law reviews, treatises, and model laws are just a few examples. Next to court opinions, these sources are the places where legal memes are born or evolve. For example, how can you study the development concepts and phrases related to product liability without tracking usage in the Restatement of Torts, Second (a model law published by a private company). Obviously, the authors may not be able to offer a solution to this problem, but I would have liked for the paper to acknowledge the exclusion of such sources and the impact on using only court opinions to study legal memetics.

Share:
Posted in Commentary | Tagged , , , | Comments Off

The Competency of the Legal Information Industry at the “Entity Web”

Matthew Hurst (@ Data Mining) recently posted about the concept of the “entity web” to describe companies involved in web-based information retrieval that are evolving into more than search engines for retrieving textual documents. Hurst speculates about the corporate skill set that will be needed to deliver on this concept, which he terms the three competencies: Understanding (1) the Web (e.g., HTML, CSS, AJAX, and other web technologies); (2) the world (i.e., the real world relationships between data points, such as that a song has an artist); and (3) Web presence (e.g., how entities appear and interact on the web). Of course, competencies (2) and (3) include the ability to record and use this knowledge in some structured model. I characterize what Hurst is discussing as integrating semantic data into existing textual search services. I also think the term “entity” is a bit limited because is implies the data is focused only on the actors (people, organizations, websites, document sources, etc.) when the users information needs may not be focused on entities at all (e.g., asking a system how photosynthesis functions or the answer to 1 + 1). Whatever you label it, Hurst is right about the direction in which we seem to be headed and when you think about how the traditional legal information industry measures up on these competencies, things do not look very good.

the Web.— Hurst comments that this is an area in which the broad market players (Google, Facebook, etc.) have largely mastered (but have room to improve). On the legal side, I would say that large legal publishers have suffered from many of the same problems of other older companies when it comes to embracing web technologies. Namely, that they tend to lag too far behind in adopting the newest web technologies. They also have a hard time building institutional knowledge in this area because they often outsource this type of work to vendors and let some departments have too much influence (e.g., marketing and communication, public relations). Overall, I would say the legal information industry is obviously not as competent as the big tech companies in this area but they generally do well with deploying established web technologies and are on par with other older companies when it comes to adopting the newest technologies.

the World.— This is probably the area in which the traditional legal information industry is the most competent but even here I think there are many reasons to be worry about the future. There is a high degree of competency in this area because traditional legal publishers have spent a long time developing institutional knowledge related to all the intricacies of government data and distribution. Other than perhaps law librarians, there a very few places that foster this kind of knowledge. I think this institutional knowledge is, however, at risk because many legal publishers have increasingly outsourced or automated the very functions which gave rise to this knowledge-building.

Web Presence.— This is probably the area in which traditional legal publishers are the weakest. In the legal field, an complete understanding of web presence would involve all the various actors interact on the web (e.g., legislators, courts, state and federal agencies, lawyers, etc.). Although traditional legal publishers are most familiar with official entities involved in issuing documents (legislatures, courts, etc.), they are much less familiar with entities that discuss or debate the legal content (blogs/blawgs, social networking sites, law firms, political and legal discussions by non professionals, etc.). A future entity web information retrieval system might need to track these sources to know that ‘Obamacare’ refers to the Affordable Care Act or that while a particular judge has not ruled on an issue his wife belongs to a group on Facebook against the issue.

Share:
Posted in The Legal Data Market | Tagged , , , | Comments Off

The High Costs of Search Illiteracy for Lawyers

“Search Literacy” is the term I have decided to use to refer to a person’s ability to use search technology and interpret queries and results. Search technology has become so integrated into our everyday lives that being search literate should be considered a basic skill. A recent post by Christopher Danzig @ Above the Law touched on this concept, which has also come up in a few of my recent posts. Danzig’s post highlights a recent decision by a federal District Judge that allowed a large pharmaceutical company to avoid the enormous costs it would have faced for its own lack of search literacy. According to Danzig:

During discovery, I-Med agreed to a forensic keyword search of its computer networks, servers, and storage devices. I-Med made the mistake of not limiting the search to active files or particular time periods. The company also allowed search of “unallocated space,” where deleted and temporary files are kept.

The uber-broad search criteria turned out to be a problem, unsurprisingly. Just in the unallocated space alone, the term generated more than 64 million hits, which represented somewhere around 95 million pages of data. Yeesh. I-Med realized they shouldn’t have agreed to the search conditions, because conducting privileged review on that scale would cost so much money and time. The company asked the magistrate judge in the case for relief from the stipulation.

The company was allowed to avoid the consequences of its lack of search literacy, but the debacle obviously highlights the risk of a lawyer not being search literate. I don’t know if the lawyers responsible were held accountable in some way but I would have liked to see the company face some consequences to create an incentive for others to not make such obvious mistakes. If a lawyer cost a client money because they failed to understand the accounting involved in a settlement agreement, I doubt a judge would be so understanding. The judge in the case sounded more search literate, however, than the lawyers for either party by including in his decision a footnote which complained about the way the parties made their arguments. Specifically, the judge noted that

“[I]t is troubling that the parties refer to the number of raw hits as though each represented a separate document. Given the volume of hits and search terms used … it stands to reason that at least some files mentioning product lines would make reference to more than one at the same time. Consequently, the [c]ourt is left to wonder whether the total hit and estimated page numbers are genuinely correct.”

This incident and the issues of search literacy raised in my previous posts seem to focus on two types of people with search literacy problems. The first type are those who tend to be younger and might be overly reliant on search technology and, therefore, lack some other research skills, such as using proper judgment in assessing the source of information. The second group (likely those in the example above) are those who tend to be older and might suffer from a lack of familiarity with search technology. I previously mentioned ways I think research platforms and other information retrieval systems can aid the first group but features that specifically aid the second group are somewhat tougher to envision. It seems entirely likely that the lawyers that actually signed off on the above e-discovery stipulation were senior lawyers who tend to delegate legal research and, therefore, have an even more limited exposure to search technology than the average lawyer.

Share:
Posted in Commentary, Things of Interest | Tagged , , , , | Comments Off