Text Mining Revolutionizes Academic Research

The benefits of big data analytics extend well beyond the uses by businesses and governments.  As the following examples illustrate, the version of data analytics known as text mining is an essential part of how the scholarly and scientific community does its research.

A well-known research finding, cited in the recent Hathitrust decision, illustrates the benefits of text mining in literary and historical research.  By comparing the frequency with which authors used “is” to refer to the United States rather than “are” researchers were able to conclude that it was only in the second half of the 19th Century that we began to think of our nation as a single, indivisible entity.

A recent New York Times piece highlighted further examples of how big data analytics can be used to ferret out hidden patterns in literary works.  One study by Matthew Jockers found that Jane Austen and Sir Walter Scot had the greatest effect on other 19th Century authors in terms of writing style and themes.  This conclusion was based on an analysis of 3,592 works published from 1780 to 1900.  Professor Jockers also identified the dominant themes in The Last of the Mohicans and Moby Dick, and compared them with themes in all 10,000 novels published in the 19th century. He documents this  fascinating style of literary detective work in his forthcoming book Macroanalysis: Methods for Digital Literary History.

Now Professor Jockers and other researchers could have read all those works and used the subtle skills of traditional literary criticism to detect the commonalities among the authors.  But the volume of text material is simply too large for these traditional skills. As a practical matter, this kind of analysis would never happen without reliance on text mining.

Text mining is different from data mining in that it works with unstructured data.  Data mining can uncover interesting patterns in data bases where information is uniformly formatted.  It can be used, for example, to discover fraud patterns in credit card data, or detect what purchases typically go together (a flashlight and batteries, for example).  Text mining works with unstructured natural language text (which comprises about 80% of the data on the Internet) and extracts useful information and insights that can be used for a wide variety of purposes in business, government and university research. A further example of the use of this technique in the research context is the text mining of scientific journals that has allowed scientists to hypothesize causes of rare diseases by looking for indirect links in different subsets of the bioscience literature.

What are the implications for public policy?  One question is whether companies and researchers are getting access to text for legitimate analytical purposes. Are there roadblocks that need to be overcome? Some have suggested exceptions to copyright law in some cases to enable text and data mining. There was a hint of this idea in the European Commission’s recent announcement of its copyright reform initiative.

Market place participants –researchers, publishers, data aggregators and analytics companies – are well positioned to work out satisfactory arrangements to assure the flow of text to important analytical uses.  In principle, these voluntary arrangements should satisfy all parties and assure the discipline of market mechanisms in making sure that text is put to its best uses.  And the marketplace is well on its way toward allowing parties to reach satisfactory arrangements.  According to the U.K. Publishers Association, for example, over 90% of publishers already grant mining requests based on research across academic and professional publications and a third already allow any kind of mining of their content without restrictions.

Governments should not override these voluntary market mechanisms that seem to be working to provide the access to text information needed by researchers and other organizations.


Mark MacCarthy, Vice President, Public Policy at SIIA, directs SIIA’s public policy initiatives in the areas of intellectual property enforcement, information privacy, cybersecurity, cloud computing and the promotion of educational technology. Follow the SIIA Public Policy team on Twitter at @SIIAPolicy