Jinfo: Hard and tricky work: information mining

Hard and tricky work: information mining
Jinfo Blog

17th April 2012

Abstract

Data and text mining techniques are used for analysing competitor behaviours and workflows as well as customer and market needs. These methods could also be used for open access, however, the full potential is not being realised as data is limited, there is no established user community and there are licensing issues.

Item

A few weeks back on the LiveWire I pondered the possible copyright changes to UK law that would allow text and data mining for non-commercial purposes. With the huge amounts of new information and data that are generated on a daily basis, increasing at a rate of almost 40% per annum according to JISC, what is being done to dig out these nuggets?

Currently data and text mining techniques are used for processes such as analysing competitor behaviours and workflows but also customer and market needs. For example the pharmaceutical industry captures patent information and research evidence to develop and improve pipeline workplans.

New knowledge areas can also be discovered by analysing large existing datasets. This is also an area that could be interrogated for open access data such as PubMed Central (PMC). PMC is the world’s largest repository of full text open access biomedical journals with around 2.4 million articles that can be freely downloaded.

Open access should not only concern itself with free access and download, but also with the re-use factor. In a recent blog post Casey Bergman reported evidence of very limited examples of projects that had applied text and data mining tools to the entire set of open access articles from PMC. He makes a few general comments that I will paraphrase that are worth expanding beyond his immediate sphere enough to generalise.

The open access set of PMC, although large in numbers, is still a very small percentage of the overall published literature and therefore not a viable enough pool size for research purposes. This reinforces that there must be a move towards a more open and transparent process in research as I reported in my last post.
Full text mining research is difficult and there is not an established community of users. As yet are there enough useful tools out there in the marketplace and do professionals have the skills to make best use of them?
English language is simple, in for example Medline abstracts, when compared to reading DNA sequence data which is more challenging to deal with, though Casey does note that there is a growing community of biocurators and bioinformaticians. A product such as VantagePoint is an example of a text mining tool to be used with patent and literature databases.
Are we are overselling the open access movement if we are not re-using this information? The market could decide itself and pull funding for open access.

A look into the higher education sector provides some further reinforcement as to why text mining is not being developed in a big way. The recent JISC report The Value and Benefits of Text Mining investigates the value and benefits but ultimately reports that the availability of material for text mining is limited, mainly due to access and costs (transaction and entry). The licensing agreements also present issues.

With such widely reported benefits of text and data mining it seems that we are only scratching at the surface of the possibilities. This is one trend I will be following closely.

About this article