Text Mining INSPIRE Conference Contributions

Posted by Martin Seiler on Tue, Jun 17, 2014
In GI, Data Science, EN,
Tags inspire text minig

So the INSPIRE Conference 2014 (#inspire_eu2014) starts tomorrow - after two days of intensive workshops. For me this poses the challenge to decide which of the parallel sessions I should attend to. As I have been experimenting with the R framework lately I decided to make use of some text mining techniques instead of reading through all the abstracts to get an idea about hot topics, trends and potentially interesting sessions.

Here are some of my ‘results’. More on the methodology below.

To get a first impression I take a look at terms that appear frequently (15+) in the contribution’s titles:

image not found

And the same for terms in the abstracts (150+)

image not found

Also from the abstracts a nicer looking wordcloud (100+):

image not found

Now I’d like to identify contributions that deal with topics of interest (e.g. “benefits” (2+), “health” (1+) or “metadata” (5+)):

image not found

image not found

image not found

Taken the ‘contribution ID’ (just the number) I can access the full abstract:

http://inspire.ec.europa.eu/events/conferences/inspire_2014/schedule/submissions/****.html

Besides that the tm-package offers a lot of functionality to analyse the datasets further. For example I can identify terms that are correlated to a specific term. For instance terms that are correlated (0.5+) with “wfs” (considering all abstracs) are:

image not found

So a few words on what I did.

Getting ready:

  • download the abstracts (wget)
  • removing headlines, html, blank lines, line breaks (sed, tr)
  • extracting abstracts, titles (sed)

r-project/tm

  • convert all characters to lower-case
  • remove numbers, punctuation, whitespaces
  • remove URLs
  • remove stopwords
  • apply word stemming
  • apply stemcompletion

Now the set of documents is ready to run the analyses.