So the INSPIRE Conference 2014 (#inspire_eu2014) starts tomorrow - after two days of intensive workshops. For me this poses the challenge to decide which of the parallel sessions I should attend to. As I have been experimenting with the R framework lately I decided to make use of some text mining techniques instead of reading through all the abstracts to get an idea about hot topics, trends and potentially interesting sessions.
Here are some of my ‘results’. More on the methodology below.
To get a first impression I take a look at terms that appear frequently (15+) in the contribution’s titles:
And the same for terms in the abstracts (150+)
Also from the abstracts a nicer looking wordcloud (100+):
Now I’d like to identify contributions that deal with topics of interest (e.g. “benefits” (2+), “health” (1+) or “metadata” (5+)):
Taken the ‘contribution ID’ (just the number) I can access the full abstract:
Besides that the tm-package offers a lot of functionality to analyse the datasets further. For example I can identify terms that are correlated to a specific term. For instance terms that are correlated (0.5+) with “wfs” (considering all abstracs) are:
So a few words on what I did.
- download the abstracts (wget)
- removing headlines, html, blank lines, line breaks (sed, tr)
- extracting abstracts, titles (sed)
- convert all characters to lower-case
- remove numbers, punctuation, whitespaces
- remove URLs
- remove stopwords
- apply word stemming
- apply stemcompletion
Now the set of documents is ready to run the analyses.