One of the problems with text and data mining, is access to content. Do you feel that content was missing in your case?Well I looked for instance at the Times, and that was a commercial dataset. But I think there are still too little data available. Look at the National Library of The Netherlands. They only have 15% of the newspapers digitally available. And in general, the available content is heavy on American and English. With German stuff you often have the problem that it’s highly copyrighted. And now we’re only talking about newspapers, think about books, administrations, handwritings… There are still hundreds of kilometers of archive that haven’t been digitized yet. Luckily historians are used to the data not being there, so we can still work with it.What tools did you use for your research? Could you use existing tools off the shelf?During the research fellowship I learned to code myself, so I can understand what a developer is doing. And I started doing simple things myself. Digital humanities requires that you know what coding is. Otherwise there are a lot of thing you cannot do. So I learned Python for example, which was very interesting and fun. It helps me to be able to adapt existing software and to write code to do simple things. What I do is radical simplicity: I keep it simple, but in a way that can give new results. My colleague at the National Library also developed a tool that now everyone can use. And then there are some tools I came across, like stuff from GitHub. Open access software. And SPSS, what many people don’t know is that next to the statistics module, it also has a modeler now which allows you to do text analysis, sentiments for example. I found it not very useful as a historian, but it was useful to clean data. One very complicated tool I used for the project Translantis is called Texcavater. It allows you to make word clouds, timelines, construct databases. Now we also have a model in ShiCo, which does very complicated things, like word embeddings. A lot of text mining has to do with distribution of words, are words in each other’s proximity? If relationships between words occur in different texts, it means something. Say you have 10,000 texts, which is big data in humanities. Is there a pattern of words occurring in each other’s proximity? ShiCo uses the principle of distributions in different ways, in three dimensional space actually. It requires very advanced programming skills that I don’t have. But at least now I do understand how it works.Why was it necessary to modify tools and develop new ones? A lot of code has already been written, and a lot of money went into this. But lots of it cannot be used in a field like history. There is no one-size-fits all solution. I think in general there are three categories. 1. There is a lot of software that isn’t useful, for instance because it’s too experimental. 2. There are very useful bits and pieces on GitHub, that just need a bit of adjusting. 3. Sometimes software is too complicated, and I need a developer. But I do want to be able to understand how it works myself, at least the basics.

Source: FutureTDM