Thursday, 14 May 2015

Is data-mining the new genealogy?

I might ramble off topic here, but I have been thinking recently, let's see where this goes...

A few months ago I had the privilege of serving as a community teaching assistant (CTA) for an on-line course on data mining. This was a very interesting and rewarding experience, interacting with the students and acting as one of the intermediaries between course staff and students.

Reading through some of the student introduction in the "Getting to Know You" forum I came across two students from different countries who both had the same goal which caught my eye. Both students said in their introductions that they wanted to learn enough about data mining to apply it to genealogy. This got me thinking, could data mining techniques be applied to genealogy and if so, how?

The (brief) argument these students put forward consisted of pointing out how much material was being digitised and indexed these days and "wouldn't it be nice if" we could use data mining to trawl through this material to find our matches more easily and (hopefully) even create our trees automatically. My gut reaction was that it wouldn't be as easy as these students seemed to think it would be. Just in my paternal line there are enough periods where there are so many Amos Bannister's from the same small town that matching together the various events with the "correct" Amos Bannister has been a very taxing task - could a data mining algorithm really handle such a complex task?

Does that mean data mining is simply not up to the task? Or could we still use data mining in genealogy as an intermediate tool to help sort the wheat from the chaff? I think the latter is closer to reality - data mining can be a useful tool and it is being used (albeit often imperfectly) by the big online databases to serve up related records. When searching Ancestry (for example) you will notice a handful of possibly related records listed on the side of the page - these are presumably the results of some data mining. Even the search results involve some data mining - pretty much any fuzzy searching of a large database will involve data mining techniques. Similarly, the little shaky leaf hints on your Ancestry tree are probably the result of Ancestry's data mining efforts. So the real question is not "can data mining be useful in genealogy?", rather it is "how can we (the end users) harness data mining to help our research?" And this is where the future looks interesting.

Imagine a tool where instead of simple transcriptions of sources the sources were further processed to extract al the various claims? A birth certificate might record claims such as: name of the child; date of birth of the child; place of birth of the child; name of the father; place of residence of the father; occupation of the father; age of the father; name of the mother; place of residence of the mother; occupation of the mother; age of the mother; etc. One simple source document can produce a massive array of related claims which could apply to a number of distinct persons. Extract all the possible claims from every source you have and suddenly you have a massive database of possible claims. Now apply some data mining techniques to this database to find which claims from different documents might relate to the same individuals - applying some heuristics to place some reasonable bounds on date ranges (don't bother looking for a marriage before the groom was born for instance) and localities to restrict the scale of the search. Now you have a great starting point for your research into these individuals. You can check each of the sources to look for further clues to determine if they really are related.

The extraction of the claims is going to become more critical. Current indexing and transcription projects employed by the big providers try to minimise the data recorded. They use strict templates extracting only the "core" information from a document and much of the extra information is simply not being indexed. This is why there are several new tools becoming more widely available, typically targeted at "evidence-based genealogy". Source analysis and claim extraction tools are starting to come to the fore. Using these tools you can analyse your source documents to extract all the possible claims, then you can analyse those claims to more accurately piece together the people in your tree. It would be awesome if the big databases could extract all the claims from their documents, but I fear this would be a task that is simply too large for anyone to tackle completely. Instead we will have to use personal tools on our own (smaller) datasets and do our data mining in the small.

But wouldn't it be nice if the new breed of developers could work together to agree on a way of sharing all this new data? ;^)

No comments:

Post a Comment