Thursday, 14 May 2015

Is data-mining the new genealogy?

I might ramble off topic here, but I have been thinking recently, let's see where this goes...

A few months ago I had the privilege of serving as a community teaching assistant (CTA) for an on-line course on data mining. This was a very interesting and rewarding experience, interacting with the students and acting as one of the intermediaries between course staff and students.

Reading through some of the student introduction in the "Getting to Know You" forum I came across two students from different countries who both had the same goal which caught my eye. Both students said in their introductions that they wanted to learn enough about data mining to apply it to genealogy. This got me thinking, could data mining techniques be applied to genealogy and if so, how?

The (brief) argument these students put forward consisted of pointing out how much material was being digitised and indexed these days and "wouldn't it be nice if" we could use data mining to trawl through this material to find our matches more easily and (hopefully) even create our trees automatically. My gut reaction was that it wouldn't be as easy as these students seemed to think it would be. Just in my paternal line there are enough periods where there are so many Amos Bannister's from the same small town that matching together the various events with the "correct" Amos Bannister has been a very taxing task - could a data mining algorithm really handle such a complex task?

Does that mean data mining is simply not up to the task? Or could we still use data mining in genealogy as an intermediate tool to help sort the wheat from the chaff? I think the latter is closer to reality - data mining can be a useful tool and it is being used (albeit often imperfectly) by the big online databases to serve up related records. When searching Ancestry (for example) you will notice a handful of possibly related records listed on the side of the page - these are presumably the results of some data mining. Even the search results involve some data mining - pretty much any fuzzy searching of a large database will involve data mining techniques. Similarly, the little shaky leaf hints on your Ancestry tree are probably the result of Ancestry's data mining efforts. So the real question is not "can data mining be useful in genealogy?", rather it is "how can we (the end users) harness data mining to help our research?" And this is where the future looks interesting.

Imagine a tool where instead of simple transcriptions of sources the sources were further processed to extract al the various claims? A birth certificate might record claims such as: name of the child; date of birth of the child; place of birth of the child; name of the father; place of residence of the father; occupation of the father; age of the father; name of the mother; place of residence of the mother; occupation of the mother; age of the mother; etc. One simple source document can produce a massive array of related claims which could apply to a number of distinct persons. Extract all the possible claims from every source you have and suddenly you have a massive database of possible claims. Now apply some data mining techniques to this database to find which claims from different documents might relate to the same individuals - applying some heuristics to place some reasonable bounds on date ranges (don't bother looking for a marriage before the groom was born for instance) and localities to restrict the scale of the search. Now you have a great starting point for your research into these individuals. You can check each of the sources to look for further clues to determine if they really are related.

The extraction of the claims is going to become more critical. Current indexing and transcription projects employed by the big providers try to minimise the data recorded. They use strict templates extracting only the "core" information from a document and much of the extra information is simply not being indexed. This is why there are several new tools becoming more widely available, typically targeted at "evidence-based genealogy". Source analysis and claim extraction tools are starting to come to the fore. Using these tools you can analyse your source documents to extract all the possible claims, then you can analyse those claims to more accurately piece together the people in your tree. It would be awesome if the big databases could extract all the claims from their documents, but I fear this would be a task that is simply too large for anyone to tackle completely. Instead we will have to use personal tools on our own (smaller) datasets and do our data mining in the small.

But wouldn't it be nice if the new breed of developers could work together to agree on a way of sharing all this new data? ;^)

Saturday, 9 May 2015

On GEDCOM, its uses and abuses...

Louis Kessler recently posted about an interesting exchange about GEDCOM's handling of sources on the FHISO mailing list. Louis raises some interesting points and shows that GEDCOM is quite a detailed and complex standard which can cope with a wide range of situations, if only the developers of genealogy apps would just study the standard. Sure, it is not an easy standard to comprehend and it does not directy cater for all possible scenarios, but it can definitely cope with a lot more than developers give it credit for.

Personally, I am not a huge fan of GEDCOM - not for what it can/can't do, but for how it has been used and abused by some sloppy developers. There have been many blog posts and articles about the incompatabilities between different implementations of GEDCOM imports and exports in different software packages. I have personally used genealogy softare which couldn't even correctly import GEDCOM created by itself! Surely if a developer has gone to the trouble of creating GEDCOM export functionality, then it is not too much to expect that they have fully tested that the software can import its own GEDCOM files.

So why is there this problem? There could be many reasons - sloppiness, laziness, rushed implementations, not referring back to the specifiction documents - but one of the main reasons is that we all let the developers get away with these half-hearted implementations. IF users (and other developers) keep accepting the poor GEDCOM support, then we will keep getting poor GEDCOM support. I have voiced my opinion to the developers of the two main genealogy programs I use on my Mac and I have noticed some (minor) improvements over the years.

There is something about this that reminds me of the early days of web development when we had the infamous "browser wars" between Microsoft's Internet Explorer and Netscape's Navigator. If you are old enough to remember those days, it was not uncommon to see a "Optimised for Internet Explorer" or "Optimised for Netscape Navigator" badges on websites. Both companies were in a mad rush to stuff so many new features in their browsers that they didn't bother spending the time to ensure their browsers supported the relevant internet standards (HTML, CSS, etc) and the result was that both browsers supported different parts of the standards and the implementations were incompatible. Some of the incompatabilities were caused by cutting edge technology not covered by any existing standards, some were caused by mis-interpreting the standards and some were deliberately introduced to break the other browser.

The browser wars were a very trying time for web developers. If you could control the platform (for example, a company intranet) it was easy, you chose a browser (IE or Netscape) and you could code to that browser's quirks. If you didn't have control over the platform, your code was a mess of conditionals, browser sniffing, and CSS hacks as you tried to serve up the "correct" version of your page depending on the user's browser. Even when you did have control it wasn't always easy. As browsers got upgraded you had to make sure your code still worked in case the browser vendor fixed some of their incompatabilities.

Coding to the published standards wasn't really a solution. Both vendors picked and chose which arts of the standards they would implement so developers had to refer to tables and charts detailing which parts of the standards were supported by which browser and where thy interpreted the standards differently. The CSS box model was a never ending source of frustration for me - you could get your code working nicely in IE, but it would look terrible in Netscape. Or vice versa.

Over time web developers started to fight back. I attended a number of developer conferences and at every one there was a vocal crowd of web developers demanding to know why the standards were being ignored. Microsoft's response was that they weren't creating browsers for developers, they were creating browsers for end users and they (Microsoft) knew better than we developers which parts of the standards were important enough to implement. More and more developers started to push back, various test suites were created to test the implementation of the standards and push the limits. Showcases of innovative web sites highlighted the standards support issues and some smaller browser developers started building "standards-compliant" browsers and slowly but surely the focus of the browser wars changed from unique features to standards support. It eventually became "cool" for browser implementors to tout their standards compliance and finally standards just became an expected feature of a browser

So how does this relate to GEDCOM? It probably doesn't except insofar as the level of complaints about poor support will determine what, if anything, is done to rectify the situation. GEDCOM is an old standard. A very old standard. It probably needs a revamp to bring it more inline with "modern" genealogical practices, especially when it comes to handling of sources and evidence-based genealogy. There have been several attempts to improve GEDCOM, or to develop a complete replacement, yet none of these efforts has amounted to much of note. To be honest though, it doesn't matter if one group or another comes up with a better GEDCOM unless there is buy in from the major genealogy developers. Small developers can support all the alternative standards they want, but if the big guys don't support it, what's the point?

Inertia is a huge force in software development. Everybody supports GEDCOM because everyone else supports GEDCOM. No one bothers to implement GEDCOM properly, because no one else implements it properly. No one bother to change this status quo because the users (and other developers) aren't demanding it.

For my software I have made the decision to not implement a GEDCOM import. I can get away without a GEDCOM import because I am not writing a "family tree" tool, I am writing a source management and analysis tool. There are enough bad and inconsistent implementations that it wuld really be too much work to try to import badly formed GEDCOM files. I will (eventually) provide a GEDCOM export and I will make darned sure I get it right. (Or at least as right as I can make it.) I will also be exporting my data in a variety of other formats , including exposing a web service or API for other developers to work with.