Monday, 18 January 2016

On Sources...

In genealogy we deal with a large number of documents, normally referred to as "sources". This is where we find the data that drives our research and good genealogists will be careful to study their sources to determine the veracity and accuracy of the data contained in them. Most genealogy software treats an entry in an online database (say, Ancestry, FamilySearch, The Genealogist, etc) as a single source and as a result you can end up with a large number of "sources" that really refer back to the original document. If you use multiple sites, you can end up with the same data, from the same original document being treated as five or six distinct sources which can tend to skew the results when you are trying to decide what the "correct" data should be for a particular fact. Then you have the added wrinkle of copies of original documents, such as the Bishop's Transcripts of original parish registers.

Take for example the Stretford, St Matthew parish registers. These parish registers are (mostly) digitised and images are available on many of the major online databases. The registers have also been transcribed, several times, and indexes are also widely available. In some cases, there are two or three separate transcriptions on FamilySearch (without images) and one or two transcriptions of the corresponding Bishops Transcripts on FamilySearch as well. This means when searching for the baptism of a particular person, you can return up to five different records on FamilySearch. The major databases, such as Ancestry, have (with permission) copied some of FamilySearch's indexes of these registers as well as conducting their own digitisation and transcription programs. The net result is now there are potentially 10-12 records for the one event on just two sites, all originating from two actual documents, one being a copy of the other. I you had 10-12 pieces of evidence supporting a fact, you might think that the fact was sound and your work was done, but is it really?

All these "sources" derive from one original document, being the parish register. The Bishops Transcript is simply the first copy of this document, then the information has been transcribed multiple times from photos/scans of these two physical documents. At most you have two source documents, but really there is only one true source.

What if software treated all these transcriptions and indexes as versions of the one original? When you look at how many supporting sources you have for a fact the software would tell you that you have one (or two) sources and 10-12 transcriptions. If the transcriptions vary (and they often do) the software could present all the variations and allow you to choose a preferred version - all the different transcriptions are doing is revising consensus on the transcription of the one source document.When determining the correct details for a fact (say date of birth or place of residence) truly distinct sources can then be presented without the "fog" of multiple transcriptions allowing the user to have more confidence that they have correctly interpreted the data.

That's one of the things I am trying to do with my own software. I want to move to a more source-centric process but take it a step or two further than other source-centric software. The data structures are slowly coalescing, but it hasn't been easy. I need to be able to link multiple transcriptions to original documents and multiple data repositories (database providers, archives, libraries, etc) and then try to distill the facts from the cloud of transcribed versions. I think I am close to understanding where I want to get to with all this, if only I had the time to maple meant it all. ;^)

Tuesday, 3 November 2015

Back to it!

Due to a number of factors (family drama, health issues, etc) I had to put aside development of my genealogy log tool. Things have (mostly) settled down now, so I guess it is time to dust off the codebase and get cracking again.

First task off the rank - perform a code and technology review: have I chosen the right tools/platform for the genealogy log? I originally opted for a web-based tool, but would it be better to create a native iOS/Android app, with an optional web-based data store? This is a question I flip-flopped on for a while before I started and it is one I think I should revisit before I restart my development efforts. One of the key questions is whether or not users will have internet connections where they want to use the apps.

What are your thoughts? Would you prefer an iPad/Android app (with the data being backed up to the cloud) or a web-based app that might not have so many bells and whistles?

Thursday, 14 May 2015

Is data-mining the new genealogy?

I might ramble off topic here, but I have been thinking recently, let's see where this goes...

A few months ago I had the privilege of serving as a community teaching assistant (CTA) for an on-line course on data mining. This was a very interesting and rewarding experience, interacting with the students and acting as one of the intermediaries between course staff and students.

Reading through some of the student introduction in the "Getting to Know You" forum I came across two students from different countries who both had the same goal which caught my eye. Both students said in their introductions that they wanted to learn enough about data mining to apply it to genealogy. This got me thinking, could data mining techniques be applied to genealogy and if so, how?

The (brief) argument these students put forward consisted of pointing out how much material was being digitised and indexed these days and "wouldn't it be nice if" we could use data mining to trawl through this material to find our matches more easily and (hopefully) even create our trees automatically. My gut reaction was that it wouldn't be as easy as these students seemed to think it would be. Just in my paternal line there are enough periods where there are so many Amos Bannister's from the same small town that matching together the various events with the "correct" Amos Bannister has been a very taxing task - could a data mining algorithm really handle such a complex task?

Does that mean data mining is simply not up to the task? Or could we still use data mining in genealogy as an intermediate tool to help sort the wheat from the chaff? I think the latter is closer to reality - data mining can be a useful tool and it is being used (albeit often imperfectly) by the big online databases to serve up related records. When searching Ancestry (for example) you will notice a handful of possibly related records listed on the side of the page - these are presumably the results of some data mining. Even the search results involve some data mining - pretty much any fuzzy searching of a large database will involve data mining techniques. Similarly, the little shaky leaf hints on your Ancestry tree are probably the result of Ancestry's data mining efforts. So the real question is not "can data mining be useful in genealogy?", rather it is "how can we (the end users) harness data mining to help our research?" And this is where the future looks interesting.

Imagine a tool where instead of simple transcriptions of sources the sources were further processed to extract al the various claims? A birth certificate might record claims such as: name of the child; date of birth of the child; place of birth of the child; name of the father; place of residence of the father; occupation of the father; age of the father; name of the mother; place of residence of the mother; occupation of the mother; age of the mother; etc. One simple source document can produce a massive array of related claims which could apply to a number of distinct persons. Extract all the possible claims from every source you have and suddenly you have a massive database of possible claims. Now apply some data mining techniques to this database to find which claims from different documents might relate to the same individuals - applying some heuristics to place some reasonable bounds on date ranges (don't bother looking for a marriage before the groom was born for instance) and localities to restrict the scale of the search. Now you have a great starting point for your research into these individuals. You can check each of the sources to look for further clues to determine if they really are related.

The extraction of the claims is going to become more critical. Current indexing and transcription projects employed by the big providers try to minimise the data recorded. They use strict templates extracting only the "core" information from a document and much of the extra information is simply not being indexed. This is why there are several new tools becoming more widely available, typically targeted at "evidence-based genealogy". Source analysis and claim extraction tools are starting to come to the fore. Using these tools you can analyse your source documents to extract all the possible claims, then you can analyse those claims to more accurately piece together the people in your tree. It would be awesome if the big databases could extract all the claims from their documents, but I fear this would be a task that is simply too large for anyone to tackle completely. Instead we will have to use personal tools on our own (smaller) datasets and do our data mining in the small.

But wouldn't it be nice if the new breed of developers could work together to agree on a way of sharing all this new data? ;^)

Saturday, 9 May 2015

On GEDCOM, its uses and abuses...

Louis Kessler recently posted about an interesting exchange about GEDCOM's handling of sources on the FHISO mailing list. Louis raises some interesting points and shows that GEDCOM is quite a detailed and complex standard which can cope with a wide range of situations, if only the developers of genealogy apps would just study the standard. Sure, it is not an easy standard to comprehend and it does not directy cater for all possible scenarios, but it can definitely cope with a lot more than developers give it credit for.

Personally, I am not a huge fan of GEDCOM - not for what it can/can't do, but for how it has been used and abused by some sloppy developers. There have been many blog posts and articles about the incompatabilities between different implementations of GEDCOM imports and exports in different software packages. I have personally used genealogy softare which couldn't even correctly import GEDCOM created by itself! Surely if a developer has gone to the trouble of creating GEDCOM export functionality, then it is not too much to expect that they have fully tested that the software can import its own GEDCOM files.

So why is there this problem? There could be many reasons - sloppiness, laziness, rushed implementations, not referring back to the specifiction documents - but one of the main reasons is that we all let the developers get away with these half-hearted implementations. IF users (and other developers) keep accepting the poor GEDCOM support, then we will keep getting poor GEDCOM support. I have voiced my opinion to the developers of the two main genealogy programs I use on my Mac and I have noticed some (minor) improvements over the years.

There is something about this that reminds me of the early days of web development when we had the infamous "browser wars" between Microsoft's Internet Explorer and Netscape's Navigator. If you are old enough to remember those days, it was not uncommon to see a "Optimised for Internet Explorer" or "Optimised for Netscape Navigator" badges on websites. Both companies were in a mad rush to stuff so many new features in their browsers that they didn't bother spending the time to ensure their browsers supported the relevant internet standards (HTML, CSS, etc) and the result was that both browsers supported different parts of the standards and the implementations were incompatible. Some of the incompatabilities were caused by cutting edge technology not covered by any existing standards, some were caused by mis-interpreting the standards and some were deliberately introduced to break the other browser.

The browser wars were a very trying time for web developers. If you could control the platform (for example, a company intranet) it was easy, you chose a browser (IE or Netscape) and you could code to that browser's quirks. If you didn't have control over the platform, your code was a mess of conditionals, browser sniffing, and CSS hacks as you tried to serve up the "correct" version of your page depending on the user's browser. Even when you did have control it wasn't always easy. As browsers got upgraded you had to make sure your code still worked in case the browser vendor fixed some of their incompatabilities.

Coding to the published standards wasn't really a solution. Both vendors picked and chose which arts of the standards they would implement so developers had to refer to tables and charts detailing which parts of the standards were supported by which browser and where thy interpreted the standards differently. The CSS box model was a never ending source of frustration for me - you could get your code working nicely in IE, but it would look terrible in Netscape. Or vice versa.

Over time web developers started to fight back. I attended a number of developer conferences and at every one there was a vocal crowd of web developers demanding to know why the standards were being ignored. Microsoft's response was that they weren't creating browsers for developers, they were creating browsers for end users and they (Microsoft) knew better than we developers which parts of the standards were important enough to implement. More and more developers started to push back, various test suites were created to test the implementation of the standards and push the limits. Showcases of innovative web sites highlighted the standards support issues and some smaller browser developers started building "standards-compliant" browsers and slowly but surely the focus of the browser wars changed from unique features to standards support. It eventually became "cool" for browser implementors to tout their standards compliance and finally standards just became an expected feature of a browser

So how does this relate to GEDCOM? It probably doesn't except insofar as the level of complaints about poor support will determine what, if anything, is done to rectify the situation. GEDCOM is an old standard. A very old standard. It probably needs a revamp to bring it more inline with "modern" genealogical practices, especially when it comes to handling of sources and evidence-based genealogy. There have been several attempts to improve GEDCOM, or to develop a complete replacement, yet none of these efforts has amounted to much of note. To be honest though, it doesn't matter if one group or another comes up with a better GEDCOM unless there is buy in from the major genealogy developers. Small developers can support all the alternative standards they want, but if the big guys don't support it, what's the point?

Inertia is a huge force in software development. Everybody supports GEDCOM because everyone else supports GEDCOM. No one bothers to implement GEDCOM properly, because no one else implements it properly. No one bother to change this status quo because the users (and other developers) aren't demanding it.

For my software I have made the decision to not implement a GEDCOM import. I can get away without a GEDCOM import because I am not writing a "family tree" tool, I am writing a source management and analysis tool. There are enough bad and inconsistent implementations that it wuld really be too much work to try to import badly formed GEDCOM files. I will (eventually) provide a GEDCOM export and I will make darned sure I get it right. (Or at least as right as I can make it.) I will also be exporting my data in a variety of other formats , including exposing a web service or API for other developers to work with.

Sunday, 12 April 2015

I am fail!

Bah! I fail on at least two counts: 1) I have not been keeping this blog updated; and 2) I have decided to throw out my first attempt at a genealogy log and start from scratch.

I have been working with my log tool for a little while and it just isn't cutting it for me. There were a number of problems with my initial approach, not least of which was search - it was too cumbersome to search through past research efforts to find what had been found or not found. I really need to rethink how and what data I am recording so that it will be easy to search past sessions. Using this tool for a few weeks I was able to tweak the data I was capturing for each session, so I'll use this as a proof of concept prototype and rebuild something more appropriate to my needs.

I didn't end up going down the NoSQL path, primarily because I saw no issues with a SQL solution for a project like this. SQL is one of my strengths, so why not leverage my knowledge? Yes I do want to learn NoSQL, but I am already branching out with Rails on this project, I don't feel the need to introduce too many new technologies at once.

So where to now? Well, I have archived the code of this attempt and will start with a clean slate. The web-based approach felt right when I was using it - I could access the app from home or out at the library on my iPad, so I got that part right at least. Responsiveness was okay too, so I will continue with Rails for Log Mark II. I have been thinking that there is a lot of overlap between a research log and a research planner, so the two might become one, however for now I will just implement the log and leave planning for a later version.

I learned quite a lot doing this first draft - I hope I get more things right the second time around. 8^)

Wednesday, 4 March 2015

Database choice - to SQL or NoSQL?

An interesting thought came to me just now. For most of my professional development career, I have been an SQL adherent. I have worked on some large and small SQL projects on a variety of platforms. I understand and am comfortable with SQL databases and it has become second nature to turn to a SQL database when I need to store any reasonable amount of data. Performance tuning and optimisation are two of my strong suits and over the years I have been called in to help speed up and clean up SQL databases. So it was natural that when I started developing my genealogy tools I turned to an SQL database for my data store. For the genealogy log I am using MySQL for local development and PostgreSQL on the server. But are these the right choices?

I was recently approached by the University of Illinois at Urbana-Champaign (UIUC) to be a Community Teaching Assistant (CTA) for two of their online courses on Coursera, one on Data Mining and the other Cloud Computing Concepts. The cloud computing course has been quite interesting and has given me a better understanding of the theory and internals of NoSQL databases. In some respects, some NoSQL approaches are not too dissimilar to systems that I have worked on in the past - key-value stores, column-based database, and graph databases. It would be an interesting exercise to put some of the theory I have been learning into practice and try using a NoSQL database (or two?) for my genealogy tools. The UIUC cloud computing course has been quite theoretical and is just an introduction to the concepts of cloud computing, but it has whet my appetite.

Would a NoSQL database make sense for a genealogy log? What about a source/evidence management app? What about a full-blown genealogy tool? If I decide to switch to NoSQL, which flavour? Maybe a mix of databases would be best. A key-value store might work well with the log. A column-based database might fit well with storing claims extracted from sources. Linking claims together to provide evidence for events and people could be modelled with a graph based database. A family tree or pedeigree seems like a natural fit for a graph database. So many possibilities!

I know one system I would love to use for these tools - it was a cross between a column-based database and a graph database, but it was big and clunky and sat uneasily on top of a SQL database. We effectively created what we called a "universal data model" in SQL. It was very flexible and powerful, but wasn't exactly the easiest of systems to understand and could be very cumbersome to use. I might have to do a little research into the current crop of NoSQL offerings to see if there is something that might be a close match.

If anyone has any experience with NoSQL systems, feel free to chime in. ;^)

Monday, 2 March 2015

Update and plans

My transcription tasks have taken longer than anticipated, but I have finally completed the two wills I recently received. I now have an idea for a transcription tool that would  (semi-)automatically scroll the original document in a frame/window while you type into a lower frame, with a highlight bar showing the line you are working on. I doubt this would be suitable for a web app, so it would more likely be a desktop app should I decide to move forward with this idea.

I have also been jotting down ideas for the real meat of my genealogy app. I am not yet ready to share details of what I have in mind, but the picture of what I want is solidifying. I find that when I picture a new project in my mind as a completed product it means I am on the right track - sometimes it comes to me in a dream where I am using the app and this is what happened to me a short while ago. I saw my app and how it worked. I want to mock something up tomorrow to make sure it feels right and also so I don't forget how I am currently picturing it.

Once I finish my mock app (which should only take a day or so) I can get back into my research log. I really want to get this finished and ready for testing before I move on to a research plan tool. The research plan and research log tools will work closely together as they are really both sides of the same coin. The sooner I finish these two tools, the sooner I can get started on the larger plan. 8^)