Tuesday, 3 November 2015

Back to it!

Due to a number of factors (family drama, health issues, etc) I had to put aside development of my genealogy log tool. Things have (mostly) settled down now, so I guess it is time to dust off the codebase and get cracking again.

First task off the rank - perform a code and technology review: have I chosen the right tools/platform for the genealogy log? I originally opted for a web-based tool, but would it be better to create a native iOS/Android app, with an optional web-based data store? This is a question I flip-flopped on for a while before I started and it is one I think I should revisit before I restart my development efforts. One of the key questions is whether or not users will have internet connections where they want to use the apps.

What are your thoughts? Would you prefer an iPad/Android app (with the data being backed up to the cloud) or a web-based app that might not have so many bells and whistles?

Thursday, 14 May 2015

Is data-mining the new genealogy?

I might ramble off topic here, but I have been thinking recently, let's see where this goes...

A few months ago I had the privilege of serving as a community teaching assistant (CTA) for an on-line course on data mining. This was a very interesting and rewarding experience, interacting with the students and acting as one of the intermediaries between course staff and students.

Reading through some of the student introduction in the "Getting to Know You" forum I came across two students from different countries who both had the same goal which caught my eye. Both students said in their introductions that they wanted to learn enough about data mining to apply it to genealogy. This got me thinking, could data mining techniques be applied to genealogy and if so, how?

The (brief) argument these students put forward consisted of pointing out how much material was being digitised and indexed these days and "wouldn't it be nice if" we could use data mining to trawl through this material to find our matches more easily and (hopefully) even create our trees automatically. My gut reaction was that it wouldn't be as easy as these students seemed to think it would be. Just in my paternal line there are enough periods where there are so many Amos Bannister's from the same small town that matching together the various events with the "correct" Amos Bannister has been a very taxing task - could a data mining algorithm really handle such a complex task?

Does that mean data mining is simply not up to the task? Or could we still use data mining in genealogy as an intermediate tool to help sort the wheat from the chaff? I think the latter is closer to reality - data mining can be a useful tool and it is being used (albeit often imperfectly) by the big online databases to serve up related records. When searching Ancestry (for example) you will notice a handful of possibly related records listed on the side of the page - these are presumably the results of some data mining. Even the search results involve some data mining - pretty much any fuzzy searching of a large database will involve data mining techniques. Similarly, the little shaky leaf hints on your Ancestry tree are probably the result of Ancestry's data mining efforts. So the real question is not "can data mining be useful in genealogy?", rather it is "how can we (the end users) harness data mining to help our research?" And this is where the future looks interesting.

Imagine a tool where instead of simple transcriptions of sources the sources were further processed to extract al the various claims? A birth certificate might record claims such as: name of the child; date of birth of the child; place of birth of the child; name of the father; place of residence of the father; occupation of the father; age of the father; name of the mother; place of residence of the mother; occupation of the mother; age of the mother; etc. One simple source document can produce a massive array of related claims which could apply to a number of distinct persons. Extract all the possible claims from every source you have and suddenly you have a massive database of possible claims. Now apply some data mining techniques to this database to find which claims from different documents might relate to the same individuals - applying some heuristics to place some reasonable bounds on date ranges (don't bother looking for a marriage before the groom was born for instance) and localities to restrict the scale of the search. Now you have a great starting point for your research into these individuals. You can check each of the sources to look for further clues to determine if they really are related.

The extraction of the claims is going to become more critical. Current indexing and transcription projects employed by the big providers try to minimise the data recorded. They use strict templates extracting only the "core" information from a document and much of the extra information is simply not being indexed. This is why there are several new tools becoming more widely available, typically targeted at "evidence-based genealogy". Source analysis and claim extraction tools are starting to come to the fore. Using these tools you can analyse your source documents to extract all the possible claims, then you can analyse those claims to more accurately piece together the people in your tree. It would be awesome if the big databases could extract all the claims from their documents, but I fear this would be a task that is simply too large for anyone to tackle completely. Instead we will have to use personal tools on our own (smaller) datasets and do our data mining in the small.

But wouldn't it be nice if the new breed of developers could work together to agree on a way of sharing all this new data? ;^)

Saturday, 9 May 2015

On GEDCOM, its uses and abuses...

Louis Kessler recently posted about an interesting exchange about GEDCOM's handling of sources on the FHISO mailing list. Louis raises some interesting points and shows that GEDCOM is quite a detailed and complex standard which can cope with a wide range of situations, if only the developers of genealogy apps would just study the standard. Sure, it is not an easy standard to comprehend and it does not directy cater for all possible scenarios, but it can definitely cope with a lot more than developers give it credit for.

Personally, I am not a huge fan of GEDCOM - not for what it can/can't do, but for how it has been used and abused by some sloppy developers. There have been many blog posts and articles about the incompatabilities between different implementations of GEDCOM imports and exports in different software packages. I have personally used genealogy softare which couldn't even correctly import GEDCOM created by itself! Surely if a developer has gone to the trouble of creating GEDCOM export functionality, then it is not too much to expect that they have fully tested that the software can import its own GEDCOM files.

So why is there this problem? There could be many reasons - sloppiness, laziness, rushed implementations, not referring back to the specifiction documents - but one of the main reasons is that we all let the developers get away with these half-hearted implementations. IF users (and other developers) keep accepting the poor GEDCOM support, then we will keep getting poor GEDCOM support. I have voiced my opinion to the developers of the two main genealogy programs I use on my Mac and I have noticed some (minor) improvements over the years.

There is something about this that reminds me of the early days of web development when we had the infamous "browser wars" between Microsoft's Internet Explorer and Netscape's Navigator. If you are old enough to remember those days, it was not uncommon to see a "Optimised for Internet Explorer" or "Optimised for Netscape Navigator" badges on websites. Both companies were in a mad rush to stuff so many new features in their browsers that they didn't bother spending the time to ensure their browsers supported the relevant internet standards (HTML, CSS, etc) and the result was that both browsers supported different parts of the standards and the implementations were incompatible. Some of the incompatabilities were caused by cutting edge technology not covered by any existing standards, some were caused by mis-interpreting the standards and some were deliberately introduced to break the other browser.

The browser wars were a very trying time for web developers. If you could control the platform (for example, a company intranet) it was easy, you chose a browser (IE or Netscape) and you could code to that browser's quirks. If you didn't have control over the platform, your code was a mess of conditionals, browser sniffing, and CSS hacks as you tried to serve up the "correct" version of your page depending on the user's browser. Even when you did have control it wasn't always easy. As browsers got upgraded you had to make sure your code still worked in case the browser vendor fixed some of their incompatabilities.

Coding to the published standards wasn't really a solution. Both vendors picked and chose which arts of the standards they would implement so developers had to refer to tables and charts detailing which parts of the standards were supported by which browser and where thy interpreted the standards differently. The CSS box model was a never ending source of frustration for me - you could get your code working nicely in IE, but it would look terrible in Netscape. Or vice versa.

Over time web developers started to fight back. I attended a number of developer conferences and at every one there was a vocal crowd of web developers demanding to know why the standards were being ignored. Microsoft's response was that they weren't creating browsers for developers, they were creating browsers for end users and they (Microsoft) knew better than we developers which parts of the standards were important enough to implement. More and more developers started to push back, various test suites were created to test the implementation of the standards and push the limits. Showcases of innovative web sites highlighted the standards support issues and some smaller browser developers started building "standards-compliant" browsers and slowly but surely the focus of the browser wars changed from unique features to standards support. It eventually became "cool" for browser implementors to tout their standards compliance and finally standards just became an expected feature of a browser

So how does this relate to GEDCOM? It probably doesn't except insofar as the level of complaints about poor support will determine what, if anything, is done to rectify the situation. GEDCOM is an old standard. A very old standard. It probably needs a revamp to bring it more inline with "modern" genealogical practices, especially when it comes to handling of sources and evidence-based genealogy. There have been several attempts to improve GEDCOM, or to develop a complete replacement, yet none of these efforts has amounted to much of note. To be honest though, it doesn't matter if one group or another comes up with a better GEDCOM unless there is buy in from the major genealogy developers. Small developers can support all the alternative standards they want, but if the big guys don't support it, what's the point?

Inertia is a huge force in software development. Everybody supports GEDCOM because everyone else supports GEDCOM. No one bothers to implement GEDCOM properly, because no one else implements it properly. No one bother to change this status quo because the users (and other developers) aren't demanding it.

For my software I have made the decision to not implement a GEDCOM import. I can get away without a GEDCOM import because I am not writing a "family tree" tool, I am writing a source management and analysis tool. There are enough bad and inconsistent implementations that it wuld really be too much work to try to import badly formed GEDCOM files. I will (eventually) provide a GEDCOM export and I will make darned sure I get it right. (Or at least as right as I can make it.) I will also be exporting my data in a variety of other formats , including exposing a web service or API for other developers to work with.

Sunday, 12 April 2015

I am fail!

Bah! I fail on at least two counts: 1) I have not been keeping this blog updated; and 2) I have decided to throw out my first attempt at a genealogy log and start from scratch.

I have been working with my log tool for a little while and it just isn't cutting it for me. There were a number of problems with my initial approach, not least of which was search - it was too cumbersome to search through past research efforts to find what had been found or not found. I really need to rethink how and what data I am recording so that it will be easy to search past sessions. Using this tool for a few weeks I was able to tweak the data I was capturing for each session, so I'll use this as a proof of concept prototype and rebuild something more appropriate to my needs.

I didn't end up going down the NoSQL path, primarily because I saw no issues with a SQL solution for a project like this. SQL is one of my strengths, so why not leverage my knowledge? Yes I do want to learn NoSQL, but I am already branching out with Rails on this project, I don't feel the need to introduce too many new technologies at once.

So where to now? Well, I have archived the code of this attempt and will start with a clean slate. The web-based approach felt right when I was using it - I could access the app from home or out at the library on my iPad, so I got that part right at least. Responsiveness was okay too, so I will continue with Rails for Log Mark II. I have been thinking that there is a lot of overlap between a research log and a research planner, so the two might become one, however for now I will just implement the log and leave planning for a later version.

I learned quite a lot doing this first draft - I hope I get more things right the second time around. 8^)

Wednesday, 4 March 2015

Database choice - to SQL or NoSQL?

An interesting thought came to me just now. For most of my professional development career, I have been an SQL adherent. I have worked on some large and small SQL projects on a variety of platforms. I understand and am comfortable with SQL databases and it has become second nature to turn to a SQL database when I need to store any reasonable amount of data. Performance tuning and optimisation are two of my strong suits and over the years I have been called in to help speed up and clean up SQL databases. So it was natural that when I started developing my genealogy tools I turned to an SQL database for my data store. For the genealogy log I am using MySQL for local development and PostgreSQL on the server. But are these the right choices?

I was recently approached by the University of Illinois at Urbana-Champaign (UIUC) to be a Community Teaching Assistant (CTA) for two of their online courses on Coursera, one on Data Mining and the other Cloud Computing Concepts. The cloud computing course has been quite interesting and has given me a better understanding of the theory and internals of NoSQL databases. In some respects, some NoSQL approaches are not too dissimilar to systems that I have worked on in the past - key-value stores, column-based database, and graph databases. It would be an interesting exercise to put some of the theory I have been learning into practice and try using a NoSQL database (or two?) for my genealogy tools. The UIUC cloud computing course has been quite theoretical and is just an introduction to the concepts of cloud computing, but it has whet my appetite.

Would a NoSQL database make sense for a genealogy log? What about a source/evidence management app? What about a full-blown genealogy tool? If I decide to switch to NoSQL, which flavour? Maybe a mix of databases would be best. A key-value store might work well with the log. A column-based database might fit well with storing claims extracted from sources. Linking claims together to provide evidence for events and people could be modelled with a graph based database. A family tree or pedeigree seems like a natural fit for a graph database. So many possibilities!

I know one system I would love to use for these tools - it was a cross between a column-based database and a graph database, but it was big and clunky and sat uneasily on top of a SQL database. We effectively created what we called a "universal data model" in SQL. It was very flexible and powerful, but wasn't exactly the easiest of systems to understand and could be very cumbersome to use. I might have to do a little research into the current crop of NoSQL offerings to see if there is something that might be a close match.

If anyone has any experience with NoSQL systems, feel free to chime in. ;^)

Monday, 2 March 2015

Update and plans

My transcription tasks have taken longer than anticipated, but I have finally completed the two wills I recently received. I now have an idea for a transcription tool that would  (semi-)automatically scroll the original document in a frame/window while you type into a lower frame, with a highlight bar showing the line you are working on. I doubt this would be suitable for a web app, so it would more likely be a desktop app should I decide to move forward with this idea.

I have also been jotting down ideas for the real meat of my genealogy app. I am not yet ready to share details of what I have in mind, but the picture of what I want is solidifying. I find that when I picture a new project in my mind as a completed product it means I am on the right track - sometimes it comes to me in a dream where I am using the app and this is what happened to me a short while ago. I saw my app and how it worked. I want to mock something up tomorrow to make sure it feels right and also so I don't forget how I am currently picturing it.

Once I finish my mock app (which should only take a day or so) I can get back into my research log. I really want to get this finished and ready for testing before I move on to a research plan tool. The research plan and research log tools will work closely together as they are really both sides of the same coin. The sooner I finish these two tools, the sooner I can get started on the larger plan. 8^)

Sunday, 15 February 2015

Bah! Temporary hold

The genealogy log is temporarily on hold for a week or so while I transcribe some documents for my father. We recently received a couple of wills and a book about our ancestral hometown and dad wants me to transcribe (and translate) the wills so he can read and understand them better. This is something I knew I would have to do, but was trying to put it off until later, but a grumpy dad is not something I want to deal with at the moment. ;^)

Friday, 6 February 2015

Initial progress report for the genealogy log tool

I will prolly copy over any previous development blog posts from my other blog, Tracing the Footsteps of Amos, just so I can keep everything in one place, but for now I'll give a brief run-down of my progress to date. (BTW, this domain name is only a couple of days old and I already have received several emails from spammers offering to "register" my site with "all the major search engines" for a modest fee. How thoughtful of them!)

I have decided that the first tool I'll create will be a web-based genealogy log tool. I have a few spreadsheet templates that I had been using for my own log, so the initial feature-set will be based off a mix of these. Users will be able to register with the site, once registered (and email is verified) the user can log in and will be presented with a list of previous research sessions and a fresh, blank session ready to fill in. Each session will record the location of the research session (i.e., local library, archives, or a website) along with the date, time and a session goal. Within each session the user can record what searches they have undertaken and the results of those searches. Sessions and search results will be searchable, so users can quickly check whether they have performed the same search previously and if so, what the results were.

One thing I am hoping to include in this tool is a bit of smarts for reverse-engineering search terms for the major genealogy websites. This will allow the user to simply copy and paste the search URL from Ancestry.com for example, and the log will attempt to pull out the exact search criteria used. Of course the tool will also be capturing as much source citation information as possible, to help make relocating the results as easy as possible.

I am developing this tool using Ruby on Rails, mainly as an excuse to roll up my sleeves and get my hands dirty on a Rails project. I have the initial sign up and login pages working and the session list is part-way complete. My aim for a usable beta version is sometime in early March, so I will be putting out a call for volunteers to test the app in about a month.

Thursday, 5 February 2015

Welcome to the Geneatools.com blog!

Welcome to the Geneatools.com blog! On this blog I will be discussing the Geneatools.com suite of genealogical research tools, how they are used, how they are being developed and any plans to extend or enhance the range of tools available.

I have had an interest in genealogy for over 10 years, but the bug really bit about four years ago. I started finding my way in the world of genealogy by searching the internet for interesting resources, before discovering Ancestry.com, FamilySearch.org and a host of other great sites. Most of my early "research" was merely collecting names and dates, with very little (okay, no) rigour as I was excited at the prospects of filling in the boxes in my family tree software. As my tree grew I started to notice some irregularities in my data, for example my cousin had one date of death for an ancestor, but I found a completely different one - which of us was correct? I reached a point where my tree had diverged slightly from my cousin's tree and more significantly from other trees I had discovered on the internet.

Eventually I decided "enough is enough" and put my tree aside and started from scratch, this time trying to be a bit more diligent with my data. I kept using family tree software however. The second attempt at my family tree wen somewhat smoother, and I was able to collect some more data that helped me deal with the inconsistencies I found, but I was becoming increasingly frustrated with the software I was using. I felt constrained by the Person -> Event -> Source structure of the various family tree apps I tried and none of them seemed to be particularly useful when dealing with conflicting data.

One particular niggle was that traditional family tree software appeared to have no way of dealing with "working data". I found several source documents referring to the death of a potential ancestor, but I couldn't pinpoint exactly which ancestor they referred to - was it my gggg-grandfather, or his son, or his nephew? Using traditional family tree software, I would attach this event to multiple people, but it was messy and when I refined my data more and found more sources, it was a pain to delete the event from people and add new data, not to mention the history of my research was being lost - sure I could delete the "wrong" event from a person, but where was the traceback showing that I had considered this possibility?

Then there was the fact that I was amassing a large collection of source documents - parish register scans; census documents; birth, marriage, death certificates; newspaper articles; etc - but really had no way of managing these documents, nor any way of managing the data I could extract from these documents.

I was starting to get very fed up with the research tools provided by traditional family tree software vendors. Really, most of the research is done outside the software and all the family tree software expects is for you to enter names, dates and places. I thought there must be a better way.

So I started to embark on a self-education campaign - try to learn as much as I could about "real" genealogy. During this time I discovered about proper citation of sources (something I had been incredibly lax about doing), the genealogical proof standard and "evidence-based genealogy". I had been thinking "there must be a better way" and now I had discovered that there was! What I had been doing up until then was "conclusion-based genealogy" or "person-based genealogy" where the focus was on filling in the boxes in your family tree software, but what I wanted to do was "evidence-based genealogy" or "source-based genealogy" where the sources become the star and sources are analysed and interpreted to prove the conclusions you made about the people in your tree. Weighing up various pieces of evidence, determine the likely accuracy or veracity of the information in your sources, considering conflicting information and coming up with sound reasoning for your conclusions sounded like a better way of doing research.

Armed with a new way of thinking, I decided it was time to restart my restart of my tree, only this time with a lot more rigour. Being a software developer, I also decided to scratch an itch and create my own tools to help with my research efforts. Tools that will help, not hinder, rigorous genealogical research. I have in mind a whole suite of tools, ranging from research planners and logs to source analysis tools and source transcription and management tools. Over time I will reveal my plans for each of these tools, but for now I am starting with a simple research log tool.

My genealogy research log tool will be a web-based app that will help users track their research sessions. I m developing the tool using Ruby on Rails and hope to have a preview version available in a month or so (time willing) and it will be hosted on the Geneatools.com website. When the log tool is finished I will then start on another tool in the suite and eventually I hope to have a fully featured toolset that other genealogists can use to aid their own research.