Drupal is my database -- importing 80,000 convicts with Table Wizard and Migrate

17 Feb 2010

One of the goals of the Founders and Survivors project (the one that concerns me the most) is to compile and publish data about the Van Diemen’s Land convicts from a variety of sources, and make links or cross-reference between records from different sources that relate to the same individual (or, potentially, the same family). Our sources include convict indents carried on the convict ships, conduct records, police gazettes, and registers of births, deaths and marriages. In addition, we have been collecting biographies and family histories submitted by descendants of convicts, which are valuable sources of information about the lives of convicts after they left the penal system.

Biographies have been submitted by members of the public through our Drupal website as a custom content type using the CCK module. The other data exists in paper records which are being transcribed and in some cases photographed. Transcriptions have been entered in a variety of desktop-oriented data collection systems, including Filemaker Pro, Microsoft Access and Microsoft Excel. My goal is to make this data available to other researchers and members of the public in open formats and through the web. As I have got to know Drupal better, I have come to see that it is a powerful way of representing strucutred data as well as free-form documents on the web, and it can also handle the privacy and access control we need to prevent unethical use of data.

The Archives of Tasmania have an online Index to Tasmanian Convicts and I was given a copy of this data in Microsoft Access. This index will form the basis of the entire Founders and Survivors database, so that a record of a convict in the index will contain links to all of the other data we have on that individual. After exporting the index data to a comma-separated text file (so that I don’t have to touch Access again) I hacked together a Perl script to extract data from this file and insert it directly into the node and content tables in Drupal’s database. It looked as if the script did what I wanted it to do, but I wasn’t confident about it and didn’t run it on our production database before I left for LCA2010 and DrupalSouth, where Angie Byron gave me a better solution. (Why yes, I do lose years off my life every time I attempt a bulk migration!)

Angie’s article is a good tutorial on using the Table Wizard and Migrate modules generic data migration process. This is the full process I needed to set up migration for the Archive index (which is still underway – 80,000 records don’t appear instantaneously).

First I also created a content type in Drupal with the same data structure. Every node must have a title, and I wanted the index records to have a composite title containing the convict’s index number and name and the name of the ship on which he or she arrived in Van Diemen’s Land. My first solution was to use the Automatic Nodetitles module which did just this for manually entered records. However, after running the migration process on a sample, I found that the whole batch of migrated records would have an automatic title made up of components of the first record in the batch. (I should probably report this as an issue and even try to fix it, but I needed to find a more immediate solution.) Instead I used the Rules module (which, like Automatic Nodetitles, depends on the Token module) to update the title of each node after is created.

rule to update titles

Table Wizard makes any database table available to Drupal Views. It looks for tables in the default Drupal database (or another database) so I dumped the contents of the CSV file into its own table in the database. The Table Wizard administration page admin/content/tw lists tables managed by Table Wizard and other tables in the database that can be added to Table Wizard. Each exposed table has an analysis link which provides information about each field.

table analysis

If you only want to view data in an external table, Table Wizard will suffice. The Migrate module is used to map the structure of the external table to data structures within Drupal and import the data into Drupal so that it can be searched, viewed and modified like any other Drupal node. The Migrate Extras module is needed to migrate to fields created in CCK. Under ‘Add a content set’ on the Migrate dashboard admin/content/migrate, I selected the index record type as the destination and the Table Wizard table as the source. Clicking on this content set allows me to map the source to destination fields and change other migration settings.

content set

There are two ways to execute the migration: from the Migrate dashboard, or using drush, the Drupal Shell. From the dashboard I can import samples of data or clear all imported data; this is useful for testing the import settings.

migrate dashboard

By default, new nodes imported by Migrate are not published. After testing the migration process on small samples, I cleared the imported nodes, went back to edit the content set and set the default value for Node: Published to 1 so that imported nodes are published.

The index data is now being migrated at a rate of about 12 records per minute. It runs out of memory after (usually) 136 records and attempts to start a new batch, but terminates instead. (These issues require more investigation on my part. It seems to be related to drush permissions.) I am running drush migrate from a cron job so that the migration process can continue unattended. At this rate the index might all be online in a week.

Drupal is my database -- importing 80,000 convicts with Table Wizard and Migrate

17 Feb 2010

One of the goals of the Founders and Survivors project (the one that concerns me the most) is to compile and publish data about the Van Diemen’s Land convicts from a variety of sources, and make links or cross-reference between records from different sources that relate to the same individual (or, potentially, the same family). Our sources include convict indents carried on the convict ships, conduct records, police gazettes, and registers of births, deaths and marriages. In addition, we have been collecting biographies and family histories submitted by descendants of convicts, which are valuable sources of information about the lives of convicts after they left the penal system.

Biographies have been submitted by members of the public through our Drupal website as a custom content type using the CCK module. The other data exists in paper records which are being transcribed and in some cases photographed. Transcriptions have been entered in a variety of desktop-oriented data collection systems, including Filemaker Pro, Microsoft Access and Microsoft Excel. My goal is to make this data available to other researchers and members of the public in open formats and through the web. As I have got to know Drupal better, I have come to see that it is a powerful way of representing strucutred data as well as free-form documents on the web, and it can also handle the privacy and access control we need to prevent unethical use of data.

The Archives of Tasmania have an online Index to Tasmanian Convicts and I was given a copy of this data in Microsoft Access. This index will form the basis of the entire Founders and Survivors database, so that a record of a convict in the index will contain links to all of the other data we have on that individual. After exporting the index data to a comma-separated text file (so that I don’t have to touch Access again) I hacked together a Perl script to extract data from this file and insert it directly into the node and content tables in Drupal’s database. It looked as if the script did what I wanted it to do, but I wasn’t confident about it and didn’t run it on our production database before I left for LCA2010 and DrupalSouth, where Angie Byron gave me a better solution. (Why yes, I do lose years off my life every time I attempt a bulk migration!)

Angie’s article is a good tutorial on using the Table Wizard and Migrate modules generic data migration process. This is the full process I needed to set up migration for the Archive index (which is still underway – 80,000 records don’t appear instantaneously).

First I also created a content type in Drupal with the same data structure. Every node must have a title, and I wanted the index records to have a composite title containing the convict’s index number and name and the name of the ship on which he or she arrived in Van Diemen’s Land. My first solution was to use the Automatic Nodetitles module which did just this for manually entered records. However, after running the migration process on a sample, I found that the whole batch of migrated records would have an automatic title made up of components of the first record in the batch. (I should probably report this as an issue and even try to fix it, but I needed to find a more immediate solution.) Instead I used the Rules module (which, like Automatic Nodetitles, depends on the Token module) to update the title of each node after is created.

rule to update titles

Table Wizard makes any database table available to Drupal Views. It looks for tables in the default Drupal database (or another database) so I dumped the contents of the CSV file into its own table in the database. The Table Wizard administration page admin/content/tw lists tables managed by Table Wizard and other tables in the database that can be added to Table Wizard. Each exposed table has an analysis link which provides information about each field.

table analysis

If you only want to view data in an external table, Table Wizard will suffice. The Migrate module is used to map the structure of the external table to data structures within Drupal and import the data into Drupal so that it can be searched, viewed and modified like any other Drupal node. The Migrate Extras module is needed to migrate to fields created in CCK. Under ‘Add a content set’ on the Migrate dashboard admin/content/migrate, I selected the index record type as the destination and the Table Wizard table as the source. Clicking on this content set allows me to map the source to destination fields and change other migration settings.

content set

There are two ways to execute the migration: from the Migrate dashboard, or using drush, the Drupal Shell. From the dashboard I can import samples of data or clear all imported data; this is useful for testing the import settings.

migrate dashboard

By default, new nodes imported by Migrate are not published. After testing the migration process on small samples, I cleared the imported nodes, went back to edit the content set and set the default value for Node: Published to 1 so that imported nodes are published.

The index data is now being migrated at a rate of about 12 records per minute. It runs out of memory after (usually) 136 records and attempts to start a new batch, but terminates instead. (These issues require more investigation on my part. It seems to be related to drush permissions.) I am running drush migrate from a cron job so that the migration process can continue unattended. At this rate the index might all be online in a week.

Humanities/Drupal talk for LUV March 2010

04 Feb 2010

Some thoughts:

Mashup of LCA2010 and DrupalSouth talks

What I did on my holidays: NZ/LCA/Drupal

Learning about Drupal modules

Unlocking the ivory tower; we are all historians

History: boring –> relevant

Digitisation, analysis, collaboration

‘Libraries’ and ‘museums’ (not such a huge difference between literary and historical archives?)

FAS: paper databases

Drupal (yay for tw/migrate)

Highlights of LCA2010 and DrupalSouth

31 Jan 2010

I am still buzzing one week after my happiest linux.conf.au ever, which was followed by the immensely rewarding DrupalSouth; but I go back to work tomorrow, and while I’ll be putting the fruits of the previous weeks to good use, this may be my last chance in a while to reflect on the highlights of the conferences and what they indicate to me about the current state of the free software community.

And that is one of the key elements that I picked up this year… this is a tightly bound and intricately networked community. I guess I did recognise this at previous LCAs, but on those occasions I still felt very much outside the community. I did participate in free software gatherings and had friends and supporters, but there was a sense of disconnection between this community and my day-to-day work and study. The gap has narrowed drastically now – not just in reality, but in my perception of it – so now I can say that I really belong to this community (even if I do sit in a somewhat unusual or eccentric corner of it).

I’ve been telling my academic and church friends and acquaintances that I attended a conference, which might be misleading, because events like LCA are not exactly conferences as academics know them. They are run by volunteers, attended by people who contribute to free software as volunteers (even if some of them also get paid for this work), and the informal ‘hallway track’ and social events are at least as important as the formal programme. LCA reminds me that using and contributing to free software isn’t just a job or hobby, it’s not even ‘just’ a philosophy; for many people it is a way of life.

(Church geeks can probably understand this… it’s another area where both ‘professionals’ and ‘amateurs’ participate in intense, demanding, emotionally draining and rewarding activities that may not seem to bring any financial or material gain, and are difficult to explain or justify to people outside the community.)

For me, the running highlight of LCA2010 (including the Linuxchix miniconference) and DrupalSouth¹ was meeting, hearing, socialising and hacking² with some of the women who have been contributing to the development of Linux and free software over the last ten years, and raising a fuss about the place of women in this community. I first attended LCA in 2007, partly because, for the first time, a Linuxchix miniconference was part of the programme. That year, a psychological barrier was broken: 10% of attendees were women. That proportion has been steadily increasing, and an unofficial estimate for 2010 is 15% female attendance. (It was also a noticeably child- and family-friendly event.) This is still an unjustifiably male-dominated field, but to me it felt that we were no longer a painful minority. We still need to work for greater equality and inclusion, but we also have many successes to celebrate.

I was most inspired by meeting and hearing from: Liz Henry, who has been fighting this fight for years, and still maintains the rage while also radiating joy and compassion; and Angela Byron (webchick), whose enthusiasm for Drupal is infectious (which is just as well, as she is coordinating the upcoming Drupal 7 release). It was also great to meet (even if briefly) other women such as Emma Jane Hogbin and Selena Deckelmann, and to reconnect with other Australian Linuxchix.

Particular highlights of the programme were:³

Liz’s talks on Code of our own (i.e. why hacking is still a feminist issue) and on assistive technology (which brought home to me that hacking is ultimately not really about computers, it’s about finding resourceful solutions to problems of any kind).
Seeing Angela demonstrate Drupal 7 twice, once in an LCA tutorial and again at DrupalSouth.
Emma Jane Hogbin’s talk on version control, followed by Sara Falamaki’s on happy hackers, underscored how much of a difference one’s tools and working environment can make to one’s creativity and efficiency. For the seven-odd years before I moved to my current job, I was forced to develop database and statistical analysis systems with proprietary software mandated by my employers. It has made a huge difference to be able to research and use the best tools for my job. A happy hacker can choose her own tools.

Other good talks included Tim McNamara’s timely presentation on the Sahana disaster management system being used in Haiti, and Paul Fenwick’s geek standup comedy routine on the world’s worst inventions.

Ohter people’s highlights that I missed include:

Angela’s talk on getting your feet wet in contributing to free software, which would probably have been useful and timely for me, but as there were seven miniconfs running in parallel on both Monday and Tuesday, schedule clashes were inevitable – on Monday I mostly attended the Linuxchix miniconf, but sometimes ducked out to attend talks on business and graphics.
The hackfest following the Linuxchix miniconf, to work on the http://geekspeakr.com/ website. On Monday I was still hesitant about LCA burnout and was pacing myself, so I had a quiet dinner with a friend and an early night. I think I missed a great hackfest and might even have been able to contribute something, but five late nights in a row would have had other negative consequences.
I managed to miss two talks on documentation – by Lana Brindley and Emma Jane Hogbin – due to schedule clashes. The Friday after-lunch slot was particuarly galling as Emma Jane (documentation) was scheduled against Liz (assistive tech); I was not the only one who was annoyed about not being able to get to both. On the other hand, the fact that there were so many women giving such a diverse range of talks is something to celebrate.

Our story continues on the Geek Feminism wiki and blog.

One followed immediately after the other, and quite a few of the speakers and attendees at DrupalSouth had been to LCA, so I’m mentally conflating the two. ↩
That is, ‘doing interesting and creative things with technology’, not ‘breaking into other people’s computers’. ↩
It’s late and I haven’t put up links to individual abstracts. You can find them on the programmes for the Linuxchix miniconf, LCA2010 main programme and DrupalSouth. ↩

Highlights of LCA2010 and DrupalSouth

31 Jan 2010

I am still buzzing one week after my happiest linux.conf.au ever, which was followed by the immensely rewarding DrupalSouth; but I go back to work tomorrow, and while I’ll be putting the fruits of the previous weeks to good use, this may be my last chance in a while to reflect on the highlights of the conferences and what they indicate to me about the current state of the free software community.

And that is one of the key elements that I picked up this year… this is a tightly bound and intricately networked community. I guess I did recognise this at previous LCAs, but on those occasions I still felt very much outside the community. I did participate in free software gatherings and had friends and supporters, but there was a sense of disconnection between this community and my day-to-day work and study. The gap has narrowed drastically now – not just in reality, but in my perception of it – so now I can say that I really belong to this community (even if I do sit in a somewhat unusual or eccentric corner of it).

I’ve been telling my academic and church friends and acquaintances that I attended a conference, which might be misleading, because events like LCA are not exactly conferences as academics know them. They are run by volunteers, attended by people who contribute to free software as volunteers (even if some of them also get paid for this work), and the informal ‘hallway track’ and social events are at least as important as the formal programme. LCA reminds me that using and contributing to free software isn’t just a job or hobby, it’s not even ‘just’ a philosophy; for many people it is a way of life.

(Church geeks can probably understand this… it’s another area where both ‘professionals’ and ‘amateurs’ participate in intense, demanding, emotionally draining and rewarding activities that may not seem to bring any financial or material gain, and are difficult to explain or justify to people outside the community.)

For me, the running highlight of LCA2010 (including the Linuxchix miniconference) and DrupalSouth¹ was meeting, hearing, socialising and hacking² with some of the women who have been contributing to the development of Linux and free software over the last ten years, and raising a fuss about the place of women in this community. I first attended LCA in 2007, partly because, for the first time, a Linuxchix miniconference was part of the programme. That year, a psychological barrier was broken: 10% of attendees were women. That proportion has been steadily increasing, and an unofficial estimate for 2010 is 15% female attendance. (It was also a noticeably child- and family-friendly event.) This is still an unjustifiably male-dominated field, but to me it felt that we were no longer a painful minority. We still need to work for greater equality and inclusion, but we also have many successes to celebrate.

I was most inspired by meeting and hearing from: Liz Henry, who has been fighting this fight for years, and still maintains the rage while also radiating joy and compassion; and Angela Byron (webchick), whose enthusiasm for Drupal is infectious (which is just as well, as she is coordinating the upcoming Drupal 7 release). It was also great to meet (even if briefly) other women such as Emma Jane Hogbin and Selena Deckelmann, and to reconnect with other Australian Linuxchix.

Particular highlights of the programme were:³

Liz’s talks on Code of our own (i.e. why hacking is still a feminist issue) and on assistive technology (which brought home to me that hacking is ultimately not really about computers, it’s about finding resourceful solutions to problems of any kind).
Seeing Angela demonstrate Drupal 7 twice, once in an LCA tutorial and again at DrupalSouth.
Emma Jane Hogbin’s talk on version control, followed by Sara Falamaki’s on happy hackers, underscored how much of a difference one’s tools and working environment can make to one’s creativity and efficiency. For the seven-odd years before I moved to my current job, I was forced to develop database and statistical analysis systems with proprietary software mandated by my employers. It has made a huge difference to be able to research and use the best tools for my job. A happy hacker can choose her own tools.

Other good talks included Tim McNamara’s timely presentation on the Sahana disaster management system being used in Haiti, and Paul Fenwick’s geek standup comedy routine on the world’s worst inventions.

Ohter people’s highlights that I missed include:

Angela’s talk on getting your feet wet in contributing to free software, which would probably have been useful and timely for me, but as there were seven miniconfs running in parallel on both Monday and Tuesday, schedule clashes were inevitable – on Monday I mostly attended the Linuxchix miniconf, but sometimes ducked out to attend talks on business and graphics.
The hackfest following the Linuxchix miniconf, to work on the http://geekspeakr.com/ website. On Monday I was still hesitant about LCA burnout and was pacing myself, so I had a quiet dinner with a friend and an early night. I think I missed a great hackfest and might even have been able to contribute something, but five late nights in a row would have had other negative consequences.
I managed to miss two talks on documentation – by Lana Brindley and Emma Jane Hogbin – due to schedule clashes. The Friday after-lunch slot was particuarly galling as Emma Jane (documentation) was scheduled against Liz (assistive tech); I was not the only one who was annoyed about not being able to get to both. On the other hand, the fact that there were so many women giving such a diverse range of talks is something to celebrate.

Our story continues on the Geek Feminism wiki and blog.

One followed immediately after the other, and quite a few of the speakers and attendees at DrupalSouth had been to LCA, so I’m mentally conflating the two. ↩
That is, ‘doing interesting and creative things with technology’, not ‘breaking into other people’s computers’. ↩
It’s late and I haven’t put up links to individual abstracts. You can find them on the programmes for the Linuxchix miniconf, LCA2010 main programme and DrupalSouth. ↩

Claudine Chionh

Drupal is my database -- importing 80,000 convicts with Table Wizard and Migrate

Drupal is my database -- importing 80,000 convicts with Table Wizard and Migrate

Humanities/Drupal talk for LUV March 2010

Highlights of LCA2010 and DrupalSouth

Highlights of LCA2010 and DrupalSouth