AT @ Yale

Connecting Excel to AT/MySQL

noreply@blogger.com (Daniel Hartwig) — Fri, 11 Jun 2010 15:47:00 +0000

Given the problems associated with creating and running custom Jasper reports within the AT, we decided to simply use Excel as a front-end application for data analysis. The beauty in this approach is that data can be pulled from the AT tables, or from MySQL queries, and then further analyzed, graphed, etc., in Excel. Pivot tables can even be used to create more robust reporting options. What's more, and with a little know-how, you can even use Excel for batch updates.

Here's how to connect Excel to the AT:

Download and install the ODBC driver for MySQL here: http://dev.mysql.com/downloads/connector/odbc/3.51.html.
Configure the ODBC Data Source (Control Panel > Administrative Tools > Data Sources (ODBC) > Add >MySQL ODBC 3.51 Driver)
Enter your AT database connection settings
Open Excel
Click on Data >From Other Sources > From Data Conenction Wizard > ODBC DSN > [the database/connection you just set-up in step 3]
Select what you want to import

Voila! It's as easy as that.

Right now we're working on constructing more complicated MySQL queries typically used for end-of-year analysis and reporting, which we will then similarly connect to Excel to allow our staff to manipulate the data as needed. Here is where a little MySQL knowledge goes a long ways!

In the end, we thought this was a much more efficient and effective means for providing our staff with customizable reporting functionality from the AT. For one, our staff are more comfortable with and regular uses of Excel. Second, there is much more documentation and training options available for using Excel and its many advanced features.

UPDATE (6/23/2010)

We ran into a bit of a dead end trying to connect custom MySQL queries to Excel. It was easy enough to connect the tables, but connecting to custom queries proved troublesome. Instead, we are now using SharePoint, which does allow you to create custom SQL queries and stored procedures. If interested, more information on how to set this up can be found here: http://office.microsoft.com/en-ca/sharepoint-designer-help/add-a-database-as-a-data-source-HA010100908.aspx#BM6.

PDF & HTML Stylesheets for Resource Reports

noreply@blogger.com (Daniel Hartwig) — Wed, 02 Jun 2010 16:29:00 +0000

As with perhaps most other users, we wanted to devise a means for customizing the AT's EAD output to comply with our local EAD implementation guidelines. Mike Rush at the Beinecke created and maintains a stylesheet that converts our AT EAD output to conform with Yale's EAD best practices. This now works very well for us. We export EAD, apply the transformation, validate it, and then save for upload to our finding aids database. That is all well and good. Our staff, though, wanted to be able to perform a similar operation, really a print preview, when editing a resource. This was possible in our previous EAD editor (XMetaL) thanks to built-in customizations. Although the AT's PDF & HTML resource reports do allow for such a print preview, many of our staff wanted a finding aid that looked more like our own. Thankfully, the AT allows you to swap stylesheets (see the AT's FAQ > Import & Export Questions for instructions) to address such needs. We found a few problems, however, that you may need to take into consideration when swapping stylesheets.

First, make sure to check the path for any or subordinate stylesheets you utilize. If you're saving your stylesheets in the AT's report folder (i.e. C:\Program Files\Archivists Toolkit 2.0\reports\Resources\eadToPdf) make sure to use the path \reports\Resources\eadToPdf\[filename]. Otherwise, if you're pointing to a URL, make sure to use the full URL. This was all that was needed to make our PDF stylesheets run in the AT.

Second, especially for HTML stylesheets, make sure that any parameters specified include default values. This was what was causing errors for us.

With these two simple tweaks we are now able to apply our PDF and HTML stylesheets when staff generate resource reports. Ideally, we would like to apply our AT to Yale BPG stylesheet prior to running the resource PDF & HTML reports, perhaps via a plug-in. I'm sure others would like to modify this process as well. For the time being though, we're satisfied with the current output, which allows our staff to easily preview their work in a format similar to our own finding aids.

Command Line Interface (CLI)

noreply@blogger.com (Daniel Hartwig) — Mon, 24 May 2010 20:20:00 +0000

We have been testing the developer preview of AT Version 2.0 Update 6, which now features support for the command line interface. Based on work from Cyrus Farajpour at USC, the new version and corresponding ATCLI plug-in developed by Cyrus and Nathan Stevens allow for development of handy command line instructions such as batch export of resources--something we and perhaps others with lots of finding aids are excited about. After initial testing, the plug-in has been modified to address memory consumption issues (a new session needs to be open for each record otherwise the records get cached and used up memory) and allows for specification of output directory.

Our first batch export test of our 3000+ resources was performed on my desktop (2.99 GHz processor, 2 GB RAM) under normal workload. It took approximately 96 hours for the process to complete! Given these results, we've tweeked our approach quite a bit, running it from a better computer and adjusting the amount of memory assigned to the atcli.lax file (see AT's FAQ's for instructions). This seems to help a bit. The main problem though for us is the sheer number and size of our finding aids. No matter what we do, it's going to take some time. In addition, we have several finding aids that don't need to get exported. Ideally, we'd like to be able to select which finding aids to export (i.e. those not marked internal only or those updated since a given date). Perhaps further modifications to the the plug-in will allow such customization.

For the time being, we'd like to thank Cyrus and Nathan for developing this key tool and look forward to working with others to improve it's design and functionality.

UPDATE (6/23/2010)
Good news. Cyrus has modified the command line plug-in to incorporate parameters, allowing you to specify which subset of resources/EAD you want to export from the AT. The parameters include: findingAidStatus, author, eadFaUniqueIdentifier, resourceIdentifier1, resourceIdentifier2, resourceIdentifier3, resourceIdentifier4, internalOnly, and lastUpdated. Plus, you can use any combination of these to easily fine-tune the export process as necessary. For more information on setting up and using the plug-in check out his github site: http://github.com/smoil/AT-CLI-Export-Plugin

Locations

noreply@blogger.com (Daniel Hartwig) — Fri, 12 Mar 2010 20:26:00 +0000

Creating locations in the AT is fairly straightforward, especially for simple numeric ranges. AT's batch location creation tool is very handy here. Creating more complex locations, especially those with alphanumeric components, however, cannot be done in the AT. This was unfortunate for us because although the vast majority of our material is stored offsite, the number of distinct onsite locations we have is still close to 3000 given the unique combination of rooms, ranges, sections, and shelves. Rather than try to generate these manually in the AT, I chose to create them in Excel and then paste directly into the AT tables. Here's how I did it.

First (after having surveyed all of our locations to come up with room numbers, ranges, sections, and shelves), I opened the LocationsTable in Navicat and copied the last record into Excel. [note: we had entered in some locations already.] I then used this record to structure and format subsequent location entries and assign sequential locationIds. Excel's fill-in feature is extremely helpful here, especially for alphanumeric data. Simply typing in 1A1, for example, and then dragging the fill-in cross-hairs to the appropriate point, allows you to quickly generate values 1A1, 1A2,...1AX. Once finished adding in all your data, your spreadsheet should look like this: sample (.xls).

The final step is to paste the records into the LocationTable via Navicat. [note: as I explain in my post on creators, you need to make sure you are the only one using the AT when pasting in data or looking up ids. Again, this is to prevent overlap or duplication of work.] Once finished, check your work in the Locations module (Tools > Locations), making sure you can create additional (i.e. test) locations.

AT Issues: Creators

noreply@blogger.com (Daniel Hartwig) — Thu, 11 Mar 2010 21:49:00 +0000

Another pesky problem we've encountered with the AT EAD import process is that finding aids not encoded with a <corpname/> or <persname/> or <famname/> in <origination label="creator"/> do not get names created with the assigned function of creator in resource records. Unfortunately for us, this amounted to the vast majority of our finding aids, some 1600+ records. Rather than fix these one at a time in the AT, we came up with the following workaround.

1. Generate list of filenames and creators from our EAD files that lack <corpname/> or <persname/> or <famname/> in <origination/>. To do this Mark Matienzo, our new digital archivist, wrote a script for us that parsed our EAD files and dumped this data into an Excel file.

2. Format filenames as needed in separate spreadsheet (e.g. remove .xml extension, add quotes) to create list as follows:

"mssa.ms.0001",
"mssa.ms.0002",
...

3. Use Navicat to run query generating list of resourceIds based on formatted filenames (eadFaUniqueIdentifier in AT Resources table):

SELECT eadFaUniqueIdentifier, resourceId FROM Resources WHERE eadFaUniqueIdentifier IN ("mssa.ms.0001", mssa.ms.002", "mssa.ru.9019") ORDER BY eadFaUniqueIdentifier;

4. Copy resourceIds back into Excel file

5. Format names in separate spreadsheet (add quotes) to create list as follows:

"Yale University. Office of the Secretary.",
"Yale University. Office of the President.",
...

6. Create query in Navicat to retrieve nameids for any names that did make it into the AT:

SELECT nameId, sortName FROM Names WHERE sortName IN ("Yale University. Office of the Secretary.", "Yale University. Office of the President.", "Yale University. Office of the Provost") ORDER BY sortName;

7. Paste nameIds into master file with resourceIds and corpnames, famnames and persnames.

For those records with nameIds, proceed to step 13. Otherwise, for those names not present in the name table, proceed to step 8.

8. Open Names table in Navicat and proceed to last record

9. Copy last record in Names table into Excel to serve as model, noting column names. It is important to note that corpnames and persnames use different columns/fields so make sure to examine records in Names table for formatting both persnames and corpnames.

10. Copy contents of your master original file with persnames, famnames and corpnames into the new spreadsheet according to the Name table structure. Use Excel's fill-in feature to fill in data as needed and assign sequential nameIds from last record in table. Make sure to format the lastUpdated and created fields as text in Excel so as to mirror date encoding in the AT. Here is what your spreadsheet should look like: sample (.xls).

11. Paste records into Names table using Navicat

12. Copy nameIds for newly created names back into your master file of resourceIds and corpnames and persnames

13. Open ArchDescriptionNames table in Navicat

14. Go to last record and copy into new Excel file to serve as model for formatting data

15. Copy contents of master file with resourceIds, corpnames, famnames, and persnames, and nameids into Excel file mirrororing structre of sample record. Use Excel's fill-in feature to format data and assign sequential ids from last record in table. Here is a what your spreadsheet should look like: sample (.xls).

16. Paste contents from Excel file into ArchDescriptionNames table using Navicat

A couple of caveats. First, it is important that these steps be taken by someone comfortable with the aforementioned programs, as well as MySQL. Second, and most important, id creation and pasting of data directly into the AT tables should be done when no one else is using to AT to prevent accidental overlap or duplication of work. Finally, make sure to test the results, including creating new names and resource creator links in the AT client just to make sure everything is ok. If something does go haywire, you can always delete the records from the AT tables you just pasted in.

AT Transition Update

noreply@blogger.com (Daniel Hartwig) — Tue, 23 Feb 2010 00:32:00 +0000

I apologize for the long delay in posting to the blog. I have spent the better part of the last several months addressing several (thousand) issues raised by our legacy data migration and plug-in development. Today, though, I am happy to announce that the day has finally come where we, Manuscripts and Archives, have finally made the jump. We are now fully in and committed the AT! Unfortunately, much still remains to be worked out.

First off, we have to fix thousands of errors reported out from our programmatic efforts to match legacy instance information to resources in the AT. These were mostly the result of errors in our finding aids and, less often, errors in our location database. These have mostly been addressed, leaving only the truly disparate errors that need a good deal of research into how collections were processed and, often times, examining/verifying box contents.

Another major category of clean-up work stems from our QC of data sent back from our consultant where we found that, with some overlap with our other error logs, several thousand barcodes from our locations database did not come into the AT. Thankfully, many of these can be easily diagnosed with our Box Lookup plug-in and fixed with our Assign Container Information plug-in. Others require more in-depth problem-solving. For those large collections where things just went haywire, or for those small collections with only a few boxes, we've decided to delete the resource, re-import and then re-assign instance information using the Assign Container Information plug-in. Other collections will require deleting one or more components, importing those components in a dummy resource, transferring said components back into the original resource, and then re-assigning container information.

A third major challenge is cleaning up restriction flags we've assigned instances based on notes in our locations database. Our locations database had a variety of notes both at the collection/series/accession level and at the item level. Since these notes were wildly inconsistent and unable to be easily parsed, we created blanket restrictions for instances based on the notes. As a result, we have to review the restrictions assigned, verifying those that need to be restricted are and fixing those that are open. Thankfully, these errors can easily be fixed with our Box Lookup and Assign Container Information plug-ins.

Aside from these data errors, which are our first priority, we also have to finalize workflows, procedures, and documentation for accessioning, arrangement and description, and collections management. Although equally critical to our day-to-day operations, these were put off until we were in the AT so that we could fully model what needed to be done.

So, although we've made such great progress up until this point, much remains to be done, much needs to be resolved. This is more or the less the lasting impression of the project. For other large institutions planning similar migration projects, I can't say enough just how much work is involved and how important it is to get staff, especially technical staff, on board. For those institutions without technical support and dedicated staff, it is probably best to hire a consultant, especially when it comes to legacy data (e.g. instance) migration and customizations to the AT.

AT Tips & Tricks: Transfer Components

noreply@blogger.com (Daniel Hartwig) — Sun, 25 Oct 2009 19:30:00 +0000

One of the key weaknesses, in our opinion, of the early AT releases was the inability to import and attach EAD for an accession or addition to an existing resource. The Yale University Archives uses an inventory template, which offices fill out and email to us with each new accession. The template, an Excel spreadsheet, incorporates EAD tags and allows us to simply copy and paste encoded description directly into a finding aid, requiring only minimal clean-up. Without a means for importing this partial EAD into early versions of the AT meant we either had to re-enter the information into the AT or delete the resource and re-import it with the addition. The problem with the later is that deleting the resource also deletes all the location and instance information tied to the resource, which, in our case, is sizable. With the addition of Transfer Components functionality in v.1.5, however, the AT allowed us to import partial EAD without losing information assigned in the AT. Here's how it works.

First, create a dummy EAD finding aid with the addition/accession EAD as a component. I generally add the new accession EAD to the existing finding aid for a resource and delete all other components. Second, import said finding aid into the AT. Third, open the the resource in the AT into which you want to transfer components. Fourth, click on the Transfer button and select the dummy resource to import components from. Fifth, update the resource as necessary (e.g. extent, dates, etc.). Lastly, delete the dummy resource from the AT. That's it. It's that simple.

We originally thought about creating a plug-in to allow us and others to import and append EAD components into a resource, but ultimately decided/realized that the Transfer function was sufficient and free. Sometimes you just have to be a little creative in your approach to maximize the AT's functionality.

AT Issues: Box Ranges

noreply@blogger.com (Daniel Hartwig) — Tue, 15 Sep 2009 18:07:00 +0000

The most recent issue we've encountered that I'm sure others out there have already concerns box ranges. When EAD is imported into the AT with a range in the <container type="Box"> tag (e.g. <container type="Box">1-2</container> or <container type="Box">3, 256</container>), the AT creates a single instance for that component (e.g. Box 1-2 or Box 3, 256), rather than separate instances for each. The problem is that each instance is likely to have a separate barcode, box type and perhaps even location. When you click on Manage Locations, for example, you are presented with a single instance to which multiple, separate values need to be assigned. There are a few options at this point to address the issue.

First, and perhaps least desirous, is to fix your EAD to eliminate box ranges. Aside from the considerable labor/programming involved, your only option is to create separate components (i.e. clones) for each instance rather than one component with multiple instances. This is because although AT (1.5.9) allows you to create and export components with multiple instances (i.e. <c0x> with multiple <container type="Box"> tags, it does not allow you to import such in the same fashion. Instead, each (and only up to 3) is imported as a separate container type into a single instance. All subquent instances tied to a component are lost. Fortunately, I am told that version 2.0 will support import of multiple instances if parent/id attributes are used for each container tag.

Second, you can fix (i.e. break apart) the ranges in the AT. You can do this in two different ways depending on how you want to characterize each instance. One, you can create separate components (i.e. clones) for each instance. Two, you can create multiple instances within a single component. The problem with the first option is that your resource/finding aid loses scanability, includes somewhat redundant info, and may grow quite large if you have several sizable ranges. The problem with the second approach is that you will need a style sheet to customize display of instances, perhaps turning back into a range if you so choose.

We've decided to address the box range issue in a combination approach, fixing some instances in EAD, addressing most programmatically in the AT with code we're developing to clone components part of ranges. Hopefully this code can be added to one of our existing plug ins that assigns other instance info, allowing others the option of creating clones for components part of a range. More to come on that soon.

Yale AT plug-in development: status report

noreply@blogger.com (Daniel Hartwig) — Mon, 03 Aug 2009 18:15:00 +0000

This post provides an update on MSSA's plug-in development efforts.

Revised Analog Instance module/view

To increase the AT's collections management functionality and usability MSSA has asked for user-defined fields (two strings, two booleans) to be added to the Analog Instance table/module in version 1.7. We will use these fields to facilitate interaction with our ILS (Voyager), using them to capture Voyager Bib and Holdings numbers, box type, and flags for item-level restrictions and tracking of exported items. Here is a screenshot of the revised Analog Instance view our plug-in presents when an individual instance is clicked in the main Resource module.

Assign Container Information plug-in

A second functional improvement MSSA is contributing is enhanced batch container information assignment via an Assign Container Information plug-in.

EAD export plug-in

Yale EAD instances validate against a Yale-specific RNG schema informed by Yale's EAD best practices guidelines. Because this schema differs from the EAD 2002 schema, we need to develop an EAD export plug-in to modify data in the AT to validate against the Yale schema for easy ingest into our finding aids database. Such a plug-in will allow others to modify their data to meet their various output needs.

Status: not yet started.
Partial EAD import plug-in

Because MSSA accessions several hundred additions to collections each year, many coming in with inventories and EAD ready to paste into a finding, deleting and re-importing finding aids (and corresponding instance info)--the only way currently to input EAD aside from direct entry into the AT, is impractical for us. Hence, we will be developing a partial EAD importer, allowing import of new addition/accession EAD (e.g. ...) to append to an existing AT resource.

Status: not yet started.
Lookup/read-only plug-in

To facilitate collections management and public services activities in MSSA, we have need to develop a quick look-up or browse plug-in that will allow us to select specific data fields (e.g. accession number) to retrieve certain information (e.g. barcode) without the need to build the entire resource, accession, or digital object record.

Status: not yet started.

Export to ILS plug-in

To facilitate easy export from the AT to an ILS we will be developing a text file export plug-in. The plug-in will allow you to format data entered into the AT for easy export/ingest into an ILS. Because we are one of I believe two institutions with a Voyager import API, in our case the plug-in will export the text file to an intermediary application that will in turn ingest the AT data into Voyager.

Revised Digital Object Instance module/view

To facilitate batch digital object creation we plan to modify the Digital Object Instance module in a similar fashion as we have done with analog instances, allowing bulk creation and metadata assignment. Specifically, we want the ability to create and assign metadata for an arbitrary number of items via a '+ n' button, which allows you to enter the number of items you want to create with assigned values.

Status: not yet started.

Collaborative AT Instances: Pros & Cons

noreply@blogger.com (Daniel Hartwig) — Mon, 20 Jul 2009 13:27:00 +0000

This post examines the pros and cons of consortial or collaborative AT instances. My comments are based on my experience administering the Yale University Library Collections Collaborative AT project and MSSA's AT development.

Pros

The central benefit of a consortial/collaborative AT instance is the consolidation of systems, procedures, practices, and resources. Having one system, one set of procedures, one course of training across multiple repositories not only conserves resources, but also greatly facilitates consistency and efficiency across the institution. Such a configuration inherently allows for enhanced understanding of each other’s collections, and provides faster and more consistent access to collection information, as well as the possibility of—from one location—getting an overview of the special collections holdings across diverse repositories. In Manuscripts and Archives alone, implementation of the AT will result in the consolidation of numerous databases, centralizing collection information and reducing ongoing systems maintenance for these un-integrated databases.
Centralizing collection and other archival information leads to increased security, as potentially sensitive information will no longer be scattered across databases, electronic office files, and often, paper logs.

Cons

The primary challenge facing collaborative/consortial instances is that the AT does not scale well for large, complex data sets and hence causes noticeable performance issues. Based on an acknowledged design flaw, the AT will impede performance/functionality at a certain point (see my previous post on Resource loading). As a result, especially for large institutions with multiple repositories or, more importantly, with extensive legacy data, consolidating multiple systems in a single AT instance will result in a slow performing system. Unfortunately, there is currently no plan to alter the design of the AT to address this issue in the 2.0 release, but the potential exists for a third round of grant support (to merge the AT with Archon) that would allow for such. The only alternatives at this point are for one of us institutions to pay a consultant to do the redesign work, which might be costly, or to develop lookup tools/plug-ins that query the database without having to say build an entire Resource.
Sustainability is another major issue. With two major releases left, development of the AT is coming to a close. Unfortunately this is happening just as many of us are finally getting around to evaluating and adopting the AT. Thankfully though with version 1.5.9 the AT team introduced plug-in functionality as a viable means for the community to customize and develop the AT. In addition, it's exciting to hear that the opportunity exists for a third round of grant support to merge the AT with Archon. Beyond greatly expanding the functionality and usability of AT/Archon, this would allow for finishing any existing development commitments still on the table when the AT development ends.

A third issue for consortial/collaborative instances is that any modifications/customization done with the AT by the superuser, e.g. modifying default values, lookup lists, user-defined fields, etc., apply to the instance as a whole and cannot be applied to just a single repository. Extensive modification of lookup lists, user-defined fields, and default values may thus result in a cluttered, hard to use interface or may even impede efficiency and performance. Plug-ins may be a solution to this problem though as they can be used to alter appearance and workflow.

A fourth issue is that the only way to restore an instance (e.g. after a crash or failure) is to import an entire MySQL dump. Hence if one repository or individual screws something up, you'll have to go back to the last backup of the entire database, which may cause lots of data to be re-entered.

A fifth challenge is the difficulty of migrating legacy data, especially instance or box information (e.g. location, box type, barcode, etc.) into the AT. Migration can also be difficult for those institutions that do not have EAD 2002 or who lack the expertise to export, format and import their legacy data. For those institutions like us with an significant amount of complex legacy data the only real option is to hire a consultant and develop a custom import process/plug-in.

A sixth shortcoming of the AT is it's inability to support the full EAD tag set, meaning that addition tools (e.g. stylesheets) or systems may be necessary to fully manage an institution's finding aids. On the bright side, especially for smaller institutions or those who lack a finding aid display/delivery system, the proposed AT-Archon merger might address the issue of full EAD support.
A seventh issue is the inability to batch edit data in the AT. For those with a little know-how though a MySQL database administration tool such as Navicat can be used to query and update data in the AT's MySQL tables. This is definitely beyond the capability of the typical user so you may want to address this inability via a plug-in.

All told, it might seem that AT has more than its fair share of cons. This is not what I want to get across. True, there are some issues to take into consideration especially when creating a collaborative instance, but as MSSA is easily the largest user of the AT, none of what we've encountered thus far is a real deal-breaker. True, we'd like it to perform better, but there are options at this point to at least sidestep this issue until the AT/Archon redesign. And with plug-in functionality now available and soon to be expanded, the means for addressing the AT's shortcomings is now in the hands of the community. We just need to step up.

AT Issues: Resource Handling

noreply@blogger.com (Daniel Hartwig) — Mon, 06 Jul 2009 13:31:00 +0000

We have run into a significant performance issue at the half-way juncture in our legacy data migration efforts. We have found that Resources are taking much too long to load--small ones are taking at least 15 seconds; medium resources are taking up to 3 minutes; our largest ones are taking upwards of 5 minutes. Again, since we're only half-finished migrating our legacy data, the problem is likely to get much worse, especially as we begin to create/describe the thousands of digital objects we've been waiting to use the AT for.

We think the reason for the problem is how the AT loads Resources, i.e. recursively issuing SQL commands for each component, rather than issuing a single command for all resource components at once. Hence, given a large, complex finding aid (e.g. one with several levels of nested components and associated instance and location information), the AT grinds to a stand-still as it tries to systematically open all of the hierarchy. The main issue here appears to be the number and depth of nested components. Resources with only one or two levels of nested components do seem to open right up, the exception being two-level resources that have larger numbers of sub-components. However, resources that have three or more levels of nested hierarchy always open slowly, even if they don't have many actual associated instances.

We have tried to address the problem by maximizing the memory assigned to the client (see AT Wiki), boosting the database server's use of system memory, increasing the key buffer size, and engaging Memory Lock to discourage swapping--all to no avail. These steps however did boost performance of the MySQL database administration tool Navicat, which we use to access the MySQL database. This therefore points to the problem as lying in the AT and its Java encoding. This hypothesis was further supported when we set up a slow query log (i.e. anything over 2 seconds) on the database and tested resource loading, finding that no such queries were logged. As a result, all of our efforts seem to suggest that while we can indeed speed up operation of the MySQL database, we cannot speed up the performance of the Java that hits or interacts with the database. Since this however lies in the core of the AT code, it would have to be addressed either by the AT development team and put into the next/final release, or developed externally and then somehow offered back to the AT community. We are not sure yet how to proceed. An alternative that we and perhaps others might be happy with would be for only the top-level components (i.e. c01s) be built and displayed within the main Resources window, with further sub-components built when clicked. After all, you don't need to see/retrieve the whole Resource to make an edit to a specific part.

One of the questions this problem has raised for us is why hasn't anyone noticed/reported this performance issue? Was this borne out in scale testing? Surely there are other repositories who have loaded large finding aids, or who are working with complex resources or digital objects. Are they not using/editing these Resources and therefore not noticing the lag in performance? Perhaps they haven't ingested the amount of instance and location information that we have? Or, is there an upper data limit to the AT's functionality? That seems short-sighted. Regardless, others are bound to run into the problem sooner or later as they begin to ingest more and more data into the AT. As a result, we hope that progress can be made to address this critical problem.

Finding Aid Clean-Up: Box Numbers

noreply@blogger.com (Daniel Hartwig) — Thu, 18 Jun 2009 00:15:00 +0000

As we proceed with our AT development we are spending considerable time cleaning up and standardizing our finding aids. Aside from the work I've mentioned previously to create consistent dates, extent statements and subjects, the main focus of our latest efforts is the standardization of container information (i.e. box numbers). The reason for all this work is to allow us to programmatically hook our location information (e.g. box type, barcode, vault, shelf, etc.) to our finding aids in the AT, the key to which it turns out is the box number.

Like many repositories, our arrangement and descriptive practices have waxed and waned over the years. Although our collection numbers and accession numbers have more or less been consistently applied, our box numbers have not, particularly with used in connection with our practice of housing small quantities of odd-sized materials in common containers. Formerly, we housed such items, especially folio or slides in what we called common folio or common slide boxes (i.e. containers housing materials from multiple collections in a single or communal box), assigning a box number for the common folio/box and a folder number for the individual folder. Aside from the clear practical issues involved in administering such common containers, we've run into problems as we try to tie the box numbers we've assigned these items in our locator database to our finding aid data in the AT. More specifically, as our descriptive practice has varied over the years, the assignment of box numbers and box number extensions (e.g. an alphanumeric character used to indicate a use copy or duplicating master of a particular item) for these items has been inconsistent, unfortunately differing a great deal from box/container info in our EAD. For example, what appears in the finding aid (i.e. ) as "MS Common Folio 10" is entered in our locator database with Box '1' and BoxNumberExtension 'CF1F10'. As a result, we've had to manually edit data both in our location database and in our finding aids/AT for all these items.

This is a short-term solution. These items really need to be rehoused and all such common containers need to be done away with, not only due to the issues at hand, but also to facilitate say the creation of future use copies of these materials.

Finding Aid Handling

noreply@blogger.com (Daniel Hartwig) — Sun, 31 May 2009 20:29:00 +0000

This post will examine issues we've encountered with the AT concerning finding aids.

EAD Import

We've run into 5 issues importing EAD into the AT. First, as I mentioned in a previous post, we've had problems importing both large EAD files (6+ MB) and large numbers of files. Even with increasing the memory assigned to the AT (see Memory Allocation) and installing the client on the same machine as the database, importing these files has crashed import, causing us to import in small batches or one file at a time as needed. Second, we've encountered issues with handling for those finding aids with a parallel extent or container summary. Here we've found that although encoded correctly, the AT is inconsistent, sometimes assigning a blank extentNumber, other times combining extentNumber and containerSummary in extentType, or, most often, assigning a blank extentNumber, extentType and throwing everything into a general physical description note. This calls for a lot of manual clean-up depending upon how you encode . Third, although encoded correctly, we and other Yale repositories have found inconsistent languageCode assignment, most often resulting in blank values. Fourth, as perhaps many others of you have experienced, we’ve had problems with Names and Subjects, both in the creation of duplicates vales and in names getting imported as subjects (and vice versa). This is likely due to how the AT treats names as subjects—a complicated concept which may or may not be able to be revised for 2.0. Fifth, we’ve found inconsistent handling of internal references/pointers, sometimes getting assigned new target values, sometimes not. Whether or not new target values are created following merging of components is another issue on the table for future investigation.
DAO handling

Unfortunately one of the EAD elements currently not handled by the AT is , which our Yale EAD Best Practices Guidelines and common EAD authoring tool (XMetal) have set up to handle all Digital Archival Objects. As a result, all our DAOs are unable to be imported into the AT. This is a major issue that needs to be addressed in the 2.0 DAO module revision.
EAD Export/Styleshseet

Although it is possible to modify or replace the stylesheet the AT uses to export EAD or PDF (see Loading Stylesheet) it’s not currently possible to select or utilize multiple stylesheets, which may be important for multiple repository installations and/or developing flexible export possibilities.
Batch editing

I’ve mentioned previously in another post that one of the key weaknesses of the AT is the inability to batch edit. Given the lack of batch editing functionality, a repository will either have to commit to greater planning and development prior to integration, or use a database administration tool such as Navicat for MySQL to clean up data once in the AT.

Subject Handling

noreply@blogger.com (Daniel Hartwig) — Wed, 27 May 2009 01:56:00 +0000

We in MSSA have chosen not to utilize the AT to manage subject terms for three reasons.

First and foremost, subject handling in the AT is not as robust or or as functional as our current cataloging system.

Second, only a portion of the finding aids we imported contained controlled access points. This is because subjects have generally only been assigned to our master collection records, which are our MARC records. Only for a short period of time were controlled access points used on the manuscripts side. Furthermore, given the differences in content and purpose of MARC and EAD, trying to consolidate the two into a single system presented obvious practical issues. So, rather than try to programatically marry our MARC subjects to our finding aids, we decided to maintain access points in MARC until the AT could serve as our master record--a scenario still some ways off though made much simpler with the introduction of plugin functionality and custom import/export tools. In fact, with plugin functionality we might revisit the possibility of at least importing our subjects from MARC and attaching them to our resource records in the AT.

The third reason we chose not to use the AT to manage subjects was the difficulties the AT has roundtripping data, especially from one format to another, and the concomitant need therefore to develop tools to clean up this data for easy import into Voyager.

AT 1.5.9 Plugin Functionality

noreply@blogger.com (Daniel Hartwig) — Fri, 22 May 2009 02:44:00 +0000

With the release of 1.5.9 and plugin functionality the AT team has introduced a much-needed mechanism for customizing and expanding functionality in an extensible fashion. In keeping with current software development trends centered around a core system that can be modified or expanded via external applications, the beauty of AT plugin functionality is that all customizations or data imported via plugins is not affected by future AT upgrades as the plugin code lies outside of the main AT source code, stored in a separate plugin folder.

Plugins will thus offer repositories the capability to do many of the things that have perhaps up until this point kept them from fully adopting the AT. This includes creation of custom importers, export modules, record editors, and display/workflow screens. We in MSSA plan to take full advantage of plugin functionality to create a custom import mechanism for our marrying our existing container information (e.g. box type, barcode, VoyagerBib, and VoygerHolding) to instance records in the AT, a custom exporter to send data from the AT to Voyager, and a custom import mechanism for importing EAD or partial EAD for an existing resource record (e.g. new accession).

As with the AT itself, plugin development requires knowledge of Java programming. For more information on developing plugins check out the Plugin Development Guide (.pdf). A list of current plugins can be found here: http://archiviststoolkit.org/forDevelopers/plugins.shtml.

Sustainability & the AT

noreply@blogger.com (Daniel Hartwig) — Thu, 14 May 2009 23:31:00 +0000

Perhaps the biggest question facing the AT is its sustainability. With only one or two releases left, the AT is soon approaching the end of development. What does the future hold?

At SAA in San Francisco last year, the AT group reported that it had begun to work with a business consultant to formulate a business plan for the AT after Phase 2 of development ends in 2009. Options at that point included institutional affiliation, subscription-based service, or the pursuit of another Mellon grant. Also briefly discussed was the related need to develop a community governance model for guiding the direction and future development of the AT. Although initial steps have been made to address the governance model thanks in part to the SAA AT roundtable, little has been said yet as to the AT business plan. This post will explore the pros and cons of the three options on the table at this point.

Institutional affiliation/hosting is one option for the AT. The main challenge of this approach is the reality that few institutions are capable of taking on this responsibility, not only due to limited technical expertise and infrastructure, but also because the current economic situation is precluding many of us from taking on outside projects. Without a commitment of significant resources it seems likely that this model will only allow for ongoing support and not any additional development. Given some of the needs addressed in the AT user survey, ATUG-listserv postings, and other forums, however, this option seems in the end the least beneficial to the archival community as a whole.

A second option is subscription or fee-based AT support and development, either by individual consultants or a single software firm. Although this option would incur costs on individual institutions, this approach does allow for continued development, especially for institution-specific needs. The central challenge for this approach is that not all institutions have the resources to commit to development and so might not have a say in the future direction of the AT. Bigger institutions with bigger budgets would drive the agenda. To prevent this disparity an effective governance model would have to evolve to lobby for community interests and manage ongoing development for all interested parties.

The third option is to pursue another round of Mellon grant support. This would obviously allow for continued development of specific needs voiced by the archival community in the AT user survey and other requests posted to the ATUG-l. Given the demands on the project staff, our current economic situation, and the little traction this option seems to have garnered up to this point, it seems unlikely that this will happen.

So where does that leave us? The AT project is coming to end, sooner perhaps than we would like. Sure we'd all like a few more bells and whistles before then, but whatever business model is ultimately adopted, the AT will at least be supported for the next few years. Will Archon, ICA-AtoM, or other products, open-source or proprietary, sufficiently evolve? It's hard to tell at this point. The one thing we at least in MSSA know is that we will be much, much, MUCH! better off to have undertaken the work to get into the AT. No amount of my blogging can sell this enough. I'm sure many institutions would agree. And we thank the AT for that.

Batch Editing & Data Clean-Up

noreply@blogger.com (Daniel Hartwig) — Thu, 07 May 2009 12:27:00 +0000

One of the key weaknesses of the AT is the inability to batch edit data. The need for batch edit functionality was ranked of high importance in the AT user group survey and will hopefully be added in a future release. What then is a repository to do in the meantime? I suggest two possible options: 1) batch edit data prior to import; 2) manipulate the MySQL tables directly or use a database administration tool such as Navicat for MySQL to connect to the AT's MySQL database and perform queries/updates in the tables.

As I have described before, over time our collection management data has been created according to a number of different ad hoc and defacto standards. We in MSSA have tried as much as possible to batch edit and standardize our data prior to import into the AT. This was straight forward for our accessions and location information which was already stored in a database and thus easy to identify and manipulate. The one problem that did exist with this data was a tendancy by MSSA staff to combine data that belong to a series of different data fields, as instructed by EAD or the AT, into a single catch-all free text note field (a place to find everything). Options for handling this data included exporting or legacy data to and modifying another file or importing into the AT and then editing. We chose the former, performing batch operations to format the files according to the AT import maps. Although this was largely successful, we still encountered edits that needed to be made once in the AT. The options at that point were either to edit in the AT one at a time or perform another round of edits, delete the data in the AT, and then reimport again. We chose instead to perform batch edits in AT using Navicat, saving a considerable amount of time and effort.

The biggest challenge though we faced in standardizing our data prior to import came with our finding aids. Because they're not in a database and therefore not easily comparable, it's hard to see what changes need to made until they're actually in the AT. No matter the number of iterative batch edits we ran on our finding aids we still came across edits that needed to be made. Importing them into the AT however would still likely require a further round of edits. Running edits outside of the AT, deleting all the data, and then reimporting them would be a huge burden, particularly on over 2500 finding aids. We chose instead to run batch operations in the AT using Navicat and then run find/replace separately on the finding aids.

Tips and Tricks: Resources

noreply@blogger.com (Daniel Hartwig) — Tue, 05 May 2009 17:42:00 +0000

In migrating your legacy data into the AT your repository might have collections that lack both a finding aid and MARC record but yet still need to get into the AT as resource records. Rather than take the time to manually create these resource records in the AT one by one, it would be helpful to come up with a means for importing this information in a batch process. MSSA had over 250 such records that we had to bring into the AT. Here is how we did it.

The legacy data that we had to address, consisting mostly of what can loosely be described as deaccessions (i.e. collections that were transferred, destroyed, never came in, and/or who knows what), was in a Paradox database. Given the orderly data structure, we decided to to generate simple EAD finding aids for each collection using a mail merge and then batch import into the AT.

The first step was to export the data from our Paradox database into Excel. We then filtered the information to select only the specific collection records needed and deleted any unnecessary data elements. We then modified the data bit, in this case creating a column for filename in addition to the other data elements present (e.g. collection number, title, note, and disposition). This served as the data source for the merge.

Next we modified our existing EAD template to include only the basics needed for import into the AT (namely level, resource identifier, title, dates, extent, and language), as well as the information present in our Paradox database to distinguish it as a deaccession (e.g. note and disposition). We then opened the EAD template in Word and set up a mail merge, inserting the elements from our Excel data source into the appropriate places in the Word document. Here is a partial view of the Word EAD template:

Then we completed the merge of the data elements in Word and saved the individual finding aids with appropriate filenames. Here is a partial view of one of the resulting finding aids:

The final step was to import the finding aids into the AT. To distinguish these resources in the AT we checked them as Internal Only in the Basic Description tab and modified appropriate note fields as needed.

All in all the process proved very easy and was much faster than trying to enter this data manually. The only real drawback was having to save 250 separate finding aids.

AT Issues: Large Finding Aids

noreply@blogger.com (Daniel Hartwig) — Fri, 01 May 2009 15:10:00 +0000

Manuscripts and Archives encountered several performance issues when loading our finding aids into the AT. First, given the large number of finding aids in question (2500+), we encountered load time issues. Our first attempt to batch import our finding aids lasted the better part of a weekend, ultimately crashing with the dreaded java heap space error. Adding insult to injury, no log was generated to indicate the cause of the crash or the status of import (e.g. which files were successfully imported). Our initial diagnosis pointed to our setup. We had the database installed our one of our local, aging servers and ran the import via a client on a remote workstation. To address the issue of load time we changed our setup and moved the database to our fastest machine, a Mac with 8GB of memory and multiple processors, installed the AT client on it, and saved copies of our finding aids to it for easy import.

Our second attempt, which involved approximately 1800 finding aids, was much much faster but still crashed. The likely culprits this time were large finding aids and memory leak. We found that large finding aids (3MB+) crashed the import process when included as part of a batch import. In addition, we found a so called memory leak (i.e. successive loss of processing memory with each imported finding aid), which greatly slowed the process over time and contributed to the crash. As a result, we separated out large finding aids and imported them individually, as well as creating smaller batches of finding aids (both with respect to total number and total size) to import in stages. Just to give you some idea of the time required to import larger finding aids, we found that using a remote client to import up to a 2 MB file averaged 20-30 minutes; 2-3 MB took 30-60 minutes; 5-6 MB took 90-120 minutes.

These strategies proved effective, allowing us to import all but the largest of our finding aids (8-12 MB each), which we are currently working on. Because these present problems for our finding aid delivery system as well, one option is to split them up into multiple files, each around 2 MB. The only problem with this option is dealing with/maintaining multiple files.

For other institutions with similar numbers and sizes of finding aids these strategies may be of help to you.

AT Issues: Large Collections

noreply@blogger.com (Daniel Hartwig) — Wed, 22 Apr 2009 19:25:00 +0000

Two issues that we have encountered during testing are load time and slow performance. Given MSSA's 100,000+ instances, 2600+ resources, 7000+ accessions, and multiple locations, we've encountered lengthy start-up times and slow performance on even just a single client, let alone the dozen or so clients that will eventually be using the AT at a given time. Since we have not fully deployed the AT at this time, so we cannot yet accurately test how the AT will perform when running across several clients. Two things we will be doing to speed up performance is to move the AT to a dedicated server and to simplify our location assignments. Concerning the latter, in a previous iteration involving writing location information to the tables directly we created an enormous locations index, including many duplicate values, which significantly slowed client performance.

A third performance issue encountered concerns report generation. Again given the amount of data we have in the AT, running reports across our resources and accessions can take a considerable amount of time. Rather than running reports in the AT, which also aren't customizable, we use the database administration tool Navicat for MySQL to query the AT tables. Aside from being much faster, it's customizable (provided you know a little MySQL) and allows for batch editing.

Deploying the AT Across Multiple Yale Repositories: Implementation

noreply@blogger.com (Daniel Hartwig) — Tue, 21 Apr 2009 01:11:00 +0000

In setting up the AT for use across multiple Yale repositories we encountered a number of practical issues that needed to be resolved. The two most important were the need for standardization of practice and administrative set-up of the AT.

Standardization of practice
Each of the four participating repositories accessioned and managed special collections in a different way. To maximize the Toolkit's effectiveness we therefore needed to create standard procedures for accessioning, including defining a minimum-level accession record and application of consistent accession numbers. In addition, we created documentation in the form of instructions, guidelines, and tutorials to instruct both initial and future participants.
AT set-up
Again, given the variety exhibited in participating repositories' practices and collections information, we had to carefully consider whether to customize the Toolkit (e.g. use of user-defined fields, unique field labels, default values, etc.) to meet specific repository needs. The major challenge posed, however, was that the Toolkit only allows for customization across the AT instance as a whole and not specific to one repository within that instance.

We had set up one database instance for the AT at Yale and created repositories for each of the special collections within it. An alternative strategy would have been to create separate instances for each repository, allowing for repository specific customization. The goal of the project, however, was to create one means for managing and querying special collections information across Yale's special collections. In addition, given the lack of distributed technical expertise and support, we decided to centrally manage the AT in a single instance.

We initially chose not to customize the Toolkit, maintaining a vanilla installation for the Music Library, Arts Library, and Divinity Library collections. Given the sheer volume of collections information in Manuscripts and Archives (MSSA), however, we decided to create a separate MSSA instance, mostly for testing legacy data import, but also to incorporate MSSA-specific data elements in user-defined fields. We are also currently contracting out customization of the Toolkit to handle collections management processes, including Integrated Library System (ILS) export.

Populating Resources: MARCXML vs. Finding Aids

noreply@blogger.com (Daniel Hartwig) — Sun, 19 Apr 2009 16:14:00 +0000

Choosing how to populate resources in the AT was an important consideration for Manuscripts and Archives, one that ultimately had us reversing course and scrapping our initial plans. Our initial dilemma was that we lacked finding aids for all of our collections, with many of our University Archives finding aids lacking inventories for some accessions (some in paper only, some taken in with no inventory). As a result, we initially chose not to import our finding aids, choosing instead to import MARCXML, which we transformed via a stylesheet into EAD for batch import. Given the lack of data normally provided by importing finding aids, we developed a script to tie our container data (Paradox) to the resources. Several things didn't quite work, so we had to reconsider our options.

Having attended the AT User Group meeting at SAA last Fall, we realized that our approach caused too many complications and problems of sustainability. As a result, we speced out importing our finding aids. Again, given our lack of finding aids for all collections, we worked out a plan to hire a consultant to help standardize our findings aids for import. We successfully imported 2000/2500 finding aids and turned our attention to how to utilize the AT as a collection management system. We then realized we faced the problem of editing editing or significantly revising finding aids, especially on the University Archives side where offices regularly provide accessions inventories ready to copy and paste into EAD. Without a simple means for importing partial EAD for accessions, and without wanting to re-enter these manually in the AT, we reconsidered our plan again. Since our current EAD creation and editing process is sufficent, we decided that the AT is not currently capable of meeting our needs and that to try and have a consultant customize it to meet our needs would be beyond our means.

As it stands now, we're back to square one; MARCXML is it (again). Unfortunately, given the considerable amount of work spent cleaning up and standardizing our finding aids, we will have to come up with a means for importing data from a separate AT EAD import (e.g. Resource Titles, Finding Aid Titles, and Citations) into a fresh AT instance populated via MARCXML. In addition, to meet our needs and function as a full collection management system, we will work with a consultant to modify the Toolkit to marry our container information with our resources and allow for easy export to our ILS (Voyager). Hopefully the need to easily import partial EAD will be worked out and we can use the AT as intended, populating resources via finding aids.

Customizing the Toolkit: Instance Creation & Voyager Export

noreply@blogger.com (Daniel Hartwig) — Sat, 18 Apr 2009 17:14:00 +0000

For some time now we have been looking to replace our collections management and accessioning systems, which are obsolete both in platform and functionality. When the Toolkit was being developed we made the decision that since it would not fully meet our needs, we would not adopt it. A few years later, however, committed to consolidating our systems into an open-source alternative, we have reversed course and decided to move to the AT. Since the latest version (1.5) still does not fully capture our existing systems functionality we have contracted out the modification of the Toolkit to meet two specific needs: 1) improved container (e.g. Instance) creation; and 2) Voyager export functionality.

Container/Instance creation

Although version 1.5 of the AT comes with customizable Rapid Data Entry (RDE) Screens and “sticky” values for improved data (e.g. Instance) entry, it is still time consuming to enter information for large collections. Here is a screenshot of a RDE I created for Instance creation:

AT 1.5 allows for the creation of one item at a time with no option for batch creation and no automatic assignment of successive box numbers. Here is a screenshot of our current system:

This system features the capability of populating successive items/containers, either next or n-multiple lines. It also allows for assignment of location info, which the AT does in a separate module

What we are building might look something like the following:

Voyager export functionality

To easily export collection information from the AT to our ILS (Voyager) we need to add the following fields to each instance: 1) VoyagerBib; 2) VoyagerHolding; 3) Date (last changed); and 4) Restriction. This will require a redesign of the Resources and Instances module to include these new fields, as well as the incorporation of a script (triggered by an Export to Voyager button) to process newly entered data in the proper format for export into our ILS. We already use a similar script to do this work with our current system, so it shouldn’t be much work to modify it for use with the AT.

Although it is unlikely we will complete all of our work for the next release of the AT, we hope to finish in time for incorporation into the next AT release.

Making the Jump: Standardization of Practice

noreply@blogger.com (Daniel Hartwig) — Sat, 18 Apr 2009 16:11:00 +0000

In addition to the important and lengthy work spent standardizing our collections management data, Manuscripts & Archives also had to analyze its practices and procedures to determine how the AT could handle them.

1. Accessioning

For quite some time now we've handled accessioning differently on the University Archives side than on the manuscript side. This has resulted in two widely different processes, and two sets of disparate accession info. To remedy this situation we developed an accession checklist and procedures common to both University Archives and manuscripts, as well as instructions and tutorials for accessioning using the AT.

2. Description

As with accessioning, description has also differed between University Archives and manuscripts. With the welcome addition of Bill Landis as head of arrangement and description, and considerable time spent on the University Archives side revising its processes, this disparity has been addressed. Also helpful has been the creation of descriptive standards for handing a/v materials and electronic records, as well as the creation of finding aid templates and EAD best practices guidelines.

3. Collections management

This was less of a problem as collections management was handled well by one person for quite a while. The only problems we encountered were from incomplete data and improper use of the system by clerical staff (e.g. entering temporary holdings, capturing accession or descriptive info, or entering several different data elments of in a single note field). To address these problems we had to map out export of the problematic data elements into appropriate AT fields (where possible) and plan for post-import clean-up in the AT.

Reporting

noreply@blogger.com (Daniel Hartwig) — Sat, 18 Apr 2009 15:25:00 +0000

In addition to the AT's built-in reports, which may or may not be sufficient for repository statistics, there are a few options for generating customized reports. Common reporting software tie-ins that can be purchased include: Jasper Reports, Crystal Reports, and iReports. Those wishing to create or customize their own reports with these apps will also need to make use of the Toolkit’s application programming interface (API), which is available on the Archivists’ Toolkit website.

Another option for those with knowledge of MySQL is to use a free database administration tool such as Navicat for MySQL. The beauty of this application is that with a little MySQL you can query and batch edit data in the AT MySQL tables. A similar tool is DaDaBiIK, a free PHP application that allows you to easily create a highly customizable front-end for a database in order to search, insert, update and delete records. Although these tools allow you to easily batch edit data in the AT, be forewarned that editing in the tables directly is not tacitly approved and may cause problems when upgrading.

We ran into problems during the upgrade process to 1.5 after we had written data (including primary key values) directly to the tables. We think this is likely due to our creation of new data/values (especially primary keys) in the tables directly. Subsequent work utilizing these tools with data already imported into the AT via EAD and Accession (XML) import has not encountered problems.