C

calcyte

A static-repository builder for human-scale data collections. Allows humans to describe data sets using spreadsheets for data entry. Run a script to generate blank spreadsheets and create HTML and JSON catalogs.

Calcyte

Audience

At this stage, this code is for developers only.

Code status

At this stage Calcyte is messy, unprofessional code. We are going to fix that see the Roadmap

About

Calcyte is (will be) a toolkit for:

  • Managing metadata for collections of data files using metadata input by people into automatically generated spreadsheets
  • Creating static HTML repositories, with a CATALOG.html file that serves as a gateway to the files
  • Packaging the static repositories for distribution using BagIt in the forthcoming Data Crate format

Installation

To use Calcyte on OS X:

  • Get Python 3 (eg by brew install python3)

  • Create a virtual environment and activate it:

    • Make a place for virtual environments if you don't have one: mkdir ~/virtualenvs
    • Make a virtual environment: python3 -m venv ~/virtualenvs/calcyte
    • Activate your virtual environment: . ~/virtualenvs/calcyte/activate
  • Change directory to where you'd like to work: mkdir ~/working; cd ~/working

  • Get Calcyte: git clone https://codeine.research.uts.edu.au/eresearch/calcyte.git

  • Install the code (in your virtual environment if you set one up and activated it).

cd calcyte

pip install .

Usage

>./calcyfy --help
usage: calcyfy [-h] [--recurse] [--bagit BAGIT] [-force] dir

positional arguments:
  dir                   Directory to search for metadata.

optional arguments:
  -h, --help            show this help message and exit
  --recurse, -r         Recurse into directories.
  --bagit BAGIT, -b BAGIT
                        Path to a directory in which to create a bagit bag of
                        the data for distribution.
  -force, -f            Force deletion/replacement of output bag directory.

Try running Calcyte on the test data in recursive mode:

 ./calcyfy test_data/Glop_Pot -r

Calcyfy will create or update:

  • a CATALOG.xlsx file in each directory in test_data/Glop_Pot listing each file in the directory
  • A CATALOG.html file in test_data/Glop_Pot summarising the whole data collection
  • A CATALOG.json file containing JSON-LD linked data about the files

On first run the HTML and JSON files are not very useful.

Edit the spreadsheets in each directory to add metadata about your files (we've added sample metadata to the test directory already). The root directory contains extra tabs for descibing associated entities; people, organisations, unusual file formats and equipment. TODO: A detailed cookbook about how to code metadata.

Roadmap

Here's a brief roadmap for Calycte covering a number of areas. We will flesh this out over time, and create some milestones for development.

Code quality, testing etc

The initial version of Calcyte is not very high quality code, it was thrown together as a proof of concept. Now that the concept is proven we are going to clean it up, in the following areas:

  • Making code PEP 8 compliant.
  • Improving test coverage.
  • Adding a proper logging system.
  • Dealing with exceptions throughout the code.
  • Continuous integration (via Docker?), leading to
  • Contaninerisation (via Docker?) to make it easy to deploy

Specific cleanups:

  • Get rid of Pandas dependency - it's not adding anything useful

Usability

Calycte, via the command line calcyfy is initially to be run from the command line. We are trying to embrace Unix conventions such as not recursing into directories by default. In future Calcyte may run in other modes, such as:

  • Automated file-watching or timed operation to create spreadsheets and HTML/JSON without user intervention.
  • Triggering via a web application on top of a Share Sync service like ownCloud.

For now we will be exploring:

  • Data consumer: Responsive HTML for generated CATALOG.html for use across multiple devices
  • Pattern-based metadata, such as:
    • Making a single spreadsheet entry for all files matching a pattern (eg test_.*.py is a python test)
    • Automated extraction of metadata from filenames using regexes, including some defaults for finding dates etc
    • Automated extraction of metadata from files using archiving and preservation tools (eg Apache Tika)

Scalability

Calcyte is designed initially for human-scale data collections, that is ones where it is practical to describe each file in a spreadsheet, but we also need to allow:

  • Sensible, configurable defaults for how many files to add to a spreadsheet.
  • Ways to minimise catalog spreadsheet entries such as pattern-matching, and having ways to flag directories to be ignored or described only at the collection level.

Previews and dissemination

In the future we would like to support automated thumbnail, preview and dissemination versions of content in Data Crates. Eg:

  • Image thumbnails in CATALOG.html
  • Automatic extraction of CSV from proprietary time-sequence formats
  • Little GIF previews of large multi-page TIFF microscope image

There is a lot of overlap here with the digital preservation space, where they have stuff like file-format registries, and software to identify formats.

See Of The Web an inactive project from Western Sydney University which did some work in this area.

File tracking

At the moment Calcyte automatically finds new files in a directory and adds them to the CATALOG.xlsx where they can be described by a human.

Future plans include:

  • Logging missing files, and moving their entries to a "missing-files" tab in the catalog spreadsheet
  • Pick up more file info such as size, date and checksums and maybe use these to track moved files

Validation

  • We need to be able to check that redundant metadata is in-sync, eg validate that CATALOG.html and CATALOG.json match
  • Calcyte will probably be able to check the validity of Data Crates (as it will be the first implementation that creates them)
  • We'll expore JSON schemas for validation: if that doesn't work will try a schematron-style pattern based approach to validating key features.