Reviewer: Sam Wolski, eResearch Services, Griffith University firstname.lastname@example.org OS: OSX 10.8.2 Browser: Chrome 29.0.1547.65 Test Case(s): Supplied ‘HCS vLab Testing August’ document.
Preliminary Comments:The HCS vLab is easily one of the best interfaces I’ve come across in Australian research projects and eResearch applications. The Bootstrap framework is a great development platform and the workflows and interfaces of the HCS vLab have been integrated well to form a beautifully clean and usable application. The following feedback is intended to provide a list of small improvements to the application. Continue reading
John Hajek, Caroline Jones and Nick Thieberger gave a presentation on the HCS vLab at the Annual meeting of the Australian Linguistics Society (ALS 2013) on Friday 04/10/2013. Here are some questions that were raised by participants and the answers from the HCS vLab team.
Felicity Cox asked, with respect to the Mitchell and Delbridge data, whether it would be possible in the HCS vLab to search by audio type and data types, e.g. sentences, words, since that information is in the file names for the original M&D data.
Answer: We could simply search the item name (the item name comes from the filename) to search by audio type and data type. This type of searching isn’t currently possible, but general metadata search functionality is currently being built into the system. Someone who knows more about each of the data sources could also help us improve the ingestion of metadata for that source.
Audience members wanted to be able to use the search box function on the main website to search the metadata fields (e.g. location of recording, or origin of speaker) not just the Item text contents.
Answer: We’re currently developing this functionality.
Felicity Cox and Adam Schembri were both quite interested in contributing legacy data to the HCS vLab, and asked if there would be any funding available to support contribution. It was suggested that people working on Australian languages might submit their data to PARADISEC and the data could enter the HCS vLab indirectly that way.
Answer: We will setup some process for taking data from people, clean it and ingest it and this process will be documented with the final release of the HCS vLab. Submitting data to PARADISEC was indeed the path envisaged for Australian languages data, but it means we need to put in place a way to regularly update the PARADISEC collection ingested in the HCS vLab.
Ben (HDR tester from Melbourne) thought it would be good if within the main site you could change the name of an Item List. (We did agree you can currently change it in Galaxy.)
Answer: We will add support for renaming of item lists
Aidan Wilson (HDR tester from Melbourne) said that he had had what he thought might be browser issues with viewing EOPAS.
Answer: We’re aware of the issue and have addressed it in the new version of the HCS vLab.
It was agreed that including EMU would be great, though some people asked if Praat could also be available.
Answer: Praat is not part of the set of tools slated for Phase I of the HCS vLab project, but we agree it would be good if we could find a way to include it. We are keeping a list of tools people have said they want and which we will consider for inclusion in Phase II (from July 2014). For now, users can download data files and use them in Praat. If users need to add the annotation files, we could add support for converting annotations to Praat format, or write a widget converting JSON-LD (our current format) to Praat format.
Felicity Cox asked if ultrasound and EEG (any electronic data, really) could be put into the HCS vLab and then be available for analysis there.
Answer: This is something we would like to have and which should already be possible as there shouldn’t be anything special about these files which would prevent them from being added.
Adam Schembri asked if ELAN would be included in the tools (Adam uses ELAN with video extensively, for sign language research). Many linguists use ELAN.
Answer: We already have some ELAN annotations in the EOPAS datasets, so to some extent we are supporting it.
Several linguists who have done work with historical sources (manuscripts and colonial letters etc) strongly wanted to have PDF scan of original source rather than typed up version as ‘Primary Data’.
Answer: We agree that they shouldn’t be considered “Primary Data” but the typed versions of the files are listed as “Original” by the collection creators. There are some PDF files in PARADISEC and it would be possible for researchers to add PDF scans to any of the AusNC collections.
Is there a vLab FAQ page we can add the questions and their answers to?
Answer: We will use these questions as the basis for an FAQ, for now we are posting them on the project blog. asking.
Is there an HCS vLab mailing list for people to join?
Answer: Sorry, not yet.
It’s good to talkWe (the Intersect dev team) had some very productive initial chats with Peter Sefton & Steve Cassidy about what the HCS vLab architecture should look like. We, maybe, see most solutions as looking a bit like a web app, and Peter, maybe, sees most solutions as looking a bit like a Repository, and Steve was there to keep us focussed on what we were aiming towards as well as having valuable real world experience with some of the tools and collections that would make up the vLab. Together I think we got close to a good initial architecture for the vLab. We came up with some high level principles that our design should aspire to.
- Federated discovery is hard and is as slow as the slowest responder.
- Ingest should be decoupled from the processing you do on ingest (eg indexing.)
- All access to data should be through an API.
- Move data, especially large data, as little as possible
- Loose coupling and high cohesion just makes sense.
- Re-using technology is better than re-inventing it.
A picture’s worth a thousand wordsWe then captured the design in this initial diagram of the HCS vLab architecture: (click to expand)
Key points of the design
- All access, whether from the web or other things such as a workflow engine or command line tools, will be through an API. This should enable us to change how we do things behind the scenes without breaking the tools that use the API to access data.
- We will use a Repository to help us control access to data as well as give us a centralised place to store metadata and a place to store user generated artefacts such as Annotations and Item Lists.
- We will use a Repository Manager to abstract access to the repository away from the API. This should enable us to replace a poorly performing repository with something better without having to change our code too much.
- We will use Solr, or similar, to index the metadata and provide the back-end for a discovery service.
- We will use a Message Queue to decouple ingest from the processing that follows ingest. When an item is ingested its metadata will go into the Repository and a message will go onto a queue to tell listening workers that a new item has arrived for processing. These workers will then perform the necessary subsequent processing such as indexing the metadata for Solr.
In late March and April, the HCSvLab team conducted eight interviews with researchers from a range of disciplines. This was done as a follow up to the survey we conducted earlier in the project. The aim of the interviews was to dig deeper and gain a better understanding of the needs of researchers in relation to the virtual laboratory.
The needs of the future users of the virtual laboratory are as diverse as the disciplines they come from. However, the interviews uncovered some common themes. What follows is a summary of those common themes.
A number of researchers we spoke to are keen to get access to more publically available, well-described data. Many researchers currently rely on the availability of corpora from overseas sources, such as the U.S. National Library of Medicine corpus, British National Corpus and publically available EMA (Electromagnetic Articulography) data, for their research.
We also found that the majority of researchers are using data that they have collected themselves. Many of these researchers are not able to readily share that data due to commercial licencing or issues with privacy and consent.
Access to powerful search functionality is high on the list of priorities for a number of researchers. A couple of examples were provided of websites with powerful, easy-to-use search interfaces: BYU-BNC: British National Corpus and Collins “WordBanks”.
A convenient way of sharing annotations with others would be valuable to a number of researchers we talked to. Some researchers need to have multiple people annotating the same document for quality reasons, so a way to track or co-ordinate that would also be useful.
Tool Use and Workflows
The researchers we talked to use a wide variety of tools and considering the virtual laboratory will not be able to host the full range, it needs to support easy transportation of data in and out of the tools that are available.
The researchers we spoke to were fairly evenly split between those that want to be able to share workflows (for collaboration or reproducibility reasons) and those that do not have a need to share workflows. For those researchers interested in sharing their workflow with others, it would be advantageous if the virtual laboratory were able to host all of the tools their workflow makes use of.
Many of the researchers interviewed make regular use of tools or scripts that they have developed themselves. These tools are often not general enough for a wider audience and have been specifically tailored with a particular research question in mind.
A number of researchers interviewed are interested in taking advantage of the cloud-computing infrastructure offered by NeCTAR via the virtual laboratory. In many cases, the researchers require that this infrastructure not be limited to use with just the corpora hosted by the virtual laboratory. For some researchers, it is a requirement that, in general, they be able to use the tools offered by the virtual laboratory with their own data.
Above and beyond
Researchers needs for the HCS virtual laboratory vary but the common themes were:
Access to more, well-described data
Powerful search functionality
Ability to share annotations with others
Easy transportation of data in and out of tools made available in the virtual laboratory
Sharing of workflows
Access to cloud-computing infrastructure offered by NeCTAR
Ability to use the tools and computing infrastructure offered by the virtual laboratory with data other than that hosted by the virtual laboratory
The dialogue with researchers continues with the HDR testing and we look forward to getting more input and hearing more feedback from the future users of the HCSvLab.
Researcher Input by Jared Berghold is licensed under a Creative Commons Attribution 3.0 Unported License.
One of the first things we developed on the project is a problem statement. This is intended as a high level summary of the problem we’re trying to solve in building the HCS Virtual Laboratory. I thought I’d present our current problem statement here and provide a bit of a commentary. Continue reading
The ink isn’t quite dry on the contracts yet but we took the opportunity of having the Speech Science and Technology conference at Macquarie University to run a short workshop on the Virtual Laboratory. The goal of the workshop was partly to publicise the project but also to get some insight into the workflow that people were using in their research.Continue reading