CNI Fall 2014 Meeting: Fedora 4 early adopters

Fedora 4 Early Adopters

David Wilcox, Defora Product Manager, DuraSpace

Fedora 4.0 released November 27. Built by 35 Fedora community developers. Native citizen of the semantic web – linked data platform service. Hydra and Islandora integration.

Beta pilots – Art Institute of Chicago, Penn State, Stanford, UCSD.

62 members in support of Fedoray, funding increased dramatically (over $500k). Effort around building sustainability – more members at lower funding amounts. Governance model – Leadership and steering groups.

Fedora 4 roadmap – short term (6 months) – 4.1 will support migrations from Fedora 3. Want to establish migration pilots, and prioritize 4.1 features.

Fedora 4.1 features – focus on migrations, but some new features – API partitioning, Web Access Control, Audit service, remote/asynch storage are candidates.

Fedora 4 training- 3 workshops held in October (DC, Australia, Colorado), more planned for 2015.

It is possible for Fedora 4 could be a back-end for VIVO.

If you want to go with Hyrda at this point you should go to Fedora 4, not 3.

Declan Fleming, UCSD

Oroginal goals – map UCSD’s deeply-nested metadata to simpler RDF vocabularies, taking advantage of Dedora 4’s RDF functionality. Ingest UCSD DAMS4 71k objects using different storage options to compare ingest performance, functionality, and repository performance. Synchronize content to disk and/or an external triple store.

Current status – Initial mapping of metadata completed for pilot work. Ingested sample dataset using mulitple storage options: Modeshape, federated filesystem, and hybrid (modeshape objects linked to federated fulesystem files). Ingested full UCSD DAMS4 dataset into Fedora4 using Modeshape.

Ongoing work – continuing to refine metadata mapping, as part of the broader Hudra community push toward interoperability and pluggable data models. Full-scale ingest with simultaneious indexing, full-scale ingest with hybrid storage (about ready to give up on that and embrace modeshape), performance testing.

Over time ingesting of metadata slowed down – they use a lot of blank nodes which adds to complexity of structure – might be the reason.

File operations were very reliable. Didn’t test huge files rigorously.

Stefano Cossu – Art Institute

DAMS project goals – will take over part of current Collection Management System duties – 270k objects, 2/3 of which are digitized. Strong integration with existing systems adopt standards, single source for institution-wide shared data. Meant to become a central hub of knowledge.

LAKE – Linked Asset and Knowledge Ecosystem. Integrates with CITI (collection management system) which is the front-end to Fedora (LAKE) which acts as the asset store.

Why Fedora? Great integration capabilities, very adaptable, built on modern standards, focus on data preservation. Makes no assumptions about front-end interface. REST APIs. Speaks RDF natively.

Key features for the AIC – Content modeling, federation, asynchronous automation, external indexing, flexible storage.

Content modeling: adding/removing functionality via mix-ins. Can define type and sub-types. Spending lots of time building a content model. Serves as a foundation for ontology. Still debating whether JCR is best model for building content model. Additional content control is in their wish list.

Asynchronous Automation: Used modeshape sequencers so far. Camel framework offers more generic functionality and flxibility. Uses: extract metadata on ingestion, create/destroy derivatives based on node events, index content.

Filesystem federation to access external sources, custom database connector.

Indexing: multiple indexing engines – powerful search/query tools: triplesetore, solr, etc.

Tom Cramer – Stanford

Exercising Fedora as a linked data repository – introducing triannon and Stanfords Fedora 4 beta pilot

Use case 1: digital manuscript annotations. Used open annotation W3C working group approach to map annotation into RDF. Tens of thousands of annotations – where to store, manage, and retrieve?

use Case 2:Linked data for libraries. Bibliographic data, person data, circulation and curation data. Build a virtual collection without enriching the core record using linked data to index and visualize.

Need a RDF store, need to persist, manage, and index. Not the ILS nor core repository – this is a fluid space while the repository is stable and reliable. All RDF / linked data.

Fedora was a good fit: Native RDF store, manage assets (bitstreams), built in service framework (versioning, indexing, APIs), easy to deploy.

Linked Data Platform (LDP): W3C draft spec, enables read-write operations of linked data via HTTP, Developed at at same time as Fedora 4, Fedora 4 one of a handful of current LDP implementations.

Stanford pilot: install, configure & deploy Fedora 4; exercise LDP API for storing annotations and associated text/binary objects; develop support for RDF references to external objects; test scale with millions of small objects; integrate with read/write apps and operations – annotation tools (e.g. Annotator), indexing and visualization (Solr and Blacklight)

Current: Annotator (Mirador) <- json-ld -> Trianon (Rails engine for open annotations stored in Fedora 4) <-> LDP – Fedora 4.

Future: Blacklight and Solr.

Learned to date: Fedora 4 approaching 100% LDP 1.0 compliant, Trannon at alpha stage (can write, read & delete open annotations to/from Fedora 4); Still to come: updates to annotations, storage of binary blobs in Fedora, implement authn/z, deploy against real annotation clients, populate with data at scale.

Looking at Fedora 4 as a general store for enriching digital objects and records through annotating, curating, tagging.


CNI meeting Fall 2014: SHARE update

SHARE update
Tyler Walters, SHARE director and dean of libraries at Virginia Tech
Erice Celeste, SHARE technical director
Jeff Spies, Co-founder/CTO at Center for Open Science

Share is a higher education initiative to maximize research impact. (huh?)

Sponsored by ARL, AAU, APRU.

Knowing what’s going on and keeping informed of what’s going on.

Four working groups addressing key tasks: repository, workflow, technical, communication

Received $1 million from IMLS and Sloan to generate a notification service.

SHARE is a response to the OSTP memo, but roots before that.

Infrastructure: Repositories, research network platforms, CRIS systems, standards and protocols, identifiers

Workflow – multiple silos = administrative burden

Policy – public access, open access, copyright, data management and sharing, internal policies.

Insittutional context: US federal agencies join growing trend to require public access to funded research; measureable proliferation of institutional and disciplinary repositories; premium on impact and visibility in higher ed.

Research Context – Scolarly outcomes are contextualized by materials generated in the process and aftermath of scholarly inquiry. Research process gendrates materials covering methods employed, evidence used, and formative discussion.

Ressearch libraries: collaboration among institutions going up; shift from collections as products to collections as components of the academy’s knowledge resources; library is supporting and embedded within the process of scholarship.

Notification Service: Knowing who is producing what, and under whose auspices, is critical to a wide range of stakeholders – funders, sponsored research offices, etc.

Researchers produce articles, preprints, presentations, datasets, and also administrative output like grant reports and data management plans. Research release events. Meant to be public.

Consumers of research release events: repositories, sponsored research offices, funders, public. Interest in process as well as product. Today each entity must relate arrange with one another to learn what’s going on. Notification service shares metadata about research release events.

Center for open science has partnered with SHARE to implement notification service.

Looking for feedback on proposed metadata schema, though the system is schema agnostic.

API – push API and content harvesters (pulling data in from various sources). Now have 24 providers and adding more. 16 use OAIPMH while 8 use non-standard metadata formats.

Harvested data gets put into open science framework – pushes out RSS/Atom, PubSubHubbub, etc. Sit on top of elastic search. You can add a lucene format full-text search to a data request.

250k research release events so far.arxiv and crossref are largest providers. Averaging about 900 events per day. Now averaging 2-3k per day in last few days as new providers are added.

Developed push protocol for providers to push data rather than waiting for pull.

Public release: Early 2015 beta release, fall 2015 first full release.

Some early lessons: Metadata rights issues – some sites not sure about thier right to, for example, share abstracts; Is there an explicit license for metadata (e.g. CC Zero)?;

Inclusion of identifiers – need some key identifiers to be available in order to create effective notifications. Most sources to not even collect email addresses of authors, much less ORCID or ISNI. Most sources make no effort to collect funding information or grant award numbers. Guidelines? See

Consistency across providers – reduce errors, simplify preparing for new providers. Required for push reporting.

Next layer: Reconciliation service – takes output of notification service to create enhanced and interrelated data set.

Share Discovery – searchable and friendly.

Phase 2 benefits – Researchers can keep everyone informed by keeping anyone informed, institutions can assemble more comprehensive record of impact,; open access advocates can hold publishers accountable for promises; other systems can count on consistency of metadata from SHARE.

Relation to Chorus – when items get into Chorus it is a research release event, hopfully will get into notification service.

CNI Fall 2013 – SHARE Update: Higher Education And Public Access To Research

Tyler Walters, Virginia Tech, MacKenzie Smith, UC-Davis

SHARE – Ensuring broad and continuing access to research is central to the mission of higher education.

Catalyst was the February OSTP memo.

SHARE Tenets – How do we see the research world and our role in it? Independent of the operationalization of OSTP directive, the higher ed community is uniquely positioned to play a leading role in stewardship of research. How can we help PIS and researchers meet compliance requirements?

Higher ed also ahas an interest in collecting and pre servicing scholarly output.

Publications, data, and metadata, should be publicly accessible to accelerate research and discovery.

Complying with multiple requirements from multiple funding sources will place a significant burden on principal investigators and offices of sponsored research.  The rumors are that different agencies will have different approaches and repositories, complicating the issue.

We nee to talk more about workflows and policies. We can rely on existing standards where available.

SHARE is a cross-institutional framework to ensure, access to, preservation and reuse of and policy compliance for funded research. SHARE will be mostly a workflow architecture that can be implemented differently at different institutions. The framework will enable PIs to submit their funded research to any of the deposit locations designated by federal agencies using a common UI. It will package and deliver relevant metadata and files. Institutions implementing SHARE may elect to store copies in local repositories. Led by ARL, with support from AAU and APLU. Guided by  a steering committee drawing from libraries, university administration, and other core constituencies.

Researchers – current funder workflows have 20+ steps. Multiple funders = tangle of compliance requirements and procedures, with potential to overwhelm PIs. Single deposit workflow = more research time, less hassle.

Funding Agencies – Streamlines receipt of information about research outputs; Increases likelihood of compliance

Universities – Optimize interaction among funded research, research officers, and granting agency; Creates organic link between compliance and analytics

General Public – makes it easier for public to access reuse and mine research outputs and research data; Adotpion of standards and protocols will make it easier for search engines.

Project map: 1. Develop Project Roadmap (hopefully in January); 2. Assemble working groups; 3. Build prototypes; 4. Launch prototypes; 5. Refine; 6. Expand

Mackenzie –  The great thing about SHARE is that it means something different to everyone you talk to. 🙂

Architecture (very basic at present) – 4 layers:

Thinking about how to federate content being collected in institutional repositories – content will be everywhere in a distributed content storage layer. Want customized discovery layers (e.g. DPLA) above that. Notification layer above that. Content aggregations for things like text mining (future), in order to support that will need to identify content aggregation mechanisms (great systems out there already).

Raises lots of issues:

  • Researchers don’t want to do anything new, but want to be able to apply. Want to embed notification layer into tools they already use.
  • Sponsored research offices are terrified of a giant unfunded mandate. So we have to provide value back to researcher and institution, and leverage what we already have rather than building new infrastructure.
  • Who should we look at as existing infrastructure to leverage?

What’s the balance between publications and data? Both were covered in the memo, but was very vague. Most agencies and institutions have some idea how to deal with publications, but not the data piece. Whatever workflow SHARE deals with will have to incorporate data handling.

Want to leverage workflow in sponsored project offices to feed SHARE.

What do we know about CHORUS (the publisher’s response) at this point? Something will exist. It would be good to have notifications coming out of CHORUS – they are part of the ecosystem.

Faculty will get a lot more interested in what’s being put out on the web about their activities. Some campuses are tracking a lot of good data in promotion and tenure dossier systems, but  that may not be able to be used for other purposes.

Will there be metadata shared for data or publications that can’t be made public? Interesting issues to deal with.

What does success look like? We don’t know yet – immediate problem is the OSTP mandate which will come soon. At base the notifications system is very important – the researcher letting the people who need to know that there is output produced from their research. Other countries have had notions of accountability for funded research for a long time. Even deans don’t know what publications from faculty are being produced. In Europe they don’t have that problem.

Want to be in a position by end of 2014 to invite people to look at and touch system. Send thoughts to