CSG Spring 2016: Common higher-ed logical data models and/or APIs


Satwinder Singh, Columbia

  • How do you get from the spaghetti diagram to a better place?
  • Foundation – Understand the systems of record, form working group with subject experts; identify user stories which then get translated into high level data models. Leads to:
  • Canonical model – abstracts systems of record to a higher level: logical data entities, data dictionary, metadata repository, leads to:
  • Generic Institution Models – are there generic Institution models for academia? Leads to:
  • Specifications and Platform: Specifications (Schema standards, APIs, data formats, security). Platform – iPaas/On-Prem
  • Platform – IPaaS/On-Prem
    • API Gateway – how does it connect back to systems of record? Uses:
    • ESB – provides decoupling
  • API and Data Governance – Lifecycle (building hundreds of successful APIs requires process to decide and maintain them), intake process (look first to see if it already exists or can be extended from something that exists), portal; Data (Master Data Management) quality and stewards.
  • Agility – Impart higher velocity (need to decouple what (“I need this data from …” and how (“I want a file”); Buld/use accelerators (e.g. pre-built connectors, or re-use things you’ve built already – the only way you can show ROI is by reuse); Use templates (interface specification; data mapping)
  • CoE@Columbia – Integration Principle (see Columbia’s document)

Current Practices: Data models
Sarah Christen, Cornell – Survey results

  • 54% have created canonical data models at their institution – data warehouse, students, person data, APIs and reporting tools
  • 85% have data governance priorities
  • 77% using data dictionaries or business glossaries for metadata (people looking at Data Cookbook)


  • Went looking to build conceptual person model
  • ERP systems have models, but not consistent nor share
  • Formed working group which formed subcommittees
  • Some common attributes (name, DOB, etc, known-by)
  • Location (can include online presences)
  • Event participation
  • Demographics (visa status, gender, ethnicity)
  • There’s discussion of the relationship of this person model and the people registries that have been built up in the identity management space. Keith Hazelton notes that if it’s just identity it’s a subset, but if it’s used access control needs more of the extended data to make access decisions.

Mike Fary, Chicago. Student Model

  • Built up representation of data objects used in Campus and Student Life units – took eight months.
  • Produced a data model that was good enough for the integrator to work with in the student system implementation project.
  • Realized there were some areas that needed further definition
  • Created a second working group to create definitions of FT/PT students, joint/dual degree programs, leave of absence/withdrawals. Took another six months
  • Went back afterwards to get reaction from people who had participated. Overwhelmingly people are positive.
  • In some cases agreed to disagree, but at least that’s now documented.
  • Value of the process is in the conversations and the realizations that different parts of the institution view the same data differently. Will need to continue those conversations as further models are built.

Jonathan Saperia – Harvard – VIVO Data Ontology

  • VIVO is a good example of something we can look at that has already been done.
  • Started out trying to solve a specific problem -wasnt looking to define an ontology, but one came out of it.
  • VIVO is “an integrated view of the scholarly record of the University” Information about research and reseachers.
  • Used semantic web technology. Uses RDF
  • Started at Cornell in 2003 with an NIH grant in 2009.
  • Currently a DuraSpace project with members, supporters, and investors
  • More than 28 public implementations, 80 implementations in progress.
  • Use extends beyond research reporting, but it’s not easy.
  • Rather than fleshing out a description of a domain, they were driven by wanting to enable communication and collaboration in scholarship.
  • VIVO take-aways: Start small and build a case; Sort out what is public and not important; stewardship of the project is critical – must be owned at the highest level of the institution; clearly identify where the home of the record will be and who is responsible for it; identify practices of the university; change management and governance are important. Practices across the institution vary and matter in different ways
  • Domains don’t exist in isolation  – lots of relationships
  • Danger is that models will become shelfware – will they evolve? You want something that’s part of a regular process to keep it alive.
  • Why do we separate the actual data about the mission of the university (research output, learning analytics) from the data about the business (student system, research funding)?

Current Practices: APIs

Standards at Yale

  • encouraging consistency, maintainability, and best practices across applications
  • URIs
  • API description (OAI, RAML, etc)
  • Common language, data dictionary and domain-specific ontoloties (Data CookBook)
  • Connections and Reusability (schema,org, JSON-LD, knowledge sources
  • API Descriptions – RAML, Open API (formerly known as Swagger), API BluePrint
  • DataCookbook – Provide for central database that stores admin systems data definitions and report specifications – standards for defining your own campus terms;
  • schema.org – adopted by major search engines; allows documenting/tagging content on website. Used in SIRI and Knowledge Graph NLQA
  • Ontologies – Linked data world, very domain specific knowledge.
  • Plan for reusability

Aaron Bucher – Data Governance

  • Roles – Data custodians; Data stewards (owners); Subject matter experts; Content developers
  • Enterprise Data Management And Reporting groups: Steering committee (owners); Data governance; Analytics collaborative (superusers sharing best practices); reporting and data specialists.
  • Data governance council – charged with establishing and maintaining data definitions; Meet monthly, vote on what definitions to establish.

Mike Chapple – Notre Dame

  • Have about 500 data definitions
  • Top lesson learned: It’s not about technology. It’s about collaborations and relationships.
  • Put together a process to adopt definitions by unanimous consensus.
    • Identify terms to work on; then identify stakeholders and roles
    • Stakeholders have to attend workshop where definitions are built – start with a draft, not a blank page

University of Michigan Medical School Data Governance

  • Hired a data governance manager 1.5 years ago
  • My data/our data – everyone thinks their data is their own
  • Data metrics on a dashboard to show how the program is proceeding
  • Will be headed into dealing with research data soon – there’s data there that should persist at the institution if a faculty member leaves, and there are compliance issues.

Jenn Stringer – users want agency over what data is collected about them – how do we provide that?


Sarah – 

  • API Specs – 31% RAML, 46% Swagger
  • APIs are built with: REST 50%, both REST and SOAP 50%
  • Who are API consumers: 7%,
  • Do you have an API intake process? Yes 46% No 54%

Lech – Yale developer portal (https://developers.yale.edu/)

  • About six months old, after students scraped a lot of web sites and a New York Times article
  • Portal and architecture built on Layer 7 technology (now part of CA)
    • API Listings – several dozen with documentation
    • Resources –
    • Getting started documentation
    • Tools – API Explorer and source samples
    • Events and training – student developer and mentorship program
    • API Keys & OAUTH2 tokens.
  • Two types of APIs – Public APIs and Secured APIs (requires API key and OAuth login)
  • SOA Gateway – requires a technical person to make changes. Supports SSL/TLS. EULA Terms of Use (generic with a provision for amendment by specific data owners); WADL or RAML documentation; Yale NetID access, looking to add social identities.
  • Metrics: 31 APIs; 120 registered users (mostly students, some IT staff); 80 apps registered (in 8 months); Can see top applications, top orgs, performance, latency.
  • Found that students are using the APIs for senior level projects (important in terms of when things need to be up)
  • API Explorer – moving to use postman (https://www.getpostman.com/ )
  • Engagement – student developer & mentorship program, weekly training, ice cream social, YHack, CS50 lectures, Office hours.
  • They include some vendor APIs on their portal – bus location, dining, ZipCar.
  • One student app is used by 3-4k students – has lots of useful information for students – busses, laundry status, etc.

Kitty – the demand for learning analytics is going to drive the need to understand how to safeguard and treat types of data where governance has not matured. Yale has developed a provostial data governance group with the Library, faculty, etc. There’s a smaller group that deals with identity data.

Standards and Tools

REST, JSON, SCIM, Swagger, RAML – Keith Hazelton, Wisconsin Madison

  • Internet2 TIER and the API / Data Model Question – redrawing functional boundaries, refining the interfaces.
  • The one big gap in open source id management space is the entity registry space.
  • Goal is to enable IAM infrastructure mashups – (homegrown, TIER, commercial)
  • In TIER starting from the infrastructure layer – APIs being built for developers building TIER applications.
  • APIs and Data structures – SQL or noSQL? Can we achieve non-breaking extensibility in data objects?
  • Entity representations in event-driven messaging and push/pull REST APIs. How different should data look if it’s a message rather than a response? Should they look any different? When is one appropriate and not the other?
  • REST-ful beats REST-less for loose coupling: implementation language agnostic; better for user-centric design
  • JSON – is XML’s bad rap deserved or not? JSON adopting more of XMLs functions.
  • SCIM: A standard for cloud-era Provisioning: Design centered for provisioning into the cloud. TIER has settled on use of SCIM where it fits – resource types of user and group. Where it doesn’t, extend using SCIM mechanisms. Think twice before doing RPC-like things.
  • Representing APIs with Swagger 2 (now Open API) – decided it’s normative.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: