CSG Spring 2016: Common higher-ed logical data models and/or APIs

Introduction

Satwinder Singh, Columbia

  • How do you get from the spaghetti diagram to a better place?
  • Foundation – Understand the systems of record, form working group with subject experts; identify user stories which then get translated into high level data models. Leads to:
  • Canonical model – abstracts systems of record to a higher level: logical data entities, data dictionary, metadata repository, leads to:
  • Generic Institution Models – are there generic Institution models for academia? Leads to:
  • Specifications and Platform: Specifications (Schema standards, APIs, data formats, security). Platform – iPaas/On-Prem
  • Platform – IPaaS/On-Prem
    • API Gateway – how does it connect back to systems of record? Uses:
    • ESB – provides decoupling
  • API and Data Governance – Lifecycle (building hundreds of successful APIs requires process to decide and maintain them), intake process (look first to see if it already exists or can be extended from something that exists), portal; Data (Master Data Management) quality and stewards.
  • Agility – Impart higher velocity (need to decouple what (“I need this data from …” and how (“I want a file”); Buld/use accelerators (e.g. pre-built connectors, or re-use things you’ve built already – the only way you can show ROI is by reuse); Use templates (interface specification; data mapping)
  • CoE@Columbia – Integration Principle (see Columbia’s document)

Current Practices: Data models
Sarah Christen, Cornell – Survey results

  • 54% have created canonical data models at their institution – data warehouse, students, person data, APIs and reporting tools
  • 85% have data governance priorities
  • 77% using data dictionaries or business glossaries for metadata (people looking at Data Cookbook)

Satwinder

  • Went looking to build conceptual person model
  • ERP systems have models, but not consistent nor share
  • Formed working group which formed subcommittees
  • Some common attributes (name, DOB, etc, known-by)
  • Location (can include online presences)
  • Event participation
  • Demographics (visa status, gender, ethnicity)
  • There’s discussion of the relationship of this person model and the people registries that have been built up in the identity management space. Keith Hazelton notes that if it’s just identity it’s a subset, but if it’s used access control needs more of the extended data to make access decisions.

Mike Fary, Chicago. Student Model

  • Built up representation of data objects used in Campus and Student Life units – took eight months.
  • Produced a data model that was good enough for the integrator to work with in the student system implementation project.
  • Realized there were some areas that needed further definition
  • Created a second working group to create definitions of FT/PT students, joint/dual degree programs, leave of absence/withdrawals. Took another six months
  • Went back afterwards to get reaction from people who had participated. Overwhelmingly people are positive.
  • In some cases agreed to disagree, but at least that’s now documented.
  • Value of the process is in the conversations and the realizations that different parts of the institution view the same data differently. Will need to continue those conversations as further models are built.

Jonathan Saperia – Harvard – VIVO Data Ontology

  • VIVO is a good example of something we can look at that has already been done.
  • Started out trying to solve a specific problem -wasnt looking to define an ontology, but one came out of it.
  • VIVO is “an integrated view of the scholarly record of the University” Information about research and reseachers.
  • Used semantic web technology. Uses RDF
  • Started at Cornell in 2003 with an NIH grant in 2009.
  • Currently a DuraSpace project with members, supporters, and investors
  • More than 28 public implementations, 80 implementations in progress.
  • Use extends beyond research reporting, but it’s not easy.
  • Rather than fleshing out a description of a domain, they were driven by wanting to enable communication and collaboration in scholarship.
  • VIVO take-aways: Start small and build a case; Sort out what is public and not important; stewardship of the project is critical – must be owned at the highest level of the institution; clearly identify where the home of the record will be and who is responsible for it; identify practices of the university; change management and governance are important. Practices across the institution vary and matter in different ways
  • Domains don’t exist in isolation  – lots of relationships
  • Danger is that models will become shelfware – will they evolve? You want something that’s part of a regular process to keep it alive.
  • Why do we separate the actual data about the mission of the university (research output, learning analytics) from the data about the business (student system, research funding)?

Current Practices: APIs

Standards at Yale

  • encouraging consistency, maintainability, and best practices across applications
  • URIs
  • API description (OAI, RAML, etc)
  • Common language, data dictionary and domain-specific ontoloties (Data CookBook)
  • Connections and Reusability (schema,org, JSON-LD, knowledge sources
  • API Descriptions – RAML, Open API (formerly known as Swagger), API BluePrint
  • DataCookbook – Provide for central database that stores admin systems data definitions and report specifications – standards for defining your own campus terms;
  • schema.org – adopted by major search engines; allows documenting/tagging content on website. Used in SIRI and Knowledge Graph NLQA
  • Ontologies – Linked data world, very domain specific knowledge.
  • Plan for reusability

Aaron Bucher – Data Governance

  • Roles – Data custodians; Data stewards (owners); Subject matter experts; Content developers
  • Enterprise Data Management And Reporting groups: Steering committee (owners); Data governance; Analytics collaborative (superusers sharing best practices); reporting and data specialists.
  • Data governance council – charged with establishing and maintaining data definitions; Meet monthly, vote on what definitions to establish.

Mike Chapple – Notre Dame

  • Have about 500 data definitions
  • Top lesson learned: It’s not about technology. It’s about collaborations and relationships.
  • Put together a process to adopt definitions by unanimous consensus.
    • Identify terms to work on; then identify stakeholders and roles
    • Stakeholders have to attend workshop where definitions are built – start with a draft, not a blank page

University of Michigan Medical School Data Governance

  • Hired a data governance manager 1.5 years ago
  • My data/our data – everyone thinks their data is their own
  • Data metrics on a dashboard to show how the program is proceeding
  • Will be headed into dealing with research data soon – there’s data there that should persist at the institution if a faculty member leaves, and there are compliance issues.

Jenn Stringer – users want agency over what data is collected about them – how do we provide that?

APIs

Sarah – 

  • API Specs – 31% RAML, 46% Swagger
  • APIs are built with: REST 50%, both REST and SOAP 50%
  • Who are API consumers: 7%,
  • Do you have an API intake process? Yes 46% No 54%

Lech – Yale developer portal (https://developers.yale.edu/)

  • About six months old, after students scraped a lot of web sites and a New York Times article
  • Portal and architecture built on Layer 7 technology (now part of CA)
    • API Listings – several dozen with documentation
    • Resources –
    • Getting started documentation
    • Tools – API Explorer and source samples
    • Events and training – student developer and mentorship program
    • API Keys & OAUTH2 tokens.
  • Two types of APIs – Public APIs and Secured APIs (requires API key and OAuth login)
  • SOA Gateway – requires a technical person to make changes. Supports SSL/TLS. EULA Terms of Use (generic with a provision for amendment by specific data owners); WADL or RAML documentation; Yale NetID access, looking to add social identities.
  • Metrics: 31 APIs; 120 registered users (mostly students, some IT staff); 80 apps registered (in 8 months); Can see top applications, top orgs, performance, latency.
  • Found that students are using the APIs for senior level projects (important in terms of when things need to be up)
  • API Explorer – moving to use postman (https://www.getpostman.com/ )
  • Engagement – student developer & mentorship program, weekly training, ice cream social, YHack, CS50 lectures, Office hours.
  • They include some vendor APIs on their portal – bus location, dining, ZipCar.
  • One student app is used by 3-4k students – has lots of useful information for students – busses, laundry status, etc.

Kitty – the demand for learning analytics is going to drive the need to understand how to safeguard and treat types of data where governance has not matured. Yale has developed a provostial data governance group with the Library, faculty, etc. There’s a smaller group that deals with identity data.

Standards and Tools

REST, JSON, SCIM, Swagger, RAML – Keith Hazelton, Wisconsin Madison

  • Internet2 TIER and the API / Data Model Question – redrawing functional boundaries, refining the interfaces.
  • The one big gap in open source id management space is the entity registry space.
  • Goal is to enable IAM infrastructure mashups – (homegrown, TIER, commercial)
  • In TIER starting from the infrastructure layer – APIs being built for developers building TIER applications.
  • APIs and Data structures – SQL or noSQL? Can we achieve non-breaking extensibility in data objects?
  • Entity representations in event-driven messaging and push/pull REST APIs. How different should data look if it’s a message rather than a response? Should they look any different? When is one appropriate and not the other?
  • REST-ful beats REST-less for loose coupling: implementation language agnostic; better for user-centric design
  • JSON – is XML’s bad rap deserved or not? JSON adopting more of XMLs functions.
  • SCIM: A standard for cloud-era Provisioning: Design centered for provisioning into the cloud. TIER has settled on use of SCIM where it fits – resource types of user and group. Where it doesn’t, extend using SCIM mechanisms. Think twice before doing RPC-like things.
  • Representing APIs with Swagger 2 (now Open API) – decided it’s normative.

 

Internet2 Tech Exchange 2015 – RESTful APIs and Resource Definitions for Higher Ed

Keith Hazelton – UWisc

TIER work growing out of CIFER – Not just RESTful APIs. The goal is to make identity infrastructure developer and integrator friendly.

Considering use of RAML API designer and raml.org tools for API design and documentation.

Data structures – the win is to get a canonical representation that can be shared across vertical silos. Looking at messaging approaches. Want to make sure that messaging and API approaches are using the same representations. Looking at JSON space.

DSAWG – the TiER Data Structures and APIs Working Group – just forming, not yet officially launched. Will be openly announced.

Ben Oshrin, Spherical Cow

CIFER APIs – Quite a few proposed, some more mature than others.

More Mature: (Core schema – attributes that show up across multiple APIs); ID Match (creates a representation for asking “do I know this person already, and do I have an identifier?”); SOR to Registry (create a new role for a person); Authorization (standard ways of representing authorization queries).

Less mature: Registry extraction (way to pull or push data from registry – overlap with provisioning); Credential management (do we really need to have multiple password reset apps?)’

Not even itemized: Management APIs; Monitoring APIs. Have come up in TIER discussions.

Non CIFER  APIs / Protocols of interest: CAS, LDAP, OAuth, OIDC, ORCID, SAML2, SCIM, VOOT2

Use cases:

  • Intra-component: e.g. person registry queries group registry for authorization; group registry receives person subject records from person registry.
  • Enterprise to component: System or Record provisions student or employee data in Person Registry
  • Enterprise APIs: Homw grown person registry exposes person data to campus apps.

#TODO

API Docs; Implementations

CSG Spring 2013 – OAuth & Web Services

University as Platform – Jim Phelps, U Wisconsin Madison

In conversation with Undergraduate Advising, found that they have to use an average of 7 tools for every task they want to accomplish when talking with a student. They spend 40% of time navigating systems to perform a task, 40% teaching students how to use the systems, and 20% of their time doing high value advising. 

Platform principles: resusable, transparent, leveraged, consistent, fiscally efficient, drives agility and innovation.

Second use case – the curricular hub. Spent a lot of up front time building it, but then requires very little reinvestment afterwards. Now have 27 different consumers, all getting the same answers. 

Web Services, Authorization and the API Economy – Scotty Logan, Stanford

Data extracted from programmableweb.com – which doesn’t have an API. 

About 9k APIs registered – straightline log chart since 2005. Mostly REST, SOAP coming back a bit. API Billionaires Club: twitter, google, facebook, netflix, etc.

90% of Expedia’s business is through APIs, half of Salesforce’s access is through the APIs. 1.1 million requests/second on S3.

Drivers: Mobile, cloud, social, desire to be open, competitor has an API

We have a dysfunctional relationship with the cloud – we’re consumers of APIs, and sometimes we produce APIs, but not consistently or on a huge scale. Who allows the cloud to push and pull data? Who allows mobile apps to push and pull? The mobiel app economy relies on the API economy. How well do your apps work without a network? The Internet of Things relies on the API economy. 

API Value Chain – App user->App Store->App->App developer-> world of APIs-> API-> API Team-> Internal systems. App developers are kingmakers – so they’re the ones you have to get APIs to. 

Web Services – SOA is dead – long live services (Anne Thomas Manes, Gartner). 

SOAP or REST? REST is more popular now, means more tools and libraries, examples, design/dev experience. SOAP uses HTTP, REST is HTTP.

Authorization – how does a person allow one appt to access their data in a different app? Shibboleth (SAML) is great if there’s a browser involved. How do you authenticate if there’s no app? OAuth + REST – Designed for REST-like APIs. Like REST, OAuth leverages HTTP features and response codes. 

Allows a client application to ask a user for restricted permissions to act on their behalf / access their data in another system without exposing credentials.

OAuth authenticates apps, not people. OpenID Connect authenticates people and is built on top of OAuth 2.

OAuth 2.0 vs 1.0/1.0a

OAuth 1 required signatures – harder than it looked. OAuth 2.0 relies on SSL/TLS – additional RFCs for signatures, JOSE specs for signed/encrypted tokens. 

API Evangelism and support at Stanford – Goal is make it easier to create and deploy APIs and to find and user APIs “the right way”.

In the beginning – helped a few groups create basic APIs. Scott & Bruce as API evangelists – please make APIs, please make them RESTful, please use OAuth.

Med School CAP System – Community Academic Profiles – expanding to all faculty and staff. Other apps want data from CAP – “we’d like help with an API!” – two day REST design session with vendor. Investigated their API management tool – concerns about integration, cost. They wanted SAML – AS, token attribute passing, scope approved, rate limiting, API console, distributed ownership/control, workgroup integration (groups).

Found a few open source API management tools – big bloated unfamiliar framework or obscure implementation language or no OAuth 2.0 support.

They’re rolling their own – three pieces: OAuth Authoriztion Server; API Proxy/Gateway; 

Using CloudFoundry’s UAA, CLient Credentials flow available now.

Developer portal – probably Drupal, wil integrate with OAuth AS.

Federated OAuth is possible.

Web Services Gone Wild! Mojgan Amini, UCSD

Users were happy with their green screens – then came web applications, first with Perl then with Java. Had users using web apps and others using green screens. They centralized common services, giving birth to their middleware. Every new technology got wrapped into the middleware, which got very bloated. Attended 2006 Gartner Web Services Summit – SOA is the wave of the future. Churned out 2000+ web services! Lesson learned: a little more effort on governance up front would be a good thing. Then created a good set of web services that were reusable that were heavily used – starting with commonly used public services. Then started talking with admin users who wanted real time data over web services – provided a select set of authorized services for their purposes. Put them in a good place for launching mobile apps when they were needed. 

Still have Cobol, Perl, and Java apps along with 2000 base web services. Moving to a Red Hat middleware solution. Hoping to address self-service campus web services. 

Student Developers, Web Service APIs, and OAuth – Mark McCahill, Duke

Two cautionary tales about what can happen with student developers, web APIs and how OAuth can maybe help.

Duke launched Innovation CoLab – effectively a long format “hackathon” – student dev teams compete to identify and address campus needs, in hopes that useful apps will emerge along with more insight how to support the institution. Has pushed a bunch of changes within OIT because students want to move fast. 

Planning for CoLab – how can we enable student dev innovaation – infrastructure (easy): on-demand VMs with pre-installed application stacks for student servers (60 different images using BitNomi); self-service Git repository + bug tracking. Enterprise data in easy-to-use form for mobile apps and alternate user interfaces (harder). What data to they want? In which formats? SOAP+WDLs? XML-RPC? REST + JSON? The last won – students didn’t touch the others. 

Impedance mismatch between what students want and what IT staff are used to doing. Enterprise developers are used to servers and app stacks hand crafted by artisan system administrators. Enterprise devs use Java frameworks etc. Student developers want on-demand PPAS services such as Heroku and software ecosystems such as Node.js, Ruby-on-Rails, JSON, REST, curl, screen scraping.

CoLab; Note Cheese – Sharing class notes is a good thing; you don’t know everyone in your class and you don’t know who took good notes. Might end up with something liek Quora meets Piazza with class notes. Why make users type in their class schedule? Devs can’t get class schedules via API, so they asked for NetID and PW, then screen scraped the classes. 

Lesson: OAuth opt-in and informed consent to access individual class enrollment? Allow students at least some access to Shibb SP signup tool – policy now is that students can’t get SPs, so need to change that. Provide pre-built shib along with API stacks.

CoLab: Hack Duke – Enterprise via REST – read-only REST access to everything. Live drill-down navigation of the APIs: classes, terms, course evaluations, departments. Event calendar, map coordinates, LDAP directory via REST.  Available on GitHub. Backend is MongoDB, nodeJS front end. a fork of openworld. Some concerns are expressed – controlled access to sensitive data? All students can view course evaluations, but should the public see this? Frequency of updates? Data validity? departments offering courses <> all departments. 

Lessons: if you don’t provide enterprise data in easy-to-use forms, the students will do it anyway. read-only REST + JSON is the way to roll. Realtime browsable catalog of APIs (requiring an API key). OAuth for non-toxic user opt-in for release of data to applications, Expectation of live data for realtime course registration decision making. 

http://www.hackduke.com