CSG Fall 2016 – Next-gen web-based interactive computing environments

After a Reuben from Zingerman’s, the afternoon workshop is on next gen interactive web environments, coordinated by Mark McCahill from Duke.

Panel includes Tom Lewis from Washington, Eric Fraser from Berkeley

What are they?  What problem(s) are trying to solve? Drive scale, lower costs in teaching. Reach more people with less effort.

What is driving development of these environments? Research by individual faculty drives use of the same platforms to engage in discovery together. Want to get software to students without them having to manage installs on their laptops. Web technology has gotten so much better – fast networks, modern browsers.

Common characteristics – Faculty can roll their own experiences using consumer services for free.

Tom: Tools: RStudio, Jupyter; Shiny; WebGL interactive 3d visualizations; Interactive web front-ends to “big data”. Is it integrated with LMS? Who supports?

What’s up at UW (Washington)?

Four patterns: Roll your own (and then commercialize); roll your own and leverage the cloud; department IT; central IT.

Roll your own: SageMathCloud cloud environment supports editing of Sage worksheets, LaTex documents, and IPython notebooks. William Stein (faculty) created with some one-time funding, now commercialized.

Roll your own and then leverage the cloud – Informatics 498f (Mike Freeman) Technical Foundations of Informatics. Intro to R and Python, build a Shiny app.

Department IT on local hardware: Code hosting and management service for Computer Science.

Central IT “productizes” a research app – SQLShare – Database/Query as a service. Browser-based app that lets you: easily upload large data sets to a managed environment; query data; share data.

Biggest need from faculty was in data management expertise (then storage, backup, security). Most data stored on personal devices. 90% of Researchers polled said they spend too much of their time handling data instead of doing science.

Upload data through browser. No need to design a database. Write SQL with some automated help and some guided help. Build on your own results. Rolling out fall quarter.

JupyterHub for Data Science Education – Eric Fraser, Berkeley

All undergrads will take a foundational Data Science class (CS+Stat+Critical Thinking w/data), then connector courses into the majors. Fall 2015: 100 students; Fall 2016 500 students; Future: 1000-1500 students.

Infrastructure goals – simple to use; rich learning environment; cost effective; ability to scale; portable; environment agnostic;  common platform for foundations and connectors; extend through academic career and beyond. Student wanted notebooks from class to use in job interview after class was done.

What is a notebook? It’s a document with text and math and also an environment with code, data, and results. “Literate Computing” – Narratives anchored in live computation.

Publishing = advertising for research, then people want access to the data. Data and code can be linked in a notebook.

JupyterHub – manages authentication, spawns single-user servers on demand. Each user gets a complete notebook server.

Replicated JupyterHub deployment used for CogSci course. Tested on AWS for a couple of months, ran Fall 2015 on local donated hardware. Migrated to Azure in Spring 2016 – summer and fall 2016. Plan for additional deployment to Google using Kubernetes.

Integration – Learning curve, large ecosystem: ansible, docker, swarm, dockerspawner, swarmspawner, restuser, etc.

How to push notebooks into student accounts?  Used github, but not all faculty are conversant. Interact: works with github repos, starting to look at integration with Google Drive. Cloud providers are working on notebooks as a service. cloudl.google.com/datalab, notebook.azure.com.

https://try.jupyterhub.org – access sample notebooks.

Mark McCahill – case studies from Duke

Rstudio for statistics classes, Jupyter, then Shiny.

2014: containerized RStudio for intro stats courses. Grown to 600-700 students users/semester. Shib at cm-manage reservation system web site, users reserve and are mapped to personal containerized RSTudio. Didn’t use RStudio Pro – didn’t need authn at that level. Inside – NGINX proxy, Docker engine, docker-gen (updates config files for NGINX), config file.

Faculty want additions/updates to packages about twice a semester. 2 or 3 incidents/semester with students that wedge their container cause RStudio to save corrupted state. Easy fix: delete the bad saved-state file.

Considering providing faculty with automated workflow to update their container template, push to test, and then push to production.

Jupyter: Biostats courses started using in Fall 2015. By summer 2016 >8500 students using (with MOOC).

Upper division students use more resources: had to separate one Biostats course away from the other users and resized the VM. Need to have limits on amount of resources – Docker has cgroups. RStudio quiesces process if unused for two hours. Jupyter doesn’t have that.


Shiny – reactive programming model to build interactive UI for R.

Making synthetic data from OPM dataset (which can’t be made public), and regression modeling against that data Wanted to give people a way that allows comparison of results to the real data.




CSG Spring 2016: Common higher-ed logical data models and/or APIs


Satwinder Singh, Columbia

  • How do you get from the spaghetti diagram to a better place?
  • Foundation – Understand the systems of record, form working group with subject experts; identify user stories which then get translated into high level data models. Leads to:
  • Canonical model – abstracts systems of record to a higher level: logical data entities, data dictionary, metadata repository, leads to:
  • Generic Institution Models – are there generic Institution models for academia? Leads to:
  • Specifications and Platform: Specifications (Schema standards, APIs, data formats, security). Platform – iPaas/On-Prem
  • Platform – IPaaS/On-Prem
    • API Gateway – how does it connect back to systems of record? Uses:
    • ESB – provides decoupling
  • API and Data Governance – Lifecycle (building hundreds of successful APIs requires process to decide and maintain them), intake process (look first to see if it already exists or can be extended from something that exists), portal; Data (Master Data Management) quality and stewards.
  • Agility – Impart higher velocity (need to decouple what (“I need this data from …” and how (“I want a file”); Buld/use accelerators (e.g. pre-built connectors, or re-use things you’ve built already – the only way you can show ROI is by reuse); Use templates (interface specification; data mapping)
  • CoE@Columbia – Integration Principle (see Columbia’s document)

Current Practices: Data models
Sarah Christen, Cornell – Survey results

  • 54% have created canonical data models at their institution – data warehouse, students, person data, APIs and reporting tools
  • 85% have data governance priorities
  • 77% using data dictionaries or business glossaries for metadata (people looking at Data Cookbook)


  • Went looking to build conceptual person model
  • ERP systems have models, but not consistent nor share
  • Formed working group which formed subcommittees
  • Some common attributes (name, DOB, etc, known-by)
  • Location (can include online presences)
  • Event participation
  • Demographics (visa status, gender, ethnicity)
  • There’s discussion of the relationship of this person model and the people registries that have been built up in the identity management space. Keith Hazelton notes that if it’s just identity it’s a subset, but if it’s used access control needs more of the extended data to make access decisions.

Mike Fary, Chicago. Student Model

  • Built up representation of data objects used in Campus and Student Life units – took eight months.
  • Produced a data model that was good enough for the integrator to work with in the student system implementation project.
  • Realized there were some areas that needed further definition
  • Created a second working group to create definitions of FT/PT students, joint/dual degree programs, leave of absence/withdrawals. Took another six months
  • Went back afterwards to get reaction from people who had participated. Overwhelmingly people are positive.
  • In some cases agreed to disagree, but at least that’s now documented.
  • Value of the process is in the conversations and the realizations that different parts of the institution view the same data differently. Will need to continue those conversations as further models are built.

Jonathan Saperia – Harvard – VIVO Data Ontology

  • VIVO is a good example of something we can look at that has already been done.
  • Started out trying to solve a specific problem -wasnt looking to define an ontology, but one came out of it.
  • VIVO is “an integrated view of the scholarly record of the University” Information about research and reseachers.
  • Used semantic web technology. Uses RDF
  • Started at Cornell in 2003 with an NIH grant in 2009.
  • Currently a DuraSpace project with members, supporters, and investors
  • More than 28 public implementations, 80 implementations in progress.
  • Use extends beyond research reporting, but it’s not easy.
  • Rather than fleshing out a description of a domain, they were driven by wanting to enable communication and collaboration in scholarship.
  • VIVO take-aways: Start small and build a case; Sort out what is public and not important; stewardship of the project is critical – must be owned at the highest level of the institution; clearly identify where the home of the record will be and who is responsible for it; identify practices of the university; change management and governance are important. Practices across the institution vary and matter in different ways
  • Domains don’t exist in isolation  – lots of relationships
  • Danger is that models will become shelfware – will they evolve? You want something that’s part of a regular process to keep it alive.
  • Why do we separate the actual data about the mission of the university (research output, learning analytics) from the data about the business (student system, research funding)?

Current Practices: APIs

Standards at Yale

  • encouraging consistency, maintainability, and best practices across applications
  • URIs
  • API description (OAI, RAML, etc)
  • Common language, data dictionary and domain-specific ontoloties (Data CookBook)
  • Connections and Reusability (schema,org, JSON-LD, knowledge sources
  • API Descriptions – RAML, Open API (formerly known as Swagger), API BluePrint
  • DataCookbook – Provide for central database that stores admin systems data definitions and report specifications – standards for defining your own campus terms;
  • schema.org – adopted by major search engines; allows documenting/tagging content on website. Used in SIRI and Knowledge Graph NLQA
  • Ontologies – Linked data world, very domain specific knowledge.
  • Plan for reusability

Aaron Bucher – Data Governance

  • Roles – Data custodians; Data stewards (owners); Subject matter experts; Content developers
  • Enterprise Data Management And Reporting groups: Steering committee (owners); Data governance; Analytics collaborative (superusers sharing best practices); reporting and data specialists.
  • Data governance council – charged with establishing and maintaining data definitions; Meet monthly, vote on what definitions to establish.

Mike Chapple – Notre Dame

  • Have about 500 data definitions
  • Top lesson learned: It’s not about technology. It’s about collaborations and relationships.
  • Put together a process to adopt definitions by unanimous consensus.
    • Identify terms to work on; then identify stakeholders and roles
    • Stakeholders have to attend workshop where definitions are built – start with a draft, not a blank page

University of Michigan Medical School Data Governance

  • Hired a data governance manager 1.5 years ago
  • My data/our data – everyone thinks their data is their own
  • Data metrics on a dashboard to show how the program is proceeding
  • Will be headed into dealing with research data soon – there’s data there that should persist at the institution if a faculty member leaves, and there are compliance issues.

Jenn Stringer – users want agency over what data is collected about them – how do we provide that?


Sarah – 

  • API Specs – 31% RAML, 46% Swagger
  • APIs are built with: REST 50%, both REST and SOAP 50%
  • Who are API consumers: 7%,
  • Do you have an API intake process? Yes 46% No 54%

Lech – Yale developer portal (https://developers.yale.edu/)

  • About six months old, after students scraped a lot of web sites and a New York Times article
  • Portal and architecture built on Layer 7 technology (now part of CA)
    • API Listings – several dozen with documentation
    • Resources –
    • Getting started documentation
    • Tools – API Explorer and source samples
    • Events and training – student developer and mentorship program
    • API Keys & OAUTH2 tokens.
  • Two types of APIs – Public APIs and Secured APIs (requires API key and OAuth login)
  • SOA Gateway – requires a technical person to make changes. Supports SSL/TLS. EULA Terms of Use (generic with a provision for amendment by specific data owners); WADL or RAML documentation; Yale NetID access, looking to add social identities.
  • Metrics: 31 APIs; 120 registered users (mostly students, some IT staff); 80 apps registered (in 8 months); Can see top applications, top orgs, performance, latency.
  • Found that students are using the APIs for senior level projects (important in terms of when things need to be up)
  • API Explorer – moving to use postman (https://www.getpostman.com/ )
  • Engagement – student developer & mentorship program, weekly training, ice cream social, YHack, CS50 lectures, Office hours.
  • They include some vendor APIs on their portal – bus location, dining, ZipCar.
  • One student app is used by 3-4k students – has lots of useful information for students – busses, laundry status, etc.

Kitty – the demand for learning analytics is going to drive the need to understand how to safeguard and treat types of data where governance has not matured. Yale has developed a provostial data governance group with the Library, faculty, etc. There’s a smaller group that deals with identity data.

Standards and Tools

REST, JSON, SCIM, Swagger, RAML – Keith Hazelton, Wisconsin Madison

  • Internet2 TIER and the API / Data Model Question – redrawing functional boundaries, refining the interfaces.
  • The one big gap in open source id management space is the entity registry space.
  • Goal is to enable IAM infrastructure mashups – (homegrown, TIER, commercial)
  • In TIER starting from the infrastructure layer – APIs being built for developers building TIER applications.
  • APIs and Data structures – SQL or noSQL? Can we achieve non-breaking extensibility in data objects?
  • Entity representations in event-driven messaging and push/pull REST APIs. How different should data look if it’s a message rather than a response? Should they look any different? When is one appropriate and not the other?
  • REST-ful beats REST-less for loose coupling: implementation language agnostic; better for user-centric design
  • JSON – is XML’s bad rap deserved or not? JSON adopting more of XMLs functions.
  • SCIM: A standard for cloud-era Provisioning: Design centered for provisioning into the cloud. TIER has settled on use of SCIM where it fits – resource types of user and group. Where it doesn’t, extend using SCIM mechanisms. Think twice before doing RPC-like things.
  • Representing APIs with Swagger 2 (now Open API) – decided it’s normative.


CSG Spring 2016: IT Innovation through Student Innovation

Julian Lombardi – Duke

Student life outside of classroom – integrated with lots of IoT technology. There’s an opportunity for universities to support new levels of engagement with technology. Can help lower barriers to innovation – create kinds of playgrounds. Foster innovation that is difficult to predict. Led by Academic and Media Technologies team.

Student – an autonomous, mobile, eating, and printing unit – Mark McCahill

Increasing emphasis on experiential learning on campuses.

Duke has an Innovation & Entrepreneurship program – foster innovation across the university. Lean heavily on the Entrepreneurship side – year long intensive programs.

Central IT’s role – serving students who want to tinker and explore, or who want to have an effect on IT Services. Innovation Co-Lab – help students create the next wave of technology for the Duke community.

Evan  Levine, Michael Faber –

Co-Lab established in spring of 2013.

Have a lot of students that know what they want to create. With today’s technology students can jump in and start right away?

How do you harness student energy? Pizza and t-shirts – cash is even better.

Inaugural Co-Lab Challenge

  • One project (called Hack Duke) created an API for institutional information (courses, directories, etc) by scraping sites.

Duke Mobile Challenge

  • Wanted people to develop for the DukeMobile app. Was too specific. Learned that students are capable of building to meet their needs if you give them the infrastructure.

Did a 3D printing challenge – a drill-mounted centrifuge was one project. Somebody wrote software that could scan a key that could be printed and worked in a lock.

Co-Lab Innovation Grants are what they use now. About 30 projects have come through since Fall ’14. Timelines fairly lengthy (> 1 semester). Amounts vary widely – some people don’t even need money, but want support and project management. Amounts range from $0-~$10k.

BioMetrix – Ivonna Demanyan, Gabby Levac

  • Develop a system to mobile monitor injury risks. They have an ACL tear prevention system – performs predictive analytics. Started with a Co-Lab grant. Now has a team of 7 full time (she’s a senior). Started with an Arduino strapped to her foot to see how her foot rolled. Been featured in lots of media. Mental Floss named them two of the most influential women, received Google entrepreneur award.

Space is a challenge. Currently have 33 3d printers. Have a lot of other physical computing equipment.

Co-Lab is going into new OIT engagement space. Will house research computing staff, media studio. Bring innovative students together with faculty and researchers.

FreeSpace – occupancy system for study rooms in the library. Use IR sensors to detect whether someone is in room, plus app to view data. Needed to incorporate data from the room reservation system. Projects are spanning gap across IT and innovation. Project didn’t work – implementation of that particular sensor didn’t work. How to connect students to institutional data? Needed a full stack software development resource –

  • Code Sharing (gitlab.duke.edu);
  • VMs (vm-manager.oit.duke.edu);
  • API service (apidocs.colab.duke.edu) – hub where student devs can come for keys and tokens and calls can be managed. Put together syndication server with node.js;
  • Local enterprise iOS app store – appstore.colab.duke.edu

Students were frustrated at the lack of an iOS print client, so they wrote one. Was popular enough to overload the Raspberry Pi it was running on – drove the vendor to develop a real app.

Mark McCahill

Big mismatch between classic IT organization and student innovation. Students want to get things done within the semester. Can’t take time planning and enterprise level reliability – here you can afford to fail. Thought they’d need a couple of VMs for student servers. Sysadmins hated the idea that students would have root access. Public IPs (in Duke namespace, but totally separate part of network); Ubuntu linux in a “RHEL shop”; patching; overcommitting VMs. Up to 22 or more images – will do almost anything that’s asked for. If it’s hacked it’s the students problem. Have about 350 VMs reserved, but not all are student machines – finding that faculty have timelines similar to students and are less concerned than IT about reliability. Ended up with early adopter faculty using it for workshops – really self-sufficient. Let the faculty build the image and then create it as a template.

Impedance matching with the registrar – students want access to data, but registrar is nervous about letting it loose. Ended up putting together OAuth infrastructure – have tokens brokered so students can opt-in to release their information to a single application. Students don’t like trying to Shibbolize their web apps – want something to work this semester. New stuff wants to use OAuth, so they’ve backed into supporting that. Finding that medical center is interested in OAuth support for mobile apps.

Approach this with the idea that everything you know is wrong will  allow you to provide support not just to student innovation projects, but a swath of faculty and researchers who typically are outside your central IT support envelope.

How can we identify and nurture this community?

  • 1:1 – Office hours – staffed by student employees as well as staff.
  • Many:many – Studio Nights. Creating community. Get together once a week, buy pizza.
  • One : many – up to now was focused on people who knew what they were doing. Roots program – the training arm of the collab. Taugh 32 courses, Linux, HTML/CSS, Javascript, Rails, iOS, Python, UX Design, Web Accessibility, git, 3D printing and modeling, Rapid Prototyping, Arduino / Photon. Some popular, some not.

John Board

Duke SmartHome – a dorm for 10 students. Solar panels, green roof, LEED Platinum, $2M-ish. 6000 square foot “live in laboratory”.  Was a student senior project. Had an all student project management team.

  • Lesson 1: No coupling to faculty research agendas – was a mistake not to have faculty at least peripherally involved early.
  • Lesson 2: Working with companies is exciting. Students cultivated sponsors. Enormous complexity in legal agreements. One management shakeup at the sponsor and they’re all gone.
  • Lesson 3: Institutional memory in student organizations is fragile. Constant churn in communication tools and sites – not sense of preservation of history.
  • Lesson 4A: “Safety? It’s ok, I’m immortal”
  • Lesson 4B: Dorms are special creatures under law.
  • Lesson 5A: Being a donation magnet is a mixed blessing.
  • Lesson 5B: Free, but with added lawyers!
  • Lesson 6: Predicting student acceptance is a dark art.
  • Lesson 7: Being an on-campus funding agency is great, but…
  • Lesson 8: The semester lasts forever (if you’re an undergrad). After first two weeks students are overcommitted.
  • Lesson 9: Matching freshman exuberance with senior wisdom.
  • Lesson 10: IoT Hell – it’s worse than you imagine. Highest density of device security issues on campus.
  • Lesson 11: Goals are good.

CNI 2014 Fall meeting – Opening Plenary

I’m in DC for the annual fall meeting of the Coalition for Networked Information. This time the opening plenary is a discussion moderated by Cliff Lynch, CNI’s Executive Director, and including Tom Cramer (Chief Technology Strategist, Stanford University Libraries), Michelle Kimpton (Chief Executive Officer, DuraSpace), and James Hilton (Dean of Libraries & Vice Provost for Digital Educational Initiatives, University of Michigan).

Cliff – Notable successes in people launching community source projects over the last 10-15 years. But the landscape is changing: economic model, speed of development are looking shakey, accellerated move to single or multi-tennant arrangements run elsewhere. Where does this leave you when you come to the point in the lifecycle when you need to think about new systems? How do we engage with new opportunities in the MOOC and Unizen space?

Community source – what is it and does it really have a future?

James Hilton – Community source not going away. Is community source the same as open source? Open source often used synonymous with developer-autonomous centric. Kuali as compared with Sakai. How do you organize the labor that produces an outcome? We have many more tools to tune development – different organizational models can work.

Michele Kimpton (Durapsace) – Tuning of community development model. If you want to collaborate and develop code together, that’s a community model. Code doesn’t advance if people are doing customization at their institutions themselves. Need to invest and be transparent to advance the code base.

Tom Cramer – Many forms of community – one form is to have a centralized organization, but just as many examples where the community is grass-roots driven from the edges, like Fedora.

Is there a trend taking us towards or away from the grass roots model to funding a central model?

Tom Cramer – Examples on both sides. Central organization can bring focus, but so can grassroots – e.g. BlackLight faceted browser for SOLR.

James H – as scale of investment goes up the pressure to organize and centralize goes up.

Is the presence of serious commercial players a factor in central vs. grass-roots?

Tom Cramer – if there’s an absence of commercial players that can buy space and time for grass-roots organizing. Central authority can make missteps, whether community or commercial.

James – Unizen is trying to organize community effort around content and analytics standards. Made a decision to adopt commercial software – in part because they wanted the speed that came with that. Contingent on contracts giving the control needed.

Michele Kimpton – Two models – when commercial entity makes product open-source that gives an exit strategy, but it’s not community controlled. Really serving core paying customers.

Tom Cramer – Community source projects have failed where they’ve been gated communities – fail to channel the interests outside the gates. Also true of vendor solutions – unless you can tap the bigger market it will be a problem.

James – Unizen focus is on creating relays that will be as agnostic as possible. Community development is in building workflows using repositories, not in refining the LMS. It’s not about software, it’s about business and economic models.

Cliff – move to talk about the move of software from local to redundant network hosting. There seems to be a big move in that direction. Seeing what would have been community source before now taking on character as community service – like DPN, APT, etc. How does that change the landscape?

James – makes you ask the question – what parts do I need to control, what do I not need? In Unizen trying to figure out what parts need control. The LMS is core infrastructure – go for economy of scale. Focus control on building digital workflows, helping humanists and research scientists know where stuff goes.

Michele Kimpton – 1700 institutions running DSpace – difficult to upgrade to new releases. Duraspace wanted to provide pathway for smaller institutions to run latest code. Cloud infrastructure will flip IT in academic environments on its head. Will be hard to justify building data centers when they can buy IT as a service and buy only what they need. Can keep the same governing process and openness.

Tom Cramer – Running data center and installing and maintaining software is not the core competency. It’s higher up in the stack providing value to the community. Where do you want to maintain control? Curation, discovery, preservation.

Cliff – A lot of this software is getting big, volatile, and complex enough (especially in the security environment) doing maintenance and configuration management is getting to be troublesome. But if you’re out in the cloud you put need to do version control and validation – is that a worthwhile tradeoff?

James – if you’re committed to running everything in this compliance environment, that is all you will do. What to we value as academic institutions? What do we bring that’s unique?

CLiff – Barrier to innovation is everybody forking off code and doing local adaptations. Sense is in the future with networked software as service that area of variation really goes down. Can diffuse innovations faster.

Tom Cramer – Perhaps getting better at managing diversity. Seen lots of good examples that different communities are good at putting enhancements back into the code base. Separate question than running software as a service. In commercial world looking at securing different layers and diffusing innovation at those layers.

Michele Kimpton – There has been a lot of customization of both DSpace and Fedora, and that leads to frustration in upgrading. But customizations are needed. Part of the beauty of more innovation is you can look at aggregations across instances in the cloud – e.g. how do we aggregate pushing content into DPLA or DPN? Easier to do from cloud to cloud.

Tom Cramer – It’s standardization that enables that, not just cloud. e.g. standardized APIs.

Cliff – standards – are the places where standards are most applicable changing? Used to be notion of standards that allowed replacement of building blocks within a system. Now that you move into a world of aggregated things standards don’t mean as much – may work just fine to be expedient.

James – challenge is how do you move standards at the pace of technology?

Tom Cramer – role for standards based on size of the pool you want to swim in. There are important communities of practice around loose coupling whether informal or formal. Look at the numbers of people using SOLR for searching.

Cliff – Puzzle about how patterns of innovation change. Community source projects from grassroots where there is considerable technical expertise at the participating instituitons. If we think about collective service-based aggregations do local technical experts become scarcer and does that imply less diversity of innovation?

James – if we can move innovation up the stack life gets better.

Tom Cramer – you don’t need to know how to run a server to have technical expertise. Successful solutions will figure out way to tap innovation coming from the edges. Be the community you want to be.

Cliff – you can look back and see the evolution – used to be many organizations that had huge knowledge of global networking, but now it’s held in fewer insittutions.

Michele Kimpton – if the developer can focus on developing and not setting up server and talking to IT, it increases innovation. Can throw things up and see if they work? Capital costs to innovation are so much lower. That’s why in the commercial space you see cloud-based services spawning all over the place.

Discussion of contracting and procurement – the legal folks have the same challenge we do in figuring where we really need to be unique flowers. We all have indemnification and state rules. We don’t need 50 different ways to say it.