After a Reuben from Zingerman’s, the afternoon workshop is on next gen interactive web environments, coordinated by Mark McCahill from Duke.
Panel includes Tom Lewis from Washington, Eric Fraser from Berkeley
What are they? What problem(s) are trying to solve? Drive scale, lower costs in teaching. Reach more people with less effort.
What is driving development of these environments? Research by individual faculty drives use of the same platforms to engage in discovery together. Want to get software to students without them having to manage installs on their laptops. Web technology has gotten so much better – fast networks, modern browsers.
Common characteristics – Faculty can roll their own experiences using consumer services for free.
Tom: Tools: RStudio, Jupyter; Shiny; WebGL interactive 3d visualizations; Interactive web front-ends to “big data”. Is it integrated with LMS? Who supports?
What’s up at UW (Washington)?
Four patterns: Roll your own (and then commercialize); roll your own and leverage the cloud; department IT; central IT.
Roll your own: SageMathCloud cloud environment supports editing of Sage worksheets, LaTex documents, and IPython notebooks. William Stein (faculty) created with some one-time funding, now commercialized.
Roll your own and then leverage the cloud – Informatics 498f (Mike Freeman) Technical Foundations of Informatics. Intro to R and Python, build a Shiny app.
Department IT on local hardware: Code hosting and management service for Computer Science.
Central IT “productizes” a research app – SQLShare – Database/Query as a service. Browser-based app that lets you: easily upload large data sets to a managed environment; query data; share data.
Biggest need from faculty was in data management expertise (then storage, backup, security). Most data stored on personal devices. 90% of Researchers polled said they spend too much of their time handling data instead of doing science.
Upload data through browser. No need to design a database. Write SQL with some automated help and some guided help. Build on your own results. Rolling out fall quarter.
JupyterHub for Data Science Education – Eric Fraser, Berkeley
All undergrads will take a foundational Data Science class (CS+Stat+Critical Thinking w/data), then connector courses into the majors. Fall 2015: 100 students; Fall 2016 500 students; Future: 1000-1500 students.
Infrastructure goals – simple to use; rich learning environment; cost effective; ability to scale; portable; environment agnostic; common platform for foundations and connectors; extend through academic career and beyond. Student wanted notebooks from class to use in job interview after class was done.
What is a notebook? It’s a document with text and math and also an environment with code, data, and results. “Literate Computing” – Narratives anchored in live computation.
Publishing = advertising for research, then people want access to the data. Data and code can be linked in a notebook.
JupyterHub – manages authentication, spawns single-user servers on demand. Each user gets a complete notebook server.
Replicated JupyterHub deployment used for CogSci course. Tested on AWS for a couple of months, ran Fall 2015 on local donated hardware. Migrated to Azure in Spring 2016 – summer and fall 2016. Plan for additional deployment to Google using Kubernetes.
Integration – Learning curve, large ecosystem: ansible, docker, swarm, dockerspawner, swarmspawner, restuser, etc.
How to push notebooks into student accounts? Used github, but not all faculty are conversant. Interact: works with github repos, starting to look at integration with Google Drive. Cloud providers are working on notebooks as a service. cloudl.google.com/datalab, notebook.azure.com.
https://try.jupyterhub.org – access sample notebooks.
Mark McCahill – case studies from Duke
Rstudio for statistics classes, Jupyter, then Shiny.
2014: containerized RStudio for intro stats courses. Grown to 600-700 students users/semester. Shib at cm-manage reservation system web site, users reserve and are mapped to personal containerized RSTudio. Didn’t use RStudio Pro – didn’t need authn at that level. Inside – NGINX proxy, Docker engine, docker-gen (updates config files for NGINX), config file.
Faculty want additions/updates to packages about twice a semester. 2 or 3 incidents/semester with students that wedge their container cause RStudio to save corrupted state. Easy fix: delete the bad saved-state file.
Considering providing faculty with automated workflow to update their container template, push to test, and then push to production.
Jupyter: Biostats courses started using in Fall 2015. By summer 2016 >8500 students using (with MOOC).
Upper division students use more resources: had to separate one Biostats course away from the other users and resized the VM. Need to have limits on amount of resources – Docker has cgroups. RStudio quiesces process if unused for two hours. Jupyter doesn’t have that.
Shiny – reactive programming model to build interactive UI for R.
Making synthetic data from OPM dataset (which can’t be made public), and regression modeling against that data Wanted to give people a way that allows comparison of results to the real data.