Higher Ed Cloud Forum – Tools for predicting future conditions: weather & climate

Toby Ault, Marty Sullivan – Cornell

Numerical models of climate science. Most of fluid dynamics in models for weather and climate are physical equations and solvers are “Dynamical Cores” that tells you about the flows of fluid in 3d space.

Continuum of scale needs to be accommodated – done through parametrization. Want to be able to sample as many parameterization schemes as possible.

Interested in intermediate time scales (weeks to months) that have been difficult to model. There’s uncertainty arising from different models so having multiple models with multiple parameterizations that get averaged together with machine learning can have huge payouts in accuracy.

Are the most useful variables for stakeholders the most predictive?

Weather simulation in the cloud:

Infrastructure as code makes things much easier. Able to containerize the models (which include lots of old code), so people don’t need to know all the nuts and bolts. Using lots of serverless – makes running web interfaces extremely easy.

Democratization of science – offered through a web interface. People can configure and run models for their locations.

Lots of orchestration behind the scenes: Deploying with CloudFormation, using ECS, etc.


Higher Ed Cloud Forum: Adventures in Cloudy High-Performance Computing

Gavin Burris – Wharton School

HPC – Computing resources characterized by many nodes, many cores, lots of ram, high speed, low-latency networks, large data stores.

XSede — it’s free (if you’re funded by a national agency).

cloud – more consoles, more code, less hardware

Using Ansible to configure cloud resources the same as on-premise, both to deploy EC2 clusters in Python, CfnCluster – cloud formation cluster to build and manage HPC clusters.

Univa UniCloud enables cloud resources to integrate with Univa scheduler.

Use Case: C++ simulation modeling code, needed 500 iterations, each took 3-4 days. Used MIT StarCLuster with spot bids. For $500 finished job in 4 days.

Use case: Where are the GPUs? Nobody was using – had to use different toolkits and code to utilize. So got rid of GPUs in refresh. Used UniCloud to use cloud instances with GPU

“Cloud can accommodate outliers” — GPUs, large memory. A la carte to the researcher based on tagged billing. Policy-based launching of cloud instances.

Seamless transition – VPC VPN link provided by central networking, AWS looks like another server room subnet. Consistent configuration management with the same Ansible playbooks. Cloud mandate by 2020 for Wharton – getting rid of server rooms to reclaim space.

They’re doing NFS over the VPN – getting great throughput.

Cost comparison – HPCC local hardware $328k, AWS $294 for flop equiv.

Spotinst – manages preemption and moves loads to free instances.


CSG Fall 2017 – Big Data Analytic Tools and AI’s Impact on Higher Education

Mark McCahill – Duke

How did we get here?

Big data / AI / machine learning driven by declining costs for:

  • sensors: IoT and cheap sensors create mountains of data
  • storage is cheap – we can save data
  • networks are fast – data mountains can be moved (and so can tools)
  • CPUs are cheap (and so are GPUs)

Massive data – IoT; internet of social computing platforms (driving rapid evolution of analytic tools); science – physics, genomics

Big data analysis tools – CPU clock speeds not increasing much — how can we coordinate CPUs run in parallel to speed analysis of large datasets? Break into parts and spread work – Hadoop MapReduce.

Map/Reduce -> Hadoop -> Apache Spark

Apache Spark – open source MapReduce famework. Spark coordinates jobs run in parallel across a cluster to process partitioned data.

Advantages over Hadoop: 10 – 100x faster than Hadoop (through memory caching); code and transformation optimizers; support for multiple languages (Scala, Python, SQL, R, Java)

2 ways to use Spark:

  • semi-structured data (text files, gene sequencer output);  write transforms and filter functions; classic map/reduce
  • Structured data (implicit or explicitly named columns); transforms and filter structured data using R-style dataframe syntax; SQL with execution optimizers.

Spark data models:

  • RDD (Resilient distributed dataset) storage allows Spark to recover from node failures in the cluster
  • Datasets – semi-structured data with strong typing and lambda functions, custom memory management, optimized execution plans
  • Dataframes – dataset with named columns, supports columnar storage, SQL, and even more optimization

Ambari – tool to manage and deploy a Spark cluster.

Machine Learning

use big datasets to train neural networks for pattern recognition based on ‘learned’ algorithms

Deep learning neural networks have multiple layers and non-linear activation functions

Common thread – training is expensive, and parallelism helps, lots of matrix math processing; GPUs start to look attractive.

Application areas: 3 lenses to look through in higher ed – research, coursework, operations

Example – TensorFire https://tenso.rs/

Case Study: Research

OPM has 40 years of longtitudinal data on federal employees. Duke researchers have been developing synthetic data and differential privacy techniques to allow broader audiences to develop models run against data in a privacy preserving fashion.

Model verification measures: test fit of model developed against synthetic data model to real data. Verification measures for models need to run against many slices of the data and OPM data is big. Initial approach: run regression models and verifications from R on a large Windows VM against a MS-SQL database. But R is single threaded; Custom code written with R parallel library to run / manage multiple independent Windows processes use more server CPU cores.

Rewrite R code to use SparkR

Initial setup: copy CSV files to HDFS; parse CVS and store in SPark Dataframe; treat Dataframe as a table; save table as a Parquet file (columnar format)

Read parquet file and you’re ready to go – SQL queries and R operations on dataframes

Lessons: Social science researchers can probably benefit from tools like Spark; Without access to a sandbox Spark cluster, how can they get up to speed on the tooling? RStudio continues to improve support for Spark via SparkR, SparklyR, etc. Plan to support Spark and R tools for large datasets.

Case Study: Coursework

Spring 2017 – grad biostats course using Jupyter notebooks.

PySpark – Python API for Spark.

Lessons learned: Course assignment k-mer counts in ~ w minutes on a 10 server clustrer (each server is 10 cores + 25 G for 40 students.

Azure has a Jupyter + Spark offering but not configured for courses.


Duke has a wealth of PHI data that researchers want – lives in a secure enclave that is very locked down. Researchers want to use RStudio and Jupyter to run TensorFlow code against GPUs for image analysis. Don’t fight the firewall – automate the analysis tooling external to the protected enclave – package tools in a container. Don’t use Docker (not suited for research) – use singularity.

Container tech developed at LBL. Runs as user not as root.

Singularity + GitLab CI – researcher does integrations and commits.

Lessons learned: Singularity not a big stretch if you do Docker. Automating build/deploy workflow simplifies moving tools into protected data enclaves (or clouds or compute clusters).

Using Machine Learning for campus operations. John Board (Duke)

We already have the data for or for … whatever. Enables manual mashups to answer specific questions, solve specific problems. Will rapidly be giving way to fully automated pulling of needed data for unanticipated questions on demand. Data – speak for yourself and show us what’s interesting.

Academic outcome prediction – easy to ask – does your grade in Freshman calculus predict sticking with and success in engineering 2 years out. Hard part is asking – what are the predictors for success? Should be able to poll corpus of university data (and aggregate across institutions).

Computing demands are significant.

How will Machine Learning Impact Learning? – Jennifer Sparrow, Penn State

What computers can’t teach (yet) – analyze, evaluate, create

What computers can teach: remember, understand, apply

b.book – robot-generated textbook. Faculty puts in desired learning outcomes, robot finds relevant articles (only uses Wikipedia so far). 85% of students prefer this over traditional textbook. 35% often or very often visited pages not a part of required readings.

When students make their own book using b.book — 73% bbookx surfaced content they never encountered before in their major.

Brainstorming tools – intelligent assistant for brainstorming partner. Can use b.bookx for iterative exploration.

Prerequisite knowledge – as bbook found texts, the computer was able to identify prerequisite knowledge and provide links to that.

Generated assessments – “adversarial learning”. Robot can generate multiple choice questions. Bbook doesn’t have any mercy. Computer generated assessments are too hard – even for faculty.

How do we put these tools in the hands of faculty and help use them to create open educational resources?

Simplification – grading writing assignments. Can use natural language processing to give feedback at a richer level than they’d get from faculty or TAs. Students can iterate as many times as they want. Faculty finding that final writing assignments are of a higher quality when using the NLP.

Learning analytics – IR & BI vs data science: aims to precisely summarize the past vs aims to approximate future; enables people to make strategic decisions vs fine scale decision making; small numbers of variables vs hundreds to thousands of variables; decision boundaries are blunt and absolute vs decision boundaries are nuanced and evolved over time.

Able to predict students grade within a half of a letter grade 90% of time. Those that fall outside the 90% have usually had some unpredictable event happen. Uses previous semester GPA, historical average course grade, cumulative GPA, # of credits currently enrolled are biggest factors. Smaller factors: course level, # of students that have completed course; high school GPA; # of credits earned.

First Class – VR teaching simulations.

Legal and ethical issues. How do we know students are submitting their own work? Can we use predictive models to analyze that? Is it ok to correlate other data sources? What kind of assumptions can we make? Do we talk to the students? Are we ethically obligated to provide interventions?

Peeling Back the Corner – Ethical considerations for authoring and use of algorithms, agents and intelligent systems in instruction – Virginia Tech  — Dale Pike

Provost has ideas about using AI and ML for improving instruction and learning.

example: different search results from the same search by different people – we experience the world differently as a result of the algorithms applied to us. Wouldn’t it be good if the search engine allowed us to “peel back the corner” to see what factors are being used in the decision making, and allow us to help improve the algorithm?

When it gets to the point where systems are recommending learning pathways or admission or graduation decisions it becomes important.

Peeling back the corner = transparency of inputs and, where possible, implications, when authoring or evaluating algorithms, agents, and intelligent systems. Where possible, let me inform or modify the inputs or interpretation.

Proposal – could we provide a “Truth in Lending Act” to students that clearly indicates activity data being gathered? Required; Opt Out; Opt In

What does a control panel for this look like? How do you manage the infrastructure necessary without crumbling when somebody opts out?

Why? False positives and the potential for rolling implications.

Filters = choose what to show; Recommendation = offer choices; Decision = choose

Data informed learning and teaching

Learning analytics – understanding of individual performance in an academic setting based (usually) on trends of learning activity outcomes. The potential impact of learning analytics is constrained by the scope of analysis: Timeframe, scope of data, source of data. Increasing the impact of potential analysis also increases “creepy factors”.

Personalized/Adaptive learning: individual activity or performance alters (based on model) the substance, pace, or duration of learning experience.

Examples: CMU Simon Initiative Vision – a data-driven virtuous cycle of learning research and innovative educational practice causes demonstrably better learning outcomes for students from any background. Phil Long and Jon Mott — N2GDLE. Now we fix time variable and accept variability in performance. In the future do we unlock the time variable and encourage everybody to get higher performance?

Big Data Analytic Tools and ML/1AI’s impact on higher ed. Charley Kneifel – Duke

Infrastructure: CPU/GPU/Network as service. GPUs are popular put hard-ish to share – CPUs are easy. “Commodity” GPUs (actual graphics cards) are very poopular (4 in a tower server, 20 in a lab, problems with power and cooling). Centralized, virtualized GPUs make sense (depending on sale), mix of “double precision/compute” and graphics cards. Schedulable resource (slurm at Duke) and interactive usage. Availability inside protected enclaves. Measure resources  do you have idle resources? Storage – HDFS, object stores, fast local filesystem. Network — pipe sizes to Internet, Science DMZ, encrypted paths…; FIONAs with GPUs – edge delivery.

with VMWare have to reboot servers to de-allocate GPUs, so it’s fairly disruptive.

Cloud vs. on-prem clusters vs. serverless: Software packaging is important (portable, repeatable); Support for moving packages to cloud or into protected enclaves; Training capacity vs. operational use; Ability to use spare cycles (may need to cleanse the GPU); standard tool sets (Spark, Machine learning, …); Where is the data? Slurp or Sip (can it be consumed on demand)? Serverless support for tools used – only pay for what you use, but remember you need to manage it, agreements for protected data including BAAs), Customized ASICs and even more specialized hardware (Cloud or …); complex work flows.

Financial considerations: costs for different GPUs; peak capacity on prem vs cloud; pay for what you use; Graphics GPUs are cheap but need home and data.

Security – is data private, covered by data use agreement, protected by law, does it need to be combined with other data, is there public/non-sensitive/di–identified data that can be used for dev purposes, IRBs – streamlined processes, training…

Staffing and support: General infrastructure – can it support and scale GPUs and ML? Software packaging – Machine learning, R, Matlab, other tools? Toole expertise  both build/deploy and analysis expertise; operational usage vs research usage; student usage and projects – leverage them!

VP of research support levels and getting faculty up to speed – projects that highlight success with big data/analytics; cross departmental services/projects – generally multidisciplinary; university/health system examples at Duke; protected data projects are often good targets; leverage students; sharing expertise; reduce barriers (IT should be fast for this).

Cloud Forum 2016 – Research In The Cloud

Daniel Fink from Cornell – Computational Ecology and Conservation using Microsoft Azure to draw insights from citizen science data.

Statistician by training. Citizen science and crowd sourced data.

Lab of Ornithology: Mission – to interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.

Why birds? They are living dinosaurs! > 10k species in all environments. Very adaptable and intelligent. Sensitive environmental indicators. Indian Vulture – 30 million in 1990, virtually extinct today. Most easily observed, counted, and studied of all widespread animal groups.

Ebird. Global bird monitoring project- citizen science for people to report what they see and interact with data. 300k people have participated, still experiencing huge growth.

Taking the observation data and turning it into scientific information. Undestanding distribution, abundance, and movements of organisms.

Data visualizations: http://ebird.org/content/ebird/occurrence/

Data – want to know every observation everywhere, with very fine geographic resolution. Computationally fill gaps in observations, and reduce noise and bias in data using models.

Species distribution modeling has become a big thing in ecology. Link populations and environment – learn where species are seen more often or not. Link ebird data with remote sensing (satellite) data. Machine learning can build models. Scaling to continental scale presents problems. Species can use completely different sets of habitats in different places, making it hard to assemble broad models.

SpatioTemopral Exploratory Model (STEM) – Divid (partition extent int regions, train & predict models within regions, then Recombine. Works well, but computationally expensive. On premise on species in North America, fit 10-30k models, 6k CPU hours, 420 hours wall clock (12 nodes, 144 CPUs). Can’t scale – also dealing with growing number of observations in Ebird – 30% /year. Also moving to larger spatial extents.

Cloud requirements: on-demand: reliably provision resources. Open Source software: Linux, hadoop, R. Sufficient CPU & RAM to reduce wall clock time. System that can scale in the future. Started shifting workload about 1.5 years ago. Using Map Reduce and Spark has been key, but isn’t a typical research computation tool.

In Azure: Using HD Insight  and Microsoft R Server – 5k CPU hours, 3 hours wall clock.

Applications – Where populations are, When they are there, What habitats are they in?

Participated in State of North America’s Birds 2016 study. Magnolia Warbler – wanted to summarize abundance in different seasons. Entire population concentrates in a single area in Central America in the winter that is a tenth the size of the breeding environment – poses a risk factor. Then looked to see if the same is true of 21 other species. Still see immense concentration in forested areas of Central America – Yucatan, Guatemala, Honduras. First time there is information to quantify risk assessment. Looking at assessing for climate change and land use.

50 species of water birds using the Pacific Flyway. Concentration point in the California Central Valley, which has had a huge amount of wetlands historically, but now there’s less than 5% of what there was. BirdReturns – Nature Conservancy project for Dynamic Land Conservation. Worked with rice growers in Sacramento River Valley to determine what time of year will be most critical for those populations. The limit is water cover on the land. There’s an opportunity to ask farmers to add water to their patties a little earlier in the spring and later in the fall, through cash incentives. Rice farmers submit bids, TNS selects bids based on abundance estimates (most birds per habitat per dollar). Thy’ve put 36k additional acres of habitat since 2014.

Quantifying habitat uses. Populations use different habitats in different seasons. Seeing a comprehensive picture of that is new and very interesting. Surprising observation of a wood thrushes using cities as a habitat during fall migrations. Is it a fluke caused  by observation bias? Is it common across multiple species?

Compare habitat use of resident species vs. migratory species. Looked at 20 neotropical migrants, and 10 resident species. Found residents have pretty consistent habitat use, but migrants seasonal differences, showing a real association with urban areas on the fall. Two interpretations: 1) cities might contribute important refuges for migrant species or, 2) cities are attracting species but are ecological traps without enough resources. Collaborators are setting up studies to see. Hypothesis that they are attracted to lights.

Heath Pardoe from NYU School of Medicine – Cloud-based neuroimaging data analysis in epilepsy using Amazon Web Services.

Comprehensive Epilepsy Center at NYU is a tertiary center, after local physician and local hospital. Epilepsy is the primary seizure disorder (defined by having two unprovoked seizures in their lifetime). Many different causes and outcomes. Figuring out the cause is a primary goal. There are medications and therapies. Only known cure is surgery, removing a part of the brain. MRI plays a very big role in pre-surgical assessment. Ketogenic diet is quite effective in reducing seizures in children. Implanting electrodes can be effective, zapping when a seizure is likely to control brain activity. Research ongoing on use of medical marijuana to treat seizures. Medication works well in 2/3 of people, but 1/3 will continue to have seizures. First step is to take a MRI scan and find the lesions.  Radiologists evaluate MRI scans to identify lesions.

Human Epilepsy Project – study people as soon as they’re diagnosed with epilepsy to develop biomarkers for epilepsy outcomes, progression, and treatment response. Tracking patients 3-7 years after diagnosis. Image at diagnosis and three years. Maintain a daily seizure diary on iOS device.Take genomics and detailed clinical phenotyping. 27 epilepsy centers across US, Canada, Australia, and Europe.

Analyzing images to detect brain changes over time. Parallel processing of MRI scans. Using StarCluster to create a cluster of EC2 machines (from 20-200) (load balances and manages nodes and turns them off when not used). Occasionally utilize compute optimized EC2 instances for computationally demanding tasks. Recently developed an MRI-based predictive modeling system using OpenCPU and R.

Have a local server in office running x2go server that people connect to from workstations. From that server upload to EC2 cluster.  More than 10 million data points in a MRI scan. Cortical Surface Modelling delineates different types of brain matter. Then you can measure properties to discriminate changes. To compare different patients you need to normalize, by computationally inflating brains like a balloon – called coregistration.

There are more advanced types of imaging.

Some studies done with these techniques: Using MRI to predict postsurgical memory changes.  Brain structure changes with antiepileptic medication use.

Work going on – image analysis as web service: age prediction using MRI. Your brain shrinks as you age. If there’s a big difference between your neurologic age and your chronological age, that can be indicative of poor brain health.

Difficulty of reproducing results is an issue in this field. Usually developed models sit on grad student’s computer never to be run again. Heath developed a web service running on EC2 that can be called to run model consistently.

CSG Fall 2016 – Next-gen web-based interactive computing environments

After a Reuben from Zingerman’s, the afternoon workshop is on next gen interactive web environments, coordinated by Mark McCahill from Duke.

Panel includes Tom Lewis from Washington, Eric Fraser from Berkeley

What are they?  What problem(s) are trying to solve? Drive scale, lower costs in teaching. Reach more people with less effort.

What is driving development of these environments? Research by individual faculty drives use of the same platforms to engage in discovery together. Want to get software to students without them having to manage installs on their laptops. Web technology has gotten so much better – fast networks, modern browsers.

Common characteristics – Faculty can roll their own experiences using consumer services for free.

Tom: Tools: RStudio, Jupyter; Shiny; WebGL interactive 3d visualizations; Interactive web front-ends to “big data”. Is it integrated with LMS? Who supports?

What’s up at UW (Washington)?

Four patterns: Roll your own (and then commercialize); roll your own and leverage the cloud; department IT; central IT.

Roll your own: SageMathCloud cloud environment supports editing of Sage worksheets, LaTex documents, and IPython notebooks. William Stein (faculty) created with some one-time funding, now commercialized.

Roll your own and then leverage the cloud – Informatics 498f (Mike Freeman) Technical Foundations of Informatics. Intro to R and Python, build a Shiny app.

Department IT on local hardware: Code hosting and management service for Computer Science.

Central IT “productizes” a research app – SQLShare – Database/Query as a service. Browser-based app that lets you: easily upload large data sets to a managed environment; query data; share data.

Biggest need from faculty was in data management expertise (then storage, backup, security). Most data stored on personal devices. 90% of Researchers polled said they spend too much of their time handling data instead of doing science.

Upload data through browser. No need to design a database. Write SQL with some automated help and some guided help. Build on your own results. Rolling out fall quarter.

JupyterHub for Data Science Education – Eric Fraser, Berkeley

All undergrads will take a foundational Data Science class (CS+Stat+Critical Thinking w/data), then connector courses into the majors. Fall 2015: 100 students; Fall 2016 500 students; Future: 1000-1500 students.

Infrastructure goals – simple to use; rich learning environment; cost effective; ability to scale; portable; environment agnostic;  common platform for foundations and connectors; extend through academic career and beyond. Student wanted notebooks from class to use in job interview after class was done.

What is a notebook? It’s a document with text and math and also an environment with code, data, and results. “Literate Computing” – Narratives anchored in live computation.

Publishing = advertising for research, then people want access to the data. Data and code can be linked in a notebook.

JupyterHub – manages authentication, spawns single-user servers on demand. Each user gets a complete notebook server.

Replicated JupyterHub deployment used for CogSci course. Tested on AWS for a couple of months, ran Fall 2015 on local donated hardware. Migrated to Azure in Spring 2016 – summer and fall 2016. Plan for additional deployment to Google using Kubernetes.

Integration – Learning curve, large ecosystem: ansible, docker, swarm, dockerspawner, swarmspawner, restuser, etc.

How to push notebooks into student accounts?  Used github, but not all faculty are conversant. Interact: works with github repos, starting to look at integration with Google Drive. Cloud providers are working on notebooks as a service. cloudl.google.com/datalab, notebook.azure.com.

https://try.jupyterhub.org – access sample notebooks.

Mark McCahill – case studies from Duke

Rstudio for statistics classes, Jupyter, then Shiny.

2014: containerized RStudio for intro stats courses. Grown to 600-700 students users/semester. Shib at cm-manage reservation system web site, users reserve and are mapped to personal containerized RSTudio. Didn’t use RStudio Pro – didn’t need authn at that level. Inside – NGINX proxy, Docker engine, docker-gen (updates config files for NGINX), config file.

Faculty want additions/updates to packages about twice a semester. 2 or 3 incidents/semester with students that wedge their container cause RStudio to save corrupted state. Easy fix: delete the bad saved-state file.

Considering providing faculty with automated workflow to update their container template, push to test, and then push to production.

Jupyter: Biostats courses started using in Fall 2015. By summer 2016 >8500 students using (with MOOC).

Upper division students use more resources: had to separate one Biostats course away from the other users and resized the VM. Need to have limits on amount of resources – Docker has cgroups. RStudio quiesces process if unused for two hours. Jupyter doesn’t have that.


Shiny – reactive programming model to build interactive UI for R.

Making synthetic data from OPM dataset (which can’t be made public), and regression modeling against that data Wanted to give people a way that allows comparison of results to the real data.



CSG Fall 2016: Large scale research and instructional computing in the Clouds


We’re at the University of Michigan in Ann Arbor for the fall CSG Meeting in the Michigan League. Fall semester is in full swing here.

Mark McCahill from Duke kicks off the workshop with an introduction on when and why the cloud might be a good fit.

The cloud is good for unpredictable loads due to the capability to elastically expand and shrink. Wisconsin example of spinning up 50-100k Condor cores in AWS. http://research.cs.wisc.edu/htcondor/HTCondorWeek2016/presentations/WedHover_provisioning.pdf

Research-specific, purpose-built clouds like Open Science Grid and XSEDE.

Is there enough demand on campus today to develop in-house expertise managing complex application stacks? e.g. should staff help researchers write hadoop applications?

Technical issues include integration with local resources like storage, monitoring, or authentication. That’s easier if you extend the data center network to the cloud, but what about network latency and bandwidth? There are issues around private IP address space, software licensing models, HPC job scheduling, slow connections, billing. Dynamic provisioning of reproducible compute environments for researchers takes more than VMs. Are research computing staff prepared for a more DevOps mindset?

New green field deployments are easier than enhancing existing resources.

Researchers will need to understand cost optimization in the cloud if they’re doing large scale work. That may be a place where central IT can help consult.

AWS Educate Starter – less credits than Educate, but students don’t need a credit card.

Case Studies: Duke large scale research & instructional cloud computing

MOOC course (Managing Big Data with MySQL) that wanted to provide 10k students with access to a million row MySQL database. Ended up with over 50k students enrolled.

Architecting for the cloud: Plan to migrate the workload – cloud provider choice will change over time. Incremental scaling with building-block design. Plan for intermittent failures – during provisioning and runtime. Failure of one VM should not affect others.

Wrote a Ruby on Rails app that runs on premise that maps user to their assigned Docker container and redirect them to the proper container host/port. Docker containers running Jupyter notebooks. Read-only access to MySQL for students. Each VM runs 140 Jupyter notebook containers + 1 MySQL instance. In worst case scenario only 140 users can be affected by a runaway SQL query. Containers restarted once/day to clear sessions.

At this scale (50-60 servers) – 1-2% failure rates. Be prepared for provisioning anomalies. Putting Jupyter notebooks into git made it easy to distribute new versions as content was revised. Hit a peak of ~7400 concurrent users. Added a policy of reclaiming containers that had not been visited for 90 days.

Spring 2016 – $100k of Azure compute credits expiring June 30. Compute cluster had all the possible research software on all the nodes, through NFS mounts in the data center. To extend it to Azure have to put a VPN tunnel in private address space. Provision Centos linux VMs then make repeated Puppet runs to get things set up, then mount NFS over the tunnel. SLURM started seeing nodes fail and then come back to life. Needed deeper monitoring that knows more than just nodes being up or down. The default VPN link into Azure maxes out at 100-200 Mbps, so they throttle the Azure VMs at the OS level so they don’t do more than 10 Mbps each. They limit the number of VMs in an Azure subscription to 20 and run workloads that do more compute and less IO. Provisioned each VM at 16 core with 112 GB RAM. Started seeing failures because there were no more A11 nodes available in the Azure East data center – unclear if/when there will be more nodes there. Other regions add latency. Ended up $96k used in one month. 80 nodes (16 cores and 112 GB RAM) in 4 groups of 20 nodes in several data centers. VPN tunnel for each subscription group.

(One school putting their Peoplesoft HR system in the cloud.)

Stratos Efstathiadis – NYU

– Experiences from running Big Data and Machine Learning courses on public clouds – Grad courses provided by NYU departments and centers. Popula courses with large number of students requiring substatial computing resources (GPUs, Hadoop, Spark, etc).

They have substantial resources on premise. Scheduled tutorialson R,MapReduce, Hive, Spark etc. Consultations with faculty, work closely with TAs. Why cloud? Timing of resources, ability to separate resources (courses vs. research), access to specific computing architectures, students need to learn the cloud.

Need a systamatic approach; Use case: Deep Learning class from the Center of Data Science. 40 student teams that needed access to NVidia K80 GPU boards. Each team must have access to identical resources to compete. Instructors must be able to assign resources and control. Required 50 AWS g2.2xlarge instances. Issues: Discounts/vouchers are stated per student, not teams. Need to enforce usage caps at various levels so instructor-imposed caps are not exceeded. Daily email notifications to instructors, TAs and teams providing current costs and details. Students were charged for a full hour every time they spun up an instance. AWS costs were estimated ~ $65k per class. On-prem solution was $200k.

Use case: Spatial data repository for searching, discovering, previewing and downloading GIS spatial data.  First generation was locally hosted – difficult to update, not scalable, couldn’t collaborate with other institutions; lack of in-house expertise; no single sign on. Decided to go to the cloud.

Use case: HPC disaster recovery
Datasets were available a few days after Sandy, but where to analyze them? Worked with other institutions to get access to HPC, but challenges included copying large volumes of data and different user environments and configurations. Started using MIT’s Star (Software Tools for Academics and Researchers), could also use AWS cfnCluster. Set up a Globus endpoint on S3 to copy data. Software licensing is a challenge – e.g. Matlab. Worked things out with Mathworks. Currently they’re syncing environments between NY and Abu Dhabi campus, but they’re investigating the cloud – looking at star/cfnCluster approach, but also might do a container based approach with Docker.


Information, Interaction, and Influence – Information intensive research initiatives at the University of Chicago

Sam Volchenbaum 

Center for Research Informatics – established in 2011 in response of need for researchers in Biological Sciences for clinical data. Hired Bob Grossman to come in and start a data warehouse. Governance structure – important to set policies on storage, delivery, and use of data. Set up secure, HIPAA and FISMA compliance in data center, got certified. Allowed storage and serving of data with PHI. Got approval of infrastructure from IRB to serve up clinical data. Once you strip out identifiers, it’s not under HIPAA. Set up data feeds, had to prove compliance to hospital. Had to go through lots of maneuvers. Released under open source software called I2B2 to discover cohorts meeting specific criteria. Developed data request process to gain access to data. Seemingly simple requests can require considerable coding. Will start charging for services next month. Next phase is a new UI with Google-like search.

Alison Brizious – Center on Robust Decision Making for Climate and Energy Policy

RDCEP is very much in the user community. Highly multi-disciplinary – eight institutions and 19 disciplines. Provide methods and tools to provide policy makers with information in areas of uncertainty. Research is computationally and information intensive. Recurring challenge is pulling large amounts of data from disparate sources and qualities. One example is how to evaluate how crops might fail in reaction to extreme events. Need massive global data and highly specific local data. Scales are often mismatched, e.g. between Iowa and Rwanda. Have used Computation Institute facilities to help with those issues. Need to merge and standardize data across multiple groups in other fields. Finding data and making it useful can dominate research projects. Want researchers to concentrate on analysis. Challenges: Technical – data access, processing, sharing, reproducibility; Cultural – multiple disciplines, what data sharing and access means, incentives for sharing might be mis-aligned.

Michael Wilde – Computation Institute

Fundamental importance of model of computation in overall process of research and science. If science is focused on delivery of knowledge in papers, lots of computation is embedded in those papers. Lots of disciplinary coding that represents huge amounts of intellectual capital. Done in a chaotic way – don’t have a standard for how computation is expressed. If we had such a standard could expand on the availability of computation. We could also trace back what has been done. Started about ten years ago – Grid Physics Netowrk to apply these concepts to the LHC, the Sloan Sky Survey, and LIGO – virtual data. If we shipped along with findings a standard codified directory of how data was derived, could ship computation anywhere on planet, and once findings were obtained, could pass along recipes to colleagues. Making lots of headway, lots of projects using tools. SWIFT – high level programming/workflow language for expressing how data is derived. Modeled as a high level programming language that can also be expressed visually. Trying to apply the kind of thinking that the Web brought to society to make science easier to navigate.

Kyle Chard – Computation Institute

Collaboration around data – Globus project. Produce a research data management service. Allow researchers to manage big data – transfer, sharing, publishing. Goal is to make research as easy as running a business from a coffee shop. Base service is around transfer of large data – gets complex with multi-institutions, making sure data is the same from one place to the other. Globus helps with that. Allow sharing to happen from existing large data stores. Need ways to describe, organize, discover. Investigated metadata – first effort is around publishing – wrap up data, put in a place, describe the data. Leverage resources within the institution – provide a layer on top of that with publication and workflow, get a DOI. Services improve collaboration by allowing researchers to share data. Publication helps not only with public discoverability, but sharing within research groups.

James Evans – Knowledge Lab

Sociologist, Computation Institute. Knowledge Institute started about a year ago. Driven by a handful of questions: Where does knowledge come from? What drives attention, imagination? What role does social, institutional play in what research gets done? How is knowledge shared? Purpose to marry questions with the explosion of digital information and the opportunities that provides. Answering four questions: How do we efficiently harvest and share knowledge harvested from all over?; How do we learn how knowledge is made from these traces?; Represent, recombine knowledge in novel ways; Improve ways of acquiring knowledge. Interested in long view – what kinds of questions could be asked? Providing mid-scale funding for research projects. Questions they’ve been asking: How science as an institution thinks and how scientists pick the next experiment; What’s the balance of tradition and innovation in research? ; How people understand safety in their environment, using street-view data; Taking data from millions of cancer papers then drive experiments with a knowledge engine; studying peer review – how does review process happen? Challenges – the corpus of science, working with publishers – how to represent themselves as safe harbor that can provide value back; how to engage in rich data annotations at a level that scientists can engage with them?; how to put this in a platform that fosters sustained engagement over time.

Alison Heath – Institute for Genomics and Systems Biology and Center for Data Intensive Science

Open Science Data Cloud – genomics, earth sciences, social sciences. How to leverage cloud infrastructure? How do you download and analyze petabyte size datasets? Create something that looks kind of like Amazon or Google, but with instant access to large science datasets. What ideas to people come up with that involve multiple datasets. How do you analyze millions of genomes? How do you protect the security of that data? How do you create a protected cloud environment for that data? BioNimbus protected data cloud. Hosts bulk of Cancer Genome Project – expected to be about 2.5 petabytes of data. Looked at building infrastructure, now looking at how to grow it and give people access. In past communities themselves have handled data curation – how to make that easier? Tools for mapping data to identifiers, citing data. But data changes – how do you map that? How far back do you keep it? Tech vs. cultural problems – culturally has been difficult. Some data access controlled by NIH – took months to get them to release attributes about who can access data. Email doesn’t scale for those kinds of purposes. Reproducibility – with virtual machines you can save the snapshot to pass it on.


Engagement needs to be well beyond technical. James Evans engaging with humanities researchers. Having equivalent of focus groups around questions over a sustained period – hammering questions, working with data, reimagining projects. Fund people to do projects that link into data. Groups that have multiple post-docs, data-savvy students can work once you give them access. Artisanal researchers need more interaction and interface work. Simplify the pipeline of research outputs – reduce to 4-5 plug and play bins with menus of analysis options. Alison – helps to be your own first user group. Some user communities are technical, some are not. Globus has Web UI, CLI, APIs, etc. About 95% of community use the web interface, which surprised them. Globus has a user experience team, making it easier to use. Easy to get tripped up on certificates, security, authentication – makes it difficult to create good interfaces. Electronic Medical Record companies have no interest in being able to share data across systems – makes it very difficult. CRI – some people see them as service provider, some as a research group. Success is measured differently so they’re trying to track both sets of metrics, and figure out how to pull them out of day-to-day workstreams. Satisfaction of users will be seen in repeat business and comments to the dean’s office, not through surveys. Doing things like providing boilerplate language on methods and results for grants and writing letters of support go a long way towards making life easier for people. CRI provides results with methods and results section ready to use in paper drafts. Should journals require an archived VM for download? Having recipes at right level of abstraction in addition of data is important. Data stored in repositories is typically not high quality – lacks metadata, curation. Can rerun the exact experiment that was run, but not others.  If toolkits automatically produce that recipe for storage and transmission then people will find it easy.


CNI Fall 2013 – SHARE Update: Higher Education And Public Access To Research

Tyler Walters, Virginia Tech, MacKenzie Smith, UC-Davis

SHARE – Ensuring broad and continuing access to research is central to the mission of higher education.

Catalyst was the February OSTP memo.

SHARE Tenets – How do we see the research world and our role in it? Independent of the operationalization of OSTP directive, the higher ed community is uniquely positioned to play a leading role in stewardship of research. How can we help PIS and researchers meet compliance requirements?

Higher ed also ahas an interest in collecting and pre servicing scholarly output.

Publications, data, and metadata, should be publicly accessible to accelerate research and discovery.

Complying with multiple requirements from multiple funding sources will place a significant burden on principal investigators and offices of sponsored research.  The rumors are that different agencies will have different approaches and repositories, complicating the issue.

We nee to talk more about workflows and policies. We can rely on existing standards where available.

SHARE is a cross-institutional framework to ensure, access to, preservation and reuse of and policy compliance for funded research. SHARE will be mostly a workflow architecture that can be implemented differently at different institutions. The framework will enable PIs to submit their funded research to any of the deposit locations designated by federal agencies using a common UI. It will package and deliver relevant metadata and files. Institutions implementing SHARE may elect to store copies in local repositories. Led by ARL, with support from AAU and APLU. Guided by  a steering committee drawing from libraries, university administration, and other core constituencies.

Researchers – current funder workflows have 20+ steps. Multiple funders = tangle of compliance requirements and procedures, with potential to overwhelm PIs. Single deposit workflow = more research time, less hassle.

Funding Agencies – Streamlines receipt of information about research outputs; Increases likelihood of compliance

Universities – Optimize interaction among funded research, research officers, and granting agency; Creates organic link between compliance and analytics

General Public – makes it easier for public to access reuse and mine research outputs and research data; Adotpion of standards and protocols will make it easier for search engines.

Project map: 1. Develop Project Roadmap (hopefully in January); 2. Assemble working groups; 3. Build prototypes; 4. Launch prototypes; 5. Refine; 6. Expand

Mackenzie –  The great thing about SHARE is that it means something different to everyone you talk to. 🙂

Architecture (very basic at present) – 4 layers:

Thinking about how to federate content being collected in institutional repositories – content will be everywhere in a distributed content storage layer. Want customized discovery layers (e.g. DPLA) above that. Notification layer above that. Content aggregations for things like text mining (future), in order to support that will need to identify content aggregation mechanisms (great systems out there already).

Raises lots of issues:

  • Researchers don’t want to do anything new, but want to be able to apply. Want to embed notification layer into tools they already use.
  • Sponsored research offices are terrified of a giant unfunded mandate. So we have to provide value back to researcher and institution, and leverage what we already have rather than building new infrastructure.
  • Who should we look at as existing infrastructure to leverage?

What’s the balance between publications and data? Both were covered in the memo, but was very vague. Most agencies and institutions have some idea how to deal with publications, but not the data piece. Whatever workflow SHARE deals with will have to incorporate data handling.

Want to leverage workflow in sponsored project offices to feed SHARE.

What do we know about CHORUS (the publisher’s response) at this point? Something will exist. It would be good to have notifications coming out of CHORUS – they are part of the ecosystem.

Faculty will get a lot more interested in what’s being put out on the web about their activities. Some campuses are tracking a lot of good data in promotion and tenure dossier systems, but  that may not be able to be used for other purposes.

Will there be metadata shared for data or publications that can’t be made public? Interesting issues to deal with.

What does success look like? We don’t know yet – immediate problem is the OSTP mandate which will come soon. At base the notifications system is very important – the researcher letting the people who need to know that there is output produced from their research. Other countries have had notions of accountability for funded research for a long time. Even deans don’t know what publications from faculty are being produced. In Europe they don’t have that problem.

Want to be in a position by end of 2014 to invite people to look at and touch system. Send thoughts to share@arl.org

CNI Fall 2013 – Creating A Data Interchange Standard For Researchers, Research, And Research Resources: VIVO-ISF

Dean B. Krafft, Brian Lowe, Cornell University

What is VIVO?

  • Software: an open0source semantic-web-based researcher and research discovery tool
  • Data: Institution-wide, publicly-visible information about research and researchers
  • Standards: A stnadard ontology (VIVO data) that interconnects researchers

VIVO normalizes complex inputs, connecting scientists and scholars with and through their research and scholarship.

Why is VIVO important?

  • The only standard way to exchange information about research and researchers across divers institutions
  • Provides authoritative data from institutional databases of record as Linked Open Data
  • Supports search, analysis, and visualization of data
  • Extensible

An http request can return HTML or RDF data

Value for institutions and consortia

  • Common data substrate
  • Distributed curation beyond what is officially tracked
  • Data that is visible gets fixed

US Dept. of Agrigculture implementing VIVO for 45,000 intramural researchers to link to Land Grant universities and international agricultural research institutions.

VIVO exploration and Analytics

  • structured data can be navigated, analyzed, and visualized within or across institutions.
  • VIVO can visualize strengths of networks
  • Create dashboards to understand impact

Providing the context for research data

  • Context is critical to find, understand, and reuse research data
  • Contexts include: narrative publications, research grant data, etc.
  • VIVO dataset registries: Australian National Data Registry, Datastar tool at Cornell

Currently hiring a full-time VIVI project director.

VIVO and the Integrated Semantic Framework

What is the ISF?

  • A semantic infrastructure to represent people based on all the products of their research and activities
  • A partnership between VIVO, eagle-i, and ShareCenter
  • A Clinical and Translational Information Exchange Project (CTSAConnect): 18 months (Feb2012-Aug2013) funded by NIH))

People and Resources – VIVO interested primarily in people, eagle-i interested in genes, anatomy, manufacturer. Overlap in techniques, training, publications, protocols.

ISF Ontology about making relationships – connecting researchers, resources, and clinical activities. Not about classification and applying terms, but about linking things together.

Going beyond static CVs – distributed data, research and scholarship in context, context aids in disambiguation, contributor roles, outputs and outcomes beyond publications.

Linked Data Vocabularies: FOAF (Friend of a Friend) for people, organizations, groups; VCard (Contact info) (new version); BIBO (publications); SKOS (terminologies, controlled vocabularies, etc).

Open biomedical Ontologies (OBO family): OBI (Ontology of biomedical investigations); ERO (eagle-i Research Resource Ontology); RO (Relationship Ontology); IAO (Information Artifact Ontology – goes beyond bibliographic)

Basic Formal Ontology from OBO – Process, Role, Ocurrent, Continuant, Spatial Region, Site.

Reified Relationships – Person-Position-Org, Person-Authorship-Article. RDF Subject/predicate model breaks down for some things, like trying to model different position relationships over time.  So use a triple so the relationship gets treated as an entity of its own with its own metadata. Allows aggregation over time, e.g. Position can be held over a particular time interval. Allows building of a distributed CV over time.  Allows aggregating name change data over time by applying time data to multiple VCards with time properties.

Beyond publication bylines – What are people doing? Roles are important in VIVO ISF. Person-Role-Project. Roles and outputs: Person-Role-Project-document, resource, etc.

Application examples: search (beta.vivosearch.org) can pull in data from distributed software (e.g. Harvard Profiles) using VIVO ontologies.

Use cases: Find publications supported by grants; discover and reuse expensive equipment and resources; demonstrate importance of facilities services to research results; discover people with access to resources or expertise in techniques.

Humanities and Artistic Works -performances of a work, translations, collections and exhibits. Steven McCauley and Theodore Lawless at Brown.

Collaborative development – DuraSpace VIVO-ISF Working Group. Biweekly calls Wed 2 pm ET. https://wiki.duraspace.org/display/VIVO/VIVO-ISF+Ontology+Working+Group

Linked Data for Libraries

December 5, 2013 Mellon made a 2 year grant to Cornell, Harvard, and Stanford starting Jan 2014 to develop Scholarly Resource Semantic Information Store model to capture the intellectual value that librarians and other domain experts add to information resources, together with the social value evident from patterns of research.

Outcomes: Open source extensible SRSIS ontology compatible with VIVO, BIBFRAME and other ontologies for libraries.

Sloan has funded Cornell to integrate ORCID more closely with VIVO. At Cornell they’re turning MARC records into RDF triples indexed with SOLR – beta.blacklight.cornell.edu


CNI Fall 2013 – Visualizing: A New Data Support Role For Duke University Libraries

Angela Voss – Data Visualization Coordinator, Duke Libraries

Data visualization can be typical types such as maps or tag clouds, or custom visualizations such as parallel axes plots. Helping people match their data to their needs, and what they want to get out of their data. Also help people think about cost/benefits of creating visualizations.

Why visualize?

  • Explore data, uncover hidden patterns. e.g. Anscombe’s Quartet.
  • Translate something typically invisible into the visible – makes the abstract easier to understand, increase engagement. Important to people performing research as well as reporting to others.
  • Communicate results, contextualize data, tell a story, or possible even mobilize action around a problem. (see Hans Rosling: The River of Myths). Important to build context around data, not just think that the numbers speak for themselves.

Visualization at Duke

  • No single centralized community, but plenty of distributed groups and projects.
  • Library was already offering GIS help.
  • Who could support visualization? Faculty/department? College/school? Campus-wide organization – was the only option with wide enough reach. There were several options – Duke created a position that reports jointly to Libraries and OIT.
  • Position started in June 2012 – Dual report to Data and GIS Services in the Libraries and Research Computing in OIT.
  • Objectives: instruction and outreach; consultation; develop new visualization services, spaces, programs.

After 18 months, what has been the most successful?

  • Visualization workshop series – software (Tableau (full time students get software free), d3 (Javascript library)), data processing (text analysis, network analysis), best practices (designing academic figures/posters, top 10 dos and don’ts for charts and graphs). The barrier is understanding data transformations to get data into software
  • Online instructional material
  • Just-in-time consulting – crucial to people getting started.
  • Ongoing visualization seminar series – this had been happening since 2002. Helped introduce the community.
  • Student data visualization contest

d3 monthly study group – Using GitHub to share sample code. Using Gist and blocks.org to see the visualization right away. e.g. http://bl.ocks.org/dukevis/6768900/.

Top 10 Dos and Don’ts for Charts and Graphs:

  • Simplify less important information
  • Don’t use 3D effects.
  • Don’t use rainbows for ordered, numerical variables. Use single hue, varying luminance.

Just in time consulting

  • Weekly walk-in consulting hours in the Data & GIS Services computer lab
  • Additional appointments outside of walk-in hours
  • Detailed support and troubleshooting via email

Weekly visualization seminars – Lunch provided, speakers from across campus and outside. Regularity helps. Live streaming and archived video. http://vis.duke.edu/FridayForum

Student data visualization contest

  • Goal: to advertise new services, take a survey of visualization at Duke – helped build relationships across the campus.
  • Open to Duke students, any type of visualization
  • Judged on insightfulness, narrative, aesthetics, technical merit, novelty
  • Awarded three finalists and two winners. Created posters of the winners to display in the lab, and run them on the monitor wall.

After 18 months, what are the challenges?

  • Marketing and outreach – easy to get overwhelmed by the people already using services at the expense of reaching new communities.
  • Staying current – every week there’s a new tool.
  • Project work, priorities – important to continue work as a visualizer on projects.
  • Disciplinary silos and conventions
  • Curriculum and skill gaps – there aren’t people teaching visualization at Duke as a separate topic. Common skill gaps: visualization types and tools; spreadsheet and/or database familiarity; scripting; robust data management practices; basic graphic design

Hopes for the future

  • Active student training program (courses, independent studies, student employment)
  • Additional physical and digital exhibit opportunities
  • Continued project and workshop development

What should a coordinator know?

  • Data transformations
  • Range of visualization types, tools
  • Range of teaching strategies
  • Marketing

What should a coordinator do?

  • Find access points to different communities
  • Use events to build community
  • Collaborate on research projects
  • Stockpile interesting datasets
  • Beware of unmanaged screens
  • Block out plenty of quiet time for the above

How should an organization establish a new visualization support program?

  • Identify potential early adopters
  • Budget for a few events, materials, etc
  • Involve othe service points
  • Provide a support system for the coordinator
  • Expect high demand

Working primarily with staff and grad students, this quarter a lot of undergrads due to a few courses.

Angela’s background is in communication for the most part. There’s a IEEE visualization conference.