CSG Fall 2016 – Next-gen web-based interactive computing environments part 2

NYU Sample Cases – Stratos Efstathiadis

Web-based interactive tools supported by NUI Data Services

Most popular tools include Quantitative, Geospatial, Qaulitative, Visualization. Courses, boot campus, etc.

Some tools are web-based (R Studio, ArcGIS  Online, CART, Qualtrics, Tableau, plotly)

Services provided for tools: Training, Consultations; Pedagogy; Data; Accounts

Geospatial example: ArcGIS online used by courses in radicalization & religion and ethnic conflict, art & politics in the street. Initial consultation and needs assessment, account creation, training, data gathering, in depth consultations, initial web publishing, training round 2, technical support, lessons learned.

Certificate class in Big Data. Structured around a textbook developed from a set of similar classes. Three options: For credit; certificate class meeting four times a semester for three days each; tailored for an agency. Includes non-NYU students, students will be able to access and analyze protected /confidential data. WOrk in teams sharing code and data in project spaces; provide substantial analytic and visualization capabilities where everyone in the class can work simultaneously. User experience is important.

Deployed two PoC environments: on-premise (short-term) and on AWS (long-term).

Built NYU Secure Research Data Environment – serve broad communities of NYU scholars and their collaborators including government agencies and privat sector; support a wide spectrum of data; provide access to powerful resources; enable collaboration; offer training; offer data curation and publications.

Part of Digital Repository Service: Band 1 (fast temporary storage); Band 2 (storage for ongoing activities); Band 3 (feature rich publication environment); Band 4 (secure data environment).

ARC Connect
Mark Montagues – Michigan

Enable easier access to HPC resources – researchers who had never used the command line. Texas shared code they wrote for the XSEDE portal. Added federated authentication with shib and mandatory multifactor. ITAR today, HIPAA on roadmap. Mandatory encryption for VNC sessions (no SSH tunnels needed). Web-enabled VNC viewer brings up a desktop. Encryption is enforced and mandatory. gsissh (part of the globus toolkit) enables authn between arcconnect web server and cluster node. Environment has RStudio and Jupyter. Researchers can install web apps in their home directory on the PC cluster.

To take advantage of the infrastructure, web apps need to be able to run nicely behind a reverse proxy. Hoping to automate the environment more in the future.

Charlie suggests that Galaxy is a piece of software that is worth looking at.


CSG Fall 2016 – Next-gen web-based interactive computing environments

After a Reuben from Zingerman’s, the afternoon workshop is on next gen interactive web environments, coordinated by Mark McCahill from Duke.

Panel includes Tom Lewis from Washington, Eric Fraser from Berkeley

What are they?  What problem(s) are trying to solve? Drive scale, lower costs in teaching. Reach more people with less effort.

What is driving development of these environments? Research by individual faculty drives use of the same platforms to engage in discovery together. Want to get software to students without them having to manage installs on their laptops. Web technology has gotten so much better – fast networks, modern browsers.

Common characteristics – Faculty can roll their own experiences using consumer services for free.

Tom: Tools: RStudio, Jupyter; Shiny; WebGL interactive 3d visualizations; Interactive web front-ends to “big data”. Is it integrated with LMS? Who supports?

What’s up at UW (Washington)?

Four patterns: Roll your own (and then commercialize); roll your own and leverage the cloud; department IT; central IT.

Roll your own: SageMathCloud cloud environment supports editing of Sage worksheets, LaTex documents, and IPython notebooks. William Stein (faculty) created with some one-time funding, now commercialized.

Roll your own and then leverage the cloud – Informatics 498f (Mike Freeman) Technical Foundations of Informatics. Intro to R and Python, build a Shiny app.

Department IT on local hardware: Code hosting and management service for Computer Science.

Central IT “productizes” a research app – SQLShare – Database/Query as a service. Browser-based app that lets you: easily upload large data sets to a managed environment; query data; share data.

Biggest need from faculty was in data management expertise (then storage, backup, security). Most data stored on personal devices. 90% of Researchers polled said they spend too much of their time handling data instead of doing science.

Upload data through browser. No need to design a database. Write SQL with some automated help and some guided help. Build on your own results. Rolling out fall quarter.

JupyterHub for Data Science Education – Eric Fraser, Berkeley

All undergrads will take a foundational Data Science class (CS+Stat+Critical Thinking w/data), then connector courses into the majors. Fall 2015: 100 students; Fall 2016 500 students; Future: 1000-1500 students.

Infrastructure goals – simple to use; rich learning environment; cost effective; ability to scale; portable; environment agnostic;  common platform for foundations and connectors; extend through academic career and beyond. Student wanted notebooks from class to use in job interview after class was done.

What is a notebook? It’s a document with text and math and also an environment with code, data, and results. “Literate Computing” – Narratives anchored in live computation.

Publishing = advertising for research, then people want access to the data. Data and code can be linked in a notebook.

JupyterHub – manages authentication, spawns single-user servers on demand. Each user gets a complete notebook server.

Replicated JupyterHub deployment used for CogSci course. Tested on AWS for a couple of months, ran Fall 2015 on local donated hardware. Migrated to Azure in Spring 2016 – summer and fall 2016. Plan for additional deployment to Google using Kubernetes.

Integration – Learning curve, large ecosystem: ansible, docker, swarm, dockerspawner, swarmspawner, restuser, etc.

How to push notebooks into student accounts?  Used github, but not all faculty are conversant. Interact: works with github repos, starting to look at integration with Google Drive. Cloud providers are working on notebooks as a service. cloudl.google.com/datalab, notebook.azure.com.

https://try.jupyterhub.org – access sample notebooks.

Mark McCahill – case studies from Duke

Rstudio for statistics classes, Jupyter, then Shiny.

2014: containerized RStudio for intro stats courses. Grown to 600-700 students users/semester. Shib at cm-manage reservation system web site, users reserve and are mapped to personal containerized RSTudio. Didn’t use RStudio Pro – didn’t need authn at that level. Inside – NGINX proxy, Docker engine, docker-gen (updates config files for NGINX), config file.

Faculty want additions/updates to packages about twice a semester. 2 or 3 incidents/semester with students that wedge their container cause RStudio to save corrupted state. Easy fix: delete the bad saved-state file.

Considering providing faculty with automated workflow to update their container template, push to test, and then push to production.

Jupyter: Biostats courses started using in Fall 2015. By summer 2016 >8500 students using (with MOOC).

Upper division students use more resources: had to separate one Biostats course away from the other users and resized the VM. Need to have limits on amount of resources – Docker has cgroups. RStudio quiesces process if unused for two hours. Jupyter doesn’t have that.


Shiny – reactive programming model to build interactive UI for R.

Making synthetic data from OPM dataset (which can’t be made public), and regression modeling against that data Wanted to give people a way that allows comparison of results to the real data.



CSG Fall 2016 – Large scale research and instructional computing in the Clouds, part 2

What is Harvard Doing for Research Computing?
Tom Vachon – Manager of Cloud Architecture at Harvard

Research Computing & Instructional Computing
Harvard AWS admin usage averages about $150k/month, research computing ~$90k. There’s Azure usage too.

They have a shared research computing facility with MIT and other campuses. Want to use cloud to burst capacity.

Cloud makes instructional computing easier, particularly in CS-centric classes.

How do you save money in the cloud? Spot Instances (if you can architect workload to survive termination with almost no notice); Auto-turndown (can base rules on tags); Provide ongoing cost data – if you don’t tell people how much they’re spending as they spend it they get surprised. How does region choice influence cost? Cheapest region may not be closest. Certain places might not have the features – e.g. high bandwidth infiniband only available in 2 regions in Azure. Understand cloud native concepts like AWS Placement Groups to get full bandwidth between instances. How do you connect to your provider – cannot be an afterthought. What speed? Harvard has 40 gb direct connect to AWS. What reliability? (had issues with Azure VPN appliances which disconnect every six minutes); Where do you do encryption? Network or application? They chose to require application encryption (including database connections), don’t encrypt their connections.

Cloud requires new tools. How will you handle multiple providers? They’re making golden images for each provider that has very little in it. Ideally have one config management product (they’re consolidating to Salt Stack). Using Terraform to run images on multiple vendors – worth buying the enterprise version. Bonus if you can use same toolset on-premise.

Research Computing in AWS at Michigan
Todd Raeker – Advanced Research Computing Technology Services

What are researchers doing? What kinds of projects?

At Michigan environment is dominated by the Flux cluster: HPC and HTC, 28k computational cores. Researchers aren’t looking to do large-scale compute in the cloud.

In 2015 AWS program to explore cloud. Received 20 x $500 AWS credits. Most were small scale projects. Was primarily used by web applications – easy to learn and use. Researchers working used AWS for data storage and compute. Easier to collaborate with colleagues at different institutions – researchers can manage their collaborations.

Pros and cons: Can be support intensive, need to train staff. Good for self-sufficient researchers (with savvy grad students). User setup can be made hard. Is it really cheaper?

Duke – 60% of research compute loads were using 1 core and 4 GB of RAM – ideal for moving to the cloud.

Asbed Bedrossian – USC, Jeremy Hallum, Michigan

Containerization can help with reproducibility of science results. That’s an area where IT can partner to make containers for researchers to distribute.  Science is a team sport now, complexity increasing rapidly. Challenge is to get attention of researcher to change how they work – need to meet person in the field (agricultural extension metaphor). Central IT can help facilitate that, but you can’t just hang out a shingle and hope they come. Ag Extension agents were about relationships.

Notre Dame starting by helping people with compliance issues in the cloud – preparing for NIST 800-171 using GovCloud.

Getting instructors to use modern tools (like checking content into Git and then doing the builds from there) can help socialize the new environments.

Harvard hopes to let people request provisioning in ServiceNow, then use Terraform to automatically fire off the work.  Georgetown looking at building self-service in Ansible Tower.

Research computing professionals will need to be expert at pricing models.



CSG Spring 2015 – Research Computing Directions, part 1

The afternoon workshop is on research computing directions. Strategic drivers are big data (e.g. gene sequencing); collaborations; mobile compute; monetization.

Issues: sfotware defined everything enables you to do things cheaper; cloud/web scale IT drives pricing down; mobile devices = sensor nets + ubiquitous connectivity; GPUs/massive parallelism. Containerized and virtualized workloads and commodity computing allows moving analysis tools to the data. Interconnect science DMZs. Federations, security & distributed research tools.

Case Studies – where are we today?

Beth Ann Bergsmark, Georgetown: Ten years ago did very little to shape central IT to align with researchers. We always started conversations as security issues. Researchers started realizing that the complexity of what they were building needed IT support. Central IT has adopted research support – grew organization to build support across the fabric of the organization. Built a group for supporting regulatory compliance. Most research has moved into the data centers on premise. Putting staff on grants – staff from traditional operations areas. Fantastic career path, plus making them more competitive for grants. Understanding how to create partnerships. Regulatory compliance control complexity continuing to grow, but research management software is also maturing. Thinking about integration of those apps. Research computing is driving future planning – networking, storage (including curation), compute. Research driving need for hybrid cloud architecture. Researchers will go where the opportunities and data are. Watching open data center initiative closely – AWS hosting public federal data. Portability becomes key. PIs and researchers move – on premise that’s hard. In cloud it should be easier.Need to build for portability. New funding models for responding to the life cycle of research.

Charles Antonelli – Michigan: Has been doing research support in the units since 1977. No central support for research computing except for 1968 era time sharing service. In 2008 there was a flurry of clusters built on campus in various units that had HPC needs. One of those was Engineering’s. In 2009-10 first central large cluster was born. Been growing since that tiem. Flux cluster: 18k cores with ~2500 accounts. Primary vehicle for on campus support of large scale research computing. Does not yet support sensitive data because it speaks NFS v3. That will be fixed with a new research file system. Cluster around 70% busy most of the time.  Central IT does not provide much help for consulting on the users. There is a HP consulting service currently staffed by 20% of one person. Have been looking at the cloud. Hard to understand how to use licensed software in the cloud. Have been using Globus Connect for a long time. Hooking up group stuff to the Globus endpoints.

Charley Kneifel – Duke: Duke Shared Cluster Resource – Monlothic cluster, Sun grid engine scheduler, solid scientific support staff, problematic financial model, breakaway/splintered clusters spun up by faculty. New provost and Vice Provost for research. Active faculty dissatisfaction, new director of research computing in IT. Now: Duke Compute Cluster, SLURM job scheduler, reinvigorated financial model – cover 80% of need on campus with no annual fees for housing nodes. Moving capex to opex. Faculty who’ve built their own clusters are now interested in collaborating. Going towards: Flexible compute cluster with multiple OS, virtualized and secure. Additional computing servers/services: specialized services such as GPU clusters or large memory machines. Flexible storage – long term, scratch, high performance SSD. Flexible networking: 10GB minimum, 40G+ interswitch connections; 20G+ storage connections; SDN services. Challenges: History, wall between health system and university. How to get there? Allocations/vouchers from middle; early engagement with researchers; matching grants; SDN services; cooperation with colleges/departments; support for protected network researchers; Outreach/training – docker days, meetings with faculty. Requires DevOps – automation, work flow support, hadoop on demand, GUI for researcher to link things together. Carrots such as subsidized storage, GPUs, large memory servers. Cut-n-pastable documents suitable for grant submissions; flexibility; removal of old hardware.

Tom Lewis, Chance Reschke, WashingtonConversations with research leaders in 2007-8. 50+ central IT staff involved, 127 researchers interviewed, selected according to: number and dollar amount of current grants relative to others; awards and recognitions. Learn about future directions of research and roles of technology. IT & data management expertise, data management infrastructure, computing power, communication & collaboration tools, data analysis & collection assistance. That equals cyberinfrastructure. By 2005 data centers were overwhelmed. In 2005 data science discussions began. By 2007 VP of research convened forums to discuss solutions, by 2010 rolled out first set of services. Why? Competitiveness & CO2 – faculty recruitment & retention, data center space crisis, climate action plan, scaling problem. Fill the gap: Speed of science/thought – faculty wanted access to large scale, responsive, supportable computing as they exceed the capacity of departments. A set want to run at huge scale – prep for petascale. Need big data pipelines to instruments. Data privacy for “cloudy” workloads.  Who’s doing it? UW-IT doing most of it through service delivery, mostly through cost recovery. Libraries work on data curation, the eScience institute works on big data research. First investment was to build scale for the large researchers who were asking the Provost. Built credibility, and now getting new users. Faculty pay for the blade, which is kept for 4 years. Just added half an FTE data scientist for consulting.

UCSD – UCSD has 25% of all research funding for all of UC. Most research computing is at San Diego Supercomputer Center. Two ways to access – XSEDE (90% of activity). Users get programming support and assistance. There are champions around campus to help. Triton Shared Computing Cluster – recharge HPC. Can buy cycles or buy into the condo. 70% of overall funding comes from research side, rest comes from campus or UC system. Integrated Digital Infrastructure is a new initiative started by Larry Smarr: SDSC, Qualcomm Institut, PRISM @ UCSD, Library, Academic Computing, Calit2. Research data library for long term data curation is part of that initiative.

[CSG Spring 2010] Storing Data Forever

Serge from Princeton is talking about storing data. There’s a piece by MacKenzie Smith called Managing Research Data 101.

What do we mean by data? What about transcribing obsolete formats? Lot of metadata issues. Lots of issues.

What is “forever”? Serge thinks we’re talking about storing for as long as we possibly can, which can’t be precisely defined.

Why store data forever?
– because we have to – funding agencies want data “sharing” plans – e.g. NIH data sharing policy (2003). NIH says that applicants may request funds for data sharing and archiving.
Science Insider May 5 – Ed Seidel says NSF will require applicants to submit a data management plan. That could include saying “we will not retain data”.

– Because we need to encourage honesty – e.g. did Mendel cheat?
– Like open source help uncover mistakes or bugs.
– Open data and access movement – what about research data?

Michael Pickett asks who owns the data? At Brown, the institution claims to own the data.

Cliff Lynch notes that most of the time the data is not copryightable, so that “ownership” comes down to “possession”

There’s a great deal of variation by branch of science on what the release schedules look like – planetary research scientists get a couple of years to work their data before releasing to others, whereas in genomics the model is to pump out the data almost every night.

Current storage models
– Let someone else do it
– Government agency/lab/bureau e.g. NASA, NOAA
– Professional society

Dryad is an interesting model – if you publish in a given model you can deposit your data there. That’s like genbank.

Duraspace wants to promote a cloud storage model based on dspace and fedora.

There are a number of data repositories that are government sponsored that started in universities.

Shel says that researchers will be putting data in the cloud as part of the research process, but where does it migrate to?

Serge’s proposal – Pay once, store endlessly (Terry notes that it’s also called a ponzi scheme).

Total cost of storage =
I = initial cost
D = rate at which storage costs decrease yearl, expressed as a fraction
R = how often, in years, storage is replaced
T = cost to store data forever

T = I + (1-d) to the r *I + (1=d) to the 2r * I + ….

if d=20%, r = 4, T=I * 2

If you charge twice the cost of initial storage, you can store the data forever.

They’re trying to implement this model at Princeton, calling it DataSpace.

People costs (calculated per gigabyte managed) also go down over time.

Cliff – there was a task force funded by NSF, Mellon, and JISC on sustainable models for digital preservation – http://brtf.sdsc.edu

[CSG Spring 2010] Staffing for Research Computing

Greg Anderson from Chicago is talking about funding staff for research computing.

Most people in the room raise their hand when asked if they dedicate staff to research computing on campus.

At Illinois they have 175 people in NCSA, but it doesn’t report to CIO.

Shel notes that employees have gotten stretched into doing lots of other things besides just providing research support. They’re trying to rein that back in in their career classification structures by requiring people to classify themselves. Now there’s 300 generalists classified as such.

At Princeton they’ve started a group of scientific sysadmins. The central folks are starting to help with technical supervision, creating some coherence across units. At Berkeley the central organization buys some time from some of the technical groups to make sure that they’re available to work with the central organization. Groups don’t get any design or consultation help unless they agree to put their computers in the data center.

At Columbia they have a central IT employee who works in the new center for (social sciences?) research computing – it’s a new model.

Greg asks how people know what the ratio of staff to research computing support should be and how do they make the case?

Shel asks whether anybody has surveyed grad students and postdocs about the sysadmin work they’re pressed into doing. He thinks that they’re seeing that work as more tangential to their research than they did a few years back.

Dave Lambert is talking about how the skill set for sysadmin has gotten sufficiently complex that the grad student or postdoc can’t hope to be successful at it. He cites the example of finding lots of insecure Oracle databases in research groups.

Klara asks why we always put funding at the start of the discussion of research support? Dave says it’s because of the funding model for research at our institutions. The domain scientists see any investment in this space by NSF as competing directly with the research funding. We need to think about how we build the political process to help lead on these issues.

[CSG Spring 2010] Research Computing Funding Issues

Alan Crosswell from Columbia kicks off the workshop on Research Computing Funding Issues. The goals of the session are: what works, what are best practices, what are barriers or enablers for best practices?

– Grants Jargon 101 – Alan
– Funding Infrastructure, primarily data centers- Alan
– Funding servers and storeage – Curt
– Funding staff – Greg
– Funding storace and archival life cycle – Serge and Raj
– Summary and reports from related initiatives – Raj

Grants Jargon
– A21: Principles for determining costs applicable to grants, contracts and other agreements with educational institutions. What are allowed and unallowed costs.
– you can’t charge people different rates for the same service.
– direct costs – personnel, equipment, supplies, travel consultants, tuition, central computer charges, core facility charges
– indirect costs a/k/a Facilities adn Admin (F&A) – overhead costs such as heat, administrative salaries, etc.
– negotiated with federal government. Columbia’s rate is 61%. PIs see this as wasted money.
– modified direct costs – substractions include equipment, participant support, GRA tuition, alteration or renovation, subcontracts > $25k.
Faculty want to know why everything they need isn’t included in the indirect cost. Faculty want to know why they can buy servers without paying overhead, but if they buy services from central IT they pay the overhead. Shel notes that CPU or storage as a service is the only logical direction, but how do we do that cost effectively under A21? Dave Lambert says that they negotiated a new agreement with HHS for their new data center. Dave Gift says that at Michigan State they let researchers buy nodes in a condo model, but some think that’s inefficient and not a good model for the future.
Alan asks whether other core shared facilities like gene sequencers are subject to indirect costs.

Campus Data Center Models
– Institutional core research facility – a number that grew out of former NSF supercomputer centers.
– Departmental closet clusters – sucking up lots of electricity that gets tossed back into the overhead.
– Shared data centers between administration and research – Columbia got some stimulus funding for some renovation around NIH research facilities.
– Multi-institution facilities (e.g. RENCI in North Carolina, recent announcement in Massachusets)
– Cloud – faculty go out with credit card and buy cycles on Amazon
– Funding spans the gamut from fully institutionally funded to fully grant funded.

Funding pre-workshop survey results
– 19 of 22 have centrally run research data centers, mostly (15) centrally funded. 9 counts of charge-back, 3 counts of grant funding)
– 18 of 22 respondents have departmentally run research data centers, mostly (14 counts) departmentally funded (3 counts of using charge back, 4 counts of grant funding)
– 14 have inventoried their research data centers
– 10 have gathered systematic data on research computing needs

Dave Lambert – had to create a cost allocation structure for the data centers for the rest of the institution to match what they charge grants for research use, in order to satisfy A21’s requirement to not charge different rates.

Kitty – as universities start revealing the costs of electricity to faculty, people will be encouraged to join the central facility. Dave notes that security often provides another incentive for people because of the visibility of incidents. At Georgetown they now have security office (in IT) review of research grants.

Curt Hillegas from Princeton is talking about Server and Short to Mid-Term Storage Funding
talking about working storage, not long-term archival storage
-some funding has to kick-start the process – either an individual faculty member or central funding. Gary Chapman notes that there’s an argument to be made for central funding of interim funding to keep the resources going between grant cycles.

Bernard says that at Minnesota they’ve done a server inventory and found that servers are located in 225 rooms in 150 different buildings, but only 15% of those are devoted to research. Sally Jackson thinks the same is approximately true at Illinois. At Princeton about 50% of computing is research, and that’s expected to grow.

Stanford is looking at providing their core image as an Amazon Machine Image.

At UC Berkeley they have three supported computational models available and they fund design consulting with PIs before the grant.

Cornell has a fee-for-service model that is starting to work well. At Princeton that has never worked.

Life Cycle management – you gotta kill the thing, to make room for the new. Terry says we need a “cash for computer clunkers” program. You need to offer transition help for researchers.