Higher Ed Cloud Forum: Getting Harvard’s Enterprise Systems to the Cloud

Ben Rota, Harvard

How a crisis optimized our organizational structure\

Phase One Org Structures (as seen retroactively):

February 2015 – Trying to change culture as well as technology. People had expectations that were impossible to meet. Original cloud team was drawn from infrastructure and IdM – no developers or applications people.

May 2015 – Restructured to focus on migrating a single application and supporting Public Affairs.  Successful migration, but had tension between operational work and further migrations.

June 2015 – split into multiple, smaller scrum teams to support more simultaneous projects. Lacked cohesiveness, plus operations were still killing them.

Septemer – December 2015 – team was demoralized. operational work continued to be a problem – pile of cleanup work. No ability to reduce tech debt.

December 2015 – PeopleSoft group decided that 9.2 upgrade would be done in the cloud. Cloud team didn’t have enough resources available to help, but their consultant could help.

June 2016 – PeopleSoft team realizes consultant didn’t work out. Cloud program put .5 FTE on the project.

December 2016 — peoplesoft migration at significant risk – migration team created to respond to impending crisis.

Application Portfolio Teams – co-located, cross-functional group for portfolio migration projects. How’s it going? Migrations are accelerating. PeopleSoft and Oracle Financials, Grants Management are migrated. Close to Alumni Affairs and Development system. Troubleshooting migration problems has gotten easier – co-l;ocation smooths communication. Shared goals breaks down silos.

Organization of work around HUIT managed applications run the risk of neglecting the “long tail” of smaller applications and systems. Too many product owners in the kitchen – how do you prioritize work when you have competing interests? Operational work vs. migration work is still a challenge, but now it’s more about prioritization within a team rather than across silos.  DevOps still has too many definitions! Had a day-long workshop open to all of HUIT in a facilitated discussion about what they hope to get out of this effort.

Higher Ed Cloud Forum – Tools for predicting future conditions: weather & climate

Toby Ault, Marty Sullivan – Cornell

Numerical models of climate science. Most of fluid dynamics in models for weather and climate are physical equations and solvers are “Dynamical Cores” that tells you about the flows of fluid in 3d space.

Continuum of scale needs to be accommodated – done through parametrization. Want to be able to sample as many parameterization schemes as possible.

Interested in intermediate time scales (weeks to months) that have been difficult to model. There’s uncertainty arising from different models so having multiple models with multiple parameterizations that get averaged together with machine learning can have huge payouts in accuracy.

Are the most useful variables for stakeholders the most predictive?

Weather simulation in the cloud:

Infrastructure as code makes things much easier. Able to containerize the models (which include lots of old code), so people don’t need to know all the nuts and bolts. Using lots of serverless – makes running web interfaces extremely easy.

Democratization of science – offered through a web interface. People can configure and run models for their locations.

Lots of orchestration behind the scenes: Deploying with CloudFormation, using ECS, etc.

Higher Ed Cloud Forum: Desktop as a Service – Moonshot to production in 6 months

Deanna Mayer, Brady Phipps — UMUC University College

Primarily online programs, 90+ programs and specializations, 80k students worldwide, 140+ classroom and services locations in 20 countries. Heavily into IT outsourcing – started with a VDI vendor, but they couldn’t scale. Needed non-device specific VDI that didn’t require an install.

Student requirements: fully integrated, one-click classroom experience; access across program, not limited to single course; secure environment providing immersive experience; ability to scale; single sign-on; rich metrics and analytics. Huge spikes in usage on Sunday nights before assignments were due.

January – April 2016 did a RFP. No vendor met all requirements. Most vendors focused on a single image across a corporation. Partnered with Amazon in April, project approved in June. Flew local solutions architect to Seattle to sit with AWS side-by-side for three weeks. Ten people for project team in a room focused on the problem, due by October. Initial launch to 400 students in August. Cut cord with legacy vendor in May – moved over 60 courses. Now have over 10k students using it. 22.5 hours/month average usage, 25% drop in student support requests.

Launched AccelerEd, a new company with Aloft, a cloud services unit.

Higher Ed Cloud Forum: When can a computer improve your social skills?

Ehsan Hoque (University of Rochester)

Behavior mining -> Applications -> Deployment

Automated Prediction of Interview Performance -> My Automated Conversation Coach (MACH) -> ROCSpeak.com

MACH – My Automate Conversation coacH — originated from people with Asperger’s wanting help developing conversational skills.

Originally a research application, got a grant from Azure to develop a cloud version. As people use the framework, the data gets fed back into the model, which improves the performance.

At the end, it’s not the specific cloud functionality but the interaction with the people at the vendor that makes things work.

Higher Ed Cloud Forum: Epidemic Modeling in The Cloud: Projecting the Spread of Zika Virus

Matteo Chinazzi (Northeastern University)

MOBS lab — part of Network Science Institute at Northeastern, modeling contagion processes in structured populations, developing predictive computational tools for analysis of spatial spread of emerging diseases.

Heterogeneous interdisciplinary research group – physicists, economists, computer scientists, biologists, etc.

GLEAM – Global epidemic and mobility model – integrates different data layers – spatial, mobility, population data. For Zika, had to introduce mosquito data, temperature data, and economic data (living conditions).

Practical challenges:

  • unknown time and place of introduction of Zika in Brazil (Latin square sampling + long simulations (4+ years))
  • Parameters need to calibrated and estimated: prediction errors add stochasticity at runtime.
  • Intrinsic stochasticity to to epidemic and traveling dynamics
  • Need quick iterations between different code implementations

Each simulation takes 6-7 minutes, need > 200k simulations. each scenario generates about 25TB of data, needed in a day. Tried on-premise, but not enough compute cores, resources were shared and bursty, and there was no reliable solution to analyze data at scale.

Migration to GCP – prompt replies and assistance from customer support (“your crazy quota increase request has been approved”)

Compute Engine – ability to scale in terms of compute cores – up to 30k cores consumed simultaneously. Can keep data without saturating on-prem NFS partitions. Big Query – ability to scale in terms of data processing. In < 1 day can run simulations and analyze outputs.

Workflow steps: Custom OS images for each version fo mode;; startup scripts to initialize model parameters, execute runs, perform post-processing and move to bucket; Python script to launch VMs, check logs, run analysis on BigQuery, export data tables to bucket, and download selected tables on local cluster. Other scripts to create pdf with simulation results.

Numbers: has 750k+ instances, analyzed 300 TB of data, simulated 10M+ global epidemics, 110+ compute years

Lessons learned: Use preemptible VM instances (~1/5 of price, predictable failure rate); use custom machine types; run concurrent loading jobs on BigQuery; use Google Cloud Client Library for Python – from simulations to outputs with no human interventions; Be aware of API rate limits.

Higher Ed Cloud Forum: Adventures in Cloudy High-Performance Computing

Gavin Burris – Wharton School

HPC – Computing resources characterized by many nodes, many cores, lots of ram, high speed, low-latency networks, large data stores.

XSede — it’s free (if you’re funded by a national agency).

cloud – more consoles, more code, less hardware

Using Ansible to configure cloud resources the same as on-premise, both to deploy EC2 clusters in Python, CfnCluster – cloud formation cluster to build and manage HPC clusters.

Univa UniCloud enables cloud resources to integrate with Univa scheduler.

Use Case: C++ simulation modeling code, needed 500 iterations, each took 3-4 days. Used MIT StarCLuster with spot bids. For $500 finished job in 4 days.

Use case: Where are the GPUs? Nobody was using – had to use different toolkits and code to utilize. So got rid of GPUs in refresh. Used UniCloud to use cloud instances with GPU

“Cloud can accommodate outliers” — GPUs, large memory. A la carte to the researcher based on tagged billing. Policy-based launching of cloud instances.

Seamless transition – VPC VPN link provided by central networking, AWS looks like another server room subnet. Consistent configuration management with the same Ansible playbooks. Cloud mandate by 2020 for Wharton – getting rid of server rooms to reclaim space.

They’re doing NFS over the VPN – getting great throughput.

Cost comparison – HPCC local hardware $328k, AWS $294 for flop equiv.

Spotinst – manages preemption and moves loads to free instances.

 

Cloud Compute Services Expansion – Lessons Leaned

Mark Personett – University of Michigan

A project to: Enable all three campuses and Michigan to access cloud infrastructure with AWS, Azure and Google

Enterprise agreement, shortcake billing, training, consulting, preconfigured security/network settings, Shibboleth integration, reporting. What it’s not: cloud strategy, governance, or operations.

Lessons learned:

BAA doesn’t cover every service. BAA is just a legal document. Account and billing differences.

AWS at U-M: BAA separate from EA and have to do a separate process to add units to the BAA. Single-sign-on is not as integrated. No inherent hierarchy.

GCP: billing accounts and “projects” separate concepts. Billing sub-accounts. GCP is API and API is GCP. API explorer is extremely helpful in writing API calls.

Azure: Resource groups vs subscription not always clear (finding that they need to do subscriptions for each resource group in the general case). Office 365 challenges – if alumni get synced to your Azure AD they get rolled into your instance and under your terms. VPN – they have levels of VPNs – if you breach the bandwidth it resets your tunnel with no warning.

 

Higher Ed Cloud Forum: Beyond the Architecture — Rethinking Responsibilities

Glenn Blackler (UC Santa Cruz)

Cloud-First! Now What…?

Santa Cruz’s approach – hw infrastructure was going to turn into a pumpkin in sprint 2018. “Screw it – we’re all in, let’s jump.”

What’s our approach? How can existing teams support this change? Program work vs. migration specific work. Our focus – enterprise applications.

Defining the program: Plan for a quick win (build confidence, get familiar, identify training needs). Go big – went from a small PHP app to identity management infrastructure. All in! — moved Peoplesoft and Banner. Run concurrent migrations.

But really. … why? Need to continually talk to customers about why they’re doing it. Benefits of cloud migration aren’t apparent – have to sell it. The pitch: elasticity, DR/BR, Accommodation (additional test environments); modernized tools and team structures; sustainability.

Teams – Separation of duties – now have separation between sysadmins and app admins and developers. Always been a handoff, ticket driven organization. Don’t know what org looks like in new world – took really smart people and threw them in a room and told them to figure it out. Core team includes App and Sys admins, plus less frequent contributions from security, DBA, networking, devs.

Looking at Cloud Engineering Team that incorporates OS Setup/Config/App Config/Maintenance. DBA team still a bit separate. Security contributing across the board, but not necessarily hands on all the time. Teams are learning new things about each other that they didn’t know in the ticket-driven world.

Future – shared responsibilities mean fewer handoffs; engineers with wider breadth of skills; improved cross-team collaboration through shared code base; continuous improvement through evolving technical design and available services; adjusted job titles and responsibilities; ITS reorganization; budget impact, review of recharge model.

New ways of collaborating: Sys and App admins using a single git repository for code. Shared tools/technologies, password management; cross-functional tier 1 support;

Lessons learned – don’t lock decisions down too early, use governance to end debates, identify project goals that foster exploration (within timeline), use consultants carefully. Traditional PM will not work, push boundaries of what is possible, required vs. ideal – compromise is important; don’t compare with mature on-premise architecture; be prepared for rumors;

Not everyone is on the bus – what about those who don’t want to get on?

Higher Ed Cloud Forum – Lightning Round #1

Phil Robinson – Cloud Progress at Cornell Student Services IT

First AWS account – July 2015 – adopted a cloud first strategy. Now have about 30 apps on AWS (migrations, rewrites, new apps). Automate with Jenkins and Ansible. Retiring on-prem VMs.

Custom class-roster app, used by students to decide what to take. Added central syllabi feature this year. Using SNS+SQS as message bus, orchestrating events; CloudFront delivery for syllabi; On fly ClamAV scans on upload; ElasticSearch for searching; SES for notifications by email. Developed in 3632 hours.

Looking towards containerizing and VDI.

Gerard Schockley – BU iPaas RDS AWS

IPaaS ODS in RDS – integration service designed to integrate many data feeds into SnapLogic platform. Operational Data Store. Using AWS Aurora.

Bob Winding – Cloud Automation Journey

Most fully automated in GovCloud project. CloudFormation (VPCs, IAM, Security Groups, Centralized alerts); ANsible and CloudFormation for server builds; Consol federation with ADFS; Consistent process for all project accounts; new project account in a couple of hours; decentralized maintenance of CF Templates.

Penn –

What does “cloud native” mean at Penn?

Case study 1 – online giving portal: Data ETL (Talent); to Postgres RDS (fundraising metadata); S3 / Cloudfront; to Oracle on prem. Near real-time

Case study 2: Service ordering (VDI and Backup requests). On prep powershell makes changes in AD groups, sends messages through SQS

Case study 3 – Device registration. On prep registration; does API keys in Lambda

Sara Jeanes – Considerations in moving HPC workloads to the cloud

Initial framing questions: Do they have a preference for which cloud provider (do they have credits, different tech); Is there a multi-cloud resiliency need?

Workload questions: Can it be interrupted (use spot instances), large workloads firewall considerations (ScienceDMZ);

Jeff Minelli – Penn State – CloudCheckr enabling transparency at Penn State

Gain insights into financial transparency, spend optimization, resource utilization and right-sizing, cost allocation, best practices, security & compliance, collection and unification of AWS API data, continuous monitoring, reporting and alerts

Working with CloudCheckr to enable SAML. Basic group email notifications. Configuration of $100 spending alerts.

Trying to get CloudCheckr into InCommon.

Network Firewall Policies for Hybrid Cloud – Brian Jemes – University of Idaho

In cloud managing firewalls with server tags. Gets complicated when managing across on-prem and cloud. On prep have Cisco tools to manage ASA firewalls.

Options: manage hybrid cloud policy in on-prem firewall; manage hybrid policies with traditional firewalls in cloud; develop a hybrid tool.

Looking at a startup called Bracket Computing – cloud firewall policy manager. brkt.com – Provides micro-segmentation.

John Bailey – Washington University (St. Louis). Cloud IAM

Balance between security and usability. Enhncing usability with SPNEGO integrated auth. leverages kerberos token from machine login to perform a web SSO login, making the web login invisible to the customer.

Lou Tiseo – how categorizing resources help to understand cloud usage

Requiring seven different tags. Using Cloudyn management dashboard. Helped save costs by using reserved instances.

Chris Malek Caltech – Automation tools for AWS ECS and Batch

deployfish – configure almost all aspects of an ECS services (load balancing, app autoscaling, volumes, environment, etc). They’ve open sourced it. Create, inspect, scale, update, destroy and restart ECS services with single commands; manage multiple environments (test, qa, prod, etc). Integrates directly with terraform.  YAML driven

batchbeagle — allowing people to manage AWS Batch. Create, update, disable, and destroy queues. Create, update, disable, and destroy compute environments. Create job descriptions. Submit and manage jobs, etc.

Amanda Tan – Washington

Enabling cost notifications on AWS. Cost monitoring is difficult – should be zero effort. Two prong attack: auto-tag resources, send email notification with total spend and resource usage daily. Cloud Formation Template sets up Cloudwatch which invokes auto tag lambda function. AutoTag tags resources with owner and principal-id. Notification works off DLT billing records, provided in S3 buckets twice a day.

 

 

Self Service at Yale

Rob Starr, Jose Andrade, Louis Tiseo (Yale University)

Community told them needed to be able to spin machines up and down at will for classes, etc. Started with a big local open stack environment, now building it out at AWS.

Wanted to deliver agility, automate and simplify provisioning, shared resources, and support structures, and reduce on-premises data centers (one data center by July 2018).

Users can self-service request servers, etc. Spinup – CAS integration, patched regularly, AD, DNS, Networking, Approved security, custom images.

Self-service platform – current manual process takes (maybe) 5 days. With Self-Service, it takes 10 minutes. Offering: Compute, Storage, Databases, Platforms, DNS

All created in the same AWS account. All servers have private IP addresses.

ElasticSearch is the source of truth.

Users don’t get access to the AWS console, but can log into the machines.

Built initial iteration in 3 months with 3 people. Took about a year to build out the microservices environment with 3-4 people. Built on PHP Laravel.

Have a TryIt environment that’s free, with limits.

Have spun up 1854 services since starting, average life of server is 64 days.