Cloud Billing Challnges

Bob Flynn, Indiana University

Microsoft Azure – the challenges. Plsses – Account management; Identity management; Networking; Security management; Incident Response.

Minus – Billing. Hvae to make a pre-commit for your enrollment ($100/month) Everything that happens at your campus later is on the same bill. Enrollment owner pays that. First user that burns the $1200 gets it for free (unless they figure out a way to rebill). Ongoing usage – Does central IT (or Procurement) have to do rebilling? How does the account holder track their usage? Azure marketplace purchases sent to enrollment admin, not the one using them. There are issues with research and education credits. The solution? Started with Resource Groups and tags. Limited to 15 tages per resource group, and not all Azure tools are resource group ready. Notifications come to subscription owner. Started looking at allowing users to have their own subscription. VNet Peering allows you to centrally manage the campus network connection. PO number added to subscription name. Bell Techlogix pulling PO # via API – they’re building a portal for account owners and set alerts at PO thresholds.

Nicole Rawleigh, Cornell University

Have 65 accounts under AWS billing. In August 2015 they manually billed four financial accounts. Sept 2016 billed 45 accounts for 65 AWS accounts. Separating internal CIT costs from external units. Switched to doing multiple financial system edocs created manually. One consolidated bill, but also can have multiple other bills/credits. Credits are applied manually to accounts. Going to automation! API between AWS and Kuali Financial. Batch job runs once a month. Outstanding Challenges: Invoice attachment (they use CloudCheckr so users can see invoice charges), making sure that resources are correctly tagged; one financial edoc per financial system account, not per AWS account; Batch error report is hard to deal with; automates consolidated bill only.

Erik Lundberg, University of Washington

Using DLT / Net+ for AWS. DLT provides a great biling front-end. Individual AWS accounts are associated with separate POs and they get invoiced and paid directly. People can create a blank PO on their university budget. Invoicing is all electronic and automatic (through Ariba). Next steps – get AWS Educate and research grants covered under the DLT contract.

 

Cloud Forum 2016 – Research In The Cloud

Daniel Fink from Cornell – Computational Ecology and Conservation using Microsoft Azure to draw insights from citizen science data.

Statistician by training. Citizen science and crowd sourced data.

Lab of Ornithology: Mission – to interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.

Why birds? They are living dinosaurs! > 10k species in all environments. Very adaptable and intelligent. Sensitive environmental indicators. Indian Vulture – 30 million in 1990, virtually extinct today. Most easily observed, counted, and studied of all widespread animal groups.

Ebird. Global bird monitoring project- citizen science for people to report what they see and interact with data. 300k people have participated, still experiencing huge growth.

Taking the observation data and turning it into scientific information. Undestanding distribution, abundance, and movements of organisms.

Data visualizations: http://ebird.org/content/ebird/occurrence/

Data – want to know every observation everywhere, with very fine geographic resolution. Computationally fill gaps in observations, and reduce noise and bias in data using models.

Species distribution modeling has become a big thing in ecology. Link populations and environment – learn where species are seen more often or not. Link ebird data with remote sensing (satellite) data. Machine learning can build models. Scaling to continental scale presents problems. Species can use completely different sets of habitats in different places, making it hard to assemble broad models.

SpatioTemopral Exploratory Model (STEM) – Divid (partition extent int regions, train & predict models within regions, then Recombine. Works well, but computationally expensive. On premise on species in North America, fit 10-30k models, 6k CPU hours, 420 hours wall clock (12 nodes, 144 CPUs). Can’t scale – also dealing with growing number of observations in Ebird – 30% /year. Also moving to larger spatial extents.

Cloud requirements: on-demand: reliably provision resources. Open Source software: Linux, hadoop, R. Sufficient CPU & RAM to reduce wall clock time. System that can scale in the future. Started shifting workload about 1.5 years ago. Using Map Reduce and Spark has been key, but isn’t a typical research computation tool.

In Azure: Using HD Insight  and Microsoft R Server – 5k CPU hours, 3 hours wall clock.

Applications – Where populations are, When they are there, What habitats are they in?

Participated in State of North America’s Birds 2016 study. Magnolia Warbler – wanted to summarize abundance in different seasons. Entire population concentrates in a single area in Central America in the winter that is a tenth the size of the breeding environment – poses a risk factor. Then looked to see if the same is true of 21 other species. Still see immense concentration in forested areas of Central America – Yucatan, Guatemala, Honduras. First time there is information to quantify risk assessment. Looking at assessing for climate change and land use.

50 species of water birds using the Pacific Flyway. Concentration point in the California Central Valley, which has had a huge amount of wetlands historically, but now there’s less than 5% of what there was. BirdReturns – Nature Conservancy project for Dynamic Land Conservation. Worked with rice growers in Sacramento River Valley to determine what time of year will be most critical for those populations. The limit is water cover on the land. There’s an opportunity to ask farmers to add water to their patties a little earlier in the spring and later in the fall, through cash incentives. Rice farmers submit bids, TNS selects bids based on abundance estimates (most birds per habitat per dollar). Thy’ve put 36k additional acres of habitat since 2014.

Quantifying habitat uses. Populations use different habitats in different seasons. Seeing a comprehensive picture of that is new and very interesting. Surprising observation of a wood thrushes using cities as a habitat during fall migrations. Is it a fluke caused  by observation bias? Is it common across multiple species?

Compare habitat use of resident species vs. migratory species. Looked at 20 neotropical migrants, and 10 resident species. Found residents have pretty consistent habitat use, but migrants seasonal differences, showing a real association with urban areas on the fall. Two interpretations: 1) cities might contribute important refuges for migrant species or, 2) cities are attracting species but are ecological traps without enough resources. Collaborators are setting up studies to see. Hypothesis that they are attracted to lights.

Heath Pardoe from NYU School of Medicine – Cloud-based neuroimaging data analysis in epilepsy using Amazon Web Services.

Comprehensive Epilepsy Center at NYU is a tertiary center, after local physician and local hospital. Epilepsy is the primary seizure disorder (defined by having two unprovoked seizures in their lifetime). Many different causes and outcomes. Figuring out the cause is a primary goal. There are medications and therapies. Only known cure is surgery, removing a part of the brain. MRI plays a very big role in pre-surgical assessment. Ketogenic diet is quite effective in reducing seizures in children. Implanting electrodes can be effective, zapping when a seizure is likely to control brain activity. Research ongoing on use of medical marijuana to treat seizures. Medication works well in 2/3 of people, but 1/3 will continue to have seizures. First step is to take a MRI scan and find the lesions.  Radiologists evaluate MRI scans to identify lesions.

Human Epilepsy Project – study people as soon as they’re diagnosed with epilepsy to develop biomarkers for epilepsy outcomes, progression, and treatment response. Tracking patients 3-7 years after diagnosis. Image at diagnosis and three years. Maintain a daily seizure diary on iOS device.Take genomics and detailed clinical phenotyping. 27 epilepsy centers across US, Canada, Australia, and Europe.

Analyzing images to detect brain changes over time. Parallel processing of MRI scans. Using StarCluster to create a cluster of EC2 machines (from 20-200) (load balances and manages nodes and turns them off when not used). Occasionally utilize compute optimized EC2 instances for computationally demanding tasks. Recently developed an MRI-based predictive modeling system using OpenCPU and R.

Have a local server in office running x2go server that people connect to from workstations. From that server upload to EC2 cluster.  More than 10 million data points in a MRI scan. Cortical Surface Modelling delineates different types of brain matter. Then you can measure properties to discriminate changes. To compare different patients you need to normalize, by computationally inflating brains like a balloon – called coregistration.

There are more advanced types of imaging.

Some studies done with these techniques: Using MRI to predict postsurgical memory changes.  Brain structure changes with antiepileptic medication use.

Work going on – image analysis as web service: age prediction using MRI. Your brain shrinks as you age. If there’s a big difference between your neurologic age and your chronological age, that can be indicative of poor brain health.

Difficulty of reproducing results is an issue in this field. Usually developed models sit on grad student’s computer never to be run again. Heath developed a web service running on EC2 that can be called to run model consistently.

Cloud Forum 2016 – Cornell’s BI move to the cloud

Jeff Christen – Cornell

Source Systems – PeopleSoft, Kuali, WOrkday, Longview. Dimensional data marts: finance, student, contributor relations, research admin. BI Tools – OBIEE and Tableau

They do data replication and staging of data for the warehouses. Nightly eplication to stage -> ETL -> Data Marts

Why replication/stage? Consistent view of data for ETL processing, protects production source systems; tuning for ETL performance.

Started journey to cloud 2 years ago. Were using Oracle streams – high maintenance, but met some needs. Oracle purchased a more robust tool and de-supported Streams. ETL tools challenge – were using Cognos Data Manager for 90% of their work, but IBM didn’t continue to support it. Replaced it with WhereScape RED, but requires rewriting jobs.  Apps were already moving off-premise. WorkDay for HR/Payroll, PeopleSoft to AT&T hosting; Kuali financials moving to AWS. Launched pilot project to answer “what would it take to run data warehouse environment in AWS?”

Small pilot – Kuali warehouse in AWS. Which existing tools will work? Desire to use AWS services such as RDS where possible; Testing of both user query performance and ETL performance.

Why Oracle RDS and not Redshift? Approximately 80% of the Kuali DW is operational reporting. Needs fine-grained security at the database level; A lot of PL/SQL in the current environment; Currently exploring Redshift for non-sensitive high volume data

Some re-architecting: Oracle Streams not supported with Oracle RDS (used Attunity). Oracle Enterprise Manager scheduler not supported with Oracle RDS – using Jenkins (so beautiful and simple); No access to OS on RDS databases – installed Data Manager on separate Linux EC2 instance; Using WhereScape to call Data Manager from the RDS database.

Need to be more efficient. On premise the KDW had two physical servers. Found some inefficiencies in ETL code and some dashboard queries were masked by large servers. Prioritization of ETL code conversion by long running areas helped get AWS within nightly batch window. Some updates made to dashboards to improve performance or offer better filter options. Hired database tuning consultant (2wk) to help with Oracle tuning.

Testing and User Perception. Started with internal unit testing. Internal query execution time comparisons between on premise and AWS. User testing of dashboards on premise versus AWS. Repoint of production OBIEE financial dashboards to AWS for a day (x3). Some queries came back faster, some slower. Went through optimization and tuning to get it comparable across the board.

Cutover to AWS. Cutover Sept. 8. Redirected all non-OBIEE ODBC client traffic in October. Agreed to keep the on premise KDW loading in parallel for two month end closings as a fall back.

Next Steps. Parallel Research Admin Mart already in AWS – expect cutover by end of CY. Need more progress on ETL conversion before moving student and contributor marts. Continue Big Data / non-traditional data investigation (Cloudera on AWS). Redshift for large non-sensitive data sets.

Lessons learned: Off premise hosting does not equal Cloud technology. Often hard to get data out of SaaS apps.

Cloud Forum 2016 – Lightning Rounds #2

Cloud VDI – Bob Winding (Notre Dame)

Use cases they looked at:

  • Classes that need locally installed software
  • Application delivery instead of high-end lab machines
  • Workstations for researchers wher the whole project is in the cloud
    • NIST 800-171, ITAR, etc
    • Heavyweight, graphics and processing-intensive work

Looked at: Workspaces (AWS); Microsoft RDP and RDP Gateway, Fra.me, Ericom Blaze and Ericom Connect

Performance is everything – did tests with PTC Creo, Siemens NX10, and Solidworks. Set up test environment in Oregon. Nobody in central IT knew how to operate the software. Found in almost every case that the remote setup was beating the local desktop performance. In some cases, local environment crashed under load, but in AWS loaded in under 2 minutes. (G2X.large).

Researchers observed that they can transfer

Cloud Governance – Do You Need a CLoud Center of Excellence? Laura Babson (Arizona)

a group that leads an organization in an area of focus

Establish best practices, governance, and frameworks for an organization

Applications vs Operations – what do you about tagging, automation, monitoring, security, etc. Don’t want to end up with different ops solutions for different applications.

CoE can help streamline decision making. CCoE can make decision if funding isn’t required, or make a recommendation to a budget committee if funding is required.

Recent decision making: Account strategy – how many and where to put each workload? Campus to Cloud Connectivity’ Monitoring; Tagging policy

Can help with communication and engagement across the organization

AWS CloudFront – Gerard Shockley (Boston U)

What is a CDN? geographically dispersed low latency, high bandwidth solution for distributing http and https.

Terminology: Distribution (rules that control how cloudfront will access and deliver content); Origin (where the content lives)

Only works with publicly visible infrastructure at AWS

Easy to get metrics and drill down into specifics

DevOps != DevOps – Orrie Gartner (Colorado)

Brought a new data center online 3 years ago to consolidate IT across campus, built a private cloud

Ops and Devs teams work close together, automating everything, fine with accepting higher risks, building strong relations between teams, performing continuous integration and deployments.

Didn’t go well this summer moving to the public cloud – lack of understanding of vision and goals from other silos.

Ensure the entire enterprise strives for the same end goal, communicates that goal

Created a vision and articulated cloud strategy. 6 phase roadmap to to public cloud, includes embracing DevOps culture. Line in strategic plan – encourages every team to articulate how they will embrace DevOps concepts.

Educate Up. Educate Laterally. Educate Down.

Change is not easy – changing culture in the organization. Prosci ADKAR – model embraced for making organizational change. Small steps, like encouraging process folks to use Jira, the same tool used by the devs and ops folks.

Us versus Them – a View From the Information Security Bleachers- David McCartney (Ohio State)

Security is not the enemy – they’re scared, unaware, and unprepared for the cloud.

Scared – “how can we stop you?”

Unaware – why move? what kind of data? what security is needed (vs. what you think you need)? what did we do to deserve this?

Unprepared – How do current security services expand? What do you mean “no agent”? Logging? Auditing? Access management? Vulnerability scans? incident response? What about regulatory and framework requirements?

Model Us + Them – Embrace security, buy them booze.

Engage security early, sell the opportunity to do something new and exciting, provide options for training and guidance.

MCloud: From Enable to Integrate – Mark Personett (Michigan)

MCloud is an umbrella service. Strictly IaaS – currently offering AWS, but might mean others later

First iteration launched in 2014 – access to UM enterprise agreement, optional consolidated billing; data egress waiver; M Cloud Consulting Service

Working on launching M Cloud AWS Integrate: provisioning – private network space, shibboleth integration, etc; Guardrails – security best practices, common logging, reporting, etc; Foundational services in AWS – AD, Shib, Kerb, DNS, etc; Site to Site VPN services.

Azure Remote App – Troy Igney (Washington U in St. Louis)

two core requirements when enrollment in second year CS class spiked. Needed Visual Studio. New computers too expsensive. On prem VDI – too expensive. Off Prem VDI – Azure Remote App.

Goal – deliver consistent development environment across a range of BYOD devices.

Challenges: Support an entire class’s logons at once. Required Micsrosoft off-menu configuration.

Advantages – template once and deploy, capacity costs based on current enrollment – dynamically adjust for enrollment changes.

Largest RemoteApp deployment directly supporting classroom delivery.

Microsoft dropped RemoteApp in favor of Citrix virtualization technologies.

Lots of lessons learned supporting remote VDI

Adopting Cloud-Friendly Architecture for On-Premise Services – Eric Westfall (Indiana)

Indiana primarily on premise with an increasing amount of SaaS. Have newer data centers and heavy investment in VMWare. Inevitable to get to hybrid environment, but in the meantime working to be prepared – “cloud-ready” app architecture.

12 factor principles
Stateless Architecture
Microservices
Object Storage (using S3 API in on-prem solutions)
Non-Relational databases

Facilitating DevOps culture

Containerization – investing heavily in Docker. Adopting Docker Data Center

Hope it will allow to take advantage of existing infrastructure investments. Give dev and ops staff opportunities to experiment with cloud services. Allow modernization of app architecture and deliver practices. Prepare for inevitable future.

Cloud Initiative and Research – Steve Kwak (Northwestern)

Cloud Governance – October 2015. IT Directors from the schools and enterprise IT. Hired a consultant to help develop governance.

Cloud Architecture and COnsulting Team – April 2016 – 5 initial team members. set up initial environments at AWS and Azure. Worked through billing and accounts, and providing consulting.

Running cloud days and “open mic” sessions with AWS .

Research environments – 3 centrally managed – HPC (heavy upfront investment for dedicated compute, always a queue); Social Science cluster (aging infrastructure, limited support); Research data storage (separate storage from HPC). Looking to burst HPC to the cloud and move the other two.

Genomics pilot in AWS. Hire on a 3rd party team to put architecture together.

HPC Environment -working on targeting specific workloads in cloud with scheduler, and figure out bursting.

Controlled Approach to Research Computing in AWS – Paul Peterson (Emory)

Mindset of security team – need a similar set of controls in cloud as on-premise. This is quite challenging.

Started working to build Research Cloud. Collected 24 use cases and put them in three categories, divided into 2 VPC types. Worked with AWS professional services to build out VPCs. Pilot started this summer, going to end of year.

Type1 VPC- one availability zone, no Internet gateway – access only through Emory. Single sign-on with Shib.

Tpe2 has two availability zones, and an Internet gateway.

Goal of project team is to make requests for VPCs easy. Automation is key.

Generate VPC service. Created an inventory of accounts, LDS groups, Exchange distribution lists, and CIDR ranges.

Service gets next available account, adds admins to LDS group, creates SAML provider, Creates account alias, selects cloudfront template, get next available CIDR range Creates stack, compute subnets for account. Takes less than 5 minutes.

We Demand, On-Demand: Berkeley Analaytics Environments, VDI and the Cloud – Bill Allison (Berkeley)

Central IT budgets getting cut 10% year-over-year.

VDI use cases have been mostly around desktop pps, not research. Funded a pilot through December. User and use-case driven (faculty oriented) – need to tell story from a faculty perspective. Research IT group is like field workers, mst with PhDs.

Analytics Environment on Demand – not a change in the way you compute, at least on the surface. Use the skills you know already. Creating an abstraction layer.

Art of Letting Go – Relationship advice for dev and ops in the cloud – Bryan Hopkins (Penn)

Team lead for cloud app dev team. Cloud First program – replace homegrown frameworks with off the fhelf frameworks; replace waterfall with agile; replace monliths with integrations and composed apps

Three things we’ve learned so far: 1. Have a clear try-and-scrap phase in R&D – give it leeway. 2. Accept that interests and traditional roles will collide. Dev team can help with platform tasks, ops team can help with dev. Everyone cares about Jenkins. Bring them together. 3. Let go of notions of perfection and clean lines. Off-the-shelf means you get what’s on the shelf.

 

 

 

Cloud Forum 2016 – ERP In The Cloud

Jim Behm (Michigan), David McCartney (Ohio State), Glen Blackler UC Santa Cruz, Erik Lundberg, Washington

UMich – Currently running PeopleSoft (Student, Fin, HR) Click Commerce, Blackbaud. Investigating IaaS for the Student system and planning others.

Ohio State – Currently running PeopleSoft (Finance, HR, and Student) converting Finance to Workday and then HR. Exploring Workday Student. Timing 3 years for finance, 5 years for HR.

UC Santa Cruz – Banner, PeopleSoft, custom IdM on Solaris. Moving it all to AWS by Spring 2018.

Washington – Most modern ERP is 30+ years old on Cobol mainframe. Moving HR/Payroll to Workday, others to follow. Launching in June 2017. Completely restructuring business processes around HR, creating a single service center. Then will tackle Finance. Lessons learned – don’t try implementing software without redoing business process. Looking at how to create sustainable organization capable of tackling these huge projects over 15-20 years.

What impact has your cloud move had on your IT staff?

Bentley University: Didn’t take into account the level of effort involved in regression and security testing. Unanticipated costs and resource issues.

Notre Dame moving ERP to AWS. Had a big impact on storage team who don’t need to do what they used to.

Harvard moving Peoplesoft HR into the cloud. Looking at it as a people issue, not technology. Very sensitive data and people who manage it on premise are invested, but don’t have the skills in the cloud. Don’t want to rely on the cloud team’s expertise. Holding Peoplesoft Day once a week with a consultant who has expertise moving Peoplesoft to AWS, the cloud team, and the Peoplesoft team, working together to solve problems and remove barriers. Building continuous integration and lots of automation.   Arizona doing that too.

Ohio State gutting the data warehouse and rebuilding from scratch. Not sure yet where it will end up.

How have you dealt with information in the cloud and the security ramifications?

Ohio State – Workday is different in terms of access than something like Box. Running into challenges getting enough visibility into the system. Concerns about ability to get logs and information they can consume.

UCSC – people don’t understand the different between SaaS and IaaS. Having to educate them on the local responsibilities still inherent in moving to AWS.

If you chose SaaS how did you enlist your campus and business partners to sacrifice flexibility of the current way for business standardization? How challenging was this?

At Cornell HR decided to move to Workday without consulting IT initially. Was a wake up call for IT in terms of commoditization.

Cloud Forum 2016 – Migration of OnBase to AWS

Sharif Nijim – Notre Dame

OnBase in AWS? Really? Windows app. AWS does Windows fine, OnBase doesn’t do the cloud fine. Licensing is painful for using elasticity. Looked at Hyland’s own hosted offering, but it was way more expensive.

A few lessons learned: EFS  doesn’t do CIFS – not useful if you want Windows File Service in the cloud. There are some products that can help. If they had to do it over again they’d probably run their own Windows File Servers in AWS, but they used Panzura because they had some licenses.

Moving the data – had a couple of terabytes. Tried AWS Snowball. Was complete overkill for what they needed. Transferred 16 GB of database in about 17 minutes. S3 multi-part parallel works well, but there were ~7 million small document files. Had to zip it up for transfer optimization and then rehydrate. Then tried Robocopy to trickle data over a couple of weeks. In order to make a choice, had to understand how the application actually works. Document is written to disk and never changes (annotations go in database). OnBase segments by directory loosely the size of a DVD. So it doesn’t matter if it takes a long time to move data, as it never changes.

OnBase uses NTLM Auth, which doesn’t work well with load balancers, so had to stand up HAProxy. Hoping that OnBase will implement Shib in the next year or so. Notre Dame default procedure is to shut down servers outside of 7am – 7pm. But with OnBase the hash codes for the licenses screw up how the license checks work. Had to get rid of load balancers and elasticity. Still gained separation of web tier from app.

June 25 2016 cut over production users and nobody noticed. Shut down 25 servers and liberated 2 TB of space in production environment.

Plea to cloud providers – make it easy to provision automation from the GUI,  by exporting to templates or whatever.

Cloud Forum 2016 – Automate All Things

https://arizona.box.com/v/automate-things/

Mark Fischer from the University of Arizona is presenting on Software Defined Infrastructure with AWS  Cloud Formation, Docker, and Jenkins

Mark – 20 years of web app dev, 5 years of infrastructure tools dev, and 2 years of AWS infrastructure dev.

Goals – Codify infrastructure decisions, document deployment process in code, ensure repeatable operations, empower developers and product owners to get more done quickly. Want to move to a place to quickly and reliably can replicate deployments.

Just because you move to AWS doesn’t mean you get automation. When you configure things manually in the console, it can leave lots of things hanging – Security Groups, SSH keys, IAM roles, etc. A three tier web app may have 10+ separate resources that need to be configured.

Automation progression: Manual Infrastructure Provisioning -> Cloud Formation;Manual Environment Configuration (libraries, java versions, etc) -> Docker; Manual code deployment -> Jenkins

CloudFormation

Cloud formation is a text based template and a deployment engine: JSON text document, defines AWS resources and resource relationships, has a parameter system so you can spin things up with different names, etc. CloudFormation tracks resources and can handle de-provisioning of all of the infrastructure pieces. If you don’t like JSON, you can now use YAML. Nicer to work with, and you can have comments.

Configuration as Code: CloudFormation allows you to codify your infrastructure deployments. Can track modifications to templates.

UA CloudFormation catalog. https://bitbucket.org/ua-ecs/service-catlog

Docker

Codifies server configurations in dockerfile. Files can be versioned and managed. There are lots of Docker enabled environments.

Jenkins – DevOps Glue

Can fill gaps where you find yourself typing a couple of commands to paste automations together. Lots of functionality: checkout a git repository, build a Java Project; run shell scripts; integrations with slack, email, etc.

UA migrating financial system to AWS – Jenkins jobs involved in lots of steps. Using AWS OpsWorks (managed Chef environment). Usernames, passwords, keys, can be stored securely in Jenkins and then passed to jobs. Devs can launch environments for new code branches – whole process takes 20-25 minutes.

Jenkins allows restricting access to jobs – can create, can run, can manage secrets. Allows you to abstract AWS deployment capabilities. BAs can perform their own database refreshes (e.g. prod -> dev), DevOps staff can manage Jenkins jobs without knowing all the credential secrets.

Sticking Points

Automation takes more time up front to get right, but subsequent deployments are shorter. IAM permissions can be a pain if you have lots of apps in the same account (making sure that people only have access to work on their app); Persistent file storage is hard. Use S3 and RDS as much as possible. EFS makes this slightly easier.

 

Cloud Forum 2016 – Lightning Rounds #1

5 minute lightning rounds

How the Cloud is living up to its promise in Cornell Student Services – Phil Robinson

Might have the largest apps portfolio at Cornell – around 190 apps and sites, POS systems, etc. Compliance requirements including HIPAA. Pain points include lots of technical debt from inherited tech. Lots of time spent keeping up with server patching and upgrades. Looking to leverage elasticity to match student cycle spikes. Built a class roster with scheduler on AWS – scaled to over 1k simultaneous users in July, then scaled down. They have 10 apps in production in AWS. Identified an inspired team member to act as champion, prioritized cloud solutions. “Automate like crazy”

Using AWS workspaces for Graduate Students in applied social sciences – Chet Ramey & Jeff Gumpf (Case Western)

pilot project to test virtual desktops via AWS Workspaces. Department was eliminating a computer lab as building was being remodeled. Workspaces are easy to provision, manage, and use on multiple devices. Each person gets a Workspace, provisioned with stats software and other tools in Spring 2016, paid for by central IT. Originally planned for 3 courses and 26 students. Initial setup took about one hour. After first week of operation the pilot was expanded to 6 courses and 110 workspaces. Users were provisioned through the AWS Console. Built a master Workspace, created an image and two bundles from it, and used them to provision users. Problem with SPSS installer – won’t run on Windows Server. Got around that. Included Google Drive client for storage. About $150/student/semester, but with new AWS hourly pricing would be ~ $80.

Bringing IT Partners on Campus Along For the Ride – Susan Kelley (Yale)

Technology Architecture Committee – govern design and architecture, approve POCs, encourage documentation of strategies, working groups. Reviewed 31 projects in the last year. Formed a Cloud Working Group – 8 central IT staff and 7 IT partners. Decision 1: AWS and Azure. Med School helped with how to interpret Azure bills. School of Architecture wanted to get out of managing servers locally – used as test case for VPC, within one year migrated all their infrastructure to cloud. Now they go around telling other IT teams what they learned.

Securing Research Data: NIST 800-171 Compliance in the Crowd – Bob Winding (Notre Dame)

Lots of work will need to be compliant by end of 2017.  Research that contains “controlled, unclassified information” – ITAR. Held a workshop with AWS and several other schools. Worked to create a Quickstart Guide and a Purdue Educause paper. GovCloud and Quickstart cover about 30% of the controls mandated. GovCloud is US persons only region, so that helps. Providing a VDI/RPC gateway in the Shared Services VPC – VDI client is the audit boundary. As long as you run on University-managed equipment you have an isolated environment. Still process-intensive, but you don’t have to worry about infrastructure.

Cloud Adoption: A Developer’s Perspective – Brett Haranin

Cost Engineering in AWS: Do Like We Say, Not Like We Did – Ben Rota

Lessons learned:

  • Be careful that your people don’t confuse “cheap” with “free”
  • For cost estimates, you generally only need to worry about RDS, EC2, and S3
  • Easiest way to save money is to shut down what you don’t need (engineers aren’t used to doing this on premise)
  • Enforce tagging standards that help you understand your spend (including tags for testing)
  • Look out for unattached storage
  • Consider over-provisioning storage rather than buying PIOPS
  • Multi-AZ RDS instances are a low-risk way to get into RI purchases
  • Real bang for buck in RI purchases is to do them at all

How to do? Set up Trusted Advisor or third party tool to help get the view of what’s going on.

Dirty Dancing In The Cloud – Scotty Logan

Why are we moving to the cloud? Geo-diversity, scalability, etc. Don’t forklift to the cloud. “Go Cortez” – burn your boats behind you. Go to new stuff, DevOps, CI, CD, etc. But you still have FIrewalls, IP Addresses – tightly coupled. Use TLS and client certs instead … but my CISo says we need to use static IPs and VPNs! If you have to, use NAT Gateways (AWS Service)… until you can get to the happy place.

Jetstream: Bob Flynn (Indiana)

Expanding NSF XD’s reach and impact. 70% of NSF researchers claimed to be resource constrained. Jetstream is NSF’s first production cloud facility. Infrastructure running Indiana and Texas, with dev at Arizona. Built on OpenStack. For researchers needing a handful of cores (1 to 44), devs, instructors looking for a course environment. Set of preconfigured images (like AMIs) to pick from. Went live September 1, over 125 XSEDE projects. NSF soliciting allocation requests, including Science Gateways. jetstream-cloud.org

Research Garden Path Case Studies – Rob Fatland (Washington)

CloudMaven – a github repo. Don’t recreate solutions. Has AWS procdurals. Has page on HIPAA compliance.

Prototyping library services using high performance NoSQL platform Erik Mitchell (Berkeley)

Costs about $8 per book to put it in a storage facility. Looked at levels of duplication across two libraries and 27 fields = 378 million data items to compare. Looked at big data solutions. Used BigQuery on Google – NoSql database with a GUI and SQL-like query language. Was able to analyze the data easily and discovered lots of places to save effort and money. Not everything needs to be an enterprise service, if the cloud service is easy enough to use at a local level.

Umbrellas for  a Rainy Day: Cloud Contracts and Procurement – Sara Jeanes (Internet2)

In the world of cloud, ” contract is king” – all you own is the contract. Typical procurement processes are long and cumbersome – doesn’t work for the cloud. Challenges: Timeliness, Risk Management, Price Variability, Pilots and Trials. Possible solutions: Consortia, “piggybacking”, Community communication and collaboration

Red Cloud and Federated Cloud – Dave Lifka (Cornell)

Talked to lots of researchers – they liked everything a la carte – cheap computing on demand, with no permanent staffing or aging hardware.  Built Red Cloud, a Eucalyptus stack (100% AWS compatible so you can burst). Gave each of them root, but then they built a subscription model and a for-fee service building VMs for researchers. Available externally as well as internal to Cornell. Aristotle – data building blocks. Bursting to other institutions and then AWS. Building an allocation and accounting system. It’s about time-to-science. Portal to tell people what resource they can get when and at what cost.

 

Cloud Forum 2016 – Routing to the Cloud

DNS – Notre Dame (Bob Winding)

Found early that everything relies on DNS. Need to integrate with AWS DNS to take advantage of both. How many “views” do you have on campus? Want to resolve all the views, but not undermine virtues of Route 53. Think about query volumes, what do you do on campus? They delegate zone administration with Infoblox on campus, but it doesn’t have great granularity for automation. AWS has great IAM controls for automation, but not granular delegation. They use Infoblox as primary DNS, but looking at creating more authoritative zones in Route 53 so they can take advantage of the automation when spinning up new systems.

What do we do with publicly addressed end-points on campus? Had to have a way of routing public endpoints to private address space. When you put in VPN tunnels you create hidden peering connections via your campus, so you need to put ACLs in place. Need to think about visibility of traffic.

AWS Security – Boston University (Gerard Schockley)

Lessons learned around Security Groups – change philosophy from individuals with desktop access to servers to using a VPN group or a bastion host. A challenge to convince them they can’t have dedicated access. How to reassemble breadcrumbs for forensics? VPC flow logs, CloudTrail, OS logs, is a time-consuming challenge. j

AWS Data Level Security – Harvard 

Have a policy to encrypt everything. Turning encryption at rest on by default. Encyrpt database backups (RMAN). Also need to encrypt data in transit – haven’t needed to do that on premise between non-routable subnets. Needed to work with app owners to make sure data gets encrypted in transit. Some institutions installing TripWire on individual systems. Looking at replicating data to other vendors. Libraries are in a bind because of their replication strategies make them unable to trust cloud vendors. There’s some discussion of whether we can urge vendors towards using some of the kinds of archival standards for preservation of digital materials that have evolved in the library world.

Notre Dame refactoring their security groups so that services are in groups and databases are in groups and users are in groups, and they can specify what apps can route traffic to which databases, not relying on IP addresses. That’s hard to do if you have to integrate on premise resources that don’t talk the same kind of security groups.

CalTech killed a $750k VDI on-prem project and is looking at AWS Workspaces very closely.

Most campuses seem to be building infrastructure like identity services into a “core VPC”. There is a 50-something peering limit before you hit some performance limits. One school is only going to peer VPCs for Active Directory and will open public IPs for Shib, LDAP, etc.

Stanford moving their production identity infrastructure to AWS in the next year, in containers. Other schools also heading that direction. Cornell has put AD into AWS, using multiple regions.

Notre Dame looked at AWS’ directory service, but it needed a separate forest from campus, so didn’t meet their needs.

Notre Dame planning to put VPN service into the cloud as well as on premise so it will continue to exist if campus is down. Arizona standing up AD in Azure, bound to campus and setting up some peering to AWS. Boston moving all their AD infrastructure to Azure – looking at Azure AD.  Stanford looked at Azure AD but decided not to use it and are building their own AD in Azure.

IPS/IDS in your VPC? Gerard – cost is “staggering”. Stanford using CoreOS, which can’t be modified while running, and running IDM systems in read-only containers – that provides intrusion prevention.

Cloud Forum 2016 – survey results and Grit!

We’re at Cornell for the 2nd Cloud Forum! We started out last night with a lovely reception and then Bill Allison, Bob Flynn and I had a great dinner at the Moosewood Restaurant.

This morning kicks off with some summaries from Gerard Shockley of the registration survey. There are 92 registrants (which was capped) from 52 institutions and 4 technology providers (AWS, Microsoft, Google, ) and attendees in 83 roles, from CIO to architects to faculty.

  • 75% reported that cloud strategy is a work in progress.
  • 52% using AWS, 17% Azure, 19% Google, Oracle 2.7%, 205 using some form of “community cloud”
  • 71% report no validated cloud exit strategy
  • 30% say they’re “cloud first”, 52% using “opportunistic cloud”
  • 79% report on-premise facilities that are > 5 years old.
  • Most realize that reducing cost is not the main reason to move to the cloud. Improving service, added flexibility and agility, and improving support for research rank high.
  • Staff readiness is the highest ranked obstacle to broad cloud adoption.
  • 34% have signed a Net+ agreement for either IaaS or Paas.
  • 70% have central IT involved in cloud support for Research Computing
  • 28% say their institution plans on performing clinical research in the cloud
  • 56% say they have signed a HIPAA BAA with a cloud service provider

Next, a session from Sharif Nijim from Notre Dame titled “Grit!”

There’s a shift in how we do things – e.g. from capacity planning to cloud financial engineering. Picking a partner to provide infrastructure services is a whole new level of trust. Hiring staff who can deal with the rate of change in the cloud is critical and hard. We’re all running software that is cloud unfriendly – how many of us are helping the vendors evolve? We’re all prototyping and learning and putting things in production and continuing to learn – sometimes the hard way.