Cloud Compute Services Expansion – Lessons Leaned

Mark Personett – University of Michigan

A project to: Enable all three campuses and Michigan to access cloud infrastructure with AWS, Azure and Google

Enterprise agreement, shortcake billing, training, consulting, preconfigured security/network settings, Shibboleth integration, reporting. What it’s not: cloud strategy, governance, or operations.

Lessons learned:

BAA doesn’t cover every service. BAA is just a legal document. Account and billing differences.

AWS at U-M: BAA separate from EA and have to do a separate process to add units to the BAA. Single-sign-on is not as integrated. No inherent hierarchy.

GCP: billing accounts and “projects” separate concepts. Billing sub-accounts. GCP is API and API is GCP. API explorer is extremely helpful in writing API calls.

Azure: Resource groups vs subscription not always clear (finding that they need to do subscriptions for each resource group in the general case). Office 365 challenges – if alumni get synced to your Azure AD they get rolled into your instance and under your terms. VPN – they have levels of VPNs – if you breach the bandwidth it resets your tunnel with no warning.

 

Higher Ed Cloud Forum: Beyond the Architecture — Rethinking Responsibilities

Glenn Blackler (UC Santa Cruz)

Cloud-First! Now What…?

Santa Cruz’s approach – hw infrastructure was going to turn into a pumpkin in sprint 2018. “Screw it – we’re all in, let’s jump.”

What’s our approach? How can existing teams support this change? Program work vs. migration specific work. Our focus – enterprise applications.

Defining the program: Plan for a quick win (build confidence, get familiar, identify training needs). Go big – went from a small PHP app to identity management infrastructure. All in! — moved Peoplesoft and Banner. Run concurrent migrations.

But really. … why? Need to continually talk to customers about why they’re doing it. Benefits of cloud migration aren’t apparent – have to sell it. The pitch: elasticity, DR/BR, Accommodation (additional test environments); modernized tools and team structures; sustainability.

Teams – Separation of duties – now have separation between sysadmins and app admins and developers. Always been a handoff, ticket driven organization. Don’t know what org looks like in new world – took really smart people and threw them in a room and told them to figure it out. Core team includes App and Sys admins, plus less frequent contributions from security, DBA, networking, devs.

Looking at Cloud Engineering Team that incorporates OS Setup/Config/App Config/Maintenance. DBA team still a bit separate. Security contributing across the board, but not necessarily hands on all the time. Teams are learning new things about each other that they didn’t know in the ticket-driven world.

Future – shared responsibilities mean fewer handoffs; engineers with wider breadth of skills; improved cross-team collaboration through shared code base; continuous improvement through evolving technical design and available services; adjusted job titles and responsibilities; ITS reorganization; budget impact, review of recharge model.

New ways of collaborating: Sys and App admins using a single git repository for code. Shared tools/technologies, password management; cross-functional tier 1 support;

Lessons learned – don’t lock decisions down too early, use governance to end debates, identify project goals that foster exploration (within timeline), use consultants carefully. Traditional PM will not work, push boundaries of what is possible, required vs. ideal – compromise is important; don’t compare with mature on-premise architecture; be prepared for rumors;

Not everyone is on the bus – what about those who don’t want to get on?

Self Service at Yale

Rob Starr, Jose Andrade, Louis Tiseo (Yale University)

Community told them needed to be able to spin machines up and down at will for classes, etc. Started with a big local open stack environment, now building it out at AWS.

Wanted to deliver agility, automate and simplify provisioning, shared resources, and support structures, and reduce on-premises data centers (one data center by July 2018).

Users can self-service request servers, etc. Spinup – CAS integration, patched regularly, AD, DNS, Networking, Approved security, custom images.

Self-service platform – current manual process takes (maybe) 5 days. With Self-Service, it takes 10 minutes. Offering: Compute, Storage, Databases, Platforms, DNS

All created in the same AWS account. All servers have private IP addresses.

ElasticSearch is the source of truth.

Users don’t get access to the AWS console, but can log into the machines.

Built initial iteration in 3 months with 3 people. Took about a year to build out the microservices environment with 3-4 people. Built on PHP Laravel.

Have a TryIt environment that’s free, with limits.

Have spun up 1854 services since starting, average life of server is 64 days.

Internet2 Tech Exchange 2015 – High Volume Logging Using Open Source Software

James Harr, Univ. Nebraska

ELK stack – ElasticSearch, Logstash, Kibana, (+redis)

ElasticSearch indexes and analyzes JSON – no foreign keys or transactions, scalable, fast, I/O friendly. Needs lots of RAM

KIbana – WebUI to query ElasticSearch and visualize data.

Log stash – “a unix pipe on steroids” – start with input and output, but can add conditional filters (e.g. regex). Add-on tools like mutate and grok. Can have multiple inputs and outputs.

GROK – has a set of prebuilt regular expressions. Makes it easy to grab things and stuff them into fields. Have to do it on the way in not after the fact (it’s a pipe tool). 306 built-in patterns.

Grok GeoIP – includes built in database, breaks out geo data into fields.

LogStash – statsd – sums things up – give it key and values, adds values and once a minute sends to another tool.

Graphite – a graphing tool, easy to use. Three pieces of info per line: key you want logged to, time, value. Will create a new metric if it’s not in the database.

Can listen to twitter data with LogStash.

Redis – Message queue server

Queue – like a mailbox, can have multiple senders and receivers, but each message goes to one receiver. No receiver, messages pile up.

Channel (pub/sub) – like the radio, each message goes to all subscribers. No subscriber? message is lost, publisher is not held up. Useful for debugging.

Composing a log system: Logstash is not a single service: split up concerns, use queues to deal with bursts, errors. use channels to troubleshoot.

General architecture – start simple:

Collector -> queue -> analyzer -> ElasticSearch -> Kibana

Keep collectors simple – reliability and speed are the goal. A single collector can listen to multiple things.

Queue goes into Redis. Most work done in analyzer – groking, passing things to statsd, etc.Can run multiple instances.

Channels can be used to also send data to other receivers.

Composing a Log System – Archiving

collector -> queue -> analyzer -> archive -> archiver -> log file

JSON compresses very well. Do archiving after analyzer so all the fields are broken out.

Split out indices so you can have different retention policies, dashboards, etc. e.g. firewall data different than log stash.

Can use logstash to read syslog data from a device, filter out what you want to send to Splunk to get your data volume down.

Lessons (technical): clear query cache regularly (cron job every morning); more RAM is better, but the JVM doesn’t behave well after 32GB; Split unrelated data into indices (e.g. syslog messages vs. firewall logs); part simple; use channels to try new things.

Lessons: It’s not about full text search, though that’s nice. It’s about having data analytics. ElasticSearch, LogStash, and Kibana are just tools in your toolbox. If you don’t have enough resources to keep everything, prune what you don’t need.

2015 Internet2 Technology Exchange

I’m in Cleveland for the next few days for the Internet2 Tech Exchange. This afternoon kicks off with a discussion of the future of cloud infrastructure and platform services. As I suppose is to be expected, the room is full of network and identity engineering professionals. I’m feeling a bit like a fish out of water as an application level guy, but it’s interesting to get this perspective on things.

One comment is that for Internet2 schools, bandwidth to cloud providers is no longer as much an issue as redundancy and latency. Shel observes that for non-I2 schools bandwidth is still a huge issue. The level of networking these schools don’t have is frightening.

The cloud conversation was followed by a TIER (Trusted Identity in Education and Research) investors meeting. There are 50 schools who have invested in TIER now. This is a multi-year project, with a goal of producing a permanently sustainable structure for keeping the work going. A lot of the work so far has been to gather requirements and build use cases. Internet2 has committed to keeping the funding for Shibboleth and Grouper going as part of the TIER funding. The technical work will be done by a dedicated team at Internet2.

The initial TIER release (baseline) will be a package that includes Shibboleth, Grouper, and COManage, brought together as a unified set, as well as setting the stage for Scalable Consent support. That release will happen in 2016. It will be a foundation for future (incremental) updates and enhancements. The API will be built for forward compatibility. Releases will be instrumented for continual feedback and improvement of the product.

CSG Spring 2014 – Cloud campfire stories, continued

Mark McCahill – Duke

Blackboard to Sakai – wanted to override Sakai’s group management with Grouper to keep Sakai from turning into a de-facto IDM system. Needed a development partner and a hosting partner. Two vendors joined forces – Unicon and Longsight. Doubling the testing does not double the fun. Business school wanted an even newer version of Sakai, so decided to run their own on premise. Started to see failure of course roster display for large courses. Errors in log that LDAP server was not responding. Server was up, but LDAP timeout was set too low for remote LDAP. 

Internet ate the course list! Course list communicated as automated batch upload from Duke’s student system. Turns out that the network was screwing up the file transfer – was hard to debug. Apps should be careful to check input data. 

Office 365 – After HIPAA BAA issues were negotiated, wanted to move Med Center and University. It was hard to explain the complex University setup to Microsoft. MS has improved support planned for “mergers and acquisitions”. You need Microsoft to code Forefront Identity Manager to glue things together. MS throttles migration traffic. Silent (and inconsistent) failure modes mean that you copy the mailbox, then check very carefully that everything made it. Failure modes change over time as new releases slosh through the cloud. 

For users, waves of user-visible upgrades wash through the cloud at seemingly unpredictable intervals. Your service desk has to deal with the fact that different users are on different versions. O365 IMAP was slow for Pine. Duke faculty member reported that to the Pine developer (Eduardo Chappa) who fixed the code. 

Cisco Cloud Connect – Cisco’s Cloud services IDM strategy. Syncs attributes from your AD into the cloud. You can select which OUs to sync, but have to do it from a chooser list. But it takes about 20 minutes to populate that list at Duke. Cisco changed to use email address for the identity, and tries to deduce institution from the address. 

Box: Duke has a Box agreement with a BAA. Medical people want tighter controls on accounts, but it has one set of enterprise controls. The REST API might offer a workaround, but the calls are slow, so a single threaded folder traversal is way too slow. Used Node.js script with non-blocking I/O allows too many concurrent REST calls, until Box throttles it. Will Box allow enough connections? Stay tuned. Maybe look at Box Events API.

Interesting thing now – how can we move arbitrary workloads from on-prem to cloud utiliities? Look at Docker.io – lightweight virtualization, no AWS lock-in, open source framework, gaining significant momentum in the DevOps world. 

Tom Lewis – Washington

Office 365 – Started three years ago – wanted to move live@edu users to O365 and then open to campus. Timeline was July-November 2011. Asked Microsoft for a test tenant – they didn’t understand why one was needed. September-October 2011 – Uh-oh. Dogfood tenant finally provisioned by Microsoft. Not a true Edu tenant or XL size. November-December 2011 – Eyes opening widely. Deep dive with Microsoft on O365 and US environment. Contracted with Premier Deployment team – more discovery on how bad the migration for Live@edu would be. Microsoft’s support for O365 was poor. Came up with Strawman for migration. January-February 2012 – Holding. Couldn’t get right tenant. UW Medicine wanted to move to O365. March – May 2012 – More holding. Decided to freeze live@edu tenant (with change of domain name) and create new O365 tenant. Problems with DirSync, problems with FIM. Lots of false steps. June-August 2012 – More holding. Contract should be ready to go by July 1st. Another strawman for migration finalized. Septemebr-December 2012 – 1 of many stages of grief. Microsoft switched contract. Major outage of live@edu. Finally got right contract signed in November. Talks with Microsoft to confirm proper support for O365. January-April 2013 – Grief dminishes. Scrapped Wave 14 and go to Wave 15. Microsfot provisions WAVE 15 tenants. May-August 2013- Change of direction. Started with migrating live@edu users and opening open access. Moved local Exchange users from phase 2 to phase 1. Migrating local Exchange to online Exchange was problematic. Spun up SkyDrive Pro and Lync Online. September – December 2013 – Progress finally. Contract with Cloudbearing to migrate live@edu. January 2014 – April 2014 – Migrating mass amounts of users. Exchange online seems to work well, OneDrive works well, Lync works well for Windows users, not so much for Macs. 

Lessons learned – Microsoft O365 technologies and support are not mature, so continued engagement required. O365 Teams still working at cross purposes. Many things are not so enterprise with licensing and otherwise. They will often release things to your tenant that will enable things that bust your HIPAA compliance. Verify and then trust with Microsoft and their partners. NET+ helps. 

Campus Change Management – Email costs will not diminish for a while (if ever). Communicate the timelines, communicate the details – lots of community meetings, public product backlog, talk up the value. Work closely in pilot mode with department IT, early adopters. Pilot early pilot for looooong time. Creat a public and open communication channel

Policy Implications – Account lifecycle management is a beast in the cloud. WHen to deprovision? Whither Alumni? Employee separation process is messy. Public product backlog. Prepare for lots of discussion on e-discovery. Engage early and often with your counsel. 

Alan Crosswell – Columbia

Big HIPAA settlement from data breach. Had previously worked on consolidating data classification and security policies, harmonizing across research, medical, education. Using Code Green for digital loss prevention. Have not turned on Google Drive for fears about sharing the wrong kinds of data. Piloting CloudLock for DLP on Google Drive. Also looked at CipherCloud, but didn’t biy.

DLP Challenges & benefits – Per user costs (about $9/user/year), added 1 FTE DLP admin, delayed roll out of Google Drive, had to increase CloudLock scanning to 3x daily to staisfy OGC, need to inform faculty tha ttheir stuff is being scanned, evidence that they are avoiding potential disclosures. 

Bob Carrozoni – Cornell

Cloud feels like the new normal, but there’s a lot more to figure out.

Seeing a lot of crowd-sourcing – piloting TopCoder. Crowd = Skilled staff as a Service. Metered payment, scalable, elastic.

CSG Spring 2014 – Cloud Campfire Stories

Stories from: Stanford, Notre Dame, Duke, UW, Columbia, Cornell, Harvard

Bruce Vincent – Stanford

Broad use of SaaS, lots of times came in through the back door. Some significant PaaS usage (Force, Acquia, Beanstalk), emerging IaaS deployments. 

Everybody’s a player – all you need is an email address and a credit card.SaaS for all vs. all for SaaS. IT provides some SaaS services for campus use. There are some products which are more niche, for a small population. Should we get involved? How they go about deploying and engaging the vendor is where we can help, if we don’t take too long or act overly bureaucratic. 

Not everyone wants to be a player – Vendor management, gnarly policy issues, system engineering complexities (opportunity to refactor what you’ve been doing before), integration complexities. 

AWS deployments – The scale of MOOCs make IaaS a no-brainer. OpEx is starting to ramp up on the cloud side. Research groups using AWS. Deployed the Stanford emergency web site on AWS. Used Amazon Beanstalk to run a WordPress instance. Main Stanford home page moving to Amazon after commencement. Those kinds of moves lead to discussions with the distributed IT community, stirs up interest. 45 technical staff have taken three day “Architecting for AWS” course. 15 more in June. This has brought distributed interests out of the shadows/silos.  Challenges: Consolidation of accounts from people who are already using the service – can’t get lists of who’s using it within the enterprise. Data classification, compliance, and FUD – good people trying to protect the institution, but it can go to a level where the standard that’s put out there is so much higher than where the bar is for existing services makes you question the reasons.

Direct Connect – gives you a dedicated pipe into one of their availability zones. Implementation is pretty complicated, but Amazon is good at turning it around. Get cheaper egress rates – no tiered billing. Also allows you to segment your address space across campus and AWS. Every AWS master account translates to a VLAN and BGP pairing, which gets messy. 

Google Compute: Shiny but rough, lots of interest in/from research computing, Google willing to talk leveraging existing peering and SDN with us. 

Other IaaS and the “virtual datacenter”

Before doing more vendor specific work, it’s time for an abstraction layer. Consider all the process and expertise IT provides to deliver on datacenter services… much of that translates. 

More of everything – not fewere on campus computing instances,more service administration, seeing benefits of consolidation, automation, and virtualization; integration to infrastructure; integration between SaaS to ____

Sharif Nijim – Notre Dame

Moved campus web site to AWS. Brought in an external agitator to stir up selection of a preferred infrastructure provider. How to scale?

4 stages – 4 projects

#1 Conductor – custom engineered CMS, runs 400 sites on campus. Cut that over to Amazon – much better performance. 50% improvement over Rackspace, with a 50% reduction in cost. 

#2 Mobile ND – Kurogu framework running on AWS.

#3 – AuthN/Z – Using Box, Google, Sakai (hosted off-campus). How to authenticate if campus network is down? By end of month going live on AWS.

#4 – Backup – can backups be solved for less than the maintenance on existing equipment? Local devices do dedupe and compression, cloud becomes authoritative store of the backup. Looking at Panzura. Proof of concept in June. Amazon claiming 11 9s of reliability in S3. Starting with 300 TB. Company in Chicago put in 115 TB and saw 15 TB stored in AWS after global dedupe and compression. 

Cloud Fluency, Automation Fluency, DevOps. Organizational Tension – sysadmins need to work more closely with developers. 

Culture Change – How do you get people to the “oh my goodness” moment? One approach is to lead them through it – identify specific people to embed. Transparency – encourage people to be transparent about what they’re working on. Their Amazon architect is very responsive on Twitter. 

The Future –  Will embark this summer on reflection and strategy assessment about data center in the next five years. What does the future hold for the two data center facilities?