Uncategorized – Oren Sreebny

CSG Spring 2018: Student success and student information strategy

Lech Maj (NYU)

Becky Joffrey (Cornell) –

Student success – the ability to have a 360 degree of the student lifecycle

Survey data:

How involved is IT team with student success initiatives – 66 (out of 100)

How involved is IT team with learning analytics: 70 (out of 100)

NYU – Bernie Savarese (Assistant VP for Student Success)

Reports through Enrollment Management (which includes Peoplesoft, Fiscal aspects, and Recruitment)

Student Success is everyone’s business!

Why? Position & Rank; Financial Stability; Perceived Value & Alumni Engagement; Deliver on Promises

First year retention and Graduation rate make up almost 1/3 of US News ranking. Found NYU was lagging peers in those measures.

Financial Loss – 1st year attrition Net loss $16 million

Goals: FIrst year retention – 96% by 2020, Six year graduation 90% by 2026: require keeping 50 extra students per year.

Student Success Steering Committee, with a Technology Task Force.

Guiding Principles: Use technology to drive relationships; Make a big complex place feel small: find and support students who need us most; identify and remove unintended barriers; find and surface evidence for continuous improvement.

Chose Starfish as a technology platform – launching in Fall 2018. Why a platform? See the whole student, aggregate critical information and systems, coordinate care, simplify resource referral, identify leading vs lagging indicators, deliver on promises.

Platform goals for Year One: All undergraduates in all schools; raise flags and alerts; predictive analytics and risk scores; appointment scheduling; shared notes; include student affairs/services

Need to be able to close loops – e.g. if a faculty member raises a concern about a student, they need to know what happened.

Students have the ability to see what’s going on in a dashboard.

Becky Joffrey – Cornell’s Student Engagement Ecosystem

Current State: Role based infrastructure (point solutions with a single role (club member, dorm resident, job seeker) in each system). Data lives withn each transactional application. Spend a lot of time marrying data from systems, but still struggle.

Desired state: Person-based infrastructure – many roles that change over time, data moves and grows with a person; invest in understanding constituents. Duh – CRM.

Ended up with 23 Salesforce orgs – that’s no way to implement CRM. Caused Provost to start project to think about student experience globally. Move from departmental intent to institutional intent.

Led to strong steering structure and governance. Provost funded, steering committee led by a dean and the Vice Provost of academic affairs, experience working group, analytics working group.

Initiative: Modernize technology to support student experience; focus on student services advising student activities and analytics. Goal is to connect all parts of the student experience. Audiences include students, the people at Cornell who support them.

experience.cornell.edu – Discovery website for student opportunities. When you click to apply for an experience, it takes you into Salesforce. A rich dashboard experience for the student that integrates and orchestrates all the different experiences. Advisors also have a dashboard. Other web sites can use the information to display filtered views of opportunities.

Putting finding tools in Drupal sites (Opportunities, Resources, Clubs, Events, People), and “Doing Things” in Salesforce (transactions)

The benefit is data: You can see who is doing X, but more importantly you can see who is NOT doing X; Data is collected via natural points of engagement vs. surveys and notes; Data benefits the entire institution, not just individual unit. Prioritize apps that will glean the richest sets of data.

Tips: Find a point of gravity that brings campus together; start with users’ problems; identify urgent need; build horizontal solution, not a vertical one; consider breadth of tools available and how they integrate; create and extendable architecture

UC Berkeley – Oliver Heyer – BOAC and the Data Loch

Early Work: CalCentral – collects data from a variety of sources to allow students to do the work they need to do. Became student front-end to new Peoplesoft SIS.

Cloud LRS (learning record store) in AWS. Pulls in feeds of data for storage and analysis. Built LTI tool (student privacy dashboard) on caliper data from Canvas. Not in production yet.

Athletic Study Center didn’t have a view into Canvas data. Putting advisors into Canvas as observers didn’t provide a manageable way to provide information.

Berkeley is framing student success around issues of diversity and inclusion.

BOAC/Data Loch solves some important problems: Canvas Big Data > UCB Data Lake > Learning Analytics; Custom cohorts; An early warning system; Stability, security, and scalability. Have about 900 students being used by 40 advisers. Storing cached data from APIs in AWS (live use of local APIs didn’t scale)

Goals: Bird’s eye view on learning and other data emerging from varied sources; Data Collection layer; Data processing layer (redshift, spark, athena); Deliver insights as a service (description, predictive, prescriptive)

Using AWS Glue for ETL into data catalog, which can be queried by Redshift Spectrum and then tables extracted into Redshift.

Largest dataset – canvas event logs. 650 GB, ~4.5 billion requests records. Compressed into Parquet. Did a table scan with Redshift – cost $1.25 and took 5 minutes.

Next steps: Expansion of advising to College of Engineering in fall 2018. How to tell story to faculty? EDW collaboration – could move into data lake. Where does application live? Is it yet another place to go? Implications for campus data and cloud platform strategy in general?

MyUW: Supporting the Student Lifecycle – Jim Phelps (Washington)

Used User Center Design including a student diary where they asked students to collect their information needs.

Findings: information overload; critical information hard to find; time management is difficult; information needs are dynamic, but predictable

Design goals: Personal, critical, curated, relevant, timely

Arrive at: actionable and personalized content on “cards”

Show students what they most need based on time in the quarter. e.g. where are my classes in weeks 1 and 2, when can I register in week 5, when are my finals in week 8.

Aggregates data from multiple sources.

Understanding the student lifecycle experience – transitioning to UW, exploring majors, in major, transitioning to profession

Understanding the co-curricular experience – Present interest, social catalyst, internalized motivation, major blocker?, information seeking, participation in co-curricular. How do we build a social component to help students connect?

Husky experience toolkit – tailored messages.

Assessment – continue to assess effectiveness and usability if MyUW and Husky Experience Toolkit. Surveys, log analysis, “guerilla” user studies. Feeds back into user-centered design process.

UC San Diego – Harnessing the power of analytics for student success – Amin Qazi and Christopher Rice

What is a student? Matriculated; non-matriculated; extension; undergraduate/graduate/professional; how long is one considered a student?

What constitutes success? better grades; improved time-to-graduation; retention; quality of experience; getting needed courses; improved job performance or advancement; personal satisfaction

What do students want? High quality degrees that are more career oriented and time compressed; sequential programs linking graduate offerings; dual graduate programs; online degrees from reputable institutions; stackable progression models

Building advanced data capability to: prepare the university for the application of AI and Machine Learning; guide reallocation of scarce resources in data driven ways; harness automation; empower units to harness the power of analytics

Overview of next generation data warehouse – layered architecture. Core data in middle, applications feed data in and out, connected by APIs.

Platform predictive capabilities (builit on SAP Hana) – also working with Google and Amazon.

Bringing all information from university into a single data warehouse. Activity hubs (employee, student, academic activity, facilities, financial activity, advancement & alumni activity). Working on student activity hub first. Real-time data, personalized messaging and interactions, complete data integration, next-generation data science. Three classes of analytics: institutional, academic, learning.

Curated view of data, de-identified – demographics, enrollment, majors/minors, retention, student statistics per term, etc. Reports are generated by people in business units, delivered by Tableau or Cognos or API. Multiple levels of security. 10 years of data from SIS, seven years of data from LMS, rolling out now.

Goal is to have four activity hubs up by end of year, and sunset the enterprise data warehouse in two years.

Retrospective and Predictive analytics for student success.

Architecture is not enough. You must build a culture around analytics: Communities of practice; data governance committee; missionary work; easy to use platforms & tools; pushing analytics to the edges, away from ITS.

Strategic Academic Program Development (use data to build RFPs) – meritocracy of ideas, reach across campus, experiment (fail, learn, repeat), focus on what is best for the student?

CSG Spring 2018 – Life in a Post Password World

We’re at Carnegie Mellon for the Spring CSG meeting. The first workshop is about “Life in

Password Security: How Safe Are Our Passwords – Richard Biever, Duke

Intro to passwords: How are they stored? What are hashes? What are the problems with hashes? (not all created equal – see NTLM in Windows)

What risk are we attempting to remediate (e.g. phishing or cracking?) (see https://haveibeenpwned.com/ )

Password Cracking: Methods of Attack: Brute Force; Brute Force with a mask (i->1, e->3, etc); Dictionary; Rainbow Tables (precomputed hashes) – gpu s make it easier to compute.

Vulnerabilities: Length, Type (passwords or passphrases), complexity

Password policies and entropy – higher entropy -> harder to guess, but harder to use. NIST 800-63v102 defines some standards. User chosen passwords are less entropic. User chosen 20 character password has less entropy than a random generated 8 character pw

Attack Dynamics: offline attachs against exfiltrated hashes. Microsoft hashes easiest. Modern GPU are fast – P50 = 14.7. billion hashes/sec – defeats 33.5 bits of entropy every second! Gets even faster if you use cloud GPUs.

Moving forward

Authentication strength: More than password strength: MFA for everyone. Vision – when you hit a shib web site, you should be able to use 2FA plus certificates. There’s a new standard called WebAuthn where you can use your phone as a token.

Ongoing projects: MFA for everyone – complete; MFA for VPN – complete; evaluating current password policy for security + user-friendliness; certificate management as a new factor in authentication strength.

Mark McCahill – Investigating Facial Recognition – now on smartphones (wouldn’t depend on just that one factor); needs high res camera, computer, fast network. Person database with ZGPU and machine learning moel to return inferences based on person database. Duke researcher Guillermo Sapiro’s group: ~150 ms latency for facial recognition ifnerence run on 100k faces.

Early POC project – facual recognition on doors – cheap sensor (raspberry Pi + 8 megapixel camera module + power supply (~$100 total) can stream video to inference engine and unlock the door. Issues: live detection via stereoscopic images? Gesture? Others? Consistent illumination? Neural Comput Stick – gigafflops on USB. Intel USB Movidius – 1 watt of power, 100 GFLOPS, 10 inferences/second in continuous inference mode, $79 retail. Move processing to the edge.

Survey Results – Tim Gleason (Harvard)

Do you believe that a long multi-word passphrase is sufficient? 78% said no
Do you use a personal Password Manager? 71% yes – lastpass most often cited.
How many passwords saved in your personal password manager ? 249 median (was 80 in 2016 meeting)
Is your password manager protected with 2FA – 54% yes.
What types of second factors are permitted?
- Push on phone 23%
- text message 17%
- telephone call 21%
- hardware token 22%
- u2f fido tokens 11%
- other 3%
do you use certs for personal authentication? 20% yes
Have you had a central AD hack? 25% yes
Has your institution been bitten by payroll bank account transfer attacks? 58% yes

Tim Gleason – Harvard

Password policy and standards (https://policy.security.harvard.edu) : Includes data classifications based on risk, password requirements, multifactor services; network positioning and protections. NIST 800-63B provide reference model. Harvard’s polcy states “all users are responsible for protecting their Harvard passwords…” Policy requires different passwords must be used for Harvard and non-Harvard accounts, and no shared passwords.

Three options for password complexity: HarvardKey (web auth system); Passwords > 20 characters; or < 20 characters with a bunch of complex options. They don’t expire passwords.

Deployed DUO in 2016, required for most services in 2017. LastPass Enterprise – free for Harvard affiliates.

Support Challenges: Single identity for life philosophy; Identity proofing is a distributed function between offices; HarvardKey enrollment.

Password recovery is generally self-service; lost or misplaced DUO tokens can require helpdesk interaction; user community is 24/7 challenge for identity proofing.

4 methods utilized by support teams to remote id: . phone number, in-person, ask person to take a selfie and compare to official photo, trusted third party.

Consderable room for improvement in the user support experience.

A Decade of PKI – Jim Jokl, UVa

Why PKI? Stronger normal authentication for common applications, for use by everyone. Also strong authentication for sensitive data access.

Digital Certificate – bind identity of a person’s public key to their identity, signed by a certification authority.

Chose to do two difference Certificate Authorities – standard assurance, targeted for standard apps; and a High Assurance CA – offline CA using hardware crypto modules, uses a hardware token, requires in-person identity proofing. A few thousand in use at any one time.

High Assurance CA applications – VPN (each user gets a custom network filter for which apps they can access); System admins (ssh and web authentication for network management)

Standard Assurance CA apps – web authentication, VPN, Wireless, S/MIME for signed and encrypted email (not common nor encouraged).

Developed a provisioning tool that provisions certificate and wireless, VPN settings, security settings, and network registration.

WebSSO – most people have certificates on their devices so web authentication is easy to use. Still not used for most web logins.

Started process to migrate from home-grown CAs to SecureW2. Commercial product for standard assurance provisioning and CA. SecureW2 hosted web services provides: provisioning, configuration, SAML Authn.

Goal to switch to InCommon/Comodo for issuing certs. May end up using SecureW2’s CA instead.

Passwords Are Weeds – Stories from The Farm – Scotty Logan (Stanford)

June 2013 – HIPAA breach. Moved to required laptop encryption, then device management

Mid 90s – Built WebAuth SSO.

In 2011 built a two-step authentication system.

2013 – made everyone change passwords, and then AD was hacked so made them do it again. Made two-step mandatory for everyone. Late 2014 switched to Duo (keeping their old UI).

2014 – Meeting about IPv6 addressing and 802.1x for network authentication. Decided to use certificates. One per person, or one per person per device? UX advantages (no no need to transfer keypairs between devices, a lost device doesn’t affect other devices, can identify device in addition to person and associate device status with cert).

2016 – WiFi and Radius – passwords still terrible, but Duo mitigates Phishing. Still using WebAuth but no developers left; Increased use of SAML 2.0 from external providers; certificates for WiFi authentication and device management in place.

Built certcache – another CA – root provate key ever stored as a shole; CloudPath sub-CA issues certs to device/person pair. Data for associated devices stored with each certificate. CLoudPath calls webhooks when cert issued / revoked. Use AWS API gateway to transform URL into SQS message – no active code, configuration in Terraform. CertCache receives notifications from SQS, queries CloudPath for certificate details, stores in MySQL. Certificate status set to “unknown” while BigFix updates details. Cert -Authn only allowed if status is “ok” or “unknown” (within seven days).

Late 2016 – concern about authentication after an earthquake. Switch to SAML 2, ditch webauth, What about WebLogin? – becomes just another SAML relying party.

2017 – work on migrating to containers on AWS.

Current Status – RADIUS: Cert authn VPN profiles in production – authz: CertCache (device status), LDAP (account status); Containerized; Still on campus but so is VPN service – going to investigate RADSec; VPN logs go to SUNAC for finer-grained network access.

Current Status – WebSSO – Migrating everything possible to SAML 2.0; WebLogin behind the IdP, still only on campus; WebSSO and supporting services running in AWS, but masters still on campus.

Going to disable text message and voice message 2FA (easily hacked formats) for some populations.

Phillip Kobezak, VaTech

Got into PKI around 2006 – lot of documentation and procedures, three CAs. Used Aladin eTokens for personal digital certificates – required in-person identity proofing. Used for limited populations. Started having support issues because vendor used specific functions in browsers that were falling out of support. In 2009 started using Vasco tokens with one time passwords.

Did a separate CA for wireless certs, operated for several years as primary wireless authentication until eduroam became popular. Now is shut down, but still use a separate identifier for the network.

Personal Digital Certs – now issuing self-service distributed online. Including key escrow. Uses: S/MIME email and project documentation signatures; Encryption of PDFs, including portfolios.

Two Factor deployment with Duo: 2013 AD compromise pointed out need for stronger auth. Duo on enterprise directory, AD, and VPN for all users, including alumni. Still need to address dsektops/laptops.

Path forward: evaluation of additional password-less approaches. Specifically interested in device registration with certs.

Higher Ed Cloud Forum: Epidemic Modeling in The Cloud: Projecting the Spread of Zika Virus

Matteo Chinazzi (Northeastern University)

MOBS lab — part of Network Science Institute at Northeastern, modeling contagion processes in structured populations, developing predictive computational tools for analysis of spatial spread of emerging diseases.

Heterogeneous interdisciplinary research group – physicists, economists, computer scientists, biologists, etc.

GLEAM – Global epidemic and mobility model – integrates different data layers – spatial, mobility, population data. For Zika, had to introduce mosquito data, temperature data, and economic data (living conditions).

Practical challenges:

unknown time and place of introduction of Zika in Brazil (Latin square sampling + long simulations (4+ years))
Parameters need to calibrated and estimated: prediction errors add stochasticity at runtime.
Intrinsic stochasticity to to epidemic and traveling dynamics
Need quick iterations between different code implementations

Each simulation takes 6-7 minutes, need > 200k simulations. each scenario generates about 25TB of data, needed in a day. Tried on-premise, but not enough compute cores, resources were shared and bursty, and there was no reliable solution to analyze data at scale.

Migration to GCP – prompt replies and assistance from customer support (“your crazy quota increase request has been approved”)

Compute Engine – ability to scale in terms of compute cores – up to 30k cores consumed simultaneously. Can keep data without saturating on-prem NFS partitions. Big Query – ability to scale in terms of data processing. In < 1 day can run simulations and analyze outputs.

Workflow steps: Custom OS images for each version fo mode;; startup scripts to initialize model parameters, execute runs, perform post-processing and move to bucket; Python script to launch VMs, check logs, run analysis on BigQuery, export data tables to bucket, and download selected tables on local cluster. Other scripts to create pdf with simulation results.

Numbers: has 750k+ instances, analyzed 300 TB of data, simulated 10M+ global epidemics, 110+ compute years

Lessons learned: Use preemptible VM instances (~1/5 of price, predictable failure rate); use custom machine types; run concurrent loading jobs on BigQuery; use Google Cloud Client Library for Python – from simulations to outputs with no human interventions; Be aware of API rate limits.

Higher Ed Cloud Forum: Cloud experiences in the Swiss high education market

Immo Noack

SWITCH – Swiss NREN. Swiss universities are members. Core competencies: Network, security, and identity management. Around 45 universities in Switzerland.

Have local SWITCH services based in data centers in Zurich and Lausanne.

Buy IaaS through GEANT, which is the pan-European organization. The GEANT tender is not valid for Switzerland, but conditions apply. Three parts: Original IaaS providers (direct); original IaaS providers (indirect); Resellers for IaaS indirect providers. Providers are AWS and Microsoft.

SWITCH’s role is expanding its cloud offering with external suppliers, provided exclusively by SWITCH to Swiss higher ed. Data protection is a big concern – they don’t want data in the US. GDPR is coming next May.

Findings: universities are rather cautious, prefer to build their own resources (they still invest heavily in higher ed). Budget process is not prepared for cloud-usage; University IT units want to keep the existing stuff, but researchers who want the cloud.

Higher Ed Cloud Forum – Lightning Round #1

Phil Robinson – Cloud Progress at Cornell Student Services IT

First AWS account – July 2015 – adopted a cloud first strategy. Now have about 30 apps on AWS (migrations, rewrites, new apps). Automate with Jenkins and Ansible. Retiring on-prem VMs.

Custom class-roster app, used by students to decide what to take. Added central syllabi feature this year. Using SNS+SQS as message bus, orchestrating events; CloudFront delivery for syllabi; On fly ClamAV scans on upload; ElasticSearch for searching; SES for notifications by email. Developed in 3632 hours.

Looking towards containerizing and VDI.

Gerard Schockley – BU iPaas RDS AWS

IPaaS ODS in RDS – integration service designed to integrate many data feeds into SnapLogic platform. Operational Data Store. Using AWS Aurora.

Bob Winding – Cloud Automation Journey

Most fully automated in GovCloud project. CloudFormation (VPCs, IAM, Security Groups, Centralized alerts); ANsible and CloudFormation for server builds; Consol federation with ADFS; Consistent process for all project accounts; new project account in a couple of hours; decentralized maintenance of CF Templates.

Penn –

What does “cloud native” mean at Penn?

Case study 1 – online giving portal: Data ETL (Talent); to Postgres RDS (fundraising metadata); S3 / Cloudfront; to Oracle on prem. Near real-time

Case study 2: Service ordering (VDI and Backup requests). On prep powershell makes changes in AD groups, sends messages through SQS

Case study 3 – Device registration. On prep registration; does API keys in Lambda

Sara Jeanes – Considerations in moving HPC workloads to the cloud

Initial framing questions: Do they have a preference for which cloud provider (do they have credits, different tech); Is there a multi-cloud resiliency need?

Workload questions: Can it be interrupted (use spot instances), large workloads firewall considerations (ScienceDMZ);

Jeff Minelli – Penn State – CloudCheckr enabling transparency at Penn State

Gain insights into financial transparency, spend optimization, resource utilization and right-sizing, cost allocation, best practices, security & compliance, collection and unification of AWS API data, continuous monitoring, reporting and alerts

Working with CloudCheckr to enable SAML. Basic group email notifications. Configuration of $100 spending alerts.

Trying to get CloudCheckr into InCommon.

Network Firewall Policies for Hybrid Cloud – Brian Jemes – University of Idaho

In cloud managing firewalls with server tags. Gets complicated when managing across on-prem and cloud. On prep have Cisco tools to manage ASA firewalls.

Options: manage hybrid cloud policy in on-prem firewall; manage hybrid policies with traditional firewalls in cloud; develop a hybrid tool.

Looking at a startup called Bracket Computing – cloud firewall policy manager. brkt.com – Provides micro-segmentation.

John Bailey – Washington University (St. Louis). Cloud IAM

Balance between security and usability. Enhncing usability with SPNEGO integrated auth. leverages kerberos token from machine login to perform a web SSO login, making the web login invisible to the customer.

Lou Tiseo – how categorizing resources help to understand cloud usage

Requiring seven different tags. Using Cloudyn management dashboard. Helped save costs by using reserved instances.

Chris Malek Caltech – Automation tools for AWS ECS and Batch

deployfish – configure almost all aspects of an ECS services (load balancing, app autoscaling, volumes, environment, etc). They’ve open sourced it. Create, inspect, scale, update, destroy and restart ECS services with single commands; manage multiple environments (test, qa, prod, etc). Integrates directly with terraform. YAML driven

batchbeagle — allowing people to manage AWS Batch. Create, update, disable, and destroy queues. Create, update, disable, and destroy compute environments. Create job descriptions. Submit and manage jobs, etc.

Amanda Tan – Washington

Enabling cost notifications on AWS. Cost monitoring is difficult – should be zero effort. Two prong attack: auto-tag resources, send email notification with total spend and resource usage daily. Cloud Formation Template sets up Cloudwatch which invokes auto tag lambda function. AutoTag tags resources with owner and principal-id. Notification works off DLT billing records, provided in S3 buckets twice a day.

Stop Doing Cloud Security Assessments

Wyman Miles, Cornell-

Technology risk assessments – a lot of sound and fury, but we don’t find problems and we slow down implementation and governance. They’re currently doing 120 assessments per quarter with 4 security engineers.

Between cyber-liability insurance and contracts, and our portrayal as risks what are really just vendor stances, what do we really need to do?

Indiana jumping in feet first with HECVAT – Box one is done, hosted by REN-ISAC.

Notre Dame discovered a product that was coded by two guys in Russia and discarded it from consideration as a result of a security review.

Maybe we should only do real reviews where we know that sensitive data will be in play?

Frequently we find issues with products that are already in use, with or without central governance knowing about it.

“Most risks we discover are really our petty issues with implementations”

Stanford – need to get out in front of what people are actually using, and then spend time facilitating proper use. Use network flow analysis, purchase records.

Dealing with Controlled Unclassified Information (CUI) – Notre Dame

Bob Winding and Kolin Hodgson from Notre Dame

How do you know you have CUI in contract? Look for DFARS 252.204-7012 – requires all DoD contractors and subs to copy with NIST 800-171 and incident reporting within an organization 72 hours.

NIST 800-171 has 14 families of controls, with 109 controls.

C3 project scope – compliance with national research compliance standards. Decided to do in AWS GovCloud with NIST templates.

No easy way to isolate sensitive data on campus.

Have a new domain not connected with campus, but federated with ADFS. AWS has a document that defines ITAR boundary. Use cloud protection manager to do backups in GovCloud. Have a Shared Services hub and each research project or team gets a separate account. CloudWatch and CloudTrail events sent to a separate security account. Started with lambda functions, but now use event bus to send cloud watch events to security.

Have the ability to burst to on-campus HPC. Many jobs (e.g. multiple Matlab simulations) work fine in AWS. But infiniband MPC kinds of low-latency jobs don’t work in AWS. They’re building a secure enclave on campus that can be tunneled to from AWS. “reverse hybrid model”. The research computing folks will manage the on-prem enclave from GovCloud. They’re using Ericom Connect to do the virtual app streaming – outperformed local machines in almost every case. Defining audit boundary as the RDP client on the university-owned device.

Printing is not allowed.

A GovCloud account is actually a child of a commercial account and the root account is in the commercial account. If you delete the commercial account the GovCloud account goes away. It can take a few days to get a GovCloud account.

Issues – need to partner with Research group. Pushback from researchers on what’s really needed; software licensing; breaking out costs.

Higher Ed Cloud Forum 2017 – Intro and Multi Account AWS Strategy

Survey Results

46 institutions attending, 4 vendors, 81 unique roles among 90 attendees.

40% cloud first, 12% have a documented cloud exit strategy.

82% AWS, 14% Azure, 4% Google, 2% other

Staff readiness is the #1 obstacle to broad adoption

42% have signed the I2 Net+ agreement, 11% have enterprise agreement with cloud provider

21% have containers/serverless in production, 9% non-prod, 70% not currently adopting.

Managing and Automating a Multi-Account Strategy in AWS: Brett Bendickenson (Arizona)

Have their own agreement with AWS. Currently have about a 75 accounts in their consolidated billing. 24 accounts in central IT.

UITS Cloud Advisory Team — cross functional group from within UITS to advise and decide on cloud practices and policies.

Tagging Policy – extremely important to get right up front. Service, name, environment, created by, contactnetid, accountnumber, sub account

Multi-account strategy. Workloads segregated into production and non-prod accounts. Tipping point was properly restricting everything by permissions – can do it with IAM roles, but it’s a lot of work. Decided on further segregation by teams / technologies, e.g. Kuali, PeopleSoft, IAM. Each has prod and non-prod accounts.

Each account has an account steward (director or dept. head) — responsible for spend, security, etc. Each account has an email list, with the address used for the root login address. Password stored in common vault, secured with MFA hardware token (kept in Ops). Linked to a central billing account. Set of account foundation templates are deployed. Started using AWS Organizations.

Account foundation modeled after the AWS NIST 800-53 Quickstrart CloudFormation Template. Set of CloudFormation templates which deploy roles, security controls, etc. Sets up an EC2 instance that runs a set of Ansible playbooks that set up Shib, bas AWS info, IAM, Logging, Lambda.

Federated Roles – SysAdmin, IAMAdmin, InstanceOps, ReadOnly, BillingPurchasing. Using Grouper for authorizations.

Using federated identities, no IAM users (generally).

CloudTrail enabled in all accounts. Enabled for all regions, records all API calls, sent to a central S3 Bucket in root account. CloudTrail logs also saved to CloudWatch logs in account for local reference.

Alarms set for changes in Network ACL, Security Group changes, Root Account activity, unauthorized access, IAM Policy changes, access key creation, cloud trail changes. (not all used in non-prod)

Lambda Functions – Alarm details (interrogates cloud trail events and sends actual API calls that raised the alarm); CreatedBy automated tagging for EC2 instances; OpsWorks tagging helper; OpsWorks tagging helper; Route53 helper (updates DNS); Tag monitoring – checks tags on instance launch (looking at Cloud Custodian from CapitalOne (open source)); AMI lookup

Arizona’s code: https://bitbucket.org/ua-ecs/service-catalog

CSG Fall 2017 – Campus Safety & Security, pt 1

We’re at Virginia Tech this time. The topic of this special day-long workshop at CSG is about Campus Safety and Security and what we’ve learned in the ten years since the VaTech shootings and in the wake of other major events at our campuses in terms of mass notifications and using technology to protect the people at our institutions.

Scott – The technology is easy once we’ve communicated the capabilities and limitations of the systems are, so realistic expectations can enable planning.

VaTech President formed a working group as an outcome of event in 2007: Telecom Infrastructure Working Group. Looked at 14 major university and regional systems. Involved over 80 committed professionals and faculty from IT, law enforcement and administration, with contributions of more than 60 additional individuals. Examined: Performance, stress-response and interoperability of all communications for multiple areas. Notifications to community, internal communications, etc. Who is the community, how are they notified? What’s the risk of sending targeted communications. It’s increasingly feasible to know locations of individuals – do we track that and attempt to target notifications to that? Nuances of what the event is has importance. How many preformed message templates should you have? Important to vet the accuracy of the information being communicated – time for analysis, but how much time do you take?

In the analysis, the technology was only involved in the response — the mitigation, preparedness, and recovery involved other parts of the institution.

WebEx with Klara Jelinkova from Rice – Hurricane Harvey Response

Wed Aug 23 – Harvey strengthens to tropical storm
Thursday strengthens to Cat 1
Friday goes to Cat 4 and makes landfall.

When it happens that quick, you have what you got: They had a service ist with criticality and emergency preparedness plan for when people can’t come to work. Primary datacenter can operate for 10 days without power, and they needed it. The secondary network is on a medical backbone.

Planning – moving to VOIP, not all data available in off site tape backup, so did a quick emergency backup to AWS Glacier (which challenged the firewall) – now looking at getting rid of tape entirely. Also looking at backup of HPC and research data — the researchers are supposed to pay for it, but nobody does. Moving major systems to cloud.

New plans they need: load balancers dependent on OIT datacenters being operational – looking at redesign in the cloud. IDM is utilizes SMU for continuity, but needs to move to cloud for scaling. Have a sophisticated email list service – everybody wanted to use it rather than the the broad blast emergency notification system. Realized that the list service is more critical than the alert system.

CISO was flooded and evacuated, so the learning management person ended up running the IT crisis center.

Institutional lessons: Standing Crisis Management Team – Good. Includes student representation. Contracts – where are you on the list for food and fuel delivery? Things that matter: flushing toilets, drinking water, food, payroll (people go to the cash economy, so make sure they have funds), network, communications services. Knowing where your people are and what they are facing – where do they live, mash that up with flooded areas – can they get to work, do they have internet, etc. Loaded everybody from ERP, geocoded addresses and put them on map and overlaid intelligence. Had needs assessment tools: housing assessments, childcare, etc. (forms built in Acquia). Lot of the hourly workers are not English speakers and don’t have smartphones (or know how to get to the resources). Put students to work in phone banks to call every person who didn’t respond to surveys. Put together departmental reports that they sent daily. Had less requests for temporary housing than offers to house people. Assessed impact of damage on specific courses. Was used to figure out when they were ready to reopen.

What worked: collecting data centrally but distributing initial assessment to divisions for analysis and followup. Didn’t sweat getting the data perfect initially. Gave deans and VPs sense of ownership. Brought in an academic geospatial research team for analysis that helped work with IT.

Quality of HR data was an issue.

Melissa Zak, UC Boulder, Ass’t Vice Chancellor of Safety – Digital Engagement

October 5 – 3 significant events. Pre-event: strategy relations functional exercises, prior trainings, EMPG/EMOG/ECWG process and plans, alert notification systems, success of cyber teams (including law enforcement).

Somebody parked at stadium and started chasing people with a machete. Low threshold event because there was a small population present, but included community members there for treatment. One person on dispatch – requires a lot of multitasking at the best of times. First alerts went out within 15 minutes of first report to dispatch.

2nd event at 1 pm – coffee shop employee called corporate office about the first event, and they directed closing all the shops in the city, which led to reports of active harmer events at multiple shops across the city. Social media begins to erupt from campus. Sent out an alert that it was all clear, that there was no incident.

3rd event – 7:37 pm another alert went to one student from another college about an event. But then people started wondering whether the alert system had been hacked. Really highlights the impact of messages spreading by social media – students will drive event.

What went right? Great communication partnership with CUPD, CU, Boulder Police, Coroner, and CU Athletics.

What didn’t go as well? Messaging and clarity of messages. Community notification channels are important. If you have lots of people subscribed, it takes time to receive messages, and they may not arrive in order. Have now realized that sending notifications every 15 minutes is the best cadence. Now have a policy to send notifications informing people of any major deployment of police.

How do we deal with people who mainly communicate via social media channels?

Communication resource limitations – need to invoke more resources than just the one dispatcher.