Oren’s Blog

CSG Fall 2017 – Big Data Analytic Tools and AI’s Impact on Higher Education

Mark McCahill – Duke

How did we get here?

Big data / AI / machine learning driven by declining costs for:

  • sensors: IoT and cheap sensors create mountains of data
  • storage is cheap – we can save data
  • networks are fast – data mountains can be moved (and so can tools)
  • CPUs are cheap (and so are GPUs)

Massive data – IoT; internet of social computing platforms (driving rapid evolution of analytic tools); science – physics, genomics

Big data analysis tools – CPU clock speeds not increasing much — how can we coordinate CPUs run in parallel to speed analysis of large datasets? Break into parts and spread work – Hadoop MapReduce.

Map/Reduce -> Hadoop -> Apache Spark

Apache Spark – open source MapReduce famework. Spark coordinates jobs run in parallel across a cluster to process partitioned data.

Advantages over Hadoop: 10 – 100x faster than Hadoop (through memory caching); code and transformation optimizers; support for multiple languages (Scala, Python, SQL, R, Java)

2 ways to use Spark:

  • semi-structured data (text files, gene sequencer output);  write transforms and filter functions; classic map/reduce
  • Structured data (implicit or explicitly named columns); transforms and filter structured data using R-style dataframe syntax; SQL with execution optimizers.

Spark data models:

  • RDD (Resilient distributed dataset) storage allows Spark to recover from node failures in the cluster
  • Datasets – semi-structured data with strong typing and lambda functions, custom memory management, optimized execution plans
  • Dataframes – dataset with named columns, supports columnar storage, SQL, and even more optimization

Ambari – tool to manage and deploy a Spark cluster.

Machine Learning

use big datasets to train neural networks for pattern recognition based on ‘learned’ algorithms

Deep learning neural networks have multiple layers and non-linear activation functions

Common thread – training is expensive, and parallelism helps, lots of matrix math processing; GPUs start to look attractive.

Application areas: 3 lenses to look through in higher ed – research, coursework, operations

Example – TensorFire https://tenso.rs/

Case Study: Research

OPM has 40 years of longtitudinal data on federal employees. Duke researchers have been developing synthetic data and differential privacy techniques to allow broader audiences to develop models run against data in a privacy preserving fashion.

Model verification measures: test fit of model developed against synthetic data model to real data. Verification measures for models need to run against many slices of the data and OPM data is big. Initial approach: run regression models and verifications from R on a large Windows VM against a MS-SQL database. But R is single threaded; Custom code written with R parallel library to run / manage multiple independent Windows processes use more server CPU cores.

Rewrite R code to use SparkR

Initial setup: copy CSV files to HDFS; parse CVS and store in SPark Dataframe; treat Dataframe as a table; save table as a Parquet file (columnar format)

Read parquet file and you’re ready to go – SQL queries and R operations on dataframes

Lessons: Social science researchers can probably benefit from tools like Spark; Without access to a sandbox Spark cluster, how can they get up to speed on the tooling? RStudio continues to improve support for Spark via SparkR, SparklyR, etc. Plan to support Spark and R tools for large datasets.

Case Study: Coursework

Spring 2017 – grad biostats course using Jupyter notebooks.

PySpark – Python API for Spark.

Lessons learned: Course assignment k-mer counts in ~ w minutes on a 10 server clustrer (each server is 10 cores + 25 G for 40 students.

Azure has a Jupyter + Spark offering but not configured for courses.

Research

Duke has a wealth of PHI data that researchers want – lives in a secure enclave that is very locked down. Researchers want to use RStudio and Jupyter to run TensorFlow code against GPUs for image analysis. Don’t fight the firewall – automate the analysis tooling external to the protected enclave – package tools in a container. Don’t use Docker (not suited for research) – use singularity.

Container tech developed at LBL. Runs as user not as root.

Singularity + GitLab CI – researcher does integrations and commits.

Lessons learned: Singularity not a big stretch if you do Docker. Automating build/deploy workflow simplifies moving tools into protected data enclaves (or clouds or compute clusters).

Using Machine Learning for campus operations. John Board (Duke)

We already have the data for or for … whatever. Enables manual mashups to answer specific questions, solve specific problems. Will rapidly be giving way to fully automated pulling of needed data for unanticipated questions on demand. Data – speak for yourself and show us what’s interesting.

Academic outcome prediction – easy to ask – does your grade in Freshman calculus predict sticking with and success in engineering 2 years out. Hard part is asking – what are the predictors for success? Should be able to poll corpus of university data (and aggregate across institutions).

Computing demands are significant.

How will Machine Learning Impact Learning? – Jennifer Sparrow, Penn State

What computers can’t teach (yet) – analyze, evaluate, create

What computers can teach: remember, understand, apply

b.book – robot-generated textbook. Faculty puts in desired learning outcomes, robot finds relevant articles (only uses Wikipedia so far). 85% of students prefer this over traditional textbook. 35% often or very often visited pages not a part of required readings.

When students make their own book using b.book — 73% bbookx surfaced content they never encountered before in their major.

Brainstorming tools – intelligent assistant for brainstorming partner. Can use b.bookx for iterative exploration.

Prerequisite knowledge – as bbook found texts, the computer was able to identify prerequisite knowledge and provide links to that.

Generated assessments – “adversarial learning”. Robot can generate multiple choice questions. Bbook doesn’t have any mercy. Computer generated assessments are too hard – even for faculty.

How do we put these tools in the hands of faculty and help use them to create open educational resources?

Simplification – grading writing assignments. Can use natural language processing to give feedback at a richer level than they’d get from faculty or TAs. Students can iterate as many times as they want. Faculty finding that final writing assignments are of a higher quality when using the NLP.

Learning analytics – IR & BI vs data science: aims to precisely summarize the past vs aims to approximate future; enables people to make strategic decisions vs fine scale decision making; small numbers of variables vs hundreds to thousands of variables; decision boundaries are blunt and absolute vs decision boundaries are nuanced and evolved over time.

Able to predict students grade within a half of a letter grade 90% of time. Those that fall outside the 90% have usually had some unpredictable event happen. Uses previous semester GPA, historical average course grade, cumulative GPA, # of credits currently enrolled are biggest factors. Smaller factors: course level, # of students that have completed course; high school GPA; # of credits earned.

First Class – VR teaching simulations.

Legal and ethical issues. How do we know students are submitting their own work? Can we use predictive models to analyze that? Is it ok to correlate other data sources? What kind of assumptions can we make? Do we talk to the students? Are we ethically obligated to provide interventions?

Peeling Back the Corner – Ethical considerations for authoring and use of algorithms, agents and intelligent systems in instruction – Virginia Tech  — Dale Pike

Provost has ideas about using AI and ML for improving instruction and learning.

example: different search results from the same search by different people – we experience the world differently as a result of the algorithms applied to us. Wouldn’t it be good if the search engine allowed us to “peel back the corner” to see what factors are being used in the decision making, and allow us to help improve the algorithm?

When it gets to the point where systems are recommending learning pathways or admission or graduation decisions it becomes important.

Peeling back the corner = transparency of inputs and, where possible, implications, when authoring or evaluating algorithms, agents, and intelligent systems. Where possible, let me inform or modify the inputs or interpretation.

Proposal – could we provide a “Truth in Lending Act” to students that clearly indicates activity data being gathered? Required; Opt Out; Opt In

What does a control panel for this look like? How do you manage the infrastructure necessary without crumbling when somebody opts out?

Why? False positives and the potential for rolling implications.

Filters = choose what to show; Recommendation = offer choices; Decision = choose

Data informed learning and teaching

Learning analytics – understanding of individual performance in an academic setting based (usually) on trends of learning activity outcomes. The potential impact of learning analytics is constrained by the scope of analysis: Timeframe, scope of data, source of data. Increasing the impact of potential analysis also increases “creepy factors”.

Personalized/Adaptive learning: individual activity or performance alters (based on model) the substance, pace, or duration of learning experience.

Examples: CMU Simon Initiative Vision – a data-driven virtuous cycle of learning research and innovative educational practice causes demonstrably better learning outcomes for students from any background. Phil Long and Jon Mott — N2GDLE. Now we fix time variable and accept variability in performance. In the future do we unlock the time variable and encourage everybody to get higher performance?

Big Data Analytic Tools and ML/1AI’s impact on higher ed. Charley Kneifel – Duke

Infrastructure: CPU/GPU/Network as service. GPUs are popular put hard-ish to share – CPUs are easy. “Commodity” GPUs (actual graphics cards) are very poopular (4 in a tower server, 20 in a lab, problems with power and cooling). Centralized, virtualized GPUs make sense (depending on sale), mix of “double precision/compute” and graphics cards. Schedulable resource (slurm at Duke) and interactive usage. Availability inside protected enclaves. Measure resources  do you have idle resources? Storage – HDFS, object stores, fast local filesystem. Network — pipe sizes to Internet, Science DMZ, encrypted paths…; FIONAs with GPUs – edge delivery.

with VMWare have to reboot servers to de-allocate GPUs, so it’s fairly disruptive.

Cloud vs. on-prem clusters vs. serverless: Software packaging is important (portable, repeatable); Support for moving packages to cloud or into protected enclaves; Training capacity vs. operational use; Ability to use spare cycles (may need to cleanse the GPU); standard tool sets (Spark, Machine learning, …); Where is the data? Slurp or Sip (can it be consumed on demand)? Serverless support for tools used – only pay for what you use, but remember you need to manage it, agreements for protected data including BAAs), Customized ASICs and even more specialized hardware (Cloud or …); complex work flows.

Financial considerations: costs for different GPUs; peak capacity on prem vs cloud; pay for what you use; Graphics GPUs are cheap but need home and data.

Security – is data private, covered by data use agreement, protected by law, does it need to be combined with other data, is there public/non-sensitive/di–identified data that can be used for dev purposes, IRBs – streamlined processes, training…

Staffing and support: General infrastructure – can it support and scale GPUs and ML? Software packaging – Machine learning, R, Matlab, other tools? Toole expertise  both build/deploy and analysis expertise; operational usage vs research usage; student usage and projects – leverage them!

VP of research support levels and getting faculty up to speed – projects that highlight success with big data/analytics; cross departmental services/projects – generally multidisciplinary; university/health system examples at Duke; protected data projects are often good targets; leverage students; sharing expertise; reduce barriers (IT should be fast for this).

Advertisements

Campus Safety and Security, pt. 2

UVa events – Marge Sidebottom and Virginia Evans (UVA)

How do we determine where high risk areas are on any given day, and are they located in the right places for the controversy that might accompany any given guest speaker? Beginning to populate a system to record those. Look at controversial speakers, as well as protests. The lone wolf terrorist is the other common concern  – may find information that helps to plan better. Expect threat assessment team to look at issues within their own areas, and mitigate those – if they can’t then it escalates to the threat assessment team, which meets weekly.

Aug 11 & 12 – protest by white supremacists and neo-nazis. There were lots of advanced preparations by the city and the campus. This culminated a series of events over the previous months in different parks. Several hundred showed up at UVa on Friday night with lit torches and surrounded a small number of students. Violence broke out, but police dispersed activity. By late morning Saturday there were thousands in a small area of downtown Charlottesville, including heavily armed alt-right protesters. Then the car ramming event happened, and then the police helicopter crashed.

The University had begun planning three week prior to the event. Had 2 meetings a week of the emergency incident management team, and the President held daily meetings. There is a city/county/university EOC structure. The city decided to have their EOC in a different location, which compromised communications. University teams went on 12 hour shifts beginning Friday morning.

When protesters moved on campus, the events developed very rapidly. It became clear that they were not following the plan they had committed to.

Having the EOC stood up was very useful. Had the University’s emergency management team in a separate room, so they could be briefed regularly. At 11:50 on Saturday, cancelled activities on campus starting at noon to not have venues that presented opportunities for confrontations. Worked carefully with a long-planned wedding at the chapel, but it did take place. They were unaware of admissions tours that were going on – once they found out, rallied faculty to accompany student guides and families, and then ended tours early.

Taking care of the needs for mental health attention for participants is important.

John DiFava (MIT Chief of Police)

MIT culture – it can’t happen here, and it won’t happen here. Also the culture of the city of Cambridge is very open and loose. Campus police used to be focused on friendly service and would call in external agencies when in need. Times have changed – policing on campus is just as complex and demanding as any other type of policing. Universities are no longer isolated.

Columbine massacre had a tremendous impact – Officers followed procedure to establish perimeter and wait for tactical units to arrive. Now they are taught to make entry.

9/11 attacks had a significant impact on policing. MIT police lost all of their officers to other jurisdictions immediately. Interagency cooperation was was inadequate. Created a cascading effect – the cavalry was out of town, so had to rely on local resources.

New reality  – had to be able to function wihtout assistance; aid would not arrive as quickly and in the quantity it once did.

Steps taken to improve capability and performance – a comprehensive approach: Recruitment process, promotional system, supervision, training improvements – do in-service training with Cambridge Police and Harvard; firearms requalification three times a year (twice during the day, once in low light); specialized training for every officer; active shooter training (with Cambridge PD, Harvard, and MSP).

Work with Institute entities – Emergency management reports to Police.

Emergency Communication: Interface Between Public Safety and IT
Andy Birchfield , Jeff McDole, Andy Palms: University of Michigan

Certain emergency phases – Pre-incident planning, inbound emergency notification, emergency assessment, emergency alert operation, emergency notification delivery. The value to the community of notifications is based on total time of all phases.

Pre-incident planning: Activities include: message templates; policy and procedure; establish expectations and know your community; analysis of delivery modes with recognition of delivery times for each mode; evaluation of lessons learned; training and exercises; prepare infrastructure.

Inbound emergency notification: Making it simple, do it like they do it every day (students choose their cell phones over using the emergency blue phones); Get as much information as possible: video, audio (phone), text; Enable people to contact us in the ways they know — social media, apps, etc; coverage and capacity; knowing where the person is.

Emergency Assessment: Issues include confirmation, authorization, timeliness. If you can get a message out in 8-10 minutes of an incident, you’re doing well.

Emergency alert operation: additional modes and desired content will delay message creation; decisions and effort slow the operator; Hick’s Law: the time it takes for a person to make a decision as a result of the possible choices they have: increasing the number of choices will increase the decision time logarithmically.

Emergency notification delivery: Speed is the priority. Issues: Get people to sign up for the right service(s) – there is not a single mode; infrastructure coverage. They can get delivery to every email inbox in Ann Arbor (~105k) in about 7 minutes, but email is not the only mode. They have apps with push notifications – time of delivery is right around 10 seconds. The future is focused messages to appropriate recipients, by topic, or location, by individual choice.

Emergency Notification Systems: Ludwig Gantner, Andrew Marinik, VA Tech

VTAlerts – designed with redundancy. Goal is that every member of the community will be notified by at least one channel. Originally built in-house, but now a complex hybrid environment with some local and some vendor channels in the cloud.

New beginning – recognition of prioritized support for public safety. Group within IT expanded to include more channels: VT Alerts, blue light telephones, next generation 911, security camera system. Having one group responsible gives one point of contact in IT for public safety officials. Having dedicated staff allows for much better response times. They’ve removed dependencies on single individuals.

Communication – notification and collaboration – use the ticketing system.

Sustainable support – important to be proactive rather than reactive in public safety. New monitoring capabilities, improved redundancy, long term planning, channel development.

Collaboration – IT recognizes technical needs; public safety prioritizes items.

ENS philosophy: What is happening, where is it happening, what do we want you to do about it? They have 21 templates.

Current Challenges: How do we institutionalize the process to avoid backsliding when people change? What are appropriate success metrics for system evaluation? What are the cyber-security concerns of the components and system as a whole?

Evolving Radio Technologies – Glenn Rodrigues, UC Boulder

LMR (land mobile radio) project at CU Boulder. Business problem: lack of ability to communicate between Public Safety officials and leadership during planned events and unplanned incidents; Officers don’t feel safe doing their job without proper communications. Plan of action: Complete LMR audit for University; short term fixes; long term fixes.

Audit: requirements – contractor had to be vendor neutral; LMR customer interview and use case mapping; technical recommendations backed with data. Output: Clients – CUPD + 9 other business units. Biggest problem was coverage inside buildings (and system overloads). Tech assessment: most equipment was over 10 years old and malfunctioning, no real resource dedicated to monitor and engage with customers, most portable radios were not optimal. Business assessment: lack of policy enforcement (internal and external); lack of visibility of individual unit needs; lack of engagement with business partners. Plans: stabilize current LMR system under limited budget in 3 months by replacing high risk or failed equipment, leverage existing University assets (monitoring, backup power). Longer term: want to patch LMR into the campus fiber backbone. RFI in process.

John Board – Duke

Had the opportunity to green field a managed, networked, camera system. Lawyers were concerned about lack of standardization and maintenance of existing cameras. Started with parking decks. Goal was evidentiary, not live surveillance. Budgeted cost actually included maintenance, ongoing verification and network and storage costs. All cameras installed and operationally verified by OIT. Cisco VSOM, decent API. 1024 cameras in operation now.

The institution is zealous about privacy. They have a policy about access to live and stored images; have a retention policy; there’s a committee that decides where cameras go (you can’t put up cameras up outside the system). Challenge around need vs demand.

Wanting to do automated image analysis to verify that cameras are working, e.g. deviation of sample image vs reference image. EE faculty proposed writing an algorithm for this. After some experimentation came to an algorithm that filters ~80% of good cameras, while reliably identifying 100% of bad cameras. By using 3-day averages, safely filters 95% of good cameras – declaring victory!

Va Tech – Crowd Monitoring and Management on Game Day – Major Mac Babb

Stadium holds 66k people. Originally built in 1965. Hokie Village across the street, 20 parking lots, most of which are licensed for alcohol.

Unified Command off 7th floor of stadium. 160 Officers, Office of Emergency Management, Communications / Dispatch, Rescue, Fire, Game Operations, Event Staff/Security/ADA Services (545 event personnel), Parking, and Stadium Facilities and Grounds Ops.

Technology Assistance – CAD terminals and radio dispatch. See same screens as regional center. Access to around 400 cameras around campus. Weather systems fed into ops center, Veoci incident management program, Athletics comms channels, social media, emergency notification system. Supported by security center at public safety building.

Team Tops Technology – University of Washington’s Approach to Crisis Commnunications – Andy Ward

Seattle Crisis Communication Team – News & Information, Police, Marketing, IT, Emergency Management, Housing & Food Services, UW Medical Center.

Roles – Initiator, Incident Commander (for communications), Communicator, Monitors

Crisis communications toolkit – UW Alert Blog (wordpress.com) — can send messages to banners on the UW home page and to the hot-line telephone. UW Alert (e2campus) sends text and email messages. UW Alert facebook and twitter channels. There’s an outdoor alert system (talkaphone) and an indoor alert system (PA capabilities on fire alarm system (problem is they have to send to all buildings at once)). Plan to use Red Cross Safe & Well system to account for people.

97% of time crisis communication team  is activated by campus police — 20 some people, calling into a conference bridge. Initiator briefs team, primarily incident commander who decides what action to take. Person who initiates the call should be ready to send out the first message. Decide which tool(s) to use to send alert, and then team stays on the bridge after the message is sent.

Police are not the incident commanders for communications.

When incident is over, they send out an all-clear message.

IT’s role during an incident: Monitor technology performance; Troubleshoot immediately; Provide technical expertise; Provide depth to the team.

Police have ability to send messages if immediacy is needed.

Subteams from all 3 campuses meet and recommend policies.

CSG Fall 2017 – Campus Safety & Security, pt 1

We’re at Virginia Tech this time. The topic of this special day-long workshop at CSG is about Campus Safety and Security and what we’ve learned in the ten years since the VaTech shootings and in the wake of other major events at our campuses in terms of mass notifications and using technology to protect the people at our institutions.

Scott – The technology is easy once we’ve communicated the capabilities and limitations of the systems are, so realistic expectations can enable planning.

VaTech President formed a working group as an outcome of event in 2007: Telecom Infrastructure Working Group. Looked at 14 major university and regional systems. Involved over 80 committed professionals and faculty from IT, law enforcement and administration, with contributions of more than 60 additional individuals. Examined: Performance, stress-response and interoperability of all communications for multiple areas. Notifications to community, internal communications, etc. Who is the community, how are they notified? What’s the risk of sending targeted communications. It’s increasingly feasible to know locations of individuals – do we track that and attempt to target notifications to that? Nuances of what the event is has importance. How many preformed message templates should you have? Important to vet the accuracy of the information being communicated – time for analysis, but how much time do you take?

In the analysis, the technology was only involved in the response — the mitigation, preparedness, and recovery involved other parts of the institution.

WebEx with Klara Jelinkova from Rice – Hurricane Harvey Response

Wed Aug 23 – Harvey strengthens to tropical storm
Thursday strengthens to Cat 1
Friday goes to Cat 4 and makes landfall.

When it happens that quick, you have what you got: They had a service ist with criticality and emergency preparedness plan for when people can’t come to work. Primary datacenter can operate for 10 days without power, and they needed it. The secondary network is on a medical backbone.

Planning – moving to VOIP, not all data available in off site tape backup, so did a quick emergency backup to AWS Glacier (which challenged the firewall) – now looking at getting rid of tape entirely. Also looking at backup of HPC and research data — the researchers are supposed to pay for it, but nobody does. Moving major systems to cloud.

New plans they need: load balancers dependent on OIT datacenters being operational – looking at redesign in the cloud. IDM is utilizes SMU for continuity, but needs to move to cloud for scaling. Have a sophisticated email list service – everybody wanted to use it rather than the the broad blast emergency notification system. Realized that the list service is more critical than the alert system.

CISO was flooded and evacuated, so the learning management person ended up running the IT crisis center.

Institutional lessons: Standing Crisis Management Team – Good. Includes student representation. Contracts – where are you on the list for food and fuel delivery? Things that matter: flushing toilets, drinking water, food, payroll (people go to the cash economy, so make sure they have funds), network, communications services. Knowing where your people are and what they are facing – where do they live, mash that up with flooded areas – can they get to work, do they have internet, etc. Loaded everybody from ERP, geocoded addresses and put them on map and overlaid intelligence. Had needs assessment tools: housing assessments, childcare, etc. (forms built in Acquia). Lot of the hourly workers are not English speakers and don’t have smartphones (or know how to get to the resources). Put students to work in phone banks to call every person who didn’t respond to surveys. Put together departmental reports that they sent daily. Had less requests for temporary housing than offers to house people. Assessed impact of damage on specific courses. Was used to figure out when they were ready to reopen.

What worked: collecting data centrally but distributing initial assessment to divisions for analysis and followup. Didn’t sweat getting the data perfect initially. Gave deans and VPs sense of ownership. Brought in an academic geospatial research team for analysis that helped work with IT.

Quality of HR data was an issue.

Melissa Zak, UC Boulder, Ass’t Vice Chancellor of Safety  – Digital Engagement

October 5 – 3 significant events. Pre-event: strategy relations functional exercises, prior trainings, EMPG/EMOG/ECWG process and plans, alert notification systems, success of cyber teams (including law enforcement).

Somebody parked at stadium and started chasing people with a machete. Low threshold event because there was a small population present, but included community members there for treatment. One person on dispatch – requires a lot of multitasking at the best of times. First alerts went out within 15 minutes of first report to dispatch.

2nd event at 1 pm – coffee shop employee called corporate office about the first event, and they directed closing all the shops in the city, which led to reports of active harmer events at multiple shops across the city. Social media begins to erupt from campus. Sent out an alert that it was all clear, that there was no incident.

3rd event – 7:37 pm another alert went to one student from another college about an event. But then people started wondering whether the alert system had been hacked. Really highlights the impact of messages spreading by social media – students will drive event.

What went right? Great communication partnership with CUPD, CU, Boulder Police, Coroner, and CU Athletics.

What didn’t go as well? Messaging and clarity of messages. Community notification channels are important. If you have lots of people subscribed, it takes time to receive messages, and they may not arrive in order. Have now realized that sending notifications every 15 minutes is the best cadence. Now have a policy to send notifications informing people of any major deployment of police.

How do we deal with people who mainly communicate via social media channels?

Communication resource limitations – need to invoke more resources than just the one dispatcher.

 

 

CSG Spring 2017 – Challenges of Shifting IT to a Trusted Business Transformation Partner

Jim Phelps (Washington) is setting the tone for the workshop on shifting IT to a business partner.

Crisis in retail – new Nordstrom building in Seattle with the network density of a data center and the reconfigurability of a maker space. User-centered design and hyper-personalization. Built on internet of things, machine learning and AI, all designed for end user on mobile devices. Driven by big data analysis in close to real time. Incredibly tight link between IT and business. Autonomous systems like Amazon Echo.

Not just retail – example of Pacific Northwest National Labs personalized app.

Technical Drivers and Cultural Drivers lead to Business Transformation

Comparing our crisis to their crisis – Google crisis in retail – 266 k results, crisis in higher education has 385k results. We have 147% more crisis!

What does digital mean for higher education? What can we do with AI, IOT, hyper-personalization to be user-centric and personalized? What help do our business partners want with this transition?

HBR Analytics asked business leaders what will be IT’s most important contribution to the business over the next three years?

Lowest ranked: Lead and implement most IT projects. What they really want is IT to drive business innovation, manage security and risk, support business-led IT initiatives and establish architectures to support digital. Looking for a different engagement model with evangelizing, consulting, brokering, coaching and (last) delivering.

Impacts of the distributed university – fragmentation in leadership and mission, as well as IT  and the business. IT’s leadership challenge: Working across IT teams to better align, work to overcome barriers between IT and the business; working upward to enable better informed more unified leadership; working across units to create shared language and definitions.

Tom Lewis (Washington) – A UCD Approach to the Five Methods of Engagement

Evangelizing – keep abreast of emerging digital trends and educate business partners on opportunities – know campus needs to identify emerging trends to pinpoint the right ones, and work with campus educate partners.

Consulting – Offer advice and frameworks to enable successful business leadership of technology investments. They offer a User Centered Design framework to help people focus on their users. Example of customer journey mapping. Points of engagement: project planning by helping draft project charter and scope; research design; data analysis. Helped team identify research questions, define scope, redirect focus from artifact to gathering insights, provide step-by-step advice.

Brokering – know campus need to provide internal connections; work with vendors to provide external connections; work with campus to provide leadership of SaaS investments. Example of Canvas selection and implementation. Identify opportunities through knowledge of campus needs. Validate campus needs and understand priorities.  Work intensively with the vendor.

Coaching – develop employee skills and share expertise with others.

Takeaways: Know campus needs, work campus, work with service owners (internal or vendors).

Harvard IT Academy – Trusted Advisor, Facilitated by Deirdre O’Shea

IT Academy – Reskilling our IT professionals for a changing IT landscape, started summer 2015.  Skills identified – having a service mindset, being a trusted advisor, foundational knowledge of agile, ITIL, security, and project management. This year will start to identify technical skills by job families. Four levels in each competency – create common language, take foundational knowledge to think through how to apply it, take concepts and implement them for your team, expert, where you teach others. 52 IT facilitators across Harvard, 135 level 1 classes as of 4/30/17, 2,918 participant completions in Level 1.

Methods – co-facilitation, interactive dialog & exercises, challenge & support, materials, action plan. 3 year investment $1.5 million. Majority is for content licensing and bringing a vendor for ITIL certification. Two full-time staff.

Service Mindset – first class they rolled out. Trusted Advisor starts here. Three competencies: Accountability, Collaborative Partnerships, and Empathy. If you aren’t putting users at the center, you can’t become a trusted advisor.

Trusted Advisor – three competencies: effective communication, connecting, proactive problem solving. Introduction activity – who do you consider a trusted advisor? What characteristics did they demonstrate? (table exercise). Effective communication: factors that influence communication, active listening, miscommunication/ladder of inference, information exchange, questioning, exploring differences.

Active listening – create the right environment, listen until you no longer exist, paraphrase, perception check.

Connecting – partnership spiral, positioning ourselves as a value added partner, trust & credibility, developing & improving relationships.

View building trust as a Marathon of Sprints – good work sustained over time.

Proactive problem solving – identify future needs, influence strategies, motivate users to problem solve, benefits vs. features. Feature describes “what”; benefit describes “so what?”. A feature is what something is, a benefit is what something does.

Bring it together with a case study. Interview stakeholders, build advice, etc.

Christina Tenerowicz  (Colorado) – Business Analysis Relationship at CU

Business Analysis & Solution Architecture – started in Research Administration, now moved to central IT.

21 people in the group.

What’s successful? Business partnership, leadership, and technology. Everyone is accountable for successful delivery of a project. Program manager for each vertical – Research Admin, HR, Student Services, Academic Admin, Advancement and Athletics, Finance. Meet monthly with directors, do a multi-year roadmap showing business benefits along with costs and resources. They offer Business Needs Discovery and Requirements as a service. After a go-live you need to be there for post-implementation support and adoption.

Challenges – Relationship management; continual care and feeding; communication; educating and coaching leadership on business analysis.

Paul Erickson “IT as either an adoption agency or a hospice”

Louis King – we don’t look where we can divest.

Mojgan Amini – Started putting all IT staff through Lean Six Sigma training, and invited business partners which really helped the conversation.

CSG Spring 2017 – Automating Campus Network Configuration

We’re at Yale for the Spring CSG meeting. It’s a beautiful, sunny New England spring day!

The first workshop is on Automating Campus Network configuration, provisioning, and monitoring Workshop Presentations.

Mark McCahill – Duke – Thinking about network automation/monitoring

Campus wireless is one of the most complicated things we run. Campus APs – averages ~6k in 250 buildings across our campuses. RF spectrum issues. How reproducible are trouble reports?

How many staff support your network – 7.5 in engineering/architecture, 10 field staff, average.

We have not converged on network management tools at all.

Monitoring taxonomy – how can we categorize tools? Data gathering, analysis, alerting, trending.

Automation strategy – understand the environment – monitoring!
Ideal end-state – standardized process, consistent quality, reduced cycle time/increased productivity.

User centric monitoring of the wireless network

Users don’t tell us that much. Should I even tell IT there’s a problem? It’s not a good experience just because they don’t complain.

Crowd sourced monitoring – boomerang. javascript in a web page that attempts to download files of various sizes – can figure out latency and performance.

via.oit.duke.edu – zero-install. Duke’s shib page includes boomerang code. Results reported to via.oit.duke.edu, stored in mySQL db. Self-service diagnostic testing available at via to check performance to various data centers.

https://github.com/duke-automation/via

You get into big data fairly quickly – it’s a statistics game. Put pages at your cloud and different data centers to measure to them individually.

Where are the trouble spots? Key questions; what are the chances of a good connection? Which wireless segments are overloaded? Instead of depending on vendor tools, use R to analyze data from boomerang. You can do statistical process control to gather objective measures.

How to monitor when they can’t connect? Simulate users with strategically situated Raspberry Pi devices that do the EAP+PEAP authenticate & DHCP dance to get on the network. Source: https;//github.com/duke-automation/raspi – dumping data into splunk for analysis.
C program makes wpa_supplicant API calls to repeatedly cycle WiFi connection monitoring. Found bimodal distribution of DHCP response times. Also found no correlation between sites. Raspberry Pi tracking wlan interface drops, link quality and signal level.

Next steps – more Raspberry Pis in the field and more monitoring. Check http performance with boomerang on the Pis. Look more at DNS and number of SSIDs detected – could be rogue SSIDs.

Network data collection is a ‘big data’ problem, which is great for statistical analysis. Will use Apache Spark cluster to speed longitudinal analysis.

Should have an iOS app that says “this isn’t good for me right here right now”. Yale has one. https://github.com/YaleSTC/wifi-reporter

Eric Boyd, Michigan – perfSONAR overview

In the context of ScienceDMZ – how do you make sure you’re getting the end-to-end performance?

perfSONAR – enables fault isolation, verify correct operation, widely deployed in ESnet and other networks.

Problem statement – wile networks interconnect, each network is owned by a separate organization – how do we troubleshoot across them? Performance issues are prevalent and distributed. Local testing will not find everything. “soft failures” are different and often go undetected. Elephant flows (giant research loads) vs. mouse flows (web, email). with packets dropping at .0046%, you only get 5% of optimal speed.

perfSONAR is open source, supported by ESnet, GEANT, Internet2.

Something will break – sometimes things break, + human error. 3 phases to deployment: get system up and running; holy cow, we have a lot of network problems; how do we keep it good?

Distributed information sharing mechanism. Can plug any tool into it. DNS lookups, building HTTP tool now. Using a $200 box to deploy on a network to measure performance. Trying to automate things – you don’t want spend > .5 FTE on performance monitoring.

William Diegard – Rice – Network automation topics

You should do perfSONAR.

But some commercial products: Splunk, Extrahop, Deepfield. Not talking about single, largest automation system we run: the wireless controller.

What is automation? Anything that lets you spend time doing new things or be more efficient.

Splunk – MapReduce for your log file processing. The mother of all grep tools. Rice using it to track things and automate things like DMCA violations. Automatic system to look for POE shutdown on Cisco access switches. Monitor Data Transfer Node activities.

Extrahop – Application Performance Monitor, passive network traffic tap grabs “wire data” and reveals it. Make you realize how little you know. Does a bunch of statistical analysis.  Answers question: “why is it slow?”  But – you have to care to take the time to look. Can deploy it in the cloud too. Rice uses it to measure eduroam performance, among other things.

Deepfield – it’s an internet2 service you probably already have. Looks at traffic that Internet2 sees, shows where traffic is going, but you don’t see everything from your border. Does nice categorization.

Sean Dilda – Duke – Cartographer

Why? Network changes faster than diagrams. Troubleshooting network problems is hard! What port is this computer on? What VLAN? What firewalls?

What does it do? Logs into every switch/router every three hours and builds internal map of network. Can look up by IP and Hostname, or lookup by Mac, or see Switch/Router interface stats. They also pull in building data, link to floor plans and google maps. Can get summaries of VRF data, show VLAN stats (including same VLAN number on different LANs). It maps Layer 2 layouts. Great for showing how things plug together for local support staff or new network engineers. WIll map routes from source to destination in a nice graphic layout.

Who can use it? All IT staff across the university, and anyone with access to IPAM. (based on Grouper groups).

Use it to allow local staff with IPAM permissions to change port VLAN, bounce wired port, clear ARP entry, block IPs from network.

New tool: Planishpere. Compines data from Cartographer, DHCP, wireless/radius, device registration, end point management, VMWare, Cisco, etc. Can gather a lot of data about end devices.

Next steps: F5 load balancers, firewall rules and network ACLs, IPS blacklists, Planisphere metrics.

Plan to distribute source.

Scotty Logan – Stanford – Network Delegation and Automation

9 pairs of physical  firewalls, 600 virtual firewalls. Half the rules change per year. 65k firewall rules. Only 4-5k changes are manual. Firewall automation first deployed 2007.

1300 Local network admins active in last 30 days. Only 1200 people in IT job roles per HR.

SNSR – “snoozer” – self registration of devices.

If you come in via VPN or VDI shows who is associated with session, to look up groups for authorization. Now have a web page for Firewall requests, creates ServiceNow requests, and if you have permission it gets updated within an hour without manual intervention (or with an approval loop in ServiceNow.

Device compliance DB – Fed from devices, BigFix. VLRE. Very Lightweight Reporting Engine – runs om Macs and Windows, reports status of machines: do you have the firewall on, disk encrypted, etc? Started deployment of 802.1x with a dedicated Radios pool. Added integration with compliance API to see if device must be compliant and if it is.

OpsWare – automated switch and router management. Backup switch configurations nightly (can do diffs on them), scheduled config changes, check all devices for specific settings.

Matt Brooks – CMU – Controlling Network Access

Limited release of .1x mostly in common spaces and where people float between buildings. WPA2 enterprise, pushing people towards it from clear-text network. Controls IP assignments via DHCP, mout outlets are deactivated by default, self service portal for activating outlets. “Quick-reg” network for on-boarding.

Updating switch and router configs – NetMRI from Infoblox used for regular backups of running configs from every switch and firewall via TFTP; Visual diff tool to review changes; Password changes; software upgrades.  CMU NetConf – initial switch config, interface config changes day-to-day via self-service portal.

Scotty Logan – Dirty Dancing in the Cloud

Why are moving to the cloud? Geo-diversity? Scalability? Cost? Availability?

Don’t do: artisanally crafter services; manual testing; manual deployment; tightly coupled services.

Do do: Devops, loosely coupled

Firewalls and IP addresses are not loosely coupled!
Difficult to get contiguous elastic IPs from AWS. So people do VPC VPNs, and Direct Connect and private routing. Like dumping a 1950s appliance in your brand new kitchen.

If only we had… Inter-networking, and transport layer security, and strong authentication… oh wait – we do!

But.. my CISO says we need to use static IPs – you need to talk to your CISO.

If you have to, use NAT gateways or NAT instances with Elastic IPs

Amazon now supports /56 IPv6 subnet for VPCs.

Azure only allows 200 mbps per link, which HPC jobs can blow out very quickly. Duke doesn’t think that extending campus data center into the cloud is a good idea.

NetDB – Delegated administration. All 1300 local network admins can control firewalls, metadata, delegate domains, carve up address space, etc for devices via self-service.

William Diegard 

Needed to replace stack. Ended up with Infoblox. Talked about outsourcing DNS, which was a huge traumatic conversation. Good think about picking Infoblox was the conversations around campus. Infoblox allows for self-service and visibility of networking to campus. Training session was done by user support team, not networking. They don’t have as much need for frequent firewall rules as other schools due to the broad segmentation of the network.

Matt Brooks 

Application Suite – all home grown. Moving more towards InfoBlox. Using NetReg as  IPAM system – registration of machines on devices, tracks network and switch metadata. CANDO – tracks structure cabling on campus. Allows (central IT)  users to request and sometimes modify outlet configs, and configure interfaces on systems you own in the data center (add VLAN to your trunk, etc). NetConf – switch auto-builder does automated provisioning of new switches and PortAdmin does automated configuration of interfaces based on activations in CANDO.

Staffing and Skill Sets for Network Automation Teams

Matt Brooks – CMU

How is team structured? 9 network ops engineers, two network design engineers, 3-4 network software engineers. Pair developers and network engineers in offices.

What do we look for in developers? Pick two: Developer experience (required) and one of SysAdmin or Networking experience.  Must be genuinely interested in learning the third part. Will learn the third skill by working outages for things that aren’t yours and working on technologies you don’t currently know. Look for generalists, not specialists. A truly curious and self-driven person. The team runs the servers they run on.

War stories

Mark – In early days of perfSONAR discovered that part of backbone wasn’t as strong as it should be. perfSONAR made one of the core routers slow down so much that hospital VOIP didn’t function. Silver lining was that it pointed out some bad router configs.

William – When you start giving client services access to tools, people can do powerful things. Someone wiped out the entire admin database, which caused all the ports to drop.

Scotty – Guy who runs DNS infrastructure pushed out a software change that took down DNS.

Eric – One of perils of network automation is you concentrate your mistakes. What was a local problem becomes a global problem. Automated config of Internet2 switches, in deployment paused between steps to check accuracy, which caused a race condition that erased all the rules on the network. Took Internet2 network down for 20 minutes.

Scotty – maybe we should be applying webscale iterative deployment and testing to our switches, where we have thousands of devices.

Matt – When moving some addresses to a new space, engineer copied a SQL block out of a wiki, but the where clause was outside the highlighted code area on the wiki, so the statement got applied all across the network. Took a full weekend to restore.

Also – tied into IdM systems. Deprovision outlets and systems that a person is individually responsible. Glitch in IdM system caused 1500 active accounts to be de-activated. 4k devices deleted from network very rapidly. Cobbled together a script to figure out what happened and then put data back in place. Managed to do it in an hour.

CSG Winter 2017 – Recommendations/guides for updating IT skill portfolio

Paul Erickson – Nebraska

Framing the issue – don’t have the skill sets or expertise for a cloud world. Run, grow transform – on-prem, traditional focus on “run” – how to change a working environment and shift investment/resources (how do you change an engine while the car is running?)?

What skill sets are we missing? Process management (granting/revoking; provisioning; integration; authorizations/permissions, vendor coordination, managing interdependencies); Integration; Product/Service management; Client relationship management.

People who contributed in the past might not have the skills to take us into the future. How do we offer them opportunities to grow that honor their contributions and allow them to grow? Adapt and evolve in an environment of continuous change.

Identify ideal employee skills. Help those who are great technologists make the transition.

Denise Cunningham – Columbia

Head of HR for technology division.

One reason people resist change is because they focus on what the have to give up instead of what they have to gain. Important to keep this at front of mind.

A framework for Organizational Performance & Change: Burke-Litwin Model

burkelitwinmodel

External environment (e.g. the cloud) impacts the organization. The spine of the model – external environment influences leadership, which influences management practice, then work unit climate, then motivation, then individual.

Focus on Work Unit Climate: What it feels like to work here; nature of our interaction with each other; interpersonal relations in the group; what we focus on and consider important.

What factors influence Work Unit Climate? Leadership and management practices. Work unit climate is the most direct factor in performance.

There’s a learning climate or a performance climate. Learning: emphasis on improving skills and abilities; stresses process and learning; motivated to increase competence and change. Performance: emphasis is on demonstrating skills; stresses outcomes and results; people are afraid to make mistakes or change.

Goals: Learning: quality, trying new things original ideas; effort. Performance: following standard procedures; high performance standards; getting task done on time.

Feedback: Learning climate: supportive/coaching role; improving work quality; two-way feedback, questions encouraged. Performance climate: evaluative role; level of competence compared to other employees, one-way feedback, questions discouraged.

When implementing change employees want to hear about it from their manager.

There is no correlation between strong individual contributors and leaders.

Changing organizational culture can take 12-18 months. Or two years in higher education. Can’t do it at all without leadership being a part of it.

When people say they’re going to get in trouble, that can be a rationale for not changing. How do you make sure new staff don’t become part of a dysfunctional culture. Ask questions at hiring about the core values. Zappo’s does this well. Build the values into the performance appraisal.

 

 

 

CSG Winter 2017 – Cloud ERP Workshop

Stanford University – Cloud Transformations – Bruce Vincent

Why Cloud and Why now? Earthquake danger; campus space; quick provisioning; easy scalability; new features and functions more quickly

Vision for Stanford UIT cloud transformation program: Starting to behave like an enterprise. Shift most of service portfolio to cloud. A lot of self-examination – assessment of organization and staff. Refactoring of skills.

Trends and areas of importance: Cloud  – requires standards, process changes, amended roles; Automation – not just for efficiency – requires API integration; IAM – federated and social identities, post-password era nearing for SSO; Security – stop using address based access control; Strategic placement of strong tech staff in key positions; timescale of cloud ignores our annual cycles.

Challenges regarding cloud deployments: Business processes tightly coupled within SaaS products, e.g. ServiceNow and Salesforce; Tracking our assets which increasingly exist in disparate XaaS products; Representing the interrelationships between cloud assets; Not using our own domain namespace in URLs.

Trying to make ServiceNow the system of record about assets – need to integrate it with the automation of spinning instances up and down in the cloud.

Cloud ERP – Governance and Cloud ERP – Jim Phelps, Washington

UW going live with Workday in July. Migrating from old mainframe system and distributed business processes and systems. Business process change is difficult. Built an integrated service center (ISC) with 4 tiers of help.

Integrated Governance Model:  across business domains; equal voice from campus; linking business and technology; strategic, transformative, efficient…

Governance Design: Approach – set strategic direction; build roadmap; govern change – built out RACI diagram.

“Central” vs “Campus” change requests – set up a rubric for evaluating: governance should review and approve major changes.

Need for a common structured change request: help desk requests and structured change requests should be easily rerouted to each others’ queues.

Governance seats (proposed): 7 people – small and nimble, but representative of campus diversity.

Focus of governance group needs to be delivering greatest value for the whole university and leading transformational change of HR/P domains. Members must bring a transformational and strategic vision to the table. They must drive continuous change and improvements over time.

Next challenge: transition planning and execution – balancing implementation governance with ISC governance throughout transition – need to have a clear definition of stabilization.

Next steps: determine role of new EVP in RACI; Align with vision of executive director of ISC; provost to formally instantiate ISC governance; develop and implement transition plan; turn into operational processes

UMN ERP Governance – Sharon Ramallo

Went live with 9.2 Peoplesoft on 4/20/2015 – no issues at go-live!

Implemented governance process and continue to operate governance

Process: Planning, Budgeting; Refine; Execution; Refine

  • Executive Oversight Committee – Chair: VP Finance. Members: VP OIT, HR, Vice Provost
  • Operational Administrative Steering Committee: Char: Sr. Dir App Dev;
  • Administrative Computing Steering Committee – people who run the operational teams
  • Change Approval Board

Their CAB process builds a calendar in ServiceNow.

USC Experience in the Cloud – Steve O’Donnell

Current admin systems  – Kuali KFS/Coeus, custom SIS (Mainframe), Lawson, Workday, Cognos

Staffing and skill modernization: Burden of support shifts from an IT knowledge base to more of a business knowledge base – in terms of accountability and knowledge.  IT skill still required for integrations, complex reporting, etc. USC staffing and skill requirements disrupted.

Challenges: Who drives the roadmap and support? IT Ownership vs. business ownership; Central vs. Decentralized; Attrition in legacy system support staff. At risk skills: legacy programmers, data center, platform support, analysts supporting individual areas.

Mitigation: establishing clear vision for system ownership and support; restructure existing support org; repurpose by offering re-tooling/training; Opportunity for less experienced resources – leverage recent grads, get fresh thinking; fellowship/internships to help augment teams.

Business Process Engineering – USC Use cases

Kuali Deployment: Don’t disrupt campus operations. No business process changes. Easier to implement, but no big bang.

Workday HCM/Payroll: Use delivered business process as starting point. Engaged folks from central business, without enough input from campus at large. Frustrating for academics. Workday as a design partner was challenging. Make change management core from beginning – real lever is conversations with campus partners. Sketch future state impact early and consult with individual areas.

Current Approach – FIN pre-implementation investment

Demonstrations & Data gathering (requirements gathering): Sep – Nov. Led by Deloitte consultants; cover each administrative area; work team identifies USC requirements; Community reviews and provides feedback. Use the services folks, not the sales folks.

Workshops (develop requirements)- Nov – Feb. Led by USC business analysts, supported by Deloitte; Work teams further clarify requirements and identify how USC will use Workday; Community reviews draft and provides feedback

Playbacks (configure): March – May. Co-led by consultants and business analysts; Workday configured to execute high-level USC business requirements; Audience includes central and department-level users

Outcomes: Requirements catalog; application fit-gap; blueprint for new chart of accounts; future business process concepts; impacts on other enterprise systems; data conversation requirements; deployment scope, support model

CIO Panel – John Board; Bill Clebsch; Virginia Evans; Ron Kraemer; Kelli Trosvig

Cloud – ready for prime time ERP or not? Bill – approaching cautiously, we don’t know if these are the ultimate golden handcuffs. How do we get out of the SaaS vendors when we need to? Peoplesoft HR implementation has 6,000 customizations and a user community that is very used to being coddled to keep their processes. ERP is towards the bottom of the list for cloud.

Virginia – ERP was at the bottom of list, but business transformation and merger of medical center and physicians with university HR drove reconsideration. Eventually everything will be in the cloud.

John – ERP firmly at the bottom of the list.

Kelli – at Washington were not ready for the implementation they took on – trusted that they could keep quirky business processes, but that wasn’t the case. Took a lot of expenditure of political capital. Everyone around the table thought it was all about other people changing. Very difficult to get large institutions onto SaaS solutions because the business processes are so inflexible. Natural tendency is to stick with what you know – many people in our institutions have never worked anywhere else. Probably easier at smaller or more top-down institutions.

Ron – Should ask is higher-ed ready for prime time ERP or not? We keep trying to fix the flower when it fails to bloom. People changing ERPs are doing it because they have to – data center might be dying, cobol programmers might be done. Try to spend time fixing the ecosystem. Stop fixing the damn flower.

Kelli – it’s about how you do systemic change, not at a theoretical level.

Bill – what problem are we trying to solve? Need to be clear when we go into implementations. At Stanford want to get rid of data centers -space at too much of a premium, too hard to get permits, etc.

John – there’s an opportunity to be trusted to advise on system issues, integration, etc.

Kelli & Ron – The financial models of cap-ex vs. op-ex is a critical success factor.

Ron – separating pre-sales versions from reality is critical. That’s where we can play an important role.

John – we have massive intellectual expertise on campus, but we’ve done a terrible job of leveraging our information to help make the campus work better. We’ve got the data, but we haven’t been using it well.

Bernie – we need to start with rationalizing our university businesses before we tackle the ERP.

Ron – incumbent on us to tell a story to the Presidents. When ND looks at moving Ellucian they think what if they can stop running things that require infrastructure and licenses on campus? Positions us better than we are today. Epiphany over the last 6 months: We have to start telling stories – we can’t just pretend we know the right things to do. Let’s start gathering stories and sharing them.

Kitty – Part of the story is about the junk we have right now. The leaders don’t necessarily know how bad the business processes and proliferation of services are.