CSG Fall 2017 – Big Data Analytic Tools and AI’s Impact on Higher Education

Mark McCahill – Duke

How did we get here?

Big data / AI / machine learning driven by declining costs for:

  • sensors: IoT and cheap sensors create mountains of data
  • storage is cheap – we can save data
  • networks are fast – data mountains can be moved (and so can tools)
  • CPUs are cheap (and so are GPUs)

Massive data – IoT; internet of social computing platforms (driving rapid evolution of analytic tools); science – physics, genomics

Big data analysis tools – CPU clock speeds not increasing much — how can we coordinate CPUs run in parallel to speed analysis of large datasets? Break into parts and spread work – Hadoop MapReduce.

Map/Reduce -> Hadoop -> Apache Spark

Apache Spark – open source MapReduce famework. Spark coordinates jobs run in parallel across a cluster to process partitioned data.

Advantages over Hadoop: 10 – 100x faster than Hadoop (through memory caching); code and transformation optimizers; support for multiple languages (Scala, Python, SQL, R, Java)

2 ways to use Spark:

  • semi-structured data (text files, gene sequencer output);  write transforms and filter functions; classic map/reduce
  • Structured data (implicit or explicitly named columns); transforms and filter structured data using R-style dataframe syntax; SQL with execution optimizers.

Spark data models:

  • RDD (Resilient distributed dataset) storage allows Spark to recover from node failures in the cluster
  • Datasets – semi-structured data with strong typing and lambda functions, custom memory management, optimized execution plans
  • Dataframes – dataset with named columns, supports columnar storage, SQL, and even more optimization

Ambari – tool to manage and deploy a Spark cluster.

Machine Learning

use big datasets to train neural networks for pattern recognition based on ‘learned’ algorithms

Deep learning neural networks have multiple layers and non-linear activation functions

Common thread – training is expensive, and parallelism helps, lots of matrix math processing; GPUs start to look attractive.

Application areas: 3 lenses to look through in higher ed – research, coursework, operations

Example – TensorFire https://tenso.rs/

Case Study: Research

OPM has 40 years of longtitudinal data on federal employees. Duke researchers have been developing synthetic data and differential privacy techniques to allow broader audiences to develop models run against data in a privacy preserving fashion.

Model verification measures: test fit of model developed against synthetic data model to real data. Verification measures for models need to run against many slices of the data and OPM data is big. Initial approach: run regression models and verifications from R on a large Windows VM against a MS-SQL database. But R is single threaded; Custom code written with R parallel library to run / manage multiple independent Windows processes use more server CPU cores.

Rewrite R code to use SparkR

Initial setup: copy CSV files to HDFS; parse CVS and store in SPark Dataframe; treat Dataframe as a table; save table as a Parquet file (columnar format)

Read parquet file and you’re ready to go – SQL queries and R operations on dataframes

Lessons: Social science researchers can probably benefit from tools like Spark; Without access to a sandbox Spark cluster, how can they get up to speed on the tooling? RStudio continues to improve support for Spark via SparkR, SparklyR, etc. Plan to support Spark and R tools for large datasets.

Case Study: Coursework

Spring 2017 – grad biostats course using Jupyter notebooks.

PySpark – Python API for Spark.

Lessons learned: Course assignment k-mer counts in ~ w minutes on a 10 server clustrer (each server is 10 cores + 25 G for 40 students.

Azure has a Jupyter + Spark offering but not configured for courses.

Research

Duke has a wealth of PHI data that researchers want – lives in a secure enclave that is very locked down. Researchers want to use RStudio and Jupyter to run TensorFlow code against GPUs for image analysis. Don’t fight the firewall – automate the analysis tooling external to the protected enclave – package tools in a container. Don’t use Docker (not suited for research) – use singularity.

Container tech developed at LBL. Runs as user not as root.

Singularity + GitLab CI – researcher does integrations and commits.

Lessons learned: Singularity not a big stretch if you do Docker. Automating build/deploy workflow simplifies moving tools into protected data enclaves (or clouds or compute clusters).

Using Machine Learning for campus operations. John Board (Duke)

We already have the data for or for … whatever. Enables manual mashups to answer specific questions, solve specific problems. Will rapidly be giving way to fully automated pulling of needed data for unanticipated questions on demand. Data – speak for yourself and show us what’s interesting.

Academic outcome prediction – easy to ask – does your grade in Freshman calculus predict sticking with and success in engineering 2 years out. Hard part is asking – what are the predictors for success? Should be able to poll corpus of university data (and aggregate across institutions).

Computing demands are significant.

How will Machine Learning Impact Learning? – Jennifer Sparrow, Penn State

What computers can’t teach (yet) – analyze, evaluate, create

What computers can teach: remember, understand, apply

b.book – robot-generated textbook. Faculty puts in desired learning outcomes, robot finds relevant articles (only uses Wikipedia so far). 85% of students prefer this over traditional textbook. 35% often or very often visited pages not a part of required readings.

When students make their own book using b.book — 73% bbookx surfaced content they never encountered before in their major.

Brainstorming tools – intelligent assistant for brainstorming partner. Can use b.bookx for iterative exploration.

Prerequisite knowledge – as bbook found texts, the computer was able to identify prerequisite knowledge and provide links to that.

Generated assessments – “adversarial learning”. Robot can generate multiple choice questions. Bbook doesn’t have any mercy. Computer generated assessments are too hard – even for faculty.

How do we put these tools in the hands of faculty and help use them to create open educational resources?

Simplification – grading writing assignments. Can use natural language processing to give feedback at a richer level than they’d get from faculty or TAs. Students can iterate as many times as they want. Faculty finding that final writing assignments are of a higher quality when using the NLP.

Learning analytics – IR & BI vs data science: aims to precisely summarize the past vs aims to approximate future; enables people to make strategic decisions vs fine scale decision making; small numbers of variables vs hundreds to thousands of variables; decision boundaries are blunt and absolute vs decision boundaries are nuanced and evolved over time.

Able to predict students grade within a half of a letter grade 90% of time. Those that fall outside the 90% have usually had some unpredictable event happen. Uses previous semester GPA, historical average course grade, cumulative GPA, # of credits currently enrolled are biggest factors. Smaller factors: course level, # of students that have completed course; high school GPA; # of credits earned.

First Class – VR teaching simulations.

Legal and ethical issues. How do we know students are submitting their own work? Can we use predictive models to analyze that? Is it ok to correlate other data sources? What kind of assumptions can we make? Do we talk to the students? Are we ethically obligated to provide interventions?

Peeling Back the Corner – Ethical considerations for authoring and use of algorithms, agents and intelligent systems in instruction – Virginia Tech  — Dale Pike

Provost has ideas about using AI and ML for improving instruction and learning.

example: different search results from the same search by different people – we experience the world differently as a result of the algorithms applied to us. Wouldn’t it be good if the search engine allowed us to “peel back the corner” to see what factors are being used in the decision making, and allow us to help improve the algorithm?

When it gets to the point where systems are recommending learning pathways or admission or graduation decisions it becomes important.

Peeling back the corner = transparency of inputs and, where possible, implications, when authoring or evaluating algorithms, agents, and intelligent systems. Where possible, let me inform or modify the inputs or interpretation.

Proposal – could we provide a “Truth in Lending Act” to students that clearly indicates activity data being gathered? Required; Opt Out; Opt In

What does a control panel for this look like? How do you manage the infrastructure necessary without crumbling when somebody opts out?

Why? False positives and the potential for rolling implications.

Filters = choose what to show; Recommendation = offer choices; Decision = choose

Data informed learning and teaching

Learning analytics – understanding of individual performance in an academic setting based (usually) on trends of learning activity outcomes. The potential impact of learning analytics is constrained by the scope of analysis: Timeframe, scope of data, source of data. Increasing the impact of potential analysis also increases “creepy factors”.

Personalized/Adaptive learning: individual activity or performance alters (based on model) the substance, pace, or duration of learning experience.

Examples: CMU Simon Initiative Vision – a data-driven virtuous cycle of learning research and innovative educational practice causes demonstrably better learning outcomes for students from any background. Phil Long and Jon Mott — N2GDLE. Now we fix time variable and accept variability in performance. In the future do we unlock the time variable and encourage everybody to get higher performance?

Big Data Analytic Tools and ML/1AI’s impact on higher ed. Charley Kneifel – Duke

Infrastructure: CPU/GPU/Network as service. GPUs are popular put hard-ish to share – CPUs are easy. “Commodity” GPUs (actual graphics cards) are very poopular (4 in a tower server, 20 in a lab, problems with power and cooling). Centralized, virtualized GPUs make sense (depending on sale), mix of “double precision/compute” and graphics cards. Schedulable resource (slurm at Duke) and interactive usage. Availability inside protected enclaves. Measure resources  do you have idle resources? Storage – HDFS, object stores, fast local filesystem. Network — pipe sizes to Internet, Science DMZ, encrypted paths…; FIONAs with GPUs – edge delivery.

with VMWare have to reboot servers to de-allocate GPUs, so it’s fairly disruptive.

Cloud vs. on-prem clusters vs. serverless: Software packaging is important (portable, repeatable); Support for moving packages to cloud or into protected enclaves; Training capacity vs. operational use; Ability to use spare cycles (may need to cleanse the GPU); standard tool sets (Spark, Machine learning, …); Where is the data? Slurp or Sip (can it be consumed on demand)? Serverless support for tools used – only pay for what you use, but remember you need to manage it, agreements for protected data including BAAs), Customized ASICs and even more specialized hardware (Cloud or …); complex work flows.

Financial considerations: costs for different GPUs; peak capacity on prem vs cloud; pay for what you use; Graphics GPUs are cheap but need home and data.

Security – is data private, covered by data use agreement, protected by law, does it need to be combined with other data, is there public/non-sensitive/di–identified data that can be used for dev purposes, IRBs – streamlined processes, training…

Staffing and support: General infrastructure – can it support and scale GPUs and ML? Software packaging – Machine learning, R, Matlab, other tools? Toole expertise  both build/deploy and analysis expertise; operational usage vs research usage; student usage and projects – leverage them!

VP of research support levels and getting faculty up to speed – projects that highlight success with big data/analytics; cross departmental services/projects – generally multidisciplinary; university/health system examples at Duke; protected data projects are often good targets; leverage students; sharing expertise; reduce barriers (IT should be fast for this).

CSG Spring 2015 – The Data Driven University, part 2

Tom Lewis, Washington

Who are the traditional players? Institutional Research; Office of Educational Assessment; Data Warehouse Team (do good work, saw their client as being Finance).

Modern players & practices – Sources of Change: From Above (President, Provost, VPs, AVPs, Chancellors); From the middle (Deans, chairs, heads of admin units (especially those focused on undergrads); From below (staff doing work, faculty); From the outside (BI and analytics vendors).

Becoming Modern –

Course Demand Dashboards – Notify.uw. Enterprising students screen scraping registration system for notifying about openings in courses, charging other students. So built notify.uw – can notify when openings occur in class via email or SMS. Almost 25k subscribers. What else can be done with the data? Understanding course demand: Notify.UW knows what classes students want; student system knows about course offerings and utilization of capacity. Mashed them up to see where demand exceeded capacity.

The Cool stuff: Central IT BA’s and engineers pulled in a like minded colleague from the DW to do innovation work with data. Provost, deans, and chairs got excited; built out dashboards using Tableau.

The Great Civitas Pilot – Why Student Success Analytics? People don’t understand much about their students, when to do interventions, longtitudinal views of program efficacy and impacts. Tried to use Civitas – take data from student system, LMS, and data warehouse. Illume: Analyze key institution metrics, starting with persistence; view historical results and predictions of future. Inspire for Advisors

The Cool stuff: Admin heads looked to IT to help solve problem because of success of course dashboard. Faculty, teaching and program support staff are eager to get started.

Show Me the Data!

Assessment folks didn’t understand the value of giving access to data that hasn’t been analyzed. IT team interviewed people for data needs, then involved assessment people in building dashboards with Tableau to realize those needs.

Data Warehouse folks have gotten the religion – look at the UW Data & Analytics page.

Central IT is the instigator and change agent, but needs BAs with deep data analysis skills.

We all need to be hiring data scientists with deep curiosity – can’t keep having technical folks with answers of it takes too long to go through the data. Should partner with existing data science centers on campus. If we’re really going to data-driven universities IT will be at the center – we touch all the parts of the institution, we have the tools, and we know more about how data interacts.

Mark Chiang – UC Berkeley

Used to have to go to separate offices to get data, mash up into spreadsheets, do pivot tables, for every request.

Data Warehouse: Cal Answers – Students (applicants, curriculum, demographics, financials); Alumni; Finance; Research; HR; Facilities.

Built out high level dashboard for deans and chairs – answer questions about curricula. Enrollments, Offerings, instructor data, etc.  Facilitates discussions between deans and faculty and administrators. Effort was driven by CFO. Makes job much easier. Added substantial additional investment.

Can build out prototypes in a couple of weeks on top of live data to prove concepts before building the real enterprise work.

Discussion

Will the data warehouse look significantly different in a few years? We don’t do a good job of understanding the way data security needs to change as data ages. There’s a place to incorporate new types of data like sentiment analysis on social media. Instructure is working on making Canvas data available via AWS Redshift. Much of the new thinking and activity about data is not coming from the traditional BI/DW teams, but those folks are more willing to partner now than they used to be.

CSG Spring 2014 – Analytics Discussion

ECAR Analytics Maturity Index – could use it to assess which group to partner with to judge feasibility. 

NYU started analytics several years ago and chose certain kinds of data. 

Dave Vernon – Cornell
Hopes and dreams for the Cornell Office of Data Architecture and Analytics (ODAA)
Curent state fof data usability at Cornell: like a library system with hundreds of libraries, each with unique catalog systems (if any), each requiring esoteric knowledge, each dependent on specialists who don’t talk to each other.

Traditional “BI” -not analytics but report generation. Aging data governance.

ODAA – to support Cornell’s mission by maximizing the value of data resources. Act as a catalyst/focal point to enable access to teaching, research, and admin data. Acknowledge limited resource, but will attempt to maximize value of existing resources.

Rethink governance: success as the norm, restrictions only as needed? Broad campus involvement in data management – “freeing” of structured / unstructured data. Stop arguing over tools: OBIEE vs Tableau, etc. Form user groups – get the analysts talking. 

Service Strategy: Expand Institutional Intelligence initiative: create focused value from a select corpus of admin data (metadata, data provenance, governance, and sustainable funding). Cost recovered reporting and analytics services. User groups, consultants, catalog and promulgate admin and research data resources. 

Resource strategy: What do you put together in this office? Oracle people, reporting people. Re-aloacate savings. Add skilled data and analytics professionals. Modest investment in legacy tool refresh. People are getting stuck in that discussions of tools.

Measures of Success: ODAA becomes a known and trusted resource. Cultural evolution – open not insular. Data becomes actionable, self-service. Broad campus involvement data management, “freeing” of data – have to work on data stewards to convince them that they have to make a compelling argument to keep data private. Continued success of legacy services.

At NYU IR owns the data stewardship and governance, but there is a group in a functional unit (not IT) that acts as the front door for data access. Currently just admin data focus, but growing out of that. Two recent challenges – student access to data (pressing governance structure), and learning analytics (people want access to LMS click streams – what about privacy concerns?).

Stanford – IR group reports to provost (like 15 people) do admin data. Group reports to dean of research for research data. Teaching & learning under another VP. Groups specialize, reducing conflict. Data scientists are part of those groups. 

Washington spinning up data science studio with research people, IT, library people as a physical space for people to collocate. 

Jim Phelps – can we use the opportunity of replacing ERPs to have the larger discussion about data access and analytics?

Notre Dame halted BI effort to go deeply into a data governance process, and as part of that are getting a handle on all of the sets of data they have. Building a data portal that catalogs reports. More a collection of data definitions rather than a catalog of data. data.nd.edu A concept at this point, but moving in that direction. Registry of data – all data must be addressable by url. Catalog shows existing reports, showing shat roles are necessary to get access. Terms used in the data are defined. 

Duke – Not hearing demand for this on campus, but getting good information on IT activity using Splunk on data streams. Could get traction by showing competencies in analysis.

At NYU had a new VP for Enrollment Management who had lots of data analysis expertise, who wowed the Board with sophisticated analyses, driving demand for that in other applications. 

Data Science Venn diagram – http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Dave Vernon – There’s an opportunity to ride the big data wave to add value by bringing people together and getting the conversations going and make people feel included. 

How are these teams staffed? Can’t expect to hire someone who knows all the pieces, so you have to have cross-functional teams to bring skills together. Michigan State has a Master’s program in analytics, so bringing some good people in there. Last four hires at Notre Dame have been right off the campus. Now have 8 FTE in BI space.