CSG Fall 2017 – Big Data Analytic Tools and AI’s Impact on Higher Education

Mark McCahill – Duke

How did we get here?

Big data / AI / machine learning driven by declining costs for:

  • sensors: IoT and cheap sensors create mountains of data
  • storage is cheap – we can save data
  • networks are fast – data mountains can be moved (and so can tools)
  • CPUs are cheap (and so are GPUs)

Massive data – IoT; internet of social computing platforms (driving rapid evolution of analytic tools); science – physics, genomics

Big data analysis tools – CPU clock speeds not increasing much — how can we coordinate CPUs run in parallel to speed analysis of large datasets? Break into parts and spread work – Hadoop MapReduce.

Map/Reduce -> Hadoop -> Apache Spark

Apache Spark – open source MapReduce famework. Spark coordinates jobs run in parallel across a cluster to process partitioned data.

Advantages over Hadoop: 10 – 100x faster than Hadoop (through memory caching); code and transformation optimizers; support for multiple languages (Scala, Python, SQL, R, Java)

2 ways to use Spark:

  • semi-structured data (text files, gene sequencer output);  write transforms and filter functions; classic map/reduce
  • Structured data (implicit or explicitly named columns); transforms and filter structured data using R-style dataframe syntax; SQL with execution optimizers.

Spark data models:

  • RDD (Resilient distributed dataset) storage allows Spark to recover from node failures in the cluster
  • Datasets – semi-structured data with strong typing and lambda functions, custom memory management, optimized execution plans
  • Dataframes – dataset with named columns, supports columnar storage, SQL, and even more optimization

Ambari – tool to manage and deploy a Spark cluster.

Machine Learning

use big datasets to train neural networks for pattern recognition based on ‘learned’ algorithms

Deep learning neural networks have multiple layers and non-linear activation functions

Common thread – training is expensive, and parallelism helps, lots of matrix math processing; GPUs start to look attractive.

Application areas: 3 lenses to look through in higher ed – research, coursework, operations

Example – TensorFire https://tenso.rs/

Case Study: Research

OPM has 40 years of longtitudinal data on federal employees. Duke researchers have been developing synthetic data and differential privacy techniques to allow broader audiences to develop models run against data in a privacy preserving fashion.

Model verification measures: test fit of model developed against synthetic data model to real data. Verification measures for models need to run against many slices of the data and OPM data is big. Initial approach: run regression models and verifications from R on a large Windows VM against a MS-SQL database. But R is single threaded; Custom code written with R parallel library to run / manage multiple independent Windows processes use more server CPU cores.

Rewrite R code to use SparkR

Initial setup: copy CSV files to HDFS; parse CVS and store in SPark Dataframe; treat Dataframe as a table; save table as a Parquet file (columnar format)

Read parquet file and you’re ready to go – SQL queries and R operations on dataframes

Lessons: Social science researchers can probably benefit from tools like Spark; Without access to a sandbox Spark cluster, how can they get up to speed on the tooling? RStudio continues to improve support for Spark via SparkR, SparklyR, etc. Plan to support Spark and R tools for large datasets.

Case Study: Coursework

Spring 2017 – grad biostats course using Jupyter notebooks.

PySpark – Python API for Spark.

Lessons learned: Course assignment k-mer counts in ~ w minutes on a 10 server clustrer (each server is 10 cores + 25 G for 40 students.

Azure has a Jupyter + Spark offering but not configured for courses.


Duke has a wealth of PHI data that researchers want – lives in a secure enclave that is very locked down. Researchers want to use RStudio and Jupyter to run TensorFlow code against GPUs for image analysis. Don’t fight the firewall – automate the analysis tooling external to the protected enclave – package tools in a container. Don’t use Docker (not suited for research) – use singularity.

Container tech developed at LBL. Runs as user not as root.

Singularity + GitLab CI – researcher does integrations and commits.

Lessons learned: Singularity not a big stretch if you do Docker. Automating build/deploy workflow simplifies moving tools into protected data enclaves (or clouds or compute clusters).

Using Machine Learning for campus operations. John Board (Duke)

We already have the data for or for … whatever. Enables manual mashups to answer specific questions, solve specific problems. Will rapidly be giving way to fully automated pulling of needed data for unanticipated questions on demand. Data – speak for yourself and show us what’s interesting.

Academic outcome prediction – easy to ask – does your grade in Freshman calculus predict sticking with and success in engineering 2 years out. Hard part is asking – what are the predictors for success? Should be able to poll corpus of university data (and aggregate across institutions).

Computing demands are significant.

How will Machine Learning Impact Learning? – Jennifer Sparrow, Penn State

What computers can’t teach (yet) – analyze, evaluate, create

What computers can teach: remember, understand, apply

b.book – robot-generated textbook. Faculty puts in desired learning outcomes, robot finds relevant articles (only uses Wikipedia so far). 85% of students prefer this over traditional textbook. 35% often or very often visited pages not a part of required readings.

When students make their own book using b.book — 73% bbookx surfaced content they never encountered before in their major.

Brainstorming tools – intelligent assistant for brainstorming partner. Can use b.bookx for iterative exploration.

Prerequisite knowledge – as bbook found texts, the computer was able to identify prerequisite knowledge and provide links to that.

Generated assessments – “adversarial learning”. Robot can generate multiple choice questions. Bbook doesn’t have any mercy. Computer generated assessments are too hard – even for faculty.

How do we put these tools in the hands of faculty and help use them to create open educational resources?

Simplification – grading writing assignments. Can use natural language processing to give feedback at a richer level than they’d get from faculty or TAs. Students can iterate as many times as they want. Faculty finding that final writing assignments are of a higher quality when using the NLP.

Learning analytics – IR & BI vs data science: aims to precisely summarize the past vs aims to approximate future; enables people to make strategic decisions vs fine scale decision making; small numbers of variables vs hundreds to thousands of variables; decision boundaries are blunt and absolute vs decision boundaries are nuanced and evolved over time.

Able to predict students grade within a half of a letter grade 90% of time. Those that fall outside the 90% have usually had some unpredictable event happen. Uses previous semester GPA, historical average course grade, cumulative GPA, # of credits currently enrolled are biggest factors. Smaller factors: course level, # of students that have completed course; high school GPA; # of credits earned.

First Class – VR teaching simulations.

Legal and ethical issues. How do we know students are submitting their own work? Can we use predictive models to analyze that? Is it ok to correlate other data sources? What kind of assumptions can we make? Do we talk to the students? Are we ethically obligated to provide interventions?

Peeling Back the Corner – Ethical considerations for authoring and use of algorithms, agents and intelligent systems in instruction – Virginia Tech  — Dale Pike

Provost has ideas about using AI and ML for improving instruction and learning.

example: different search results from the same search by different people – we experience the world differently as a result of the algorithms applied to us. Wouldn’t it be good if the search engine allowed us to “peel back the corner” to see what factors are being used in the decision making, and allow us to help improve the algorithm?

When it gets to the point where systems are recommending learning pathways or admission or graduation decisions it becomes important.

Peeling back the corner = transparency of inputs and, where possible, implications, when authoring or evaluating algorithms, agents, and intelligent systems. Where possible, let me inform or modify the inputs or interpretation.

Proposal – could we provide a “Truth in Lending Act” to students that clearly indicates activity data being gathered? Required; Opt Out; Opt In

What does a control panel for this look like? How do you manage the infrastructure necessary without crumbling when somebody opts out?

Why? False positives and the potential for rolling implications.

Filters = choose what to show; Recommendation = offer choices; Decision = choose

Data informed learning and teaching

Learning analytics – understanding of individual performance in an academic setting based (usually) on trends of learning activity outcomes. The potential impact of learning analytics is constrained by the scope of analysis: Timeframe, scope of data, source of data. Increasing the impact of potential analysis also increases “creepy factors”.

Personalized/Adaptive learning: individual activity or performance alters (based on model) the substance, pace, or duration of learning experience.

Examples: CMU Simon Initiative Vision – a data-driven virtuous cycle of learning research and innovative educational practice causes demonstrably better learning outcomes for students from any background. Phil Long and Jon Mott — N2GDLE. Now we fix time variable and accept variability in performance. In the future do we unlock the time variable and encourage everybody to get higher performance?

Big Data Analytic Tools and ML/1AI’s impact on higher ed. Charley Kneifel – Duke

Infrastructure: CPU/GPU/Network as service. GPUs are popular put hard-ish to share – CPUs are easy. “Commodity” GPUs (actual graphics cards) are very poopular (4 in a tower server, 20 in a lab, problems with power and cooling). Centralized, virtualized GPUs make sense (depending on sale), mix of “double precision/compute” and graphics cards. Schedulable resource (slurm at Duke) and interactive usage. Availability inside protected enclaves. Measure resources  do you have idle resources? Storage – HDFS, object stores, fast local filesystem. Network — pipe sizes to Internet, Science DMZ, encrypted paths…; FIONAs with GPUs – edge delivery.

with VMWare have to reboot servers to de-allocate GPUs, so it’s fairly disruptive.

Cloud vs. on-prem clusters vs. serverless: Software packaging is important (portable, repeatable); Support for moving packages to cloud or into protected enclaves; Training capacity vs. operational use; Ability to use spare cycles (may need to cleanse the GPU); standard tool sets (Spark, Machine learning, …); Where is the data? Slurp or Sip (can it be consumed on demand)? Serverless support for tools used – only pay for what you use, but remember you need to manage it, agreements for protected data including BAAs), Customized ASICs and even more specialized hardware (Cloud or …); complex work flows.

Financial considerations: costs for different GPUs; peak capacity on prem vs cloud; pay for what you use; Graphics GPUs are cheap but need home and data.

Security – is data private, covered by data use agreement, protected by law, does it need to be combined with other data, is there public/non-sensitive/di–identified data that can be used for dev purposes, IRBs – streamlined processes, training…

Staffing and support: General infrastructure – can it support and scale GPUs and ML? Software packaging – Machine learning, R, Matlab, other tools? Toole expertise  both build/deploy and analysis expertise; operational usage vs research usage; student usage and projects – leverage them!

VP of research support levels and getting faculty up to speed – projects that highlight success with big data/analytics; cross departmental services/projects – generally multidisciplinary; university/health system examples at Duke; protected data projects are often good targets; leverage students; sharing expertise; reduce barriers (IT should be fast for this).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s