Oren’s Blog

Higher Ed Cloud Forum: Adventures in Cloudy High-Performance Computing

Gavin Burris – Wharton School

HPC – Computing resources characterized by many nodes, many cores, lots of ram, high speed, low-latency networks, large data stores.

XSede — it’s free (if you’re funded by a national agency).

cloud – more consoles, more code, less hardware

Using Ansible to configure cloud resources the same as on-premise, both to deploy EC2 clusters in Python, CfnCluster – cloud formation cluster to build and manage HPC clusters.

Univa UniCloud enables cloud resources to integrate with Univa scheduler.

Use Case: C++ simulation modeling code, needed 500 iterations, each took 3-4 days. Used MIT StarCLuster with spot bids. For $500 finished job in 4 days.

Use case: Where are the GPUs? Nobody was using – had to use different toolkits and code to utilize. So got rid of GPUs in refresh. Used UniCloud to use cloud instances with GPU

“Cloud can accommodate outliers” — GPUs, large memory. A la carte to the researcher based on tagged billing. Policy-based launching of cloud instances.

Seamless transition – VPC VPN link provided by central networking, AWS looks like another server room subnet. Consistent configuration management with the same Ansible playbooks. Cloud mandate by 2020 for Wharton – getting rid of server rooms to reclaim space.

They’re doing NFS over the VPN – getting great throughput.

Cost comparison – HPCC local hardware $328k, AWS $294 for flop equiv.

Spotinst – manages preemption and moves loads to free instances.

 

Advertisement

Cloud Compute Services Expansion – Lessons Leaned

Mark Personett – University of Michigan

A project to: Enable all three campuses and Michigan to access cloud infrastructure with AWS, Azure and Google

Enterprise agreement, shortcake billing, training, consulting, preconfigured security/network settings, Shibboleth integration, reporting. What it’s not: cloud strategy, governance, or operations.

Lessons learned:

BAA doesn’t cover every service. BAA is just a legal document. Account and billing differences.

AWS at U-M: BAA separate from EA and have to do a separate process to add units to the BAA. Single-sign-on is not as integrated. No inherent hierarchy.

GCP: billing accounts and “projects” separate concepts. Billing sub-accounts. GCP is API and API is GCP. API explorer is extremely helpful in writing API calls.

Azure: Resource groups vs subscription not always clear (finding that they need to do subscriptions for each resource group in the general case). Office 365 challenges – if alumni get synced to your Azure AD they get rolled into your instance and under your terms. VPN – they have levels of VPNs – if you breach the bandwidth it resets your tunnel with no warning.

 

Higher Ed Cloud Forum: Cloud experiences in the Swiss high education market

Immo Noack

SWITCH – Swiss NREN. Swiss universities are members. Core competencies: Network, security, and identity management. Around 45 universities in Switzerland.

Have local SWITCH services based in data centers in Zurich and Lausanne.

Buy IaaS through GEANT, which is the pan-European organization. The GEANT tender is not valid for Switzerland, but conditions apply. Three parts: Original IaaS providers (direct); original IaaS providers (indirect); Resellers for IaaS indirect providers. Providers are AWS and Microsoft.

SWITCH’s role is expanding its cloud offering with external suppliers, provided exclusively by SWITCH to Swiss higher ed. Data protection is a big concern – they don’t want data in the US. GDPR is coming next May.

Findings: universities are rather cautious, prefer to build their own resources (they still invest heavily in higher ed). Budget process is not prepared for cloud-usage; University IT units want to keep the existing stuff, but researchers who want the cloud.

Higher Ed Cloud Forum: Beyond the Architecture — Rethinking Responsibilities

Glenn Blackler (UC Santa Cruz)

Cloud-First! Now What…?

Santa Cruz’s approach – hw infrastructure was going to turn into a pumpkin in sprint 2018. “Screw it – we’re all in, let’s jump.”

What’s our approach? How can existing teams support this change? Program work vs. migration specific work. Our focus – enterprise applications.

Defining the program: Plan for a quick win (build confidence, get familiar, identify training needs). Go big – went from a small PHP app to identity management infrastructure. All in! — moved Peoplesoft and Banner. Run concurrent migrations.

But really. … why? Need to continually talk to customers about why they’re doing it. Benefits of cloud migration aren’t apparent – have to sell it. The pitch: elasticity, DR/BR, Accommodation (additional test environments); modernized tools and team structures; sustainability.

Teams – Separation of duties – now have separation between sysadmins and app admins and developers. Always been a handoff, ticket driven organization. Don’t know what org looks like in new world – took really smart people and threw them in a room and told them to figure it out. Core team includes App and Sys admins, plus less frequent contributions from security, DBA, networking, devs.

Looking at Cloud Engineering Team that incorporates OS Setup/Config/App Config/Maintenance. DBA team still a bit separate. Security contributing across the board, but not necessarily hands on all the time. Teams are learning new things about each other that they didn’t know in the ticket-driven world.

Future – shared responsibilities mean fewer handoffs; engineers with wider breadth of skills; improved cross-team collaboration through shared code base; continuous improvement through evolving technical design and available services; adjusted job titles and responsibilities; ITS reorganization; budget impact, review of recharge model.

New ways of collaborating: Sys and App admins using a single git repository for code. Shared tools/technologies, password management; cross-functional tier 1 support;

Lessons learned – don’t lock decisions down too early, use governance to end debates, identify project goals that foster exploration (within timeline), use consultants carefully. Traditional PM will not work, push boundaries of what is possible, required vs. ideal – compromise is important; don’t compare with mature on-premise architecture; be prepared for rumors;

Not everyone is on the bus – what about those who don’t want to get on?

Higher Ed Cloud Forum – Lightning Round #1

Phil Robinson – Cloud Progress at Cornell Student Services IT

First AWS account – July 2015 – adopted a cloud first strategy. Now have about 30 apps on AWS (migrations, rewrites, new apps). Automate with Jenkins and Ansible. Retiring on-prem VMs.

Custom class-roster app, used by students to decide what to take. Added central syllabi feature this year. Using SNS+SQS as message bus, orchestrating events; CloudFront delivery for syllabi; On fly ClamAV scans on upload; ElasticSearch for searching; SES for notifications by email. Developed in 3632 hours.

Looking towards containerizing and VDI.

Gerard Schockley – BU iPaas RDS AWS

IPaaS ODS in RDS – integration service designed to integrate many data feeds into SnapLogic platform. Operational Data Store. Using AWS Aurora.

Bob Winding – Cloud Automation Journey

Most fully automated in GovCloud project. CloudFormation (VPCs, IAM, Security Groups, Centralized alerts); ANsible and CloudFormation for server builds; Consol federation with ADFS; Consistent process for all project accounts; new project account in a couple of hours; decentralized maintenance of CF Templates.

Penn –

What does “cloud native” mean at Penn?

Case study 1 – online giving portal: Data ETL (Talent); to Postgres RDS (fundraising metadata); S3 / Cloudfront; to Oracle on prem. Near real-time

Case study 2: Service ordering (VDI and Backup requests). On prep powershell makes changes in AD groups, sends messages through SQS

Case study 3 – Device registration. On prep registration; does API keys in Lambda

Sara Jeanes – Considerations in moving HPC workloads to the cloud

Initial framing questions: Do they have a preference for which cloud provider (do they have credits, different tech); Is there a multi-cloud resiliency need?

Workload questions: Can it be interrupted (use spot instances), large workloads firewall considerations (ScienceDMZ);

Jeff Minelli – Penn State – CloudCheckr enabling transparency at Penn State

Gain insights into financial transparency, spend optimization, resource utilization and right-sizing, cost allocation, best practices, security & compliance, collection and unification of AWS API data, continuous monitoring, reporting and alerts

Working with CloudCheckr to enable SAML. Basic group email notifications. Configuration of $100 spending alerts.

Trying to get CloudCheckr into InCommon.

Network Firewall Policies for Hybrid Cloud – Brian Jemes – University of Idaho

In cloud managing firewalls with server tags. Gets complicated when managing across on-prem and cloud. On prep have Cisco tools to manage ASA firewalls.

Options: manage hybrid cloud policy in on-prem firewall; manage hybrid policies with traditional firewalls in cloud; develop a hybrid tool.

Looking at a startup called Bracket Computing – cloud firewall policy manager. brkt.com – Provides micro-segmentation.

John Bailey – Washington University (St. Louis). Cloud IAM

Balance between security and usability. Enhncing usability with SPNEGO integrated auth. leverages kerberos token from machine login to perform a web SSO login, making the web login invisible to the customer.

Lou Tiseo – how categorizing resources help to understand cloud usage

Requiring seven different tags. Using Cloudyn management dashboard. Helped save costs by using reserved instances.

Chris Malek Caltech – Automation tools for AWS ECS and Batch

deployfish – configure almost all aspects of an ECS services (load balancing, app autoscaling, volumes, environment, etc). They’ve open sourced it. Create, inspect, scale, update, destroy and restart ECS services with single commands; manage multiple environments (test, qa, prod, etc). Integrates directly with terraform.  YAML driven

batchbeagle — allowing people to manage AWS Batch. Create, update, disable, and destroy queues. Create, update, disable, and destroy compute environments. Create job descriptions. Submit and manage jobs, etc.

Amanda Tan – Washington

Enabling cost notifications on AWS. Cost monitoring is difficult – should be zero effort. Two prong attack: auto-tag resources, send email notification with total spend and resource usage daily. Cloud Formation Template sets up Cloudwatch which invokes auto tag lambda function. AutoTag tags resources with owner and principal-id. Notification works off DLT billing records, provided in S3 buckets twice a day.

 

 

Stop Doing Cloud Security Assessments

Wyman Miles, Cornell-

Technology risk assessments – a lot of sound and fury, but we don’t find problems and we slow down implementation and governance. They’re currently doing 120 assessments per quarter with 4 security engineers.

Between cyber-liability insurance and contracts, and our portrayal as risks what are really just vendor stances, what do we really need to do?

Indiana jumping in feet first with HECVAT – Box one is done, hosted by REN-ISAC.

Notre Dame discovered a product that was coded by two guys in Russia and discarded it from consideration as a result of a security review.

Maybe we should only do real reviews where we know that sensitive data will be in play?

Frequently we find issues with products that are already in use, with or without central governance knowing about it.

“Most risks we discover are really our petty issues with implementations”

Stanford – need to get out in front of what people are actually using, and then spend time facilitating proper use. Use network flow analysis, purchase records.

Self Service at Yale

Rob Starr, Jose Andrade, Louis Tiseo (Yale University)

Community told them needed to be able to spin machines up and down at will for classes, etc. Started with a big local open stack environment, now building it out at AWS.

Wanted to deliver agility, automate and simplify provisioning, shared resources, and support structures, and reduce on-premises data centers (one data center by July 2018).

Users can self-service request servers, etc. Spinup – CAS integration, patched regularly, AD, DNS, Networking, Approved security, custom images.

Self-service platform – current manual process takes (maybe) 5 days. With Self-Service, it takes 10 minutes. Offering: Compute, Storage, Databases, Platforms, DNS

All created in the same AWS account. All servers have private IP addresses.

ElasticSearch is the source of truth.

Users don’t get access to the AWS console, but can log into the machines.

Built initial iteration in 3 months with 3 people. Took about a year to build out the microservices environment with 3-4 people. Built on PHP Laravel.

Have a TryIt environment that’s free, with limits.

Have spun up 1854 services since starting, average life of server is 64 days.

Dealing with Controlled Unclassified Information (CUI) – Notre Dame

Bob Winding and Kolin Hodgson from Notre Dame

How do you know you have CUI in contract? Look for DFARS 252.204-7012 – requires all DoD contractors and subs to copy with NIST 800-171 and incident reporting within an organization 72 hours.

NIST 800-171 has 14 families of controls, with 109 controls.

C3 project scope – compliance with national research compliance standards. Decided to do in AWS GovCloud with NIST templates.

No easy way to isolate sensitive data on campus.

Have a new domain not connected with campus, but federated with ADFS. AWS has a document that defines ITAR boundary. Use cloud protection manager to do backups in GovCloud. Have a Shared Services hub and each research project or team gets a separate account. CloudWatch and CloudTrail events sent to a separate security account. Started with lambda functions, but now use event bus to send cloud watch events to security.

Have the ability to burst to on-campus HPC. Many jobs (e.g. multiple Matlab simulations) work fine in AWS. But infiniband MPC kinds of low-latency jobs don’t work in AWS. They’re building a secure enclave on campus that can be tunneled to from AWS. “reverse hybrid model”. The research computing folks will manage the on-prem enclave from GovCloud. They’re using Ericom Connect to do the virtual app streaming – outperformed local machines in almost every case. Defining audit boundary as the RDP client on the university-owned device.

Printing is not allowed.

A GovCloud account is actually a child of a commercial account and the root account is in the commercial account. If you delete the commercial account the GovCloud account goes away. It can take a few days to get a GovCloud account.

Issues – need to partner with Research group. Pushback from researchers on what’s really needed; software licensing; breaking out costs.

Higher Ed Cloud Forum 2017 – Intro and Multi Account AWS Strategy

Survey Results

46 institutions attending, 4 vendors, 81 unique roles among 90 attendees.

40% cloud first, 12% have a documented cloud exit strategy.

82% AWS, 14% Azure, 4% Google, 2% other

Staff readiness is the #1 obstacle to broad adoption

42% have signed the I2 Net+ agreement, 11% have enterprise agreement with cloud provider

21% have containers/serverless in production, 9% non-prod, 70% not currently adopting.

Managing and Automating a Multi-Account Strategy in AWS: Brett Bendickenson (Arizona)

Have their own agreement with AWS. Currently have about a 75 accounts in their consolidated billing. 24 accounts in central IT.

UITS Cloud Advisory Team — cross functional group from within UITS to advise and decide on cloud practices and policies.

  • Tagging Policy – extremely important to get right up front. Service, name, environment, created by, contactnetid, accountnumber, sub account

Multi-account strategy. Workloads segregated into production and non-prod accounts. Tipping point was properly restricting everything by permissions – can do it with IAM roles, but it’s a lot of work. Decided on further segregation by teams / technologies, e.g. Kuali, PeopleSoft, IAM. Each has prod and non-prod accounts.

Each account has an account steward (director or dept. head) — responsible for spend, security, etc. Each account has an email list, with the address used for the root login address. Password stored in common vault, secured with MFA hardware token (kept in Ops). Linked to a central billing account. Set of account foundation templates are deployed. Started using AWS Organizations.

Account foundation modeled after the AWS NIST 800-53 Quickstrart CloudFormation Template. Set of CloudFormation templates which deploy roles, security controls, etc. Sets up an EC2 instance that runs a set of Ansible playbooks that set up Shib, bas AWS info, IAM, Logging, Lambda.

Federated Roles – SysAdmin, IAMAdmin, InstanceOps, ReadOnly, BillingPurchasing. Using Grouper for authorizations.

Using federated identities, no IAM users (generally).

CloudTrail enabled in all accounts. Enabled for all regions, records all API calls, sent to a central S3 Bucket in root account. CloudTrail logs also saved to CloudWatch logs in account for local reference.

Alarms set for changes in Network ACL, Security Group changes, Root Account activity, unauthorized access, IAM Policy changes, access key creation, cloud trail changes. (not all used in non-prod)

Lambda Functions – Alarm details (interrogates cloud trail events and sends actual API calls that raised the alarm); CreatedBy automated tagging for EC2 instances; OpsWorks tagging helper; OpsWorks tagging helper; Route53 helper (updates DNS); Tag monitoring – checks tags on instance launch (looking at Cloud Custodian from CapitalOne (open source)); AMI lookup

Arizona’s code: https://bitbucket.org/ua-ecs/service-catalog

CSG Fall 2017 – Big Data Analytic Tools and AI’s Impact on Higher Education

Mark McCahill – Duke

How did we get here?

Big data / AI / machine learning driven by declining costs for:

  • sensors: IoT and cheap sensors create mountains of data
  • storage is cheap – we can save data
  • networks are fast – data mountains can be moved (and so can tools)
  • CPUs are cheap (and so are GPUs)

Massive data – IoT; internet of social computing platforms (driving rapid evolution of analytic tools); science – physics, genomics

Big data analysis tools – CPU clock speeds not increasing much — how can we coordinate CPUs run in parallel to speed analysis of large datasets? Break into parts and spread work – Hadoop MapReduce.

Map/Reduce -> Hadoop -> Apache Spark

Apache Spark – open source MapReduce famework. Spark coordinates jobs run in parallel across a cluster to process partitioned data.

Advantages over Hadoop: 10 – 100x faster than Hadoop (through memory caching); code and transformation optimizers; support for multiple languages (Scala, Python, SQL, R, Java)

2 ways to use Spark:

  • semi-structured data (text files, gene sequencer output);  write transforms and filter functions; classic map/reduce
  • Structured data (implicit or explicitly named columns); transforms and filter structured data using R-style dataframe syntax; SQL with execution optimizers.

Spark data models:

  • RDD (Resilient distributed dataset) storage allows Spark to recover from node failures in the cluster
  • Datasets – semi-structured data with strong typing and lambda functions, custom memory management, optimized execution plans
  • Dataframes – dataset with named columns, supports columnar storage, SQL, and even more optimization

Ambari – tool to manage and deploy a Spark cluster.

Machine Learning

use big datasets to train neural networks for pattern recognition based on ‘learned’ algorithms

Deep learning neural networks have multiple layers and non-linear activation functions

Common thread – training is expensive, and parallelism helps, lots of matrix math processing; GPUs start to look attractive.

Application areas: 3 lenses to look through in higher ed – research, coursework, operations

Example – TensorFire https://tenso.rs/

Case Study: Research

OPM has 40 years of longtitudinal data on federal employees. Duke researchers have been developing synthetic data and differential privacy techniques to allow broader audiences to develop models run against data in a privacy preserving fashion.

Model verification measures: test fit of model developed against synthetic data model to real data. Verification measures for models need to run against many slices of the data and OPM data is big. Initial approach: run regression models and verifications from R on a large Windows VM against a MS-SQL database. But R is single threaded; Custom code written with R parallel library to run / manage multiple independent Windows processes use more server CPU cores.

Rewrite R code to use SparkR

Initial setup: copy CSV files to HDFS; parse CVS and store in SPark Dataframe; treat Dataframe as a table; save table as a Parquet file (columnar format)

Read parquet file and you’re ready to go – SQL queries and R operations on dataframes

Lessons: Social science researchers can probably benefit from tools like Spark; Without access to a sandbox Spark cluster, how can they get up to speed on the tooling? RStudio continues to improve support for Spark via SparkR, SparklyR, etc. Plan to support Spark and R tools for large datasets.

Case Study: Coursework

Spring 2017 – grad biostats course using Jupyter notebooks.

PySpark – Python API for Spark.

Lessons learned: Course assignment k-mer counts in ~ w minutes on a 10 server clustrer (each server is 10 cores + 25 G for 40 students.

Azure has a Jupyter + Spark offering but not configured for courses.

Research

Duke has a wealth of PHI data that researchers want – lives in a secure enclave that is very locked down. Researchers want to use RStudio and Jupyter to run TensorFlow code against GPUs for image analysis. Don’t fight the firewall – automate the analysis tooling external to the protected enclave – package tools in a container. Don’t use Docker (not suited for research) – use singularity.

Container tech developed at LBL. Runs as user not as root.

Singularity + GitLab CI – researcher does integrations and commits.

Lessons learned: Singularity not a big stretch if you do Docker. Automating build/deploy workflow simplifies moving tools into protected data enclaves (or clouds or compute clusters).

Using Machine Learning for campus operations. John Board (Duke)

We already have the data for or for … whatever. Enables manual mashups to answer specific questions, solve specific problems. Will rapidly be giving way to fully automated pulling of needed data for unanticipated questions on demand. Data – speak for yourself and show us what’s interesting.

Academic outcome prediction – easy to ask – does your grade in Freshman calculus predict sticking with and success in engineering 2 years out. Hard part is asking – what are the predictors for success? Should be able to poll corpus of university data (and aggregate across institutions).

Computing demands are significant.

How will Machine Learning Impact Learning? – Jennifer Sparrow, Penn State

What computers can’t teach (yet) – analyze, evaluate, create

What computers can teach: remember, understand, apply

b.book – robot-generated textbook. Faculty puts in desired learning outcomes, robot finds relevant articles (only uses Wikipedia so far). 85% of students prefer this over traditional textbook. 35% often or very often visited pages not a part of required readings.

When students make their own book using b.book — 73% bbookx surfaced content they never encountered before in their major.

Brainstorming tools – intelligent assistant for brainstorming partner. Can use b.bookx for iterative exploration.

Prerequisite knowledge – as bbook found texts, the computer was able to identify prerequisite knowledge and provide links to that.

Generated assessments – “adversarial learning”. Robot can generate multiple choice questions. Bbook doesn’t have any mercy. Computer generated assessments are too hard – even for faculty.

How do we put these tools in the hands of faculty and help use them to create open educational resources?

Simplification – grading writing assignments. Can use natural language processing to give feedback at a richer level than they’d get from faculty or TAs. Students can iterate as many times as they want. Faculty finding that final writing assignments are of a higher quality when using the NLP.

Learning analytics – IR & BI vs data science: aims to precisely summarize the past vs aims to approximate future; enables people to make strategic decisions vs fine scale decision making; small numbers of variables vs hundreds to thousands of variables; decision boundaries are blunt and absolute vs decision boundaries are nuanced and evolved over time.

Able to predict students grade within a half of a letter grade 90% of time. Those that fall outside the 90% have usually had some unpredictable event happen. Uses previous semester GPA, historical average course grade, cumulative GPA, # of credits currently enrolled are biggest factors. Smaller factors: course level, # of students that have completed course; high school GPA; # of credits earned.

First Class – VR teaching simulations.

Legal and ethical issues. How do we know students are submitting their own work? Can we use predictive models to analyze that? Is it ok to correlate other data sources? What kind of assumptions can we make? Do we talk to the students? Are we ethically obligated to provide interventions?

Peeling Back the Corner – Ethical considerations for authoring and use of algorithms, agents and intelligent systems in instruction – Virginia Tech  — Dale Pike

Provost has ideas about using AI and ML for improving instruction and learning.

example: different search results from the same search by different people – we experience the world differently as a result of the algorithms applied to us. Wouldn’t it be good if the search engine allowed us to “peel back the corner” to see what factors are being used in the decision making, and allow us to help improve the algorithm?

When it gets to the point where systems are recommending learning pathways or admission or graduation decisions it becomes important.

Peeling back the corner = transparency of inputs and, where possible, implications, when authoring or evaluating algorithms, agents, and intelligent systems. Where possible, let me inform or modify the inputs or interpretation.

Proposal – could we provide a “Truth in Lending Act” to students that clearly indicates activity data being gathered? Required; Opt Out; Opt In

What does a control panel for this look like? How do you manage the infrastructure necessary without crumbling when somebody opts out?

Why? False positives and the potential for rolling implications.

Filters = choose what to show; Recommendation = offer choices; Decision = choose

Data informed learning and teaching

Learning analytics – understanding of individual performance in an academic setting based (usually) on trends of learning activity outcomes. The potential impact of learning analytics is constrained by the scope of analysis: Timeframe, scope of data, source of data. Increasing the impact of potential analysis also increases “creepy factors”.

Personalized/Adaptive learning: individual activity or performance alters (based on model) the substance, pace, or duration of learning experience.

Examples: CMU Simon Initiative Vision – a data-driven virtuous cycle of learning research and innovative educational practice causes demonstrably better learning outcomes for students from any background. Phil Long and Jon Mott — N2GDLE. Now we fix time variable and accept variability in performance. In the future do we unlock the time variable and encourage everybody to get higher performance?

Big Data Analytic Tools and ML/1AI’s impact on higher ed. Charley Kneifel – Duke

Infrastructure: CPU/GPU/Network as service. GPUs are popular put hard-ish to share – CPUs are easy. “Commodity” GPUs (actual graphics cards) are very poopular (4 in a tower server, 20 in a lab, problems with power and cooling). Centralized, virtualized GPUs make sense (depending on sale), mix of “double precision/compute” and graphics cards. Schedulable resource (slurm at Duke) and interactive usage. Availability inside protected enclaves. Measure resources  do you have idle resources? Storage – HDFS, object stores, fast local filesystem. Network — pipe sizes to Internet, Science DMZ, encrypted paths…; FIONAs with GPUs – edge delivery.

with VMWare have to reboot servers to de-allocate GPUs, so it’s fairly disruptive.

Cloud vs. on-prem clusters vs. serverless: Software packaging is important (portable, repeatable); Support for moving packages to cloud or into protected enclaves; Training capacity vs. operational use; Ability to use spare cycles (may need to cleanse the GPU); standard tool sets (Spark, Machine learning, …); Where is the data? Slurp or Sip (can it be consumed on demand)? Serverless support for tools used – only pay for what you use, but remember you need to manage it, agreements for protected data including BAAs), Customized ASICs and even more specialized hardware (Cloud or …); complex work flows.

Financial considerations: costs for different GPUs; peak capacity on prem vs cloud; pay for what you use; Graphics GPUs are cheap but need home and data.

Security – is data private, covered by data use agreement, protected by law, does it need to be combined with other data, is there public/non-sensitive/di–identified data that can be used for dev purposes, IRBs – streamlined processes, training…

Staffing and support: General infrastructure – can it support and scale GPUs and ML? Software packaging – Machine learning, R, Matlab, other tools? Toole expertise  both build/deploy and analysis expertise; operational usage vs research usage; student usage and projects – leverage them!

VP of research support levels and getting faculty up to speed – projects that highlight success with big data/analytics; cross departmental services/projects – generally multidisciplinary; university/health system examples at Duke; protected data projects are often good targets; leverage students; sharing expertise; reduce barriers (IT should be fast for this).