Oren’s Blog

CSG Winter 2018 – Research and Teaching & Learning IT: Partnering with the Library

This morning’s workshop on partnering between IT and Libraries features Jenn Stringer/Chris Hoffman (Berkeley), Jennifer Sparrow/Joe Salem (Penn State), Diane Butler (Rice), Cliff Lynch (CNI), Louis King (Yale), David Millman (NYU)

The morning is starting off with some thoughts from Cliff Lynch (CNI):

Reminders of some things many haven’t lived through: In the early 90s there was a call not only for collaboration between IT and Libraries, but serious talk of merging. It was tried at a few institutions, like Columbia University. The takeaway was that it’s fairly crazy at large institutions. The mission expansion of each has been in differing rather than overlapping areas. But it’s been successful at a number of liberal arts organizations.

When CNI was founded it was totally viewed as a collaboration between the CIO and the head of the Library at member institutions. In the early 2000s that makeup was changing. The representation was the head of the library and someone doing research or academic computing, or doing digital work in the libraries. Led to increasing disengagement of the CIOs. Starting around 2000 started putting on executive roundtables with the intent of re-engaging the CIOs. It was fairly easy in the first few years to come up with topics in that sweet spot, but it got harder. If you look back from 1990 – 2005 you see that Libraries had low levels of technical expertise. At the same time libraries had developed some internal expertise in technologies important for digital humanities, data curation, etc, where there is now more competence than in the central IT org, which has structured its mission around infrastructure, compliance, etc. Libraries continue to rely on IT for fundamental infrastructure.

If you look at the landscape, how much IT capability is native to the library, and how much replicates or compliments the expertise in IT. This is hugely inconsistent. If you polled the CSG campuses you’d be surprised at the degree of variation in organic IT expertise in the library.

Collaborations involving library have become much more multilateral rather than bilateral with IT – involving partners like University Presses, Museums, research data management, digital scholarship centers (often involving academic school or department), geospatial centers, maker spaces. \

Don’t forget collaboration on institutional policies. Data governance, privacy and reuse of student data and analytics, responsibility of university to preserving scholarly products. Had a recent roundtable looking at policy implications of adoption of widespread cloud platforms.

This area does not lend itself to checklists.

UC Berkeley – Chris Hoffman

A history of good intentions – Museum Informatics Project – Housed in Library, Digital collections and DAMS. Complicating factors: Sustainability, budget cuts, grant funding; priorities; loss of key champions; culture.

Collectionspace – managing collections for museums.

Research Data Management – an impetus for change. New drivers (DMP requirements), new change leaders, new models for partnership. Benchmarking justified need. Broad definition of research data – all digital parts of a research project. Priority to nurture collaboration between IT and Library. Co-funded a position for program manager. Campus-wide perspective, investing in understanding and bridging cultures.

What’s next? More challenging tests to partnerships, RDM 2.0, Visualization and makerspaces, more fundamental technologies? (archival storage, virtual teaching and research environments); strategic alignment?

NYU – Stratos Efstathiadis, David Millman, David Ackerman

Research technology works closely with LIbraries.

Data Services – estab. 2008. 11 FTE Consultation and instructional support for scholars using quantitative, qualitative, survey design, and geospatial software and methods. Joint service of IT and Libraries.

Digital Library Technology Services – estab ca. 2000. Digital content publication and preservation. New services to support current scholarly communication. R&D to develop new services and partnerships, 19 FTE.

Research Data Management Services – estab 2015. 2 FTE. Promulgate beset practices in data organization, curation, description, publication, compliance, preservation planning, and sharing.

Research Cloud Services – new collaboration build on other preexisting services. Inteconnected research storage environment. REimagine a spectrum of cloud storage from dynamic to published final products. Provide backbone for researchers but also Libraries collections and workflows.

Yale – Louis King

Considerable history at Yale in working in digital transformation space.

Office of Digital Assets and Infrastructure – Sept 2008. Work closely with Library and ITS. Focus on Digital Assets & Infrastructure. Take advantage of disciplinary approach of libraries and technical capacity of IT.

Looking for ways to gain efficiencies and lower overhead for people who want to manage digital content.

Had some substantial initial success, but changes: Initial provost sponsor left Yale, 2009 financial crash, VP retired, two library director transitions, transition in IT director, emerging digital systems in Library.

Late 2012 relaunch as Yale Digital Collections Center, but closed in 2015. But it catalyzed momentum towards digital transformation at Yale. Established the foundation for many successful current and future collaborations.

Rice University – Diane Butler

Library and IT have been partners for a long time. For a very short time, the organizations were merged. Research IT and library have been partners since 2012 and informally even further back. Began iwth library providing the service and IT providing the core infrastructure but has morphed into a collaborative partnership.

Areas of collaboration: Data Management (through Library). Provide consultation, including creating DMP, describing and organizing data, storing data, and sharing data. Training, Access to resources such as platform for sharing and preserving publications and small-to-medium datasets. Still an area for work as faculty aren’t very engaged.

Digital Scholarship: Service provided by library with IT providing infrastructure. Preserving scholarship, navigating copyright and open access, managing and visualizing data, digitizing materials, consultation, etc.  Research IT has history in supporting engineering and sciences, but not so much in humanities.

Digital Humanities: Imagine Rio Project. Most successful collaborative project to date. An architecture and history professors joining together to imagine Rio de Janeiro. Searchable atlas of social and urban evolution of Rio.

Positive outcomes: Research IT had not supported Humanities or qualitative social sciences previously. Success of project has brought in more funding. Research IT now has 2 facilitators that are working with faculty in those disciplines.

At Rice the board has come up with some base funding for research computing, so that all of the work doesn’t have to be funded by grants.

Penn State – Joe Salem and Jennifer Sparrow

Strong history of working with libraries, IT, and student services on accessibility issues. Thinking about spaces in place and how to leverage institutional spaces. Built a “blue box” classroom.

Worked on the Dreamery – a co-learning space for bringing emerging technologies onto campus.

Driving strategic initiatives: Collaborative, technology-infused space. Inherited a space called the Knowledge Commons. Includes a corner with staffing from both Libraries and Academic Tech. Service partnership profile has grown from just a focus on media, to overall platform for supporting students. Work on curricular support together – open educational resources and portable content. Instructional design is a focus.

Learning Spaces committee – Provide leadership in innovative instruction.

What makes the partnership work … or not? What does each side bring to the table?

Chris – Berkeley

Visualization service at Berkeley. HearstCAVE: Connected virtual spaces over the Pacific Research Platform around preserving archaeology preservation. Thinking about how it connects with data science.

Markerspaces at UCB – pockets of excellence and experimentation. Jacobs Institute for Design Innovation. Talking with library and ETS to look at space.

Hooking the two together in a Center for Connected Learning.

Research Data Management at Yale
Much Ado about Something: Complex funder requirements; reliable verficiation of results; reuse of data in new research.

What are the responsibilities and rights of the University and faculty regarding research data? They put out a Yale Research Data & Materials Policy. Developed over 2-2 years with collaboration across the university. There is significant collaboration in support of that policy – Library and IT collaboration: Research Data Strategic Initiaitive Group, Research Data Consultation Group, Yale Center for Research Computing.

Recommendation: Research Data Service Unit; REports within LIbrary – Assessment, coordination, outreach and communication. Federated support model for all research data support services – research technology, data management, metadata, outreach & communications, customer relations, education and training, research data administrative analytics.

NYU – David Millman

Bottom-up requirements – survey local researchers: IT/Lib complementary styles, contacts. Survey peers: IT’Lib coordinated.

Executive review: Dean, AVP-level

NYU – research repository service identification. Umbrella of services – – researc lifecycle. Creation, manipulation, publication, etc. Holistic — customer focus. 1. HPC storage. 2 – medium” performance storage (CIFS, NFS); 3 – “published” sotrage – preserved, curated, citable.

IT/Library crossover strategy questions: business of universities: long-term preservation of scholarship. Any updates on our participation in digital preservation facilities? Some of our colleagues have recommended highly distributed protocols for better preservation. How do we approach this?




CSG Winter 2018 – Much Ado About GDPR

We’re in sunny, warm LA for the Winter CSG meeting, hosted by USC.  Last night, Asbed coordinated a group to go out for tacos at http://chichenitzarestaurant.com/ , which was excellent!

This morning we’re starting off with a workshop on GDPR, featuring: Sharif Nijim (Notre Dame), Jim Behm (Michigan), Paul Erickson (Nebraska), Alan Crosswell (Columbia), and Kitty Bridges (NYU)

GDPR = General Data Protection Regulation – 127 days until enforcement on May 25

Membership survey :
87% think GDPR is an institutional risk
58% identified as beginners in GDPR
70% either don’t know or don’t think their institution will be compliant
41% have engaged outside counsel
50% General Counsel and IT partnership to lead compliance initiative.

What is GDPR? Alan Crosswell.

EU regulation on personal data protection, applicable to people, products, or services. Replaces old regulations dating back to 1995. Covers: personal data (relating to people). Examples: IP address, genetic data, health data, research data, video surveillance. Who is covered? EU individuals or any company that offers products/services to EU individuals or collects/processes their personal data (includes non-EU citizens located in EU).

Requirements: Identify personal data; data protection by design; individual rights on data usage (transparency, right to data erasure, right to data portability, etc); obtain proper consent (opt-in); withdrawn consent and the right to be forgotten (opt-out); breach notification; designate data have to designate a protection / privacy officer (DPO).

What does it mean for a student to have the right to be forgotten?

Penalty: Failing to report breaches within 72 hours maximum of 20 million euros or 4% of organizational annual revenue – whichever is greater.

Preparing for GDPR – key steps: Promote awareness; discover PII you hold; implement data protection by design; identify legal basis for processing personal data; review procedures for communicating personal data, individual data rights, data consent, guardian consent for minor’s data, data breach detection, response, and notificaton; designate data protection / privacy officer.

EU Indivdidaul – physically located in an EU member state, both EU citizens and non-EU citizens.
Personal Data – relating to identified natural person. name, ID number, location data, online identifier, address, email, passport, cookies, drivers license, etc
Consent: freely given, unambiguous indication of data subject’s wishes of subject’s wishes.

Question: does this include firewall logs? General agreement that it does.

Comment: This is subject to legal jurisdiction, and the thought that this is generally applicable to everything we do might not be correct.

GDPR Scenarios

Recruiting: NYU recruiter holding open house in Paris for EU people to find out about NYU. Recruiter gathers name, interests, and hands over wifi credentials. Need to give an explicit consent form, saying which elements are collected, what they’ll be used for, and how long they’ll be retained. Has to be provided in the native language. (Is your admissions prospecting software aware of and planning on how to handle GDPR? That’s institutional data – it’s an indemnification issue. What kind of contract language do you have?).

Admissions: Need name, national ID, country of origin, addresses, high school transcripts, etc. to make effective admissions decision. Also use that information for research (see Unizen). How is consent for data tracked through the various systems?  (Common App Organization GDPR adjustment ETA? – “early spring”).

Question – has anyone reached out to European universities on what they’re doing to prepare for GDPR?

Matriculation – example of alleged assault from student abroad. What happens if student exercises GDPR rights to not share data back to the US? Could contracts with partners abroad be affected if we don’t behave according to GDPR? Example of LMS vendor that is spinning up version of LMS in the EU specifically for GDPR – do we keep our data there for EU citizens?

Research – What about information about researchers kept on servers? Do legal federations with agreements help us? GEANT did a study on GDPR impact on Edugain. Emerging attribute release agreements help with GDPR compliance. GEANT is submitting a new code of conduct for GDPR – a way of publishing attributes in an open and transparent way. Coming out later this year. Transparency, documentation, and incident response are critical pieces.

Alumni and Benefactors – What data are collected and where is it? What if they want to be removed? Compliance might be viewed as a revenue issue. There is a notion in GDPR of “legitimate interest” but that isn’t a blanket clause.

Comment: We should follow advice of counsel on how to approach GDPR. It may not be worth a lot of worry at this point about how much this impacts us. Just because it’s over the Internet doesn’t make it different than any other issue between countries and how citizens are treated. We all need to decide what our risk posture will be.

How many campuses operate summer camps with people under 16 from EU countries?

If institutions are backing away from collecting citizenship data (from concern about undocumented people), does that impede our effort to comply?

Educause and GDPR: Trying to curate best resources – see page at: https://library.educause.edu/topics/policy-and-law/eu-general-data-protection-regulation-gdpr

Good to start with JISC resources. https://www.jisc.ac.uk/gdpr

Territoriality – we higher ed institutions generally have enough business links that we should surmise that GDPR might apply in some way.

Educause is working with other US higher ed groups (NACUA, etc) on GDPR guidance. It’s slow going, and all organizations are struggling with what advice to give members.

Notre Dame – Initial meeting with General Counsel (8/2017); Elevated to information governance committee (9/2017); Assigned to IT by institutional risk committee (10/2017); Compliance questionnaire circulated (11/2017); Questionnaire data aggregated and analyzed (1/2018) – Hard to collect data across the institution – will need help from general counsel in complying with collection. The vision is that data stewards will be accountable for the data in their areas. Impossible to collect every last piece of data, but important to show due diligence and have a process for dealing with issues.

NYU – Have hired external counsel – issuing questionnaires. In data collection mode, focused on central administrative entities. Don’t yet know what the institutional posture will be. General Counsel will advise. Will likely think of this as responsibility of business offices, who have been involved in discussions. IT is a key partner, knowing how things are connected together. First thing to focus on: Documenting identity data; movements of data between systems; prioritizing what to worry about first (biggest risk). Especially tricky areas are warehousing, analytics, and logs. Logs: operational logs (IP addresses, MAC addresses, authentication logs, DB logs, application logs) used for troubleshooting and trending. Can they even be made anonymous? Audit logs – understanding who has access, understanding really how long things need to be kept in identifiable form.

Nebrasksa: Bringing together multiple conversations around GDPR – General Counsel coordinating. Work in progress – expect to at least have posture before deadline. NACUA webinar was very helpful. Distance ed group started early on. Good test of relationships across campus – IT as implementer. Research group is interested in GDPR to help guide data governance. Indemnifaction – example of SaaS contract where vendor struck out “global standards.”

UMich – Started this past summer – taking a “cautious approach”. Concern about the extent to which regulations will apply to US institutions. Group led by General Counsel with representation across campus. Counsel has hired a consultant to help guide campus through the process. There’s enough gray areas that it’s unlikely that campuses will be held accountable in May. For state institutions, it may be the state that is accountable, not the institution. Might not be the case at Michigan.

Rice – Chief Compliance Officer leads a working group with the CISO. Creating an institutional web site for information.

UVa and Va Tech – very early in process. Conversations with General Counsel.  State AG’s office has hired counsel who should be issuing guidance for state higher ed.

Ron – IT is the only organization that touches every other organization in multiple domains – so it falls to us to be of service.

Minnesota – Counsel leading effort, still assessing impact and how much needs to be done.

Iowa – In due diligence approach, with Counsel taking lead. Will be naming a privacy officer. Creating a plan for operations that take place in the EU, which is a relatively small set.

CMU – very early on. Taking gap analysis approach.

Sharif – taking the approach that much institutional data is “legitimate interest” vs. asking specific consent. But that still requires transparency. How far does legitimate interest go?

Maybe this worry is overblown (like we did with CALEA)? It’s primarily targeting Googles and Facebooks, not higher ed.

Should we be reviewing cloud contracts for how GDPR is or isn’t covered? Could Educause help come up with a checklist for review? To what extent does it affect Net+ contracts? (e.g. LMS).  We could have an area on the CSG site for sharing information.

We may be likely to see something analogous in the US, so this won’t be wasted effort. Much of what we need to do for GDPR are just good enterprise data practices.

It’s not an IT project, it’s about institutional risk. Should be part of that regular assessment process.


Higher Ed Cloud Forum: Getting Harvard’s Enterprise Systems to the Cloud

Ben Rota, Harvard

How a crisis optimized our organizational structure\

Phase One Org Structures (as seen retroactively):

February 2015 – Trying to change culture as well as technology. People had expectations that were impossible to meet. Original cloud team was drawn from infrastructure and IdM – no developers or applications people.

May 2015 – Restructured to focus on migrating a single application and supporting Public Affairs.  Successful migration, but had tension between operational work and further migrations.

June 2015 – split into multiple, smaller scrum teams to support more simultaneous projects. Lacked cohesiveness, plus operations were still killing them.

Septemer – December 2015 – team was demoralized. operational work continued to be a problem – pile of cleanup work. No ability to reduce tech debt.

December 2015 – PeopleSoft group decided that 9.2 upgrade would be done in the cloud. Cloud team didn’t have enough resources available to help, but their consultant could help.

June 2016 – PeopleSoft team realizes consultant didn’t work out. Cloud program put .5 FTE on the project.

December 2016 — peoplesoft migration at significant risk – migration team created to respond to impending crisis.

Application Portfolio Teams – co-located, cross-functional group for portfolio migration projects. How’s it going? Migrations are accelerating. PeopleSoft and Oracle Financials, Grants Management are migrated. Close to Alumni Affairs and Development system. Troubleshooting migration problems has gotten easier – co-l;ocation smooths communication. Shared goals breaks down silos.

Organization of work around HUIT managed applications run the risk of neglecting the “long tail” of smaller applications and systems. Too many product owners in the kitchen – how do you prioritize work when you have competing interests? Operational work vs. migration work is still a challenge, but now it’s more about prioritization within a team rather than across silos.  DevOps still has too many definitions! Had a day-long workshop open to all of HUIT in a facilitated discussion about what they hope to get out of this effort.

Higher Ed Cloud Forum – Tools for predicting future conditions: weather & climate

Toby Ault, Marty Sullivan – Cornell

Numerical models of climate science. Most of fluid dynamics in models for weather and climate are physical equations and solvers are “Dynamical Cores” that tells you about the flows of fluid in 3d space.

Continuum of scale needs to be accommodated – done through parametrization. Want to be able to sample as many parameterization schemes as possible.

Interested in intermediate time scales (weeks to months) that have been difficult to model. There’s uncertainty arising from different models so having multiple models with multiple parameterizations that get averaged together with machine learning can have huge payouts in accuracy.

Are the most useful variables for stakeholders the most predictive?

Weather simulation in the cloud:

Infrastructure as code makes things much easier. Able to containerize the models (which include lots of old code), so people don’t need to know all the nuts and bolts. Using lots of serverless – makes running web interfaces extremely easy.

Democratization of science – offered through a web interface. People can configure and run models for their locations.

Lots of orchestration behind the scenes: Deploying with CloudFormation, using ECS, etc.

Higher Ed Cloud Forum: Desktop as a Service – Moonshot to production in 6 months

Deanna Mayer, Brady Phipps — UMUC University College

Primarily online programs, 90+ programs and specializations, 80k students worldwide, 140+ classroom and services locations in 20 countries. Heavily into IT outsourcing – started with a VDI vendor, but they couldn’t scale. Needed non-device specific VDI that didn’t require an install.

Student requirements: fully integrated, one-click classroom experience; access across program, not limited to single course; secure environment providing immersive experience; ability to scale; single sign-on; rich metrics and analytics. Huge spikes in usage on Sunday nights before assignments were due.

January – April 2016 did a RFP. No vendor met all requirements. Most vendors focused on a single image across a corporation. Partnered with Amazon in April, project approved in June. Flew local solutions architect to Seattle to sit with AWS side-by-side for three weeks. Ten people for project team in a room focused on the problem, due by October. Initial launch to 400 students in August. Cut cord with legacy vendor in May – moved over 60 courses. Now have over 10k students using it. 22.5 hours/month average usage, 25% drop in student support requests.

Launched AccelerEd, a new company with Aloft, a cloud services unit.

Higher Ed Cloud Forum: When can a computer improve your social skills?

Ehsan Hoque (University of Rochester)

Behavior mining -> Applications -> Deployment

Automated Prediction of Interview Performance -> My Automated Conversation Coach (MACH) -> ROCSpeak.com

MACH – My Automate Conversation coacH — originated from people with Asperger’s wanting help developing conversational skills.

Originally a research application, got a grant from Azure to develop a cloud version. As people use the framework, the data gets fed back into the model, which improves the performance.

At the end, it’s not the specific cloud functionality but the interaction with the people at the vendor that makes things work.

Higher Ed Cloud Forum: Epidemic Modeling in The Cloud: Projecting the Spread of Zika Virus

Matteo Chinazzi (Northeastern University)

MOBS lab — part of Network Science Institute at Northeastern, modeling contagion processes in structured populations, developing predictive computational tools for analysis of spatial spread of emerging diseases.

Heterogeneous interdisciplinary research group – physicists, economists, computer scientists, biologists, etc.

GLEAM – Global epidemic and mobility model – integrates different data layers – spatial, mobility, population data. For Zika, had to introduce mosquito data, temperature data, and economic data (living conditions).

Practical challenges:

  • unknown time and place of introduction of Zika in Brazil (Latin square sampling + long simulations (4+ years))
  • Parameters need to calibrated and estimated: prediction errors add stochasticity at runtime.
  • Intrinsic stochasticity to to epidemic and traveling dynamics
  • Need quick iterations between different code implementations

Each simulation takes 6-7 minutes, need > 200k simulations. each scenario generates about 25TB of data, needed in a day. Tried on-premise, but not enough compute cores, resources were shared and bursty, and there was no reliable solution to analyze data at scale.

Migration to GCP – prompt replies and assistance from customer support (“your crazy quota increase request has been approved”)

Compute Engine – ability to scale in terms of compute cores – up to 30k cores consumed simultaneously. Can keep data without saturating on-prem NFS partitions. Big Query – ability to scale in terms of data processing. In < 1 day can run simulations and analyze outputs.

Workflow steps: Custom OS images for each version fo mode;; startup scripts to initialize model parameters, execute runs, perform post-processing and move to bucket; Python script to launch VMs, check logs, run analysis on BigQuery, export data tables to bucket, and download selected tables on local cluster. Other scripts to create pdf with simulation results.

Numbers: has 750k+ instances, analyzed 300 TB of data, simulated 10M+ global epidemics, 110+ compute years

Lessons learned: Use preemptible VM instances (~1/5 of price, predictable failure rate); use custom machine types; run concurrent loading jobs on BigQuery; use Google Cloud Client Library for Python – from simulations to outputs with no human interventions; Be aware of API rate limits.