CSG Fall 2016: Large scale research and instructional computing in the Clouds

leaguefront

We’re at the University of Michigan in Ann Arbor for the fall CSG Meeting in the Michigan League. Fall semester is in full swing here.

Mark McCahill from Duke kicks off the workshop with an introduction on when and why the cloud might be a good fit.

The cloud is good for unpredictable loads due to the capability to elastically expand and shrink. Wisconsin example of spinning up 50-100k Condor cores in AWS. http://research.cs.wisc.edu/htcondor/HTCondorWeek2016/presentations/WedHover_provisioning.pdf

Research-specific, purpose-built clouds like Open Science Grid and XSEDE.

Is there enough demand on campus today to develop in-house expertise managing complex application stacks? e.g. should staff help researchers write hadoop applications?

Technical issues include integration with local resources like storage, monitoring, or authentication. That’s easier if you extend the data center network to the cloud, but what about network latency and bandwidth? There are issues around private IP address space, software licensing models, HPC job scheduling, slow connections, billing. Dynamic provisioning of reproducible compute environments for researchers takes more than VMs. Are research computing staff prepared for a more DevOps mindset?

New green field deployments are easier than enhancing existing resources.

Researchers will need to understand cost optimization in the cloud if they’re doing large scale work. That may be a place where central IT can help consult.

AWS Educate Starter – less credits than Educate, but students don’t need a credit card.

Case Studies: Duke large scale research & instructional cloud computing

MOOC course (Managing Big Data with MySQL) that wanted to provide 10k students with access to a million row MySQL database. Ended up with over 50k students enrolled.

Architecting for the cloud: Plan to migrate the workload – cloud provider choice will change over time. Incremental scaling with building-block design. Plan for intermittent failures – during provisioning and runtime. Failure of one VM should not affect others.

Wrote a Ruby on Rails app that runs on premise that maps user to their assigned Docker container and redirect them to the proper container host/port. Docker containers running Jupyter notebooks. Read-only access to MySQL for students. Each VM runs 140 Jupyter notebook containers + 1 MySQL instance. In worst case scenario only 140 users can be affected by a runaway SQL query. Containers restarted once/day to clear sessions.

At this scale (50-60 servers) – 1-2% failure rates. Be prepared for provisioning anomalies. Putting Jupyter notebooks into git made it easy to distribute new versions as content was revised. Hit a peak of ~7400 concurrent users. Added a policy of reclaiming containers that had not been visited for 90 days.

Spring 2016 – $100k of Azure compute credits expiring June 30. Compute cluster had all the possible research software on all the nodes, through NFS mounts in the data center. To extend it to Azure have to put a VPN tunnel in private address space. Provision Centos linux VMs then make repeated Puppet runs to get things set up, then mount NFS over the tunnel. SLURM started seeing nodes fail and then come back to life. Needed deeper monitoring that knows more than just nodes being up or down. The default VPN link into Azure maxes out at 100-200 Mbps, so they throttle the Azure VMs at the OS level so they don’t do more than 10 Mbps each. They limit the number of VMs in an Azure subscription to 20 and run workloads that do more compute and less IO. Provisioned each VM at 16 core with 112 GB RAM. Started seeing failures because there were no more A11 nodes available in the Azure East data center – unclear if/when there will be more nodes there. Other regions add latency. Ended up $96k used in one month. 80 nodes (16 cores and 112 GB RAM) in 4 groups of 20 nodes in several data centers. VPN tunnel for each subscription group.

(One school putting their Peoplesoft HR system in the cloud.)

Stratos Efstathiadis – NYU

– Experiences from running Big Data and Machine Learning courses on public clouds – Grad courses provided by NYU departments and centers. Popula courses with large number of students requiring substatial computing resources (GPUs, Hadoop, Spark, etc).

They have substantial resources on premise. Scheduled tutorialson R,MapReduce, Hive, Spark etc. Consultations with faculty, work closely with TAs. Why cloud? Timing of resources, ability to separate resources (courses vs. research), access to specific computing architectures, students need to learn the cloud.

Need a systamatic approach; Use case: Deep Learning class from the Center of Data Science. 40 student teams that needed access to NVidia K80 GPU boards. Each team must have access to identical resources to compete. Instructors must be able to assign resources and control. Required 50 AWS g2.2xlarge instances. Issues: Discounts/vouchers are stated per student, not teams. Need to enforce usage caps at various levels so instructor-imposed caps are not exceeded. Daily email notifications to instructors, TAs and teams providing current costs and details. Students were charged for a full hour every time they spun up an instance. AWS costs were estimated ~ $65k per class. On-prem solution was $200k.

Use case: Spatial data repository for searching, discovering, previewing and downloading GIS spatial data.  First generation was locally hosted – difficult to update, not scalable, couldn’t collaborate with other institutions; lack of in-house expertise; no single sign on. Decided to go to the cloud.

Use case: HPC disaster recovery
Datasets were available a few days after Sandy, but where to analyze them? Worked with other institutions to get access to HPC, but challenges included copying large volumes of data and different user environments and configurations. Started using MIT’s Star (Software Tools for Academics and Researchers), could also use AWS cfnCluster. Set up a Globus endpoint on S3 to copy data. Software licensing is a challenge – e.g. Matlab. Worked things out with Mathworks. Currently they’re syncing environments between NY and Abu Dhabi campus, but they’re investigating the cloud – looking at star/cfnCluster approach, but also might do a container based approach with Docker.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s