Higher Ed Cloud Forum: Epidemic Modeling in The Cloud: Projecting the Spread of Zika Virus

Matteo Chinazzi (Northeastern University)

MOBS lab — part of Network Science Institute at Northeastern, modeling contagion processes in structured populations, developing predictive computational tools for analysis of spatial spread of emerging diseases.

Heterogeneous interdisciplinary research group – physicists, economists, computer scientists, biologists, etc.

GLEAM – Global epidemic and mobility model – integrates different data layers – spatial, mobility, population data. For Zika, had to introduce mosquito data, temperature data, and economic data (living conditions).

Practical challenges:

  • unknown time and place of introduction of Zika in Brazil (Latin square sampling + long simulations (4+ years))
  • Parameters need to calibrated and estimated: prediction errors add stochasticity at runtime.
  • Intrinsic stochasticity to to epidemic and traveling dynamics
  • Need quick iterations between different code implementations

Each simulation takes 6-7 minutes, need > 200k simulations. each scenario generates about 25TB of data, needed in a day. Tried on-premise, but not enough compute cores, resources were shared and bursty, and there was no reliable solution to analyze data at scale.

Migration to GCP – prompt replies and assistance from customer support (“your crazy quota increase request has been approved”)

Compute Engine – ability to scale in terms of compute cores – up to 30k cores consumed simultaneously. Can keep data without saturating on-prem NFS partitions. Big Query – ability to scale in terms of data processing. In < 1 day can run simulations and analyze outputs.

Workflow steps: Custom OS images for each version fo mode;; startup scripts to initialize model parameters, execute runs, perform post-processing and move to bucket; Python script to launch VMs, check logs, run analysis on BigQuery, export data tables to bucket, and download selected tables on local cluster. Other scripts to create pdf with simulation results.

Numbers: has 750k+ instances, analyzed 300 TB of data, simulated 10M+ global epidemics, 110+ compute years

Lessons learned: Use preemptible VM instances (~1/5 of price, predictable failure rate); use custom machine types; run concurrent loading jobs on BigQuery; use Google Cloud Client Library for Python – from simulations to outputs with no human interventions; Be aware of API rate limits.

Higher Ed Cloud Forum: Adventures in Cloudy High-Performance Computing

Gavin Burris – Wharton School

HPC – Computing resources characterized by many nodes, many cores, lots of ram, high speed, low-latency networks, large data stores.

XSede — it’s free (if you’re funded by a national agency).

cloud – more consoles, more code, less hardware

Using Ansible to configure cloud resources the same as on-premise, both to deploy EC2 clusters in Python, CfnCluster – cloud formation cluster to build and manage HPC clusters.

Univa UniCloud enables cloud resources to integrate with Univa scheduler.

Use Case: C++ simulation modeling code, needed 500 iterations, each took 3-4 days. Used MIT StarCLuster with spot bids. For $500 finished job in 4 days.

Use case: Where are the GPUs? Nobody was using – had to use different toolkits and code to utilize. So got rid of GPUs in refresh. Used UniCloud to use cloud instances with GPU

“Cloud can accommodate outliers” — GPUs, large memory. A la carte to the researcher based on tagged billing. Policy-based launching of cloud instances.

Seamless transition – VPC VPN link provided by central networking, AWS looks like another server room subnet. Consistent configuration management with the same Ansible playbooks. Cloud mandate by 2020 for Wharton – getting rid of server rooms to reclaim space.

They’re doing NFS over the VPN – getting great throughput.

Cost comparison – HPCC local hardware $328k, AWS $294 for flop equiv.

Spotinst – manages preemption and moves loads to free instances.