Cloud Forum 2016 – Research In The Cloud

Daniel Fink from Cornell – Computational Ecology and Conservation using Microsoft Azure to draw insights from citizen science data.

Statistician by training. Citizen science and crowd sourced data.

Lab of Ornithology: Mission – to interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.

Why birds? They are living dinosaurs! > 10k species in all environments. Very adaptable and intelligent. Sensitive environmental indicators. Indian Vulture – 30 million in 1990, virtually extinct today. Most easily observed, counted, and studied of all widespread animal groups.

Ebird. Global bird monitoring project- citizen science for people to report what they see and interact with data. 300k people have participated, still experiencing huge growth.

Taking the observation data and turning it into scientific information. Undestanding distribution, abundance, and movements of organisms.

Data visualizations:

Data – want to know every observation everywhere, with very fine geographic resolution. Computationally fill gaps in observations, and reduce noise and bias in data using models.

Species distribution modeling has become a big thing in ecology. Link populations and environment – learn where species are seen more often or not. Link ebird data with remote sensing (satellite) data. Machine learning can build models. Scaling to continental scale presents problems. Species can use completely different sets of habitats in different places, making it hard to assemble broad models.

SpatioTemopral Exploratory Model (STEM) – Divid (partition extent int regions, train & predict models within regions, then Recombine. Works well, but computationally expensive. On premise on species in North America, fit 10-30k models, 6k CPU hours, 420 hours wall clock (12 nodes, 144 CPUs). Can’t scale – also dealing with growing number of observations in Ebird – 30% /year. Also moving to larger spatial extents.

Cloud requirements: on-demand: reliably provision resources. Open Source software: Linux, hadoop, R. Sufficient CPU & RAM to reduce wall clock time. System that can scale in the future. Started shifting workload about 1.5 years ago. Using Map Reduce and Spark has been key, but isn’t a typical research computation tool.

In Azure: Using HD Insight  and Microsoft R Server – 5k CPU hours, 3 hours wall clock.

Applications – Where populations are, When they are there, What habitats are they in?

Participated in State of North America’s Birds 2016 study. Magnolia Warbler – wanted to summarize abundance in different seasons. Entire population concentrates in a single area in Central America in the winter that is a tenth the size of the breeding environment – poses a risk factor. Then looked to see if the same is true of 21 other species. Still see immense concentration in forested areas of Central America – Yucatan, Guatemala, Honduras. First time there is information to quantify risk assessment. Looking at assessing for climate change and land use.

50 species of water birds using the Pacific Flyway. Concentration point in the California Central Valley, which has had a huge amount of wetlands historically, but now there’s less than 5% of what there was. BirdReturns – Nature Conservancy project for Dynamic Land Conservation. Worked with rice growers in Sacramento River Valley to determine what time of year will be most critical for those populations. The limit is water cover on the land. There’s an opportunity to ask farmers to add water to their patties a little earlier in the spring and later in the fall, through cash incentives. Rice farmers submit bids, TNS selects bids based on abundance estimates (most birds per habitat per dollar). Thy’ve put 36k additional acres of habitat since 2014.

Quantifying habitat uses. Populations use different habitats in different seasons. Seeing a comprehensive picture of that is new and very interesting. Surprising observation of a wood thrushes using cities as a habitat during fall migrations. Is it a fluke caused  by observation bias? Is it common across multiple species?

Compare habitat use of resident species vs. migratory species. Looked at 20 neotropical migrants, and 10 resident species. Found residents have pretty consistent habitat use, but migrants seasonal differences, showing a real association with urban areas on the fall. Two interpretations: 1) cities might contribute important refuges for migrant species or, 2) cities are attracting species but are ecological traps without enough resources. Collaborators are setting up studies to see. Hypothesis that they are attracted to lights.

Heath Pardoe from NYU School of Medicine – Cloud-based neuroimaging data analysis in epilepsy using Amazon Web Services.

Comprehensive Epilepsy Center at NYU is a tertiary center, after local physician and local hospital. Epilepsy is the primary seizure disorder (defined by having two unprovoked seizures in their lifetime). Many different causes and outcomes. Figuring out the cause is a primary goal. There are medications and therapies. Only known cure is surgery, removing a part of the brain. MRI plays a very big role in pre-surgical assessment. Ketogenic diet is quite effective in reducing seizures in children. Implanting electrodes can be effective, zapping when a seizure is likely to control brain activity. Research ongoing on use of medical marijuana to treat seizures. Medication works well in 2/3 of people, but 1/3 will continue to have seizures. First step is to take a MRI scan and find the lesions.  Radiologists evaluate MRI scans to identify lesions.

Human Epilepsy Project – study people as soon as they’re diagnosed with epilepsy to develop biomarkers for epilepsy outcomes, progression, and treatment response. Tracking patients 3-7 years after diagnosis. Image at diagnosis and three years. Maintain a daily seizure diary on iOS device.Take genomics and detailed clinical phenotyping. 27 epilepsy centers across US, Canada, Australia, and Europe.

Analyzing images to detect brain changes over time. Parallel processing of MRI scans. Using StarCluster to create a cluster of EC2 machines (from 20-200) (load balances and manages nodes and turns them off when not used). Occasionally utilize compute optimized EC2 instances for computationally demanding tasks. Recently developed an MRI-based predictive modeling system using OpenCPU and R.

Have a local server in office running x2go server that people connect to from workstations. From that server upload to EC2 cluster.  More than 10 million data points in a MRI scan. Cortical Surface Modelling delineates different types of brain matter. Then you can measure properties to discriminate changes. To compare different patients you need to normalize, by computationally inflating brains like a balloon – called coregistration.

There are more advanced types of imaging.

Some studies done with these techniques: Using MRI to predict postsurgical memory changes.  Brain structure changes with antiepileptic medication use.

Work going on – image analysis as web service: age prediction using MRI. Your brain shrinks as you age. If there’s a big difference between your neurologic age and your chronological age, that can be indicative of poor brain health.

Difficulty of reproducing results is an issue in this field. Usually developed models sit on grad student’s computer never to be run again. Heath developed a web service running on EC2 that can be called to run model consistently.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s