CNI Fall Meeting 2010 – Cyberinfrastructure Framework

Cyberinfrastructure Frameork
Alan Blatecky – acting director, NSF Office of Cyberinfrastructure

5 crises
Computing tech
Data, provenance, and viz
Software
Organization for multidisciplinary science
Education

Science and scholarship are team sports
Collaboration/partnerships will change
– building dynamic coalitions in real time
Ownership if data plusnlow cost fuels growth and number of data systems
– federation ant interop become mire important

Innovation and discovery will be driven by analysis
Mobility and personal control will drive innovation and research communities. –
– eg using accelerometers foe earthquake detection
Gaming, virtual worlds, social networksm will transform the way we do science, research and education

All the layers have to work together for the system to function. Cyberinf. Ecosystem.

The goal of virtual proximity – you are one with your resources. Collapse the barrier of distance. All resources are virtually present, accessible, and secure.

NSF

Data enabled science
Community research networks
New computational infrastructure
Access and connection to facilities

Impacts on NSF
CI as enabling infrastructure for S&E
New role for data
Multi-disciplinary approaches essential
Education – embedded and integral
More coordinated post-award management

Examples

Transient & data-intensive astronomy
– seeing events as they occur
– complex interconnected earth systems

Four data challenges
Volume
Growth
Distribution
Data sharing

Sea of data
– data enabled sciences
— immediate and long term support of data
— focus on data life cycle issues
— development of data tools – mining, visualization, algorithms
— broad computation science education program
– advanced computational infrastructure
— software elements -> integration -> institutes
— sustained long-term investment in software
– data services – integration, preservation, access, analysis
— community access networks – building virtual communities
— collab tools, secure systems to link peplum etc
– access and connectivity
— connections ton facilities and instruments
— ooi, sensor networks, telescopes (desktop connectivity hasn’t improved)
— cybersecurity
— networking, end-to-end

– data sciences

[ECAR Summer 2008] Bob Franza

Seattle Science Foundation – to nurture networks of experts. They have an 18k sq. ft. facility in Seattle, but if they buid more bricks in the future they will have failed. Working in virtual environments (second life) – not games. Using the virtual environment to solve real problems of distributed teams, not just as a … Continue reading “[ECAR Summer 2008] Bob Franza”

Seattle Science Foundation – to nurture networks of experts. They have an 18k sq. ft. facility in Seattle, but if they buid more bricks in the future they will have failed. Working in virtual environments (second life) – not games.

Using the virtual environment to solve real problems of distributed teams, not just as a neat technology.

CareCyte – why should workflow and health care be foreign concepts to each other? If you were going to redesign a health service facility, what would you build? Rethought the design – ultra-fast design, manufacture, assembly, near-laminar airflows, all internal walls are furniture that can be reconfigured easily, etc. Rendered the facility in about 2.5 days in second life, and can show people the ideas and design in an engaging way that can’t be accomplished with drawing and static images. Have to not use the technology in ways that replicate existing activities (e.g. giving lectures).

“Recruiters will look at somebody with a World of Warcraft score of 70 or above as CEO material.”

Doesn’t like the term “virtual” – has negative semantics around it, including accountability. Prefers the term “immersive”.

Bob talks about the ability to become things you aren’t in the environment, whether that’s a molecule to better understand how physics work or seeing what it’s like to be in a wheelchair or changing gender.

He’s asked what the implications of immersive environments are for university enterprises.

They’re looking at the undergraduate health sciences curriculum – can’t find anatomy profs anymore. Can’t supply cadavers for education – why do you need them? The curriculum is the same worldwide – you’ve got buildings on campuses with students coming in and getting bored. What does it cost to heat, cool, and illuminate those buildings? We have brought no imagination to these challenges. We have to look at the cost of operations. What is the cost of distributing rolls of toilet paper into thousands of classroom buildings?

All of the retired faculty could be participating in these immersive environments to bring education to many more people.

Macro nodes – very large data centers sitting next to hydroelectic generating stations. Biological scientists haven’t figured this out to the extent that astro and physicists have with things like the Hubble – collaborate to create resource.

Have to stop thinkign about physical space as the basis of anything except for those things that absolutely require it.

We have no technology excuses – the fundamental issue is will. Oil prices will drive that will.

We have to stop asking students to do pattern recognition. The way we’ve been evaluating students doesn’t have anything to do with the challenges they will face. But if they have to get along with a group of others to actually accomplish something, that will translate.

Bob invites people to contact him and work with them in Second Life.

[CSG Spring 2008] Cyberinfrastructure Workshop – Funding for CI – Three Success Stories

Steve Cawley – University of Minnesota The CIO’s job – get the money. What has worked at Minnesota, what doesn’t work? What are we doing to move forward? Three centrally funded research areas – the Minnesota Supercomputer Institute. $6.5 million budget to provide high-performance computing to institution. Moving from engineering/science college to VP for Research. … Continue reading “[CSG Spring 2008] Cyberinfrastructure Workshop – Funding for CI – Three Success Stories”

Steve Cawley – University of Minnesota

The CIO’s job – get the money. What has worked at Minnesota, what doesn’t work? What are we doing to move forward?

Three centrally funded research areas –

the Minnesota Supercomputer Institute. $6.5 million budget to provide high-performance computing to institution. Moving from engineering/science college to VP for Research.

Central IT – Network is a common good. Unified gig to the desk meshed 10 gig backbone. BOREAS-net 2100 mile owned optical network connecting to Chicago and Kansas City. Chicago CIC fiber ring and OMNIPOP. NLR and Internet2.

Had good luck funding SAN storage for researchers past three years. Received startup funding for server centralization utilizing virtualization. New data center plan central IT plus research.

University Library – expanding expertise including data collection and curation, data preservation and stewardship. VIrtual organization tools. University Digital Conservancy. Rich Media Services.

Problems – limited collaboration between researchers.

Heavy reliance on chargeback is a detriment. Central IT was 80% chargeback, now only 20% is chargeback. Common good services should be centrally funded.

Moving forward – the Research Cyberinfrastructure Alliance. Great exec partnership – VP for Research, CP for Tech, U Librarian. Try to speak with one voice.

Input from interviews with faculty. Large storage needs. Little thought being given to long term data preservation. University does not exist in a vacuum.

Julian Lombardi – Duke

Center for Computational Science Engineering and Math (CSEM) – Visualization Lab, Duke Shared Cluster Resource – blades donated to cluster. Those who donated got priority cycles.

Provost discontinued funding to the center. Cluster and Vis Lab were still being supported by OIT and departments.

Needs are – broad and participatory direction setting; support for emerging and inter-disciplines.

Bill Clebsch – Stanford – Cyberinfrastructure: A Cultural Change

Religious camps blocking progress.

There were three separate campus – the schools, the faculty, and the administration. Everyone pretending that computing can be managed locally.

Asked the Dean of Research to send out letters to the top fifty researchers. Spent time with each of them to find out what the state of computing is.

Exploding Urban Myths – when they talked to faculty found out that the received wisdom wasn’ ttrue.

Myths and facts #1 – myth: scientific research computing methodology has not fundamentally changed (heard from Provost). Fact: researchers’ computational needs have changed fundamentally in the last five years, increased computing availability itself directly yields research benefits, researchers have abandoned the notion that computing equipment needs to be down the hall.

Myths and Facts #2: Myth: faculty needs are highly specialized and cannot be met with shared facilities. Fact: Faculty are willing to share resources, clusters, and cycles. Research methodologies are surprisingly similar regardless of discipline (e.g. larger data sets; simulation studies; from shared memory to parallelism). Episodic nature of research computing allows for coordinated resource sharing.

Myths and Facts #3: Myth: distributed facilitie scan keep pace with demand. Fact: Lessons learned from BioX Program: running out space, cycles, cooling, and power. Cross-disciplinary facility economies of scale. Multi-disciplinary computing economies of scale.

Myths and Facts #4: Myth: Central computing facilities are bureaucratic and inflexible. Fact: Colocating and sharing models reduce overhead. Modularity in building, power, cooling, and support help create sustainability. Move from control to collaboration empowers faculty and reduces central cost. Faculty own the clusters – OIT will just cycle power on request.

Where is this going? 21st century cyberinfrastructure costs will dwarf 20th century ERP investments. Sustainability will be an economic necessity. Cloud/grid computing will affect investment horizons.

[CSG Spring 2008] Cyberinfrastructure Workshop – Jim Pepin

Disruptive Change – Things creating exponential change – transistors, disk capacity, new mass storage, parallel apps, storage management, optics. Federated identity (“Ken is a disruptive change”) team science/academics; CI as a tool for all scholarship. Lack of diversity in computing architectures – X86 or X64 has “won” – maybe IBM/Power or Sun/SPARC at ages. Innovation … Continue reading “[CSG Spring 2008] Cyberinfrastructure Workshop – Jim Pepin”

Disruptive Change –

Things creating exponential change – transistors, disk capacity, new mass storage, parallel apps, storage management, optics.

Federated identity (“Ken is a disruptive change”) team science/academics; CI as a tool for all scholarship.

Lack of diversity in computing architectures – X86 or X64 has “won” – maybe IBM/Power or Sun/SPARC at ages. Innovation is in consumer space – game boxes, iPhones, etc.

Network futures – optical bypasses (we’ve brought on ourselves by building crappy networks with friction). GLIF examples. Security is driving researchers away from campus networks. Will we see our networks become the “campus phone switch” of 2010?

Data futures – Massive storage (really really big) Object oriented (in some cases); Preservation, provenance (- how do we know the data is real? ) distributed, blur between databases and file systems. Metadata.

New Operating Environments – Operating systems in network (grids) not really OSs. How to build petascale single systems – scaling apps is the biggest problem. “Cargo cult” systems and apps. Haven’t trained a generation of users or apps people to use these new parallel environments.

In response to a question Jim says that grids work for very special cases, but are too heavyweight for general use. Cloud computing works in embarrassingly parallel applications. Big problems want a bunch of big resources that you can’t get.

The distinction is made between high throughput computing and high performance computing.

100s of teraflops on campus – how to tie into national petascale systems, all the problems of teragrid and VOs on steroids – network security friction points, identity management, non-homogenous operating environments.

Computation – massively parallel – many cores (doubling every 2-3 years). Massive collections of nodes with high speed interconnect – heat and power density, optical on chip technology. Legacy code scales poorly.

Vis/remote access – SHDTV like quality (4k) enables true telemedicine and robotic surgery, massive storage ties to this,

Versus – old code , writte on 360 or vaxes, vector optimized, static IT models – defending the castle of the campus. researchers don’t play with others well. condo model evolving. will we have to get used to the two port internet? Thinking this is just for science and engineering – social science apps (e.g. education outcomes at clemson – large data, statistics on huge scale) or shoah foundation at USC – many terabytes of video.

VIsion/sales pitch – access to various kinds of resources – parallel high performance, flexible node configurations, large storage of various flavors, viz, leading edge networks.

Storage farms – diverse data models: large streams (easy to do); large number of small files (hard to do); integrate mandates (security, preservation), blur between institution data, and personal/research; storage spans external, campus, departmental, local. The speed of light matters.

[CSG Spring 2008] Cyberinfrastructure Workshop – CI at UC San Diego

Cyberinfrastructure at UC San Diego Elazar Harel Unique assets at UCSD: CalIT2; SDSC; Scripps Institution of Oceanography They have a sustainable funding model for the network. Allows them to invest in cyberinfrastructure without begging or borrowing from other sources. Implemented ubiquitous Shibboleth and OpenID presence. Formed a CI design team – joint workgroup. New CI … Continue reading “[CSG Spring 2008] Cyberinfrastructure Workshop – CI at UC San Diego”

Cyberinfrastructure at UC San Diego

Elazar Harel

Unique assets at UCSD: CalIT2; SDSC; Scripps Institution of Oceanography

They have a sustainable funding model for the network. Allows them to invest in cyberinfrastructure without begging or borrowing from other sources.

Implemented ubiquitous Shibboleth and OpenID presence.

Formed a CI design team – joint workgroup.

New CI Network designed to provide 10 gig or multiples directly to labs. First pilot is in genomics. Rapid deployment of ad-hoc connections. Bottleneck-free 10 gig channels. Working to have reasonable security controls and be as green as possible.

Just bought two Sun Blackboxes – being installed tomorrow. Will be used by labs.

Chaitan Baru – SDSC

Some VO Projects – BIRN (www.birn.net) – NIH Biomedical Informatics Resarch Network – shares neuroscience imaging data. NEES (www.nees.org) Network for earthquake engineering simulations; GEON (www.geongrid.org) Geosciences network; TEAM (www.teamnetwork.org) field ecology data; GLEON (www.gleon.org) Global Lakes; TDAR (www.tdar.org) digital archaeology record; MOCA (moca.anthropgeny.org) comparative anthropogeny

Cyberinfrastructure at the speed of research – research moves very fast. researchers think that google is the best tool they’ve ever used – In some cases “do what it takes” to keep up: take shortcuts; leverage infrastructure from other CI projects and off-the-shelf products. Difficult because – can be stressful on developers who take pride in creating their own; engineers may think PI is changing course too many times. In other cases “don’t get too far ahead” of the users – sometimes we build too much technology – user community may see no apparent benefit to the infrastructure being developed.

The sociology of the research community influences how you think about data.

Portal-based science environments. Support for resource sharing and collaboration. Lots of commonalities, including identity and access issues. Lots of them use the same technologies (e.g. GEON and others). Ways of accessing data and instruments. Lots of interest from scientists in doing server-side processing of data rather than just sharing whole data sets for ftp. e.g. LiDAR on the GEON portal. opentopography model is an attempt to generalize that. EracthScope data portal is another example – includes SDSC, IRIS, UNAVCO (Boulder), adn ICDP (Potsdam).

Cyberdashboards – live status of information as it’s being collected. Notifications of events is also desirable.

Cyberdashboard for Emergency Response – collecting all 911 calls in California. Data miniing of spatiotemporal data. Analysis of calls during San Diego wildfires Oct 2007. Wildfire evacuations – visualization of data from Red Cross disastersafe database.

Cyberinfrastructure for Visualization

On-demand access to data – short lead times from request to readiness to rendering and display.

On-demand access to computing – online modeling, analysis and visualization tools

Online collaboration environments – software architecture, facility architecture.

SDSC/Calit2 synthesis center – conceived as a collaboration space to do science together – brings together – high performance computing; large scale data storage; in person collaboration; consultation. Has big hd screens, steroscopic screen, videoconferencing, etc. Used for workshops, classes, meetings, site visits. Needs tech staff to run it, and research staff to help with visualization, integration, data mining. So far has been on project-based funding, lately there’s been a recharge fee.

Calit2 stereo wall (C-Wall) – Dual HD resolution (1920 x 2048 pixels) with JVS HD2k projectors.

Calit2 digital cinema theater – 200 seats, 8.2 sound, Sony SRX-R110, SGI Prism with 21 TB, 10GE to computers

The StarCAVE – 30 JVC HD2k (1920 x 1080) projectors.

225 megapixel hiperspace tiled display.

In response to a question from Terry Gray, Chaitan notes that the pendulum is swinging a bit in that PIs still want to own their own clusters, but they no longer want to run them – they want them housed and administered in data centers. Elazar notes that they’re trying to make the hardware immaterial – a few years from now they may all be in the cloud, but the service component to help researchers get what they need will remain on campus.

[CSG Spring 2008] Cyberinfrastructure Workshop – Virtual Organizations

Ken Klingenstein – An increasing artifact of the landscape of scientific research, largely from the cost nature of new instruments. Always inter-institutional, frequently international – presents interesting security and privacy issues. Having a “mission” in teaching and a need for administration. All of these proposals end with “in the final year of our proposal three … Continue reading “[CSG Spring 2008] Cyberinfrastructure Workshop – Virtual Organizations”

Ken Klingenstein –

An increasing artifact of the landscape of scientific research, largely from the cost nature of new instruments.

Always inter-institutional, frequently international – presents interesting security and privacy issues.

Having a “mission” in teaching and a need for administration. All of these proposals end with “in the final year of our proposal three thousand students will be able to do this simulation”. Three thousand students did hit the Teragrid a few months back for a challenge – 50% of the jobs never returned.

Tend to cluster around unique global scale facilities and instruments.

Heavily reflected in agency solicitations and peer review processes.

Being seen now in arts and humanities.

VO Characteristics – distributed across space and time; dynamic management structures; collaboratively enabled; computationally enhanced.

Building effective VOs. Workshop run by NSF in January 2008. A few very insightful talks, and many not-so-insightful talks. http://www.ci.uchicago.edu/events/VirtOrg2008/

Fell into the rathole of competing collab tools.

Virtual Org Drivers (VOSS) – solicitation just closed. Studying the sociology – org life cycles, production and innovation, etc.

NSF Datanet – to develop new methods, management structures, and technologies. “Those of us who are familiar with boiling the ocean recognize an opportunity.”

Comanage environment – externalizes id management, priveleges, and groups. Being developed by Internet2 with Stanford as lead institution. Apps being targeted: Confluence (done), Sympa, Asterisk, DimDim, Bedework, Subversion.

Two specimen VOs

LIGO-GEO-VIRGO (www.ligo.org)

Ocean Observing Initiative ( http://www.joiscience.org/ocean_observing )

The new order – stick sensors wherever you can and then correlate the hell out of them.

Lessons Learned – people collaborate externally but compete internally; time zones are hell; big turf issue of the local VO sysadmin – LIGO has 9 different wiki technologies spread out over 15 or more sites (collaboration hell). Diversity driven by autonomous sysadmins. Many instruments are black boxes – give you a shell script as your access control. Physical access control matters with these instruments. There are big science egos involved.

Jim Leous – Penn State – A VO Case Study.

Research as a process: lit search/forming the team; writing the proposal; funding; data collection; data processing; publish; archive.

Science & Engineering Indicators 2008

publications with authors from multiple institutions grew from 41% to 65%. Coauthorship with foreign authors increased by 9% between 2995 and 005.

How do we support this? Different collaborative tools. Lit Search – refworks, zotero, del.icio.us; Research info systems – Kuali Research; home grown; Proposals – wikis, google docs; etc. Lots of logins. COmanage moves the identity and access management out of individual tools and into the collaboration itself.

Need to manage attributes locally – not pollute the central directory with attributes for a specific collaboration effort.

What about institutions that don’t participate. LIGO – 600 scientists from 45 institutions.

LIGO challenges – data rates of 0.5 PB/yr across three detectors (> 1 TB /day); many institutions provide shared infrastructure, e.g. clusters, wikis, instrument control/calibration); international collaboration with other organizations; a typical researcher has dozens of accounts.

Penn State Gravity Team implemented LIGO roster based on LDAP and Kerberos – Penn State “just went out and did it” – drove soul searching from LIGO folks – “why shouldn’t we do this?”. Led to LIGO Hackathon in January, which was very productive. Implemented Shibboleth, several SPs, Confluence, Grouper, etc.

Next steps are to leverage evolving LIGO IAM infrastructure; establish permanent instance of LIFO COmanage; encourage remaining institutions to join InCommon; and (eventually) detect a gravity wave?

Bernie Gulachek – Positioning University of Minnesota’s Research Cyberinfrastructure – forming a Virtual Org at Minnesota – the Research Cyberinfrastructure Alliance.

A group of folks who have provided research technology support – academic health center; college of liberal arts; minnesota supercomputer institute; library; etc.

Not (right now) a conversation about technology, but about organization, alliances, and partnerships. Folks not necessarily accountable to each other, but are willing to come together and change the way they think about things to achieve the greater common good.

Both health center and college of liberal arts came to IT to ask how to build sustainable support for research technology .

Assessing Readiness – will this be something successful, or a one-off partnership? What precepts need to be in place for partnership? The goal is to position the institution for computationally intensive research. They have a (short) set of principles for the Alliance.

Research support has been silo’ed – need to have a connection with a specific campus organization, and the researcher needs to bridge those individual organizations. The vision is to bring the silos together. Get research infrastructure providers talking together. Researcher consultations – hired a consultant.

Common Service Portfolio – Consulting Services; Application Support Services; Infrastructure Services – across the silos. Might be offered differently in different disciplines. Consulting Services are the front door to the researcher.

Group is meeting weekly, discussing projects and interests.

[CSG Spring 2008] Cyberinfrastructure Workshop – Bamboo Project

I’m in Ann Arbor for the Spring CSG meeting. The first day is a workshop focusing on cyberinfrastructure issues. The NSF Atkins report defines ci as high perf comp; data, information; observation, measurement; interfaces, visualization; and collaboration services. Today will concentrate on the last two. The workshop agenda will cover interdisciplinary science; virtual organizations; visualization; … Continue reading “[CSG Spring 2008] Cyberinfrastructure Workshop – Bamboo Project”

I’m in Ann Arbor for the Spring CSG meeting. The first day is a workshop focusing on cyberinfrastructure issues.

The NSF Atkins report defines ci as high perf comp; data, information; observation, measurement; interfaces, visualization; and collaboration services. Today will concentrate on the last two.

The workshop agenda will cover interdisciplinary science; virtual organizations; visualization; mapping scientific problems to IT infrastructures; and getting CI funded.

Chad Kainz from University of Chicago is leading off, talking about the Bamboo Planning Project. The Our Cultural Commonwealth report from ACLS served the same kind of function in the humanities that the Atkins report did in the sciences.

Chad starts off with a scenario of a faculty member in a remote Wyoming institution who creates a mashup tool for correlating medieval text with maps, and publishes that tool, which gets picked up for research by someone in New Jersey, where it is used for scholarly discourse. The Wyoming faculty member then uses the fact of that discourse in her tenure review.

What if we could make it easier for faculty to take that moment of inspiration to create something and share it with others? How do we get away from the server under the desk and yet another database?

How can we advance arts and humanities research through the development of share technology services?

There are a seemingly unending number of humanities disciplines each with only a handful of people – you don’t build infrastructure for a handful of people. One of the challenges is how we boil this down to commonalities to enable working together. Day 2 of the Berkeley Bamboo workshop showed that unintentional normalization will lead to watering down the research innovations. The next workshop will start by trying to look at the common elements.

About eighty institutions participating in the first set of workshops.

One idea is to have demonstrators and pilot projects between workshops to test ideas, explore commonalities, desmonstrate shared services, and experiment with new application models. There is one project exposing textual analysis services from the ARTFL project that will probably be the first example.