We’re here a the Pyle Center at the University of Wisconsin for the spring CSG meeting.
Bill Clebsch from Stanford is leading the Data Center Workshop.
The first panel, being introduced by Jim Pepin from USC, is setting the context.
There was a meeting two weeks ago with NSF about cyberinfrastructure in a campus level. Twenty years ago we were the people supporting high end folks on campus. Over the last ten years we’ve gotten to a point where we dont change anything quickly. Whole lot of stuff going on on campus around clusters and networking. People in departments want to build clusters – what will that drive in terms of campus infrastructure for housing those.
Previously there were no line items in grants for facilities and operations. The sweet spot used to be 32 node clusters, but now it’s 128 or 256 nodes. If faculty want to get funded they need to be at the cutting edge of research, and the 32 node cluster in a grad student office isn’t that. There’s been a lot of pushback and rethinking at the funding agencies.
Kevin Moroney – You could suffer “death by [NSF] Young Investigator Awards” with 16 node clusters.
Jim notes that science has become team science, which drives more computing.
The back channel has some chatter about water-cooled racks in machine rooms.
Penn State has just invested in some water-cooled racks.
Some discussion about power consumption – people are planning 18-24 kW per rack for high performance computing and 4-6kW for an enterprise computing rack.
Walter points out in the back channel that “APC has a ‘half rack CRAC’ where you have a ‘real’ system in the rack “hut”. he hut is a set of back to back racks which has a roof so that heat is contained in the hut and the in-rack AC units suck in that air and exhaust room temp air. A new CMU research facility that is using the APC system has a writeup at http://www.pdl.cmu.edu/PDL-FTP/News/newsletter05.pdf”
Jim and Kevin note that in colo offering if you charge by rack then people will stuff the racks as full as possible, raising the kilowatt per rack average in the machine room.
Jim notes that USC has rented some colo space in a downtown facility for mirroring for disaster recovery.
There’s a bunch of discussion about offsite backup, disaster recovery, and how campuses prioritize what services will be brought back in which order.
Theresa from MIT notes that for key communication services (web, email) it’s good to have geographically distributed load balanced services, so that one location can go down and service is only more heavily loaded.
Patrick from MIT says that funding bodies are starting to hold campuses responsible for recovery of research data, so if the data center is holding research data it’s another thing to worry about.
In a brief discussion on power infrastructure (most campuses here have multiple power routes into the data center) Jim says that USC is always being compared to Disney for what they do – Steve Worona says that’s a Mickey Mouse solution 🙂
Theresa says that the financial systems can actually stand some outage time, but it’s the communications infrastructure – web and email – that is truly critical to keep running, along with the network. That’s the lesson of 9/11.
About a third of institutions here report that they’ve had a total data center outage within the last three years.