We’re at Yale for the Spring CSG meeting. It’s a beautiful, sunny New England spring day!
The first workshop is on Automating Campus Network configuration, provisioning, and monitoring Workshop Presentations.
Mark McCahill – Duke – Thinking about network automation/monitoring
Campus wireless is one of the most complicated things we run. Campus APs – averages ~6k in 250 buildings across our campuses. RF spectrum issues. How reproducible are trouble reports?
How many staff support your network – 7.5 in engineering/architecture, 10 field staff, average.
We have not converged on network management tools at all.
Monitoring taxonomy – how can we categorize tools? Data gathering, analysis, alerting, trending.
Automation strategy – understand the environment – monitoring!
Ideal end-state – standardized process, consistent quality, reduced cycle time/increased productivity.
User centric monitoring of the wireless network
Users don’t tell us that much. Should I even tell IT there’s a problem? It’s not a good experience just because they don’t complain.
Crowd sourced monitoring – boomerang. javascript in a web page that attempts to download files of various sizes – can figure out latency and performance.
via.oit.duke.edu – zero-install. Duke’s shib page includes boomerang code. Results reported to via.oit.duke.edu, stored in mySQL db. Self-service diagnostic testing available at via to check performance to various data centers.
https://github.com/duke-automation/via
You get into big data fairly quickly – it’s a statistics game. Put pages at your cloud and different data centers to measure to them individually.
Where are the trouble spots? Key questions; what are the chances of a good connection? Which wireless segments are overloaded? Instead of depending on vendor tools, use R to analyze data from boomerang. You can do statistical process control to gather objective measures.
How to monitor when they can’t connect? Simulate users with strategically situated Raspberry Pi devices that do the EAP+PEAP authenticate & DHCP dance to get on the network. Source: https;//github.com/duke-automation/raspi – dumping data into splunk for analysis.
C program makes wpa_supplicant API calls to repeatedly cycle WiFi connection monitoring. Found bimodal distribution of DHCP response times. Also found no correlation between sites. Raspberry Pi tracking wlan interface drops, link quality and signal level.
Next steps – more Raspberry Pis in the field and more monitoring. Check http performance with boomerang on the Pis. Look more at DNS and number of SSIDs detected – could be rogue SSIDs.
Network data collection is a ‘big data’ problem, which is great for statistical analysis. Will use Apache Spark cluster to speed longitudinal analysis.
Should have an iOS app that says “this isn’t good for me right here right now”. Yale has one. https://github.com/YaleSTC/wifi-reporter
Eric Boyd, Michigan – perfSONAR overview
In the context of ScienceDMZ – how do you make sure you’re getting the end-to-end performance?
perfSONAR – enables fault isolation, verify correct operation, widely deployed in ESnet and other networks.
Problem statement – wile networks interconnect, each network is owned by a separate organization – how do we troubleshoot across them? Performance issues are prevalent and distributed. Local testing will not find everything. “soft failures” are different and often go undetected. Elephant flows (giant research loads) vs. mouse flows (web, email). with packets dropping at .0046%, you only get 5% of optimal speed.
perfSONAR is open source, supported by ESnet, GEANT, Internet2.
Something will break – sometimes things break, + human error. 3 phases to deployment: get system up and running; holy cow, we have a lot of network problems; how do we keep it good?
Distributed information sharing mechanism. Can plug any tool into it. DNS lookups, building HTTP tool now. Using a $200 box to deploy on a network to measure performance. Trying to automate things – you don’t want spend > .5 FTE on performance monitoring.
William Diegard – Rice – Network automation topics
You should do perfSONAR.
But some commercial products: Splunk, Extrahop, Deepfield. Not talking about single, largest automation system we run: the wireless controller.
What is automation? Anything that lets you spend time doing new things or be more efficient.
Splunk – MapReduce for your log file processing. The mother of all grep tools. Rice using it to track things and automate things like DMCA violations. Automatic system to look for POE shutdown on Cisco access switches. Monitor Data Transfer Node activities.
Extrahop – Application Performance Monitor, passive network traffic tap grabs “wire data” and reveals it. Make you realize how little you know. Does a bunch of statistical analysis. Answers question: “why is it slow?” But – you have to care to take the time to look. Can deploy it in the cloud too. Rice uses it to measure eduroam performance, among other things.
Deepfield – it’s an internet2 service you probably already have. Looks at traffic that Internet2 sees, shows where traffic is going, but you don’t see everything from your border. Does nice categorization.
Sean Dilda – Duke – Cartographer
Why? Network changes faster than diagrams. Troubleshooting network problems is hard! What port is this computer on? What VLAN? What firewalls?
What does it do? Logs into every switch/router every three hours and builds internal map of network. Can look up by IP and Hostname, or lookup by Mac, or see Switch/Router interface stats. They also pull in building data, link to floor plans and google maps. Can get summaries of VRF data, show VLAN stats (including same VLAN number on different LANs). It maps Layer 2 layouts. Great for showing how things plug together for local support staff or new network engineers. WIll map routes from source to destination in a nice graphic layout.
Who can use it? All IT staff across the university, and anyone with access to IPAM. (based on Grouper groups).
Use it to allow local staff with IPAM permissions to change port VLAN, bounce wired port, clear ARP entry, block IPs from network.
New tool: Planishpere. Compines data from Cartographer, DHCP, wireless/radius, device registration, end point management, VMWare, Cisco, etc. Can gather a lot of data about end devices.
Next steps: F5 load balancers, firewall rules and network ACLs, IPS blacklists, Planisphere metrics.
Plan to distribute source.
Scotty Logan – Stanford – Network Delegation and Automation
9 pairs of physical firewalls, 600 virtual firewalls. Half the rules change per year. 65k firewall rules. Only 4-5k changes are manual. Firewall automation first deployed 2007.
1300 Local network admins active in last 30 days. Only 1200 people in IT job roles per HR.
SNSR – “snoozer” – self registration of devices.
If you come in via VPN or VDI shows who is associated with session, to look up groups for authorization. Now have a web page for Firewall requests, creates ServiceNow requests, and if you have permission it gets updated within an hour without manual intervention (or with an approval loop in ServiceNow.
Device compliance DB – Fed from devices, BigFix. VLRE. Very Lightweight Reporting Engine – runs om Macs and Windows, reports status of machines: do you have the firewall on, disk encrypted, etc? Started deployment of 802.1x with a dedicated Radios pool. Added integration with compliance API to see if device must be compliant and if it is.
OpsWare – automated switch and router management. Backup switch configurations nightly (can do diffs on them), scheduled config changes, check all devices for specific settings.
Matt Brooks – CMU – Controlling Network Access
Limited release of .1x mostly in common spaces and where people float between buildings. WPA2 enterprise, pushing people towards it from clear-text network. Controls IP assignments via DHCP, mout outlets are deactivated by default, self service portal for activating outlets. “Quick-reg” network for on-boarding.
Updating switch and router configs – NetMRI from Infoblox used for regular backups of running configs from every switch and firewall via TFTP; Visual diff tool to review changes; Password changes; software upgrades. CMU NetConf – initial switch config, interface config changes day-to-day via self-service portal.
Scotty Logan – Dirty Dancing in the Cloud
Why are moving to the cloud? Geo-diversity? Scalability? Cost? Availability?
Don’t do: artisanally crafter services; manual testing; manual deployment; tightly coupled services.
Do do: Devops, loosely coupled
Firewalls and IP addresses are not loosely coupled!
Difficult to get contiguous elastic IPs from AWS. So people do VPC VPNs, and Direct Connect and private routing. Like dumping a 1950s appliance in your brand new kitchen.
If only we had… Inter-networking, and transport layer security, and strong authentication… oh wait – we do!
But.. my CISO says we need to use static IPs – you need to talk to your CISO.
If you have to, use NAT gateways or NAT instances with Elastic IPs
Amazon now supports /56 IPv6 subnet for VPCs.
Azure only allows 200 mbps per link, which HPC jobs can blow out very quickly. Duke doesn’t think that extending campus data center into the cloud is a good idea.
NetDB – Delegated administration. All 1300 local network admins can control firewalls, metadata, delegate domains, carve up address space, etc for devices via self-service.
William Diegard
Needed to replace stack. Ended up with Infoblox. Talked about outsourcing DNS, which was a huge traumatic conversation. Good think about picking Infoblox was the conversations around campus. Infoblox allows for self-service and visibility of networking to campus. Training session was done by user support team, not networking. They don’t have as much need for frequent firewall rules as other schools due to the broad segmentation of the network.
Matt Brooks
Application Suite – all home grown. Moving more towards InfoBlox. Using NetReg as IPAM system – registration of machines on devices, tracks network and switch metadata. CANDO – tracks structure cabling on campus. Allows (central IT) users to request and sometimes modify outlet configs, and configure interfaces on systems you own in the data center (add VLAN to your trunk, etc). NetConf – switch auto-builder does automated provisioning of new switches and PortAdmin does automated configuration of interfaces based on activations in CANDO.
Staffing and Skill Sets for Network Automation Teams
Matt Brooks – CMU
How is team structured? 9 network ops engineers, two network design engineers, 3-4 network software engineers. Pair developers and network engineers in offices.
What do we look for in developers? Pick two: Developer experience (required) and one of SysAdmin or Networking experience. Must be genuinely interested in learning the third part. Will learn the third skill by working outages for things that aren’t yours and working on technologies you don’t currently know. Look for generalists, not specialists. A truly curious and self-driven person. The team runs the servers they run on.
War stories
Mark – In early days of perfSONAR discovered that part of backbone wasn’t as strong as it should be. perfSONAR made one of the core routers slow down so much that hospital VOIP didn’t function. Silver lining was that it pointed out some bad router configs.
William – When you start giving client services access to tools, people can do powerful things. Someone wiped out the entire admin database, which caused all the ports to drop.
Scotty – Guy who runs DNS infrastructure pushed out a software change that took down DNS.
Eric – One of perils of network automation is you concentrate your mistakes. What was a local problem becomes a global problem. Automated config of Internet2 switches, in deployment paused between steps to check accuracy, which caused a race condition that erased all the rules on the network. Took Internet2 network down for 20 minutes.
Scotty – maybe we should be applying webscale iterative deployment and testing to our switches, where we have thousands of devices.
Matt – When moving some addresses to a new space, engineer copied a SQL block out of a wiki, but the where clause was outside the highlighted code area on the wiki, so the statement got applied all across the network. Took a full weekend to restore.
Also – tied into IdM systems. Deprovision outlets and systems that a person is individually responsible. Glitch in IdM system caused 1500 active accounts to be de-activated. 4k devices deleted from network very rapidly. Cobbled together a script to figure out what happened and then put data back in place. Managed to do it in an hour.