Internet2 Tech Exchange 2015 – High Volume Logging Using Open Source Software

James Harr, Univ. Nebraska

ELK stack – ElasticSearch, Logstash, Kibana, (+redis)

ElasticSearch indexes and analyzes JSON – no foreign keys or transactions, scalable, fast, I/O friendly. Needs lots of RAM

KIbana – WebUI to query ElasticSearch and visualize data.

Log stash – “a unix pipe on steroids” – start with input and output, but can add conditional filters (e.g. regex). Add-on tools like mutate and grok. Can have multiple inputs and outputs.

GROK – has a set of prebuilt regular expressions. Makes it easy to grab things and stuff them into fields. Have to do it on the way in not after the fact (it’s a pipe tool). 306 built-in patterns.

Grok GeoIP – includes built in database, breaks out geo data into fields.

LogStash – statsd – sums things up – give it key and values, adds values and once a minute sends to another tool.

Graphite – a graphing tool, easy to use. Three pieces of info per line: key you want logged to, time, value. Will create a new metric if it’s not in the database.

Can listen to twitter data with LogStash.

Redis – Message queue server

Queue – like a mailbox, can have multiple senders and receivers, but each message goes to one receiver. No receiver, messages pile up.

Channel (pub/sub) – like the radio, each message goes to all subscribers. No subscriber? message is lost, publisher is not held up. Useful for debugging.

Composing a log system: Logstash is not a single service: split up concerns, use queues to deal with bursts, errors. use channels to troubleshoot.

General architecture – start simple:

Collector -> queue -> analyzer -> ElasticSearch -> Kibana

Keep collectors simple – reliability and speed are the goal. A single collector can listen to multiple things.

Queue goes into Redis. Most work done in analyzer – groking, passing things to statsd, etc.Can run multiple instances.

Channels can be used to also send data to other receivers.

Composing a Log System – Archiving

collector -> queue -> analyzer -> archive -> archiver -> log file

JSON compresses very well. Do archiving after analyzer so all the fields are broken out.

Split out indices so you can have different retention policies, dashboards, etc. e.g. firewall data different than log stash.

Can use logstash to read syslog data from a device, filter out what you want to send to Splunk to get your data volume down.

Lessons (technical): clear query cache regularly (cron job every morning); more RAM is better, but the JVM doesn’t behave well after 32GB; Split unrelated data into indices (e.g. syslog messages vs. firewall logs); part simple; use channels to try new things.

Lessons: It’s not about full text search, though that’s nice. It’s about having data analytics. ElasticSearch, LogStash, and Kibana are just tools in your toolbox. If you don’t have enough resources to keep everything, prune what you don’t need.


Internet2 Tech Exchange 2015 – RESTful APIs and Resource Definitions for Higher Ed

Keith Hazelton – UWisc

TIER work growing out of CIFER – Not just RESTful APIs. The goal is to make identity infrastructure developer and integrator friendly.

Considering use of RAML API designer and tools for API design and documentation.

Data structures – the win is to get a canonical representation that can be shared across vertical silos. Looking at messaging approaches. Want to make sure that messaging and API approaches are using the same representations. Looking at JSON space.

DSAWG – the TiER Data Structures and APIs Working Group – just forming, not yet officially launched. Will be openly announced.

Ben Oshrin, Spherical Cow

CIFER APIs – Quite a few proposed, some more mature than others.

More Mature: (Core schema – attributes that show up across multiple APIs); ID Match (creates a representation for asking “do I know this person already, and do I have an identifier?”); SOR to Registry (create a new role for a person); Authorization (standard ways of representing authorization queries).

Less mature: Registry extraction (way to pull or push data from registry – overlap with provisioning); Credential management (do we really need to have multiple password reset apps?)’

Not even itemized: Management APIs; Monitoring APIs. Have come up in TIER discussions.

Non CIFER  APIs / Protocols of interest: CAS, LDAP, OAuth, OIDC, ORCID, SAML2, SCIM, VOOT2

Use cases:

  • Intra-component: e.g. person registry queries group registry for authorization; group registry receives person subject records from person registry.
  • Enterprise to component: System or Record provisions student or employee data in Person Registry
  • Enterprise APIs: Homw grown person registry exposes person data to campus apps.


API Docs; Implementations

Internet2 2015 Tech Ex – The Age of Consent

Ken Klingenstein is talking about the work going on to enable informed consent in releasing identity attributes to services. I walked in a little late, so I’m missing a bunch of the detail at the beginning, but he invokes Kim Cameron’s 7 Laws of Identity.

Consent done properly is something that users will only want to do once for a relying party, and be informed out of band when attributes are released. There is interesting research about whether users take on consent differs between offline and online.

Some federations already support consent.

Rob Carter – Duke

Why do we need an architecture for consent? Use cases at Duke:

  • Student data – would like to release attributes about students that may be FERPA protected. If a student can “consent” to an attribute release, FERPA may not even be involved (according to Duke’s previous registrar). Consent has to be informed (what’s sufficient “informedness”?); has  to be revokable (means you need to store it so the person can review and change); Non-repudiation and audibility (We know that the person gave the consent and when it was given).
  • Student devs – Trying to get students working on development. When a student wants to have another student share information with friends in an app, questions come up about release of information. Would like to have same kind of consent framework in OAuth as they have in other environments (e.g. Shibboleth).
  • Would be nice to have a single place for a user to manage their consents.

Nathan Dors – Washington

UW entering age of consent. Asks the audience where they are with consent frameworks – almost all are just entering the discussion. UW wants to go from uninformed consent (or not doing things because of the barrier of getting consent). Consent is highly ubiquitous already in the consumer world – Google, Facebook, etc. Help desk need to understand how to explain things to users. ID Management needs to be able to help developers understand what they need to do to get consents. Need to figure out how to layer in consent into a bunch of idPs including Azure AD which has its own consent framework. Need to apply existing privacy policies and data governance to consent context.

Larry Peterson (U of Arizona) – Give Your Data the Edge (Cloud Infrastructure for Big Data Research)

Data Management Challenge
     Distributed Set of Collaborators
     Existing Data Sets (sometimes curated, sometimes less)
     Taking advantage of commodity data storage
Researchers widely distributed, data widely distributed
Pre-Stage, then Write Back – Read/write workload, widely distributed. Tend to make assumption that researcher is data management expert.
Goal: Enable a scalable number of collaborators and apps to share access to data independent of where it’s stored: minimize operational burden on users; maximize use of commodity infrastructure; maximize aggregate I/O performance
Syndicate Solution – Add a CDN into the mix. Distribute big data in the same way Netflix distributes video, using caching.
Syndicate gateways sit between players and the caching transport (which uses HTTP instead of TCP).
Metadata service on the side which has to scale.
Result is all collaborators share a volume.
Gateways bridge application workflow and HTTP transport, e.g. IRODS, Hadoop drivers. Data gets acquired from existing repositories, or commodity storage (e.g. S3 or Box) which get treated as block storage. Gives ability to spread risk over multiple backing stores.
Metadata store manages data consistency and key distribution. Uses adaptive HTTP streaming. Plays a role of security distributing credentials which are delivered through the same CDN.
Shared volume is mounted just like Dropbox. Can auto mount volumes into multiple VMs like in EC2.
Service composition – Syndicate = CDN + Obect Store + NoSQL DB
Value Add
Storage Service
CDN gives scalable read bandwidth (Akamai Hyper Cache and RequestRouter). Built a CDN for R&E content on Internet2s footprint.
Object store – gives data durability. (S3, Glacier, DropBox, Box, Swift).
NoSQL DB (Google App Engine) for metadata service
Multi-tier cloud. Commodity cloud is one of the tiers. You could contribute private clouds into the project too. Internet2 Backbone. Regional & Campus (4 servers minimum). -> End Users
Caching hierarchy – some in the private cloud, less at the regional & campus side.
Request Router in the I2 backbone, which tells you which cache to get service from.
Value Proposition: Cloud Ready (Allows users to mount shared voluments into cloud-hosted VMs with minimal operational overhead; Adapt to existing workflows – makes it easy to integrate existing user workflows. There are ways to build modular plugins on user side or data side. Sustainable Design – I got a big data problem, I need to connect commodity disk to my workload.
Will first be used by IPlant community. More info at
Scientific Data with Syndicate – John Hartman, Univ. of Arizona Computer Science
Use Hadoop and Syndicate to support “big data” science – meta-genomics research w/ Bonnie Hurwitz, Agriculture and Biosystems.
Meta-genomics – statistical analysis of DNA rather than sequencing entire genomes. Sequencing produces snippets of DNA (called reads) – requires very pure samples of DNA. Instead, look at samples in the environment, e.g. compare population of reads in one sample with reads in another sample. Tara Oceans Expedition; Bacterial infections; Colon cancer.
Tara is a ship collecting information about the oceans, taking samples to enable comparative analysis. Currently about 9 TB of data from ship.
Looking at bacterial infections with Dr. George Watts. Treatment depends on identifying characteristics of the bacteria, so ideal to perform meta-genomic analysis on an infection to classify and determine treatment.
Analysis Techniques – originally custom HPC applications with manual data staging; now- Hadoop application with manual data staging; future – Hadoop application with data access via Syndicate.
Hadoop: open-source MapReduce; includes Hadoop Distributed File System, so storage nodes for the computation nodes. Tasks are run on local data when possible – hadoop task scheduler knows data location. Data must be manually staged into HDFS, and Hadoop does are managed by central controller.
Trying to allow remote Hadoop data access: Storage in iRODS and HDFS; Transport by Syndicate and HTTP; enable federation between Hadoop clusters.
Storage-side functionality: delivers data sets to Syndicate via SG; Publish/subscribe mechanism keeps datasets up to date via Rabbit MQ. Integration of Syndicate and iRODs authentication mechanisms.
Working on federating Haddop clusters via Syndicate, so clusters can pull data from each other.
Challenges – Identity Management; While-file write (need to write results back to storage; syndicate designed for file reads, block writes) have to provide consistency at dataset level; Performance.
Biologists are thrilled when this works at all.

2015 Internet2 Technology Exchange

I’m in Cleveland for the next few days for the Internet2 Tech Exchange. This afternoon kicks off with a discussion of the future of cloud infrastructure and platform services. As I suppose is to be expected, the room is full of network and identity engineering professionals. I’m feeling a bit like a fish out of water as an application level guy, but it’s interesting to get this perspective on things.

One comment is that for Internet2 schools, bandwidth to cloud providers is no longer as much an issue as redundancy and latency. Shel observes that for non-I2 schools bandwidth is still a huge issue. The level of networking these schools don’t have is frightening.

The cloud conversation was followed by a TIER (Trusted Identity in Education and Research) investors meeting. There are 50 schools who have invested in TIER now. This is a multi-year project, with a goal of producing a permanently sustainable structure for keeping the work going. A lot of the work so far has been to gather requirements and build use cases. Internet2 has committed to keeping the funding for Shibboleth and Grouper going as part of the TIER funding. The technical work will be done by a dedicated team at Internet2.

The initial TIER release (baseline) will be a package that includes Shibboleth, Grouper, and COManage, brought together as a unified set, as well as setting the stage for Scalable Consent support. That release will happen in 2016. It will be a foundation for future (incremental) updates and enhancements. The API will be built for forward compatibility. Releases will be instrumented for continual feedback and improvement of the product.