These are my raw notes from talks held on Wednesday at LCA2014. May contain errors, mis-heard quotes. Also completely un-reviewed or spell checked:
Lightning Talks / Conference Open
- Interesting attendees: Linus, Tridge, Jon Oxer
- Zero footprint discovery
- Extremely scalable monitoring
Problems Addressed
- Risk Management
- Maintaining details discovery database
- Discovring forgotten systems
- Software discovery
- Monitoring services and systems
- Finding unmonitored services
- Intrusions
- Why Discovery?
Unique Powerful Features
- Continuous discovery by listening
- Zero network footprint
- Every change noticed
- Dependency discovery
- Low network load
Uniformly, fully distributed work
- Monitoring and discovery are fully distributed
- Reliable
- Only edge conditions are centralised
- Adding systems doe snot increase monitoring work
- Each server monitors 2 or 4 neighbours
- Each server monitors it's own services
- Repair and alerting is low volume
- Detects switch failure by nominating 1 server per switch for a cross switch ring.
- 95% of traffic stays in the same switch
Architectural Components
- Collective Management Authority - per installation
- Nano probes - per server
- Data storage
- Nanoprobe management:
- Configure and direct
- Hear alerts and discovery
- Update rings (join / leave
- Update database
- Issue alerts
- Nanoprobe functions
- Announce self to CMA
- Do as CMA instructs
- No persistent state across reboots
- Linux-HA Base Service Monitoring
- Local Resource Manager (LRM)
- Pros:
- Simple scalable
- Uniform work distribution
- No single point of failure (cluster CMA)
- Light network load
- Multi-tenant
- Cons
- Active agents
- Potential slow startup at power on (for large numbers of machines
- Why a Graph DB
- Humans describe things as graphs
- Dependency and Discovery is fundamentally a graph
- Speed of graph depends on size of sub graph, not total graph
- Natural visualisation
- Schema-less design: good for heterogeneous env.
- Graph model == object model
Discovery API
- Scripts perform discovery with JSON output
- Three sample discovery snippets
- OS Information
- Service discovery
- Client discovery
- Service discovery is brilliant.
Current Status
- Released in April 2013
- Nanoprobe is functional
- Need adopters
- Checkoint / Restore
- Otherwise standard Linux
- Namespaces
- Allows granularity
- Presents a subsite of host resources
- Allows picking and chosing components
- Not everything is namespace aware
- setns allows you to enter a namespace
- no need to ssh into a namespace
- Veth is a virtual ethernet pipe
- Containers need mulitlayer security defenses - no one tool currently provides what's reuiqred.
- LXC is worth looking at. Docker is built on LXC
- TPP impacts domestic legislative capabilities
- There's a lack of transparency
- Restrictive intellectual property impacts
- Potentially effects access to affordable medicine
- ISP's to monitor IP infringements
- Corporations more able to sue the state for laws that impact them
- Creates an infringement of national sovreignty
What Can You Do?
Political Landscape
- Greens and Pirate Party are opposed, actively, calling for transparency
- The ALP appesr to be against ISDS despite previosly negotiating
- The Nationals are giving hints of disquiet
- The Liberals claim TPPA will be good for industry
Groups with a Tech Focus
Broader Coalitions
- AFTINET
- Choice Australia
- ACTU
- Public Health Association Australia
- MADGE Australia
- Environmental activism
Draw on other strategies
- Utilise Beautiful Trouble
- Consider and map your spectrum of allies
- Shift the discourse
- Tactics that welcome participation
- Allow for tiered participation
- Facilitate any one who want to do something to be able to do something
- Ensure compelling frameworks
- Think strategically about how you frame it
- Think about your orgs structure
- Avoid burnout - keep a balanced life
Direct Action
- Bring about the change you want to see
- Can gain visibility for negotiations
Internal Strategies
- Approaches:
- Cloud provisioning
- Traditional system administration
- Targeting of nodes and classifying nodes
- Configuration managemetn vs monitoring
- Parametrisation - word of the conference
- Parametrise your automation system
- Define data in one place
- RECLASS will merge it
- Currently on YAML_FS
- Multiple inheretence
- Adapters interface between configuration management and reclass
- CLI switches
- output in YAML / JSON
- Ansible and SALT are supported currently
- SALT integration is now via a SALT module (better performance)
- Provides inventory information for Ansible
- Future work
- Logging framework
- Membership lists
- Tests
- Disk caching
- Long running process
- Composable(?)
- Rollup - alert summarisation
- Alert routing
- Does three things
- Receives an event
- API (RESTful JSON)
- No restarts required
- Bulletproof use it anger, two developers paid to work on it.
- Ruby, Redis, EventMachine based.
- Designed for humans
- Considers alert fatigue
- Normalcy bias
- Confirmation bias
Why?
- Multi-tenant support
- Segregated responsibility
- Check engine independence (event producers)
- Self-checking with oobetet
- Rollup - alter sumerisation
- Contacts store media type (email, SMS etc), sets summary thresholds, entities, checks and history
- Hooks up to Google Hangouts / Jabber media types
- Tagging can be used for grouping
- No hard/soft states
- Nagios / Icinga used as a dumb alert checker (only configure check execution)
- Allows scaling
@Bulletproof
- Process >~60 events/second
- Manage - a customer portal for managing their own notification rules
- manage-flapjack-sync does what it says :-)
Shortcomings
- 30s fixed broadcast delay (why?)
- Assumes single external source of truth (puppet, CRM, Ansible via API - needs to be written)
- Contacts need to import and exported from an external source (what sources? - what ever sources you write for)
Other Features:
- Release planning is public
- As is bug tracking
- Semantic versioning
- Write/run tests (unit and integration)
- .deb .rpm packages provided
- Solid documentation available
- A bad first experience is considered a bug
- Slides
- Use file leve syncing, handle exceptions via configuration managers
- Google wrote a custom-rsync
- Root partition is the same on all servers
- Ran RedHat 7.1 for over 10 years and wanted to upgrade without reboot (were the machines that old too?)
- On all Google's production machines
- At Google's scale, maintaining your own distro based on Debian makes a lotlof sense
- File level syncing recovers from state and is more reliable
- Forcing service to not write on the rootFS helps with distro switches