Layer8

Intro Subscribe

15nov2009

the following was written in response to this post on the FoRK mailing list.

Those of you in the large-scale technology operations space will be familiar with Puppet, a long-time favorite for infrastructure automation, and Chef, a more recent entrant. Both of them are open source and both have a component responsible for system discovery: Puppet includes a tool called facter and Chef includes one called ohai[4].

System discovery is the task of collecting various facts (hence the name facter) about the system on which you are running so the rest of the automation can run from a consistent view. System discovery is abstraction, and that brings a host of questions around implementation and presentation. facter takes a minimalist approach: it returns a compact set of information and relies on a number of native C extensions (which, on might argue, pushes you towards only returning a compact set of information). ohai (now) takes a rather maximalist approach: the data returned can be quite large, for example when run on OS X with the plist gem installed, and avoids use of any native C extensions.

I cannot comment on the history or philosophy of facter, but I can do so for ohai. I wrote quite a bit of the ohai code (though I did not write the initial version, am not its maintainer, and have not contributed code for some time), and am primarily responsible for the volume of information it collects compared to similar tools (the philosophy being ‘collect it all, let the users decide what matters’). ohai began life as approximately a pure Ruby version of facter to support Chef. The data returned was similar (and similarly unstructured), the main difference being its avoidance of C extensions. The motivation for remaining pure Ruby was some combination of simplicity and a desire for consistency with the rest of Chef. Where facter uses native interfaces to collect system data, ohai relies on a lot of popen4() and regex matching. This has made ohai incredibly easy to port to new platforms, and it went from 1 (Linux) to 4 (Linux of various descriptions, Solaris, FreeBSD, and OSX) in a couple of weeks by folks more familiar with Ruby and a command line than the system-level C interfaces.

In so doing, I learned quite a bit about how command-line output varies between platforms and the issues of semantic mismatch between programming languages and the operating systems on which they run. My lessons do not lead me to conclude we are so far from an “80%” solution or that there is cause for despair.

The first contribution I made to ohai was to suggest a slight restructuring to support multiple platforms. This introduced hierarchy both in the code layout (with OS-specific plugins) and in the data output. The use of JSON for the output is the least bad option, given the common alternatives and the use of CouchDB at the server, but is not entirely satisfactory. The most bothersome issue is its lack of references. For example, I might have several IP addresses on a host, but I want one that I can refer to as its canonical address in the rest of the automation. Automatically deciding which IP to use is easy (take the primary IP address on the interface used for the default route), but indicating which address has been chosen creates a new problem: I have a top level notion of the IP address, but no way to indicate, in the data structure, where it came from.

As an example, the top level entry looks like this:

 "ipaddress": "172.16.100.202"

And the actual network interface definition (which is in the network->interfaces sub-hash) looks like this:

  "en1": {
    "status": "active",
    "flags": [
      "UP",
      "BROADCAST",
      "SMART",
      "RUNNING",
      "SIMPLEX",
      "MULTICAST"
    ],
    "number": "1",
    "addresses": {
      "00:23:6c:90:47:10": {
        "family": "lladdr"
      },
      "fe80::223:6cff:fe90:4710": {
        "scope": "Link",
        "prefixlen": "64",
        "family": "inet6"
      },
      "172.16.100.202": {
        "broadcast": "172.16.100.255",
        "netmask": "255.255.255.0",
        "family": "inet"
      }
    },
    "mtu": "1500",
    "media": {
      "supported": [
        {
          "autoselect": {
            "options": [

            ]
          }
        }
      ],
      "selected": [
        {
          "autoselect": {
            "options": [

            ]
          }
        }
      ]
    },
    "type": "en",
    "arp": {
      "172.16.100.1": "0:1b:c:f:90:23",
      "172.16.100.201": "0:23:12:a8:2d:84",
      "172.16.100.246": "0:16:cb:a9:70:4b"
    },
    "encapsulation": "Ethernet"
  }

If I want to know where the default address came from, I have to iterate of the interfaces to find it. If I added a tag to the default interface, I then have to update in 2 places should there be a change. Storing a reference to the default interface would be a cleaner solution, but is not supported in JSON. Creating a JSON-based format that supported references seems not such a problem, it just hasn’t been done, to my knowledge (and please don’t suggest XML, it is too bloated and complex for consideration). This is minor compared to the other, big challenges, though.

The second problem, and one most clearly an issue for all languages interacting with the OS for systems work, is process management. While abstractions like threads and event callbacks are (reasonably) well understood, Unix-style process management remains just this side of a black art; look at the daemonization code in any C server code for an example. Scripting languages like Ruby and Python tend to just punt and directly expose the C process management interface, hence the use of popen4() all over the place in ohai. Mocking for testing and dealing with grandchildren and orphan processes present many, often obscure, problems. They can be dealt with, but nobody has bothered to write reasonable libraries to do this in Ruby (parts of it are now in Chef), and I am not familiar enough with Python to know what folks do there. Again, there is a semantic gap between what the OS is exposing and how the languages consume them. This gap does not really exist for the lightweight concurrency mechanisms, particularly event-based concurrency, where the language support is quite good (see EventMachine in Ruby and Twisted in Python, both of which are libraries, not language features; process management should yield to similar effort).

The third big problem I encountered was the wild variation in command output. At the amusing end of the spectrum, I received a bug report from someone running Linux with German localization and the output of ifconfig was entirely translated into German, something you are unlikely to see in a C API. Generally, the challenge in working across platforms might be summarized in this way: the more optimized a system is for direct consumption by a human operator, the harder it is to write automation that doesn’t use ‘native’ APIs. Windows is the obvious extreme example of this, but the unexpected offender here is Solaris.

Solaris is, in my estimation, the best OS core (kernel, filesystems, etc) on the market. It is also the long-time favorite of old-school sysadmins who pride themselves on knowing every last inch of their systems and only using automation to take care of certain, recurring tasks, rather than the full-auto, lights out style encouraged by Puppet and Chef. The output from things like ifconfig is optimized for them, being particularly verbose and human-readable, but extensive variation in output makes them very involved to parse (see the ifconfig man page for a taste). At another point in the space, there are things like the OSX system_profiler command that will happily generate XML output exactly for ease of consumption by code rather than people. All of which is really to say operating systems can, should, and sometimes do, expose interfaces above the level of the native C APIs, but intended for consumption by scripting tools. Things like system_profiler show one way of doing that, though the XML-ified plist output is not a winner. An OS that had ‘automation modes’ on all its system management tools would be a massive win for system language users and would, I think, not be hard (just a small matter of programming).

To sum up, I see three areas that need attention to better integrate systems languages and the operating systems on which they run:

  1. A simple data format that supports internal references and more data types, whether a new format or JSON convention.
  2. Proper process management as robust and simple as event-based concurrency.
  3. Operating system interfaces appropriate for scripting/automation tools.

    I didn’t intend this post to be quite so long, so my apologies and thanks to those of you who made it this far. It represents my experience in one, possibly representative, corner of dealing with the challenges at the interface between systems languages and systems. It is my pious hope, to quote Roger Penrose, that none of the challenges I describe above are fundamental and all could be solved with only a modicum of effort from some motivated folk. Whether they are the same sort of problems that raised Jeff Bone’s ire I can’t say, but I remain quite optimistic there isn’t cause for despair or anger in this.

20oct2009

an amazing thing almost happened on twitter. i say almost because the folks at twitter intervened before the amazing could take place. why they chose to take action is easily understood, but in so doing a rare opportunity was lost.

the facts are these:

this morning, RevRunWisdom (yes, that Rev Run), tweeted the phrase “Know God… Know Peace. No God.. No Peace!.”.

revrunwisdom tweet

other twitter users began retweeting it, and others retweeted from them, and so on, making the phrase extremely common in the public tweet stream.

the twitter trending topics list is constructed automatically and, as far as i can tell, with only a small amount of filtering; one would assume various offensive bits of slang are in common use on the service, but they never appear in the trending topics. the constant retweeting of “Know God… Know Peace. No God.. No Peace!.” caused it, quite naturally, to appear in the trending topics. however, the automatic parsing into shorter chunks, appropriate for trending topics, resulted in two phrases showing up on the list: “No God” and “Know Peace”.

the appearance of the term “No God” in trending topics caused two new waves of tweets and retweets: religious people expressing confusion and, sometimes, outrage and non-religious people expressing delight. this is what folks in control theory call positive feedback. angrily tweeting about the presence of “No God” in trending topics only made it appear more frequently in the tweet stream, quickly driving it to the top of the list.

after several hours of this, with “No God” locked firmly in the top trending topics slot, and the volume of “No God” tweets rising, the folks at twitter stepped in. the trending topics list suddenly only showed “Know God”. the link presented actually went to a search that included both “No God” and “Know God”, but the offending term no longer appeared in trending topics. once again, i do not fault the folks at twitter for this choice, it makes sense in many ways for their business.

i am, however, disappointed. the “No God” incident was modern, american, civic debate in the purest form i’ve ever witnessed, with every action and reaction visible and documented. having misunderstood the reasons for a situation, a group of people lashed out in a manner that did nothing but serve to amplify the very situation angering them. their ignorance of the simple mechanics of trending topics and their direct, literal responsibility for its contents drove them to take exactly the wrong action.

i like to think the people angrily tweeting and retweeting “No God” would’ve learned within a few hours or days and changed their tactics, letting the topic drop, and so, disappear. even more optimistically, i like to think that would lead to thousands of ‘a-ha!’ moments as they came to understand not just this trap, but the hundreds of other, similar traps set for them all the time in their lives: at home, at work, at school, at church, by the media, by the government, by big corporations, by the nature of the world. i like to think “No God” would have resulted in a lot of people suddenly able to listen and think before lashing out. imagine it, if you can.

this was the amazing moment lost and i, for one, mourn it. one more thing i like to think, though, is that it is only the first, not the last, and eventually those lessons will be learned, changing our notions about debate and disagreement, and the world along with them.

09oct2009

everything you ever wanted to know about US oil refinery capacity.
hand-carved brooks leather saddles from kara ginther are drool-inducing.
cliff reminded me of the hundred year old color photos of pre-soviet russia.
the berlin reunion performance by royal de luxe is startling and wonderful.
colonel john boyd provides much food for thought on web operations with his OODA loop, even if he is talking about aerial combat: part 1, part 2, part 3, part 4.
the how and why of github’s use of unicorn is essential reading for those running large, rails/merb-ish apps.

07oct2009

i made these early this year with the intention of writing a lovely little blog post to go with them. as i haven’t yet, and as a few folks i know can benefit from seeing them, here they are without much explanation: the OODA loop applied to web ops, diagrams A and B.



diagram A


diagram B

16aug2009

i missed my ketones terribly. i started (back) on a ketogenic diet to lose weight. over the past 10 years i’d gained over 30lbs, and it certainly wasn’t muscle. what became quickly apparent was the incredible effect of ketosis on my mental state. whereas i normally have extreme mood swings based on my blood sugar, in ketosis i noticed being hungry, but it was hardly a distraction. i also started waking up earlier, without an alarm, and feeling much more rested. the downside is that it is a lot of effort to eliminate carbohydrates and the simplest solution is to eat a lot of meat and cheese, with a few, specific vegetables. it gets boring.

then, over the course of a week of going away parties for friends and other, minor celebrations, the carbs came back. and along with them, my mood oscillated wildly, my productivity dropped essentially to nothing, and i was miserable. that was three weeks ago and i have just pushed back into ketosis. the initial transition brought a bout of (twitter) mania, but, thankfully, that has passed. the past few weeks have been a useful, if unpleasant, experiment in using diet to manage my (unruly) mind. i don’t think i’ll be repeating it soon.

but enough about me, here are some interesting fragments from the past few weeks:

  • coda hale found yet another timing vulnerability in a common cryptography library and took the opportunity to remind us all about the subtleties of properly implementing this stuff. nate lawson recently delivered a great presentation at google on the same topic that you really must watch.

  • scala is all the rage of late, as evidenced by the sudden appearance of a bunch of books about it. there’s beginning scala, then programming scala, then that other programming scala, and, finally, programming in scala. in the tree killing business, at least, scala is a language to watch.

  • vmware has been doing great work in hardware virtualization performance measurement. this paper from 2006 is well-written and informative. consider it a must read. some further results, based on vmware’s vmark benchmark, covering performance with nested page tables, are in this blog entry from february of this year.

  • i received a copy of what is shaping up to be an excellent book: applied ballistics for long range shooting. new goal? shooting accurately at ranges and in situations where coriolis effects matter.

  • finally, there is the slow web. it’s like slow food. but, the web. savor it.

16jul2009

the gpl is fatally flawed, though the flaws were not obvious until online service became as prominent as they now are. attempts to update the gpl for “the cloud”, like the agpl, only serve to exacerbate the problem. more permissive licenses, such as bsd, mit, and apache, are less troublesome, though still lacking in certain areas, as i will explain.

the original intent of the gpl was to ensure users of software had access to the source so they could inspect it, fix it, and modify it for their needs, irrespective of the desires, or continued existence, of the software vendor. a noble goal motivated by the best of intentions and much excellent software has been produced under the gpl, with linux being the highest profile project.

with the widespread adoption of online services, a weakness in the gpl was exposed: because online services never handed software over to users, they were not obligated to share the source with them. the affero variant of the gpl closes this apparent gap by requiring source disclosure even when the software is provided as a service rather than as object code.

however, the agpl simultaneously makes itself unattractive to service providers who are rightfully concerned about contamination of their proprietary code such that they must release it. this is, in fact, the goal of the gpl and its variants: it acts as a virus to force the release of ever more source. the gpl serves to rigidly control what you can and cannot do with software covered by it, and is thus the license equivalent of digital rights management.

this leads to a related problem. the gpl produces, in practice, a two-tiered structure dividing those who control a software project from those who merely contribute to it. those in the former group are free to create a dual-license: those who want to use the software for non-commercial purposes can do so freely, but those wanting to use the software commercially must pay. the latter group cannot do this, regardless of how much they may have contributed to the project (though, of course, they could create a new project and rewrite all encumbered components). even worse, in a complete subversion of the intent of the gpl, a company can now make open source closed to its users simply by paying for a license. the license intended to protect the rights of users is instead being optimized for the rights of developers.

when the gpl is abused like this, as it is more and more frequently, the most obvious difference between it and the permissive licenses is a matter of who decides who gets paid. under the gpl, that control rests only with the project owner, just like content drm. under a permissive license, anyone can decide.

however, another, more troubling issue looms: content is at least as important as code, but open content licensing, like creative commons, and open source licensing are treated independently. as a result, vendors can adhere to both the letter and the spirit of an open source license, whether the gpl or something more permissive, yet users may have little control over the content managed by the software. as one concrete example, amazon made certain changes to linux for their kindle reader and, as required by the gpl, released those changes. however, the actual kindle application code is closed source and the content users purchase is accessible and transferable as seen fit by amazon or the publishers.

this is the real licensing hole. most users have little interest in source code access for the applications that manage their content, but they have intense interest in access to their content. if i store my personal photos on a photo site and attach all sorts of description and tag information to them, can i easily download them, with all the metadata, when i want? if i purchase books for my electronic book reader, can i easily back them up, transfer them between my devices, and continue to access them even if the company from which i purchased them goes out of business? these applications are empty shells, of no use to anyone, without their content.

the open source community needs to recognize the weaknesses, in practice, of licenses like the gpl and focus attention not on further controlling how people can use code, but, in the spirit of the user freedom, on ensuring access to their content.

this post was greatly improved by input from my reviewers, andrew and coda.

emil makes a very good point about constraints on project owners running off and selling contributed code. this does not invalidate my point that the gpl and its variants are being twisted in favor of developers vs. the original purpose of protecting users. i’ve updated the post to reflect his feedback.

16jul2009

yesterday i explained the dangers of the common misunderstanding of service level agreements as insurance policies. while i mentioned a strategy of using multiple vendors rather than relying on the SLA offered by a single vendor, some more specific details will be useful in understanding and internalizing this approach.

over the past ten years i have participated in or lead negotiations for internet and CDN bandwidth at internap, amazon, and microsoft. at first i invested significant time and effort in defining SLAs, methodology, metrics, and penalties, as is common practice. what eventually became apparent were two things:

  1. defining meaningful SLAs for public internet services, as opposed to private telco links, is not generally possible.
  2. SLA failure penalties are insufficient compensation for business impact.

from this experience and these realizations i changed my approach significantly. the two facets of the new strategy were, and are:

  1. only enter into contracts with as small a traffic commitment as feasible and with no penalties for termination, regardless of cause.
  2. engage multiple vendors for all bandwidth services.

availability, which is always the responsibility of the customer, is now actually under the customer’s control, rather than being delegated to a vendor via an SLA. should a vendor fail to deliver the desired service level, even for a short period of time, traffic can be shifted to other vendors until quality improves. should a vendor prove too unreliable to use at all, their services can be terminated and other vendors brought in to replace them.

to make best use of this strategy it is important to have proper software support in place. for example, a single CDN vendor should be used for content on each page served, and the vendor used varied dynamically across requests; mixing multiple CDN vendors on a single page can actually reduce availability. similar traffic engineering can be done for requests to your own web servers using DNS-based global load balancing, though with coarser granularity. similar principles will apply to “the cloud” as the interfaces and functionality in the space are commoditized.

as heinlein said, TANSTAAFL, and high-availability distributed systems are not exceptions. you are responsible for your availability. understand clearly the business value to you of a vendor SLA and be prepared to change your strategy, and put in the technical and contract work required, if it will not meet your business needs.

15jul2009

vijay posted a (better late than never) rebuttal to a post from november last year by joe weinman of at&t. i agree with all the points vijay makes, and want to focus in on a particular area of joe’s article:

(4) SLAs with financial penalties – Not only won’t enterprises accept “Well, after all, it’s still in beta” as an excuse for service outages, they demand meaningful SLAs (service level agreements) with clear metrics for evaluating achievement of those SLAs, backed up by monitoring and management systems, and financial penalties such as credits or refunds if service levels aren’t met. A “free” or low-cost service with questionable delivery quality is about as attractive to a CIO as an offer of free neurosurgery from someone who just skimmed a blog on how to do it in three easy steps.

ah, the mighty service level agreement! the tooth and claw by which the wily customer brings the vendor to heel. get the SLA right and you, the customer, can sit back and relax, safe in the knowledge that should there be an outage, you are covered. your business is protected from harm by the warm, experienced embrace of a big, stable telco. pinch me, i must be dreaming.

vijay refers to SLAs as “an actuarial game”. the situation is rather worse than that. the trouble is that many intelligent people mistake an SLA for an insurance policy. it most definitely is not.

an insurance policy is purchased for a price, often based on actuarial tables, that reflects the risk of the policy being paid out and the size of the pay out. the value of the policy is that it is an actual hedge: in the event of a claim, the holder is compensated for (approximately) the full value lost. the insurance industry is predicated on most policy holders paying far more over the life of their policies than they are paid out, and on there not being catastrophic events that cause simultaneous claims by a large number of policy holders.

a service level agreement does not work this way. an SLA is not a hedge against the business impact of an outage: it is a refund policy. the maximum value of an SLA ‘claim’ is your monthly bill. the cost to your business of an SLA failure is likely to be far higher, but you will not be compensated for that loss. a six hour service outage might cost your small business 10,000 dollars. receiving a 500 dollar service credit is cold comfort.

SLA failures become more common as you move up the stack from the rigid, extremely well-characterized, layer 1 telco sweet spot. outages that impact large sections of your customer base simultaneously are inevitable in large-scale, shared software infrastructure. if SLAs were insurance policies, vendors would quickly be out of business.

given this, the question remains: how do you achieve confidence in the availability of the services on which your business relies? the answer is to use multiple vendors for the same services. this is already common practice in other areas: internet connection multihoming, multiple CDN vendors, multiple ad networks, etc. the cloud does not change this. if you want high availability, you’re going to have to work for it.

21may2009

i have thought for some time, though only recently managed to crystalize the ideas into such simple form, that there is a significant gap between how people (and companies) succeed:

how people succeed

and how people think people (and companies) succeed:

how people think people succeed

the space between is filled by the diet books of the business world. in individuals, the disconnect on how they succeeded is prevalent among the beneficiaries of the various technology company millionaire factories. those who recognize the importance of hard work and discipline go on to continued success, while those who come to believe the main driver for their success was just their own talent achieve little but frustration.

i have no deep research to support my claims. they are, however, distributed free of charge, giving them a much better ROI than any business book. think i’m getting it wrong? contact me and let me know how!

10apr2009

last year i spoke at the panorama capital cio summit. until early the morning of my presentation, i didn’t even know what i was going to discuss. i didn’t sleep well the night before and got out of bed before 6am realizing i had something to say about how standardization and adoption of ‘cloud’ stuff will be similar to what happened in networking.

focus on many aspects of cloud standardization has reached astonishing levels of confusion and vitriol, none of which seem warranted. in an effort to introduce a bit of perspective into the debate, i am posting my slides from that talk last year. enjoy.

Copyright © 2009