Seth Eliot, “Your Path to Data-Driven Quality”

Another software QA video recommended by Gilberto Castañeda is “Your Path to Data-Driven Quality,” presented by Microsoft’s Seth Eliot. (This video is actually from an earlier presentation.) Eliot delivered this presentation at the April 1-3, 2014 Seattle ALM Forum. The slides Eliot presented are available as PDF and PowerPoint files.

I confess that I never heard of Eliot before Gilberto brought him up; but Eliot’s title commands respect! He’s Microsoft’s “Principal Knowledge Engineer, Test Excellence.” I’ll bet that looks nice on a business card. He works with services, cloud delivery, and data-driven products. Before his Microsoft gig, Eliot spent some time at I can’t help but wonder if his stay there overlapped that of my late friend Paul S. Davis; but that was really some time ago.

Eliot proposes to give us a road map, but aphoristically points out that this will not be enough: we also need roads. He’s going to propose “a way to get there” which listeners can apply to their particular environments.

A general impression: This presentation owes much to Alan Turing. To Eliot, everything is data; code is data. And so are management dictates….

The lowest form of data-driven quality (soon to be abbreviated as “DDQ”) is “HiPPO-driven,” based on the Highest Paid Person’s Opinion. By the way, Eliot claims to have tested HiPPO-driven decisions using data gathered online, finding that one-third were wrong, one-third were right, and one-third had no noticeable impact.

Let’s go to a higher level of DDQ. You can use scoring engines to apply Bayesian analysis to historical data. Frankly, this is not Eliot’s forte. He’s more interested in a real-time approach: “testing in production” (TiP), using production data.

“It’s not as difficult as you might think.”

That’s good to know! Because guess what everyone in Eliot’s (small) audience, and I, were tensing up over!

Why would we want to test in production? Because real users are surprising: they do weird stuff. Environments can be chaotic and uncontrolled. Fine, but I’m waiting for the other shoe to drop, and it’s got to be this: What’s implicit here is that a more structured, anticipatory testing approach will not encompass this scope of weirdness; and that this is the swamp from which things will crawl and bite you in the ass.

Like any good road map, Eliot’s is very compact. The bare outline on his slide (at 7:20) needs the context of his explanation and the following slides, so I won’t bother to reproduce its bullet items here. For novices, lack of familiarity means that the hard parts—the magic—will be in designing for production-quality data, selecting data sources, and using the right tools. There’s also a circular or cyclical quality to this, which the linear list can only enumerate as answering your questions and learning new questions. (Eliot adds a seventh bullet and graphical overlay to make the loop explicit at about 7:26.)

Next, Eliot shows a slide full of widgets, illustrating the kind of real-time production data available at Microsoft. Old school Mission Control fans will not be disappointed.

Eliot’s road map begins with something that seems easy: Defining your questions. Why not just plunge in? Why not just get data and try to draw correlations? Because they would be unhelpful, like the statistical correlation between using sunblock and being more likely to drown. Kids, don’t try this at home! Plunging into the data is for advanced users.

How about an example, Seth? What questions does Microsoft ask about Exchange Online? “Is the application available?” (Some of us may yet think of this as “dial tone.”) This is important, because when the application’s not available, it can stop the user’s work.

Note that the user’s perception of availability can be more subtle than the provider’s.

Eliot’s first example of this is an occasion when the Japanese version of Exchange Online silently failed and loaded labels in English. From the server point of view, the application worked; but most Japanese users were cast adrift.

Slow response may also make an application as good as dead to users, in a way that is not so clearly visible from the server end. Eliot shows a graph of user abandonment vs. time waiting to start a video stream (13:37). About five seconds of delay is enough to get rid of almost everyone.

Streaming Video Abandonment vs. Delay for Different Broadband Connection Types

Except mobile users, who have been conditioned to be much more patient.

Perhaps these users remember the ground crew members who did not immediately let go when a gust of wind caught the flying aircraft carrier USS Akron…and who can blame them? Let go while you still can!

The advantage of production testing in situations like these “is manifest.” And here’s where paradigm-shifting light bulb really began to turn on for me, as I began to understand how dramatically different Eliot’s approach is from what I’ve spent most of my testing time doing. Eliot wants to watch a dashboard showing distributed, real-time application behavior; whereas my peers and I explore the territory bounded by anticipating user behavior, and evaluating the consistency of database tables. (These intersect Eliot’s worldview as means of acquiring “active” data, as opposed to “passive,” real user data: a distinction he introduces at 18:33, discussed below.) When you click a box offering to participate in the Windows Customer Experience Improvement Program, you’re volunteering data for Eliot’s production testing pipeline, or perhaps for another one much like it.

A thumbnail sketch adapted from one of Eliot’s graphs sums up how active data—all the stuff you and your co-workers gather using your made-up test cases and jury-rigged fake data—yields to passive data as the application goes into production and, with any luck, as the users pile on like the Clampett family on Jed’s old truck.

Active and Passive Test Data

Eliot’s “active”/”passive” semantics confused me at first. When you work at Microsoft scale, surely the application is hopping up and down in production in a way it never does in sterile testing environments. You can beat on it with load tests, but to get the noise level and surprises of real users and deployment environments that time forgot? But Eliot’s choice of adjectives is drawn from the tester’s perspective, as it really should be. Active data is what you go out and hunt. Passive data…you can catch that in drift nets.

By the way, Eliot’s schematic curves of active and passive tests are not mean to reflect the natural order of things: what happens when you sit back and watch nature take its course. “Staged data acquisition mitigates risk.” Of course! You’d stage a deployment of any large, outward-facing system, right? First the internal users, then maybe some friendly beta testers, and so on. Data acquisition goes hand in hand.

We knew that in our guts, right? Sometimes it’s important to make these things explicit.

Here Eliot begins to focus on the questions we might bring to testing in production; or rather, the kind of answer we’re looking for. How would we rank availability, performance, and usage. Or; Is there any situation in which availability does not come first?

Yes, Eliot suggests: Twitter and social platforms might value usage and feature adoption over availability, for example.

What scenarios are most important? Is it more important that a user should be able to send email immediately? Or that the product logo display properly? (The Marketing department may have a hard time with this, but I’m going to go with sending email. But maybe there’s some context in which I’d feel differently.)

In any case, think about these things, because they affect priority of your test scenarios.

I enjoy collecting buzz phrases, and wouldn’t consider ending this overview without recognizing Eliot’s linguistic contributions.

Most of us have heard of “eating your own dog food.” (If you haven’t, it means using your own products. Wikipedia’s got more depth.) Eliot takes this to the next level: “It worked in dog food,” “We dogfooded it.”

In Eliot’s world, these concepts are important enough, and thrown around frequently enough, to rate acronyms:

  • DDQ: data-driven quality.
  • HiPPO: highest-paid person’s opinion.
  • RUM: real user measurement (i.e. acquiring passive data).
  • TiP: testing in production.

Open Lecture by James Bach on Software Testing

This YouTube video was my recent introduction to James Bach, the enfant terrible of the “context-driven school of software testing.” (I thank my colleague Gilberto Castañeda, “El Águila de Tenochtitlan,”  for suggesting I watch this video.)

When I write enfant terrible—which I guess means “terrible infant”—I don’t mean an upset child, fists clenched, face red, who stops screaming only to draw breath. We’re talking about an adult here, who on the strength of his reputation, has been invited to speak to students in Estonia. (Does anyone in Estonia even know your name? Not mine.) In this context, we’re talking about someone who upsets some established order by expressing himself enthusiastically and articulately.

I typically find these people quite compelling. Their presence tells us that a discipline has room for finding success without first enduring a Druid-like, decades-long apprenticeship. (Not that Bach advocates ignorance, some kind of Noble Savage approach, or barefoot and wide-eyed laying on of hands. He urges learning all the tools and technologies you can.)

I suspect that when Bach planted himself behind the podium in this college lecture hall, he had a very general outline in mind. In fact, he brought a few slides, which he seems to go through in order. However, he’s not afraid to range far and wide to make his points. The broad outline is defined, but the sentences are JIT. Something he does a lot of is telling stories. (This, too, is an approach that reels me in. The “story” is a way of encapsulating ideas which I can consume like M&Ms. Stories are brain candy.) Much of the presentation consists of Bach’s anecdotes about approaching a software testing situation and turning over rocks to reveal previously visible ugly stuff.

How does he do it?

This is part of what Bach tries to express with the “context-driven testing” label. He does it by recognizing that the testing context, the context of software or system success or failure, is an inclusive one. It generally covers a lot more territory than the problem description makes clear—where “generally”  means something like “while pigs do not fly.” Bach zooms out, or looks at the system from other perspectives, to see potential vulnerabilities. It’s creative. It’s insightful. It provides the stuff of good stories.

One story which stands out as I listen to the presentation for the second time, in the background, is Bach’s summary of Miyamoto Musashi’s Book of Five Rings. Bach summarizes the book as an account of how Musashi survived many duels. How? By using a particular type of sword? A technique? By training with a particular school?


In Bach’s telling, Musashi owed his survival to mastering many weapons (learn all the tools and technologies you can), and then doing whatever was necessary to win.

This will do for now as a metaphor for context-driven testing.

When you hear one of Bach’s stories, and feel that you laugh or thrill with him, this encourages you to believe that you, too, are capable of joining the context-driven brotherhood (a belief Back explicitly encourages elsewhere), and sharing in these penetrating insights. Bach encourages this perception. He begins with simple, clearly defined situations, giving his audience a chance to warm up and stretch before they hear about more complex situations. In his first example, he shows the students some simple code and then discusses all the situations not captured by the code’s logic. After a little stumbling, Bach has probably closely aligned his audience’s understanding of the example with his own. After that, each additional story reinforces their complicity.

There are many more trivial, discrete, and personal reasons I find Bach really compelling. There’s a whole body of shared memes surfacing in his discussion: The Book of Five Rings. Simulated annealing. Artificial intelligence and knowledge representation, circa 1980s. To name a few.

On the other hand, I’ve bought some of my groceries as an advocate for various processes: the SEI CMMI, XP, and a few things in between. Bach is quick to express his disdain for “best practices” and methodologies which seek to eliminate human “squishiness.”

I’m very interested in how I can use the Selenium framework for some automated testing. Bach acknowledges that this approach has its uses, but belongs in the back seat.

As I write, I’m infatuated with Bach and what I understand of context-driven testing. Will this survive more exposure?