Seth Eliot, “Your Path to Data-Driven Quality”

Another software QA video recommended by Gilberto Castañeda is “Your Path to Data-Driven Quality,” presented by Microsoft’s Seth Eliot. (This video is actually from an earlier presentation.) Eliot delivered this presentation at the April 1-3, 2014 Seattle ALM Forum. The slides Eliot presented are available as PDF and PowerPoint files.

I confess that I never heard of Eliot before Gilberto brought him up; but Eliot’s title commands respect! He’s Microsoft’s “Principal Knowledge Engineer, Test Excellence.” I’ll bet that looks nice on a business card. He works with services, cloud delivery, and data-driven products. Before his Microsoft gig, Eliot spent some time at Amazon.com. I can’t help but wonder if his stay there overlapped that of my late friend Paul S. Davis; but that was really some time ago.

Eliot proposes to give us a road map, but aphoristically points out that this will not be enough: we also need roads. He’s going to propose “a way to get there” which listeners can apply to their particular environments.

A general impression: This presentation owes much to Alan Turing. To Eliot, everything is data; code is data. And so are management dictates….

The lowest form of data-driven quality (soon to be abbreviated as “DDQ”) is “HiPPO-driven,” based on the Highest Paid Person’s Opinion. By the way, Eliot claims to have tested HiPPO-driven decisions using data gathered online, finding that one-third were wrong, one-third were right, and one-third had no noticeable impact.

Let’s go to a higher level of DDQ. You can use scoring engines to apply Bayesian analysis to historical data. Frankly, this is not Eliot’s forte. He’s more interested in a real-time approach: “testing in production” (TiP), using production data.

“It’s not as difficult as you might think.”

That’s good to know! Because guess what everyone in Eliot’s (small) audience, and I, were tensing up over!

Why would we want to test in production? Because real users are surprising: they do weird stuff. Environments can be chaotic and uncontrolled. Fine, but I’m waiting for the other shoe to drop, and it’s got to be this: What’s implicit here is that a more structured, anticipatory testing approach will not encompass this scope of weirdness; and that this is the swamp from which things will crawl and bite you in the ass.

Like any good road map, Eliot’s is very compact. The bare outline on his slide (at 7:20) needs the context of his explanation and the following slides, so I won’t bother to reproduce its bullet items here. For novices, lack of familiarity means that the hard parts—the magic—will be in designing for production-quality data, selecting data sources, and using the right tools. There’s also a circular or cyclical quality to this, which the linear list can only enumerate as answering your questions and learning new questions. (Eliot adds a seventh bullet and graphical overlay to make the loop explicit at about 7:26.)

Next, Eliot shows a slide full of widgets, illustrating the kind of real-time production data available at Microsoft. Old school Mission Control fans will not be disappointed.

Eliot’s road map begins with something that seems easy: Defining your questions. Why not just plunge in? Why not just get data and try to draw correlations? Because they would be unhelpful, like the statistical correlation between using sunblock and being more likely to drown. Kids, don’t try this at home! Plunging into the data is for advanced users.

How about an example, Seth? What questions does Microsoft ask about Exchange Online? “Is the application available?” (Some of us may yet think of this as “dial tone.”) This is important, because when the application’s not available, it can stop the user’s work.

Note that the user’s perception of availability can be more subtle than the provider’s.

Eliot’s first example of this is an occasion when the Japanese version of Exchange Online silently failed and loaded labels in English. From the server point of view, the application worked; but most Japanese users were cast adrift.

Slow response may also make an application as good as dead to users, in a way that is not so clearly visible from the server end. Eliot shows a graph of user abandonment vs. time waiting to start a video stream (13:37). About five seconds of delay is enough to get rid of almost everyone.

Streaming Video Abandonment vs. Delay for Different Broadband Connection Types

Except mobile users, who have been conditioned to be much more patient.

Perhaps these users remember the ground crew members who did not immediately let go when a gust of wind caught the flying aircraft carrier USS Akron…and who can blame them? Let go while you still can!

The advantage of production testing in situations like these “is manifest.” And here’s where paradigm-shifting light bulb really began to turn on for me, as I began to understand how dramatically different Eliot’s approach is from what I’ve spent most of my testing time doing. Eliot wants to watch a dashboard showing distributed, real-time application behavior; whereas my peers and I explore the territory bounded by anticipating user behavior, and evaluating the consistency of database tables. (These intersect Eliot’s worldview as means of acquiring “active” data, as opposed to “passive,” real user data: a distinction he introduces at 18:33, discussed below.) When you click a box offering to participate in the Windows Customer Experience Improvement Program, you’re volunteering data for Eliot’s production testing pipeline, or perhaps for another one much like it.

A thumbnail sketch adapted from one of Eliot’s graphs sums up how active data—all the stuff you and your co-workers gather using your made-up test cases and jury-rigged fake data—yields to passive data as the application goes into production and, with any luck, as the users pile on like the Clampett family on Jed’s old truck.

Active and Passive Test Data

Eliot’s “active”/”passive” semantics confused me at first. When you work at Microsoft scale, surely the application is hopping up and down in production in a way it never does in sterile testing environments. You can beat on it with load tests, but to get the noise level and surprises of real users and deployment environments that time forgot? But Eliot’s choice of adjectives is drawn from the tester’s perspective, as it really should be. Active data is what you go out and hunt. Passive data…you can catch that in drift nets.

By the way, Eliot’s schematic curves of active and passive tests are not mean to reflect the natural order of things: what happens when you sit back and watch nature take its course. “Staged data acquisition mitigates risk.” Of course! You’d stage a deployment of any large, outward-facing system, right? First the internal users, then maybe some friendly beta testers, and so on. Data acquisition goes hand in hand.

We knew that in our guts, right? Sometimes it’s important to make these things explicit.

Here Eliot begins to focus on the questions we might bring to testing in production; or rather, the kind of answer we’re looking for. How would we rank availability, performance, and usage. Or; Is there any situation in which availability does not come first?

Yes, Eliot suggests: Twitter and social platforms might value usage and feature adoption over availability, for example.

What scenarios are most important? Is it more important that a user should be able to send email immediately? Or that the product logo display properly? (The Marketing department may have a hard time with this, but I’m going to go with sending email. But maybe there’s some context in which I’d feel differently.)

In any case, think about these things, because they affect priority of your test scenarios.

I enjoy collecting buzz phrases, and wouldn’t consider ending this overview without recognizing Eliot’s linguistic contributions.

Most of us have heard of “eating your own dog food.” (If you haven’t, it means using your own products. Wikipedia’s got more depth.) Eliot takes this to the next level: “It worked in dog food,” “We dogfooded it.”

In Eliot’s world, these concepts are important enough, and thrown around frequently enough, to rate acronyms:

  • DDQ: data-driven quality.
  • HiPPO: highest-paid person’s opinion.
  • RUM: real user measurement (i.e. acquiring passive data).
  • TiP: testing in production.