You could think of a hospital as a big cruise liner, afloat on a sea of data, charting and correcting course as the captain and crew read the shifting wave patterns, monitor their instruments, pump the bilge ...

Put more prosaically, hospital executives have an awful lot of numbers to navigate while checking their dashboards and generating feedback internally and externally. Artificial intelligence technologies like machine discovery, machine learning and natural language generation are revolutionizing the performance of those tasks.

Take benchmarking

It wasn’t until the mid-1990s that the notion of comparing an organization’s performance with results achieved by other organizations in the same business even entered the health care mindset. The first mention of the term “benchmarking” in the literature catalogued in the National Library of Medicine’s PubMed dates from 1989. Today, the search term “hospital benchmarking” turns up 5,636 PubMed entries.

Type those words into a search engine and you’ll be deluged with URLs. You can, for example, compare a vast array of health system operational and financial indicators in areas or hospitals of your choice to national and state averages or other areas and hospitals using the free online Dartmouth Atlas of Healthcare. You can download massively detailed Hospital Compare data sets from Medicare.gov or get a broad-brush sense of your own hospital’s performance compared with national averages on some 200 indicators compiled annually online. Scores of vendors offer customized benchmarking reports full of eye-catching charts and graphs.

However, as noted by Victor Sower, emeritus professor of management at Sam Houston State University in Huntsville, Texas, writing for the American Society for Quality, “Doing a simple comparison with a national average ... doesn’t tell the hospital how to improve operations. ... [It doesn’t tell] which hospitals are the best performers and what best in class performance is.”

Moreover, he points out, “National averages provide no measure of variation in performance.” He cites a real-life example in which it took as many as seven calls to get a clinic appointment before a quality improvement project, three at the most afterward — even though the number of calls necessary on average remained unchanged at 1.4. Standard deviation had been slashed from 0.99 to 0.52 — a laudable achievement, but invisible when only averages are compared.

Then again, simply looking at averages as yardsticks “isn’t very motivating,” suggests artificial intelligence pioneer Raul Valdes-Perez, co-founder and CEO of OnlyBoth, a Pittsburgh automated benchmarking engine developer (and co-founder of Vivisimo, now an IBM Watson company).

In fact, he recently wrote in Harvard Business Review, “Peer comparison as generally practiced suffers from tunnel vision and so misses critical insights, to everyone’s detriment.

“Almost universally,” he continues, “the benchmarker chooses one or two organizational goals, then picks a few key metrics (key performance indicators) relevant to those goals, and finally selects several peer groups from a limited set. The outputs are then the mean, median, distribution or high-percentile values for those peer groups on those metrics. The conclusion is that the organization may or may not have a problem, which may or may not be addressable.”

From numbers to words

Data analysts today employ a variety of machine-assisted techniques lumped into the rubric of “advanced analytics.” Data mining, predictive analytics, location intelligence and other AI technologies enable statisticians to trawl the depths of structured and unstructured information — from social media commentary to the networked sensors of the Internet of Things — by which to judge current performance and forecast future outcomes.

But grasping the import of mathematical measures may not be part of the skill set of the executives charged with strategizing and making decisions. That’s why organizations typically rely on an intermediate staff of statistical experts to interpret and disseminate to the C-suite reports, summaries, charts and graphs — visualizations of data that are “actionable.” Even those graphic presentations often need dumbing down for all but the most numerate.

Here’s where an AI branch called natural language generation comes into play.

Based on machine learning, NLG converts data points — numbers — into ordinary English sentences (or, of course, text in any desired language).

“Instead of manually analyzing, interpreting and communicating insights to employees and customers, our intelligent system ... does it automatically," NarrativeScience, a Chicago-based NLG business software developer, says on its website. "Advanced natural language generation is an important technology that turns copious data into a summary that enables you to read, digest and understand what is happening and its impact on you.”

NarrativeScience customers are primarily in finance, sales and consulting. They rely on NLG to “explain the numbers in investment portfolios, trading records and market statistics.” But as a demonstration of the power of machine discovery and natural language generation in health care, OnlyBoth has applied these technologies to derive benchmarking insights from the latest Centers for Medicare & Medicaid Services data sets for almost every hospital in the country. The tool is free online.

How is this hospital doing?

OnlyBoth benchmarking algorithms enable the engine to recognize context and detect relative implications in raw data. For example, when queried about one of the nation’s premier medical institutions, the first noteworthy metric it reports is that it “has the lowest rate of wounds that split open after surgery on the abdomen or pelvis (0.86 percent) among all 4,803 hospitals [in the HospitalCompare database]. That 0.86 percent compares with an average of 1.7 percent and standard deviation of 0.34 percent across the 4,803 hospitals.”

That’s just one of 28 “good or neutral” results reported when one clicks on a tab labeled “How is this hospital doing?” Another is that it has “the lowest rate of readmission after hip/knee surgery (4 percent)” among the hospitals in its county. “That 4 percent compares to an average of 4.9 percent and standard deviation of 0.58 percent across the 11 hospitals.” The peer group has been narrowed.

Click on “What’s best in class?” for readmission after hip or knee surgery and the benchmarking engine takes a broader view of the hospital: “147th best (tied with 56 others) among the 2,525 hospitals with applicable values and that are an acute care hospital, which range from a best of 2.6 percent ... to a worst of 8.5 percent ... with an average of 4.9 percent and standard deviation of 0.65 percent.”

If you’re looking for problem areas, click on “Where could it improve?” and the engine returns 11 “ambivalent or bad” results. For example, this hospital “has the fewest patients who reported that the area around their room was always quiet at night (40 percent) among the 811 hospitals with at least 80 percent of patients who reported ‘yes,’ they would definitely recommend the hospital. ... That 40 percent compares to an average of 69.4 percent and standard deviation of 10.7 percent across the 811 hospitals.”

What’s remarkable about this benchmark is that it was discovered by the machine dynamically, by combing through the entire Medicare.gov data set and finding provocative associations few if any human analysts would be likely to note.

Now what?

If the finding is positive, the engine suggests asking the following questions: “Why is this happening?” “Should we celebrate, praise, reward, publish or brag?” and “Is there an underlying good practice that we should disseminate or emulate elsewhere?”

If the benchmark shows underperformance — all that noise at night, say — the prompt suggests: “Should we be concerned?” “Why is this happening?” “Can this be fixed?” “What actions could be taken?” and “Which action should we take, if any?”

In his Harvard Business Review piece, Valdes-Perez imagines one of three ways a hospital CEO might react: “1. We’re profitable [and] prestigious. ... What’s a little nocturnal noise? 2. There’s been night-time construction next door for the last year, and it’s almost done, so the problem will solve itself. 3. I can’t think of any reason why we should be at the bottom of this elite peer group. I’ll forward this paragraph to our chief of operations to investigate and report back what may be happening.”

The OnlyBoth benchmarking engine allows users to create their own peer groups, picking from a long list of attributes, and to weigh themselves against best in class in categories from acute myocardial infarction 30-day mortality rate to colon surgery site infections. It includes more than 500,000 benchmarking insights, about 110 per hospital.

“What we’re looking for is outlier behavior,” says Valdes-Perez. “Where the hospital is near the top or the bottom are the interesting metrics. The most powerful heuristic is the size of the peer group. Other things being equal, being exceptional within a larger peer group is a better distinction than within a smaller group.”

OnlyBoth has also applied machine discovery and NLG to benchmark the nation’s 15,665 CMS-certified nursing homes, 1,889 private colleges and the tax regimens of the world's countries.

“We’re doing for this process what search engines did for information seeking,” Valdes-Perez says. “And hospitals are using our website every day. We can tell. I say to hospital executives, ‘Maybe instead of paying a consultant a lot of money to benchmark, you can just look here.’ ”

David Ollier Weber is a principal of The Kila Springs Group in Placerville, Calif., and a regular contributor to H&HN Daily.