Rohan Pal
2023
Microsoft Power BI
Evaluating usability of natural language prompts feature for generating mathematical formulas
Collaborators
Colette Chen
Maomao Ding
Swati
The Quick Measure Suggestions feature that enabled users to quickly create new measures using natural language prompts instead of writing formulas in DAX (Data Analysis Expressions).

As Microsoft integrated OpenAI’s GPT models across its product suite, the Power BI team wanted to evaluate the usability of the Quick Measure Suggestions feature before launching to general availability.
Purpose of Study
Gather insights on the usability of Power BI’s Quick Measure Suggestions feature
01
Evaluate discoverability of the feature in workplace settings
02
Identify challenges in submitting a natural language prompt
03
Discover best practices to set user expectations for the feature
See Findings
Power BI users with experience in DAX
A critical requirement was that users should have some experience with DAX (Data Analysis Expressions) in order for us to validate their expectations against the outcome.
Participant recruitment
In collaboration with the Power BI team, we recruited six participants, who used Power BI for their personal or professional work on an everyday basis.

The recruitment process, including screening, signing a Non-Disclosure Agreement with Microsoft, and scheduling, was done through the User Interviews platform. Furthermore, all our participants were based in the United States because of data regulations in specific geographic regions.
Prompting to Prompt - Designing tasks
Our tasks were designed to first get participants an idea of the dataset they were to use during the study. This was followed by checking discoverability of the feature, and then performing calculations on Power BI of various difficulty levels.
When writing our script, we had to iterate several times on our language because users would use the same words we used in describing the task for prompting using the Quick Measure Suggestions feature.
Not really natural language
Typing the prompt in certain ways lead to correct results which might not always be natural language.

Language has to be specific to the data and is not intelligent to recognize any variables that may be given input naturally. For instance, the user has to enter “United States of America” and not “U.S.” or “United States”
Confusing interaction with the input box
There is a blue underline. But what does that mean?

The blue line indicates match with an existing field in the dataset, but participants thought of other things like auto-suggestions, and filtering system for specific terms.
Expectations inspired by ChatGPT
Considering this feature is being launched after people are used to ChatGPT, participants expect more than just calculations.

This feature is limited to providing formulas, but people want directions on getting to their final outcome, which could be a graph or refined calculation.
Interface shortcomings
Participants wanted to give a name to the measure they were creating before they added the calculation as a card to their dashboards, but they could not find a way to do that. They missed the formula bar up top left, where they could actually edit the measure name.

Some participants also expected there to be a typo recognizer so they do not make mistakes writing a natural language prompt.
Poor discoverability of additional suggestions
5 of 6 participants did not notice the variations of suggested measures shown below the first expanded suggestion card, despite it showed up multiple times throughout the test.

Furthermore, participants reported that the variations look similar to each other, and they cannot easily distinguish them.
Unexpected output
Every output provides a “Preview value” which is not be optimal for certain types of prompts, leading to confusion.

Confusing for prompts with multiple variables, especially for categorical or time/trend related ones. The current design is most suitable for displaying singular and text-based answers.

The “Preview value” is a middle step, but participant expectation is the final output (analysis or visualization).
System Usability Scale
We used the System Usability Scale (SUS) list of ten questions that participants scored as “Strongly Agree”, “Agree”, “Neutral”, “Disagree”, or “Strongly Disagree”.
60.7
SUS Score
AI Trust Score
Because of the uniqueness of a highly AI-based feature, we also measured usability using Microsoft’s AI Trust Score, which evaluates the trust of enterprise-based users on an AI system using six questions.

AI Trust Score for this project can not be shared here.
What went well
The feature has high discoverability and users followed three unique flows to reach the Quick Measure Suggestions natural language input box.

No participants had any difficulty starting to write a natural language prompt. However, there were challenges in understanding the suggestions and other interface elements after they started typing in.

Furthermore, all participants (and especially those new or unfamiliar with DAX) were highly positive about the potential of the feature and everyone agreed to be willing to use the feature in their everyday work.