The satisfaction gap
What drives platform satisfaction and what gets in the way
Role: Lead UX Researcher
Scope: Generative research, mixed-methods survey design, AI-assisted qualitative analysis, research dissemination
THE PROBLEM
By the time we launched this study, HarvardSites Drupal had been Harvard's primary web publishing platform for over a year.
The first year had been consumed by migrating thousands of sites from OpenScholar, Harvard’s legacy web publishing platform, to HarvardSites Drupal, while simultaneously building a design system and component library. Getting the platform stable and the sites transferred was the priority. A systematic look at the site builder experience was a priority we knew would come soon after.
HarvardSites serves a wide range of people, staff, faculty, and students across every school and administrative unit, managing everything from small faculty profiles to flagship departmental sites. Most aren’t professional web developers. A site is one piece of a much larger job, and the platform’s usability has real downstream effects on how much time people spend on their sites, how confident they feel making changes, and whether they stay on HarvardSites or start looking for alternatives.
We needed to know where the platform stood, so we launched a survey to collect data that would actually inform product decisions, not just surface a list of pain points and user delights.
THE CONSTRAINTS
There were no prior baselines to compare against and no established precedent for how findings from a study this large would be consumed or used to inform product decisions. The survey needed to capture both quantitative metrics we could track over time and rich qualitative data that would explain the numbers.
162 participants, 20+ questions, many of them open-ended. Manual thematic synthesis at that scale would take weeks. At the same time, I had real professional concerns about using AI for qualitative analysis. These were concerns I’d researched carefully and discussed with colleagues in the field across different industries. AI can surface patterns, but it can also introduce bias, flatten nuance, and misrepresent sentiment in ways that aren’t always obvious. Those risks need to be actively managed when AI is being used in UX research analysis.
Leadership was invested in the findings, which meant that the work had organizational weight and that there was genuine pressure to move quickly and efficiently. Using AI to assist with the analysis phase was identified as a path worth exploring. This meant figuring out how to do that in a way I could stand behind professionally.
DECISION POINTS
Decision #1: Design the survey to serve both measurement and meaning
What I noticed
We needed the study to do two things at once. The first was establishing baseline metrics scores for editing confidence, design flexibility, overall satisfaction, and likelihood to recommend. The idea was that these metrics could be tracked over time as the platform evolves. The second was understanding why those scores landed where they did. Numbers alone wouldn’t tell us where to focus product work.
The choice I made
We organized the survey around four research objectives: getting started experience, building and editing experience, design flexibility and content management, and overall satisfaction. Within each area, we paired Likert-scale rating questions with open-ended follow-ups asking respondents to describe what was easiest, most difficult, and what would make the biggest difference.
Why that choice was made
A study that only measured satisfaction scores would tell us where things stood but not what to do about it. And a purely qualitative study at this scale wouldn't give us anything trackable over time. Pairing the two meant the findings would provide a baseline we could return to, and enough context to make the numbers actionable.
Decision #2: Build a responsible framework for AI-assisted analysis
What I noticed
As the study moved toward the analysis phase, using AI to help with qualitative synthesis became part of the conversation. I’d done enough research on the topic to have a clear understanding of the tradeoffs. AI can accelerate pattern detection in a large dataset, but it can also miss context, misread sentiment, and produce outputs that sound confident without reflecting what participants actually said. On a dataset this size, those risks were a real concern that had to be taken seriously.
The question wasn’t whether AI could be useful. It was whether it could be used in a way that preserved the integrity of the research. I wasn’t willing to hand the analysis off to a model and trust the output. But I also recognized that building a thoughtful framework for how AI fits into qualitative synthesis was worth the investment, both for this study and for how the team might approach similar work going forward.
The choice I made
I spent two weeks engineering and testing a thematic synthesis prompt from scratch. This included researching how AI models handle qualitative data, running iterative trials, and building in safeguards around scoping, evidence thresholds, bias checks, and verbatim quote requirements. The prompt was designed to accelerate pattern detection, not to replace human judgment.
The AI-generated themes became the starting point for analysis in Condens.io, our qualitative analysis and research repository platform. From there, my co-researcher and I worked through the survey responses directly, clustering highlights and applying codes and sub-level codes to surface patterns and sub-themes. That hands-on process is where we caught miscoded responses and instances where the AI hadn't accurately reflected what participants said.
We presented the findings alongside a full demo of the methodology, including an honest accounting of where the AI performed well (roughly 75% accuracy) and where human judgment was needed.
Why that choice was made
The prompt engineering was a necessary step that made using AI defensible at all. Without the constraints built into the prompt and the audit step that followed, the analysis would have carried risks I wasn’t comfortable with on a high-visibility and high-impact study. In this project, the AI functioned as a tool for accelerating first-pass pattern detection, with human judgment as the final filter.
This framework saved an estimated two weeks on the analysis phase. It also became something the team could learn from. The demo my colleague and I gave was both a retrospective on the study and a foundation for how AI-assisted qualitative analysis could be approached responsibly in future work.
A snippet from the prompt engineered for AI-assisted first pass thematic analysis.
Decision #3: Organize the findings around drivers, not a list of pain points
What I noticed
The data showed a clear gap between capability and experience. Site builders rated their editing confidence fairly high at 3.79 out of 5, but overall satisfaction was lower at 3.53, and perceived design flexibility lower still at 2.58. Users felt capable of making edits without breaking things. The platform just didn't give them enough room to do what they actually wanted.
The OpenScholar comparison also touched every area. 84% of participants were managing migrated sites, meaning their baseline for "normal" was a different system with different capabilities. That context needed to be part of how the findings were read.
The choice I made
I organized the report around the drivers of satisfaction rather than a flat list of issues. The statistical relationship between perceived design flexibility and overall satisfaction, accounting for roughly 34% of the variation in satisfaction scores, became the anchor. From there, the qualitative findings gave context to what flexibility meant in practice. Layout control, component variety, spacing options, color options were some of the most desired areas of design flexibility.
The migration context was surfaced explicitly, with the data showing clearly that hands-on support during transition was the strongest predictor of positive outcomes, not the platform itself.
Why that choice was made
Structuring the findings this way gave the team something to act on. Rather than a list of complaints with no direction on importance, our report led with what had the most measurable impact on how users felt about the platform. That made it easier for the team to prioritize, and for leadership to understand what kind of investment would move the needle on satisfaction over time.
A section of the affinity map board in Condens used to cluster survey responses.
OUTCOMES
This study produced the first comprehensive baseline for the HarvardSites Drupal site builder experience.
162 site builders surveyed across Harvard's schools and administrative units
Baseline metrics established for editing confidence (3.79/5), perceived design flexibility (2.58/5), overall satisfaction (3.53/5), and likelihood to recommend (3.45/5)
Perceived design flexibility identified as the primary driver of satisfaction, explaining ~34% of satisfaction variance
Key friction patterns documented including nested workflows, lack of global content editing, image control limitations, and reliability gaps
Hands-on migration support identified as the strongest predictor of positive transition outcomes
AI-assisted thematic synthesis framework piloted and shared with the team as a repeatable methodology for future qualitative work
Interested in the methodology? The thematic synthesis prompt is available to download in markdown.
REFLECTION
Building the thematic synthesis prompt was a design problem in its own right that consisted of two weeks of research, iteration, and trial and error. If I were to do it again, I'd document the methodology as it developed rather than reconstructing it after. Earlier documentation would have made the team presentation stronger and the framework easier to hand off.
On the research side, the thing I keep coming back to is how much the OpenScholar comparison shapes how site builders experience HarvardSites. Users aren’t evaluating the platform in isolation, but against workflows they knew and capabilities that are now gone. That context matters for how satisfaction data is interpreted, and it's something worth tracking deliberately as the legacy comparison fades over time. A 3.53 today means something different than a 3.53 in two years.