Tested chunking + embeddings data from 3 production websites. [P]

“`html A recent study tested the performance of a retrieval and generation (RAG) model on data from three distinct production websites—Intercom, HubSpot,…

By AI Maestro May 23, 2026 1 min read
Tested chunking + embeddings data from 3 production websites. [P]

“`html

A recent study tested the performance of a retrieval and generation (RAG) model on data from three distinct production websites—Intercom, HubSpot, and KPMG. The results highlight variations in content density across these corpora.

  • The Intercom corpus had the highest number of high-quality chunks (96), with nearly half of them being help-center articles. In contrast, HubSpot’s high-quality chunks were primarily case studies related to specific business outcomes like revenue growth.
  • KPMG’s corpus was notably different; it contained very little substantive content, as most of its material was focused on positioning and marketing strategies. This is evident from the fact that many high-quality chunks in KPMG’s corpus were categorized as REJECTED due to being navigational or legal content.
  • Interestingly, semantic relevance tests showed that even with a thin corpus like KPMG’s, the model could still route correctly. For example, queries related to family business succession and ESG issues mapped accurately to relevant pages within the company’s structure.

This research underscores the importance of understanding the quality and nature of different corpora when deploying models in real-world applications. The key takeaway is that a metric called ‘yield score’—the proportion of high-quality chunks relative to total chunks—can be a valuable tool for predicting which brands might require more careful handling and alternative phrasing.

  1. The study highlights the variability in content quality across different websites, particularly between those with substantive information versus those focused on marketing or positioning.
  2. It introduces a new metric, ‘yield score’, to better understand which brands benefit most from more nuanced and context-sensitive language when using generative models.
  3. This research points out the limitations of current benchmarks that assume uniform quality in training data. The results suggest that such assumptions may lead to suboptimal performance in real-world applications.

“`

This HTML snippet provides a structured overview of the findings and their implications, formatted as an editorial brief suitable for publication.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top