We Built an AI Product. The Hard Part Wasn't the AI.
Everyone's obsessed with AI models. After building a 500-tool data directory, we learned the model is 10% of the work. The other 90% is data collection, validation, and quality — and that's the real moat.
EB
Egor Burlakov
••6 min read
Everyone's talking about AI models. Which LLM is fastest, which agent framework is best, which prompt technique unlocks 10x productivity. After spending months building modern-datatools.com — a directory that tracks 300 data tools across 11 categories with reviews, pricing, comparisons, and market landscapes — we learned something that doesn't make for a sexy headline:
The model is maybe 10% of the work. The other 90% is data.
The rebuild that changed everything
Our first version was fast to build. We scraped some websites, fed the results to an LLM, and generated tool pages. It looked impressive for about a week — until we started finding hallucinated pricing tiers, outdated feature lists, and comparisons that contradicted each other across pages.
So we rebuilt from scratch. Not the AI part — the data part.
We built a pipeline that scrapes 18 distinct sources — GitHub, PyPI, DockerHub, Google Trends, Product Hunt, Wikipedia, TrustRadius, HackerNews, StackOverflow, AWS CloudFormation Registry, npm, and vendor websites (pricing pages, feature pages, and general content). Each tool is cross-referenced against an average of 7.7 sources. In total, we maintain 2,300+ source records feeding into 300 tool profiles.
Only after that data is collected, validated, and scored does AI touch it. The difference was night and day.
Why data collection is the hardest problem
When people say "just collect the data," they underestimate four things:
Scraping is fragile. We maintain scrapers for 18 sources. GitHub changes their API rate limits. PyPI restructures their JSON endpoint. A vendor redesigns their pricing page and our parser returns garbage. A scraper that works today breaks tomorrow. You need retry logic with exponential backoff, fallback sources, and monitoring — not a one-time script. Our pipeline has dedicated error handling for each source, and we still spend time fixing broken scrapers.
Validation must be automated — you can't scale manual checks. We track 300 tools, each with structured pricing details, features, metrics, and relationships. That's thousands of data points changing constantly. Manual verification is impossible at this scale.
So we built automated quality checkers that score every page from 0 to 100. Each checker runs deterministic tests: does the review mention the tool's actual pricing model? Does the comparison reference real features from both tools? Is the pricing page consistent with what the scraper found on the vendor's site? Right now, 1,529 pages are scored, with an average quality of 95.9/100 with no page below 90. Nothing gets published below this quality threshold.
This is the same principle behind data pipeline testing — if you don't test your data automatically, you're shipping bugs to production. (You can see our full quality methodology on our methodology page.)
Relationships are where value lives. Knowing that "Airbyte exists" is worthless. Knowing that it's an open-source alternative to Fivetran, integrates with dbt, costs 80% less, and has 15,000 GitHub stars — that's the moat. We maintain all the alternative relationships (presented in our alternative pages) and integration relationships (used in our recommendation wizard ). Raw data is a commodity. Curated, connected data is the defensible asset.
Freshness erodes trust. A pricing page that shows last year's numbers is worse than no pricing page at all. We run weekly scraping cycles across all 18 sources and freshness checks that flag stale data before users see it. When ClickHouse changes their cloud pricing or Databricks adds a new SKU, our pipeline catches it within a week.
Collect everything. Seriously.
Here's something we wish we'd internalized earlier: collect more data than you think you need.
Ten years ago we called it "big data" and it required specialized infrastructure. Today, with AI as your processing engine, the calculus has flipped. Storage is cheap. Computation is cheap. The expensive thing is not having the data when you realize you need it.
We started by tracking just names and descriptions. Then we added GitHub stars, PyPI weekly downloads, Docker pulls, Google Trends interest scores, Product Hunt upvotes, pricing tiers with structured JSON, feature matrices, user reviews from TrustRadius, Wikipedia summaries, and StackOverflow tag activity. Every new data source we added made the AI-generated content measurably better — not because the model improved, but because it had more ground truth to work with.
Here's a concrete example: when we added Google Trends data, our market landscape pages went from generic popularity claims to actual trend-backed rankings. When we added DockerHub pull counts alongside GitHub stars, we could distinguish tools that are talked about from tools that are actually used in production. Each signal alone is noisy. Combined, they tell a story no single source can.
Where AI helps (and where it doesn't)
To be clear — AI is essential in this pipeline. It's excellent at:
Extracting structure from chaos: turning a messy vendor pricing page into clean JSON with tiers, limits, and per-unit costs
Generating content from data: writing a 2,000-word comparison from structured data points about two tools — their features, pricing, GitHub activity, and user sentiment (although there is quite a lot of work here to develop reliable and efficient prompt - we will talk about it next time!)
Catching anomalies: flagging when a tool's pricing suddenly jumps 10x between scraping cycles, or when a "free" tool suddenly shows enterprise-only pricing
But AI is bad at knowing when it's wrong. An LLM will confidently generate a pricing table with plausible-looking numbers that are completely fabricated. That's why you need automated quality gates around the AI, not instead of it. Golden test sets that verify extractor output against known-good data. Cross-source checks that flag when GitHub says a repo has 5K stars but our database says 50K. Deterministic quality scores that catch structural problems no LLM would notice.
Case in point: our recommendation wizard
We just shipped a recommendation engine that asks you 3–5 questions — what you're building, your team size, budget, cloud provider — and produces a personalized tool stack with an architecture diagram and data-backed explanations for every pick.
Building the UI took less than a day. The AI-powered scoring algorithm? A few hundred lines of code. But the feature only works because of what's behind it: 1,131 verified tool integrations ("Snowflake works natively with dbt, Fivetran, and Looker"), structured tags across 7 dimensions (deployment model, scale, cloud provider, language ecosystem), and live community signals that distinguish tools people talk about from tools people actually use.
Without that data layer, the wizard would just be a fancy form that outputs generic advice. With it, we can say things like: "We recommend Dagster for orchestration because it has native dbt integration, 11K GitHub stars with rising momentum, and fits your self-hosted + Python constraint — while Prefect is a strong runner-up with a more generous free cloud tier."
That's not AI being smart. That's curated data making AI useful.
The punchline
If you're building an AI-powered product, here's the uncomfortable truth: spend 80% of your energy on data infrastructure. Build the scrapers, the validators, the quality checkers, the freshness monitors. Make it automated because you can't scale manual verification.
The model will improve on its own — or you'll swap it for a better one next quarter. But what about your curated, validated, interconnected datasets? That's what competitors can't replicate overnight. That's the real moat.
EB
Written by Egor Burlakov
Engineering and Science Leader with experience building scalable data infrastructure, data pipelines and science applications. Sharing insights about data tools, architecture patterns, and best practices.
Explore Further
Dive deeper into the tools and categories mentioned in this article.