Idea and goals
This is a playground or proof of concept to see what is possible with LLMs (text-to-text, text-to-image and image-to-text) in the context of newsarticles around the world. At the core is a fairly simple crawler that attempts to get news articles and images from various websites, summarizes these articles using LLMs (text-to-text), then clustering articles together based on word relevancy, it then attempts to summarize the cluster taking all the summaries of the cluster articles together, it then looks for images that are close to the cluster description (image-to-text) and does a clustering of images, if there are no images available, it creates an image prompt (text-to-text) and then uses a third-party site to create the image (text-to-image), finally it creates the html and deploys it. This is work in progress to test various algorithms and LLMs and thus can randomly change.
Tools in use
This is being created using the following tools:
- Development environment: 5 different node applications (TypeScript) that run in sequence and each has its own data store (json files). This is done for simplicity. Most of these were developed using vibe coding for speed.
- Crawler: Primarily Cheerio and Puppeteer, also testing with Playwright (all open source node modules).
- Clustering: LLM-based clustering using HDBSCAN algorithm (density-based) with local embeddings via Ollama (embeddinggemma model), min-samples parameter set to 2 for optimal cluster granularity,
- LLM access: using Openrouter for all text-to-text and image-to-text, LLama 4 Maverick and for text-to-image, Black Forest Flux using their APIs (not free!).
- Hosting: Firebase (Google cloud) hosting.
Who is behind this
This is an example project by Vanguard Signals that uses vector databases and large language models (LLMs) to process and generate content.