Wanted to share a personal project I've been working on: I love classic literature, and I'm also taking an AI class here at UIUC, so I trained an LLM from scratch on Victorian lit from 1837 to 1899. This is Mr. Chatterbox, the Victorian Gentleman Chatbot: huggingface.co/spaces/tvent...
Type your questions or conversation starters into the chat box and receive replies written in authentic 19th‑century style. The bot replies with text based only on Victorian literature, letting you...
Segment any object in an image dataset with a text prompt — one command.
uv run segment-objects.py data output --class-name deer
Pixel-level masks via SAM3. Perfect for agents building their own training data.
Runs on @hf.co Jobs.
huggingface.co/datasets/uv-...
huggingface.co/spaces/davan...
One of the nicest things about Nvidia model releases is that they ship the training data.
What does it look like? I sampled 250k examples from 24 datasets in the Nemotron post-training v3 collection and built an interactive Embedding Atlas to explore it.
The new @hf.co storage Buckets open up the Hub beyond models and datasets.
Example: IIIF image hosting.
With Buckets, just upload static tiles and any IIIF viewer zooms straight from CDN!
Is olmOCR-bench getting close to saturation? Top score is now 85.9%.
Yesterday, Datalab took #1 with chandra-ocr-2. A year ago, the best was 79.
Visualised the race to get there using @hf.co leaderboard data
You never know what data will be used for!
I uploaded a @britishlibrary.bsky.social dataset to Hugging Face in 2022. IIRC one of my first PR to a HF repo!
4 years later, someone trains a Victorian chatbot on it
More libraries should be sharing their public domain collections for AI to build on!
IIIF manifest (interoperable with any IIIF viewer): huggingface.co/buckets/dava...
UV script to generate tiles from your own images: huggingface.co/datasets/uv-...
Very cool work! Happy to see this dataset continue to be used!
cc @iiif.bsky.social :)
Want to talk to the past? Here' an LLM "trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 & 1899, drawn from a dataset made available by the British Library"
Quite different from an LLM roleplaying a Victorian. huggingface.co/spaces/tvent...
Ethan Mollick
This app lets you upload your vector embeddings (e.g., CSV or JSON files) and instantly creates an interactive 2‑D/3‑D plot where similar items cluster together. You can explore the layout, hover o...