Find, curate & license ML datasets with Valyu

A comprehensive data infrastructure platform to help simplify your dataset licensing, provenance and distribution  process.

View Demo
Retrieval Augmented Generation Capabilities

Enhance  the accuracy and reliability of Gen AI models by seeding it with diverse data sources.

Easy Access to Datasets for Deep Learning

A growing catalogue of general & domain specific datasets for pretraining, finetuning and RAG.

Commission Bespoke Training Datasets

Request and create custom-made datasets tailored to specific needs.


License your datasets or buy licensed training data- confidently deploy your models to production without copyright issues.

Easily Verify Quality and Provenance

Comprehensive documentation of sources and key metrics to help you assess the dataset.

Trusted DATA

Quality data with provenance

Performant AI models require quality datasets. A large part of ML is just data, reduce the time it takes to train your models by using quality data. Our datasets contain detailed assessments of quality and provenance- make an informed decision about your training data.


Valyu connects to your ML workflow and apps with  SDK tooling


Benchmark and perform valuations of datasets.


Synthesize new finetuning datasets or create new data products.


Import and use datasets directly into your workflows and notebooks.

Prompt Enrichment

Perform RAG with 3rd party licensed data and mitigate hallucinations in your apps.


Get answers to common questions

Frequently asked questions about Valyu. If you have any additional questions or feedback, please let us know.  

What is Valyu and how does it work?

Valyu is a data provenance and licensing platform that connects data providers with ML engineers looking for diverse, high-quality datasets for training models. We source datasets for training models in partnerships with data providers and supply the datasets for data consumers through our tooling.

What are the key features of Valyu?

More than a data exchange, it is a comprehensive data infrastructure to govern, secure, and ensure quality of Dataset Assets for Training and Knowledge (RAG) tasks. The platform allows you to use and enforce robust privacy controls, apply simple licensing and detailed data cards for provenance.

The platform also has a growing set of SDK tools to benchmark, refine and synthesize datasets, create data cards and manage provenance, which integrate directly into your ML workflows, apps, and pipelines.

You can create your own quality datasets derived from existing data (data synthesis) to improve model accuracy and context. Integrate first and third-party data, reducing hallucinations and boosting application performance.

Ideal for RAG, LLMs, and chatbots, Valyu provides the quality datasets needed for  prompt augmentation and reliable AI results.

I'm a data provider, how can I add my datasets to the platform?

You can have a chat with us and and be part of our beta. We will be opening up the platform to any provider to self-serve soon.  In the meantime, shoot us a message using the Contact form. We're looking forward to potentially partnering with you!

What kind of training datasets do you have?

We have a variety of datasets across multiple domains and diversities including healthcare, manufacturing and publishing.  Modalities include text, image, audio, and video. Feel free to reach out to us for more information.

I'm an AI company looking for a specific dataset, what should I do?

You can commission bespoke datasets for your specific needs through our platform. Simply let us know your requirements and we can help you commissioning the dataset.

What are the costs associated to use Valyu?

There is no hidden cost to distribute your content/datasets on Valyu. We charge a small fee upon a successful transaction to cover the operation and maintenance of the platform.