playbook/antigravity-awesome-skills/skills/hugging-face-dataset-viewer/SKILL.md

132 lines
4.9 KiB
Markdown

---
source: "https://github.com/huggingface/skills/tree/main/skills/huggingface-datasets"
name: hugging-face-dataset-viewer
description: Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.
risk: unknown
---
# Hugging Face Dataset Viewer
## When to Use
Use this skill when you need read-only exploration of a Hugging Face dataset through the Dataset Viewer API.
Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
## Core workflow
1. Optionally validate dataset availability with `/is-valid`.
2. Resolve `config` + `split` with `/splits`.
3. Preview with `/first-rows`.
4. Paginate content with `/rows` using `offset` and `length` (max 100).
5. Use `/search` for text matching and `/filter` for row predicates.
6. Retrieve parquet links via `/parquet` and totals/metadata via `/size` and `/statistics`.
## Defaults
- Base URL: `https://datasets-server.huggingface.co`
- Default API method: `GET`
- Query params should be URL-encoded.
- `offset` is 0-based.
- `length` max is usually `100` for row-like endpoints.
- Gated/private datasets require `Authorization: Bearer <HF_TOKEN>`.
## Dataset Viewer
- `Validate dataset`: `/is-valid?dataset=<namespace/repo>`
- `List subsets and splits`: `/splits?dataset=<namespace/repo>`
- `Preview first rows`: `/first-rows?dataset=<namespace/repo>&config=<config>&split=<split>`
- `Paginate rows`: `/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>`
- `Search text`: `/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>`
- `Filter with predicates`: `/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>`
- `List parquet shards`: `/parquet?dataset=<namespace/repo>`
- `Get size totals`: `/size?dataset=<namespace/repo>`
- `Get column statistics`: `/statistics?dataset=<namespace/repo>&config=<config>&split=<split>`
- `Get Croissant metadata (if available)`: `/croissant?dataset=<namespace/repo>`
Pagination pattern:
```bash
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
```
When pagination is partial, use response fields such as `num_rows_total`, `num_rows_per_page`, and `partial` to drive continuation logic.
Search/filter notes:
- `/search` matches string columns (full-text style behavior is internal to the API).
- `/filter` requires predicate syntax in `where` and optional sort in `orderby`.
- Keep filtering and searches read-only and side-effect free.
## Querying Datasets
Use `npx parquetlens` with Hub parquet alias paths for SQL querying.
Parquet alias shape:
```text
hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
```
Derive `<config>`, `<split>`, and `<shard>` from Dataset Viewer `/parquet`:
```bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
| jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
```
Run SQL query:
```bash
npx -y -p parquetlens -p @parquetlens/sql parquetlens \
"hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
--sql "SELECT * FROM data LIMIT 20"
```
### SQL export
- CSV: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"`
- JSON: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"`
- Parquet: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"`
## Creating and Uploading Datasets
Use one of these flows depending on dependency constraints.
Zero local dependencies (Hub UI):
- Create dataset repo in browser: `https://huggingface.co/new-dataset`
- Upload parquet files in the repo "Files and versions" page.
- Verify shards appear in Dataset Viewer:
```bash
curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
```
Low dependency CLI flow (`npx @huggingface/hub` / `hfjs`):
- Set auth token:
```bash
export HF_TOKEN=<your_hf_token>
```
- Upload parquet folder to a dataset repo (auto-creates repo if missing):
```bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
```
- Upload as private repo on creation:
```bash
npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
```
After upload, call `/parquet` to discover `<config>/<split>/<shard>` values for querying with `@~parquet`.
## Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.