Methodology

How this data was gathered, cleaned, weighted, and presented.

Data Collection

Yoot is a WhatsApp-based civic polling infrastructure operated by Youth Ki Awaaz. Questions are deployed through WhatsApp Business API to a growing panel that currently stands at 12,000+ registered participants across India. Over 500 days of polling, the panel has collectively generated 2,00,000+ responses across 500+ questions.

Participants join voluntarily through organic recruitment across Youth Ki Awaaz's platforms. Each poll is sent as a WhatsApp message; respondents tap to answer. Not all participants answer every question. Response counts vary per question, depending on topic, timing, and community engagement.

Cleaning & Validation

Raw response data exported from BigQuery undergoes a multi-step cleaning pipeline:

Removal of duplicate responses per WhatsApp ID per question
Filtering of bot-like or spam responses
Free-text responses separated from categorical responses for analysis

For this archive, only questions with identifiable categorical response options are included. Free-text and open-ended questions are excluded from the scatter visualisation.

Weighting

Yoot applies raking-based post-stratification to adjust for demographic imbalances in the panel relative to India's young population. Weights are calculated at the question level, accounting for differential non-response across questions.

Population benchmarks are drawn from Census 2011 projections and NFHS-5 data. The raking algorithm iteratively adjusts weights across a three-way State × Gender × Age cross-classification until convergence (tolerance: 1e-6, max 50 iterations). Weights are normalised to a mean of 1 at each iteration.

Design effects of 1.5 to 2.0 are typical for weighted digital surveys of this kind. Margins of error are reported using effective sample sizes that account for these design effects, using Wilson score intervals at 95% confidence.

Note: The visualisation on the Questions tab shows unweighted response counts and percentages. Weighted analysis is applied in Yoot's thematic reports and partner deliverables.

Clustering & Visualisation

To map 500+ questions into a navigable visual space, we used the following pipeline:

Embedding: Each question text was encoded using the all-MiniLM-L6-v2 sentence transformer model (384-dimensional embeddings), which captures semantic similarity between questions regardless of surface wording.
Clustering: K-Means clustering (k=18, 30 initialisations) was applied to the embeddings. Cluster names were assigned through manual inspection of member questions. Some clusters are tighter (e.g., Exam Culture, 10 questions) and some broader (e.g., Civic & Political Life, 53 questions).
Dimensionality reduction: UMAP (n_neighbors=15, min_dist=0.15) was used to project the 384-dimensional embeddings into 2D coordinates for the scatter plot. Proximity on the map reflects semantic similarity: questions that sit near each other are about related things, even if they belong to different named clusters.

Dot size corresponds to response count. Colour corresponds to cluster assignment.

Limitations

This archive is a civic data project, not a nationally representative survey. We want to be clear about what it is and what it is not:

Self-selected panel: Participants join voluntarily through Youth Ki Awaaz's networks. The panel skews toward digitally connected, English/Hindi-literate young people with access to WhatsApp. It does not claim to represent all of India's youth.
Variable response rates: Not all 12,000+ panelists answer every question. Some questions received over 1,000 responses; others just 100. Questions with fewer responses carry wider uncertainty. This archive includes questions with as few as 100 responses to preserve the breadth of what was asked, while the 500+ and 1000+ filters allow focus on higher-confidence data.
Sampling: While Yoot applies post-stratification weighting to correct for known demographic imbalances, the underlying recruitment is not probability-based. Results should be read as indicative patterns from an engaged panel, not as population estimates.
Evolving infrastructure: Yoot's data pipeline, question design protocols, and weighting methodology have evolved over the project's 500-day life. Earlier questions may reflect less refined processes. This is a living project, and the methodology improves continuously.

We share this data because we believe civic transparency matters more than methodological perfection. These responses are authentic, even if the sample in several instances is imperfect. We see this as a contribution to the well-recognised need for civic data commons in India, and we welcome scrutiny as part of that process.

Yoot by Youth Ki Awaaz | Licensed under CC BY-NC-SA 4.0 | Built with assistance from Claude by Anthropic