# How Jogg Generates High-Quality AI/ML Questions ### The Science and Craft Behind Every Question You Answer *by the Jogg Team | MokingBird Oy* --- Every quiz app has questions. But not every quiz app's questions are worth your time. If you have ever opened a "machine learning quiz" on some random platform and been asked *"what does ML stand for?"* or given a multiple choice question where three of the four answers are so obviously wrong they barely count as distractors — you know the problem. Shallow questions teach you to recognize the shape of correct answers, not to actually understand the material. At Jogg, we think differently. Here is the full story of how we create, validate, and continuously improve the questions you encounter on your AI/ML learning journey. --- ## Starting Point: What Makes a Good AI/ML Question? Before we talk about systems and pipelines, let us talk about principles. A good question in the context of AI/ML learning should do at least one of the following: 1. **Test genuine understanding**, not just memorization of facts 2. **Have a single, clearly defensible correct answer** — no ambiguity, no "well it depends on the framework" traps 3. **Have plausible distractors** — wrong answers should reflect real misconceptions, not obvious nonsense 4. **Be appropriately difficult for its level** — not too easy to feel trivial, not so obscure that it stops being educational 5. **Connect to something real** — a real algorithm, a real paper, a real architectural choice that matters in practice These five criteria drive everything about how questions get created and reviewed at Jogg. --- ## Phase 1: The MVP Foundation — Expert-Curated Questions The first questions in Jogg were not generated by any AI system. They were **hand-crafted by human experts** with deep knowledge of AI/ML. Our curation team worked through the full 9-lane curriculum — from mathematical foundations through data preprocessing, model training, RAG systems, inference optimization, deployment, multimodal AI, and AI safety — writing questions that meet our quality bar. This was deliberate. Before we could train machine learning models to generate good questions, we needed a **gold standard dataset** of what "good" actually looks like. ### What Expert Curation Looks Like When a domain expert writes a question for Jogg, they go through a structured process: **Step 1: Pick a key concept** Not just a fact, but a concept that has genuine depth. For example: not "what is a transformer?" but "why does the self-attention mechanism in transformers scale quadratically with sequence length, and what architectural innovations address this?" **Step 2: Write the correct answer** The answer should be provably correct, not a matter of opinion. It should reference the actual technical reason, not a simplified rule of thumb. **Step 3: Write three distractors** Each distractor should: - Sound plausible to someone with partial knowledge - Represent a real misconception in the field - Not be trivially eliminable by process of elimination A good distractor is almost as hard to write as the correct answer. **Step 4: Tag difficulty level** Every question gets classified: - **Beginner:** Requires understanding of core terminology and basic concepts - **Intermediate:** Requires understanding of how components interact - **Advanced:** Requires understanding of tradeoffs, limitations, and deeper principles - **Expert:** Requires understanding of research-level nuances, edge cases, and cross-cutting concerns **Step 5: Peer review** Every question is reviewed by at least one other domain expert before it enters the question bank. This process is slow. It is expensive. But it produces questions that actually teach you something. --- ## Phase 2: Research Paper-Based Questions — A Different Challenge One of Jogg's most distinctive features is our **Papers Quest** mode — quiz questions derived directly from landmark AI/ML research papers. Writing questions about research papers is harder than writing general curriculum questions, for a few reasons: - Papers often contain subtle technical claims that are easy to misrepresent - The "correct" interpretation of a result can evolve as the field's understanding matures - The most important contribution of a paper is sometimes not its headline result, but a methodological choice or negative finding buried in an appendix Our approach to paper-based questions: **Read the full paper, not just the abstract.** Our question writers read every paper in its entirety before writing a single question. **Focus on key contributions over incidental details.** A question about FlashAttention should test your understanding of *why IO-awareness matters for attention computation* — not whether you memorized a specific benchmark number from Table 3. **Test for conceptual transfer.** The best questions aren't "what did this paper propose?" but "given this paper's contribution, what would you expect to happen if you applied it to this scenario?" Our current curated papers include some of the most important works in modern AI/ML: - *Attention Is All You Need* — the original Transformer - *BERT* — bidirectional pre-training for language understanding - *GPT-3* — few-shot learning at massive scale - *LoRA* — parameter-efficient fine-tuning - *RAG* — retrieval-augmented generation - *FlashAttention* — IO-aware fast attention - *Scaling Laws for Neural Language Models* - *Constitutional AI* — harmlessness from AI feedback (RLHF/RLAIF) - *Denoising Diffusion Probabilistic Models* - *CLIP* — vision-language models through contrastive learning - ...and a total of 20 landmark papers --- ## Phase 3: MokingbirdDataGen — AI-Assisted Question Generation As Jogg grows, manual question curation cannot scale to meet the depth and breadth the platform needs. This is where our proprietary **MokingbirdDataGen** system comes in. MokingbirdDataGen is a custom, two-model AI content generation system developed by MokingBird Oy. It is designed specifically for generating high-quality, educationally rigorous quiz content. ### The Architecture MokingbirdDataGen is built on two fine-tuned language models working in tandem: #### The Generator Model The Generator LLM (based on Mistral-7B, fine-tuned with LoRA adapters) is responsible for actually creating questions. Given a source document, topic taxonomy tag, and difficulty target, it generates: - The question stem - The correct answer - Three plausible distractors - A detailed explanation The Generator does not generate questions blindly. It has been trained on our expert-curated question bank, learning the patterns and quality attributes of questions we consider excellent. #### The Classifier Model The Classifier LLM (also Mistral-7B + LoRA) works in parallel with the Generator. Its job is to: - Label questions with rich metadata (difficulty, topic, subtopic, cognitive level) - Flag questions that may be ambiguous, incorrect, or poorly constructed - Assign quality scores based on criteria derived from our expert curation standards #### Reinforcement Learning (GPRO) Both models are trained using a custom reinforcement learning approach we call **Mokingbird-GPRO-Hybrid** — a novel combination of field-level process supervision and outcome-level reward. This ensures both models optimize for the right goals: - Generator reward: question quality, factual correctness, appropriate difficulty calibration, distractor plausibility - Classifier reward: accuracy of metadata labels compared to expert-labeled gold data The two models train together in a feedback loop, continuously improving question quality. ### Why Fine-Tuned Models Instead of Prompting GPT-4? This is a fair question. Using a general-purpose LLM API would be faster to set up. We chose to build custom fine-tuned models because: 1. **Quality consistency:** A fine-tuned model trained on our exact quality standards produces more consistently good questions than prompt engineering a general-purpose model 2. **Domain depth:** A model fine-tuned on AI/ML content deeply understands the domain-specific terminology, paper citations, and technical nuances that general models often get wrong 3. **Control:** We control the full pipeline — we can improve quality, fix systematic errors, and retrain without dependency on external API changes 4. **Data privacy:** All generation happens on our infrastructure — research paper content is not sent to external APIs --- ## Phase 4: Difficulty Calibration — Making "Hard" Actually Mean Hard Writing a question that is labeled "advanced" is one thing. Ensuring that it actually discriminates between intermediate and advanced learners is another. Jogg uses **Item Response Theory (IRT)** to calibrate question difficulty dynamically after launch. IRT is a psychometric framework used in standardized testing (it's behind exams like the SAT and GRE). The idea is that a question's "true" difficulty can be estimated statistically based on how many people answer it correctly, accounting for the ability level of the people who answered it. Here is how it works in Jogg: 1. A new question enters the system with an **estimated difficulty** based on expert judgment 2. Over time, as more users answer the question, we collect anonymous response data 3. Nightly, a batch process fits an IRT model (2-parameter logistic) to the response data 4. The difficulty estimate is updated to reflect observed performance 5. Questions that are performing outside their difficulty band (e.g., an "advanced" question that 90% of beginners answer correctly) get flagged for review This means the difficulty labels in Jogg are **empirically validated**, not just subjectively assigned. --- ## Phase 5: Adaptive Difficulty — Questions That Learn From You Beyond calibrating questions, Jogg adapts which questions you see based on your personal performance profile. The **Mixed Practice** mode uses your question history to determine: - Which topics have gaps in your understanding - Which concepts you've mastered and need less reinforcement - Which difficulty level is appropriate for your current level in each topic area This is combined with **FSRS spaced repetition** for Daily Jogg, which determines the optimal timing for reviewing each concept based on your forgetting curve. The result is that no two users' Jogg experiences are exactly alike — the system continuously adjusts to give each individual learner the questions that will help them grow most efficiently. --- ## Continuous Quality Improvement Question quality is not a one-time concern. We have ongoing processes to ensure quality stays high: ### User Reporting Every question has an in-app reporting mechanism. If you believe a question contains an error, is ambiguous, or has become outdated, you can flag it. Our content team reviews all flags. ### Analytics-Driven Review Questions with anomalous response patterns are automatically flagged for review: - Questions with very high or very low correctness rates relative to their labeled difficulty - Questions where users spend unusually long or short times deliberating - Questions with unusual skip or flag rates ### Regular Content Audits The AI/ML field moves fast. What was an emerging topic two years ago may now be foundational — or obsolete. We conduct regular curriculum audits to ensure content stays current with the state of the field. --- ## What This Means for You as a Learner When you answer a question in Jogg, you can trust: - The correct answer is actually correct, not just "probably right" - The difficulty label reflects real calibration data, not just gut feeling - The wrong answers represent real misconceptions, not random filler - The question tests something worth knowing - If the field has evolved and a question has become outdated, it will be updated This is what it means to take question quality seriously. It is a lot more work than scraping the internet for quiz questions or asking a chatbot to generate 500 questions in five minutes. But it is the only approach that produces learning that actually sticks. --- ## The Future: Personalized Question Generation Looking ahead, our ambition is to enable fully personalized question generation — where you can specify your background, goals, and areas of interest, and Jogg generates questions tailored specifically to your learning path. Already, you can request questions derived from specific research papers. In the future, Jogg will be able to generate questions calibrated to your specific skill level, focused on the specific topics most relevant to your goals — whether that's preparing for a specific type of interview, mastering a particular model architecture, or staying current with a specific research area. This is the promise of AI-assisted personalized learning at scale. And it is what MokingbirdDataGen is being built to deliver. --- *Have feedback on question quality? Use the in-app flag feature or reach out to [email protected].* *Jogg — Built for serious learners, with questions that prove it.*
Jogg Blog