How Jogg Generates High-Quality AI/ML Questions

The Science and Craft Behind Every Question You Answer

by the Jogg Team | MokingBird Oy

Every quiz app has questions. But not every quiz app's questions are worth your time.

If you have ever opened a "machine learning quiz" on some random platform and been asked "what does ML stand for?" or given a multiple choice question where three of the four answers are so obviously wrong they barely count as distractors — you know the problem. Shallow questions teach you to recognize the shape of correct answers, not to actually understand the material.

At Jogg, we think differently. Here is the full story of how we create, validate, and continuously improve the questions you encounter on your AI/ML learning journey.

Starting Point: What Makes a Good AI/ML Question?

Before we talk about systems and pipelines, let us talk about principles. A good question in the context of AI/ML learning should do at least one of the following:

Test genuine understanding, not just memorization of facts
Have a single, clearly defensible correct answer — no ambiguity, no "well it depends on the framework" traps
Have plausible distractors — wrong answers should reflect real misconceptions, not obvious nonsense
Be appropriately difficult for its level — not too easy to feel trivial, not so obscure that it stops being educational
Connect to something real — a real algorithm, a real paper, a real architectural choice that matters in practice

These five criteria drive everything about how questions get created and reviewed at Jogg.

Phase 1: The MVP Foundation — Expert-Curated Questions

The first questions in Jogg were not generated by any AI system. They were hand-crafted by human experts with deep knowledge of AI/ML.

Our curation team worked through the full 9-lane curriculum — from mathematical foundations through data preprocessing, model training, RAG systems, inference optimization, deployment, multimodal AI, and AI safety — writing questions that meet our quality bar.

This was deliberate. Before we could train machine learning models to generate good questions, we needed a gold standard dataset of what "good" actually looks like.

What Expert Curation Looks Like

When a domain expert writes a question for Jogg, they go through a structured process:

Step 1: Pick a key concept Not just a fact, but a concept that has genuine depth. For example: not "what is a transformer?" but "why does the self-attention mechanism in transformers scale quadratically with sequence length, and what architectural innovations address this?"

Step 2: Write the correct answer The answer should be provably correct, not a matter of opinion. It should reference the actual technical reason, not a simplified rule of thumb.

Step 3: Write three distractors Each distractor should:

Sound plausible to someone with partial knowledge
Represent a real misconception in the field
Not be trivially eliminable by process of elimination

A good distractor is almost as hard to write as the correct answer.

Step 4: Tag difficulty level Every question gets classified:

Beginner: Requires understanding of core terminology and basic concepts
Intermediate: Requires understanding of how components interact
Advanced: Requires understanding of tradeoffs, limitations, and deeper principles
Expert: Requires understanding of research-level nuances, edge cases, and cross-cutting concerns

Step 5: Peer review Every question is reviewed by at least one other domain expert before it enters the question bank.

This process is slow. It is expensive. But it produces questions that actually teach you something.

Phase 2: Research Paper-Based Questions — A Different Challenge

One of Jogg's most distinctive features is our Papers Quest mode — quiz questions derived directly from landmark AI/ML research papers.

Writing questions about research papers is harder than writing general curriculum questions, for a few reasons:

Papers often contain subtle technical claims that are easy to misrepresent
The "correct" interpretation of a result can evolve as the field's understanding matures
The most important contribution of a paper is sometimes not its headline result, but a methodological choice or negative finding buried in an appendix

Our approach to paper-based questions:

Read the full paper, not just the abstract. Our question writers read every paper in its entirety before writing a single question.

Focus on key contributions over incidental details. A question about FlashAttention should test your understanding of why IO-awareness matters for attention computation — not whether you memorized a specific benchmark number from Table 3.

Test for conceptual transfer. The best questions aren't "what did this paper propose?" but "given this paper's contribution, what would you expect to happen if you applied it to this scenario?"

Our current curated papers include some of the most important works in modern AI/ML:

Attention Is All You Need — the original Transformer
BERT — bidirectional pre-training for language understanding
GPT-3 — few-shot learning at massive scale
LoRA — parameter-efficient fine-tuning
RAG — retrieval-augmented generation
FlashAttention — IO-aware fast attention
Scaling Laws for Neural Language Models
Constitutional AI — harmlessness from AI feedback (RLHF/RLAIF)
Denoising Diffusion Probabilistic Models
CLIP — vision-language models through contrastive learning
...and a total of 20 landmark papers

Phase 3: MokingbirdDataGen — AI-Assisted Question Generation

As Jogg grows, manual question curation cannot scale to meet the depth and breadth the platform needs. This is where our proprietary MokingbirdDataGen system comes in.

MokingbirdDataGen is a custom, two-model AI content generation system developed by MokingBird Oy. It is designed specifically for generating high-quality, educationally rigorous quiz content.

The Architecture

MokingbirdDataGen is built on two fine-tuned language models working in tandem:

The Generator Model

The Generator LLM (based on Mistral-7B, fine-tuned with LoRA adapters) is responsible for actually creating questions. Given a source document, topic taxonomy tag, and difficulty target, it generates:

The question stem
The correct answer
Three plausible distractors
A detailed explanation

The Generator does not generate questions blindly. It has been trained on our expert-curated question bank, learning the patterns and quality attributes of questions we consider excellent.

The Classifier Model

The Classifier LLM (also Mistral-7B + LoRA) works in parallel with the Generator. Its job is to:

Label questions with rich metadata (difficulty, topic, subtopic, cognitive level)
Flag questions that may be ambiguous, incorrect, or poorly constructed
Assign quality scores based on criteria derived from our expert curation standards

Reinforcement Learning (GPRO)

Both models are trained using a custom reinforcement learning approach we call Mokingbird-GPRO-Hybrid — a novel combination of field-level process supervision and outcome-level reward. This ensures both models optimize for the right goals:

Generator reward: question quality, factual correctness, appropriate difficulty calibration, distractor plausibility
Classifier reward: accuracy of metadata labels compared to expert-labeled gold data

The two models train together in a feedback loop, continuously improving question quality.

Why Fine-Tuned Models Instead of Prompting GPT-4?

This is a fair question. Using a general-purpose LLM API would be faster to set up. We chose to build custom fine-tuned models because:

Quality consistency: A fine-tuned model trained on our exact quality standards produces more consistently good questions than prompt engineering a general-purpose model
Domain depth: A model fine-tuned on AI/ML content deeply understands the domain-specific terminology, paper citations, and technical nuances that general models often get wrong
Control: We control the full pipeline — we can improve quality, fix systematic errors, and retrain without dependency on external API changes
Data privacy: All generation happens on our infrastructure — research paper content is not sent to external APIs

Phase 4: Difficulty Calibration — Making "Hard" Actually Mean Hard

Writing a question that is labeled "advanced" is one thing. Ensuring that it actually discriminates between intermediate and advanced learners is another.

Jogg is designed to use Item Response Theory (IRT) to calibrate question difficulty as sufficient reviewed response data and the production batch pipeline become available.

IRT is a psychometric framework used in standardized testing (it's behind exams like the SAT and GRE). The idea is that a question's "true" difficulty can be estimated statistically based on how many people answer it correctly, accounting for the ability level of the people who answered it.

Here is how it works in Jogg:

A new question enters the system with an estimated difficulty based on expert judgment
Over time, as more users answer the question, we collect anonymous response data
In the target pipeline, a reviewed batch process fits an IRT model (2-parameter logistic) to response data
The difficulty estimate is updated to reflect observed performance
Questions that are performing outside their difficulty band (e.g., an "advanced" question that 90% of beginners answer correctly) get flagged for review

This means the difficulty labels in Jogg are empirically validated, not just subjectively assigned.

Phase 5: Adaptive Difficulty — Questions That Learn From You

Beyond calibrating questions, Jogg adapts which questions you see based on your personal performance profile.

The Mixed Practice mode uses your question history to determine:

Which topics have gaps in your understanding
Which concepts you've mastered and need less reinforcement
Which difficulty level is appropriate for your current level in each topic area

This is combined with FSRS spaced repetition for Daily Jogg, which determines the optimal timing for reviewing each concept based on your forgetting curve.

The result is that no two users' Jogg experiences are exactly alike — the system continuously adjusts to give each individual learner the questions that will help them grow most efficiently.

Continuous Quality Improvement

Question quality is not a one-time concern. We have ongoing processes to ensure quality stays high:

User Reporting

Every question has an in-app reporting mechanism. If you believe a question contains an error, is ambiguous, or has become outdated, you can flag it. Our content team reviews all flags.

Analytics-Driven Review

Questions with anomalous response patterns are automatically flagged for review:

Questions with very high or very low correctness rates relative to their labeled difficulty
Questions where users spend unusually long or short times deliberating
Questions with unusual skip or flag rates

Regular Content Audits

The AI/ML field moves fast. What was an emerging topic two years ago may now be foundational — or obsolete. We conduct regular curriculum audits to ensure content stays current with the state of the field.

What This Means for You as a Learner

When you answer a question in Jogg, you can trust:

The correct answer is actually correct, not just "probably right"
The difficulty label reflects real calibration data, not just gut feeling
The wrong answers represent real misconceptions, not random filler
The question tests something worth knowing
If the field has evolved and a question has become outdated, it will be updated

This is what it means to take question quality seriously. It is a lot more work than scraping the internet for quiz questions or asking a chatbot to generate 500 questions in five minutes. But it is the only approach that produces learning that actually sticks.

The Future: Personalized Question Generation

Looking ahead, our ambition is to enable fully personalized question generation — where you can specify your background, goals, and areas of interest, and Jogg generates questions tailored specifically to your learning path.

Already, you can request questions derived from specific research papers. In the future, Jogg will be able to generate questions calibrated to your specific skill level, focused on the specific topics most relevant to your goals — whether that's preparing for a specific type of interview, mastering a particular model architecture, or staying current with a specific research area.

This is the promise of AI-assisted personalized learning at scale. And it is what MokingbirdDataGen is being built to deliver.

Have feedback on question quality? Use the in-app flag feature or reach out to [email protected].

Jogg — Built for serious learners, with questions that prove it.