Sometimes you need to test a hypothesis, but good old and wellknown techniques like A/B tests don’t work. For example, when there is no way to randomize the test and control groups. That’s where the DifferenceinDifference method comes in. Using an example from EdTech, we show how we change the product based on the data obtained through such analysis.
Victor Pchelin
Product Analyst of the Educational Analytics Department of Netology
What hypotheses we tested: It’s too difficult for students to learn programming
In Netology, the institute of A/B testing is well developed. But we use these tests in commercial analytics: we look at conversions, conversions, and payments — this is a wellknown and studied area. And we are trying to bring the datadriven approach to the field of educational analytics as well, that is, to check the content and teaching methods in order to select the optimal structure of courses and materials for students.
One of the problems we faced was the lack of time for students.
The workload of courses, especially those related to programming, is high. And if people come without a background, it is difficult for them to master the program.
For our part, we expected that it would take 1015 hours a week to complete the course, but in fact, students spend much more. Because of this, they refuse to study.
People are working, and during the week there is not much time that they are willing to spend on learning. When you need to watch several webinars and submit several homework assignments in seven days, while reading additional literature at the same time, students do not have time to properly master the material.
For our part, we expected that it would take 1015 hours a week to complete the course, but in fact, students spend much more. Our task is to make a course that does not scare people with an excessive workload. At the same time, it meets our training requirements and prepares specialists.
We conducted a UX study and made sure that there is indeed a problem: exceeding the teaching load compared to the planned one. The idea arose to conduct an experiment related to the distribution of the educational load on students.
How We Conducted DifferenceinDifference Testing: Hypotheses
Purpose and metrics of the experiment
The goal of our experiment is to make the actual study load feasible, increase the activity of students on the course and reduce the number of refusals to study.
The main hypothesis is that a decrease in the educational load should lead to an increase in student activity.
In Netology, we use segmentation by activity. Every few days, we watch how the student attends classes, solves problems and learns the material. After that, assign one of the four statuses:

active;

lagging behind;

lagging far behind;

missing.
The main metric of our experiment is the share of the two most active segments. This is a universal metric that does not depend on the content of a particular course. A metric based on this metric is suitable for different professions.
We compared the numbers in the active groups from January to April. During this period, several training streams start at once, three in the test and control groups. Based on the MonteCarlo simulation, the required sample size was determined. We assumed that three sets would be enough to get a statistically significant result.
The method of conducting the test is DifferenceinDifference: why it
We have a module that teaches the basics of HTML — this is the first module for several professional courses at once: FullStack, JavaScript, Bitrix, and Frontend.
We took Frontend students (hereinafter referred to as FE) as a test group, and the rest as a control group. In terms of the number of students, the groups turned out to be approximately equal.
DifferenceinDifference was chosen as a methodology. This is a method of testing hypotheses by comparing the values of a metric in Different time periods: before and after the changes are implemented in the test group.
It has a number of differences from A/B testing. The main one is the distribution of users among the groups of the experiment. For A/B testing, we need random distribution because it helps to eliminate the influence of factors that we have not taken into account and do not test on the results. That is, the test and control groups are randomized so that they differ only in one parameter — the one we are studying. For example, there should be an equal ratio of men and women in groups, unless we are investigating the gender factor.
In the context of Netology, the distribution can no longer be randomized, because students purposefully enroll in the chosen profession. We can’t mix them up, just as we can’t show different content in the same profession. When people choose a course for studying, it is influenced by a lot of factors that distort the result: income, free time, existing knowledge, and much more.
So we took the DifferenceinDifference method, which can be used to study nonrandomized groups and still get reliable results.
How DifferenceinDifference Works
In DifferenceinDifference, we take two groups and observe them over a historical period to see how much the dynamics of the metric are the same in them. Then we make one of the groups a control group, make changes and record the changes. Then we compare the results of the test and control groups.
Certain conditions must be met for the method to be applicable. For example, the average activity in the test group and the control group should vary in the same way. Each student’s activity should not depend on the activity of another student. Therefore, the students had almost no interaction with each other during their studies.
The graph shows how the method works. We have two groups. The test line is the green line, and the control line is the orange line. In the period BEFORE the changes, the dynamics of the groups were similar: if activity increased in one group, it increased in the other.
AFTER the changes, the dynamics of the groups begin to differ. To assess the effect of the changes, we compare the actual dynamics and the estimated ones. The estimated dynamics of the test group corresponds to the dynamics of the control group.
Why We Are Sure That the Method Is Correct
When planning the experiment, we conducted A/A tests. We had a historical sample of student education over a sixmonth period. We took this data, divided it into two periods, and compared it to each other, as if the second period were the testing phase.
This was to see if the testing method would show the difference between the observations. Ideally, the computational method should find nothing, because there was virtually no change. We did a lot of simulations and estimated the probability of a type 1 error to be no higher than 5%. This means that the results of the research can be trusted.
We are also interested in estimating the confidence interval and the statistical significance of the effect of the changes. They show how confident we can be that there is actually an effect. To estimate the confidence interval, we construct a linear regression. The final conclusion about the statistical significance of the changes is based on the Pvalue, which should be less than the predetermined critical value. The default value is 0.05.
What data we received
In the test group The average workload decreased by three hours per week. The number of classes on the course has not decreased, but the period of study has increased.
This is what the change in student activity looks like in different weeks of study. The red line is the test group, the blue line is the control group. In the test group, activity increased both relative to the control group and relative to the previous period.
In total, based on the results of the test, we received four metric values. Next, we construct three differences.

The metric after changes minus the metric before the changes for the test group.

The metric after changes minus the metric before changes for the control group.

The increment of the metric values in the test group minus the increment of the metric values in the control group.
At the beginning of the experiment, we noticed a 3.6% increase in activity. At the end of the experiment, activity in the test group increased by 4.4%. Pvalue is 0.09. Based on the test results, we will conclude how the load affects the activity.
What conclusions have we drawn and what will it lead to?
The experiment is over, and based on the results, we decided to adjust the curriculum in those modules where there is a high load.
We also made sure that the changes did not affect student reviews. So far, the students in the test group like the content more. 13% more students in the test group made it to the final test. But some students noted that learning had become too slow for them and asked to be transferred to a control group, although only three people did so.
Thanks to the test, we also have new hypotheses. For example, that activity is affected by the time of classes and their difficulty — let’s check these versions in the following.
Get a new major or promotion with these Netology courses:
For the most attentive — a 10% discount with a promo code dshabr10.
———
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.