Intro
edmsyn
is an R package, the name is essentially Educational Data Synthesizer (originally it was the concatenation of Educational Data Mining and Synthetic, not quite gramatically meaningful).
What is synthesizing data?
Are you familiar with Data Mining? or Machine Learning, or Statistics? I would not delve into distinguishing these concepts, but would instead fearlessly throw a terrible generalization as follows: All you need to do in these fields of study are:
- Collect data (for
edmsyn
you collect data from educational context) - Propose a model (and you are forced to choose a parameterised one)
- Learn the parameters, test them on test data, report the performance of your model
Synthesizing data in edmsyn
takes an additional step: say you obtained a set of learnt parameters, you tweak them a little bit, and then generate new data from them, and finally call it synthetic data. There’s that!
But what’s good doing so?
You may want to Google a bit to find that the answer is not a short one, or simply read this paper from Behzad Beheshti (my colleague at polymtl.ca - now already got his PhD) to see one application of synthetic data.
What’s special about this R package?
Yes, obviously there are several R packages that serve the purpose of learning parameters from different models, generate them under a bunch of different options (including ones that I don’t really understand). These packages are in fact very specialised, each of them go really deep into various aspects of a single model (or a class of models). How am I confident that edmsyn
are going to do anything decently good?
In fact edmsyn
serve something slightly different: it is useful when users want to study synthetic data under many different models. The package provides a framework that is really flexible:
- You throw in data, then say the magic word (SYNTHESIZE!), and everything happen. Or,
- You throw in a bunch of parameters, then say the magic words (GENERATE!) and everything happen.
A bunch of parameters as I mentioned few seconds ago is actually not a simple concept. Let’s say the data here is a matrix A
of size \( m \times n\), with each entry being a real number between 0 and 1. If you throw in two numbers m
and n
, then the generating method can simply be as follows: generate \( m \times n \) numbers between 0 and 1, one by one arrage each of them into the result matrix. But what if you throw in m
and a vector v
of \( n \) real values, representing the expected value of \( n \) columns of A
? Then things become a little complicated.
And as I mentioned above, all you have to do is saying the magic word. So basically edmsyn
allow you to do either of the below:
See that depending on what is inputted, edmsyn::generate
automatically figured out what to do! And don’t forget that you are working across different models, so the following is okay too:
edmsyn::generate(model = 'B', m = 4, n = 3, p = 6)
edmsyn::generate(model = 'C', v = c(0.5, 0.5, 0.5), t = matrix(0,3,5))
context: another magic word
See that models A
, B
and C
are sharing some parameters? That is what happen in Educational Data Mining. Specifically the famous Q
matrix, defining the relationship between items and skills, is being used everywhere (not literally, but close)! That is why edmsyn
introduces a useful notion that becomes the building block of the whole framework. The new thing here is called a context. It makes all the illustrative code above even simpler (in other words, the first magic word become even less verbose!). Look at the code below to see how it works:
context <- edmsyn::create.context(m = 4, n = 3, v = c(0.5,0.5,0.5), t = matrix(0,3,5))
# magic!
A <- edmsyn::generate('A', context)
B <- edmsyn::generate('B', context)
C <- edmsyn::generate('C', context)
Notice that n
should be the length of v
. So putting both of them into a single context is kinda redundant, they are inclusive! And guess what, edmsyn
allows you to put in only v
, n
will be automatically inferred.
context <- edmsyn::create.context(m = 4, v = c(0.5,0.5,0.5), t = matrix(0,3,5))
context <- edmsyn::generate('A', context) # fine!
To make it even better, this is also possible:
context <- edmsyn::create.context(n = 3, t = matrix(0,3,5))
# recall that model C need v and t, but context only have n and t
context <- edmsyn::generate(model = 'C', context)
In this case, edmsyn
understands that it need to do an intermediate step: generate v
from n
, before using the resulting v
and t
to generate C
. All of that happens without you having to specifically tell the package what to do.
Sufficiency and Consistency
What if you sneakily (or accidentally by some kind of bug - very likely when it comes to big complicated applications) throw in a single context something like this: edmsyn::create.context(n = 4, v = c(0.5, 0.5, 0.5))
? You will be caught for inputting v
with an incompatible length to n
. The situation is called inconsistent context and ofcourse will be caught by function edmsyn::create.context()
Simpler, if you do something like this edmsyn::generate(model = 'A', edmsyn::generate.context(m = 3))
. You will be caught for not giving enough information with respect to model A
. This is called insufficient context and will be caught by function edmsyn::generate()
.
When you are working across many models and contexts, the two essential conditions are being sufficient and consistent. They are ensured and automatically resolved whenever possible by edmsyn
Okay I got it, edmsyn
is trying to be a flexible, shared interface to various specialised EDM package.
You are god damn right. But there is even more as I am working on edmsyn
towards the second version of it. Right now if you are interested in using it right away (and are familiar to EDM literature) jump right to the vignette for a thorough tutorial on using it (you will find that the package is a little bit different from what I explained above, but the spirit is still there).
And of course to install it:
Good luck!