The German language is know to be relatively complicated and especially the gender causes lots of confusion. While English has only one article (the), three different articles are used in German:
- der (male)
- die (female)
- das (neuter)
While rules to determine the gender of a noun exist, almost no German native speaker can name them. We can now solve this problem (determine the gender without memorizing the rules) using some simple machine learning with the Accord framework.
Let’s quickly name the steps that will follow:
- find and extract a dataset of noun-gender associations
- split into training, test and validation dataset
- extract features into something the algorithm can use
- train a Naive Bayes
- test the model with the test dataset
After quite a while of searching, I found this machine readable and CC-BY-SA 4.0 licensed XML file from Daniel Naber.
In our Universal Windows App we can then load all nouns into a List:
Words = await Parser.LoadNounsAsync("morphy-export-20110722.xml", int.MaxValue, MinLength); // MinLength = 4 letters
The next step is to split the dataset into training, test and validation sets. For this purpose I wrote a SplitRandom method that randomly selects elements from an IEnumerable and returns a List<T>[] with a specified size.
// randomly split the dataset into three almost equally sized sets var splits = Words.SplitRandom(3); trainingDataset = splits[0]; testDataset = splits[1]; validationDataset = splits[2];
You can look up the definitions of SplitRandom<T>() and LadeSubstantive() in the source code of the sample application at the bottom of this post.
To train the Naive Bayes, we have to select features and represent them as number so the algorithm can use them. Our assumption is that the gender of German nouns can be determined from the suffix, so for simplicity we start with the last four letters. In the sample application, I represent the letters as enums so they can be casted to int or double as required. So each instance of the Wort class can now have a Features property of type int[]:
public int[] Features { get { // The Naive Bayes expects the class labels to range from 0 to k return new int[] { (int)GetLetter(-1), (int)GetLetter(-2), (int)GetLetter(-3), (int)GetLetter(-4) }; } }
Before we continue with building the model, let’s add the required NuGet packages from the Accord-framework. As we are doing this for a Universal Windows App, we can not get the original Accord-packages, but someone already published portable packages. In the NuGet package manager, you can find them by their names:
- portable.accord.machinelearning
- portable.accord.statistics
From our training dataset we can now build the feature- and label-arrays using LINQ:
int[][] inputs = trainingDataset.Select(w => w.Features).ToArray<int[]>(); int[] outputs = trainingDataset.Select(w => w.Label).ToArray();
The next step is to build and train a Naive Bayes model:
NaiveBayes bayes = new NaiveBayes(Wort.LabelClasses, inputs[0].Select(i => Extensions.LetterValues.Length).ToArray()); double error = bayes.Estimate(inputs, outputs);
Testing the model can now be done using the training dataset:
// Classify the test dataset using the model int[][] testFeatures = testDataset.Select(w => w.Features).ToArray(); int[] testLabels = testDataset.Select(w => w.Label).ToArray(); // predict the labels int[] testPredictions = testFeatures.Apply(bayes.Compute);
By counting the zeores after subtracting the correct labels from the predicted labels, we get the number of correct predictions:
double correctPredictions = testPredictions.Subtract(testLabels).Count(x => (x == 0)); // using Accord.Math for .Subtract() and .Count() System.Diagnostics.Debug.WriteLine($"{correctPredictions / testPredictions.Length * 100} % success rate.");
When you run this, it generates a model with 70-75 % success rate.
In the sample application you can see how a Decision Tree performs compared to the Naive Bayes (hint: better).
You can grab the whole thing from the MSDN code gallery: https://code.msdn.microsoft.com/Predicting-Noun-Genders-ef904a12
Have fun!