Developing Greek CV parsing: an Odyssey
9 December 2016 Blog Kim Pieschel
Textkernel has been developing its 17th cv parsing language, which will be released soon. You can read about the development of the Greek cv parsing language in Konstantinos’ blog post:
My journey started halfway through August, when I was appointed the task of teaching a machine to read Greek CVs. I got settled in the quiet cosy office amongst all the NLP, ML researchers and engineers and that was it; I had joined Textkernel and, in particular, the so-called Textractor team.
Luckily, the air-conditioning unit provided a cool environment during those hellish 30°C days, so creating the ideal atmosphere to work on an exciting topic and, in my case, also in my mother tongue. My name is Konstantinos Lampridis. I am a graduate MSc student of Artificial Intelligence at the University of Amsterdam and I am going to take you through the process we followed, up to the point where we could state that Textkernel is able to parse Greek – the language with the longest documented history of any living language, spanning 34 centuries of written records.
By Konstantinos Lampridis
Preamble: equipping our engine with a new language parsing model
But how can one teach a machine to read Greek CVs? Technically, we cast the problem into an NLP problem and solve it deploying a computational model. This model comprises of various components such as Machine Learning models and algorithms, engineered linguistic features and lexical resources. The Textractor engine is the one that is able to run this computational model as a pipeline of the above components. So practically, all I had to do was tailor a version of the model that would be able to deal accurately with the Greek language. The goal was to render it capable of understanding the contents of a Greek CV, processing the contents of a document as sections, items, words and characters.
Luckily for me, I had a lot of development tools at my disposal and got to use them under a robust framework implementing the state-of-the-art in Machine Learning. A series of generations of model versions were trained based on annotated data, then evaluated, documented and archived. The above versions were also compared against each other on various levels of the information extraction process and conclusions were compiled based on the metrics of the comparison tests. So after exploring all possible instances, we reached an acceptable fitness or equivalent performance level.
Scylla and Charybdis: the challenges
“So, how can this be an odyssey if all those technologies are at your command?”, someone might rationally ask. Well, all those models are trained using sets of documents. Each document has to be transformed, in a meaningful way, into a representation understandable by a machine. One way to do that is to extract features based on linguistic patterns. Engineering these features to strive the model’s performance towards better accuracy required both technical and linguistic knowledge.The challenge was to use my intuition behind the model’s performance, based on that, make the features’ design and to optimally incorporate the language’s specifics into the system.
Syntax, grammar, tonicity, orthography and in general Greek language diversity, richness and complexity proved to be an initial challenge for algorithms to deal with. A lot of design choices had to be made, but fortunately my command of Greek, after two and a half years in Amsterdam, is still perfect – and then it all came down to crafting the code.
Ithaca: Textkernel can now parse Greek
Reaching the present, we have found our Ithaca. We can now confidently say that we outperform almost all competition (soon to be completely all), regarding the information that we care to extract. Working within the Textractor team during the development process was and still is a valuable experience, so maybe once more: it is not about the destination, but the journey.