The challenges behind parsing & matching CVs and jobs
8 July 2016 Blog Mihai Rotaru
For the human eye reading a CV (resume) or a job ad is an easy task. These semi-structured documents are usually separated in sections and have layouts that makes it easy to quickly identify important information.
In contrast, a computer system that parses CVs needs to be continuously trained and adapted to deal with the endless expressivity of human language. As a leader in the field of language technology, Textkernel is working hard to provide the best CV parser to our customers. In this blog article I will explain how we achieve this and discuss the focus of our current research efforts.
By Mihai Rotaru
Why rule based parsers do not work
There is a lot of variability and ambiguity in the language used in CVs. There are many ways to write dates and numerous job titles and skills appear every month. Someone’s name can be a company name (e.g. Harvey Nash) or even an IT skill (e.g. Cassandra). The only way a CV parser can deal with this is to “understand” the context in which words occur and the relationship between them. That is why a rule-based parser will quickly run into two big limitations: 1) the rules will get quite complex to account for exceptions and ambiguity, and 2) the coverage will be limited.
There are different sorts of CV parsers on the market (keyword-based, grammar-based and statistical-based), but in order to gain the most accuracy we need a mix of all these methods.
Machine learning to the rescue
One of the issues with rule-based approaches is that no rule is 100% reliable. Take the example of finding the name in a CV. “Name: ” , is not always a good left context, not all words after this context should be used, and entries in an extensive list of first names can be street names or other concepts etc.
Machine learning solves this problem by estimating the quality of these signals based on annotated data and through principled ways of combining the evidence from several signals. These signals are called features and encode information that we think is relevant for our prediction task (e.g. “is the word preceded by ‘Name:”, “does the word start with uppercase). In general, these features are manually written.
It requires a lot of intuition, black magic and sweat to get the right features for a task. Nowadays, there is a lot of exciting new research that learns the features automatically from a lot of data. At Textkernel we are also pursuing this avenue and I will talk about it in a later post.
Let’s go back to CV parsing: how are we going to apply Machine Learning to this problem? It turns out that a lot of smart people have already studied the problem of extracting information from text (e.g. names, companies, location, etc). The general class of problem is called sequence labelling (or tagging) with several well known instantiations like Name Entity Recognition or Part-of-Speech Tagging. There are several statistical models that have been applied successfully for this problem like Hidden Markov Models or Conditional Random Fields.
Why should we look at a sequence and not individual words? The answer is that the context and the order in the sequence are very important signals for our problem. Imagine the problem of finding the date, job title and the company from the following text:
1999-2001: Preparator at Universitatea de Vest
I am 99.7% sure you won’t understand all the words in this sentence. This is actually the reality of all statistical models: they will never know all words. But there are a few things we can spot even if we don’t know all words: you can easily recognise the date range, there is a word that looks like university and most importantly, there is the word “at” before it. And since a typical pattern of describing work experiences is “DATE: JOB_TITLE at COMPANY”, we can easily segment the sequence in its parts: date is “1999-2001”, job title is “Preparator” and company is “Universtatea de Vest”. This is the intuition behind sequence labelling and why it works much better than classifying individual words.
Ideally, we would like to have a single model that takes in an entire CV, treats it as a sequence and finds all the information in it. But this is a very complicated task as the model needs to predict many labels. Moreover, words are very ambiguous depending on their position in the CV: names can be either the candidate name in the contact section or the reference name in the reference section; “University” can be part of either the educational institution (in education section) or the employer (in work experience).
Thus, we approach the parsing task with a two-stage system: a first statistical model segments the CV into sections and then specialised statistical models handle each individual section. Let’s talk briefly about section segmentation. It can also be cast as a sequence labelling problem and therefore several statistical tools are available to solve it. But applying previous research to the section segmentation problem is not trivial and a very important choice is the level: should we treat the sequence as a sequence of words or a sequence of lines. The choice also has impact on the type of statistical tools we can apply effectively.
In the beginning, we used to have an approach using words and Hidden Markov Models (HMMs) but that was problematic because of the very ambiguity I mentioned above (e.g. names, “University”). From time to time, the model would hallucinate new sections just because certain ambiguous words were present. In addition, it is hard for HMMs to take advantage of crucial multi-word clues like section headers (e.g. “Work experience”), presentation clues (e.g. lines that start with dates), etc. By making the simplifying assumption that a line can belong to a single section, we could take advantage of a more powerful statistical model: Conditional Random Fields. This assumption also simplified the problem as we did not have to label sequences of 2000+ words, but sequences of 100+ lines. The improvements were impressive (50% reduction in errors) and this approach has become our baseline for all languages.
I hope you now have a better picture of why CV parsing is quite a challenging task. Next time I will talk about some of the exciting developments in representation learning and Deep Learning that we put into practice recently. More specifically, I will describe how we are dealing with unknown words using word representation learned via Word Embeddings and our use of Siamese networks for job title normalisation, a task that is crucial for matching CVs and jobs (e.g. “java programmer” is the same as “java developer”). I will also discuss some of the software engineering challenges associated with using machine learning in industry.
And talking about matching, I want to reassure you that we are not heading for a future where your CV will be ignored just because some cold and unemotional algorithm thinks there is no match. There is a lot of subjectivity involved in selecting the right applicants for a job including the fact that hiring managers have little time to look at your CV. Textkernel is providing tools to help hiring managers do more in the same amount of time. And if they are looking for a “Java programmer” and you wrote in your CV that you are a “Java developer”, we will make sure they see you too!
Are you curious about Texternel? We are growing rapidly and hiring new colleagues!