A Hybrid Approach to Concept Extraction and Recognition-based Matching in the Domain of Human Resources

Dan Crow, John DeSanto

Unicru, Inc.

Abstract

In this paper we describe the Convex system for extracting concepts from résumés and subsequently matching the best qualified candidates to jobs. A blend of knowledge-based and speculative concept extraction provides high quality results even outside the scope of the built-in knowledge. A comparison test shows the results found by Convex are significantly better than those found by engines using a keyword or statistical conceptual approach.

Introduction

This paper describes the significant features of a commercial system that finds candidates who are well matched to a job requirement. This problem is interesting for several reasons. First, this is a real-world problem whose solution has a significant impact on many businesses – employers are swamped with applicants for job openings and a system that screens out unqualified candidates can realize substantial time and cost savings. Second, this problem is not “AI complete” because the domain is narrow enough to avoid the need for a general-purpose solution. Finally, the domain is still broad enough to require an interestingly rich solution – in particular the main document in this domain is the résumé, which is semi-structured free text describing a wide but not limitless set of topics.

The system described in this paper, called Convex, contains technology to extract meaningful concepts from résumés and an engine for efficiently matching those concepts to job descriptions.

Concept Extraction (CE) is the task of automatically extracting abstract concepts from unstructured or semi-structured documents; these concepts represent the meaning or “aboutness” of the original document. A concept is a representation-neutral term that captures the intended meaning of a word, phrase or other component of a document. For example, extracting work history details from a résumé results in the generation of terminology-independent concepts representing that candidate’s work history. CE is a specific type of Information Extraction as described in [10], for example.

Recognition-based Matching (RM) involves using those elements of a document that are most useful for the task of matching that document to queries. The Convex system specifically extracts from résumé those concepts that are likely to be searched. For example, we do not extract a concept for the candidate’s home phone number, because recruiters rarely want to find candidates by their phone number.

We do, however, extract a concept for the candidate’s ZIP code since a common query is to find candidates who live near a prospective employer. The knowledge required to make these kinds of extraction decision is empirically derived through interviews with a wide range of users and subject-matter experts.

What Convex Does

The main purpose of Convex is to match candidates to job requirements. To perform a sophisticated match, the system has to be able to solve an interesting set of problems:

Candidates use different terminology to describe the same concepts. For example, candidates from the financial industry use the terms “Graham-Leach Billey Act” and “Regulation P” interchangeably. Convex has domain-specific knowledge that allows it to recognize these terms as synonymous.
Candidates describe themselves at different levels of generality. For example, one résumé may list the skills: “Kirkpatrick Modeling” and “Instructional Design”; another describes “Educational Development” experience. Convex knows that “Kirkpatrick Modeling” and “Instructional Design” are specific instances of the general class “Educational Development”.
Resume structure carries significant information. For example, skills that have been used more recently or for a longer length of time are generally preferred. Understanding and appropriately weighting the temporal descriptors in résumés - as Convex does -significantly improves the quality of results returned.
Well-qualified candidates typically have sets of related skills. For example, a strong candidate for a “Loss Prevention Specialist” job is likely to have relevant skills such as “Shrink Prevention”, “Alarm Systems” and “Facilities Management” on his résumé. Convex contains knowledge of the relationships between job titles and skills.

Searching vs. Matching

A key component of the Convex system is the match engine. We draw a categorical distinction between match engines and search engines.

A search engine returns the set of documents that exactly meet the search criteria, typically a Boolean combination of words or phrases. Examples include well-known Internet search engines such as AltaVista and Google.

By contrast, a match engine locates documents by interpreting user requirements from the query. These results are closely aligned to what the user wants. For example, in the domain of human resources, a match engine might interpret a request for a “software engineer” to include someone with the job title “computer programmer” because these are - in this context - synonymous terms.

A significant drawback of search engines is they require users to create queries that predict the exact terminology used in the documents they are searching for. For example, a user may be searching a database of candidates’ résumés for people with marketing experience. One candidate describes herself as a “Category Planner” - a common marketing role. A search for “Marketing” won’t find this candidate’s résumé even though it matches what the user is looking for.

The most significant barrier to creating an effective match engine is the problem of acquiring and codifying enough knowledge to produce useful matches. This is especially hard to do when the scope of documents to be searched is the entire contents of the Internet. By restricting the knowledge domain – in the Convex case to résumés – you can bound the problem sufficiently to make it tractable.

Recognition vs. Understanding

The goal of many AI systems is to understand the domain of interest. This is a hard problem to solve because understanding implies the ability to answer generalized queries about the domain. This in turn requires a semantic model of the domain that is - or is usefully close to - correct, consistent and complete. For all but simplified “toy” domains a comprehensive semantic model is expensive to create and maintain. It also assumes that the domain can be formally modeled, though in reality many domains are inherently inconsistent and poorly understood.

An alternative approach is to recognize significant terms in the domain without fully understanding them. This allows the creation of a system that can do useful work without having to have a complete semantic model of the domain. For example, it is a relatively easy task for a system to recognize significant terms using a list of synonyms. However, it is much more complex to be able to answer general questions about the relationships between synonymous terms.

For the task of matching user’s descriptions of themselves (resumes) to job requirements it is sufficient that the system know that a set of user terms map to a requirement term. It does not need to know why this mapping is valid, or to be able to reason about the mapping itself.

This is another way of bounding the problem to make it computationally and logistically tractable.

Concept Extraction using Domain-Specific Knowledge

The concept extraction process is driven by a knowledge base containing domain-specific knowledge tailored to the task of recognizing significant terms within a résumé. This knowledge is stored and maintained using the Protégé-2000 tool [2]. The knowledge base contains a formal but limited model of the semantics of the domain, encoding features such as parent-child IS_A relationships and USES relationships between roles and skills.

The most significant features of the knowledge base are that the knowledge is domain- and task- specific. The knowledge base does not contain a generalized semantics of the domain, only enough knowledge to support the recognition task. Thus Convex is an RM system.

The knowledge is used by the conceptualizer component to process résumés and extract concepts. The conceptualizer uses a pipeline architecture that operates separately from the match engine. In other words the computationally intensive task of extracting concepts from semi-structured text is not performed at query time.

The conceptualizer examines parsed résumés, applying a series of specialized extraction modules that extract specific types of concept from a specific section of the résumé. For example, the skill module examines the work history section of a résumé and extracts concepts that are contained in the knowledge base’s skills hierarchy. The extraction process algorithm is:

Take the text of a single job description – ignoring the job title, company and date headers.
Locate any words or phrases that are known synonyms for a skill in the knowledge base.
For each synonym, extract a term that is the canonical name of the skill.

Generate a score for each extracted term, using the following formula:

score = S(LoS / Rec) + RS

where:

LoS is the Length of Service – the number of months this skill was used for. This is calculated from the start and end dates of the relevant job description.

Rec is the Recency - measuring how recently the skill was used. This is calculated as 1+the number of years ago that the relevant job ended.

RS is a Related Skills score - this factors in any other related skills that the candidate has in the same job. Each related skill is factored according to the type of relationship it holds to the skill in the knowledge base. Table 1 shows the default relationship weightings used.Table 1 Convex scoring algorithm weightings

Relationship type	Factor
Sibling	0.5
Parent	0.6
Child	0.4
Related-To	0.3

We sum the scores for a skill derived from each job on the resume to generate an overall score for the skill concept for the candidate. However, we cap the number of occurrences of a term that are included in the score – this prevents attempted “gaming” of the system where the résumé contains keywords repeated several times, often hidden in the original document through tricks such as using white text on a white background.

In addition to the skill extraction module described here, the conceptualizer contains specialized extractors for educational and professional qualifications, candidate location information, job titles, companies and industries and the candidate’s desired job, amongst others.

Speculative Concept Extraction

A significant disadvantage of using knowledge to extract concepts is the system fails when it is outside the bounds of its knowledge base. Our knowledge base uses the open world assumption that it is incomplete. This means that some meaningful terms in candidate résumés are not recognized by the knowledge-based extractors. To overcome this limitation, Convex also uses extraction techniques that do not rely on the contents of the knowledge base. These extractors are called speculative because they allow the system to extract concepts that are likely to be meaningful, rather than known to be.

Convex uses two types of speculative extraction: shallow natural language parsing, and heuristics.

Shallow Natural Language Parsing

Convex uses two domain-independent, language-specific NLP techniques to extract noun phrases from sentences, following the information retrieval approach of [5]; other examples of the shallow NLP approach are found in [3]. We currently identify noun phrases using the barrier word algorithm and parts-of-speech tagging.

The barrier word algorithm – described in [7] - is a fast way of identifying short noun phrases in text. In our case the text is a job description extracted from a résumé. The job description is split into sentences and barrier words are identified. Any sequence of words that occurs between barrier words is extracted as a potential concept. The empirically derived list of barrier words includes articles, prepositions, and verbs and domain-specific stopwords.

We also use a maximum entropy Parts of Speech tagger, as described in [9], to locate noun phrases in job descriptions. Noun phrases are often indicators of the subject of the text. In the context of a resume, they often represent significant skills and experiences.

Heuristics

Convex contains a set of heuristic extractors that embody special-case domain-specific rules to extract likely concepts. These heuristics have been derived from observation of the analysis performed by expert recruiters and hiring managers when they examine a candidate résumé. For example, the Management Heuristic determines if a candidate has experience managing people. Many job requisitions require candidates to have people management skills, so it is useful to extract this concept from résumés.

The management heuristic first examines the job titles listed in a résumé. Any job title containing the term “manager” provides evidence of a management role, with the exception of certain known job titles such as “product manager” that do not imply management responsibilities. The text of the job description in a résumé is then scanned for key phrases that indicate management experience. Examples of these empirically-derived phrases include:

oversaw a team of …

directed … staff

managed … people

Occurrences of patterns such as these add to the evidence of people management. If enough evidence is found in a single résumé then the concept is extracted.

Speculative Extraction Framework

Each speculative extraction technique has an assigned confidence level, again empirically determined. Extracted concepts are assigned the sum of the confidence levels of all the extractors that find them.

Once all the speculative extractors have been applied to a résumé, those concepts whose confidence level is below a defined threshold are thrown out. The confidence levels and threshold are set such that only concepts found by at least two speculative extractors are passed. In contrast, concepts found by a knowledge-based extractor are always accepted.

Conceptual Inferencing

Once a set of concepts have been extracted from a résumé using knowledge-based or speculative extractors, the system uses a set of knowledge-based inference rules to generate further relevant concepts. The current system includes two major types of inference: class category and related skills.

Class Category Inference

The system can generate concepts representing a category of things from extracted examples of the things themselves. For example, the knowledge base specifies that the three job titles: “Brand Manager”, “Market Research Manger” and “Marketing Communications Manager” are instances of the class “Marketing Role”. If a résumé contains at least one of these concepts then the system infers that the candidate has the “Marketing Role” concept because of the IS_A relationship between the class and the instance.

The score for the concept(s) generated by such inferences is calculated using the number of relevant child concepts found. So a candidate who has only the “Marketing Communications Manager” concept has a lower score associated with their “Marketing Role” concept than a candidate who has all three child role concepts.

This form of inferencing enables Convex to match candidates with job-relevant experience regardless of how specifically they describe that experience.

Related Skill Inference

Convex can infer a candidate’s job title without the presence of a recognized title on the résumé. Each job in the knowledge base is related to a set of primary skills. These are the skills that are highly indicative of that job and are rarely associated with other jobs. A candidate who has at least half of the primary skills of a job is given the concept for that job, whether they mention the job title on their resume or not.

For example, the job title “mortgage underwriter” has seven primary skills related to it, including: “cash flow analysis”, “credit risk analysis”, “property appraisal” and “mortgage banking risk”. A candidate with these four skills on their résumé is inferred to be a mortgage underwriter.

Concept Discovery

The concepts found by the speculative extractors are by definition not already in the knowledge base. If the speculative extractors are producing high-quality concepts then these represent new domain knowledge that would improve the scope and quality of the knowledge base.

The Convex concept discovery system (CDS) adds to the knowledge base by analyzing the context in which new concepts are located and incrementally expanding the scope of the knowledge base. To do this, it first records all speculatively extracted concepts and the context in which they appear. Context is defined as the set of words that surround the occurrence of the concept in the source text.

Machine learning techniques are used to locate patterns in these contexts. Each context is analyzed to find known terms from the knowledge base, all other words in the context are marked as “unknown”. This reduces the context to a symbolic representation.

The set of symbolic contexts for a concept are analyzed to discover common associated context symbols. If the co-occurring context symbols are known concepts, the concept discovery system associates the speculative concept with the common parent of the known concepts, if one exists.

For example, consider the following two contexts for the speculatively extracted concept C#:

…languages: Java, C#, C++, C…

…experience with C#, .NET, C++…

In this example the context size is 2, meaning the two words before and after the concept term are included. In practice a context size of 3 or 4 has been found to be most effective; 2 is chosen here to simplify the example. Assuming the concepts C++, Java, C and .NET appear in the knowledge base, these contexts get reduced to:

<U> <Java> C# <C++> <C>

<U> <U> C# <.NET> <C++>

Because the known concept C++ co-occurs frequently with C# the system draws the inference that these concepts are related. C++ is a child of the concept Object-oriented programming languages in the knowledge base, so the concept discovery system suggests that C# be added as an object-oriented programming language. A complete description of the learning algorithm is beyond the scope of this paper, but it has two interesting features:

It is incremental - instead of trying to find all deducible relationships between concepts, it uses statistical techniques to locate the most likely marginal extensions to the knowledge base. The learnt concepts are instances of existing classes rather than completely new concepts.
It is semi-automatic – instead of directly adding learnt terms it presents them to a human knowledge engineer who validates them. This validation process is important as it allows us to maintain the quality of the knowledge base and therefore the quality of the match results.

The use of shallow NLP and domain-specific heuristics to generate input terms for the CDS is similar to the work of [4], taking a “learning from clear cases” approach that “processes simple, easily extractable, parsable and interpretable constructs of natural language conveying taxonomic knowledge”.

Concept Extraction and Matching

An important principle of Convex is the separation of concept extraction from matching. Extracting concepts from a résumé is a relatively slow operation taking several seconds. For typical customers we process anywhere from several tens of thousand to a few million résumés. Conceptualizing this many resumes at query time would be prohibitively slow.

To avoid this time cost, the conceptualizer extracts concepts and stores them in a concept space – a multi-dimensional database in which every concept corresponds to a single dimension. The concept score is converted into a position along the dimension. For example, a candidate with the following three concepts: (Java 35, Management 64, Sales 23) would be represented by a point in a three-dimensional space whose co-ordinate is (35, 64, 23). In practice, candidates are represented by many more than 3 dimensions – on average Convex generates 43 concepts per candidate, and many of these are shared by only a few candidates. A conceptualization of 1,000,000 candidates resulted in a concept space of 14.6 million dimensions.

Both candidates and job requisitions are represented as points in the same concept space, so finding the candidates who match a requisition involves finding the nearest neighbors to the co-ordinate representing the requisition. The requisitions carry additional information specifying which concepts are required, and this allows the match engine to quickly reduce the set of possible results to a small sub-region of the concept space and from that find the set of closest candidates and rank them.

An important feature of the concept space representation is that it allows us to easily find candidates who are similar to a given person, as well as to a given job requisition. Finding the nearest neighbors to a point representing a person – for example, a high performing current employee – allows the user to “clone” a person, finding candidates who are highly similar to an existing employee. It also allows “navigational cloning” where you specify a person as a starting point and a set of variances from that person. For example, you can ask for candidates who are like your best salesman but with more marketing experience, or who are located in a different part of the country. This allows users to avoid having to create abstract queries to specify what they want.

The concept space is optimized for rapid retrieval of matching résumés. Convex can determine the best matching candidates to a job description in under 5 seconds, from a set of 1,000,000 résumés on mid-range server hardware.

Comparison with other search and match technologies

There are two main approaches to solving the problem of searching large sets of documents. The most common is keyword search, where every word (usually excluding known stopwords) in each document is indexed. Queries are expressed as Boolean combinations of words, and all documents that match the query are returned. More sophisticated keyword search engines order the returned list of matching documents, the best known of these ranking algorithms is Google Inc.’s PageRank, see [8].

The second approach is conceptual search. These systems, also known as match engines, attempt to understand documents at the conceptual or semantic level and match documents to queries at the semantic level. Convex takes this approach, as do several other significant commercial and research projects. Unlike the knowledge-based approach used by Convex, most of these engines use statistical analysis to discover semantic structure in documents, and are therefore referred to as statistical conceptual match engines. Among the more successful statistical approaches are Latent Semantic Analysis, for example [1], and Naïve Bayesian analysis described in [6].

The statistical approach has many attractive features, primarily domain-independence and a level of robustness in the face of unknown data. However, although statistical conceptual engines produce demonstrably more relevant results than the keyword search engines, they still make errors. Because they rely on finding patterns of word co-occurrence they sometimes draw conclusions that are semantically incorrect. For example, the phrase “sales manager” frequently occurs on resumes, which can lead a statistical engine to associate the words “sales” and “manager”. A query for an “engineering manager” then returns candidates with the word “sales” on their resume because “manager” is linked to “sales”; this is incorrect in the context of this query. In addition, because statistical systems treat resumes as unstructured lists of words they cannot take into account the importance of more recent experiences and so tend to return over-qualified candidates.

The knowledge-based approach used by Convex produces more relevant results than either keyword or statistical conceptual search but is only applicable when the document domain is sufficiently limited to allow the construction of a significant body of relevant knowledge.

Candidate Differentiation Test

We have conducted a series of result relevancy test on Convex. In this paper we present one of the series. This test involved a keyword search engine, a commercial LSA engine and Convex attempting to distinguish between 16 similar yet distinct candidates for 4 similar technology jobs. This mirrors a common real-world situation where many broadly similar candidates apply for a job and the system must find those who meet the specific job requirements.

The test set consisted of 16 résumés selected from a pool of more than 10,000 by subject matter experts who did not have prior knowledge of any of the tested engines. The SMEs were asked to select four candidates in each of the following categories: Network Administrator, Oracle Database Administrator, Java Programmer and Software Engineer. Using four actual job descriptions, expert users of each engine constructed queries they believed would retrieve relevant candidates for those jobs; these users did not have prior knowledge of the 16 resume test set. An independent group of SMEs then rated the relevance of the top four results returned by each engine.

Table 2 shows the number of appropriate candidates returned by each engine for each job. For example, returning four Java Developers in response to the Java Developer job description scores 4, but returning two Java Developers, an Oracle DBA and a Software Engineer scores 2.

Table 2 Aggregate comparison test results

Position	Convex	LSA	Keyword
Java Developer	4	3	2
Network Admin.	4	3	3
Oracle DBA	4	4	1
Software Engineer	4	1	2
Totals	16	11	8
% correct	100%	69%	50%

Convex successfully identified the 4 most relevant candidates correctly in each test. The LSA engine performed well, correctly identifying 69% of the candidates, while the keyword engine only found legitimate candidates 50% of the time.

Conclusions

In the limited domain of résumés, Convex’s hybrid of knowledge-based, shallow natural language processing and heuristic concept extraction techniques is shown to be significantly better than both keyword and statistical conceptual techniques. Using speculative concept extraction goes some way to addressing two major drawbacks of a purely knowledge-based approach to matching: the lack of robustness in the face of unknown data and the potential cost of extending the domain knowledge base.

The recognition-based matching approach successfully solves a meaningful real-world problem without requiring an intractably large amount of knowledge.

Acknowledgements

Many thanks to the dedicated engineering team whose hard work made the Convex project possible: Dave Blaettler, Greg Dotson, Christopher Glover, Markus Guehrs, Steve Kalidonis, Indus Khaitan, David Leftwich, Visnu Pitiyanuvath, and Nikolas Sredanovic.

References

[1] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman “Indexing by Latent Semantic Analysis”, Journal of the American Society of Information Science 41:6:391-407

[2] W.E. Grosso, H. Eriksson, R.W. Fergerson, J.H. Gennari, S.W. Tu, and M.M. Musen “Knowledge Modeling at the Millennium -- The Design and Evolution of Protege-2000”, Proceedings of the 12 the International Workshop on Knowledge Acquisition, Modeling and Management (KAW'99), Banff, Canada, 1999

[3] P.J. Hayes, “Intelligent High-Volume Text Processing Using Shallow, Domain-Specific Techniques”, S.J. Paul (ed.) Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval, pp. 227-241

[4] L.M. Iwanska, N. Mata and K. Kruger, “Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts”, Natural Language Processing and Knowledge Representation L.M. Iwanska and S.C. Shapiro pp. 335-345. AAAI Press

[5] C. Jacquemin “Spotting and Discovering Terms through Natural Language Processing”, MIT Press, 2001

[6] D. Michie, D.J. Spiegelhalter, and C.C. Taylor, “Machine Learning, Neural and Statistical Classification”, (edited collection). New York: Ellis Horwood

[7] G.W. Moore, R.E. Miller and G.M. Hutchins,
”Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the Barrier Word Method”, J.R. Scherrer, R.A. Cote and S.H. Mandil (ed.) Computerized Natural Medical Language Processing for Knowledge Representation. pp 29-39. North-Holland

[8] L. Page, S. Brin, R. Motwani and T. Winograd “The PageRank Citation Ranking: Bringing Order to the Web” Stanford Digital Library Technologies Project

[9] A. Ratnaparkhi “A maximum entropy part of speech tagger”, Proceedings of the ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing, pp. 491-497, Philadelphia

[10] G. Salton “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”. Addison-Wesley