Parsing summary

Those who are faced with the task of automated analysis summary represent the current state of Affairs in this area — the existing parsers are mostly restricted to the selection of contact data and a few fields such as "title" and "city".

For any meaningful analysis not enough. It is important to not only highlight some lines and mark them with tags, but also to determine what kind of objects lie behind them.

Live example (a piece of XML analysis result summary from one of the leaders of the region Sovren):

the
 <EmployerOrg>
<EmployerOrgName>OOO Zvezda-DSME</EmployerOrgName>
<PositionHistory positionType="directHire">
<Title>a Leading specialist of development Department of information systems</Title>
<OrgName>
<OrganizationName>OOO Zvezda-DSME</OrganizationName>
</OrgName>

The Sovren parser has done with the allocation fields. Guys knowingly engaged in this business nearly 20 years!

But what to do with "Leading specialist of the Department of information systems development"? How to understand what is the position, how much experience this person relevant to this position?

If your job — search employee under the requirements of the vacancy or Vice versa — jobs under the experience and wishes of the candidate, the search keywords, comparison bag of words give mediocre results. For objects that have many synonymous names, this approach does not work.

First you need to normalize the name, to turn "specialists in anything" programmers, system administrators and other otolaryngologists.

It will have to be a knowledge base, a taxonomy of objects. And the peculiarity is that it is not enough to describe, for example, only construction workers — people change activities and resume Builder can meet and others not related to construction work.

And if the taxonomy will only describe the construction in the texts relating to other areas are false positives. In the construction of "architect" is one thing, but IT is quite another. "Operation", "Action", "Object" and lots of phrases containing these words are examples of ambiguities that must be resolved.

A simple normalization is also not rescue the father of Russian democracy. The imagination of the people writing the summary and the components of staffing never ceases to amaze. Unfortunately for us developers, this means that in the General case of a string describing the object to identify the object impossible. That is of course you can try to train some classifier, feeding him field and desired position.
And it will even work. "The accountant", "Secretary", "programmers."
Only here in summary people write "specialist of the Department of N" and to understand the accountant or Secretary, you can only by the context, the set of duties performed.

It would seem — well, consider the context, let the classifier is trained on more responsibilities. So, Yes not so — when determining the set of responsibilities has the same problem: the ambiguity of interpretation, all sorts of strange anaphora).

We decided to apply a probabilistic (Bayesian) approach:

Analyzing the source text for all rows (e.g. "architect", "
customer") we define the set of all possible interpretations.
(for example, "architect" it will be a "building architect", "architect software
security", etc.). The result is a set of sets of interpretations. Then
looking for a combination of interpretations from all sets to believability.
was maximum.

For example:

the
    the
  • 2007-2009, OOO Umbrella corp. Senior Manager. Working with key clients, finding new clients, negotiation, execution of transactions;
  • the
  • 2001-2007, OOO Horns and hooves. Manager on work with clients. Advising customers on all issues, checkout, cash or Bank transfer.

In both places used the word "Manager" to refer to very different positions. By context we can understand that in the first case, the position indeed managerial, and the second would be more correct to name the seller.

In order to choose between "the Manager on work with clients" and "seller", we evaluate the plausibility of combining these positions with those found in this work place skills. The skills can likewise be selected from several options, so the task is to select the most plausible combination of the many found in the text objects.

The number of objects of different types (skills, position, industry, city, etc.) is very large (hundreds of thousands in our knowledge base), so the space that owns the summary very, very multidimensional. For learning most machine learning algorithms need an astronomical number of examples.

We decided to cut. That is to dramatically reduce the number of parameters and use the result of training only where we have a sufficient number of examples.

For a start, we began to collect statistics on combinations of tuples of signs, such as post-industry, post-division, post-skill. Based on these statistics we estimate the plausibility of a new, not previously encountered, combinations
objects and select the best combination.

In the above example, the skill set inclines parser in the direction of the Manager in the first case and the side of the seller in the second.

The use of simple counters and probabilities according to Bayes allows to obtain good results with a small number of examples. In our knowledge base now about one hundred thousand are marked by specialists and resume and it allows to solve most ambiguities for common objects.

We are left with a JSON object describing a job or resume in terms of our knowledge base, not those invented by the applicant or the employer.

This view can be used to find an exact match for the parameters of the evaluation ("scoring") summary of the applicant or matching pairs "summary-vacancy".

We made a simple interface where you can upload a resume (doc, docx, pdf (not image), and other formats) and to its representation in JSON. Just don't forget 152ФЗ! It is not necessary to experiment with a summary with real personal data :)

For example, here is a summary:

Hidden text
Pupkin Vasily Lvovich
Omsk
tel +7923123321123

Responsible and hardworking sales Manager.

Experience

the
    the
  • 2001-2002: Buttonprev. Cleaner areas in the yard. Especially conscientiously performed work on snow removal.
  • the
  • On 12.03.2005 at 30.01.2007 Shop "Hope". Senior Stallman for the sale of sausage. Increased sales of sausage 146% in 2 months.
  • the
  • 2002-2003: Casement Tinsmith-the repairers in the garage. Altered "the Cossacks", "Mercedes".

Advanced
Physically strong, intelligent, beautiful. There is a car Range Rover Sport and the rights of category B.

Converted into the following JSON:

Hidden text
{
"url": null,
"name": "sales Manager.",
"skill_ids": [
{
"cv_skill_id": 5109999,
"skill_name": "liability",
"skill_id": 91,
"skill_level_id": 1,
"skill_level_name": "Basic"
},
{
"cv_skill_id": 5110000,
"skill_name": "hardwork",
"skill_id": 596,
"skill_level_id": 1,
"skill_level_name": "Basic"
},
{
"cv_skill_id": 5109998,
"skill_name": "implementation of housekeeping",
"skill_id": 1474,
"skill_level_id": 1,
"skill_level_name": "Basic"
},
{
"cv_skill_id": 5109997,
"skill_name": "car",
"skill_id": 2688,
"skill_level_id": 2,
"skill_level_name": "Average"

],
"description": "no Description available",
"ts": "2016-09-14 06:00:51.136898",
"jobs": [
{
"employer_id": null,
"description": ": Buttonprev. Cleaner areas in the yard. Especially faithfully performed work on the snow"
"department_id": null,
"company_size_id": null,
"industry_id": null,
"start_date": "2001-01-01",
"cv_job_id": 1812412,
"company_size_name": null,
"employer_name": null,
"job_id": 336,
"department_name": null,
"industry_name": null,
"end_date": "2002-01-01",
"job_name": "Janitor"
},
{
"employer_id": null,
"description": ": Casement Tinsmith-the repairers in the garage. Altered \"Cossacks\" \"Mercedes\". Additionally, Physically strong, intelligent, beautiful. There is a car Range Rover Sport and the rights of category B",
"department_id": null,
"company_size_id": null,
"industry_id": null,
"start_date": "2002-01-01",
"cv_job_id": 1812414,
"company_size_name": null,
"employer_name": null,
"job_id": 268,
"department_name": null,
"industry_name": null,
"end_date": "2003-01-01",
"job_name": "worker centers"
},
{
"employer_id": null,
"description": "12.03.2005 on 30.01.2007 Store \"Hope\". Senior Stallman for the sale of sausage. Increased sales of sausage 146% in 2 months",
"department_id": null,
"company_size_id": null,
"industry_id": 39,
"start_date": "2005-03-12",
"cv_job_id": 1812413,
"company_size_name": null,
"employer_name": null,
"job_id": 354,
"department_name": null,
"industry_name": "Retail trade",
"end_date": "2007-01-30",
"job_name": "Seller"
}
],
"cv_file_id": 16598,
"favorite_industries": [
{
"name": "Retail trade",
"industry_id": 39
}
],
"wage_min": null,
"cv_id": 1698916,
"favorite_areas_data": [
[
{
"id": 198830,
"name": "Russian Federation",
"level": 1
},
{
"id": 10005,
"name": "Siberian Federal district",
"level": 2
},
{
"id": 88,
"name": "Omsk oblast",
"level": 3
},
{
"id": 727,
"name": "Omsk",
"level": 4
}
]
],
"certificate_ids": [
{
"certificate_name": "Driving license category B",
"certificate_id": 118,
"cv_certificate_id": 604445
}
],
"cv_owner": "own",
"favorite_jobs": [
{
"name": "sales Manager",
"job_id": 112
}
],
"cv_status_id": 2,
"filename": "test_resume.odt"
}

Removing personal data in the parser is disabled, don't look for them to JSON.

In my biased opinion, the result is interesting and uses him a lot. Although the accuracy of object recognition, comparable to human, still very far away. We need to develop the knowledge base, to train the algorithm on examples to introduce additional heuristics and, possibly, specialized classifiers, for instance industries.

I wonder what methods you use or would You use? Especially I wonder does anybody semantic approach a La ABBYY Compreno?
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Tactoom. How about the middle of blogging?

SumIT Weekend of 18-19 February, the idea for iPad and Hackathon

Knowledge base. Part 2. Freebase: make requests to the Google Knowledge Graph