Updating documentation. Adding extra_skill_patterns.jsonl to allow users to simply add new skill patterns.

2019-04-23 12:29:00 -07:00 · 2019-04-23 12:29:00 -07:00 · 9b26c97f0d
--- a/README.md
+++ b/README.md
@ -4,10 +4,10 @@ The Skills Extractor is a Named Entity Recognition (NER) model that takes text a

 ## Definitions

-### What is a Cognitive Skill
+### What is a Cognitive Skill?
 A Cognitive Skill is a Feature of Azure Search designed to Augment data in a search index.

-### What is a Skill in terms of the Skills Extractor Service?
+### What is a Skill in terms of the Skills Extractor?
 A Skill is a Technical Concept/Tool or a Business related/Personal attribute.

 Example skills:
--- a/data/extra_skill_patterns.jsonl
+++ b/data/extra_skill_patterns.jsonl
--- a/docs/existing_search_index.md
+++ b/docs/existing_search_index.md
@ -10,8 +10,193 @@ Before running this sample, you must have the following:
 * Install the [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest). This article requires the Azure CLI version 2.0 or later. Run `az --version` to find the version you have.  
 You can also use the [Azure Cloud Shell](https://shell.azure.com/bash).

+* If you want to run the function locally install [Azure Function Core Tools version 2.x](https://docs.microsoft.com/en-us/azure/azure-functions/functions-run-local#v2)

-## Deploy Docker Container to Azure Functions using Azure CLI
+
+# Running the Function App locally
+
+If you just want to skip this and just deploy the prebuilt container to an Azure Function skip to [Deploy Docker Container to Azure Functions using Azure CLI](#deploy-docker-container-to-azure-functions-using-azure-cli)
+
+## Clone this repo
+
+```
+git clone https://github.com/Microsoft/SkillsExtractorCognitiveSearch
+```
+
+## Start the Function
+
+```
+func host start
+```
+
+When the Functions host starts, it write something like the following output, which has been truncated for readability
+
+```output
+                  %%%%%%
+                 %%%%%%
+            @   %%%%%%    @
+          @@   %%%%%%      @@
+       @@@    %%%%%%%%%%%    @@@
+     @@      %%%%%%%%%%        @@
+       @@         %%%%       @@
+         @@      %%%       @@
+           @@    %%      @@
+                %%
+                %
+
+...
+Content root path: C:\functions\MyFunctionProj
+Now listening on: http://0.0.0.0:7071
+Application started. Press Ctrl+C to shut down.
+
+Http Functions:
+
+        extract_skills: [POST] http://localhost:7071/api/extract_skills
+
+        get_skill: [GET] http://localhost:7071/api/skills/{skill_id}
+
+[8/27/2018 10:38:27 PM] Host started (29486ms)
+[8/27/2018 10:38:27 PM] Job host started
+```
+
+## Test the Function locally
+
+Lets make sure we can process Azure Search Records as expected.
+Run the following POST request to test your function locally.
+
+```http
+POST http://localhost:7071/api/extract_skills?skill_property=name
+Content-Type: application/json
+```
+```json
+{
+    "values": [
+        {
+            "recordId": "a3",
+            "data": {
+                "text": "Azure Machine Learning team is a fast-paced product group within the Microsoft Artificial Intelligence & Research organization. We are building the Machine Learning development platform that will make it easy for all data scientists and AI Developers to create and deploy robust, scalable and highly available machine learning solutions on the cloud, using the best of the open source ecosystem and innovation from inside the company and the latest breakthroughs in research We are looking for a Principal Program Manager who is passionate about delivering a highly available and reliable cloud platform that will transform the data science and machine learning experience. Come to build out and lead the Artificial Intelligence/Machine/Deep Learning investments, define innovative experiences for data science, and extend the state of the art in deep learning performance and efficiency in the industry. We are looking for a creative and technically minded product visionary with expertise in building and shipping large scale user experiences. A strong performer that can lead our UX efforts and drive cross-team initiatives to address major customer needs. Responsibilities In this role you will be responsible for leading the development of our Azure Machine Learning Web UX and drive our journey to modernize the appearance combining the different AML properties into one portal UX. You’ll help define the roadmap and the vision for our next generation of data science web experiences. You will be working closely with design and research counterparts to invent great data science experiences and lead the development team to deliver these modern experiences to our highly engaged customers. You will work with teams across the org to make this experience simple, reliable, fast and coherent with the other Microsoft AI products. Qualifications Have experience in building and leading teams and driving large cross team efforts.Have a track record in building and shipping products with an emphasis on UX.Are equally comfortable articulating the big picture and refining the small details.Are customer and data obsessed, always learning and refining your features to deliver maximum business impact. Have a BS Degree in Engineering, Computer Science, or similar degree. 7+ years of Product or Program Management experience. #AIPLATFORM##AIPLATREF#",
+                "language": "en"
+            }
+        }
+   ]
+}
+```
+
+You should see a response that looks like this:
+
+```json
+{
+    "values": [
+        {
+            "recordId": "a3",
+            "data": {
+                "skills": [
+                    "Artificial intelligence",
+                    "Azure Machine Learning",
+                    "Business",
+                    "Computer science",
+                    "Data science",
+                    "Deep learning",
+                    "Design",
+                    "Engineering",
+                    "Machine learning"
+                ]
+            },
+            "errors": null,
+            "warnings": null
+        }
+    ]
+}
+```
+
+It looks like we missed one potential skill: **UX**
+Let's add **UX** as a skill we want to extract.
+
+## Add a new Skill
+
+The main `data/skill_patterns.jsonl` file is built programatically from `data/skills.json`
+so we won't edit it directly. Instead we're going to add any new patterns to `data/extra_skill_patterns.jsonl`
+
+Open the `data/extra_skill_patterns.jsonl` file and the following lines.
+
+```json
+{"label":"SKILL|ux","pattern":[{"LOWER":"ux"}]}
+{"label":"SKILL|ux","pattern":[{"LOWER":"user"}, {"LOWER":"experience"}]}
+```
+
+This pattern syntax is specific to SpaCy's rule based matching. If you want learn more about the pattern syntax the SpaCy Docs have a great User Guide here:
+
+https://spacy.io/usage/rule-based-matching#entityruler
+
+> JSONL (files like *.jsonl) is an extension of JSON where each new line is a separate JSON object
+
+## Restart the Function and test your new patterns
+
+End the running host process and start the Functions host again with
+
+```
+func host start
+```
+
+Run the same POST request as before to `http://localhost:7071/api/extract_skills?skill_property=name`
+
+Your response should now have UX included in the list of skills like below:
+
+```json
+{
+    "values": [
+        {
+            "recordId": "a3",
+            "data": {
+                "skills": [
+                    "Artificial intelligence",
+                    "Azure Machine Learning",
+                    "Business",
+                    "Computer science",
+                    "Data science",
+                    "Deep learning",
+                    "Design",
+                    "Engineering",
+                    "Machine learning",
+                    "UX"
+                ]
+            },
+            "errors": null,
+            "warnings": null
+        }
+    ]
+}
+```
+
+## Build a Docker Image from the Dockerfile
+
+In the root folder run the `docker build` command providing your Docker Hub user id, image name and tag name
+
+For instance
+```
+docker build . -t <docker_id>/skills-extractor-cognitive-search:v1.0.0
+```
+
+## Push the built image to Docker Hub
+
+You need to be logged in to push this image to Docker Hub
+
+Docker Login
+```
+docker login -u <docker_id> -p <dockerhub_password>
+```
+
+Push your image to Docker Hub
+
+For instance
+```
+docker push <docker_id>/skills-extractor-cognitive-search:v1.0.0
+```
+
+Next we'll deploy this container to Azure Functions using the Azure CLI
+
+
+# Deploy Docker Container to Azure Functions using Azure CLI

 The first step is deploy your own instance of the Skills Extractor Azure Function.
 You can build and host the container yourself from this repo or use the prebuilt container on Docker Hub: `mcr.microsoft.com/wwllab/skills/skills-extractor-cognitive-search`
--- a/extract_skills/init.py
+++ b/extract_skills/init.py
@ -17,7 +17,7 @@ skills_extractor = SkillsExtractor(nlp)


 async def extract_from_record(
-    doc: RecordRequest, skill_property: str = "id"
+    record: RecordRequest, skill_property: str = "id"
 ):
    """Extract Skills from a single RecordRequest"""
    extracted_skills = skills_extractor.extract_skills(record.data.text)
@ -29,7 +29,7 @@ async def extract_from_record(
            skills.add(skill_id)

    return {
-        "recordId": doc.recordId,
+        "recordId": record.recordId,
        "data": {"skills": sorted(list(skills))},
        "warnings": None,
        "errors": None,
@ -96,6 +96,6 @@ async def main(req: func.HttpRequest) -> func.HttpResponse:
        logging.info(f"Extracting Skills from {len(body.values)} Records.")

        response_headers = {"Content-Type": "application/json"}
-        values_res = await extract_from_docs(body.values, skill_property)
+        values_res = await extract_from_records(body.values, skill_property)

        return func.HttpResponse(values_res.json(), headers=response_headers)
--- a/services/skills.py
+++ b/services/skills.py
@ -2,6 +2,7 @@
 # Licensed under the MIT License.

 from collections import defaultdict
+import itertools
 from pathlib import Path

 import srsly
@ -19,8 +20,9 @@ class SkillsExtractor:
        self.skills = self._get_skills()

        patterns = self._build_patterns(self.skills)
+        extra_patterns = self._get_extra_skill_patterns()
        ruler = EntityRuler(nlp, overwrite_ents=True)
-        ruler.add_patterns(patterns)
+        ruler.add_patterns(itertools.chain(patterns, extra_patterns))
        if not self.nlp.has_pipe("skills_ruler"):
            self.nlp.add_pipe(ruler, name="skills_ruler")

@ -29,6 +31,12 @@ class SkillsExtractor:
        skills_path = self.data_path/"skills.json"
        skills = srsly.read_json(skills_path)
        return skills
+    
+    def _get_extra_skill_patterns(self):
+        """Load extra user added skill patterns"""
+        extra_patterns_path = self.data_path/"extra_skill_patterns.jsonl"
+        extra_skill_patterns = srsly.read_jsonl(extra_patterns_path)
+        return extra_skill_patterns

    def _skill_pattern(self, skill: str, split_token: str = None):
        """Create a single skill pattern"""
@ -38,10 +46,11 @@ class SkillsExtractor:
        else:
            split = skill.split()
        for b in split:
-            if b.upper() == skill:
-                pattern.append({"TEXT": b})
-            else:
-                pattern.append({"LOWER": b.lower()})
+            if b:
+                if b.upper() == skill:
+                    pattern.append({"TEXT": b})
+                else:
+                    pattern.append({"LOWER": b.lower()})

        return pattern

@ -102,24 +111,6 @@ class SkillsExtractor:
            if "|" in ent.label_:
                ent_label, skill_id = ent.label_.split("|")
                if ent_label == "SKILL" and skill_id:
-                    skill_info = self.skills[skill_id]
-                    sources = skill_info['sources']
-
-                    # Some sources have better Skill Descriptions than others.
-                    # This is a simple heuristic for cascading through the sources 
-                    # to pick the best description available per skill
-                    main_source = sources[0]
-                    for source in sources:
-                        if source["sourceName"] == "Github Topics":
-                            main_source = source
-                            break
-                        elif source["sourceName"] == "Microsoft Academic Topics":
-                            main_source = source
-                            break
-                        elif source["sourceName"] == "Stackshare Skills":
-                            main_source = source
-                            break
-
                    found_skills[skill_id]["matches"].append(
                        {
                            "start": ent.start_char,
@ -128,6 +119,34 @@ class SkillsExtractor:
                            "text": ent.text,
                        }
                    )
+                    try:
+                        skill_info = self.skills[skill_id]
+                        sources = skill_info['sources']
+
+                        # Some sources have better Skill Descriptions than others.
+                        # This is a simple heuristic for cascading through the sources 
+                        # to pick the best description available per skill
+                        main_source = sources[0]
+                        for source in sources:
+                            if source["sourceName"] == "Github Topics":
+                                main_source = source
+                                break
+                            elif source["sourceName"] == "Microsoft Academic Topics":
+                                main_source = source
+                                break
+                            elif source["sourceName"] == "Stackshare Skills":
+                                main_source = source
+                                break
+                    except KeyError:
+                        # This happens when a pattern defined in data/extra_skill_patterns.jsonl
+                        # is matched. The skill is not added to data/skills.json so there's no
+                        # extra metadata about the skill from an established source.
+                        sources = []
+                        main_source = {
+                            "displayName": ent.text,
+                            "shortDescription": "",
+                            "longDescription": ""
+                        }

                    keys = ["displayName", "shortDescription", "longDescription"]
                    for k in keys: