DEPRECATED - A Javascript library for parsing metadata on a web page.

Перейти к файлу

farhanpatel bf32095533 Update version number		2019-06-25 11:53:04 -07:00
tests	Add support for language detection.	2018-12-18 12:51:14 -08:00
.babelrc	Adds webpack scripts and config for generating JS that will work on iOS	2016-09-15 14:12:15 -04:00
.eslintignore	Fixes eslint warnings and removed bin/ from patch	2016-09-26 16:19:47 -04:00
.eslintrc	Clean package.json and .eslintrc	2016-06-29 09:03:46 -07:00
.gitignore	Remove Fathom 1.0 Dependency fixes #90	2017-08-10 14:03:07 -04:00
CODE_OF_CONDUCT.md	Add Mozilla Code of Conduct file	2019-03-29 14:59:18 -07:00
LICENSE	Initial commit	2016-06-20 17:50:32 -04:00
README.md	Update docs to reflect using domino instead of jsdom fixes #105	2018-11-02 13:37:16 -04:00
circle.yml	Remove coveralls access token fixes #101	2018-04-19 16:31:11 -04:00
karma.conf.js	Clean package.json and .eslintrc	2016-06-29 09:03:46 -07:00
package.json	Update version number	2019-06-25 11:53:04 -07:00
parser.js	Simplify icon score calculation by just using one demension.	2019-06-25 11:52:26 -07:00
url-utils.js	Remove Fathom 1.0 Dependency fixes #90	2017-08-10 14:03:07 -04:00
webpack.config.js	Remove Fathom 1.0 Dependency fixes #90	2017-08-10 14:03:07 -04:00

README.md

Page Metadata Parser

A Javascript library for parsing metadata in web pages.

Overview

Purpose

The purpose of this library is to be able to find a consistent set of metadata for any given web page. Each individual kind of metadata has many rules which define how it may be located. For example, a description of a page could be found in any of the following DOM elements:

<meta name="description" content="A page's description"/>

<meta property="og:description" content="A page's description" />

Because different web pages represent their metadata in any number of possible DOM elements, the Page Metadata Parser collects rules for different ways a given kind of metadata may be represented and abstracts them away from the caller.

The output of the metadata parser for the above example would be

{description: "A page's description"}

regardless of which particular kind of description tag was used.

Supported schemas

This library employs parsers for the following formats:

opengraph

twitter

meta tags

Requirements

This library is meant to be used either in the browser (embedded directly in a website or into a browser addon/extension) or on a server (node.js).

The parser depends only on the Node URL library or the Browser URL library.

Each function expects to be passed a Document object, which may be created either directly by a browser or on the server using a Document compatible object, such as that provided by domino.

Usage

Installation

npm install --save page-metadata-parser

Usage in the browser

The library can be built to be deployed directly to a modern browser by using

npm run bundle

and embedding the resultant js file directly into a page like so:

<script src="page-metadata-parser.bundle.js" type="text/javascript" />

<script>

  const metadata = metadataparser.getMetadata(window.document, window.location);

  console.log("The page's title is ", metadata.title);

</script>

Usage in node

To use the library in node, you must first construct a DOM API compatible object from an HTML string, for example:

const {getMetadata} = require('page-metadata-parser');
const domino = require('domino');

const url = 'https://github.com/mozilla/page-metadata-parser';
const response = await fetch(url);
const html = await response.text();
const doc = domino.createWindow(html).document;
const metadata = getMetadata(doc, url);

Metadata Rules

Rules

A single rule instructs the parser on a possible DOM node to locate a specific piece of content.

For instance, a rule to parse the title of a page found in a DOM tag like this:

<meta property="og:title" content="Page Title" />

Would be represented with the following rule:

['meta[property="og:title"]', element => element.getAttribute('content')]

A rule consists of two parts, a query selector compatible string which is used to look up the target content, and a callable which receives an element and returns the desired content from that element.

Many rules together form a Rule Set. This library will apply each rule to a page and choose the 'best' result. The order in which rules are defined indicate their preference, with the first rule being the most preferred. A Rule Set can be defined like so:

const titleRules = {
  rules: [
    ['meta[property="og:title"]', node => node.element.getAttribute('content')],
    ['title', node => node.element.text],
  ]
};

In this case, the OpenGraph title will be preferred over the title tag.

This library includes many rules for a single desired piece of metadata which should allow it to consistently find metadata across many types of pages. This library is meant to be a community driven effort, and so if there is no rule to find a piece of information from a particular website, contributors are encouraged to add new rules!

Built-in Rule Sets

This library provides rule sets to find the following forms of metadata in a page:

Field	Description
description	A user displayable description for the page.
icon	A URL which contains an icon for the page.
image	A URL which contains a preview image for the page.
keywords	The meta keywords for the page.
provider	A string representation of the sub and primary domains.
title	A user displayable title for the page.
type	The type of content as defined by opengraph.
url	A canonical URL for the page.

To use a single rule set to find a particular piece of metadata within a page, simply pass that rule set, a URL, and a Document object to getMetadata and it will apply each possible rule for that rule set until it finds a matching piece of information and return it.

Example:

const {getMetadata, metadataRuleSets} = require('page-metadata-parser');

const pageTitle = getMetadata(doc, url, {title: metadataRuleSets.title});

Extending a single rule

To add your own additional custom rule to an existing rule set, you can simply push it into that rule sets's array.

Example:

const {getMetadata, metadataRuleSets} = require('page-metadata-parser');

const customDescriptionRuleSet = metadataRuleSets.description;

customDescriptionRuleSet.rules.push([
  ['meta[name="customDescription"]', element => element.getAttribute('content')]
]);

const pageDescription = getMetadata(doc, url, {description: customDescriptionRuleSet});

Using all rules

To parse all of the available metadata on a page using all of the rule sets provided in this library, simply call getMetadata on the Document.

const {getMetadata, metadataRuleSets} = require('page-metadata-parser');

const pageMetadata = getMetadata(doc, url);