W-shingling

Quality 0.50 · 1 view · Updated 2 months ago

title: "W-shingling" type: doc version: 1 created: 2026-02-28 author: "Wikipedia contributors" status: active scope: public tags: ["natural-language-processing"] topic_path: "technology/computing" source: "https://en.wikipedia.org/wiki/W-shingling" license: "CC BY-SA 4.0" wikipedia_page_id: 0 wikipedia_revision_id: 0

In natural language processing a w-shingling is a set of unique shingles (therefore n-grams) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol w denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:

:(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (Thus 4=n, thus 4-grams) is

:{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) }

Which can then be reduced, or maximally shingled in this particular instance to

:{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

:r(A,B)=

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

References

Does not yet use the term "shingling".

::callout[type=info title="Wikipedia Source"] This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page. ::