linkify-it-py#

CI pypi Anaconda-Server Badge Documentation Status codecov Maintainability

This is Python port of linkify-it.

Links recognition library with FULL unicode support. Focused on high quality link patterns detection in plain text.

Demo

Javascript Demo

Why it’s awesome:

  • Full unicode support, with astral characters!

  • International domains support.

  • Allows rules extension & custom normalizers.

Install#

pip install linkify-it-py

or

conda install -c conda-forge linkify-it-py

Usage examples#

Example 1. Simple use#

from linkify_it import LinkifyIt


linkify = LinkifyIt()

print(linkify.test("Site github.com!"))
# => True

print(linkify.match("Site github.com!"))
# => [linkify_it.main.Match({
#         'schema': '',
#         'index': 5,
#         'last_index': 15,
#         'raw': 'github.com',
#         'text': 'github.com',
#         'url': 'http://github.com'
#     }]

Example 2. With options#

from linkify_it import LinkifyIt
from linkify_it.tlds import TLDS


# Reload full tlds list & add unofficial `.onion` domain.
linkify = (
    LinkifyIt()
    .tlds(TLDS)               # Reload with full tlds list
    .tlds("onion", True)      # Add unofficial `.onion` domain
    .add("git:", "http:")     # Add `git:` protocol as "alias"
    .add("ftp:", None)        # Disable `ftp:` protocol
    .set({"fuzzy_ip": True})  # Enable IPs in fuzzy links (without schema)
)
print(linkify.test("Site tamanegi.onion!"))
# => True

print(linkify.match("Site tamanegi.onion!"))
# => [linkify_it.main.Match({
#         'schema': '',
#         'index': 5,
#         'last_index': 19,
#         'raw': 'tamanegi.onion',
#         'text': 'tamanegi.onion',
#         'url': 'http://tamanegi.onion'
#     }]

Example 3. Add twitter mentions handler#

from linkify_it import LinkifyIt


linkify = LinkifyIt()

def validate(obj, text, pos):
    tail = text[pos:]

    if not obj.re.get("twitter"):
        obj.re["twitter"] = re.compile(
            "^([a-zA-Z0-9_]){1,15}(?!_)(?=$|" + obj.re["src_ZPCc"] + ")"
        )
    if obj.re["twitter"].search(tail):
        if pos > 2 and tail[pos - 2] == "@":
            return False
        return len(obj.re["twitter"].search(tail).group())
    return 0

def normalize(obj, match):
    match.url = "https://twitter.com/" + re.sub(r"^@", "", match.url)

linkify.add("@", {"validate": validate, "normalize": normalize})

API#

API documentation

LinkifyIt(schemas, options)#

Creates new linkifier instance with optional additional schemas.

By default understands:

  • http(s)://... , ftp://..., mailto:... & //... links

  • “fuzzy” links and emails (google.com, foo@bar.com).

schemas is an dict, where each key/value describes protocol/rule:

  • key - link prefix (usually, protocol name with : at the end, skype: for example). linkify-it-py makes sure that prefix is not preceded with alphanumeric char.

  • value - rule to check tail after link prefix

    • str

      • just alias to existing rule

    • dict

      • validate - either a re.Pattern (start with ^, and don’t include the link prefix itself), or a validator function which, given arguments self, text and pos, returns the length of a match in text starting at index pos. pos is the index right after the link prefix. self can be used to access the linkify object to cache data.

      • normalize - optional function to normalize text & url of matched result (for example, for twitter mentions).

options:

  • fuzzy_link - recognize URL-s without http(s):// head. Default True.

  • fuzzy_ip - allow IPs in fuzzy links above. Can conflict with some texts like version numbers. Default False.

  • fuzzy_email - recognize emails without mailto: prefix. Default True.

  • - set True to terminate link with --- (if it’s considered as long dash).

.test(text)#

Searches linkifiable pattern and returns True on success or False on fail.

.pretest(text)#

Quick check if link MAY BE can exist. Can be used to optimize more expensive .test() calls. Return False if link can not be found, True - if .test() call needed to know exactly.

.test_schema_at(text, name, position)#

Similar to .test() but checks only specific protocol tail exactly at given position. Returns length of found pattern (0 on fail).

.match(text)#

Returns list of found link matches or null if nothing found.

Each match has:

  • schema - link schema, can be empty for fuzzy links, or // for protocol-neutral links.

  • index - offset of matched text

  • last_index - index of next char after mathch end

  • raw - matched text

  • text - normalized text

  • url - link, generated from matched text

.matchAtStart(text)#

Checks if a match exists at the start of the string. Returns Match (see docs for match(text)) or null if no URL is at the start. Doesn’t work with fuzzy links.

.tlds(list_tlds, keep_old=False)#

Load (or merge) new tlds list. Those are needed for fuzzy links (without schema) to avoid false positives. By default:

  • 2-letter root zones are ok.

  • biz|com|edu|gov|net|org|pro|web|xxx|aero|asia|coop|info|museum|name|shop|рф are ok.

  • encoded (xn--...) root zones are ok.

If that’s not enough, you can reload defaults with more detailed zones list.

.add(key, value)#

Add a new schema to the schemas object. As described in the constructor definition, key is a link prefix (skype:, for example), and value is a str to alias to another schema, or an dict with validate and optionally normalize definitions. To disable an existing rule, use .add(key, None).

.set(options)#

Override default options. Missed properties will not be changed.

License#

MIT