Write an Extractor

This document goes over an example of creating remarkable.RemarkableHighlightExtractorExample

To start, let’s make sure we have a clear goal.

Note

We want to create a new extractor that outputs a list of highlights for a given document

remarking (run|persist) json --extractor remarkable_example my_book_name

This should run our extractor on my_book_name and output the highlights our extractor finsd.

In order to achieve our goal, we need to implement the remarking.HighlightExtractor interface.

The remarking.HighlightExtractor Interface

The remarking.HighlightExtractor has a fairly straight forward interface for us to implement.

Let’s start by creating a file called remarking/highlight_extractor/remarkable_highlight_extractor_example.py and adding this to the file:

from typing import List

from remarking import HighlightExtractor
from remarking import ExtractorData
from remarking import Document
from remarking import Highlight


class RemarkableHighlightExtractorExample(HighlightExtractor):
    """ Extracts highlights from reMarkable documents. """

    @classmethod
    def get_extractor_instance_data(cls) -> List[ExtractorData]:
        """ Return a list of :class:`ExtractorData` instaces representing
            different run options for the extractor.
        """
        return [
            ExtractorData(
                extractor_name="remarkable_example",
                instance=cls(),
                description=cls.__doc__
            )
        ]

    def get_highlights(self, working_path: str, document: Document) -> List[Highlight]:
        """ Retrieve all highlights for document. """
        return []

Let’s check if remarking has found our extractor:

remarking list extractor

Should show our extractor named remarkable_example and produce no errors.

Let’s dive into implementing the methods a bit more now.

HighlightExtractor.get_extractor_instance_data

The HighlightExtractor.get_extractor_instance_data method returns a list of ExtractorData instances. Ecah instance is used by remarking in order to offer the extractor through remarking list extractors and the --extractors option for output writers.

An entry in remarking list extractors will be placed for each ExtractorData entry we return.

Let’s plan for our extractor to have two modes, a fast mode and an accurate mode. The accurate mode should be expected to take longer.

To reflect this, let’s change the implementation of get_extractor_instance_data and add a constructor:

def __init__(self, fast: bool = False):
    self.fast = fast

@classmethod
def get_extractor_instance_data(cls) -> List[ExtractorData]:
    """ Return a list of :class:`ExtractorData` instaces representing
        different run options for the extractor.
    """
    return [
        ExtractorData(
            extractor_name="remarkable_example_accurate",
            instance=cls(fast=False),
            description=cls.__doc__ + " This version is more accurate."
        ),
        ExtractorData(
            extractor_name="remarkable_example_fast",
            instance=cls(fast=True),
            description=cls.__doc__ + " This version runs faster."
        )
    ]

Let’s test our change by running remarking list extractors.

We should see two extractors, one for each of the entries we returned.

HighlightExtractor.get_highlights

HighlightExtractor.get_highlights is where the magic happens. It accepts a working_path where all the documents for the current execution of remarking are stored.

It also accepts a document indicating which document to return highlights for.

remarking expects the extractor to return a list of Highlight objects that represent the highlights found.

Let’s make our implementation simple. A more complicated implementation can be seen in remarking/highlight_extractor/remarkable_highlight_extractor.py.

def get_highlights(self, working_path: str, document: Document) -> List[Highlight]:
    """ Retrieve all highlights for document. """
    if self.fast:
        quote = f"A fast quote from {document.name}"
    else:
        quote = f"An accurate quote from {document.name}"

    return [
        Highlight.create_highlight(
            doc_id=document.id,
            text=quote,
            page_number=1,
            extraction_method="RemarkableHighlightExtractorExample",
        )
    ]

We’re not actually parsing highlights in this example.

Instead, we simple return a quote indicating if we ran the fast option. We also include the document name.

Let’s test this by running:

remarking run json --extractors remarkable_example_fast library

This should run the fast version of our extractor and return a single highlight per document found in library.

We can also run the accurate version with:

remarking run json --extractors remarkable_example_accurate library

Both examples should run without error.

Congratulations, your extractor just ran!

All together now

Our final implementation of remarking/highlight_extractor/remarkable_highlight_extractor.py should be:

from typing import List

from remarking import HighlightExtractor
from remarking import ExtractorData
from remarking import Document
from remarking import Highlight

class RemarkableHighlightExtractorExample(HighlightExtractor):
    """ Extracts highlights from reMarkable documents. """

    def __init__(self, fast: bool = False):
        self.fast = fast

    @classmethod
    def get_extractor_instance_data(cls) -> List[ExtractorData]:
        """ Return a list of :class:`ExtractorData` instaces representing
            different run options for the extractor.
        """
        return [
            ExtractorData(
                extractor_name="remarkable_example_accurate",
                instance=cls(),
                description=cls.__doc__
            ),
            ExtractorData(
                extractor_name="remarkable_example_fast",
                instance=cls(fast=True),
                description=cls.__doc__ + " This version runs faster."
            )
        ]

    def get_highlights(self, working_path: str, document: Document) -> List[Highlight]:
        """ Retrieve all highlights for document. """
        if self.fast:
            quote = f"A fast quote from {document.name}"
        else:
            quote = f"An accurate quote from {document.name}"

        return [
            Highlight.create_highlight(
                doc_id=document.id,
                text=quote,
                page_number=1,
                extraction_method="RemarkableHighlightExtractorExample",
            )
        ]

Check out the implementation of remarking.RemarkableHighlightExtractor for an example of a more complex extractor!

Next Steps

Once you have designed your new extractor open a pull request as outlined in the contribution doc and someone will review it!