perceval.backends.core package¶

Submodules¶

perceval.backends.core.askbot module¶

class perceval.backends.core.askbot.Askbot(url, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Askbot backend.

This class retrieves the questions posted on an Askbot site. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters

url – Askbot site URL
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['question']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'tags': ['tags']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the questions/answers from the repository.

The method retrieves, from an Askbot site, the questions and answers updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain questions/answers updated since this date

Returns

a generator of items

fetch_items(category, **kwargs)[source]¶

Fetch the questions

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from an Askbot item.

This backend only generates one type of item which is ‘question’.

static metadata_id(item)[source]¶: Extracts the identifier from an Askbot question item.

static metadata_updated_on(item)[source]¶

Extracts the update time from an Askbot item.

The timestamp is extracted from ‘last_activity_at’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.8.0'¶

class perceval.backends.core.askbot.AskbotClient(base_url, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Askbot client.

This class implements a simple client to retrieve distinct kind of data from an Askbot site.

Parameters

base_url – URL of the Askbot site
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

API_QUESTIONS = 'api/v1/questions/'¶

HREQUEST_WITH = 'X-Requested-With'¶

PAVATAR_SIZE = 'avatar_size'¶

PPAGE = 'page'¶

PPOST_ID = 'post_id'¶

PPOST_TYPE = 'post_type'¶

PSORT = 'sort'¶

RCOMMENTS = 's/post_comments'¶

RCOMMENTS_OLD = 'post_comments'¶

RHTML_QUESTION = 'question/'¶

VANSWER = 'answer'¶

VAVATAR_SIZE = 0¶

VHTTP_REQUEST = 'XMLHttpRequest'¶

VORDER_API = 'activity-asc'¶

VORDER_HTML = 'votes'¶

get_api_questions(path)[source]¶

Retrieve a question page using the API.

Parameters: page – page to retrieve

get_comments(post_id)[source]¶

Retrieve a list of comments by a given id.

Parameters: object_id – object identifiere

get_html_question(question_id, page=1)[source]¶

Retrieve a raw HTML question and all it’s information.

Parameters

question_id – question identifier
page – page to retrieve

class perceval.backends.core.askbot.AskbotCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Askbot backend from the command line.

BACKEND¶: alias of perceval.backends.core.askbot.Askbot

classmethod setup_cmd_parser()[source]¶: Returns the Askbot argument parser.

class perceval.backends.core.askbot.AskbotParser[source]¶

Bases: object

Askbot HTML parser.

This class parses a plain HTML document, converting questions, answers, comments and user information into dict items.

static parse_answers(html_question)[source]¶

Parse the answers of a given HTML question.

The method parses the answers related with a given HTML question, as well as all the comments related to the answer.

Parameters: html_question – raw HTML question element
Returns: a list with the answers

static parse_number_of_html_pages(html_question)[source]¶

Parse number of answer pages to paginate over them.

Parameters: html_question – raw HTML question element
Returns: an integer with the number of pages

static parse_question_container(html_question)[source]¶

Parse the question info container of a given HTML question.

The method parses the information available in the question information container. The container can have up to 2 elements: the first one contains the information related to the user who generated the question and the date (if any). The second one contains the date of the update and the user who updated it (if not the same who generated the question).

Parameters: html_question – raw HTML question element
Returns: an object with the parsed information

static parse_user_info(update_info)[source]¶

Parse the user information of a given HTML container.

The method parses all the available user information in the container. If the class “user-info” exists, the method will get all the available information in the container. If not, if a class “tip” exists, it will be a wiki post with no user associated. Else, it can be an empty container.

Parameters: update_info – beautiful soup answer container element
Returns: an object with the parsed information

perceval.backends.core.bugzilla module¶

class perceval.backends.core.bugzilla.Bugzilla(url, user=None, password=None, max_bugs=200, max_bugs_csv=10000, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Bugzilla backend.

This class allows the fetch the bugs stored in Bugzilla repository. To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.

Parameters

url – Bugzilla server URL
user – Bugzilla user
password – Bugzilla user password
max_bugs – maximum number of bugs requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['bug']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'component': ['component', 0, '__text__'], 'product': ['product', 0, '__text__']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the bugs from the repository.

The method retrieves, from a Bugzilla repository, the bugs updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain bugs updated since this date

Returns

a generator of bugs

fetch_items(category, **kwargs)[source]¶

Fetch the bugs

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Bugzilla item.

This backend only generates one type of item which is ‘bug’.

static metadata_id(item)[source]¶: Extracts the identifier from a Bugzilla item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Bugzilla item.

The timestamp is extracted from ‘delta_ts’ field. This date is converted to UNIX timestamp format. Due Bugzilla servers ignore the timezone on HTTP requests, it will be ignored during the conversion, too.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_bug_activity(raw_html)[source]¶

Parse a Bugzilla bug activity HTML stream.

This method extracts the information about activity from the given HTML stream. The bug activity is stored into a HTML table. Each parsed activity event is returned into a dictionary.

If the given HTML is invalid, the method will raise a ParseError exception.

Parameters: raw_html – HTML string to parse
Returns: a generator of parsed activity events
Raises: ParseError – raised when an error occurs parsing the given HTML stream

static parse_buglist(raw_csv)[source]¶

Parse a Bugzilla CSV bug list.

The method parses the CSV file and returns an iterator of dictionaries. Each one of this, contains the summary of a bug.

Parameters: raw_csv – CSV string to parse
Returns: a generator of parsed bugs

static parse_bugs_details(raw_xml)[source]¶

Parse a Bugilla bugs details XML stream.

This method returns a generator which parses the given XML, producing an iterator of dictionaries. Each dictionary stores the information related to a parsed bug.

If the given XML is invalid or does not contains any bug, the method will raise a ParseError exception.

Parameters: raw_xml – XML string to parse
Returns: a generator of parsed bugs
Raises: ParseError – raised when an error occurs parsing the given XML stream

version = '0.12.0'¶

class perceval.backends.core.bugzilla.BugzillaClient(base_url, user=None, password=None, max_bugs_csv=10000, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Bugzilla API client.

This class implements a simple client to retrieve distinct kind of data from a Bugzilla repository. Currently, it only supports 3.x and 4.x servers.

When it is initialized, it checks if the given Bugzilla is available and retrieves its version.

Parameters

base_url – URL of the Bugzilla server
user – Bugzilla user
password – user password
max_bugs_cvs – max bugs requested per CSV query
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

BackendError – when an error occurs initializing the client

CGI_BUG = 'show_bug.cgi'¶

CGI_BUGLIST = 'buglist.cgi'¶

CGI_BUG_ACTIVITY = 'show_activity.cgi'¶

CGI_LOGIN = 'index.cgi'¶

CTYPE_CSV = 'csv'¶

CTYPE_XML = 'xml'¶

OLD_STYLE_VERSIONS = ['3.2.3', '3.2.2']¶

PBUGZILLA_LOGIN = 'Bugzilla_login'¶

PBUGZILLA_PASSWORD = 'Bugzilla_password'¶

PBUG_ID = 'id'¶

PCHFIELD_FROM = 'chfieldfrom'¶

PCTYPE = 'ctype'¶

PEXCLUDE_FIELD = 'excludefield'¶

PLIMIT = 'limit'¶

PLOGIN = 'GoAheadAndLogIn'¶

PLOGOUT = 'logout'¶

PORDER = 'order'¶

URL = '%(base)s/%(cgi)s'¶

VERSION_REGEX = re.compile('.+bugzilla version="([^"]+)"', re.DOTALL)¶

bug_activity(bug_id)[source]¶

Get the activity of a bug in HTML format.

Parameters: bug_id – bug identifier

buglist(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Get a summary of bugs in CSV format.

Parameters: from_date – retrieve bugs that where updated from that date

bugs(*bug_ids)[source]¶

Get the information of a list of bugs in XML format.

Parameters: bug_ids – list of bug identifiers

call(cgi, params)[source]¶

Run an API command.

Parameters

cgi – cgi command to run on the server
params – dict with the HTTP parameters needed to run the given command

login(user, password)[source]¶

Authenticate a user in the server.

Parameters

user – Bugzilla user
password – user password

logout()[source]¶: Logout from the server.

metadata()[source]¶: Get metadata information in XML format.

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the login and password information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.bugzilla.BugzillaCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Bugzilla backend from the command line.

BACKEND¶: alias of perceval.backends.core.bugzilla.Bugzilla

classmethod setup_cmd_parser()[source]¶: Returns the Bugzilla argument parser.

perceval.backends.core.bugzillarest module¶

class perceval.backends.core.bugzillarest.BugzillaREST(url, user=None, password=None, api_token=None, max_bugs=500, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Bugzilla backend that uses its API REST.

This class allows the fetch the bugs stored in Bugzilla server (version 5.0 or later). To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.

Parameters

url – Bugzilla server URL
user – Bugzilla user
password – Bugzilla user password
api_token – Bugzilla token
max_bugs – maximum number of bugs requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['bug']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'component': ['component'], 'product': ['product']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the bugs from the repository.

The method retrieves, from a Bugzilla repository, the bugs updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain bugs updated since this date

Returns

a generator of bugs

fetch_items(category, **kwargs)[source]¶

Fetch the bugs

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Bugzilla item.

This backend only generates one type of item which is ‘bug’.

static metadata_id(item)[source]¶: Extracts the identifier from a Bugzilla item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Bugzilla item.

The timestamp used is extracted from ‘last_change_time’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.10.0'¶

class perceval.backends.core.bugzillarest.BugzillaRESTClient(base_url, user=None, password=None, api_token=None, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Bugzilla REST API client.

This class implements a simple client to retrieve distinct kind of data from a Bugzilla > 5.0 repository using its REST API.

When user and password parameters are given it logs in the server. Further requests will use the token obtained during the sign in phase.

Parameters

base_url – URL of the Bugzilla server
user – Bugzilla user
password – user password
api_token – api token for user; when this is provided user and password parameters will be ignored
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

BackendError – when an error occurs initializing the client

PBUGZILLA_LOGIN = 'login'¶

PBUGZILLA_PASSWORD = 'password'¶

PBUGZILLA_TOKEN = 'token'¶

PEXCLUDE_FIELDS = 'exclude_fields'¶

PIDS = 'ids'¶

PINCLUDE_FIELDS = 'include_fields'¶

PLAST_CHANGE_TIME = 'last_change_time'¶

PLIMIT = 'limit'¶

POFFSET = 'offset'¶

PORDER = 'order'¶

RATTACHMENT = 'attachment'¶

RBUG = 'bug'¶

RCOMMENT = 'comment'¶

RHISTORY = 'history'¶

RLOGIN = 'login'¶

URL = '%(base)s/rest/%(resource)s'¶

VCHANGE_DATE_ORDER = 'changeddate'¶

VEXCLUDE_ATTCH_DATA = 'data'¶

VINCLUDE_ALL = '_all'¶

attachments(*bug_ids)[source]¶

Get the attachments of the given bugs.

Parameters: bug_id – list of bug identifiers

bugs(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_bugs=500)[source]¶

Get the information of a list of bugs.

Parameters

from_date – retrieve bugs that where updated from that date; dates are converted to UTC
offset – starting position for the search; i.e to return 11th element, set this value to 10.
max_bugs – maximum number of bugs to reteurn per query

call(resource, params)[source]¶

Retrive the given resource.

Parameters

resource – resource to retrieve
params – dict with the HTTP parameters needed to retrieve the given resource

Raises

BugzillaRESTError – raised when an error is returned by the server

comments(*bug_ids)[source]¶

Get the comments of the given bugs.

Parameters: bug_ids – list of bug identifiers

history(*bug_ids)[source]¶

Get the history of the given bugs.

Parameters: bug_ids – list of bug identifiers

login(user, password)[source]¶

Authenticate a user in the server.

Parameters

user – Bugzilla user
password – user password

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the login, password and token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.bugzillarest.BugzillaRESTCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run BugzillaREST backend from the command line.

BACKEND¶: alias of perceval.backends.core.bugzillarest.BugzillaREST

classmethod setup_cmd_parser()[source]¶: Returns the BugzillaREST argument parser.

exception perceval.backends.core.bugzillarest.BugzillaRESTError(**kwargs)[source]¶

Bases: perceval.errors.BaseError

Raised when an error occurs using the API

message = '%(error)s (code: %(code)s)'¶

perceval.backends.core.confluence module¶

class perceval.backends.core.confluence.Confluence(url, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Confluence backend.

This class allows the fetch the historical contents (content versions) stored on a Confluence server. Initialize this class passing the URL os this server. The url will be set as the origin of the data.

Parameters

url – URL of the server
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['historical content']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='historical content', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the contents by version from the server.

This method fetches the different historical versions (or snapshots) of the contents stored in the server that were updated since the given date. Only those snapshots created or updated after from_date will be returned.

Take into account that the seconds of from_date parameter will be ignored because the Confluence REST API only accepts the date and hours and minutes for timestamps values.

Parameters

category – the category of items to fetch
from_date – obtain historical versions of contents updated since this date

Returns

a generator of historical versions

fetch_items(category, **kwargs)[source]¶

Fetch the contents

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Confluence item.

This backend only generates one type of item which is ‘historical content’.

static metadata_id(item)[source]¶

Extracts the identifier from a Confluence item.

This identifier will be the mix of two fields because a historical content does not have any unique identifier. In this case, ‘id’ and ‘version’ values are combined because it should not be possible to have two equal version numbers for the same content. The value to return will follow the pattern: <content>#v<version> (i.e 28979#v10).

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Confluence item.

The timestamp is extracted from ‘when’ field on ‘version’ section. This date is converted to UNIX timestamp format.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_contents_summary(raw_json)[source]¶

Parse a Confluence summary JSON list.

The method parses a JSON stream and returns an iterator of diccionaries. Each dictionary is a content summary.

Parameters: raw_json – JSON string to parse
Returns: a generator of parsed content summaries.

static parse_historical_content(raw_json)[source]¶

Parse a Confluence historical content JSON stream.

This method parses a JSON stream and returns a dictionary that contains the data of a historical content.

Parameters: raw_json – JSON string to parse
Returns: a dict with historical content

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus the page ancestor IDs, the content ID and the content version number.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.12.0'¶

class perceval.backends.core.confluence.ConfluenceClient(base_url, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Confluence REST API client.

This class implements a client to retrieve contents from a Confluence server using its REST API.

Parameters

base_url – URL of the Confluence server
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

MSEARCH = 'search'¶

PANCESTORS = 'ancestors'¶

PCQL = 'cql'¶

PEXPAND = 'expand'¶

PLIMIT = 'limit'¶

PSTART = 'start'¶

PSTATUS = 'status'¶

PVERSION = 'version'¶

RCONTENTS = 'content'¶

RHISTORY = 'history'¶

RSPACE = 'space'¶

URL = '%(base)s/rest/api/%(resource)s'¶

VCQL = "lastModified>='%(date)s' order by lastModified"¶

VEXPAND = ['body.storage', 'history', 'version']¶

VHISTORICAL = 'historical'¶

contents(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_contents=200)[source]¶

Get the contents of a repository.

This method returns an iterator that manages the pagination over contents. Take into account that the seconds of from_date parameter will be ignored because the API only works with hours and minutes.

Parameters

from_date – fetch the contents updated since this date
offset – fetch the contents starting from this offset
limit – maximum number of contents to fetch per request

historical_content(content_id, version)[source]¶

Get the snapshot of a content for the given version.

Parameters

content_id – fetch the snapshot of this content
version – snapshot version of the content

class perceval.backends.core.confluence.ConfluenceCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Confluence backend from the command line.

BACKEND¶: alias of perceval.backends.core.confluence.Confluence

classmethod setup_cmd_parser()[source]¶: Returns the Bugzilla argument parser.

perceval.backends.core.discourse module¶

class perceval.backends.core.discourse.Discourse(url, api_username=None, api_token=None, tag=None, archive=None, max_retries=10, sleep_time=5, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Discourse backend for Perceval.

This class retrieves the topics posted in a Discourse board. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters

url – Discourse URL
api_username – Discourse API username
api_token – Discourse API access token
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['topic']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'category_id': ['category_id']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='topic', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the topics from the Discurse board.

The method retrieves, from a Discourse board the topics updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain topics updated since this date

Returns

a generator of topics

fetch_items(category, **kwargs)[source]¶

Fetch the topics

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Discourse item.

This backend only generates one type of item which is ‘topic’.

static metadata_id(item)[source]¶: Extracts the identifier from a Discourse item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Discourse item.

The timestamp used is extracted from ‘last_posted_at’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.13.1'¶

class perceval.backends.core.discourse.DiscourseClient(base_url, api_username=None, api_key=None, sleep_time=5, max_retries=10, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Discourse API client.

This class implements a simple client to retrieve topics from any Discourse board.

Parameters

base_url – URL of the Discourse site
api_username – Discourse API username
api_key – Discourse API access token
sleep_time – time (in seconds) to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

ALL_TOPICS = None¶

EXTRA_STATUS_FORCELIST = [429]¶

HKEY = 'Api-Key'¶

HUSER = 'Api-Username'¶

POSTS = 'posts'¶

PPAGE = 'page'¶

TJSON = '.json'¶

TOPIC = 't'¶

TOPICS_SUMMARY = 'latest'¶

post(post_id)[source]¶

Retrieve the post whit post_id identifier.

Parameters: post_id – identifier of the post to retrieve

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the user and key information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

topic(topic_id)[source]¶

Retrive the topic with topic_id identifier.

Parameters: topic_id – identifier of the topic to retrieve

topics_page(page=None)[source]¶

Retrieve the #page summaries of the latest topics.

Parameters: page – number of page to retrieve

class perceval.backends.core.discourse.DiscourseCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Discourse backend from the command line.

BACKEND¶: alias of perceval.backends.core.discourse.Discourse

classmethod setup_cmd_parser()[source]¶: Returns the Discourse argument parser.

perceval.backends.core.dockerhub module¶

class perceval.backends.core.dockerhub.DockerHub(owner, repository, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

DockerHub backend for Perceval.

This class retrieves data from a repository stored in the Docker Hub site. To initialize this class owner and repositories where data will be fetched must be provided. The origin of the data will be built with both parameters.

Shortcut _ owner for official Docker repositories will be replaced by its long name: library.

Parameters

owner – DockerHub owner
repository – DockerHub repository owned by owner
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['dockerhub-data']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'name': ['name'], 'namespace': ['namespace']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='dockerhub-data')[source]¶

Fetch data from a Docker Hub repository.

The method retrieves, from a repository stored in Docker Hub, its data which includes number of pulls, stars, description, among other data.

Parameters: category – the category of items to fetch
Returns: a generator of data

fetch_items(category, **kwargs)[source]¶

Fetch the Dockher Hub items

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Docker Hub item.

This backend only generates one type of item which is ‘dockerhub-data’.

static metadata_id(item)[source]¶: Extracts the identifier from a Docker Hub item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Docker Hub item.

The timestamp is extracted from ‘fetched_on’ field. This field is not part of the data provided by Docker Hub. It is added by this backend.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_json(raw_json)[source]¶

Parse a Docker Hub JSON stream.

The method parses a JSON stream and returns a dict with the parsed data.

Parameters: raw_json – JSON string to parse
Returns: a dict with the parsed data

version = '0.6.0'¶

class perceval.backends.core.dockerhub.DockerHubClient(archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

DockerHub API client.

Client for fetching information from the DockerHub server using its REST API v2.

Parameters

archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

RREPOSITORY = 'repositories'¶

repository(owner, repository)[source]¶: Fetch information about a repository.

class perceval.backends.core.dockerhub.DockerHubCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run DockerHub backend from the command line.

BACKEND¶: alias of perceval.backends.core.dockerhub.DockerHub

classmethod setup_cmd_parser()[source]¶: Returns the DockerHub argument parser.

perceval.backends.core.gerrit module¶

class perceval.backends.core.gerrit.Gerrit(hostname, user=None, port='29418', max_reviews=500, disable_host_key_check=False, id_filepath=None, tag=None, archive=None, blacklist_ids=None)[source]¶

Bases: perceval.backend.Backend

Gerrit backend.

Class to fetch the reviews from a Gerrit server. To initialize this class the Hostname of the server must be provided. The hostname will be set as the origin of the data.

Parameters

hostname – Gerrit server Hostname
user – SSH user used to connect to the Gerrit server
port – SSH port
max_reviews – maximum number of reviews requested on the same query
disable_host_key_check – disable host key controls
tag – label used to mark the data
archive – archive to store/retrieve items
blacklist_ids – exclude the reviews while fetching
id_filepath – path to SSH private key

CATEGORIES = ['review']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'project_name': ['project'], 'review_hash': ['id']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='number', type=<class 'str'>)¶

A field unique to a given origin for items produced by this backend.

If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.

To access this field, please prefer origin_unique_field().

fetch(category='review', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the reviews from the repository.

The method retrieves, from a Gerrit repository, the reviews updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain reviews updated since this date

Returns

a generator of reviews

fetch_items(category, **kwargs)[source]¶

Fetch the reviews

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not support items resuming

static metadata_category(item)[source]¶

Extracts the category from a Gerrit item.

This backend only generates one type of item which is ‘review’.

static metadata_id(item)[source]¶: Extracts the identifier from a Gerrit item.

static metadata_updated_on(item)[source]¶

Extracts and converts the update time from a Gerrit item.

The timestamp is extracted from ‘lastUpdated’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_reviews(raw_data)[source]¶: Parse a Gerrit reviews list.

version = '0.13.1'¶

class perceval.backends.core.gerrit.GerritClient(repository, user=None, max_reviews=500, blacklist_reviews=None, disable_host_key_check=False, port='29418', id_filepath=None, archive=None, from_archive=False)[source]¶

Bases: object

Gerrit API client.

This class implements a client to retrieve reviews from a Gerrit repository using the ssh API. Currently it supports <2.8 and >=2.9 versions in incremental mode.

Check the next link for more info: https://gerrit-documentation.storage.googleapis.com/Documentation/2.12/cmd-query.html

Parameters

repository – Hostname of the Gerrit server
user – SSH user to be used to connect to gerrit server
max_reviews – max number of reviews per query
blacklist_reviews – exclude the reviews of this list while fetching
disable_host_key_check – disable host key controls
port – SSH port
id_filepath – SSH private key path
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive

CMD_GERRIT = 'gerrit'¶

CMD_VERSION = 'version'¶

MAX_RETRIES = 3¶

RETRY_WAIT = 60¶

VERSION_REGEX = re.compile('gerrit version (\\d+)\\.(\\d+).*')¶

next_retrieve_group_item(last_item=None, entry=None)[source]¶: Return the item to start from in next reviews group.

reviews(last_item, filter_=None)[source]¶: Get the reviews starting from last_item.

static sanitize_for_archive(cmd)[source]¶

Sanitize the Gerrit command by removing username information before storing/retrieving archived items

Param: cmd: Gerrit command

:returns the sanitized cmd

property version¶: Return the Gerrit server version.

class perceval.backends.core.gerrit.GerritCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Gerrit backend from the command line.

BACKEND¶: alias of perceval.backends.core.gerrit.Gerrit

classmethod setup_cmd_parser()[source]¶: Returns the Gerrit argument parser.

perceval.backends.core.git module¶

exception perceval.backends.core.git.EmptyRepositoryError(**kwargs)[source]¶

Bases: perceval.errors.RepositoryError

Exception raised when a repository is empty

message = '%(repository)s is empty'¶

class perceval.backends.core.git.Git(uri, gitpath, tag=None, archive=None)[source]¶

Bases: perceval.backend.Backend

Git backend.

This class allows the fetch the commits from a Git repository (local or remote) or from a log file. To initialize this class, you have to provide the URI repository and a value for gitpath. This uri will be set as the origin of the data.

When gitpath is a directory or does not exist, it will be considered as the place where the repository is/will be cloned; when gitpath is a file it will be considered as a Git log file.

Parameters

uri – URI of the Git repository
gitpath – path to the repository or to the log file
tag – label used to mark the data
archive – archive to store/retrieve items

Raises

RepositoryError – raised when there was an error cloning or updating the repository.

CATEGORIES = ['commit']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='commit', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), branches=None, latest_items=False, no_update=False)[source]¶

Fetch commits.

The method retrieves from a Git repository or a log file a list of commits. Commits are returned in the same order they were obtained.

When from_date parameter is given it returns items committed since the given date.

The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.

The parameter latest_items returns only those commits which are new since the last time this method was called.

The parameter no_update returns all commits without performing an update of the repository before.

Take into account that from_date and branches are ignored when the commits are fetched from a Git log file or when latest_items flag is set.

The class raises a RepositoryError exception when an error occurs accessing the repository.

Parameters

category – the category of items to fetch
from_date – obtain commits newer than a specific date (inclusive)
to_date – obtain commits older than a specific date
branches – names of branches to fetch from (default: None)
latest_items – sync with the repository to fetch only the newest commits
no_update – if enabled, don’t update the repo with the latest changes

Returns

a generator of commits

fetch_items(category, **kwargs)[source]¶

Fetch the commits

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend does not support items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Git item.

This backend only generates one type of item which is ‘commit’.

static metadata_id(item)[source]¶: Extracts the identifier from a Git item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Git item.

The timestamp used is extracted from ‘CommitDate’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_git_log_from_file(filepath)[source]¶

Parse a Git log file.

The method parses the Git log file and returns an iterator of dictionaries. Each one of this, contains a commit.

Parameters

filepath – path to the log file

Returns

a generator of parsed commits

Raises

ParseError – raised when the format of the Git log file is invalid
OSError – raised when an error occurs reading the given file

static parse_git_log_from_iter(iterator)[source]¶

Parse a Git log obtained from an iterator.

The method parses the Git log fetched from an iterator, where each item is a line of the log. It returns and iterator of dictionaries. Each dictionary contains a commit.

Parameters: iterator – iterator of Git log lines
Raises: ParseError – raised when the format of the Git log is invalid

version = '0.12.1'¶

class perceval.backends.core.git.GitCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Git backend from the command line.

BACKEND¶: alias of perceval.backends.core.git.Git

classmethod setup_cmd_parser()[source]¶: Returns the Git argument parser.

class perceval.backends.core.git.GitParser(stream)[source]¶

Bases: object

Git log parser.

This class parses a plain Git log stream, converting plain commits into dict items.

Not every Git log output is valid to be parsed. The Git log stream must have a specific structure. It must contain raw commits data and stats about modified files. The next excerpt shows an example of a valid log:

commit aaa7a9209f096aaaadccaaa7089aaaa3f758a703 Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300

Commit for testing

:000000 100644 0000000… aaaaaaa… A aaa/otherthing :000000 100644 0000000… aaaaaaa… A aaa/something :000000 100644 0000000… aaaaaaa… A bbb/bthing 0 0 aaa/otherthing 0 0 aaa/something 0 0 bbb/bthing

Each commit starts with the ‘commit’ tag that is followed by the SHA-1 of the commit, its parents (two or more parents in the case of a merge) and a list of refs, if any.

commit 456a68ee1407a77f3e804a30dff245bb6c6b872f
ce8e0b86a1e9877f42fe9453ede418519115f367 51a3b654f252210572297f47597b31527c475fb8 (HEAD -> refs/heads/master)

The commit line is followed by one or more headers. Each header has a key and a value:

Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300

Then, an empty line divides the headers from the commit message.

First line of the commit

Commit message splitted into one or several lines. Each line of the message stars with 4 spaces.

Commit messages can contain a list of ‘trailers’. These trailers have the same format of headers but their meaning is project dependent. This is an example of a commit message with trailers:

Commit message with trailers

This is the body of the message where trailers are included. Trailers are part of the body so each line of the message stars with 4 spaces.

Signed-off-by: John Doe <jdoe@example.com> Signed-off-by: Jane Rae <jrae@example.com>

After a new empty line, actions and stats over files can be found. A action line starts with one or more ‘:’ chars and contain data about the old and new permissions of a file, its old and new indexes, the action code and the filepath to the file. In the case of a copied, renamed or moved file, the new filepath to that file is included.

:100644 100644 e69de29… e69de29… R100 aaa/otherthing aaa/otherthing.renamed

Stats lines include the number of lines added and removed, and the name of the file. The new name is also included for moved or renamed files.

10 0 aaa/{otherthing => otherthing.renamed}

The commit ends with an empty line.

Take into account that one empty line is valid at the beginning of the log. This allows to parse empty logs without raising exceptions.

This example was generated using the next command:

git log –raw –numstat –pretty=fuller –decorate=full –parents -M -C -c –remotes=origin –all

Parameters: stream – a file object which stores the log

ACTION_PATTERN = '^(?P<sc>\\:+)\n (?P<modes>(?:\\d{6}[ \\t])+)\n (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n (?P<action>[^\\t]+)\\t+\n (?P<file>[^\\t]+)\n (?:\\t+(?P<newfile>.+))?$'¶

COMMIT = 1¶

COMMIT_PATTERN = '^commit[ \\t](?P<commit>[a-f0-9]{40})\n (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n (?:[ \\t]\$(?P<refs>.+)\$)?$\n '¶

EMPTY_LINE_PATTERN = '^$'¶

FILE = 4¶

GIT_ACTION_REGEXP = re.compile('^(?P<sc>\\:+)\n (?P<modes>(?:\\d{6}[ \\t])+)\n (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n (?P<action>[^\\t]+)\\t+\n , re.VERBOSE)¶

GIT_COMMIT_REGEXP = re.compile('^commit[ \\t](?P<commit>[a-f0-9]{40})\n (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n (?:[ \\t]\$(?P<refs>.+)\$)?$\n ', re.VERBOSE)¶

GIT_HEADER_TRAILER_REGEXP = re.compile('^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$', re.VERBOSE)¶

GIT_MESSAGE_REGEXP = re.compile('^[\\s]{4}(?P<msg>.*)$', re.VERBOSE)¶

GIT_NEXT_STATE_REGEXP = re.compile('^$', re.VERBOSE)¶

GIT_STATS_REGEXP = re.compile('^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$', re.VERBOSE)¶

HEADER = 2¶

HEADER_TRAILER_PATTERN = '^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$'¶

INIT = 0¶

MESSAGE = 3¶

MESSAGE_LINE_PATTERN = '^[\\s]{4}(?P<msg>.*)$'¶

STATS_PATTERN = '^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$'¶

TRAILERS = ['Signed-off-by']¶

parse()[source]¶: Parse the Git log stream.

class perceval.backends.core.git.GitRef(hash, refname)¶

Bases: tuple

property hash¶: Alias for field number 0

property refname¶: Alias for field number 1

class perceval.backends.core.git.GitRepository(uri, dirpath)[source]¶

Bases: object

Manage a Git repository.

This class provides access to a Git repository running some common commands such as clone, pull or log. To create an instance from a remote repository, use clone() class method.

Parameters

uri – URI of the repository
dirpath – local directory where the repository is stored

GIT_PRETTY_OUTPUT_OPTS = ['--raw', '--numstat', '--pretty=fuller', '--decorate=full', '--parents', '-M', '-C', '-c']¶

classmethod clone(uri, dirpath)[source]¶

Clone a Git repository.

Make a bare copy of the repository stored in uri into dirpath. The repository would be either local or remote.

Parameters

uri – URI of the repository
dirtpath – directory where the repository will be cloned

Returns

a GitRepository class having cloned the repository

Raises

RepositoryError – when an error occurs cloning the given repository

count_objects()[source]¶

Count the objects of a repository.

The method returns the total number of objects (packed and unpacked) available on the repository.

Raises: RepositoryError – when an error occurs counting the objects of a repository

is_detached()[source]¶

Check if the repo is in a detached state.

The repository is in a detached state when HEAD is not a symbolic reference.

Returns: whether the repository is detached or not
Raises: RepositoryError – when an error occurs checking the state of the repository

is_empty()[source]¶

Determines whether the repository is empty or not.

Returns True when the repository is empty. Under the hood, it checks the number of objects on the repository. When this number is 0, the repositoy is empty.

Raises: RepositoryError – when an error occurs accessing the repository

log(from_date=None, to_date=None, branches=None, encoding='utf-8')[source]¶

Read the commit log from the repository.

The method returns the Git log of the repository using the following options:

git log –raw –numstat –pretty=fuller –decorate=full
–all –reverse –topo-order –parents -M -C -c –remotes=origin

When from_date is given, it gets the commits equal or older than that date. This date is given in a datetime object.

Parameters

from_date – fetch commits newer than a specific date (inclusive)
branches – names of branches to fetch from (default: None)
encoding – encode the log using this format

Returns

a generator where each item is a line from the log

Raises

EmptyRepositoryError – when the repository is empty and the action cannot be performed
RepositoryError – when an error occurs fetching the log

rev_list(branches=None)[source]¶

Read the list commits from the repository

The method returns the Git rev-list of the repository using the following options:

git rev-list –topo-order

Parameters

branches – names of branches to fetch from (default: None)

Raises

EmptyRepositoryError – when the repository is empty and the action cannot be performed
RepositoryError – when an error occurs executing the command

show(commits=None, encoding='utf-8')[source]¶

Show the data of a set of commits.

The method returns the output of Git show command for a set of commits using the following options:

git show –raw –numstat –pretty=fuller –decorate=full
–parents -M -C -c [<commit>…<commit>]

When the list of commits is empty, the command will return data about the last commit, like the default behaviour of git show.

Parameters

commits – list of commits to show data
encoding – encode the output using this format

Returns

a generator where each item is a line from the show output

Raises

EmptyRepositoryError – when the repository is empty and the action cannot be performed
RepositoryError – when an error occurs fetching the show output

sync()[source]¶

Keep the repository in sync.

This method will synchronize the repository with its ‘origin’, fetching newest objects and updating references. It uses low level commands which allow to keep track of which things have changed in the repository.

The method also returns a list of hashes related to the new commits fetched during the process.

Returns: list of new commits
Raises: RepositoryError – when an error occurs synchronizing the repository

update()[source]¶

Update repository from its remote.

Calling this method, the repository will be synchronized with the remote repository using ‘fetch’ command for ‘heads’ refs. Any commit stored in the local copy will be removed; refs will be overwritten.

Raises: RepositoryError – when an error occurs updating the repository

perceval.backends.core.github module¶

class perceval.backends.core.github.GitHub(owner=None, repository=None, api_token=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

GitHub backend for Perceval.

This class allows the fetch the issues stored in GitHub repository. Note that since version 0.20.0, the api_token accepts a list of tokens, thus the backend must be initialized as follows: ``` GitHub(

owner=’chaoss’, repository=’grimoirelab’, api_token=[TOKEN-1, TOKEN-2, …], sleep_for_rate=True, sleep_time=300

Parameters

owner – GitHub owner
repository – GitHub repository from the owner
api_token – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
sleep_time – time to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue', 'pull_request', 'repository']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

CLASSIFIED_FIELDS = [['user_data'], ['merged_by_data'], ['assignee_data'], ['assignees_data'], ['requested_reviewers_data'], ['comments_data', 'user_data'], ['comments_data', 'reactions_data', 'user_data'], ['reviews_data', 'user_data'], ['review_comments_data', 'user_data'], ['review_comments_data', 'reactions_data', 'user_data']]¶

A list of fields that should be considered sensitive or confidential.

Fields listed here will be hidden from fetched items, when this behaviour is requested.

Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.

Classified data filtering and archiving are not compatible to prevent data leaks or security issues.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]¶

Fetch the issues/pull requests from the repository.

The method retrieves, from a GitHub repository, the issues/pull requests updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain issues/pull requests updated since this date
to_date – obtain issues/pull requests until a specific date (included)
filter_classified – remove classified fields from the resulting items

Returns

a generator of issues

fetch_items(category, **kwargs)[source]¶

Fetch the items (issues or pull_requests or repo information)

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a GitHub item.

This backend generates three types of item which are ‘issue’, ‘pull_request’ and ‘repo’ information.

static metadata_id(item)[source]¶: Extracts the identifier from a GitHub item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a GitHub item.

The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus the owner and repo.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.27.0'¶

class perceval.backends.core.github.GitHubClient(owner, repository, tokens=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Client for retieving information from GitHub API

Parameters

owner – GitHub owner
repository – GitHub repository from the owner
tokens – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [403, 500, 502, 503]¶

HACCEPT = 'Accept'¶

HAUTHORIZATION = 'Authorization'¶

PDIRECTION = 'direction'¶

PPER_PAGE = 'per_page'¶

PSINCE = 'since'¶

PSORT = 'sort'¶

PSTATE = 'state'¶

RCOMMENTS = 'comments'¶

RCOMMITS = 'commits'¶

RISSUES = 'issues'¶

RORGS = 'orgs'¶

RPULLS = 'pulls'¶

RRATE_LIMIT = 'rate_limit'¶

RREACTIONS = 'reactions'¶

RREPOS = 'repos'¶

RREQUESTED_REVIEWERS = 'requested_reviewers'¶

RREVIEWS = 'reviews'¶

RUSERS = 'users'¶

VACCEPT = 'application/vnd.github.squirrel-girl-preview'¶

VACCEPT_V3 = 'application/vnd.github.v3+json'¶

VDIRECTION_ASC = 'asc'¶

VSORT_UPDATED = 'updated'¶

VSTATE_ALL = 'all'¶

calculate_time_to_reset()[source]¶: Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.

fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]¶

Fetch the data from a given URL.

Parameters

url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available
auth – auth of the request

:returns a response object

fetch_items(path, payload)[source]¶: Return the items from github API using links pagination

issue_comment_reactions(comment_id)[source]¶: Get reactions of an issue comment

issue_comments(issue_number)[source]¶: Get the issue comments from pagination

issue_reactions(issue_number)[source]¶: Get reactions of an issue

issues(from_date=None)[source]¶

Fetch the issues from the repository.

The method retrieves, from a GitHub repository, the issues updated since the given date.

Parameters: from_date – obtain issues updated since this date
Returns: a generator of issues

pull_commits(pr_number)[source]¶: Get pull request commits

pull_requested_reviewers(pr_number)[source]¶: Get pull requested reviewers

pull_review_comment_reactions(comment_id)[source]¶: Get reactions of a review comment

pull_review_comments(pr_number)[source]¶: Get pull request review comments

pull_reviews(pr_number)[source]¶: Get pull request reviews

pulls(from_date=None)[source]¶

Fetch the pull requests from the repository.

The method retrieves, from a GitHub repository, the pull requests updated since the given date.

Parameters: from_date – obtain pull requests updated since this date
Returns: a generator of pull requests

repo()[source]¶: Get repository data

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

user(login)[source]¶: Get the user information and update the user cache

user_orgs(login)[source]¶: Get the user public organizations

class perceval.backends.core.github.GitHubCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run GitHub backend from the command line.

BACKEND¶: alias of perceval.backends.core.github.GitHub

classmethod setup_cmd_parser()[source]¶: Returns the GitHub argument parser.

perceval.backends.core.githubql module¶

class perceval.backends.core.githubql.GitHubQL(owner=None, repository=None, api_token=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]¶

Bases: perceval.backends.core.github.GitHub

GitHubQL backend for Perceval using the GitHub API v4. Most of the methods are inherited from the GitHub backend.

This class allows the fetch the issue events of a GitHub repository. Note that the events retrieved included also the ones of pull requests, since in GitHub, every pull request is an issue, but an issue may not be a pull request. Pull requests can be identified by the attribute pull_request included in data.issue.

Due to the limitation of not fetching issue events after a given date from GitHub v3, the events are fetched via the GitHub v4 (based on GraphQL).

All issues of a given tracker are retrieved in ascending order based on the last time they were updated. For each issue, its events (optionally from/until a given date) are collected using a GraphQL call. Each event is returned by Perceval together with the corresponding issue (available in data.issue).

Since the events are collected issue by issue, the incremental fetching is not supported. This limitation is due to the fact that events that occur on an issue may not update the issue attributes. Since there is no way to identify new events from the attributes of an issue, all issues must be fetched for every execution.

No user information beyond the login is included in data returned by this backend. Thus, the backend doesn’t require filter classified support.

Parameters

owner – GitHub owner
repository – GitHub repository from the owner
api_token – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items per query
sleep_time – time to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['event']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the issue events from the repository.

The method retrieves, from a GitHub repository, the issue events since/until a given date.

Parameters

category – the category of items to fetch
from_date – obtain issue events since this date
to_date – obtain issue events until this date (included)

Returns

a generator of events

fetch_items(category, **kwargs)[source]¶

Fetch the items

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend doesn’t support items resuming

static metadata_category(item)[source]¶

Extracts the category from a GitHub item.

This backend generates one type item which is ‘event’.

static metadata_id(item)[source]¶: Extracts the identifier from a GitHub item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a GitHub item.

The timestamp used is extracted from ‘createdAt’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.4.0'¶

class perceval.backends.core.githubql.GitHubQLClient(owner, repository, tokens=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.backends.core.github.GitHubClient

Client for retrieving information from GitHub API

Parameters

owner – GitHub owner
repository – GitHub repository from the owner
tokens – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
sleep_time – time to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
archive – collect events already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

VACCEPT = 'application/vnd.github.squirrel-girl-preview,application/vnd.github.starfox-preview+json'¶

VPER_PAGE = 100¶

events(issue_number, is_pull, from_date)[source]¶

Get the issue events of the types declared at EVENT_TYPES from the GraphQL API

Parameters

issue_number – number of the issue
is_pull – boolean value to identify a pull request
from_date – fetch events after a given date

class perceval.backends.core.githubql.GitHubQLCommand(*args, debug=False)[source]¶

Bases: perceval.backends.core.github.GitHubCommand

Class to run GitHubQL backend from the command line.

BACKEND¶: alias of perceval.backends.core.githubql.GitHubQL

perceval.backends.core.gitlab module¶

class perceval.backends.core.gitlab.GitLab(owner=None, repository=None, api_token=None, is_oauth_token=False, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, blacklist_ids=None, extra_retry_after_status=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

GitLab backend for Perceval.

This class allows the fetch the issues stored in GitLab repository.

Parameters

owner – GitLab owner
repository – GitLab repository from the owner
api_token – GitLab auth token to access the API
is_oauth_token – True if the token is OAuth (default False)
base_url – GitLab URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitLab public site.
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
blacklist_ids – ids of items that must not be retrieved
extra_retry_after_status – retry HTTP requests after status (default 500 and 502). These status complete the ones (413, 429, 503) defined in the HttpClient class
ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue', 'merge_request']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='iid', type=<class 'int'>)¶

A field unique to a given origin for items produced by this backend.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

To access this field, please prefer origin_unique_field().

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the issues/merge requests from the repository.

The method retrieves, from a GitLab repository, the issues/merge requests updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain issues updated since this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]¶

Fetch the items (issues or merge_requests)

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archivng items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not support items resuming

static metadata_category(item)[source]¶

Extracts the category from a GitLab item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]¶: Extracts the identifier from a GitLab item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a GitLab item.

The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitLab dates are in UTC the conversion is straightforward.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus the owner, project and iid of the issue or merge requests. Optionally, if the project is part of a (nested) group, all groups are also included to the search fields via the attribute groups.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.12.0'¶

class perceval.backends.core.gitlab.GitLabClient(owner, repository, token, is_oauth_token=False, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, extra_retry_after_status=None, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Client for retieving information from GitLab API

Parameters

owner – GitLab owner
repository – GitLab owner’s repository
token – GitLab auth token to access the API
is_oauth_token – True if the token is OAuth (default False)
base_url –

GitLab URL in enterprise edition case;
when no value is set the backend will be fetch the data from the GitLab public site.

param sleep_for_rate

sleep until rate limit is reset

param min_rate_to_sleep

minimum rate needed to sleep until it will be reset

param sleep_time

time (in seconds) to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
extra_retry_after_status – retry HTTP requests after status
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'¶

HPRIVATE_TOKEN = 'PRIVATE-TOKEN'¶

HRATE_LIMIT = 'RateLimit-Remaining'¶

HRATE_LIMIT_RESET = 'RateLimit-Reset'¶

PORDER_BY = 'order_by'¶

PPER_PAGE = 'per_page'¶

PSORT = 'sort'¶

PSTATE = 'state'¶

PUPDATE_AFTER = 'updated_after'¶

PVIEW = 'view'¶

REMOJI = 'award_emoji'¶

RISSUES = 'issues'¶

RMERGES = 'merge_requests'¶

RNOTES = 'notes'¶

RPROJECTS = 'projects'¶

RVERSIONS = 'versions'¶

VORDER_UPDATED_AT = 'updated_at'¶

VPER_PAGE = 100¶

VSORT_ASC = 'asc'¶

VSTATE_ALL = 'all'¶

VVIEW_SIMPLE = 'simple'¶

calculate_time_to_reset()[source]¶: Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.

emojis(item_type, item_id)[source]¶: Get emojis from pagination

fetch(url, payload=None, headers=None, method='GET', stream=False)[source]¶

Fetch the data from a given URL.

Parameters

url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available

:returns a response object

fetch_items(path, payload)[source]¶: Return the items from GitLab API using links pagination

issues(from_date=None)[source]¶: Get the issues from pagination

merge(merge_id)[source]¶: Get the merge full data

merge_version(merge_id, version_id)[source]¶: Get merge version detail

merge_versions(merge_id)[source]¶: Get the merge versions from pagination

merges(from_date=None)[source]¶: Get the merge requests from pagination

note_emojis(item_type, item_id, note_id)[source]¶: Get emojis of a note

notes(item_type, item_id)[source]¶: Get the notes from pagination

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.gitlab.GitLabCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run GitLab backend from the command line.

BACKEND¶: alias of perceval.backends.core.gitlab.GitLab

classmethod setup_cmd_parser()[source]¶: Returns the GitLab argument parser.

perceval.backends.core.gitter module¶

class perceval.backends.core.gitter.Gitter(group=None, room=None, api_token=None, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Gitter backend.

This class retrieves the messages sent to a Gitter room. To access the server an API token is required.

The origin of the data will be set to the GITTER_URL plus the identifier of the room; i.e ‘https://gitter.im/{group}/{room}’.

Parameters

group – group to which the room belongs
room – identifier of the room from which the messages are to be fetched
api_token – token or key needed to use the API
max_items – maximum number of message requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from the room.

This method fetches the messages sent in the room that were sent since the given date.

Parameters

category – the category of items to fetch
from_date – date from which messages are to be fetched

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages.

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not support items resuming

static metadata_category(item)[source]¶

Extracts the category from a Gitter item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]¶: Extracts the identifier from a Gitter item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the sent time of a message from a Gitter item.

The timestamp is extracted from ‘sent’ field and converted to a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id,`group`,`room` and ‘room_id’.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.1.0'¶

class perceval.backends.core.gitter.GitterClient(api_token, max_items=100, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Gitter API client.

Client for fetching information from the Gitter server using its REST API.

Parameters

api_token – key needed to use the API
max_items – maximum number of items per request
archive – an archive to store/read fetched data
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'¶

PBEFORE_ID = 'beforeId'¶

PLIMIT = 'limit'¶

RMESSAGES = 'chatMessages'¶

RROOMS = 'rooms'¶

calculate_time_to_reset()[source]¶: Number of seconds to wait. They are contained in the rate limit reset header

fetch(url, payload=None, headers=None)[source]¶

Fetch the data from a given URL.

Parameters

url – link to the resource
payload – payload of the request
headers – headers of the request

:returns a response object

get_room_id(room)[source]¶: Fetch the room id of a room.

message_page(room_id, before_id)[source]¶: Fetch a page of messages.

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items.

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.gitter.GitterCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Gitter backend from the command line.

BACKEND¶: alias of perceval.backends.core.gitter.Gitter

classmethod setup_cmd_parser()[source]¶: Returns the Gitter argument parser.

perceval.backends.core.googlehits module¶

class perceval.backends.core.googlehits.GoogleHits(keywords, tag=None, archive=None, max_retries=5, sleep_time=1, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

GoogleHits backend for Perceval.

This class retrieves the number of hits for a given list of keywords via the Google API. To initialize this class a list of keywords is needed.

Parameters

keywords – a list of keywords
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['google_hits']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'keywords': ['keywords']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='google_hits')[source]¶

Fetch data from Google API.

The method retrieves a list of hits for some given keywords using the Google API.

Parameters: category – the category of items to fetch
Returns: a generator of data

fetch_items(category, **kwargs)[source]¶

Fetch Google hit items

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a GoogleHits item.

This backend only generates one type of item which is ‘google_hits’.

static metadata_id(item)[source]¶: Extracts the identifier from a GoogleHit item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a GoogleHit item.

The timestamp is based on the current time when the hit was extracted. This field is not part of the data provided by Google API. It is added by this backend.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.4.0'¶

class perceval.backends.core.googlehits.GoogleHitsClient(sleep_time=1, max_retries=5, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

GoogleHits API client.

Client for fetching hits data from Google API.

Parameters

sleep_time – time (in seconds) to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [429]¶

PQUERY = 'q'¶

hits(keywords)[source]¶: Fetch information about a list of keywords.

class perceval.backends.core.googlehits.GoogleHitsCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run GoogleHits backend from the command line.

BACKEND¶: alias of perceval.backends.core.googlehits.GoogleHits

classmethod setup_cmd_parser()[source]¶: Returns the GoogleHits argument parser.

perceval.backends.core.groupsio module¶

class perceval.backends.core.groupsio.Groupsio(group_name, dirpath, email, password, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backends.core.mbox.MBox

Groups.io backend.

This class allows the fetch the messages of a Groups.io group. Initialize this class passing the name of the group, the directory path where the mbox files will be fetched and stored, and the email and password of the Groupsio user. The origin of the data will be set to the url of the group on Groups.io.

In order to know the group names where you are subscribed, you can use the following script: https://gist.github.com/valeriocos/2e2231e17fd3052800303bf99bd0c7c4

Parameters

group_name – Name of the group
dirpath – directory path where the mboxes are stored
email – Groupsio user email
password – Groupsio user password
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from a Groups.io group.

The method fetches the mbox files from a remote Groups.io group and retrieves the messages stored on them.

Parameters

category – the category of items to fetch
from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend does not support items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus the group_name

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.4.2'¶

class perceval.backends.core.groupsio.GroupsioClient(group_name, dirpath, email, password, ssl_verify=True)[source]¶

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by Groups.io.

This class gives access to remote and local mboxes archives from a mailing list stored by Groups.io. This class also allows to keep them in sync.

Parameters

group_name – Name of the group
dirpath – directory path where the mboxes are stored
email – Groupsio user email
password – Groupsio user password
ssl_verify – enable/disable SSL verification

PEMAIL = 'email'¶

PGROUP_ID = 'group_id'¶

PLIMIT = 'limit'¶

PPAGE_TOKEN = 'page_token'¶

PPASSWORD = 'password'¶

PSTART_TIME = 'start_time'¶

RDOWNLOAD_ARCHIVES = 'downloadarchives'¶

RGET_SUBSCRIPTIONS = 'getsubs'¶

RLOGIN = 'login'¶

fetch(from_date=None)[source]¶

Fetch the mbox files from the remote archiver.

Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.

Groups.io archives are returned as a .zip file, which contains one file in mbox format.

Parameters: from_date – fetch messages after a given date (included) expressed in ISO format
Returns: a list of tuples, storing the links and paths of the fetched archives

subscriptions(per_page=100)[source]¶

Fetch the groupsio paginated subscriptions for a given token

Parameters: per_page – number of subscriptions per page
Returns: an iterator of subscriptions

class perceval.backends.core.groupsio.GroupsioCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Groupsio backend from the command line.

BACKEND¶: alias of perceval.backends.core.groupsio.Groupsio

classmethod setup_cmd_parser()[source]¶: Returns the Groupsio argument parser.

perceval.backends.core.hyperkitty module¶

class perceval.backends.core.hyperkitty.HyperKitty(url, dirpath, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backends.core.mbox.MBox

HyperKitty backend.

This class allows the fetch the email messages stored on a HyperKitty archiver. Initialize this class passing the URL where the mailing list archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.

Parameters

url – URL to the HyperKitty mailing list archiver
dirpath – directory path where the mboxes are stored
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from the HyperKitty mailing list archiver.

The method fetches the mbox files from a remote HyperKitty mailing list archiver and retrieves the messages stored on them.

Take into account that HyperKitty does not provide yet any kind of info to know which is the first message on the mailing list. For this reason, using a value in from_date previous to the date where the first message was sent will make to download empty mbox files.

Parameters

category – the category of items to fetch
from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend does not support items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

version = '0.6.0'¶

class perceval.backends.core.hyperkitty.HyperKittyCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run HyperKitty backend from the command line.

BACKEND¶: alias of perceval.backends.core.hyperkitty.HyperKitty

classmethod setup_cmd_parser()[source]¶: Returns the HyperKitty argument parser.

class perceval.backends.core.hyperkitty.HyperKittyList(url, dirpath, ssl_verify=True)[source]¶

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by HyperKitty archiver.

This class gives access to remote and local mboxes archives from a mailing list stored by HyperKitty. This class also allows to keep them in sync.

Notice that this class only works with HyperKitty version 1.0.4 or greater. Previous versions do not export messages in MBox format.

Parameters

url – URL to the HyperKitty archiver for this list
dirpath – path to the local mboxes archives
ssl_verify – enable/disable SSL verification

PEND = 'end'¶

PSTART = 'start'¶

REXPORT = 'export'¶

fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the mbox files from the remote archiver.

This method stores the archives in the path given during the initialization of this object.

HyperKitty archives are accessed month by month and stored following the schema year-month. Archives are fetched from the given month till the current month.

Parameters: from_date – fetch archives that store messages equal or after the given date; only year and month values are compared
Returns: a list of tuples, storing the links and paths of the fetched archives

property mboxes¶

Get the mboxes managed by this mailing list.

Returns the archives sorted by date in ascending order.

Returns: a list of .MBoxArchive objects

perceval.backends.core.jenkins module¶

class perceval.backends.core.jenkins.Jenkins(url, user=None, api_token=None, tag=None, archive=None, detail_depth=1, sleep_time=10, blacklist_ids=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Jenkins backend for Perceval.

This class retrieves the builds from a Jenkins site. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters

url – Jenkins url
user – Jenkins user
api_token – Jenkins auth token to access the API
tag – label used to mark the data
archive – archive to store/retrieve items
detail_depth – control the detail level of the data returned by the API
sleep_time – time (in seconds) to sleep in case of connection problems
archive – collect builds already retrieved from an archive
blacklist_ids – exclude the jobs ID of this list while fetching
ssl_verify – enable/disable SSL verification

CATEGORIES = ['build']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'number': ['number']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='url', type=<class 'str'>)¶

A field unique to a given origin for items produced by this backend.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

To access this field, please prefer origin_unique_field().

fetch(category='build')[source]¶

Fetch the builds from the url.

The method retrieves, from a Jenkins url, the builds updated since the given date.

Parameters: category – the category of items to fetch
Returns: a generator of builds

fetch_items(category, **kwargs)[source]¶

Fetch the contents

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archiving

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Jenkins item.

This backend only generates one type of item which is ‘build’.

static metadata_id(item)[source]¶: Extracts the identifier from a Build item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Jenkins item.

The timestamp is extracted from ‘timestamp’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.16.0'¶

class perceval.backends.core.jenkins.JenkinsClient(url, user=None, api_token=None, blacklist_jobs=None, detail_depth=1, sleep_time=10, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Jenkins API client.

This class implements a simple client to retrieve jobs/builds from projects in a Jenkins node. The amount of data returned for each request depends on the detail_depth value selected (minimum and default is 1). Note that increasing the detail_depth may considerably slow down the fetch operation and cause connection broken errors.

Parameters

url – URL of jenkins node: https://build.opnfv.org/ci
user – Jenkins user
api_token – Jenkins auth token to access the API
blacklist_jobs – exclude the jobs of this list while fetching
detail_depth – set the detail level of the data returned by the API
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

EXTRA_STATUS_FORCELIST = [410, 502, 503]¶

MAX_RETRIES = 5¶

PDEPTH = 'depth'¶

RAPI = 'api'¶

RJOB = 'job'¶

RJSON = 'json'¶

get_builds(job_name, url)[source]¶

Retrieve all builds from a job

Parameters

job_name – name of the job
url – target url to fetch builds

get_jobs(url)[source]¶

Retrieve all jobs

Parameters: url – target url to fetch jobs

class perceval.backends.core.jenkins.JenkinsCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Jenkins backend from the command line.

BACKEND¶: alias of perceval.backends.core.jenkins.Jenkins

classmethod setup_cmd_parser()[source]¶: Returns the Jenkins argument parser.

perceval.backends.core.jira module¶

class perceval.backends.core.jira.Jira(url, project=None, user=None, password=None, cert=None, max_results=100, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

JIRA backend for Perceval.

This class retrieves the issues stored in JIRA issue tracking system. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Note that when fetching data with an authenticated access (i.e., user and password), information about issue transitions and operations (e.g., edit-issue, comment-issue) is included in the JSON documents produced by the backend.

Parameters

url – JIRA’s endpoint
project – filter issues by project
user – Jira user
password – Jira user password
cert – SSL certificate path (PEM)
max_results – max number of results per query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'issue_key': ['key'], 'project_id': ['fields', 'project', 'id'], 'project_key': ['fields', 'project', 'key'], 'project_name': ['fields', 'project', 'name']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the issues from the site.

The method retrieves, from a JIRA site, the issues updated since the given date.

Parameters

category – the category of items to fetch
from_date – retrieve issues updated from this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]¶

Fetch the issues

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Jira item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]¶: Extracts the identifier from a Jira item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Jira item.

The timestamp used is extracted from ‘updated’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_issues(raw_page)[source]¶

Parse a JIRA API raw response.

The method parses the API response retrieving the issues from the received items

Parameters: items – items from where to parse the issues
Returns: a generator of issues

version = '0.14.0'¶

class perceval.backends.core.jira.JiraClient(url, project, user, password, cert, max_results=100, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

JIRA API client.

This class implements a simple client to retrieve issues from any JIRA issue tracking system.

Parameters

URL – URL of the JIRA server
project – filter issues by project
user – JIRA’s username
password – JIRA’s password
cert – SSL certificate
max_results – max number of results per query
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

PEXPAND = 'expand'¶

PJQL = 'jql'¶

PMAX_RESULTS = 'maxResults'¶

PSTART_AT = 'startAt'¶

RCOMMENT = 'comment'¶

RESOURCE = 'rest/api'¶

RFIELD = 'field'¶

RISSUE = 'issue'¶

RSEARCH = 'search'¶

VERSION_API = '2'¶

VEXPAND = 'renderedFields,transitions,operations,changelog'¶

get_comments(issue_id)[source]¶

Retrieve all the comments of a given issue.

Parameters: issue_id – ID of the issue

get_fields()[source]¶: Retrieve all the fields available.

get_issues(from_date)[source]¶

Retrieve all the issues from a given date.

Parameters: from_date – obtain issues updated since this date

get_items(from_date, url, expand_fields=True)[source]¶

Retrieve all the items from a given date.

Parameters

url – endpoint API url
from_date – obtain items updated since this date
expand_fields – if True, it includes the expand fields in the payload

class perceval.backends.core.jira.JiraCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Jira backend from the command line.

BACKEND¶: alias of perceval.backends.core.jira.Jira

classmethod setup_cmd_parser()[source]¶: Returns the Jira argument parser.

perceval.backends.core.jira.filter_custom_fields(fields)[source]¶

Filter custom fields from a given set of fields.

Parameters: fields – set of fields
Returns: an object with the filtered custom fields

perceval.backends.core.jira.map_custom_field(custom_fields, fields)[source]¶

Add extra information for custom fields.

Parameters

custom_fields – set of custom fields with the extra information
fields – fields of the issue where to add the extra information

Returns

an set of items with the extra information mapped

perceval.backends.core.launchpad module¶

class perceval.backends.core.launchpad.Launchpad(distribution, package=None, items_per_page=75, sleep_time=300, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Launchpad backend for Perceval.

This class allows the fetch the issues stored in Launchpad.

Parameters

distribution – Launchpad distribution
package – Distribution package
items_per_page – number of items in a retrieved page
sleep_time – time (in seconds) to sleep in case of connection problems
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the issues from a project (distribution/package).

The method retrieves, from a Launchpad project, the issues updated since the given date.

Parameters

category – the category of items to fetch
from_date – obtain issues updated since this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]¶

Fetch the issues

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Launchpad item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]¶: Extracts the identifier from a Launchpad item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Launchpad item.

The timestamp used is extracted from ‘date_last_updated’ field. This date is converted to UNIX timestamp format. As Launchpad dates are in UTC in ISO 8601 (e.g., ‘2008-03-26T01:43:15.603905+00:00’) the conversion is straightforward.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus additional values depending on the item category. For the categories issue and pull_request, the search fields include the issue/pull request number, labels, state and the name of the milestone. For the category repository, license and language are set as search fields.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.8.1'¶

class perceval.backends.core.launchpad.LaunchpadClient(distribution, package=None, items_per_page=75, sleep_time=300, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Client for retrieving information from Launchpad API

Parameters

distribution – Launchpad distribution
package – Distribution package
items_per_page – number of items in a retrieved page
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

HCONTENT_TYPE = 'Content-type'¶

PMODIFIED_SINCE = 'modified_since'¶

POMIT_DULPLICATES = 'omit_duplicates'¶

PORDER_BY = 'order_by'¶

PSTATUS = 'status'¶

PWS_OP = 'ws.op'¶

PWS_SIZE = 'ws.size'¶

PWS_START = 'ws.start'¶

RBUGS = 'bugs'¶

RSOURCE = '+source'¶

VCONTENT_TYPE = 'application/json'¶

VDATE_LAST_MODIFIED = 'date_last_updated'¶

VOMIT_DUPLICATES = 'false'¶

VSEARCH_TASKS = 'searchTasks'¶

VSTATUS = ['New', 'Incomplete', 'Opinion', 'Invalid', "Won't Fix", 'Expired', 'Confirmed', 'Triaged', 'In Progress', 'Fix Committed', 'Fix Released', 'Incomplete (with response)', 'Incomplete (without response)']¶

issue(issue_id)[source]¶: Get the issue data by its ID

issue_collection(issue_id, collection_name)[source]¶: Get a collection list of a given issue

issues(start=None)[source]¶: Get the issues from pagination

user(user_name)[source]¶: Get the user data by URL

user_name(user_link)[source]¶: Get user name from link

class perceval.backends.core.launchpad.LaunchpadCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Launchpad backend from the command line.

BACKEND¶: alias of perceval.backends.core.launchpad.Launchpad

classmethod setup_cmd_parser()[source]¶: Returns the Launchpad argument parser.

perceval.backends.core.mattermost module¶

class perceval.backends.core.mattermost.Mattermost(url, channel, api_token, max_items=60, tag=None, archive=None, team=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Mattermost backend.

This class retrieves the posts sent to a Mattermost channel. To access the server an API token is required, which must have enough permissions to read from the given channel.

To initialize this class the URL of the server must be provided. The origin of data will be set using this url plus the channel from data is obtained (i.e: https://mattermost.example.com/abcdefg). If using channel and team names instead of a channel id, this will take the form url plus team plus channel.

The team parameter is only required if providing a channel name instead of a channel ID.

Parameters

url – URL of the server
channel – identifier/name of the channel where data will be fetched
api_token – token or key needed to use the API
max_items – maximum number of message requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
team – (optional) The name of the team the channel is in
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['post']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_data', 'id'], 'channel_name': ['channel_data', 'name']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='post', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the posts from the channel.

This method fetches the posts stored on the channel that were sent since the given date.

Parameters

category – the category of items to fetch
from_date – obtain posts sent since this date

Returns

a generator of posts

fetch_items(category, **kwargs)[source]¶

Fetch the messages.

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not support items resuming

static metadata_category(item)[source]¶

Extracts the category from a Mattermost item.

This backend only generates one type of item which is ‘post’.

static metadata_id(item)[source]¶: Extracts the identifier from a Mattermost item.

static metadata_updated_on(item)[source]¶

Extracts and converts the update time from a Metadata item.

The timestamp is extracted from ‘update_at’ field. This field is already a UNIX timestamp but it needs to be converted to float.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_json(raw_json)[source]¶

Parse a Mattermost JSON stream.

The method parses a JSON stream and returns a dict with the parsed data.

Parameters: raw_json – JSON string to parse
Returns: a dict with the parsed data

version = '0.5.0'¶

class perceval.backends.core.mattermost.MattermostClient(base_url, api_token, max_items=60, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Mattermost API client.

Client for fetching information from a Mattermost server using its REST API.

Parameters

base_url – URL of the Mattermost server
api_key – key needed to use the API
max_items – maximum number of items fetched per request
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

API_URL = '%(base_url)s/api/v4/%(entrypoint)s'¶

HAUTHORIZATION = 'Authorization'¶

PCHANNEL_ID = 'channel_id'¶

PPAGE = 'page'¶

PPER_PAGE = 'per_page'¶

RCHANNELS = 'channels'¶

RCHANNELS_BY_NAME = 'teams/name/%s/channels/name/%s'¶

RPOSTS = 'posts'¶

RUSERS = 'users'¶

calculate_time_to_reset()[source]¶

Number of seconds to wait.

The time is obtained by the different between the current date and the next date when the token is fully regenerated.

channel(channel)[source]¶: Fetch the channel information

channel_by_name(team: str, channel: str)[source]¶

Fetch the channel information by channel/team name

This provides identical information to the channel() method, with the key difference of looking up a channel by channel name and team name instead of by the channel ID.

fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]¶

Override fetch method to handle API rate limit.

Parameters

url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available
auth – auth of the request

:returns a response object

posts(channel, page=None)[source]¶: Fetch the history of a channel.

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items.

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user)[source]¶: Fetch user data.

class perceval.backends.core.mattermost.MattermostCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Mattermost backend from the command line.

BACKEND¶: alias of perceval.backends.core.mattermost.Mattermost

DESCRIPTION = 'Can either be called a channel ID, or a channel name. If a channel name is used, the team name is required. Otherwise, the team argument is ignored.'¶

classmethod setup_cmd_parser()[source]¶: Returns the Meetup argument parser.

perceval.backends.core.mbox module¶

class perceval.backends.core.mbox.MBox(uri, dirpath, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

MBox backend.

This class allows the fetch the email messages stored one or several mbox files. Initialize this class passing the directory path where the mbox files are stored. The origin of the data will be set to to the value of uri.

Parameters

uri – URI of the mboxes; typically, the URL of their mailing list
dirpath – directory path where the mboxes are stored
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

DATE_FIELD = 'Date'¶

MESSAGE_ID_FIELD = 'Message-ID'¶

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from a set of mbox files.

The method retrieves, from mbox files, the messages stored in these containers.

Parameters

category – the category of items to fetch
from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend does not support items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a MBox item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]¶: Extracts the identifier from a MBox item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a MBox item.

The timestamp used is extracted from ‘Date’ field in its several forms. This date is converted to UNIX timestamp format.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_mbox(filepath)[source]¶

Parse a mbox file.

This method parses a mbox file and returns an iterator of dictionaries. Each one of this contains an email message.

Parameters: filepath – path of the mbox to parse

:returnsgenerator of messages; each message is stored in a: dictionary of type requests.structures.CaseInsensitiveDict

version = '0.13.1'¶

class perceval.backends.core.mbox.MBoxArchive(filepath)[source]¶

Bases: object

Class to access a mbox archive.

MBOX archives can be stored into plain or compressed files (gzip, bz2 or zip).

Parameters: filepath – path to the mbox file

property compressed_type¶

property container¶

property filepath¶

is_compressed()[source]¶

class perceval.backends.core.mbox.MBoxCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run MBox backend from the command line.

BACKEND¶: alias of perceval.backends.core.mbox.MBox

classmethod setup_cmd_parser()[source]¶: Returns the MBox argument parser.

class perceval.backends.core.mbox.MailingList(uri, dirpath)[source]¶

Bases: object

Manage mailing lists archives.

This class gives access to the local mboxes archives that a mailing list manages.

Parameters

uri – URI of the mailing lists, usually its URL address
dirpath – path to the mboxes archives

property mboxes¶

Get the mboxes managed by this mailing list.

Returns the archives sorted by name.

Returns: a list of .MBoxArchive objects

perceval.backends.core.mediawiki module¶

class perceval.backends.core.mediawiki.MediaWiki(url, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

MediaWiki backend for Perceval.

This class retrieves the wiki pages and edits from a MediaWiki site. To initialize this class the URL must be provided. The origin of the data will be set to this URL.

It uses different APIs to support pre and post 1.27 MediaWiki versions. The pre 1.27 approach performance is better but it needs different logic for full an incremental retrieval.

In pre 1.27 the incremental approach uses the recent changes API which just covers MAX_RECENT_DAYS. If the from_date used is older, all the pages must be retrieved and the consumer of the items must filter itself.

Both approach return a common format: a page with all its revisions. It is different how the pages list is generated.

The page and revisions data downloaded are the standard. More data could be gathered using additional properties.

Deleted pages are not analyzed.

Parameters

url – MediaWiki url
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['page']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='page', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), reviews_api=False)[source]¶

Fetch the pages from the backend url.

The method retrieves, from a MediaWiki url, the wiki pages.

Parameters

category – the category of items to fetch
from_date – obtain pages updated since this date
reviews_api – use the reviews API available in MediaWiki >= 1.27

Returns

a generator of pages

fetch_items(category, **kwargs)[source]¶

Fetch the pages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not support items resuming

static metadata_category(item)[source]¶

Extracts the category from a MediaWiki item.

This backend only generates one type of item which is ‘page’.

static metadata_id(item)[source]¶: Extracts the identifier from a MediaWiki page.

static metadata_updated_on(item)[source]¶

Extracts the update field from a MediaWiki item.

The timestamp is extracted from ‘update’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

version = '0.11.0'¶

class perceval.backends.core.mediawiki.MediaWikiClient(url, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

MediaWiki API client.

This class implements a simple client to retrieve pages from projects in a MediaWiki node.

Parameters

url – URL of mediawiki site: https://wiki.mozilla.org
archive – an archive to store/retrieved the fetched data
from_archive – define whether the archive is used to store/read data
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

PACTION = 'action'¶

PAP_CONTINUE = 'apcontinue'¶

PAP_LIMIT = 'aplimit'¶

PAP_NAMESPACE = 'apnamespace'¶

PARV_CONTINUE = 'arvcontinue'¶

PARV_DIR = 'arvdir'¶

PARV_LIMIT = 'arvlimit'¶

PARV_NAMESPACE = 'arvnamespace'¶

PARV_PROP = 'arvprop'¶

PARV_START = 'arvstart'¶

PFORMAT = 'format'¶

PLIST = 'list'¶

PMETA = 'meta'¶

PPAGE_IDS = 'pageids'¶

PPROP = 'prop'¶

PRC_CONTINUE = 'rccontinue'¶

PRC_LIMIT = 'rclimit'¶

PRC_NAMESPACE = 'rcnamespace'¶

PRC_PROP = 'rcprop'¶

PRV_DIR = 'rvdir'¶

PRV_LIMIT = 'rvlimit'¶

PRV_START = 'rvstart'¶

PSIPROP = 'siprop'¶

VALL_PAGES = 'allpages'¶

VALL_REVISIONS = 'allrevisions'¶

VIDS = 'ids'¶

VJSON = 'json'¶

VNAMESPACES = 'namespaces'¶

VNEWER = 'newer'¶

VQUERY = 'query'¶

VRC_PROP = 'title|timestamp|ids'¶

VRECENT_CHANGES = 'recentchanges'¶

VREVISIONS = 'revisions'¶

VSITE_INFO = 'siteinfo'¶

call(params)[source]¶: Run an API command. :param cgi: cgi command to run on the server :param params: dict with the HTTP parameters needed to run

the given command

get_namespaces()[source]¶: Retrieve all contents namespaces.

get_pages(namespace, apcontinue='')[source]¶: Retrieve all pages from a namespace starting from apcontinue.

get_pages_from_allrevisions(namespaces, from_date=None, arvcontinue=None)[source]¶

get_recent_pages(namespaces, rccontinue='')[source]¶: Retrieve recent pages from all namespaces starting from rccontinue.

get_revisions(pageid, last_date=None)[source]¶

get_version()[source]¶

class perceval.backends.core.mediawiki.MediaWikiCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run MediaWiki backend from the command line.

BACKEND¶: alias of perceval.backends.core.mediawiki.MediaWiki

classmethod setup_cmd_parser()[source]¶: Returns the MediaWiki argument parser.

perceval.backends.core.meetup module¶

class perceval.backends.core.meetup.Meetup(group, api_token, max_items=200, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Meetup backend.

This class allows to fetch the events of a group from the Meetup server. Initialize this class passing the OAuth2 token needed for authentication with the parameter api_token.

Parameters

group – name of the group where data will be fetched
api_token – OAuth2 token to access the API
max_items – maximum number of issues requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['event']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

CLASSIFIED_FIELDS = [['group', 'topics'], ['event_hosts'], ['rsvps'], ['venue']]¶

A list of fields that should be considered sensitive or confidential.

Fields listed here will be hidden from fetched items, when this behaviour is requested.

Classified data filtering and archiving are not compatible to prevent data leaks or security issues.

EXTRA_SEARCH_FIELDS = {'group_id': ['group', 'id'], 'group_name': ['group', 'name']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=None, filter_classified=False)[source]¶

Fetch the events from the server.

This method fetches those events of a group stored on the server that were updated since the given date. Data comments and rsvps are included within each event.

Parameters

category – the category of items to fetch
from_date – obtain events updated since this date
to_date – obtain events updated before this date
filter_classified – remove classified fields from the resulting items

Returns

a generator of events

fetch_items(category, **kwargs)[source]¶

Fetch the events

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Meetup item.

This backend only generates one type of item which is ‘event’.

static metadata_id(item)[source]¶: Extracts the identifier from a Meetup item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Meetup item.

The timestamp is extracted from ‘updated’ field and converted to a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_json(raw_json)[source]¶

Parse a Meetup JSON stream.

The method parses a JSON stream and returns a list with the parsed data.

Parameters: raw_json – JSON string to parse
Returns: a list with the parsed data

version = '0.17.0'¶

class perceval.backends.core.meetup.MeetupClient(api_token, max_items=200, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Meetup API client.

Client for fetching information from the Meetup server using its REST API v3.

Parameters

api_token – OAuth2 token needed to access the API
max_items – maximum number of items per request
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [429]¶

PFIELDS = 'fields'¶

PKEY_OAUTH2 = 'Authorization'¶

PORDER = 'order'¶

PPAGE = 'page'¶

PRESPONSE = 'response'¶

PSCROLL = 'scroll'¶

PSTATUS = 'status'¶

RCOMMENTS = 'comments'¶

REVENTS = 'events'¶

RRSVPS = 'rsvps'¶

VEVENT_FIELDS = ['event_hosts', 'featured', 'group_topics', 'plain_text_description', 'rsvpable', 'series']¶

VRESPONSE = ['yes', 'no']¶

VRSVP_FIELDS = ['attendance_status']¶

VSTATUS = ['cancelled', 'upcoming', 'past', 'proposed', 'suggested']¶

VUPDATED = 'updated'¶

calculate_time_to_reset()[source]¶: Number of seconds to wait. They are contained in the rate limit reset header

comments(group, event_id)[source]¶: Fetch the comments of a given event.

events(group, from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶: Fetch the events pages of a given group.

rsvps(group, event_id)[source]¶: Fetch the rsvps of a given event.

static sanitize_for_archive(url, headers, payload)[source]¶: Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items :param: url: HTTP url request :param: headers: HTTP headers request :param: payload: HTTP payload request :returns url, headers and the sanitized payload

class perceval.backends.core.meetup.MeetupCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Meetup backend from the command line.

BACKEND¶: alias of perceval.backends.core.meetup.Meetup

classmethod setup_cmd_parser()[source]¶: Returns the Meetup argument parser.

perceval.backends.core.nntp module¶

class perceval.backends.core.nntp.NNTP(host, group, tag=None, archive=None)[source]¶

Bases: perceval.backend.Backend

NNTP backend.

This class allows to fetch the articles published on a news group using NNTP. It is initialized giving the host and the name of the news group.

Parameters

host – host
group – name of the group
tag – label used to mark the data
archive – archive to store/retrieve items

CATEGORIES = ['article']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'newsgroups': ['Newsgroups']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='article', offset=1)[source]¶

Fetch articles posted on a news group.

This method fetches those messages or articles published on a news group starting on the given offset.

Parameters

category – the category of items to fetch
offset – obtain messages from this offset

Returns

a generator of articles

fetch_items(category, **kwargs)[source]¶

Fetch the articles

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

metadata(item, filter_classified=False)[source]¶

NNTP metadata.

This method takes items, overriding metadata decorator, to add extra information related to NNTP.

Parameters

item – an item fetched by a backend
filter_classified – sets if classified fields were filtered

static metadata_category(item)[source]¶

Extracts the category from a NNTP item.

This backend only generates one type of item which is ‘article’.

static metadata_id(item)[source]¶: Extracts the identifier from a NNTP item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a NNTP item.

The timestamp is extracted from ‘Date’ field and converted to a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_article(raw_article)[source]¶

Parse a NNTP article.

This method parses a NNTP article stored in a string object and returns an dictionary.

Parameters: raw_article – NNTP article string
Returns: a dictionary of type requests.structures.CaseInsensitiveDict
Raises: ParseError – when an error is found parsing the article

version = '0.6.0'¶

class perceval.backends.core.nntp.NNTPCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run NNTP backend from the command line.

BACKEND¶: alias of perceval.backends.core.nntp.NNTP

classmethod setup_cmd_parser()[source]¶: Returns the NNTP argument parser.

class perceval.backends.core.nntp.NNTTPClient(host, archive=None, from_archive=False)[source]¶

Bases: object

NNTP client

Parameters

host – host
group – name of the group
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive

ARTICLE = 'article'¶

GROUP = 'group'¶

OVER = 'over'¶

article(article_id)[source]¶

Fetch article data

Parameters: article_id – id of the article to fetch

group(group_name)[source]¶

Fetch group data

Parameters: group_name – name of the group

over(offset)[source]¶

Fetch messages data

Parameters: offset – a tuple representing the offset to retrieve

quit()[source]¶

perceval.backends.core.pagure module¶

class perceval.backends.core.pagure.Pagure(namespace=None, repository=None, api_token=None, tag=None, archive=None, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Pagure backend for Perceval.

This class allows the fetch the issues stored in a Pagure repository.

Parameters

namespace – Pagure namespace
repository – Pagure repository
api_token – Pagure API token to access the API
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
sleep_time – time to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]¶

Fetch the issues from the repository.

The method retrieves, from a Pagure repository, the issues updated since/until the given date.

Parameters

category – the category of items to fetch
from_date – obtain issues updated since this date
to_date – obtain issues until a until a specific date (included)
filter_classified – remove classified fields from the resulting items

Returns

a generator of issues

fetch_items(category, **kwargs)[source]¶

Fetch the items (issues)

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Pagure item.

This backend generates one type of item which is ‘issue’.

static metadata_id(item)[source]¶: Extracts the identifier from a Pagure item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Pagure item.

The timestamp used is extracted from ‘last_updated’ field. This date is converted to UNIX timestamp format. As Pagure dates are in timestamp format the conversion is straightforward.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus the namespace and repo.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.1.2'¶

class perceval.backends.core.pagure.PagureClient(namespace, repository, token, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Client for retrieving information from Pagure API

Parameters

namespace – Pagure namespace
repository – Pagure repository
token – Pagure API token to access the API
sleep_time – time to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items per query
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'¶

PORDER = 'order'¶

PPER_PAGE = 'per_page'¶

PSINCE = 'since'¶

PSTATUS = 'status'¶

RISSUES = 'issues'¶

VORDER_ASC = 'asc'¶

VSTATUS_ALL = 'all'¶

fetch(url, payload=None, headers=None)[source]¶

Fetch the data from a given URL.

Parameters

url – link to the resource
payload – payload of the request
headers – headers of the request

:returns a response object

fetch_items(path, payload)[source]¶

Return the items from Pagure API using links pagination

Parameters

path – Path from which the item is to be fetched
payload – Payload to be added to the request

Returns

a generator of items

issues(from_date=None)[source]¶

Fetch the issues from the repository.

The method retrieves, from a Pagure repository, the issues updated since the given date.

Parameters: from_date – obtain issues updated since this date
Returns: a generator of issues

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.pagure.PagureCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Pagure backend from the command line.

BACKEND¶: alias of perceval.backends.core.pagure.Pagure

classmethod setup_cmd_parser()[source]¶: Returns the Pagure argument parser.

perceval.backends.core.phabricator module¶

class perceval.backends.core.phabricator.ConduitClient(base_url, api_token, max_retries=5, sleep_time=1, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Conduit API Client.

Phabricator uses Conduit as the Phabricator REST API. This class implements some of its methods to retrieve the contents from a Phabricator server.

Parameters

base_url – URL of the Phabricator server
api_token – token to get access to restricted methods of the API
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [429, 502, 503]¶

MANIPHEST_TASKS = 'maniphest.search'¶

MANIPHEST_TRANSACTIONS = 'maniphest.gettasktransactions'¶

PAFTER = 'after'¶

PATTACHMENTS = 'attachments'¶

PCONSTRAINTS = 'constraints'¶

PHAB_PHIDS = 'phid.query'¶

PHAB_USERS = 'user.query'¶

PHIDS = 'phids'¶

PIDS = 'ids'¶

PMODIFIED_START = 'modifiedStart'¶

PORDER = 'order'¶

PPROJECTS = 'projects'¶

URL = '%(base)s/api/%(method)s'¶

VOUTDATED = 'outdated'¶

phids(*phids)[source]¶

Retrieve data about PHIDs.

Params phids: list of PHIDs

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

tasks(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Retrieve tasks.

Parameters: from_date – retrieve tasks that where updated from that date; dates are converted epoch time.

transactions(*phids)[source]¶

Retrieve tasks transactions.

Parameters: phids – list of tasks identifiers

users(*phids)[source]¶

Retrieve users.

Params phids: list of users identifiers

exception perceval.backends.core.phabricator.ConduitError(**kwargs)[source]¶

Bases: perceval.errors.BaseError

Raised when an error occurs using Conduit

message = '%(error)s (code: %(code)s)'¶

class perceval.backends.core.phabricator.Phabricator(url, api_token, tag=None, archive=None, max_retries=5, sleep_time=1, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Phabricator backend.

This class allows to fetch the tasks stored on a Phabricator server. Initialize this class passing the URL of this server and the API token. The origin of the data will be set to this URL.

Parameters

url – URL of the server
api_token – token needed to use the API
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification

CATEGORIES = ['task']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='task', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the tasks from the server.

This method fetches the tasks stored on the server that were updated since the given date. The transactions data related to each task is also included within them.

Parameters

category – the category of items to fetch
from_date – obtain tasks updated since this date

Returns

a generator of tasks

fetch_items(category, **kwargs)[source]¶

Fetch the tasks

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Phabricator item.

This backend only generates one type of item which is ‘task’.

static metadata_id(item)[source]¶: Extracts the identifier from a Phabricator item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Phabricator item.

The timestamp is extracted from ‘dateModified’ field. This date is in UNIX timestamp format but needs to be converted to a float number.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_phids(results)[source]¶

Parse a Phabicator PHIDs JSON stream.

This method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the PHID parsed data.

Parameters: results – JSON to parse
Returns: a generator of parsed PHIDs

static parse_tasks(raw_json)[source]¶

Parse a Phabricator tasks JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the task parsed data.

Parameters: raw_json – JSON string to parse
Returns: a generator of parsed tasks

static parse_tasks_transactions(raw_json)[source]¶

Parse a Phabricator tasks transactions JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed transactions.

Parameters: raw_json – JSON string to parse
Returns: a dict with the parsed transactions

static parse_users(raw_json)[source]¶

Parse a Phabricator users JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contais the user parsed data.

Parameters: raw_json – JSON string to parse
Returns: a generator of parsed users

version = '0.13.0'¶

class perceval.backends.core.phabricator.PhabricatorCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Phabricator backend from the command line.

BACKEND¶: alias of perceval.backends.core.phabricator.Phabricator

classmethod setup_cmd_parser()[source]¶: Returns the Phabricator argument parser.

perceval.backends.core.pipermail module¶

class perceval.backends.core.pipermail.Pipermail(url, dirpath, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backends.core.mbox.MBox

Pipermail backend.

This class allows the fetch the email messages stored on a Pipermail archiver. Initialize this class passing the URL where the archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.

Parameters

url – URL to the Pipermail archiver
dirpath – directory path where the mboxes are stored
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from the Pipermail archiver.

The method fetches the mbox files from a remote Pipermail archiver and retrieves the messages stored on them.

Parameters

category – the category of items to fetch
from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend does not support items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

version = '0.11.1'¶

class perceval.backends.core.pipermail.PipermailCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Pipermail backend from the command line.

BACKEND¶: alias of perceval.backends.core.pipermail.Pipermail

classmethod setup_cmd_parser()[source]¶: Returns the Pipermail argument parser.

class perceval.backends.core.pipermail.PipermailList(url, dirpath, ssl_verify=True)[source]¶

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by Pipermail archiver.

This class gives access to remote and local mboxes archives from a mailing list stored by Pipermail. This class also allows to keep them in sync.

Parameters

url – URL to the Pipermail archiver for this list
dirpath – path to the local mboxes archives
ssl_verify – enable/disable SSL verification

fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the mbox files from the remote archiver.

Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.

Pipermail archives usually have on their file names the date of the archives stored following the schema year-month. When from_date property is called, it will return the mboxes which their year and month are equal or after that date.

Parameters: from_date – fetch archives that store messages equal or after the given date; only year and month values are compared
Returns: a list of tuples, storing the links and paths of the fetched archives

property mboxes¶

Get the mboxes managed by this mailing list.

Returns the archives sorted by date in ascending order.

Returns: a list of .MBoxArchive objects

perceval.backends.core.redmine module¶

class perceval.backends.core.redmine.Redmine(url, api_token=None, max_issues=100, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Redmine backend.

This class allows to fetch the issues stored on a Redmine server. Initialize this class passing the URL of this server. Some servers require authentication to get access to some data, if this is the case, pass the API token to api_token parameter.

Parameters

url – URL of the server
api_token – token needed to use the API
max_issues – maximum number of issues requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'project_id': ['project', 'id'], 'project_name': ['project', 'name']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the issues from the server.

This method fetches the issues stored on the server that were updated since the given date. Data about attachments, journals and watchers (among others) are included within each issue.

Parameters

category – the category of items to fetch
from_date – obtain issues updated since this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]¶

Fetch the issues

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Redmine item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]¶: Extracts the identifier from a Redmine item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Redmine item.

The timestamp is extracted from ‘updated_on’ field and converted to a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_issue_data(raw_json)[source]¶

Parse a Redmine issue JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed data for the given issue.

Parameters: raw_json – JSON string to parse
Returns: a dictionary with the parsed issue data

static parse_issues(raw_json)[source]¶

Parse a Redmine issues JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the issue parsed data.

Parameters: raw_json – JSON string to parse
Returns: a generator of parsed issues

static parse_user_data(raw_json)[source]¶

Parse a Redmine user JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed data for the given user.

Parameters: raw_json – JSON string to parse
Returns: a dictionary with the parsed user data

version = '0.11.0'¶

class perceval.backends.core.redmine.RedmineClient(base_url, api_token=None, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Redmine API client.

This class implements a client that retrieves issues from a Redmine server. Remine servers provides a REST API that returns its results in JSON format.

Parameters

base_url – URL of the Phabricator server
api_token – token to get access to restricted data stored in the server
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

CATTACHMENTS = 'attachments'¶

CCHANGESETS = 'changesets'¶

CCHILDREN = 'children'¶

CJOURNALS = 'journals'¶

CJSON = '.json'¶

CRELATIONS = 'relations'¶

CWATCHERS = 'watchers'¶

PINCLUDE = 'include'¶

PKEY = 'key'¶

PLIMIT = 'limit'¶

POFFSET = 'offset'¶

PSORT = 'sort'¶

PSTATUS_ID = 'status_id'¶

PUPDATED_ON = 'updated_on'¶

RISSUES = 'issues'¶

RUSERS = 'users'¶

URL = '%(base)s/%(resource)s'¶

issue(issue_id)[source]¶

Get the information of the given issue.

Parameters: issue_id – issue identifier

issues(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_issues=100)[source]¶

Get the information of a list of issues.

Parameters

from_date – retrieve issues that where updated from that date; dates are converted to UTC
offset – starting position for the search
max_issues – maximum number of issues to reteurn per query

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user_id)[source]¶

Get the information of the given user.

Parameters: user_id – user identifier

class perceval.backends.core.redmine.RedmineCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Redmine backend from the command line.

BACKEND¶: alias of perceval.backends.core.redmine.Redmine

classmethod setup_cmd_parser()[source]¶: Returns the Redmine argument parser.

perceval.backends.core.rocketchat module¶

class perceval.backends.core.rocketchat.RocketChat(url, channel, user_id, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Rocket.Chat backend.

This class allows to fetch messages from a channel(room) on a Rocket.Chat server. An API token and a User Id is required to access the server.

Parameters

url – server url from where messages are to be fetched
channel – name of the channel from where data will be fetched
user_id – generated User Id using your Rocket.Chat account
api_token – token needed to use the API
max_items – maximum number of message requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_info', '_id'], 'channel_name': ['channel_info', 'name']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]¶

Fetch the messages from the channel.

This method fetches the messages stored on the channel that were sent since the given date.

Parameters

category – the category of items to fetch
from_date – obtain messages sent since this date
filter_classified – remove classified fields from the resulting items

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages.

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Rocket.Chat item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]¶: Extracts the identifier from a Rocket.Chat item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Rocket.Chat item.

The timestamp is extracted from ‘ts’ field, and then converted into a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: extracted timestamp

static parse_channel_info(raw_channel_info)[source]¶

Parse a channel’s information JSON stream.

This method parses a JSON stream, containing the information of the channel, and returns a dict with the parsed data.

Parameters: raw_channel_info – JSON string to parse
Returns: a dict with the parsed channel’s information

static parse_messages(raw_messages)[source]¶

Parse a channel messages JSON stream.

This method parses a JSON stream, containing the history of a channel. It returns a list of messages and the total messages count in that channel.

Parameters: raw_messages – JSON string to parse
Returns: a tuple with a list of dicts with the parsed messages and a total messages count in the channel.

version = '0.1.0'¶

class perceval.backends.core.rocketchat.RocketChatClient(url, user_id, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, from_archive=False, archive=None, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Rocket.Chat API client.

Client for fetching information from the Rocket.Chat server using its REST API.

Parameters

url – server url from where messages are to be fetched
user_id – generated User Id using your Rocket.Chat account
api_token – token needed to use the API
max_items – maximum number of message requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
from_archive – it tells whether to write/read the archive
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

HAUTH_TOKEN = 'X-Auth-Token'¶

HUSER_ID = 'X-User-Id'¶

PCHANNEL_NAME = 'roomName'¶

PCOUNT = 'count'¶

POLDEST = 'oldest'¶

RCHANNEL_INFO = 'channels.info'¶

RCHANNEL_MESSAGES = 'channels.messages'¶

calculate_time_to_reset()[source]¶: Number of seconds to wait. They are contained in the rate limit reset header.

channel_info(channel)[source]¶: Fetch information about a channel.

fetch(url, payload=None, headers=None)[source]¶

Fetch the data from a given URL.

Parameters

url – link to the resource
payload – payload of the request
headers – headers of the request

:returns a response object

messages(channel, from_date, offset)[source]¶

Fetch messages from a channel.

The messages are fetch in ascending order i.e. from the oldest to the latest based on the time they were last updated. A query is also passed as a param to fetch the messages from a given date.

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token and: user id information before storing/retrieving archived items.

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request
Returns: url, headers and the sanitized payload

class perceval.backends.core.rocketchat.RocketChatCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Rocket.Chat backend from the command line.

BACKEND¶: alias of perceval.backends.core.rocketchat.RocketChat

classmethod setup_cmd_parser()[source]¶: Returns the Rocket.Chat argument parser.

perceval.backends.core.rss module¶

class perceval.backends.core.rss.RSS(url, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

RSS backend for Perceval.

This class retrieves the entries from a RSS feed. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters

url – RSS url
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['entry']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='entry')[source]¶

Fetch the entries from the url.

The method retrieves all entries from a RSS url

Parameters: category – the category of items to fetch
Returns: a generator of entries

fetch_items(category, **kwargs)[source]¶

Fetch the entries

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving entries on the fetch process.

Returns: this backend supports entries archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not supports entries resuming

static metadata_category(item)[source]¶

Extracts the category from a RSS item.

This backend only generates one type of item which is ‘entry’.

static metadata_id(item)[source]¶: Extracts the identifier from an entry item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a RSS item.

The timestamp is extracted from ‘published’ field. This date is a datetime string that needs to be converted to a UNIX timestamp float value.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

classmethod parse_feed(raw_entries)[source]¶

version = '0.7.0'¶

class perceval.backends.core.rss.RSSClient(url, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

RSS API client.

This class implements a simple client to retrieve entries from projects in a RSS node.

Parameters

url – URL of rss node: https://item.opnfv.org/ci
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

get_entries()[source]¶: Retrieve all entries from a RSS feed

class perceval.backends.core.rss.RSSCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run RSS backend from the command line.

BACKEND¶: alias of perceval.backends.core.rss.RSS

classmethod setup_cmd_parser()[source]¶: Returns the RSS argument parser.

perceval.backends.core.slack module¶

class perceval.backends.core.slack.Slack(channel, api_token, max_items=1000, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Slack backend.

This class retrieves the messages sent to a Slack channel. To access the server an API token is required, which must have enough permissions to read from the given channel.

The origin of the data will be set to the SLACK_URL plus the identifier of the channel; i.e ‘https://slack.com/C01234ABC’.

Parameters

channel – identifier of the channel where data will be fetched
api_token – token or key needed to use the API
max_items – maximum number of message requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_info', 'id'], 'channel_name': ['channel_info', 'name']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from the channel.

This method fetches the messages stored on the channel that were sent since the given date.

Parameters

category – the category of items to fetch
from_date – obtain messages sent since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend does not support items resuming

static metadata_category(item)[source]¶

Extracts the category from a Slack item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]¶

Extracts the identifier from a Slack item.

This identifier will be the mix of two fields because Slack messages does not have any unique identifier. In this case, ‘ts’ and ‘user’ values (or ‘bot_id’ when the message is sent by a bot) are combined because there have been cases where two messages were sent by different users at the same time.

In the case where neither the ‘user’ or ‘bot_id’ attributes are present (e.g, bot deleted), the fallback option is to generate the identifier using the ‘ts’ and ‘username’ values.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Slack item.

The timestamp is extracted from ‘ts’ field and converted to a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_channel_info(raw_channel_info)[source]¶

Parse a channel info JSON stream.

This method parses a JSON stream, containing the information from a channel, and returns a dict with the parsed data.

:param raw_channel_info

Returns: a dict with the parsed information about a channel

static parse_history(raw_history)[source]¶

Parse a channel history JSON stream.

This method parses a JSON stream, containing the history of a channel, and returns a list with the parsed data. It also returns if there are more messages that are not included on this stream.

Parameters: raw_history – JSON string to parse
Returns: a tuple with a list of dicts with the parsed messages and ‘has_more’ value

static parse_user(raw_user)[source]¶

Parse a user’s info JSON stream.

This method parses a JSON stream, containing the information from a user, and returns a dict with the parsed data.

Parameters: raw_user – JSON string to parse
Returns: a dict with the parsed user’s information

version = '0.10.0'¶

class perceval.backends.core.slack.SlackClient(api_token, max_items=1000, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Slack API client.

Client for fetching information from the Slack server using its REST API.

Parameters

api_token – key needed to use the API
max_items – maximum number of items per request
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

AUTHORIZATION_HEADER = 'Authorization'¶

PCHANNEL = 'channel'¶

PCOUNT = 'count'¶

PLATEST = 'latest'¶

POLDEST = 'oldest'¶

PTOKEN = 'token'¶

PUSER = 'user'¶

RCONVERSATION_HISTORY = 'conversations.history'¶

RCONVERSATION_INFO = 'conversations.info'¶

RCONVERSATION_MEMBERS = 'conversations.members'¶

RUSER_INFO = 'users.info'¶

URL = 'https://slack.com/api/%(resource)s'¶

channel_info(channel)[source]¶: Fetch information about a channel.

conversation_members(conversation)[source]¶

Fetch the number of members in a conversation, which is a supertype for public and private ones, DM and group DM.

Parameters: conversation – the ID of the conversation

history(channel, oldest=None, latest=None)[source]¶: Fetch the history of a channel.

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user_id)[source]¶: Fetch user info.

exception perceval.backends.core.slack.SlackClientError(**kwargs)[source]¶

Bases: perceval.errors.BaseError

Raised when an error occurs using the Slack client

message = '%(error)s'¶

class perceval.backends.core.slack.SlackCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Slack backend from the command line.

BACKEND¶: alias of perceval.backends.core.slack.Slack

classmethod setup_cmd_parser()[source]¶: Returns the Slack argument parser.

perceval.backends.core.stackexchange module¶

class perceval.backends.core.stackexchange.StackExchange(site, tagged=None, api_token=None, access_token=None, max_questions=100, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

StackExchange backend for Perceval.

This class retrieves the questions stored in any of the StackExchange sites. To initialize this class the site must be provided.

Parameters

site – StackExchange site
tagged – filter items by question Tag
api_token – StackExchange application key for the API
access_token – StackExchange user access_token for the API
max_questions – max of questions per page retrieved
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['question']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'tags': ['tags']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the questions from the site.

The method retrieves, from a StackExchange site, the questions updated since the given date.

Parameters: from_date – obtain questions updated since this date
Returns: a generator of questions

fetch_items(category, **kwargs)[source]¶

Fetch the questions

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a StackExchange item.

This backend only generates one type of item which is ‘question’.

static metadata_id(item)[source]¶: Extracts the identifier from a StackExchange item.

static metadata_updated_on(item)[source]¶

Extracts the update time from a StackExchange item.

The timestamp is extracted from ‘last_activity_date’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_questions(raw_page)[source]¶

Parse a StackExchange API raw response.

The method parses the API response retrieving the questions from the received items

Parameters: items – items from where to parse the questions
Returns: a generator of questions

version = '0.12.1'¶

class perceval.backends.core.stackexchange.StackExchangeClient(site, tagged, token, access_token=None, max_questions=100, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

StackExchange API client.

This class implements a simple client to retrieve questions from any Stackexchange site.

Parameters

site – URL of the Bugzilla server
tagged – filter items by question Tag
token – StackExchange application key for the API
access_token – StackExchange user access token for the API
max_questions – max number of questions per query
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

PACCESSTOKEN = 'access_token'¶

PFILTER = 'filter'¶

PKEY = 'key'¶

PMIN = 'min'¶

PORDER = 'order'¶

PPAGE = 'page'¶

PPAGESIZE = 'pagesize'¶

PSITE = 'site'¶

PSORT = 'sort'¶

PTAGGED = 'tagged'¶

RQUESTIONS = 'questions'¶

STACKEXCHANGE_API_URL = 'https://api.stackexchange.com'¶

VERSION_API = '2.2'¶

VQUESTIONS_FILTER = 'Bf*y*ByQD_upZqozgU6lXL_62USGOoV3)MFNgiHqHpmO_Y-jHR'¶

get_questions(from_date)[source]¶

Retrieve all the questions from a given date.

Parameters: from_date – obtain questions updated since this date

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.stackexchange.StackExchangeCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run StackExchange backend from the command line.

BACKEND¶: alias of perceval.backends.core.stackexchange.StackExchange

classmethod setup_cmd_parser()[source]¶: Returns the StackExchange argument parser.

perceval.backends.core.supybot module¶

class perceval.backends.core.supybot.Supybot(uri, dirpath, tag=None, archive=None)[source]¶

Bases: perceval.backend.Backend

Supybot IRC log backend.

This class fetches the messages stored by Supybot in log files. Initialize this class providing the directory where those IRC log files are stored.

The log filenames expected by this backend should follow the pattern: #channel_YYYY-MM-DD.log (i.e #grimoirelab_2016-06-27.log). This is needed to determine the date when messages were sent. Other filenames might work too but the behaviour is unknown.

The format of the messages must also follow a pattern. This patterns can be found in SupybotParser class documentation.

Parameters

uri – URI of the IRC archives; typically, the URL of their IRC channel
dirpath – directory path where the archives are stored
tag – label used to mark the data
archive – archive to store/retrieve items

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶

Fetch the messages from the Supybot IRC logger.

The method parsers and returns the messages saved on the IRC log files and stored by Supybot in dirpath.

Parameters

category – the category of items to fetch
from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend does not support items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Supybot item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]¶

Extracts the identifier from a Supybot item.

This identifier will be the mix of three fields because IRC messages does not have any unique identifier. In this case, ‘timestamp’, ‘nick’ and ‘body’ values are combined because there have been cases where two messages were sent by the same user at the same time.

static metadata_updated_on(item)[source]¶

Extracts the update time from a Supybot item.

The timestamp used is extracted from ‘timestamp’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_supybot_log(filepath)[source]¶

Parse a Supybot IRC log file.

The method parses the Supybot IRC log file and returns an iterator of dictionaries. Each one of this, contains a message from the file.

Parameters

filepath – path to the IRC log file

Returns

a generator of parsed messages

Raises

ParseError – raised when the format of the Supybot log file is invalid
OSError – raised when an error occurs reading the given file

version = '0.10.0'¶

class perceval.backends.core.supybot.SupybotCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Supybot backend from the command line.

BACKEND¶: alias of perceval.backends.core.supybot.Supybot

classmethod setup_cmd_parser()[source]¶: Returns the Supybot argument parser.

class perceval.backends.core.supybot.SupybotParser(stream)[source]¶

Bases: object

Supybot IRC parser.

This class parses a Supybot IRC log stream, converting plain log lines (or messages) into dict items. Each dictionary will contain the date of the message, the type of message (comment or server message), the nick of the sender and its body.

Each line on a log starts with a date in ISO format including its timezone and it is followed by two spaces and by a message.

There are two types of valid messages in a Supybot log: comment messages and server messages. First one follows any of these two patterns:

2016-06-27T12:00:00+0000 <nick> body of the message 2016-06-27T12:00:00+0000 * nick waves hello

While a valid server message has the next pattern:

2016-06-27T12:00:00+0000 *** nick is known as new_nick

An exception is raised when any of the lines does not follow any of the above formats.

Parameters: stream – an iterator which produces Supybot log lines

BOT_PATTERN = '^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$'¶

COMMENT_ACTION_PATTERN = '^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$'¶

COMMENT_PATTERN = '^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$'¶

EMPTY_BOT_PATTERN = '^-(.*?)(!.*)?-\\s*$'¶

EMPTY_COMMENT_ACTION_PATTERN = '^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$'¶

EMPTY_COMMENT_PATTERN = '^<(.*?)(!.*)?>\\s*$'¶

EMPTY_PATTERN = '^\\s*$'¶

SERVER_PATTERN = '^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$'¶

SUPYBOT_BOT_REGEX = re.compile('^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$', re.VERBOSE)¶

SUPYBOT_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$', re.VERBOSE)¶

SUPYBOT_COMMENT_REGEX = re.compile('^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$', re.VERBOSE)¶

SUPYBOT_EMPTY_BOT_REGEX = re.compile('^-(.*?)(!.*)?-\\s*$', re.VERBOSE)¶

SUPYBOT_EMPTY_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$', re.VERBOSE)¶

SUPYBOT_EMPTY_COMMENT_REGEX = re.compile('^<(.*?)(!.*)?>\\s*$', re.VERBOSE)¶

SUPYBOT_EMPTY_REGEX = re.compile('^\\s*$', re.VERBOSE)¶

SUPYBOT_SERVER_REGEX = re.compile('^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$', re.VERBOSE)¶

SUPYBOT_TIMESTAMP_REGEX = re.compile('^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]?\\d{0,4})\\s\\s\n (?P<msg>.+)$\n ', re.VERBOSE)¶

TCOMMENT = 'comment'¶

TIMESTAMP_PATTERN = '^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]?\\d{0,4})\\s\\s\n (?P<msg>.+)$\n '¶

TSERVER = 'server'¶

parse()[source]¶

Parse a Supybot IRC stream.

Returns an iterator of dicts. Each dicts contains information about the date, type, nick and body of a single log entry.

Returns: iterator of parsed lines
Raises: ParseError – when an invalid line is found parsing the given stream

perceval.backends.core.telegram module¶

class perceval.backends.core.telegram.Telegram(bot, bot_token, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Telegram backend.

The Telegram backend fetches the messages that a Telegram bot can receive. Usually, these messages are direct or private messages but a bot can be configured to receive every message sent to a channel/group where it is subscribed. Take into account that messages are removed from the Telegram server 24 hours after they are sent. Moreover, once they are fetched using an offset, these messages are also removed. This means every time this backend is called, messages will be deleted.

Initialize this class passing the name of the bot and the authentication token used by this bot. The authentication token is provided by Telegram once the bot is created.

The origin of the data will be set to the TELEGRAM_URL plus the name of the bot; i.e ‘http://telegram.org/mybot’.

Parameters

bot – name of the bot
bot_token – authentication token used by the bot
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'chat_id': ['message', 'chat', 'id'], 'chat_name': ['message', 'chat', 'title']}¶

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

{
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='message', offset=1, chats=None)[source]¶

Fetch the messages the bot can read from the server.

The method retrieves, from the Telegram server, the messages sent with an offset equal or greater than the given.

A list of chats, groups and channels identifiers can be set using the parameter chats. When it is set, only those messages sent to any of these will be returned. An empty list will return no messages.

Parameters

category – the category of items to fetch
offset – obtain messages from this offset
chats – list of chat names used to filter messages

Returns

a generator of messages

Raises

ValueError – when chats is an empty list

fetch_items(category, **kwargs)[source]¶

Fetch the messages

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

metadata(item, filter_classified=False)[source]¶

Telegram metadata.

The method takes an item and overrides the metadata information to add extra information related to Telegram.

Currently, it adds the ‘offset’ keyword.

Parameters

item – an item fetched by a backend
filter_classified – sets if classified fields were filtered

static metadata_category(item)[source]¶

Extracts the category from a Telegram item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]¶: Extracts the identifier from a Telegram item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Telegram item.

The timestamp is extracted from ‘date’ field that is inside of ‘message’ dict. This date is converted to UNIX timestamp format.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

static parse_messages(raw_json)[source]¶

Parse a Telegram JSON messages list.

The method parses the JSON stream and returns an iterator of dictionaries. Each one of this, contains a Telegram message.

Parameters: raw_json – JSON string to parse
Returns: a generator of parsed messages

version = '0.11.1'¶

class perceval.backends.core.telegram.TelegramBotClient(bot_token, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient

Telegram Bot API 2.0 client.

This class implements a simple client to retrieve those messages sent to a Telegram bot. This includes personal messages or messages sent to a channel (when privacy settings are disabled).

Parameters

bot_token – token for the bot
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

API_URL = 'https://api.telegram.org/bot%(token)s/%(method)s'¶

OFFSET = 'offset'¶

UPDATES_METHOD = 'getUpdates'¶

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize URL of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns the sanitized url, plus the headers and payload

updates(offset=None)[source]¶

Fetch the messages that a bot can read.

When the offset is given it will retrieve all the messages that are greater or equal to that offset. Take into account that, due to how the API works, all previous messages will be removed from the server.

Parameters: offset – fetch the messages starting on this offset

class perceval.backends.core.telegram.TelegramCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Telegram backend from the command line.

BACKEND¶: alias of perceval.backends.core.telegram.Telegram

classmethod setup_cmd_parser()[source]¶: Returns the Telegram argument parser.

perceval.backends.core.twitter module¶

class perceval.backends.core.twitter.Twitter(query, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, tag=None, archive=None, ssl_verify=True)[source]¶

Bases: perceval.backend.Backend

Twitter backend.

This class allows to fetch samples of tweets containing specific keywords. Initialize this class passing API key needed for authentication with the parameter api_key.

Parameters

query – query to fetch tweets
api_token – token or key needed to use the API
max_items – maximum number of issues requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification

CATEGORIES = ['tweet']¶

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='tweet', since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, tweets_type='mixed')[source]¶

Fetch the tweets from the server.

This method fetches tweets from the TwitterSearch API published in the last seven days.

Parameters

category – the category of items to fetch
since_id – if not null, it returns results with an ID greater than the specified ID
max_id – when it is set or if not None, it returns results with an ID less than the specified ID
geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”
lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code
include_entities – if disabled, it excludes entities node
tweets_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”

Returns

a generator of tweets

fetch_items(category, **kwargs)[source]¶

Fetch the tweets

Parameters

category – the category of items to fetch
kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]¶

Returns whether it supports archiving items on the fetch process.

Returns: this backend supports items archive

classmethod has_resuming()[source]¶

Returns whether it supports to resume the fetch process.

Returns: this backend supports items resuming

static metadata_category(item)[source]¶

Extracts the category from a Twitter item.

This backend only generates one type of item which is ‘tweet’.

static metadata_id(item)[source]¶: Extracts the identifier from a Twitter item.

static metadata_updated_on(item)[source]¶

Extracts and coverts the update time from a Twitter item.

The timestamp is extracted from ‘created_at’ field and converted to a UNIX timestamp.

Parameters: item – item generated by the backend
Returns: a UNIX timestamp

search_fields(item)[source]¶

Add search fields to an item.

It adds the values of metadata_id plus the hashtags of a tweet.

Parameters: item – the item to extract the search fields values
Returns: a dict of search fields

version = '0.4.0'¶

class perceval.backends.core.twitter.TwitterClient(api_key, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False, ssl_verify=True)[source]¶

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Twitter API client.

Client for fetching information from the Twitter server using its REST API v1.1.

Parameters

api_key – key needed to use the API
max_items – maximum number of items per request
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'¶

PCOUNT = 'count'¶

PGEOCODE = 'geocode'¶

PINCLUDE_ENTITIES = 'include_entities'¶

PLANG = 'lang'¶

PMAX_ID = 'max_id'¶

PQUERY = 'q'¶

PRESULT_TYPE = 'result_type'¶

PSINCE_ID = 'since_id'¶

calculate_time_to_reset()[source]¶: Number of seconds to wait. They are contained in the rate limit reset header

static sanitize_for_archive(url, headers, payload)[source]¶

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param: url: HTTP url request
Param: headers: HTTP headers request
Param: payload: HTTP payload request

:returns url, headers and the sanitized payload

tweets(query, since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, result_type='mixed')[source]¶

Fetch tweets for a given query between since_id and max_id.

Parameters

query – query to fetch tweets
since_id – if not null, it returns results with an ID greater than the specified ID
max_id – if not null, it returns results with an ID less than the specified ID
geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”
lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code
include_entities – if disabled, it excludes entities node
result_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”

Returns

a generator of tweets

class perceval.backends.core.twitter.TwitterCommand(*args, debug=False)[source]¶

Bases: perceval.backend.BackendCommand

Class to run Twitter backend from the command line.

BACKEND¶: alias of perceval.backends.core.twitter.Twitter

classmethod setup_cmd_parser()[source]¶: Returns the Twitter argument parser.

perceval.backends.core package¶

Submodules¶

perceval.backends.core.askbot module¶

perceval.backends.core.bugzilla module¶

perceval.backends.core.bugzillarest module¶

perceval.backends.core.confluence module¶

perceval.backends.core.discourse module¶

perceval.backends.core.dockerhub module¶

perceval.backends.core.gerrit module¶

perceval.backends.core.git module¶

perceval.backends.core.github module¶

perceval.backends.core.githubql module¶

perceval.backends.core.gitlab module¶

perceval.backends.core.gitter module¶

perceval.backends.core.googlehits module¶

perceval.backends.core.groupsio module¶

perceval.backends.core.hyperkitty module¶

perceval.backends.core.jenkins module¶

perceval.backends.core.jira module¶

perceval.backends.core.launchpad module¶

perceval.backends.core.mattermost module¶

perceval.backends.core.mbox module¶

perceval.backends.core.mediawiki module¶

perceval.backends.core.meetup module¶

perceval.backends.core.nntp module¶

perceval.backends.core.pagure module¶

perceval.backends.core.phabricator module¶

perceval.backends.core.pipermail module¶

perceval.backends.core.redmine module¶

perceval.backends.core.rocketchat module¶

perceval.backends.core.rss module¶

perceval.backends.core.slack module¶

perceval.backends.core.stackexchange module¶

perceval.backends.core.supybot module¶

perceval.backends.core.telegram module¶

perceval.backends.core.twitter module¶

Module contents¶