perceval.backends.core package

Submodules

perceval.backends.core.askbot module

class perceval.backends.core.askbot.Askbot(url, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Askbot backend.

This class retrieves the questions posted on an Askbot site. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters
  • url – Askbot site URL

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['question']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'tags': ['tags']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the questions/answers from the repository.

The method retrieves, from an Askbot site, the questions and answers updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain questions/answers updated since this date

Returns

a generator of items

fetch_items(category, **kwargs)[source]

Fetch the questions

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from an Askbot item.

This backend only generates one type of item which is ‘question’.

static metadata_id(item)[source]

Extracts the identifier from an Askbot question item.

static metadata_updated_on(item)[source]

Extracts the update time from an Askbot item.

The timestamp is extracted from ‘last_activity_at’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.8.0'
class perceval.backends.core.askbot.AskbotClient(base_url, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Askbot client.

This class implements a simple client to retrieve distinct kind of data from an Askbot site.

Parameters
  • base_url – URL of the Askbot site

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

API_QUESTIONS = 'api/v1/questions/'
HREQUEST_WITH = 'X-Requested-With'
PAVATAR_SIZE = 'avatar_size'
PPAGE = 'page'
PPOST_ID = 'post_id'
PPOST_TYPE = 'post_type'
PSORT = 'sort'
RCOMMENTS = 's/post_comments'
RCOMMENTS_OLD = 'post_comments'
RHTML_QUESTION = 'question/'
VANSWER = 'answer'
VAVATAR_SIZE = 0
VHTTP_REQUEST = 'XMLHttpRequest'
VORDER_API = 'activity-asc'
VORDER_HTML = 'votes'
get_api_questions(path)[source]

Retrieve a question page using the API.

Parameters

page – page to retrieve

get_comments(post_id)[source]

Retrieve a list of comments by a given id.

Parameters

object_id – object identifiere

get_html_question(question_id, page=1)[source]

Retrieve a raw HTML question and all it’s information.

Parameters
  • question_id – question identifier

  • page – page to retrieve

class perceval.backends.core.askbot.AskbotCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Askbot backend from the command line.

BACKEND

alias of perceval.backends.core.askbot.Askbot

classmethod setup_cmd_parser()[source]

Returns the Askbot argument parser.

class perceval.backends.core.askbot.AskbotParser[source]

Bases: object

Askbot HTML parser.

This class parses a plain HTML document, converting questions, answers, comments and user information into dict items.

static parse_answers(html_question)[source]

Parse the answers of a given HTML question.

The method parses the answers related with a given HTML question, as well as all the comments related to the answer.

Parameters

html_question – raw HTML question element

Returns

a list with the answers

static parse_number_of_html_pages(html_question)[source]

Parse number of answer pages to paginate over them.

Parameters

html_question – raw HTML question element

Returns

an integer with the number of pages

static parse_question_container(html_question)[source]

Parse the question info container of a given HTML question.

The method parses the information available in the question information container. The container can have up to 2 elements: the first one contains the information related to the user who generated the question and the date (if any). The second one contains the date of the update and the user who updated it (if not the same who generated the question).

Parameters

html_question – raw HTML question element

Returns

an object with the parsed information

static parse_user_info(update_info)[source]

Parse the user information of a given HTML container.

The method parses all the available user information in the container. If the class “user-info” exists, the method will get all the available information in the container. If not, if a class “tip” exists, it will be a wiki post with no user associated. Else, it can be an empty container.

Parameters

update_info – beautiful soup answer container element

Returns

an object with the parsed information

perceval.backends.core.bugzilla module

class perceval.backends.core.bugzilla.Bugzilla(url, user=None, password=None, max_bugs=200, max_bugs_csv=10000, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Bugzilla backend.

This class allows the fetch the bugs stored in Bugzilla repository. To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.

Parameters
  • url – Bugzilla server URL

  • user – Bugzilla user

  • password – Bugzilla user password

  • max_bugs – maximum number of bugs requested on the same query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['bug']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'component': ['component', 0, '__text__'], 'product': ['product', 0, '__text__']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the bugs from the repository.

The method retrieves, from a Bugzilla repository, the bugs updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain bugs updated since this date

Returns

a generator of bugs

fetch_items(category, **kwargs)[source]

Fetch the bugs

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Bugzilla item.

This backend only generates one type of item which is ‘bug’.

static metadata_id(item)[source]

Extracts the identifier from a Bugzilla item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Bugzilla item.

The timestamp is extracted from ‘delta_ts’ field. This date is converted to UNIX timestamp format. Due Bugzilla servers ignore the timezone on HTTP requests, it will be ignored during the conversion, too.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_bug_activity(raw_html)[source]

Parse a Bugzilla bug activity HTML stream.

This method extracts the information about activity from the given HTML stream. The bug activity is stored into a HTML table. Each parsed activity event is returned into a dictionary.

If the given HTML is invalid, the method will raise a ParseError exception.

Parameters

raw_html – HTML string to parse

Returns

a generator of parsed activity events

Raises

ParseError – raised when an error occurs parsing the given HTML stream

static parse_buglist(raw_csv)[source]

Parse a Bugzilla CSV bug list.

The method parses the CSV file and returns an iterator of dictionaries. Each one of this, contains the summary of a bug.

Parameters

raw_csv – CSV string to parse

Returns

a generator of parsed bugs

static parse_bugs_details(raw_xml)[source]

Parse a Bugilla bugs details XML stream.

This method returns a generator which parses the given XML, producing an iterator of dictionaries. Each dictionary stores the information related to a parsed bug.

If the given XML is invalid or does not contains any bug, the method will raise a ParseError exception.

Parameters

raw_xml – XML string to parse

Returns

a generator of parsed bugs

Raises

ParseError – raised when an error occurs parsing the given XML stream

version = '0.12.0'
class perceval.backends.core.bugzilla.BugzillaClient(base_url, user=None, password=None, max_bugs_csv=10000, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Bugzilla API client.

This class implements a simple client to retrieve distinct kind of data from a Bugzilla repository. Currently, it only supports 3.x and 4.x servers.

When it is initialized, it checks if the given Bugzilla is available and retrieves its version.

Parameters
  • base_url – URL of the Bugzilla server

  • user – Bugzilla user

  • password – user password

  • max_bugs_cvs – max bugs requested per CSV query

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

BackendError – when an error occurs initializing the client

CGI_BUG = 'show_bug.cgi'
CGI_BUGLIST = 'buglist.cgi'
CGI_BUG_ACTIVITY = 'show_activity.cgi'
CGI_LOGIN = 'index.cgi'
CTYPE_CSV = 'csv'
CTYPE_XML = 'xml'
OLD_STYLE_VERSIONS = ['3.2.3', '3.2.2']
PBUGZILLA_LOGIN = 'Bugzilla_login'
PBUGZILLA_PASSWORD = 'Bugzilla_password'
PBUG_ID = 'id'
PCHFIELD_FROM = 'chfieldfrom'
PCTYPE = 'ctype'
PEXCLUDE_FIELD = 'excludefield'
PLIMIT = 'limit'
PLOGIN = 'GoAheadAndLogIn'
PLOGOUT = 'logout'
PORDER = 'order'
URL = '%(base)s/%(cgi)s'
VERSION_REGEX = re.compile('.+bugzilla version="([^"]+)"', re.DOTALL)
bug_activity(bug_id)[source]

Get the activity of a bug in HTML format.

Parameters

bug_id – bug identifier

buglist(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Get a summary of bugs in CSV format.

Parameters

from_date – retrieve bugs that where updated from that date

bugs(*bug_ids)[source]

Get the information of a list of bugs in XML format.

Parameters

bug_ids – list of bug identifiers

call(cgi, params)[source]

Run an API command.

Parameters
  • cgi – cgi command to run on the server

  • params – dict with the HTTP parameters needed to run the given command

login(user, password)[source]

Authenticate a user in the server.

Parameters
  • user – Bugzilla user

  • password – user password

logout()[source]

Logout from the server.

metadata()[source]

Get metadata information in XML format.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the login and password information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.bugzilla.BugzillaCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Bugzilla backend from the command line.

BACKEND

alias of perceval.backends.core.bugzilla.Bugzilla

classmethod setup_cmd_parser()[source]

Returns the Bugzilla argument parser.

perceval.backends.core.bugzillarest module

class perceval.backends.core.bugzillarest.BugzillaREST(url, user=None, password=None, api_token=None, max_bugs=500, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Bugzilla backend that uses its API REST.

This class allows the fetch the bugs stored in Bugzilla server (version 5.0 or later). To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.

Parameters
  • url – Bugzilla server URL

  • user – Bugzilla user

  • password – Bugzilla user password

  • api_token – Bugzilla token

  • max_bugs – maximum number of bugs requested on the same query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['bug']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'component': ['component'], 'product': ['product']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the bugs from the repository.

The method retrieves, from a Bugzilla repository, the bugs updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain bugs updated since this date

Returns

a generator of bugs

fetch_items(category, **kwargs)[source]

Fetch the bugs

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Bugzilla item.

This backend only generates one type of item which is ‘bug’.

static metadata_id(item)[source]

Extracts the identifier from a Bugzilla item.

static metadata_updated_on(item)[source]

Extracts the update time from a Bugzilla item.

The timestamp used is extracted from ‘last_change_time’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.10.0'
class perceval.backends.core.bugzillarest.BugzillaRESTClient(base_url, user=None, password=None, api_token=None, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Bugzilla REST API client.

This class implements a simple client to retrieve distinct kind of data from a Bugzilla > 5.0 repository using its REST API.

When user and password parameters are given it logs in the server. Further requests will use the token obtained during the sign in phase.

Parameters
  • base_url – URL of the Bugzilla server

  • user – Bugzilla user

  • password – user password

  • api_token – api token for user; when this is provided user and password parameters will be ignored

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

BackendError – when an error occurs initializing the client

PBUGZILLA_LOGIN = 'login'
PBUGZILLA_PASSWORD = 'password'
PBUGZILLA_TOKEN = 'token'
PEXCLUDE_FIELDS = 'exclude_fields'
PIDS = 'ids'
PINCLUDE_FIELDS = 'include_fields'
PLAST_CHANGE_TIME = 'last_change_time'
PLIMIT = 'limit'
POFFSET = 'offset'
PORDER = 'order'
RATTACHMENT = 'attachment'
RBUG = 'bug'
RCOMMENT = 'comment'
RHISTORY = 'history'
RLOGIN = 'login'
URL = '%(base)s/rest/%(resource)s'
VCHANGE_DATE_ORDER = 'changeddate'
VEXCLUDE_ATTCH_DATA = 'data'
VINCLUDE_ALL = '_all'
attachments(*bug_ids)[source]

Get the attachments of the given bugs.

Parameters

bug_id – list of bug identifiers

bugs(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_bugs=500)[source]

Get the information of a list of bugs.

Parameters
  • from_date – retrieve bugs that where updated from that date; dates are converted to UTC

  • offset – starting position for the search; i.e to return 11th element, set this value to 10.

  • max_bugs – maximum number of bugs to reteurn per query

call(resource, params)[source]

Retrive the given resource.

Parameters
  • resource – resource to retrieve

  • params – dict with the HTTP parameters needed to retrieve the given resource

Raises

BugzillaRESTError – raised when an error is returned by the server

comments(*bug_ids)[source]

Get the comments of the given bugs.

Parameters

bug_ids – list of bug identifiers

history(*bug_ids)[source]

Get the history of the given bugs.

Parameters

bug_ids – list of bug identifiers

login(user, password)[source]

Authenticate a user in the server.

Parameters
  • user – Bugzilla user

  • password – user password

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the login, password and token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.bugzillarest.BugzillaRESTCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run BugzillaREST backend from the command line.

BACKEND

alias of perceval.backends.core.bugzillarest.BugzillaREST

classmethod setup_cmd_parser()[source]

Returns the BugzillaREST argument parser.

exception perceval.backends.core.bugzillarest.BugzillaRESTError(**kwargs)[source]

Bases: perceval.errors.BaseError

Raised when an error occurs using the API

message = '%(error)s (code: %(code)s)'

perceval.backends.core.confluence module

class perceval.backends.core.confluence.Confluence(url, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Confluence backend.

This class allows the fetch the historical contents (content versions) stored on a Confluence server. Initialize this class passing the URL os this server. The url will be set as the origin of the data.

Parameters
  • url – URL of the server

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['historical content']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='historical content', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the contents by version from the server.

This method fetches the different historical versions (or snapshots) of the contents stored in the server that were updated since the given date. Only those snapshots created or updated after from_date will be returned.

Take into account that the seconds of from_date parameter will be ignored because the Confluence REST API only accepts the date and hours and minutes for timestamps values.

Parameters
  • category – the category of items to fetch

  • from_date – obtain historical versions of contents updated since this date

Returns

a generator of historical versions

fetch_items(category, **kwargs)[source]

Fetch the contents

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Confluence item.

This backend only generates one type of item which is ‘historical content’.

static metadata_id(item)[source]

Extracts the identifier from a Confluence item.

This identifier will be the mix of two fields because a historical content does not have any unique identifier. In this case, ‘id’ and ‘version’ values are combined because it should not be possible to have two equal version numbers for the same content. The value to return will follow the pattern: <content>#v<version> (i.e 28979#v10).

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Confluence item.

The timestamp is extracted from ‘when’ field on ‘version’ section. This date is converted to UNIX timestamp format.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_contents_summary(raw_json)[source]

Parse a Confluence summary JSON list.

The method parses a JSON stream and returns an iterator of diccionaries. Each dictionary is a content summary.

Parameters

raw_json – JSON string to parse

Returns

a generator of parsed content summaries.

static parse_historical_content(raw_json)[source]

Parse a Confluence historical content JSON stream.

This method parses a JSON stream and returns a dictionary that contains the data of a historical content.

Parameters

raw_json – JSON string to parse

Returns

a dict with historical content

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus the page ancestor IDs, the content ID and the content version number.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.12.0'
class perceval.backends.core.confluence.ConfluenceClient(base_url, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Confluence REST API client.

This class implements a client to retrieve contents from a Confluence server using its REST API.

Parameters
  • base_url – URL of the Confluence server

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

MSEARCH = 'search'
PANCESTORS = 'ancestors'
PCQL = 'cql'
PEXPAND = 'expand'
PLIMIT = 'limit'
PSTART = 'start'
PSTATUS = 'status'
PVERSION = 'version'
RCONTENTS = 'content'
RHISTORY = 'history'
RSPACE = 'space'
URL = '%(base)s/rest/api/%(resource)s'
VCQL = "lastModified>='%(date)s' order by lastModified"
VEXPAND = ['body.storage', 'history', 'version']
VHISTORICAL = 'historical'
contents(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_contents=200)[source]

Get the contents of a repository.

This method returns an iterator that manages the pagination over contents. Take into account that the seconds of from_date parameter will be ignored because the API only works with hours and minutes.

Parameters
  • from_date – fetch the contents updated since this date

  • offset – fetch the contents starting from this offset

  • limit – maximum number of contents to fetch per request

historical_content(content_id, version)[source]

Get the snapshot of a content for the given version.

Parameters
  • content_id – fetch the snapshot of this content

  • version – snapshot version of the content

class perceval.backends.core.confluence.ConfluenceCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Confluence backend from the command line.

BACKEND

alias of perceval.backends.core.confluence.Confluence

classmethod setup_cmd_parser()[source]

Returns the Bugzilla argument parser.

perceval.backends.core.discourse module

class perceval.backends.core.discourse.Discourse(url, api_username=None, api_token=None, tag=None, archive=None, max_retries=10, sleep_time=5, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Discourse backend for Perceval.

This class retrieves the topics posted in a Discourse board. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters
  • url – Discourse URL

  • api_username – Discourse API username

  • api_token – Discourse API access token

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['topic']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'category_id': ['category_id']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='topic', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the topics from the Discurse board.

The method retrieves, from a Discourse board the topics updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain topics updated since this date

Returns

a generator of topics

fetch_items(category, **kwargs)[source]

Fetch the topics

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Discourse item.

This backend only generates one type of item which is ‘topic’.

static metadata_id(item)[source]

Extracts the identifier from a Discourse item.

static metadata_updated_on(item)[source]

Extracts the update time from a Discourse item.

The timestamp used is extracted from ‘last_posted_at’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.13.1'
class perceval.backends.core.discourse.DiscourseClient(base_url, api_username=None, api_key=None, sleep_time=5, max_retries=10, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Discourse API client.

This class implements a simple client to retrieve topics from any Discourse board.

Parameters
  • base_url – URL of the Discourse site

  • api_username – Discourse API username

  • api_key – Discourse API access token

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • archive – collect issues already retrieved from an archive

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

ALL_TOPICS = None
EXTRA_STATUS_FORCELIST = [429]
HKEY = 'Api-Key'
HUSER = 'Api-Username'
POSTS = 'posts'
PPAGE = 'page'
TJSON = '.json'
TOPIC = 't'
TOPICS_SUMMARY = 'latest'
post(post_id)[source]

Retrieve the post whit post_id identifier.

Parameters

post_id – identifier of the post to retrieve

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the user and key information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

topic(topic_id)[source]

Retrive the topic with topic_id identifier.

Parameters

topic_id – identifier of the topic to retrieve

topics_page(page=None)[source]

Retrieve the #page summaries of the latest topics.

Parameters

page – number of page to retrieve

class perceval.backends.core.discourse.DiscourseCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Discourse backend from the command line.

BACKEND

alias of perceval.backends.core.discourse.Discourse

classmethod setup_cmd_parser()[source]

Returns the Discourse argument parser.

perceval.backends.core.dockerhub module

class perceval.backends.core.dockerhub.DockerHub(owner, repository, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

DockerHub backend for Perceval.

This class retrieves data from a repository stored in the Docker Hub site. To initialize this class owner and repositories where data will be fetched must be provided. The origin of the data will be built with both parameters.

Shortcut _ owner for official Docker repositories will be replaced by its long name: library.

Parameters
  • owner – DockerHub owner

  • repository – DockerHub repository owned by owner

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['dockerhub-data']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'name': ['name'], 'namespace': ['namespace']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='dockerhub-data')[source]

Fetch data from a Docker Hub repository.

The method retrieves, from a repository stored in Docker Hub, its data which includes number of pulls, stars, description, among other data.

Parameters

category – the category of items to fetch

Returns

a generator of data

fetch_items(category, **kwargs)[source]

Fetch the Dockher Hub items

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Docker Hub item.

This backend only generates one type of item which is ‘dockerhub-data’.

static metadata_id(item)[source]

Extracts the identifier from a Docker Hub item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Docker Hub item.

The timestamp is extracted from ‘fetched_on’ field. This field is not part of the data provided by Docker Hub. It is added by this backend.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_json(raw_json)[source]

Parse a Docker Hub JSON stream.

The method parses a JSON stream and returns a dict with the parsed data.

Parameters

raw_json – JSON string to parse

Returns

a dict with the parsed data

version = '0.6.0'
class perceval.backends.core.dockerhub.DockerHubClient(archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

DockerHub API client.

Client for fetching information from the DockerHub server using its REST API v2.

Parameters
  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

RREPOSITORY = 'repositories'
repository(owner, repository)[source]

Fetch information about a repository.

class perceval.backends.core.dockerhub.DockerHubCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run DockerHub backend from the command line.

BACKEND

alias of perceval.backends.core.dockerhub.DockerHub

classmethod setup_cmd_parser()[source]

Returns the DockerHub argument parser.

perceval.backends.core.gerrit module

class perceval.backends.core.gerrit.Gerrit(hostname, user=None, port='29418', max_reviews=500, disable_host_key_check=False, id_filepath=None, tag=None, archive=None, blacklist_ids=None)[source]

Bases: perceval.backend.Backend

Gerrit backend.

Class to fetch the reviews from a Gerrit server. To initialize this class the Hostname of the server must be provided. The hostname will be set as the origin of the data.

Parameters
  • hostname – Gerrit server Hostname

  • user – SSH user used to connect to the Gerrit server

  • port – SSH port

  • max_reviews – maximum number of reviews requested on the same query

  • disable_host_key_check – disable host key controls

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • blacklist_ids – exclude the reviews while fetching

  • id_filepath – path to SSH private key

CATEGORIES = ['review']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'project_name': ['project'], 'review_hash': ['id']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='number', type=<class 'str'>)

A field unique to a given origin for items produced by this backend.

If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.

To access this field, please prefer origin_unique_field().

fetch(category='review', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the reviews from the repository.

The method retrieves, from a Gerrit repository, the reviews updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain reviews updated since this date

Returns

a generator of reviews

fetch_items(category, **kwargs)[source]

Fetch the reviews

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not support items resuming

static metadata_category(item)[source]

Extracts the category from a Gerrit item.

This backend only generates one type of item which is ‘review’.

static metadata_id(item)[source]

Extracts the identifier from a Gerrit item.

static metadata_updated_on(item)[source]

Extracts and converts the update time from a Gerrit item.

The timestamp is extracted from ‘lastUpdated’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_reviews(raw_data)[source]

Parse a Gerrit reviews list.

version = '0.13.1'
class perceval.backends.core.gerrit.GerritClient(repository, user=None, max_reviews=500, blacklist_reviews=None, disable_host_key_check=False, port='29418', id_filepath=None, archive=None, from_archive=False)[source]

Bases: object

Gerrit API client.

This class implements a client to retrieve reviews from a Gerrit repository using the ssh API. Currently it supports <2.8 and >=2.9 versions in incremental mode.

Check the next link for more info: https://gerrit-documentation.storage.googleapis.com/Documentation/2.12/cmd-query.html

Parameters
  • repository – Hostname of the Gerrit server

  • user – SSH user to be used to connect to gerrit server

  • max_reviews – max number of reviews per query

  • blacklist_reviews – exclude the reviews of this list while fetching

  • disable_host_key_check – disable host key controls

  • port – SSH port

  • id_filepath – SSH private key path

  • archive – collect issues already retrieved from an archive

  • from_archive – it tells whether to write/read the archive

CMD_GERRIT = 'gerrit'
CMD_VERSION = 'version'
MAX_RETRIES = 3
RETRY_WAIT = 60
VERSION_REGEX = re.compile('gerrit version (\\d+)\\.(\\d+).*')
next_retrieve_group_item(last_item=None, entry=None)[source]

Return the item to start from in next reviews group.

reviews(last_item, filter_=None)[source]

Get the reviews starting from last_item.

static sanitize_for_archive(cmd)[source]

Sanitize the Gerrit command by removing username information before storing/retrieving archived items

Param

cmd: Gerrit command

:returns the sanitized cmd

property version

Return the Gerrit server version.

class perceval.backends.core.gerrit.GerritCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Gerrit backend from the command line.

BACKEND

alias of perceval.backends.core.gerrit.Gerrit

classmethod setup_cmd_parser()[source]

Returns the Gerrit argument parser.

perceval.backends.core.git module

exception perceval.backends.core.git.EmptyRepositoryError(**kwargs)[source]

Bases: perceval.errors.RepositoryError

Exception raised when a repository is empty

message = '%(repository)s is empty'
class perceval.backends.core.git.Git(uri, gitpath, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Git backend.

This class allows the fetch the commits from a Git repository (local or remote) or from a log file. To initialize this class, you have to provide the URI repository and a value for gitpath. This uri will be set as the origin of the data.

When gitpath is a directory or does not exist, it will be considered as the place where the repository is/will be cloned; when gitpath is a file it will be considered as a Git log file.

Parameters
  • uri – URI of the Git repository

  • gitpath – path to the repository or to the log file

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

Raises

RepositoryError – raised when there was an error cloning or updating the repository.

CATEGORIES = ['commit']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='commit', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), branches=None, latest_items=False, no_update=False)[source]

Fetch commits.

The method retrieves from a Git repository or a log file a list of commits. Commits are returned in the same order they were obtained.

When from_date parameter is given it returns items committed since the given date.

The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.

The parameter latest_items returns only those commits which are new since the last time this method was called.

The parameter no_update returns all commits without performing an update of the repository before.

Take into account that from_date and branches are ignored when the commits are fetched from a Git log file or when latest_items flag is set.

The class raises a RepositoryError exception when an error occurs accessing the repository.

Parameters
  • category – the category of items to fetch

  • from_date – obtain commits newer than a specific date (inclusive)

  • to_date – obtain commits older than a specific date

  • branches – names of branches to fetch from (default: None)

  • latest_items – sync with the repository to fetch only the newest commits

  • no_update – if enabled, don’t update the repo with the latest changes

Returns

a generator of commits

fetch_items(category, **kwargs)[source]

Fetch the commits

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend does not support items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Git item.

This backend only generates one type of item which is ‘commit’.

static metadata_id(item)[source]

Extracts the identifier from a Git item.

static metadata_updated_on(item)[source]

Extracts the update time from a Git item.

The timestamp used is extracted from ‘CommitDate’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_git_log_from_file(filepath)[source]

Parse a Git log file.

The method parses the Git log file and returns an iterator of dictionaries. Each one of this, contains a commit.

Parameters

filepath – path to the log file

Returns

a generator of parsed commits

Raises
  • ParseError – raised when the format of the Git log file is invalid

  • OSError – raised when an error occurs reading the given file

static parse_git_log_from_iter(iterator)[source]

Parse a Git log obtained from an iterator.

The method parses the Git log fetched from an iterator, where each item is a line of the log. It returns and iterator of dictionaries. Each dictionary contains a commit.

Parameters

iterator – iterator of Git log lines

Raises

ParseError – raised when the format of the Git log is invalid

version = '0.12.1'
class perceval.backends.core.git.GitCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Git backend from the command line.

BACKEND

alias of perceval.backends.core.git.Git

classmethod setup_cmd_parser()[source]

Returns the Git argument parser.

class perceval.backends.core.git.GitParser(stream)[source]

Bases: object

Git log parser.

This class parses a plain Git log stream, converting plain commits into dict items.

Not every Git log output is valid to be parsed. The Git log stream must have a specific structure. It must contain raw commits data and stats about modified files. The next excerpt shows an example of a valid log:

commit aaa7a9209f096aaaadccaaa7089aaaa3f758a703 Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300

Commit for testing

:000000 100644 0000000… aaaaaaa… A aaa/otherthing :000000 100644 0000000… aaaaaaa… A aaa/something :000000 100644 0000000… aaaaaaa… A bbb/bthing 0 0 aaa/otherthing 0 0 aaa/something 0 0 bbb/bthing

Each commit starts with the ‘commit’ tag that is followed by the SHA-1 of the commit, its parents (two or more parents in the case of a merge) and a list of refs, if any.

commit 456a68ee1407a77f3e804a30dff245bb6c6b872f

ce8e0b86a1e9877f42fe9453ede418519115f367 51a3b654f252210572297f47597b31527c475fb8 (HEAD -> refs/heads/master)

The commit line is followed by one or more headers. Each header has a key and a value:

Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300

Then, an empty line divides the headers from the commit message.

First line of the commit

Commit message splitted into one or several lines. Each line of the message stars with 4 spaces.

Commit messages can contain a list of ‘trailers’. These trailers have the same format of headers but their meaning is project dependent. This is an example of a commit message with trailers:

Commit message with trailers

This is the body of the message where trailers are included. Trailers are part of the body so each line of the message stars with 4 spaces.

Signed-off-by: John Doe <jdoe@example.com> Signed-off-by: Jane Rae <jrae@example.com>

After a new empty line, actions and stats over files can be found. A action line starts with one or more ‘:’ chars and contain data about the old and new permissions of a file, its old and new indexes, the action code and the filepath to the file. In the case of a copied, renamed or moved file, the new filepath to that file is included.

:100644 100644 e69de29… e69de29… R100 aaa/otherthing aaa/otherthing.renamed

Stats lines include the number of lines added and removed, and the name of the file. The new name is also included for moved or renamed files.

10 0 aaa/{otherthing => otherthing.renamed}

The commit ends with an empty line.

Take into account that one empty line is valid at the beginning of the log. This allows to parse empty logs without raising exceptions.

This example was generated using the next command:

git log –raw –numstat –pretty=fuller –decorate=full –parents -M -C -c –remotes=origin –all

Parameters

stream – a file object which stores the log

ACTION_PATTERN = '^(?P<sc>\\:+)\n                      (?P<modes>(?:\\d{6}[ \\t])+)\n                      (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n                      (?P<action>[^\\t]+)\\t+\n                      (?P<file>[^\\t]+)\n                      (?:\\t+(?P<newfile>.+))?$'
COMMIT = 1
COMMIT_PATTERN = '^commit[ \\t](?P<commit>[a-f0-9]{40})\n                     (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n                     (?:[ \\t]\\((?P<refs>.+)\\))?$\n                     '
EMPTY_LINE_PATTERN = '^$'
FILE = 4
GIT_ACTION_REGEXP = re.compile('^(?P<sc>\\:+)\n                      (?P<modes>(?:\\d{6}[ \\t])+)\n                      (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n                      (?P<action>[^\\t]+)\\t+\n                     , re.VERBOSE)
GIT_COMMIT_REGEXP = re.compile('^commit[ \\t](?P<commit>[a-f0-9]{40})\n                     (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n                     (?:[ \\t]\\((?P<refs>.+)\\))?$\n                     ', re.VERBOSE)
GIT_HEADER_TRAILER_REGEXP = re.compile('^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$', re.VERBOSE)
GIT_MESSAGE_REGEXP = re.compile('^[\\s]{4}(?P<msg>.*)$', re.VERBOSE)
GIT_NEXT_STATE_REGEXP = re.compile('^$', re.VERBOSE)
GIT_STATS_REGEXP = re.compile('^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$', re.VERBOSE)
HEADER = 2
HEADER_TRAILER_PATTERN = '^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$'
INIT = 0
MESSAGE = 3
MESSAGE_LINE_PATTERN = '^[\\s]{4}(?P<msg>.*)$'
STATS_PATTERN = '^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$'
TRAILERS = ['Signed-off-by']
parse()[source]

Parse the Git log stream.

class perceval.backends.core.git.GitRef(hash, refname)

Bases: tuple

property hash

Alias for field number 0

property refname

Alias for field number 1

class perceval.backends.core.git.GitRepository(uri, dirpath)[source]

Bases: object

Manage a Git repository.

This class provides access to a Git repository running some common commands such as clone, pull or log. To create an instance from a remote repository, use clone() class method.

Parameters
  • uri – URI of the repository

  • dirpath – local directory where the repository is stored

GIT_PRETTY_OUTPUT_OPTS = ['--raw', '--numstat', '--pretty=fuller', '--decorate=full', '--parents', '-M', '-C', '-c']
classmethod clone(uri, dirpath)[source]

Clone a Git repository.

Make a bare copy of the repository stored in uri into dirpath. The repository would be either local or remote.

Parameters
  • uri – URI of the repository

  • dirtpath – directory where the repository will be cloned

Returns

a GitRepository class having cloned the repository

Raises

RepositoryError – when an error occurs cloning the given repository

count_objects()[source]

Count the objects of a repository.

The method returns the total number of objects (packed and unpacked) available on the repository.

Raises

RepositoryError – when an error occurs counting the objects of a repository

is_detached()[source]

Check if the repo is in a detached state.

The repository is in a detached state when HEAD is not a symbolic reference.

Returns

whether the repository is detached or not

Raises

RepositoryError – when an error occurs checking the state of the repository

is_empty()[source]

Determines whether the repository is empty or not.

Returns True when the repository is empty. Under the hood, it checks the number of objects on the repository. When this number is 0, the repositoy is empty.

Raises

RepositoryError – when an error occurs accessing the repository

log(from_date=None, to_date=None, branches=None, encoding='utf-8')[source]

Read the commit log from the repository.

The method returns the Git log of the repository using the following options:

git log –raw –numstat –pretty=fuller –decorate=full

–all –reverse –topo-order –parents -M -C -c –remotes=origin

When from_date is given, it gets the commits equal or older than that date. This date is given in a datetime object.

The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.

Parameters
  • from_date – fetch commits newer than a specific date (inclusive)

  • branches – names of branches to fetch from (default: None)

  • encoding – encode the log using this format

Returns

a generator where each item is a line from the log

Raises
rev_list(branches=None)[source]

Read the list commits from the repository

The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.

The method returns the Git rev-list of the repository using the following options:

git rev-list –topo-order

Parameters

branches – names of branches to fetch from (default: None)

Raises
show(commits=None, encoding='utf-8')[source]

Show the data of a set of commits.

The method returns the output of Git show command for a set of commits using the following options:

git show –raw –numstat –pretty=fuller –decorate=full

–parents -M -C -c [<commit>…<commit>]

When the list of commits is empty, the command will return data about the last commit, like the default behaviour of git show.

Parameters
  • commits – list of commits to show data

  • encoding – encode the output using this format

Returns

a generator where each item is a line from the show output

Raises
sync()[source]

Keep the repository in sync.

This method will synchronize the repository with its ‘origin’, fetching newest objects and updating references. It uses low level commands which allow to keep track of which things have changed in the repository.

The method also returns a list of hashes related to the new commits fetched during the process.

Returns

list of new commits

Raises

RepositoryError – when an error occurs synchronizing the repository

update()[source]

Update repository from its remote.

Calling this method, the repository will be synchronized with the remote repository using ‘fetch’ command for ‘heads’ refs. Any commit stored in the local copy will be removed; refs will be overwritten.

Raises

RepositoryError – when an error occurs updating the repository

perceval.backends.core.github module

class perceval.backends.core.github.GitHub(owner=None, repository=None, api_token=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]

Bases: perceval.backend.Backend

GitHub backend for Perceval.

This class allows the fetch the issues stored in GitHub repository. Note that since version 0.20.0, the api_token accepts a list of tokens, thus the backend must be initialized as follows: ``` GitHub(

owner=’chaoss’, repository=’grimoirelab’, api_token=[TOKEN-1, TOKEN-2, …], sleep_for_rate=True, sleep_time=300

Parameters
  • owner – GitHub owner

  • repository – GitHub repository from the owner

  • api_token – list of GitHub auth tokens to access the API

  • github_app_id – GitHub App ID

  • github_app_pk_filepath – GitHub App private key PEM file path

  • base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • max_items – max number of category items (e.g., issues, pull requests) per query

  • sleep_time – time to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue', 'pull_request', 'repository']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

CLASSIFIED_FIELDS = [['user_data'], ['merged_by_data'], ['assignee_data'], ['assignees_data'], ['requested_reviewers_data'], ['comments_data', 'user_data'], ['comments_data', 'reactions_data', 'user_data'], ['reviews_data', 'user_data'], ['review_comments_data', 'user_data'], ['review_comments_data', 'reactions_data', 'user_data']]

A list of fields that should be considered sensitive or confidential.

Fields listed here will be hidden from fetched items, when this behaviour is requested.

Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.

Classified data filtering and archiving are not compatible to prevent data leaks or security issues.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]

Fetch the issues/pull requests from the repository.

The method retrieves, from a GitHub repository, the issues/pull requests updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain issues/pull requests updated since this date

  • to_date – obtain issues/pull requests until a specific date (included)

  • filter_classified – remove classified fields from the resulting items

Returns

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the items (issues or pull_requests or repo information)

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a GitHub item.

This backend generates three types of item which are ‘issue’, ‘pull_request’ and ‘repo’ information.

static metadata_id(item)[source]

Extracts the identifier from a GitHub item.

static metadata_updated_on(item)[source]

Extracts the update time from a GitHub item.

The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus the owner and repo.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.27.0'
class perceval.backends.core.github.GitHubClient(owner, repository, tokens=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Client for retieving information from GitHub API

Parameters
  • owner – GitHub owner

  • repository – GitHub repository from the owner

  • tokens – list of GitHub auth tokens to access the API

  • github_app_id – GitHub App ID

  • github_app_pk_filepath – GitHub App private key PEM file path

  • base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time to sleep in case of connection problems

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • max_items – max number of category items (e.g., issues, pull requests) per query

  • archive – collect issues already retrieved from an archive

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [403, 500, 502, 503]
HACCEPT = 'Accept'
HAUTHORIZATION = 'Authorization'
PDIRECTION = 'direction'
PPER_PAGE = 'per_page'
PSINCE = 'since'
PSORT = 'sort'
PSTATE = 'state'
RCOMMENTS = 'comments'
RCOMMITS = 'commits'
RISSUES = 'issues'
RORGS = 'orgs'
RPULLS = 'pulls'
RRATE_LIMIT = 'rate_limit'
RREACTIONS = 'reactions'
RREPOS = 'repos'
RREQUESTED_REVIEWERS = 'requested_reviewers'
RREVIEWS = 'reviews'
RUSERS = 'users'
VACCEPT = 'application/vnd.github.squirrel-girl-preview'
VACCEPT_V3 = 'application/vnd.github.v3+json'
VDIRECTION_ASC = 'asc'
VSORT_UPDATED = 'updated'
VSTATE_ALL = 'all'
calculate_time_to_reset()[source]

Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.

fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]

Fetch the data from a given URL.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

  • method – type of request call (GET or POST)

  • stream – defer downloading the response body until the response content is available

  • auth – auth of the request

:returns a response object

fetch_items(path, payload)[source]

Return the items from github API using links pagination

issue_comment_reactions(comment_id)[source]

Get reactions of an issue comment

issue_comments(issue_number)[source]

Get the issue comments from pagination

issue_reactions(issue_number)[source]

Get reactions of an issue

issues(from_date=None)[source]

Fetch the issues from the repository.

The method retrieves, from a GitHub repository, the issues updated since the given date.

Parameters

from_date – obtain issues updated since this date

Returns

a generator of issues

pull_commits(pr_number)[source]

Get pull request commits

pull_requested_reviewers(pr_number)[source]

Get pull requested reviewers

pull_review_comment_reactions(comment_id)[source]

Get reactions of a review comment

pull_review_comments(pr_number)[source]

Get pull request review comments

pull_reviews(pr_number)[source]

Get pull request reviews

pulls(from_date=None)[source]

Fetch the pull requests from the repository.

The method retrieves, from a GitHub repository, the pull requests updated since the given date.

Parameters

from_date – obtain pull requests updated since this date

Returns

a generator of pull requests

repo()[source]

Get repository data

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

user(login)[source]

Get the user information and update the user cache

user_orgs(login)[source]

Get the user public organizations

class perceval.backends.core.github.GitHubCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run GitHub backend from the command line.

BACKEND

alias of perceval.backends.core.github.GitHub

classmethod setup_cmd_parser()[source]

Returns the GitHub argument parser.

perceval.backends.core.githubql module

class perceval.backends.core.githubql.GitHubQL(owner=None, repository=None, api_token=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]

Bases: perceval.backends.core.github.GitHub

GitHubQL backend for Perceval using the GitHub API v4. Most of the methods are inherited from the GitHub backend.

This class allows the fetch the issue events of a GitHub repository. Note that the events retrieved included also the ones of pull requests, since in GitHub, every pull request is an issue, but an issue may not be a pull request. Pull requests can be identified by the attribute pull_request included in data.issue.

Due to the limitation of not fetching issue events after a given date from GitHub v3, the events are fetched via the GitHub v4 (based on GraphQL).

All issues of a given tracker are retrieved in ascending order based on the last time they were updated. For each issue, its events (optionally from/until a given date) are collected using a GraphQL call. Each event is returned by Perceval together with the corresponding issue (available in data.issue).

Since the events are collected issue by issue, the incremental fetching is not supported. This limitation is due to the fact that events that occur on an issue may not update the issue attributes. Since there is no way to identify new events from the attributes of an issue, all issues must be fetched for every execution.

No user information beyond the login is included in data returned by this backend. Thus, the backend doesn’t require filter classified support.

Parameters
  • owner – GitHub owner

  • repository – GitHub repository from the owner

  • api_token – list of GitHub auth tokens to access the API

  • github_app_id – GitHub App ID

  • github_app_pk_filepath – GitHub App private key PEM file path

  • base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • max_items – max number of category items per query

  • sleep_time – time to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['event']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issue events from the repository.

The method retrieves, from a GitHub repository, the issue events since/until a given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain issue events since this date

  • to_date – obtain issue events until this date (included)

Returns

a generator of events

fetch_items(category, **kwargs)[source]

Fetch the items

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend doesn’t support items resuming

static metadata_category(item)[source]

Extracts the category from a GitHub item.

This backend generates one type item which is ‘event’.

static metadata_id(item)[source]

Extracts the identifier from a GitHub item.

static metadata_updated_on(item)[source]

Extracts the update time from a GitHub item.

The timestamp used is extracted from ‘createdAt’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.4.0'
class perceval.backends.core.githubql.GitHubQLClient(owner, repository, tokens=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.backends.core.github.GitHubClient

Client for retrieving information from GitHub API

Parameters
  • owner – GitHub owner

  • repository – GitHub repository from the owner

  • tokens – list of GitHub auth tokens to access the API

  • github_app_id – GitHub App ID

  • github_app_pk_filepath – GitHub App private key PEM file path

  • base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • sleep_time – time to sleep in case of connection problems

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • max_items – max number of category items (e.g., issues, pull requests) per query

  • archive – collect events already retrieved from an archive

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

VACCEPT = 'application/vnd.github.squirrel-girl-preview,application/vnd.github.starfox-preview+json'
VPER_PAGE = 100
events(issue_number, is_pull, from_date)[source]

Get the issue events of the types declared at EVENT_TYPES from the GraphQL API

Parameters
  • issue_number – number of the issue

  • is_pull – boolean value to identify a pull request

  • from_date – fetch events after a given date

class perceval.backends.core.githubql.GitHubQLCommand(*args, debug=False)[source]

Bases: perceval.backends.core.github.GitHubCommand

Class to run GitHubQL backend from the command line.

BACKEND

alias of perceval.backends.core.githubql.GitHubQL

perceval.backends.core.gitlab module

class perceval.backends.core.gitlab.GitLab(owner=None, repository=None, api_token=None, is_oauth_token=False, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, blacklist_ids=None, extra_retry_after_status=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

GitLab backend for Perceval.

This class allows the fetch the issues stored in GitLab repository.

Parameters
  • owner – GitLab owner

  • repository – GitLab repository from the owner

  • api_token – GitLab auth token to access the API

  • is_oauth_token – True if the token is OAuth (default False)

  • base_url – GitLab URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitLab public site.

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • blacklist_ids – ids of items that must not be retrieved

  • extra_retry_after_status – retry HTTP requests after status (default 500 and 502). These status complete the ones (413, 429, 503) defined in the HttpClient class

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue', 'merge_request']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='iid', type=<class 'int'>)

A field unique to a given origin for items produced by this backend.

If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.

To access this field, please prefer origin_unique_field().

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues/merge requests from the repository.

The method retrieves, from a GitLab repository, the issues/merge requests updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain issues updated since this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the items (issues or merge_requests)

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archivng items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not support items resuming

static metadata_category(item)[source]

Extracts the category from a GitLab item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a GitLab item.

static metadata_updated_on(item)[source]

Extracts the update time from a GitLab item.

The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitLab dates are in UTC the conversion is straightforward.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus the owner, project and iid of the issue or merge requests. Optionally, if the project is part of a (nested) group, all groups are also included to the search fields via the attribute groups.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.12.0'
class perceval.backends.core.gitlab.GitLabClient(owner, repository, token, is_oauth_token=False, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, extra_retry_after_status=None, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Client for retieving information from GitLab API

Parameters
  • owner – GitLab owner

  • repository – GitLab owner’s repository

  • token – GitLab auth token to access the API

  • is_oauth_token – True if the token is OAuth (default False)

  • base_url

    GitLab URL in enterprise edition case;

    when no value is set the backend will be fetch the data from the GitLab public site.

    param sleep_for_rate

    sleep until rate limit is reset

    param min_rate_to_sleep

    minimum rate needed to sleep until it will be reset

    param sleep_time

    time (in seconds) to sleep in case of connection problems

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • extra_retry_after_status – retry HTTP requests after status

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'
HPRIVATE_TOKEN = 'PRIVATE-TOKEN'
HRATE_LIMIT = 'RateLimit-Remaining'
HRATE_LIMIT_RESET = 'RateLimit-Reset'
PORDER_BY = 'order_by'
PPER_PAGE = 'per_page'
PSORT = 'sort'
PSTATE = 'state'
PUPDATE_AFTER = 'updated_after'
PVIEW = 'view'
REMOJI = 'award_emoji'
RISSUES = 'issues'
RMERGES = 'merge_requests'
RNOTES = 'notes'
RPROJECTS = 'projects'
RVERSIONS = 'versions'
VORDER_UPDATED_AT = 'updated_at'
VPER_PAGE = 100
VSORT_ASC = 'asc'
VSTATE_ALL = 'all'
VVIEW_SIMPLE = 'simple'
calculate_time_to_reset()[source]

Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.

emojis(item_type, item_id)[source]

Get emojis from pagination

fetch(url, payload=None, headers=None, method='GET', stream=False)[source]

Fetch the data from a given URL.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

  • method – type of request call (GET or POST)

  • stream – defer downloading the response body until the response content is available

:returns a response object

fetch_items(path, payload)[source]

Return the items from GitLab API using links pagination

issues(from_date=None)[source]

Get the issues from pagination

merge(merge_id)[source]

Get the merge full data

merge_version(merge_id, version_id)[source]

Get merge version detail

merge_versions(merge_id)[source]

Get the merge versions from pagination

merges(from_date=None)[source]

Get the merge requests from pagination

note_emojis(item_type, item_id, note_id)[source]

Get emojis of a note

notes(item_type, item_id)[source]

Get the notes from pagination

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.gitlab.GitLabCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run GitLab backend from the command line.

BACKEND

alias of perceval.backends.core.gitlab.GitLab

classmethod setup_cmd_parser()[source]

Returns the GitLab argument parser.

perceval.backends.core.gitter module

class perceval.backends.core.gitter.Gitter(group=None, room=None, api_token=None, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Gitter backend.

This class retrieves the messages sent to a Gitter room. To access the server an API token is required.

The origin of the data will be set to the GITTER_URL plus the identifier of the room; i.e ‘https://gitter.im/{group}/{room}’.

Parameters
  • group – group to which the room belongs

  • room – identifier of the room from which the messages are to be fetched

  • api_token – token or key needed to use the API

  • max_items – maximum number of message requested on the same query

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the room.

This method fetches the messages sent in the room that were sent since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – date from which messages are to be fetched

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages.

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not support items resuming

static metadata_category(item)[source]

Extracts the category from a Gitter item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Gitter item.

static metadata_updated_on(item)[source]

Extracts and coverts the sent time of a message from a Gitter item.

The timestamp is extracted from ‘sent’ field and converted to a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id,`group`,`room` and ‘room_id’.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.1.0'
class perceval.backends.core.gitter.GitterClient(api_token, max_items=100, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Gitter API client.

Client for fetching information from the Gitter server using its REST API.

Parameters
  • api_token – key needed to use the API

  • max_items – maximum number of items per request

  • archive – an archive to store/read fetched data

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'
PBEFORE_ID = 'beforeId'
PLIMIT = 'limit'
RMESSAGES = 'chatMessages'
RROOMS = 'rooms'
calculate_time_to_reset()[source]

Number of seconds to wait. They are contained in the rate limit reset header

fetch(url, payload=None, headers=None)[source]

Fetch the data from a given URL.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

:returns a response object

get_room_id(room)[source]

Fetch the room id of a room.

message_page(room_id, before_id)[source]

Fetch a page of messages.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items.

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.gitter.GitterCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Gitter backend from the command line.

BACKEND

alias of perceval.backends.core.gitter.Gitter

classmethod setup_cmd_parser()[source]

Returns the Gitter argument parser.

perceval.backends.core.googlehits module

class perceval.backends.core.googlehits.GoogleHits(keywords, tag=None, archive=None, max_retries=5, sleep_time=1, ssl_verify=True)[source]

Bases: perceval.backend.Backend

GoogleHits backend for Perceval.

This class retrieves the number of hits for a given list of keywords via the Google API. To initialize this class a list of keywords is needed.

Parameters
  • keywords – a list of keywords

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['google_hits']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'keywords': ['keywords']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='google_hits')[source]

Fetch data from Google API.

The method retrieves a list of hits for some given keywords using the Google API.

Parameters

category – the category of items to fetch

Returns

a generator of data

fetch_items(category, **kwargs)[source]

Fetch Google hit items

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a GoogleHits item.

This backend only generates one type of item which is ‘google_hits’.

static metadata_id(item)[source]

Extracts the identifier from a GoogleHit item.

static metadata_updated_on(item)[source]

Extracts the update time from a GoogleHit item.

The timestamp is based on the current time when the hit was extracted. This field is not part of the data provided by Google API. It is added by this backend.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.4.0'
class perceval.backends.core.googlehits.GoogleHitsClient(sleep_time=1, max_retries=5, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

GoogleHits API client.

Client for fetching hits data from Google API.

Parameters
  • sleep_time – time (in seconds) to sleep in case of connection problems

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [429]
PQUERY = 'q'
hits(keywords)[source]

Fetch information about a list of keywords.

class perceval.backends.core.googlehits.GoogleHitsCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run GoogleHits backend from the command line.

BACKEND

alias of perceval.backends.core.googlehits.GoogleHits

classmethod setup_cmd_parser()[source]

Returns the GoogleHits argument parser.

perceval.backends.core.groupsio module

class perceval.backends.core.groupsio.Groupsio(group_name, dirpath, email, password, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backends.core.mbox.MBox

Groups.io backend.

This class allows the fetch the messages of a Groups.io group. Initialize this class passing the name of the group, the directory path where the mbox files will be fetched and stored, and the email and password of the Groupsio user. The origin of the data will be set to the url of the group on Groups.io.

In order to know the group names where you are subscribed, you can use the following script: https://gist.github.com/valeriocos/2e2231e17fd3052800303bf99bd0c7c4

Parameters
  • group_name – Name of the group

  • dirpath – directory path where the mboxes are stored

  • email – Groupsio user email

  • password – Groupsio user password

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from a Groups.io group.

The method fetches the mbox files from a remote Groups.io group and retrieves the messages stored on them.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend does not support items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus the group_name

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.4.2'
class perceval.backends.core.groupsio.GroupsioClient(group_name, dirpath, email, password, ssl_verify=True)[source]

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by Groups.io.

This class gives access to remote and local mboxes archives from a mailing list stored by Groups.io. This class also allows to keep them in sync.

Parameters
  • group_name – Name of the group

  • dirpath – directory path where the mboxes are stored

  • email – Groupsio user email

  • password – Groupsio user password

  • ssl_verify – enable/disable SSL verification

PEMAIL = 'email'
PGROUP_ID = 'group_id'
PLIMIT = 'limit'
PPAGE_TOKEN = 'page_token'
PPASSWORD = 'password'
PSTART_TIME = 'start_time'
RDOWNLOAD_ARCHIVES = 'downloadarchives'
RGET_SUBSCRIPTIONS = 'getsubs'
RLOGIN = 'login'
fetch(from_date=None)[source]

Fetch the mbox files from the remote archiver.

Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.

Groups.io archives are returned as a .zip file, which contains one file in mbox format.

Parameters

from_date – fetch messages after a given date (included) expressed in ISO format

Returns

a list of tuples, storing the links and paths of the fetched archives

subscriptions(per_page=100)[source]

Fetch the groupsio paginated subscriptions for a given token

Parameters

per_page – number of subscriptions per page

Returns

an iterator of subscriptions

class perceval.backends.core.groupsio.GroupsioCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Groupsio backend from the command line.

BACKEND

alias of perceval.backends.core.groupsio.Groupsio

classmethod setup_cmd_parser()[source]

Returns the Groupsio argument parser.

perceval.backends.core.hyperkitty module

class perceval.backends.core.hyperkitty.HyperKitty(url, dirpath, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backends.core.mbox.MBox

HyperKitty backend.

This class allows the fetch the email messages stored on a HyperKitty archiver. Initialize this class passing the URL where the mailing list archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.

Parameters
  • url – URL to the HyperKitty mailing list archiver

  • dirpath – directory path where the mboxes are stored

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the HyperKitty mailing list archiver.

The method fetches the mbox files from a remote HyperKitty mailing list archiver and retrieves the messages stored on them.

Take into account that HyperKitty does not provide yet any kind of info to know which is the first message on the mailing list. For this reason, using a value in from_date previous to the date where the first message was sent will make to download empty mbox files.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend does not support items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

version = '0.6.0'
class perceval.backends.core.hyperkitty.HyperKittyCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run HyperKitty backend from the command line.

BACKEND

alias of perceval.backends.core.hyperkitty.HyperKitty

classmethod setup_cmd_parser()[source]

Returns the HyperKitty argument parser.

class perceval.backends.core.hyperkitty.HyperKittyList(url, dirpath, ssl_verify=True)[source]

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by HyperKitty archiver.

This class gives access to remote and local mboxes archives from a mailing list stored by HyperKitty. This class also allows to keep them in sync.

Notice that this class only works with HyperKitty version 1.0.4 or greater. Previous versions do not export messages in MBox format.

Parameters
  • url – URL to the HyperKitty archiver for this list

  • dirpath – path to the local mboxes archives

  • ssl_verify – enable/disable SSL verification

PEND = 'end'
PSTART = 'start'
REXPORT = 'export'
fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the mbox files from the remote archiver.

This method stores the archives in the path given during the initialization of this object.

HyperKitty archives are accessed month by month and stored following the schema year-month. Archives are fetched from the given month till the current month.

Parameters

from_date – fetch archives that store messages equal or after the given date; only year and month values are compared

Returns

a list of tuples, storing the links and paths of the fetched archives

property mboxes

Get the mboxes managed by this mailing list.

Returns the archives sorted by date in ascending order.

Returns

a list of .MBoxArchive objects

perceval.backends.core.jenkins module

class perceval.backends.core.jenkins.Jenkins(url, user=None, api_token=None, tag=None, archive=None, detail_depth=1, sleep_time=10, blacklist_ids=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Jenkins backend for Perceval.

This class retrieves the builds from a Jenkins site. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters
  • url – Jenkins url

  • user – Jenkins user

  • api_token – Jenkins auth token to access the API

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • detail_depth – control the detail level of the data returned by the API

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – collect builds already retrieved from an archive

  • blacklist_ids – exclude the jobs ID of this list while fetching

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['build']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'number': ['number']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='url', type=<class 'str'>)

A field unique to a given origin for items produced by this backend.

If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.

If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a OriginUniqueField containing the number and data type of the field.

Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.

To access this field, please prefer origin_unique_field().

fetch(category='build')[source]

Fetch the builds from the url.

The method retrieves, from a Jenkins url, the builds updated since the given date.

Parameters

category – the category of items to fetch

Returns

a generator of builds

fetch_items(category, **kwargs)[source]

Fetch the contents

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archiving

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not supports items resuming

static metadata_category(item)[source]

Extracts the category from a Jenkins item.

This backend only generates one type of item which is ‘build’.

static metadata_id(item)[source]

Extracts the identifier from a Build item.

static metadata_updated_on(item)[source]

Extracts the update time from a Jenkins item.

The timestamp is extracted from ‘timestamp’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.16.0'
class perceval.backends.core.jenkins.JenkinsClient(url, user=None, api_token=None, blacklist_jobs=None, detail_depth=1, sleep_time=10, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Jenkins API client.

This class implements a simple client to retrieve jobs/builds from projects in a Jenkins node. The amount of data returned for each request depends on the detail_depth value selected (minimum and default is 1). Note that increasing the detail_depth may considerably slow down the fetch operation and cause connection broken errors.

Parameters
  • url – URL of jenkins node: https://build.opnfv.org/ci

  • user – Jenkins user

  • api_token – Jenkins auth token to access the API

  • blacklist_jobs – exclude the jobs of this list while fetching

  • detail_depth – set the detail level of the data returned by the API

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

EXTRA_STATUS_FORCELIST = [410, 502, 503]
MAX_RETRIES = 5
PDEPTH = 'depth'
RAPI = 'api'
RJOB = 'job'
RJSON = 'json'
get_builds(job_name, url)[source]

Retrieve all builds from a job

Parameters
  • job_name – name of the job

  • url – target url to fetch builds

get_jobs(url)[source]

Retrieve all jobs

Parameters

url – target url to fetch jobs

class perceval.backends.core.jenkins.JenkinsCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Jenkins backend from the command line.

BACKEND

alias of perceval.backends.core.jenkins.Jenkins

classmethod setup_cmd_parser()[source]

Returns the Jenkins argument parser.

perceval.backends.core.jira module

class perceval.backends.core.jira.Jira(url, project=None, user=None, password=None, cert=None, max_results=100, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

JIRA backend for Perceval.

This class retrieves the issues stored in JIRA issue tracking system. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Note that when fetching data with an authenticated access (i.e., user and password), information about issue transitions and operations (e.g., edit-issue, comment-issue) is included in the JSON documents produced by the backend.

Parameters
  • url – JIRA’s endpoint

  • project – filter issues by project

  • user – Jira user

  • password – Jira user password

  • cert – SSL certificate path (PEM)

  • max_results – max number of results per query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'issue_key': ['key'], 'project_id': ['fields', 'project', 'id'], 'project_key': ['fields', 'project', 'key'], 'project_name': ['fields', 'project', 'name']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues from the site.

The method retrieves, from a JIRA site, the issues updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – retrieve issues updated from this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the issues

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Jira item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Jira item.

static metadata_updated_on(item)[source]

Extracts the update time from a Jira item.

The timestamp used is extracted from ‘updated’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_issues(raw_page)[source]

Parse a JIRA API raw response.

The method parses the API response retrieving the issues from the received items

Parameters

items – items from where to parse the issues

Returns

a generator of issues

version = '0.14.0'
class perceval.backends.core.jira.JiraClient(url, project, user, password, cert, max_results=100, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

JIRA API client.

This class implements a simple client to retrieve issues from any JIRA issue tracking system.

Parameters
  • URL – URL of the JIRA server

  • project – filter issues by project

  • user – JIRA’s username

  • password – JIRA’s password

  • cert – SSL certificate

  • max_results – max number of results per query

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

PEXPAND = 'expand'
PJQL = 'jql'
PMAX_RESULTS = 'maxResults'
PSTART_AT = 'startAt'
RCOMMENT = 'comment'
RESOURCE = 'rest/api'
RFIELD = 'field'
RISSUE = 'issue'
RSEARCH = 'search'
VERSION_API = '2'
VEXPAND = 'renderedFields,transitions,operations,changelog'
get_comments(issue_id)[source]

Retrieve all the comments of a given issue.

Parameters

issue_id – ID of the issue

get_fields()[source]

Retrieve all the fields available.

get_issues(from_date)[source]

Retrieve all the issues from a given date.

Parameters

from_date – obtain issues updated since this date

get_items(from_date, url, expand_fields=True)[source]

Retrieve all the items from a given date.

Parameters
  • url – endpoint API url

  • from_date – obtain items updated since this date

  • expand_fields – if True, it includes the expand fields in the payload

class perceval.backends.core.jira.JiraCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Jira backend from the command line.

BACKEND

alias of perceval.backends.core.jira.Jira

classmethod setup_cmd_parser()[source]

Returns the Jira argument parser.

perceval.backends.core.jira.filter_custom_fields(fields)[source]

Filter custom fields from a given set of fields.

Parameters

fields – set of fields

Returns

an object with the filtered custom fields

perceval.backends.core.jira.map_custom_field(custom_fields, fields)[source]

Add extra information for custom fields.

Parameters
  • custom_fields – set of custom fields with the extra information

  • fields – fields of the issue where to add the extra information

Returns

an set of items with the extra information mapped

perceval.backends.core.launchpad module

class perceval.backends.core.launchpad.Launchpad(distribution, package=None, items_per_page=75, sleep_time=300, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Launchpad backend for Perceval.

This class allows the fetch the issues stored in Launchpad.

Parameters
  • distribution – Launchpad distribution

  • package – Distribution package

  • items_per_page – number of items in a retrieved page

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues from a project (distribution/package).

The method retrieves, from a Launchpad project, the issues updated since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain issues updated since this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the issues

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Launchpad item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Launchpad item.

static metadata_updated_on(item)[source]

Extracts the update time from a Launchpad item.

The timestamp used is extracted from ‘date_last_updated’ field. This date is converted to UNIX timestamp format. As Launchpad dates are in UTC in ISO 8601 (e.g., ‘2008-03-26T01:43:15.603905+00:00’) the conversion is straightforward.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus additional values depending on the item category. For the categories issue and pull_request, the search fields include the issue/pull request number, labels, state and the name of the milestone. For the category repository, license and language are set as search fields.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.8.1'
class perceval.backends.core.launchpad.LaunchpadClient(distribution, package=None, items_per_page=75, sleep_time=300, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Client for retrieving information from Launchpad API

Parameters
  • distribution – Launchpad distribution

  • package – Distribution package

  • items_per_page – number of items in a retrieved page

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

HCONTENT_TYPE = 'Content-type'
PMODIFIED_SINCE = 'modified_since'
POMIT_DULPLICATES = 'omit_duplicates'
PORDER_BY = 'order_by'
PSTATUS = 'status'
PWS_OP = 'ws.op'
PWS_SIZE = 'ws.size'
PWS_START = 'ws.start'
RBUGS = 'bugs'
RSOURCE = '+source'
VCONTENT_TYPE = 'application/json'
VDATE_LAST_MODIFIED = 'date_last_updated'
VOMIT_DUPLICATES = 'false'
VSEARCH_TASKS = 'searchTasks'
VSTATUS = ['New', 'Incomplete', 'Opinion', 'Invalid', "Won't Fix", 'Expired', 'Confirmed', 'Triaged', 'In Progress', 'Fix Committed', 'Fix Released', 'Incomplete (with response)', 'Incomplete (without response)']
issue(issue_id)[source]

Get the issue data by its ID

issue_collection(issue_id, collection_name)[source]

Get a collection list of a given issue

issues(start=None)[source]

Get the issues from pagination

user(user_name)[source]

Get the user data by URL

user_name(user_link)[source]

Get user name from link

class perceval.backends.core.launchpad.LaunchpadCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Launchpad backend from the command line.

BACKEND

alias of perceval.backends.core.launchpad.Launchpad

classmethod setup_cmd_parser()[source]

Returns the Launchpad argument parser.

perceval.backends.core.mattermost module

class perceval.backends.core.mattermost.Mattermost(url, channel, api_token, max_items=60, tag=None, archive=None, team=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Mattermost backend.

This class retrieves the posts sent to a Mattermost channel. To access the server an API token is required, which must have enough permissions to read from the given channel.

To initialize this class the URL of the server must be provided. The origin of data will be set using this url plus the channel from data is obtained (i.e: https://mattermost.example.com/abcdefg). If using channel and team names instead of a channel id, this will take the form url plus team plus channel.

The team parameter is only required if providing a channel name instead of a channel ID.

Parameters
  • url – URL of the server

  • channel – identifier/name of the channel where data will be fetched

  • api_token – token or key needed to use the API

  • max_items – maximum number of message requested on the same query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • team – (optional) The name of the team the channel is in

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['post']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_data', 'id'], 'channel_name': ['channel_data', 'name']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='post', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the posts from the channel.

This method fetches the posts stored on the channel that were sent since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain posts sent since this date

Returns

a generator of posts

fetch_items(category, **kwargs)[source]

Fetch the messages.

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not support items resuming

static metadata_category(item)[source]

Extracts the category from a Mattermost item.

This backend only generates one type of item which is ‘post’.

static metadata_id(item)[source]

Extracts the identifier from a Mattermost item.

static metadata_updated_on(item)[source]

Extracts and converts the update time from a Metadata item.

The timestamp is extracted from ‘update_at’ field. This field is already a UNIX timestamp but it needs to be converted to float.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_json(raw_json)[source]

Parse a Mattermost JSON stream.

The method parses a JSON stream and returns a dict with the parsed data.

Parameters

raw_json – JSON string to parse

Returns

a dict with the parsed data

version = '0.5.0'
class perceval.backends.core.mattermost.MattermostClient(base_url, api_token, max_items=60, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Mattermost API client.

Client for fetching information from a Mattermost server using its REST API.

Parameters
  • base_url – URL of the Mattermost server

  • api_key – key needed to use the API

  • max_items – maximum number of items fetched per request

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

API_URL = '%(base_url)s/api/v4/%(entrypoint)s'
HAUTHORIZATION = 'Authorization'
PCHANNEL_ID = 'channel_id'
PPAGE = 'page'
PPER_PAGE = 'per_page'
RCHANNELS = 'channels'
RCHANNELS_BY_NAME = 'teams/name/%s/channels/name/%s'
RPOSTS = 'posts'
RUSERS = 'users'
calculate_time_to_reset()[source]

Number of seconds to wait.

The time is obtained by the different between the current date and the next date when the token is fully regenerated.

channel(channel)[source]

Fetch the channel information

channel_by_name(team: str, channel: str)[source]

Fetch the channel information by channel/team name

This provides identical information to the channel() method, with the key difference of looking up a channel by channel name and team name instead of by the channel ID.

fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]

Override fetch method to handle API rate limit.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

  • method – type of request call (GET or POST)

  • stream – defer downloading the response body until the response content is available

  • auth – auth of the request

:returns a response object

posts(channel, page=None)[source]

Fetch the history of a channel.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items.

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user)[source]

Fetch user data.

class perceval.backends.core.mattermost.MattermostCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Mattermost backend from the command line.

BACKEND

alias of perceval.backends.core.mattermost.Mattermost

DESCRIPTION = 'Can either be called a channel ID, or a channel name.  If a channel name is used, the team name is required. Otherwise, the team argument is ignored.'
classmethod setup_cmd_parser()[source]

Returns the Meetup argument parser.

perceval.backends.core.mbox module

class perceval.backends.core.mbox.MBox(uri, dirpath, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

MBox backend.

This class allows the fetch the email messages stored one or several mbox files. Initialize this class passing the directory path where the mbox files are stored. The origin of the data will be set to to the value of uri.

Parameters
  • uri – URI of the mboxes; typically, the URL of their mailing list

  • dirpath – directory path where the mboxes are stored

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

DATE_FIELD = 'Date'
MESSAGE_ID_FIELD = 'Message-ID'
fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from a set of mbox files.

The method retrieves, from mbox files, the messages stored in these containers.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend does not support items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a MBox item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a MBox item.

static metadata_updated_on(item)[source]

Extracts the update time from a MBox item.

The timestamp used is extracted from ‘Date’ field in its several forms. This date is converted to UNIX timestamp format.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_mbox(filepath)[source]

Parse a mbox file.

This method parses a mbox file and returns an iterator of dictionaries. Each one of this contains an email message.

Parameters

filepath – path of the mbox to parse

:returnsgenerator of messages; each message is stored in a

dictionary of type requests.structures.CaseInsensitiveDict

version = '0.13.1'
class perceval.backends.core.mbox.MBoxArchive(filepath)[source]

Bases: object

Class to access a mbox archive.

MBOX archives can be stored into plain or compressed files (gzip, bz2 or zip).

Parameters

filepath – path to the mbox file

property compressed_type
property container
property filepath
is_compressed()[source]
class perceval.backends.core.mbox.MBoxCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run MBox backend from the command line.

BACKEND

alias of perceval.backends.core.mbox.MBox

classmethod setup_cmd_parser()[source]

Returns the MBox argument parser.

class perceval.backends.core.mbox.MailingList(uri, dirpath)[source]

Bases: object

Manage mailing lists archives.

This class gives access to the local mboxes archives that a mailing list manages.

Parameters
  • uri – URI of the mailing lists, usually its URL address

  • dirpath – path to the mboxes archives

property mboxes

Get the mboxes managed by this mailing list.

Returns the archives sorted by name.

Returns

a list of .MBoxArchive objects

perceval.backends.core.mediawiki module

class perceval.backends.core.mediawiki.MediaWiki(url, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

MediaWiki backend for Perceval.

This class retrieves the wiki pages and edits from a MediaWiki site. To initialize this class the URL must be provided. The origin of the data will be set to this URL.

It uses different APIs to support pre and post 1.27 MediaWiki versions. The pre 1.27 approach performance is better but it needs different logic for full an incremental retrieval.

In pre 1.27 the incremental approach uses the recent changes API which just covers MAX_RECENT_DAYS. If the from_date used is older, all the pages must be retrieved and the consumer of the items must filter itself.

Both approach return a common format: a page with all its revisions. It is different how the pages list is generated.

The page and revisions data downloaded are the standard. More data could be gathered using additional properties.

Deleted pages are not analyzed.

Parameters
  • url – MediaWiki url

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['page']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='page', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), reviews_api=False)[source]

Fetch the pages from the backend url.

The method retrieves, from a MediaWiki url, the wiki pages.

Parameters
  • category – the category of items to fetch

  • from_date – obtain pages updated since this date

  • reviews_api – use the reviews API available in MediaWiki >= 1.27

Returns

a generator of pages

fetch_items(category, **kwargs)[source]

Fetch the pages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not support items resuming

static metadata_category(item)[source]

Extracts the category from a MediaWiki item.

This backend only generates one type of item which is ‘page’.

static metadata_id(item)[source]

Extracts the identifier from a MediaWiki page.

static metadata_updated_on(item)[source]

Extracts the update field from a MediaWiki item.

The timestamp is extracted from ‘update’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

version = '0.11.0'
class perceval.backends.core.mediawiki.MediaWikiClient(url, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

MediaWiki API client.

This class implements a simple client to retrieve pages from projects in a MediaWiki node.

Parameters
  • url – URL of mediawiki site: https://wiki.mozilla.org

  • archive – an archive to store/retrieved the fetched data

  • from_archive – define whether the archive is used to store/read data

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

PACTION = 'action'
PAP_CONTINUE = 'apcontinue'
PAP_LIMIT = 'aplimit'
PAP_NAMESPACE = 'apnamespace'
PARV_CONTINUE = 'arvcontinue'
PARV_DIR = 'arvdir'
PARV_LIMIT = 'arvlimit'
PARV_NAMESPACE = 'arvnamespace'
PARV_PROP = 'arvprop'
PARV_START = 'arvstart'
PFORMAT = 'format'
PLIST = 'list'
PMETA = 'meta'
PPAGE_IDS = 'pageids'
PPROP = 'prop'
PRC_CONTINUE = 'rccontinue'
PRC_LIMIT = 'rclimit'
PRC_NAMESPACE = 'rcnamespace'
PRC_PROP = 'rcprop'
PRV_DIR = 'rvdir'
PRV_LIMIT = 'rvlimit'
PRV_START = 'rvstart'
PSIPROP = 'siprop'
VALL_PAGES = 'allpages'
VALL_REVISIONS = 'allrevisions'
VIDS = 'ids'
VJSON = 'json'
VNAMESPACES = 'namespaces'
VNEWER = 'newer'
VQUERY = 'query'
VRC_PROP = 'title|timestamp|ids'
VRECENT_CHANGES = 'recentchanges'
VREVISIONS = 'revisions'
VSITE_INFO = 'siteinfo'
call(params)[source]

Run an API command. :param cgi: cgi command to run on the server :param params: dict with the HTTP parameters needed to run

the given command

get_namespaces()[source]

Retrieve all contents namespaces.

get_pages(namespace, apcontinue='')[source]

Retrieve all pages from a namespace starting from apcontinue.

get_pages_from_allrevisions(namespaces, from_date=None, arvcontinue=None)[source]
get_recent_pages(namespaces, rccontinue='')[source]

Retrieve recent pages from all namespaces starting from rccontinue.

get_revisions(pageid, last_date=None)[source]
get_version()[source]
class perceval.backends.core.mediawiki.MediaWikiCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run MediaWiki backend from the command line.

BACKEND

alias of perceval.backends.core.mediawiki.MediaWiki

classmethod setup_cmd_parser()[source]

Returns the MediaWiki argument parser.

perceval.backends.core.meetup module

class perceval.backends.core.meetup.Meetup(group, api_token, max_items=200, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Meetup backend.

This class allows to fetch the events of a group from the Meetup server. Initialize this class passing the OAuth2 token needed for authentication with the parameter api_token.

Parameters
  • group – name of the group where data will be fetched

  • api_token – OAuth2 token to access the API

  • max_items – maximum number of issues requested on the same query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['event']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

CLASSIFIED_FIELDS = [['group', 'topics'], ['event_hosts'], ['rsvps'], ['venue']]

A list of fields that should be considered sensitive or confidential.

Fields listed here will be hidden from fetched items, when this behaviour is requested.

Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.

Classified data filtering and archiving are not compatible to prevent data leaks or security issues.

EXTRA_SEARCH_FIELDS = {'group_id': ['group', 'id'], 'group_name': ['group', 'name']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=None, filter_classified=False)[source]

Fetch the events from the server.

This method fetches those events of a group stored on the server that were updated since the given date. Data comments and rsvps are included within each event.

Parameters
  • category – the category of items to fetch

  • from_date – obtain events updated since this date

  • to_date – obtain events updated before this date

  • filter_classified – remove classified fields from the resulting items

Returns

a generator of events

fetch_items(category, **kwargs)[source]

Fetch the events

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Meetup item.

This backend only generates one type of item which is ‘event’.

static metadata_id(item)[source]

Extracts the identifier from a Meetup item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Meetup item.

The timestamp is extracted from ‘updated’ field and converted to a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_json(raw_json)[source]

Parse a Meetup JSON stream.

The method parses a JSON stream and returns a list with the parsed data.

Parameters

raw_json – JSON string to parse

Returns

a list with the parsed data

version = '0.17.0'
class perceval.backends.core.meetup.MeetupClient(api_token, max_items=200, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Meetup API client.

Client for fetching information from the Meetup server using its REST API v3.

Parameters
  • api_token – OAuth2 token needed to access the API

  • max_items – maximum number of items per request

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [429]
PFIELDS = 'fields'
PKEY_OAUTH2 = 'Authorization'
PORDER = 'order'
PPAGE = 'page'
PRESPONSE = 'response'
PSCROLL = 'scroll'
PSTATUS = 'status'
RCOMMENTS = 'comments'
REVENTS = 'events'
RRSVPS = 'rsvps'
VEVENT_FIELDS = ['event_hosts', 'featured', 'group_topics', 'plain_text_description', 'rsvpable', 'series']
VRESPONSE = ['yes', 'no']
VRSVP_FIELDS = ['attendance_status']
VSTATUS = ['cancelled', 'upcoming', 'past', 'proposed', 'suggested']
VUPDATED = 'updated'
calculate_time_to_reset()[source]

Number of seconds to wait. They are contained in the rate limit reset header

comments(group, event_id)[source]

Fetch the comments of a given event.

events(group, from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the events pages of a given group.

rsvps(group, event_id)[source]

Fetch the rsvps of a given event.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items :param: url: HTTP url request :param: headers: HTTP headers request :param: payload: HTTP payload request :returns url, headers and the sanitized payload

class perceval.backends.core.meetup.MeetupCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Meetup backend from the command line.

BACKEND

alias of perceval.backends.core.meetup.Meetup

classmethod setup_cmd_parser()[source]

Returns the Meetup argument parser.

perceval.backends.core.nntp module

class perceval.backends.core.nntp.NNTP(host, group, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

NNTP backend.

This class allows to fetch the articles published on a news group using NNTP. It is initialized giving the host and the name of the news group.

Parameters
  • host – host

  • group – name of the group

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

CATEGORIES = ['article']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'newsgroups': ['Newsgroups']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='article', offset=1)[source]

Fetch articles posted on a news group.

This method fetches those messages or articles published on a news group starting on the given offset.

Parameters
  • category – the category of items to fetch

  • offset – obtain messages from this offset

Returns

a generator of articles

fetch_items(category, **kwargs)[source]

Fetch the articles

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

metadata(item, filter_classified=False)[source]

NNTP metadata.

This method takes items, overriding metadata decorator, to add extra information related to NNTP.

Parameters
  • item – an item fetched by a backend

  • filter_classified – sets if classified fields were filtered

static metadata_category(item)[source]

Extracts the category from a NNTP item.

This backend only generates one type of item which is ‘article’.

static metadata_id(item)[source]

Extracts the identifier from a NNTP item.

static metadata_updated_on(item)[source]

Extracts the update time from a NNTP item.

The timestamp is extracted from ‘Date’ field and converted to a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_article(raw_article)[source]

Parse a NNTP article.

This method parses a NNTP article stored in a string object and returns an dictionary.

Parameters

raw_article – NNTP article string

Returns

a dictionary of type requests.structures.CaseInsensitiveDict

Raises

ParseError – when an error is found parsing the article

version = '0.6.0'
class perceval.backends.core.nntp.NNTPCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run NNTP backend from the command line.

BACKEND

alias of perceval.backends.core.nntp.NNTP

classmethod setup_cmd_parser()[source]

Returns the NNTP argument parser.

class perceval.backends.core.nntp.NNTTPClient(host, archive=None, from_archive=False)[source]

Bases: object

NNTP client

Parameters
  • host – host

  • group – name of the group

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

ARTICLE = 'article'
GROUP = 'group'
OVER = 'over'
article(article_id)[source]

Fetch article data

Parameters

article_id – id of the article to fetch

group(group_name)[source]

Fetch group data

Parameters

group_name – name of the group

over(offset)[source]

Fetch messages data

Parameters

offset – a tuple representing the offset to retrieve

quit()[source]

perceval.backends.core.pagure module

class perceval.backends.core.pagure.Pagure(namespace=None, repository=None, api_token=None, tag=None, archive=None, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Pagure backend for Perceval.

This class allows the fetch the issues stored in a Pagure repository.

Parameters
  • namespace – Pagure namespace

  • repository – Pagure repository

  • api_token – Pagure API token to access the API

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • max_items – max number of category items (e.g., issues, pull requests) per query

  • sleep_time – time to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]

Fetch the issues from the repository.

The method retrieves, from a Pagure repository, the issues updated since/until the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain issues updated since this date

  • to_date – obtain issues until a until a specific date (included)

  • filter_classified – remove classified fields from the resulting items

Returns

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the items (issues)

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Pagure item.

This backend generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Pagure item.

static metadata_updated_on(item)[source]

Extracts the update time from a Pagure item.

The timestamp used is extracted from ‘last_updated’ field. This date is converted to UNIX timestamp format. As Pagure dates are in timestamp format the conversion is straightforward.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus the namespace and repo.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.1.2'
class perceval.backends.core.pagure.PagureClient(namespace, repository, token, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Client for retrieving information from Pagure API

Parameters
  • namespace – Pagure namespace

  • repository – Pagure repository

  • token – Pagure API token to access the API

  • sleep_time – time to sleep in case of connection problems

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • max_items – max number of category items per query

  • archive – collect issues already retrieved from an archive

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'
PORDER = 'order'
PPER_PAGE = 'per_page'
PSINCE = 'since'
PSTATUS = 'status'
RISSUES = 'issues'
VORDER_ASC = 'asc'
VSTATUS_ALL = 'all'
fetch(url, payload=None, headers=None)[source]

Fetch the data from a given URL.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

:returns a response object

fetch_items(path, payload)[source]

Return the items from Pagure API using links pagination

Parameters
  • path – Path from which the item is to be fetched

  • payload – Payload to be added to the request

Returns

a generator of items

issues(from_date=None)[source]

Fetch the issues from the repository.

The method retrieves, from a Pagure repository, the issues updated since the given date.

Parameters

from_date – obtain issues updated since this date

Returns

a generator of issues

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.pagure.PagureCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Pagure backend from the command line.

BACKEND

alias of perceval.backends.core.pagure.Pagure

classmethod setup_cmd_parser()[source]

Returns the Pagure argument parser.

perceval.backends.core.phabricator module

class perceval.backends.core.phabricator.ConduitClient(base_url, api_token, max_retries=5, sleep_time=1, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Conduit API Client.

Phabricator uses Conduit as the Phabricator REST API. This class implements some of its methods to retrieve the contents from a Phabricator server.

Parameters
  • base_url – URL of the Phabricator server

  • api_token – token to get access to restricted methods of the API

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

EXTRA_STATUS_FORCELIST = [429, 502, 503]
MANIPHEST_TASKS = 'maniphest.search'
MANIPHEST_TRANSACTIONS = 'maniphest.gettasktransactions'
PAFTER = 'after'
PATTACHMENTS = 'attachments'
PCONSTRAINTS = 'constraints'
PHAB_PHIDS = 'phid.query'
PHAB_USERS = 'user.query'
PHIDS = 'phids'
PIDS = 'ids'
PMODIFIED_START = 'modifiedStart'
PORDER = 'order'
PPROJECTS = 'projects'
URL = '%(base)s/api/%(method)s'
VOUTDATED = 'outdated'
phids(*phids)[source]

Retrieve data about PHIDs.

Params phids

list of PHIDs

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

tasks(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Retrieve tasks.

Parameters

from_date – retrieve tasks that where updated from that date; dates are converted epoch time.

transactions(*phids)[source]

Retrieve tasks transactions.

Parameters

phids – list of tasks identifiers

users(*phids)[source]

Retrieve users.

Params phids

list of users identifiers

exception perceval.backends.core.phabricator.ConduitError(**kwargs)[source]

Bases: perceval.errors.BaseError

Raised when an error occurs using Conduit

message = '%(error)s (code: %(code)s)'
class perceval.backends.core.phabricator.Phabricator(url, api_token, tag=None, archive=None, max_retries=5, sleep_time=1, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Phabricator backend.

This class allows to fetch the tasks stored on a Phabricator server. Initialize this class passing the URL of this server and the API token. The origin of the data will be set to this URL.

Parameters
  • url – URL of the server

  • api_token – token needed to use the API

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • max_retries – number of max retries to a data source before raising a RetryError exception

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['task']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='task', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the tasks from the server.

This method fetches the tasks stored on the server that were updated since the given date. The transactions data related to each task is also included within them.

Parameters
  • category – the category of items to fetch

  • from_date – obtain tasks updated since this date

Returns

a generator of tasks

fetch_items(category, **kwargs)[source]

Fetch the tasks

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Phabricator item.

This backend only generates one type of item which is ‘task’.

static metadata_id(item)[source]

Extracts the identifier from a Phabricator item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Phabricator item.

The timestamp is extracted from ‘dateModified’ field. This date is in UNIX timestamp format but needs to be converted to a float number.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_phids(results)[source]

Parse a Phabicator PHIDs JSON stream.

This method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the PHID parsed data.

Parameters

results – JSON to parse

Returns

a generator of parsed PHIDs

static parse_tasks(raw_json)[source]

Parse a Phabricator tasks JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the task parsed data.

Parameters

raw_json – JSON string to parse

Returns

a generator of parsed tasks

static parse_tasks_transactions(raw_json)[source]

Parse a Phabricator tasks transactions JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed transactions.

Parameters

raw_json – JSON string to parse

Returns

a dict with the parsed transactions

static parse_users(raw_json)[source]

Parse a Phabricator users JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contais the user parsed data.

Parameters

raw_json – JSON string to parse

Returns

a generator of parsed users

version = '0.13.0'
class perceval.backends.core.phabricator.PhabricatorCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Phabricator backend from the command line.

BACKEND

alias of perceval.backends.core.phabricator.Phabricator

classmethod setup_cmd_parser()[source]

Returns the Phabricator argument parser.

perceval.backends.core.pipermail module

class perceval.backends.core.pipermail.Pipermail(url, dirpath, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backends.core.mbox.MBox

Pipermail backend.

This class allows the fetch the email messages stored on a Pipermail archiver. Initialize this class passing the URL where the archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.

Parameters
  • url – URL to the Pipermail archiver

  • dirpath – directory path where the mboxes are stored

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the Pipermail archiver.

The method fetches the mbox files from a remote Pipermail archiver and retrieves the messages stored on them.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend does not support items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

version = '0.11.1'
class perceval.backends.core.pipermail.PipermailCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Pipermail backend from the command line.

BACKEND

alias of perceval.backends.core.pipermail.Pipermail

classmethod setup_cmd_parser()[source]

Returns the Pipermail argument parser.

class perceval.backends.core.pipermail.PipermailList(url, dirpath, ssl_verify=True)[source]

Bases: perceval.backends.core.mbox.MailingList

Manage mailing list archives stored by Pipermail archiver.

This class gives access to remote and local mboxes archives from a mailing list stored by Pipermail. This class also allows to keep them in sync.

Parameters
  • url – URL to the Pipermail archiver for this list

  • dirpath – path to the local mboxes archives

  • ssl_verify – enable/disable SSL verification

fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the mbox files from the remote archiver.

Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.

Pipermail archives usually have on their file names the date of the archives stored following the schema year-month. When from_date property is called, it will return the mboxes which their year and month are equal or after that date.

Parameters

from_date – fetch archives that store messages equal or after the given date; only year and month values are compared

Returns

a list of tuples, storing the links and paths of the fetched archives

property mboxes

Get the mboxes managed by this mailing list.

Returns the archives sorted by date in ascending order.

Returns

a list of .MBoxArchive objects

perceval.backends.core.redmine module

class perceval.backends.core.redmine.Redmine(url, api_token=None, max_issues=100, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Redmine backend.

This class allows to fetch the issues stored on a Redmine server. Initialize this class passing the URL of this server. Some servers require authentication to get access to some data, if this is the case, pass the API token to api_token parameter.

Parameters
  • url – URL of the server

  • api_token – token needed to use the API

  • max_issues – maximum number of issues requested on the same query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['issue']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'project_id': ['project', 'id'], 'project_name': ['project', 'name']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the issues from the server.

This method fetches the issues stored on the server that were updated since the given date. Data about attachments, journals and watchers (among others) are included within each issue.

Parameters
  • category – the category of items to fetch

  • from_date – obtain issues updated since this date

Returns

a generator of issues

fetch_items(category, **kwargs)[source]

Fetch the issues

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Redmine item.

This backend only generates one type of item which is ‘issue’.

static metadata_id(item)[source]

Extracts the identifier from a Redmine item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Redmine item.

The timestamp is extracted from ‘updated_on’ field and converted to a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_issue_data(raw_json)[source]

Parse a Redmine issue JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed data for the given issue.

Parameters

raw_json – JSON string to parse

Returns

a dictionary with the parsed issue data

static parse_issues(raw_json)[source]

Parse a Redmine issues JSON stream.

The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the issue parsed data.

Parameters

raw_json – JSON string to parse

Returns

a generator of parsed issues

static parse_user_data(raw_json)[source]

Parse a Redmine user JSON stream.

The method parses a JSON stream and returns a dictionary with the parsed data for the given user.

Parameters

raw_json – JSON string to parse

Returns

a dictionary with the parsed user data

version = '0.11.0'
class perceval.backends.core.redmine.RedmineClient(base_url, api_token=None, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Redmine API client.

This class implements a client that retrieves issues from a Redmine server. Remine servers provides a REST API that returns its results in JSON format.

Parameters
  • base_url – URL of the Phabricator server

  • api_token – token to get access to restricted data stored in the server

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

CATTACHMENTS = 'attachments'
CCHANGESETS = 'changesets'
CCHILDREN = 'children'
CJOURNALS = 'journals'
CJSON = '.json'
CRELATIONS = 'relations'
CWATCHERS = 'watchers'
PINCLUDE = 'include'
PKEY = 'key'
PLIMIT = 'limit'
POFFSET = 'offset'
PSORT = 'sort'
PSTATUS_ID = 'status_id'
PUPDATED_ON = 'updated_on'
RISSUES = 'issues'
RUSERS = 'users'
URL = '%(base)s/%(resource)s'
issue(issue_id)[source]

Get the information of the given issue.

Parameters

issue_id – issue identifier

issues(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_issues=100)[source]

Get the information of a list of issues.

Parameters
  • from_date – retrieve issues that where updated from that date; dates are converted to UTC

  • offset – starting position for the search

  • max_issues – maximum number of issues to reteurn per query

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user_id)[source]

Get the information of the given user.

Parameters

user_id – user identifier

class perceval.backends.core.redmine.RedmineCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Redmine backend from the command line.

BACKEND

alias of perceval.backends.core.redmine.Redmine

classmethod setup_cmd_parser()[source]

Returns the Redmine argument parser.

perceval.backends.core.rocketchat module

class perceval.backends.core.rocketchat.RocketChat(url, channel, user_id, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Rocket.Chat backend.

This class allows to fetch messages from a channel(room) on a Rocket.Chat server. An API token and a User Id is required to access the server.

Parameters
  • url – server url from where messages are to be fetched

  • channel – name of the channel from where data will be fetched

  • user_id – generated User Id using your Rocket.Chat account

  • api_token – token needed to use the API

  • max_items – maximum number of message requested on the same query

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_info', '_id'], 'channel_name': ['channel_info', 'name']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]

Fetch the messages from the channel.

This method fetches the messages stored on the channel that were sent since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages sent since this date

  • filter_classified – remove classified fields from the resulting items

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages.

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Rocket.Chat item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Rocket.Chat item.

static metadata_updated_on(item)[source]

Extracts the update time from a Rocket.Chat item.

The timestamp is extracted from ‘ts’ field, and then converted into a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

extracted timestamp

static parse_channel_info(raw_channel_info)[source]

Parse a channel’s information JSON stream.

This method parses a JSON stream, containing the information of the channel, and returns a dict with the parsed data.

Parameters

raw_channel_info – JSON string to parse

Returns

a dict with the parsed channel’s information

static parse_messages(raw_messages)[source]

Parse a channel messages JSON stream.

This method parses a JSON stream, containing the history of a channel. It returns a list of messages and the total messages count in that channel.

Parameters

raw_messages – JSON string to parse

Returns

a tuple with a list of dicts with the parsed messages and a total messages count in the channel.

version = '0.1.0'
class perceval.backends.core.rocketchat.RocketChatClient(url, user_id, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, from_archive=False, archive=None, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Rocket.Chat API client.

Client for fetching information from the Rocket.Chat server using its REST API.

Parameters
  • url – server url from where messages are to be fetched

  • user_id – generated User Id using your Rocket.Chat account

  • api_token – token needed to use the API

  • max_items – maximum number of message requested on the same query

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimum rate needed to sleep until it will be reset

  • from_archive – it tells whether to write/read the archive

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

HAUTH_TOKEN = 'X-Auth-Token'
HUSER_ID = 'X-User-Id'
PCHANNEL_NAME = 'roomName'
PCOUNT = 'count'
POLDEST = 'oldest'
RCHANNEL_INFO = 'channels.info'
RCHANNEL_MESSAGES = 'channels.messages'
calculate_time_to_reset()[source]

Number of seconds to wait. They are contained in the rate limit reset header.

channel_info(channel)[source]

Fetch information about a channel.

fetch(url, payload=None, headers=None)[source]

Fetch the data from a given URL.

Parameters
  • url – link to the resource

  • payload – payload of the request

  • headers – headers of the request

:returns a response object

messages(channel, from_date, offset)[source]

Fetch messages from a channel.

The messages are fetch in ascending order i.e. from the oldest to the latest based on the time they were last updated. A query is also passed as a param to fetch the messages from a given date.

static sanitize_for_archive(url, headers, payload)[source]
Sanitize payload of a HTTP request by removing the token and

user id information before storing/retrieving archived items.

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

Returns

url, headers and the sanitized payload

class perceval.backends.core.rocketchat.RocketChatCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Rocket.Chat backend from the command line.

BACKEND

alias of perceval.backends.core.rocketchat.RocketChat

classmethod setup_cmd_parser()[source]

Returns the Rocket.Chat argument parser.

perceval.backends.core.rss module

class perceval.backends.core.rss.RSS(url, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

RSS backend for Perceval.

This class retrieves the entries from a RSS feed. To initialize this class the URL must be provided. The url will be set as the origin of the data.

Parameters
  • url – RSS url

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['entry']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='entry')[source]

Fetch the entries from the url.

The method retrieves all entries from a RSS url

Parameters

category – the category of items to fetch

Returns

a generator of entries

fetch_items(category, **kwargs)[source]

Fetch the entries

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving entries on the fetch process.

Returns

this backend supports entries archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not supports entries resuming

static metadata_category(item)[source]

Extracts the category from a RSS item.

This backend only generates one type of item which is ‘entry’.

static metadata_id(item)[source]

Extracts the identifier from an entry item.

static metadata_updated_on(item)[source]

Extracts the update time from a RSS item.

The timestamp is extracted from ‘published’ field. This date is a datetime string that needs to be converted to a UNIX timestamp float value.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

classmethod parse_feed(raw_entries)[source]
version = '0.7.0'
class perceval.backends.core.rss.RSSClient(url, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

RSS API client.

This class implements a simple client to retrieve entries from projects in a RSS node.

Parameters
  • url – URL of rss node: https://item.opnfv.org/ci

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

get_entries()[source]

Retrieve all entries from a RSS feed

class perceval.backends.core.rss.RSSCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run RSS backend from the command line.

BACKEND

alias of perceval.backends.core.rss.RSS

classmethod setup_cmd_parser()[source]

Returns the RSS argument parser.

perceval.backends.core.slack module

class perceval.backends.core.slack.Slack(channel, api_token, max_items=1000, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Slack backend.

This class retrieves the messages sent to a Slack channel. To access the server an API token is required, which must have enough permissions to read from the given channel.

The origin of the data will be set to the SLACK_URL plus the identifier of the channel; i.e ‘https://slack.com/C01234ABC’.

Parameters
  • channel – identifier of the channel where data will be fetched

  • api_token – token or key needed to use the API

  • max_items – maximum number of message requested on the same query

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_info', 'id'], 'channel_name': ['channel_info', 'name']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the channel.

This method fetches the messages stored on the channel that were sent since the given date.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages sent since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend does not support items resuming

static metadata_category(item)[source]

Extracts the category from a Slack item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Slack item.

This identifier will be the mix of two fields because Slack messages does not have any unique identifier. In this case, ‘ts’ and ‘user’ values (or ‘bot_id’ when the message is sent by a bot) are combined because there have been cases where two messages were sent by different users at the same time.

In the case where neither the ‘user’ or ‘bot_id’ attributes are present (e.g, bot deleted), the fallback option is to generate the identifier using the ‘ts’ and ‘username’ values.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Slack item.

The timestamp is extracted from ‘ts’ field and converted to a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_channel_info(raw_channel_info)[source]

Parse a channel info JSON stream.

This method parses a JSON stream, containing the information from a channel, and returns a dict with the parsed data.

:param raw_channel_info

Returns

a dict with the parsed information about a channel

static parse_history(raw_history)[source]

Parse a channel history JSON stream.

This method parses a JSON stream, containing the history of a channel, and returns a list with the parsed data. It also returns if there are more messages that are not included on this stream.

Parameters

raw_history – JSON string to parse

Returns

a tuple with a list of dicts with the parsed messages and ‘has_more’ value

static parse_user(raw_user)[source]

Parse a user’s info JSON stream.

This method parses a JSON stream, containing the information from a user, and returns a dict with the parsed data.

Parameters

raw_user – JSON string to parse

Returns

a dict with the parsed user’s information

version = '0.10.0'
class perceval.backends.core.slack.SlackClient(api_token, max_items=1000, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Slack API client.

Client for fetching information from the Slack server using its REST API.

Parameters
  • api_token – key needed to use the API

  • max_items – maximum number of items per request

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

AUTHORIZATION_HEADER = 'Authorization'
PCHANNEL = 'channel'
PCOUNT = 'count'
PLATEST = 'latest'
POLDEST = 'oldest'
PTOKEN = 'token'
PUSER = 'user'
RCONVERSATION_HISTORY = 'conversations.history'
RCONVERSATION_INFO = 'conversations.info'
RCONVERSATION_MEMBERS = 'conversations.members'
RUSER_INFO = 'users.info'
URL = 'https://slack.com/api/%(resource)s'
channel_info(channel)[source]

Fetch information about a channel.

conversation_members(conversation)[source]

Fetch the number of members in a conversation, which is a supertype for public and private ones, DM and group DM.

Parameters

conversation – the ID of the conversation

history(channel, oldest=None, latest=None)[source]

Fetch the history of a channel.

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

user(user_id)[source]

Fetch user info.

exception perceval.backends.core.slack.SlackClientError(**kwargs)[source]

Bases: perceval.errors.BaseError

Raised when an error occurs using the Slack client

message = '%(error)s'
class perceval.backends.core.slack.SlackCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Slack backend from the command line.

BACKEND

alias of perceval.backends.core.slack.Slack

classmethod setup_cmd_parser()[source]

Returns the Slack argument parser.

perceval.backends.core.stackexchange module

class perceval.backends.core.stackexchange.StackExchange(site, tagged=None, api_token=None, access_token=None, max_questions=100, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

StackExchange backend for Perceval.

This class retrieves the questions stored in any of the StackExchange sites. To initialize this class the site must be provided.

Parameters
  • site – StackExchange site

  • tagged – filter items by question Tag

  • api_token – StackExchange application key for the API

  • access_token – StackExchange user access_token for the API

  • max_questions – max of questions per page retrieved

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['question']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'tags': ['tags']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the questions from the site.

The method retrieves, from a StackExchange site, the questions updated since the given date.

Parameters

from_date – obtain questions updated since this date

Returns

a generator of questions

fetch_items(category, **kwargs)[source]

Fetch the questions

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a StackExchange item.

This backend only generates one type of item which is ‘question’.

static metadata_id(item)[source]

Extracts the identifier from a StackExchange item.

static metadata_updated_on(item)[source]

Extracts the update time from a StackExchange item.

The timestamp is extracted from ‘last_activity_date’ field. This date is a UNIX timestamp but needs to be converted to a float value.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_questions(raw_page)[source]

Parse a StackExchange API raw response.

The method parses the API response retrieving the questions from the received items

Parameters

items – items from where to parse the questions

Returns

a generator of questions

version = '0.12.1'
class perceval.backends.core.stackexchange.StackExchangeClient(site, tagged, token, access_token=None, max_questions=100, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

StackExchange API client.

This class implements a simple client to retrieve questions from any Stackexchange site.

Parameters
  • site – URL of the Bugzilla server

  • tagged – filter items by question Tag

  • token – StackExchange application key for the API

  • access_token – StackExchange user access token for the API

  • max_questions – max number of questions per query

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

Raises

HTTPError – when an error occurs doing the request

PACCESSTOKEN = 'access_token'
PFILTER = 'filter'
PKEY = 'key'
PMIN = 'min'
PORDER = 'order'
PPAGE = 'page'
PPAGESIZE = 'pagesize'
PSITE = 'site'
PSORT = 'sort'
PTAGGED = 'tagged'
RQUESTIONS = 'questions'
STACKEXCHANGE_API_URL = 'https://api.stackexchange.com'
VERSION_API = '2.2'
VQUESTIONS_FILTER = 'Bf*y*ByQD_upZqozgU6lXL_62USGOoV3)MFNgiHqHpmO_Y-jHR'
get_questions(from_date)[source]

Retrieve all the questions from a given date.

Parameters

from_date – obtain questions updated since this date

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

class perceval.backends.core.stackexchange.StackExchangeCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run StackExchange backend from the command line.

BACKEND

alias of perceval.backends.core.stackexchange.StackExchange

classmethod setup_cmd_parser()[source]

Returns the StackExchange argument parser.

perceval.backends.core.supybot module

class perceval.backends.core.supybot.Supybot(uri, dirpath, tag=None, archive=None)[source]

Bases: perceval.backend.Backend

Supybot IRC log backend.

This class fetches the messages stored by Supybot in log files. Initialize this class providing the directory where those IRC log files are stored.

The log filenames expected by this backend should follow the pattern: #channel_YYYY-MM-DD.log (i.e #grimoirelab_2016-06-27.log). This is needed to determine the date when messages were sent. Other filenames might work too but the behaviour is unknown.

The format of the messages must also follow a pattern. This patterns can be found in SupybotParser class documentation.

Parameters
  • uri – URI of the IRC archives; typically, the URL of their IRC channel

  • dirpath – directory path where the archives are stored

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]

Fetch the messages from the Supybot IRC logger.

The method parsers and returns the messages saved on the IRC log files and stored by Supybot in dirpath.

Parameters
  • category – the category of items to fetch

  • from_date – obtain messages since this date

Returns

a generator of messages

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend does not support items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Supybot item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Supybot item.

This identifier will be the mix of three fields because IRC messages does not have any unique identifier. In this case, ‘timestamp’, ‘nick’ and ‘body’ values are combined because there have been cases where two messages were sent by the same user at the same time.

static metadata_updated_on(item)[source]

Extracts the update time from a Supybot item.

The timestamp used is extracted from ‘timestamp’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_supybot_log(filepath)[source]

Parse a Supybot IRC log file.

The method parses the Supybot IRC log file and returns an iterator of dictionaries. Each one of this, contains a message from the file.

Parameters

filepath – path to the IRC log file

Returns

a generator of parsed messages

Raises
  • ParseError – raised when the format of the Supybot log file is invalid

  • OSError – raised when an error occurs reading the given file

version = '0.10.0'
class perceval.backends.core.supybot.SupybotCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Supybot backend from the command line.

BACKEND

alias of perceval.backends.core.supybot.Supybot

classmethod setup_cmd_parser()[source]

Returns the Supybot argument parser.

class perceval.backends.core.supybot.SupybotParser(stream)[source]

Bases: object

Supybot IRC parser.

This class parses a Supybot IRC log stream, converting plain log lines (or messages) into dict items. Each dictionary will contain the date of the message, the type of message (comment or server message), the nick of the sender and its body.

Each line on a log starts with a date in ISO format including its timezone and it is followed by two spaces and by a message.

There are two types of valid messages in a Supybot log: comment messages and server messages. First one follows any of these two patterns:

2016-06-27T12:00:00+0000 <nick> body of the message 2016-06-27T12:00:00+0000 * nick waves hello

While a valid server message has the next pattern:

2016-06-27T12:00:00+0000 *** nick is known as new_nick

An exception is raised when any of the lines does not follow any of the above formats.

Parameters

stream – an iterator which produces Supybot log lines

BOT_PATTERN = '^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$'
COMMENT_ACTION_PATTERN = '^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$'
COMMENT_PATTERN = '^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$'
EMPTY_BOT_PATTERN = '^-(.*?)(!.*)?-\\s*$'
EMPTY_COMMENT_ACTION_PATTERN = '^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$'
EMPTY_COMMENT_PATTERN = '^<(.*?)(!.*)?>\\s*$'
EMPTY_PATTERN = '^\\s*$'
SERVER_PATTERN = '^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$'
SUPYBOT_BOT_REGEX = re.compile('^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$', re.VERBOSE)
SUPYBOT_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$', re.VERBOSE)
SUPYBOT_COMMENT_REGEX = re.compile('^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$', re.VERBOSE)
SUPYBOT_EMPTY_BOT_REGEX = re.compile('^-(.*?)(!.*)?-\\s*$', re.VERBOSE)
SUPYBOT_EMPTY_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$', re.VERBOSE)
SUPYBOT_EMPTY_COMMENT_REGEX = re.compile('^<(.*?)(!.*)?>\\s*$', re.VERBOSE)
SUPYBOT_EMPTY_REGEX = re.compile('^\\s*$', re.VERBOSE)
SUPYBOT_SERVER_REGEX = re.compile('^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$', re.VERBOSE)
SUPYBOT_TIMESTAMP_REGEX = re.compile('^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]?\\d{0,4})\\s\\s\n                        (?P<msg>.+)$\n                        ', re.VERBOSE)
TCOMMENT = 'comment'
TIMESTAMP_PATTERN = '^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]?\\d{0,4})\\s\\s\n                        (?P<msg>.+)$\n                        '
TSERVER = 'server'
parse()[source]

Parse a Supybot IRC stream.

Returns an iterator of dicts. Each dicts contains information about the date, type, nick and body of a single log entry.

Returns

iterator of parsed lines

Raises

ParseError – when an invalid line is found parsing the given stream

perceval.backends.core.telegram module

class perceval.backends.core.telegram.Telegram(bot, bot_token, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Telegram backend.

The Telegram backend fetches the messages that a Telegram bot can receive. Usually, these messages are direct or private messages but a bot can be configured to receive every message sent to a channel/group where it is subscribed. Take into account that messages are removed from the Telegram server 24 hours after they are sent. Moreover, once they are fetched using an offset, these messages are also removed. This means every time this backend is called, messages will be deleted.

Initialize this class passing the name of the bot and the authentication token used by this bot. The authentication token is provided by Telegram once the bot is created.

The origin of the data will be set to the TELEGRAM_URL plus the name of the bot; i.e ‘http://telegram.org/mybot’.

Parameters
  • bot – name of the bot

  • bot_token – authentication token used by the bot

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['message']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

EXTRA_SEARCH_FIELDS = {'chat_id': ['message', 'chat', 'id'], 'chat_name': ['message', 'chat', 'title']}

A set of search fields to simplify query operations.

The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from fetch() in a dict with the following shape:

{

‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3

}

These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict EXTRA_SEARCH_FIELDS. An example of EXTRA_SEARCH_FIELDS is provided below:

{

‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]

}

Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.

fetch(category='message', offset=1, chats=None)[source]

Fetch the messages the bot can read from the server.

The method retrieves, from the Telegram server, the messages sent with an offset equal or greater than the given.

A list of chats, groups and channels identifiers can be set using the parameter chats. When it is set, only those messages sent to any of these will be returned. An empty list will return no messages.

Parameters
  • category – the category of items to fetch

  • offset – obtain messages from this offset

  • chats – list of chat names used to filter messages

Returns

a generator of messages

Raises

ValueError – when chats is an empty list

fetch_items(category, **kwargs)[source]

Fetch the messages

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

metadata(item, filter_classified=False)[source]

Telegram metadata.

The method takes an item and overrides the metadata information to add extra information related to Telegram.

Currently, it adds the ‘offset’ keyword.

Parameters
  • item – an item fetched by a backend

  • filter_classified – sets if classified fields were filtered

static metadata_category(item)[source]

Extracts the category from a Telegram item.

This backend only generates one type of item which is ‘message’.

static metadata_id(item)[source]

Extracts the identifier from a Telegram item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Telegram item.

The timestamp is extracted from ‘date’ field that is inside of ‘message’ dict. This date is converted to UNIX timestamp format.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

static parse_messages(raw_json)[source]

Parse a Telegram JSON messages list.

The method parses the JSON stream and returns an iterator of dictionaries. Each one of this, contains a Telegram message.

Parameters

raw_json – JSON string to parse

Returns

a generator of parsed messages

version = '0.11.1'
class perceval.backends.core.telegram.TelegramBotClient(bot_token, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient

Telegram Bot API 2.0 client.

This class implements a simple client to retrieve those messages sent to a Telegram bot. This includes personal messages or messages sent to a channel (when privacy settings are disabled).

Parameters
  • bot_token – token for the bot

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

API_URL = 'https://api.telegram.org/bot%(token)s/%(method)s'
OFFSET = 'offset'
UPDATES_METHOD = 'getUpdates'
static sanitize_for_archive(url, headers, payload)[source]

Sanitize URL of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns the sanitized url, plus the headers and payload

updates(offset=None)[source]

Fetch the messages that a bot can read.

When the offset is given it will retrieve all the messages that are greater or equal to that offset. Take into account that, due to how the API works, all previous messages will be removed from the server.

Parameters

offset – fetch the messages starting on this offset

class perceval.backends.core.telegram.TelegramCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Telegram backend from the command line.

BACKEND

alias of perceval.backends.core.telegram.Telegram

classmethod setup_cmd_parser()[source]

Returns the Telegram argument parser.

perceval.backends.core.twitter module

class perceval.backends.core.twitter.Twitter(query, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, tag=None, archive=None, ssl_verify=True)[source]

Bases: perceval.backend.Backend

Twitter backend.

This class allows to fetch samples of tweets containing specific keywords. Initialize this class passing API key needed for authentication with the parameter api_key.

Parameters
  • query – query to fetch tweets

  • api_token – token or key needed to use the API

  • max_items – maximum number of issues requested on the same query

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • tag – label used to mark the data

  • archive – archive to store/retrieve items

  • ssl_verify – enable/disable SSL verification

CATEGORIES = ['tweet']

A list of categories that can be fetched by this backend.

Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.

The categories defined in this variable (and only the categories defined in this variable) can be passed to fetch() and returned from metadata_category().

Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.

fetch(category='tweet', since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, tweets_type='mixed')[source]

Fetch the tweets from the server.

This method fetches tweets from the TwitterSearch API published in the last seven days.

Parameters
  • category – the category of items to fetch

  • since_id – if not null, it returns results with an ID greater than the specified ID

  • max_id – when it is set or if not None, it returns results with an ID less than the specified ID

  • geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”

  • lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code

  • include_entities – if disabled, it excludes entities node

  • tweets_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”

Returns

a generator of tweets

fetch_items(category, **kwargs)[source]

Fetch the tweets

Parameters
  • category – the category of items to fetch

  • kwargs – backend arguments

Returns

a generator of items

classmethod has_archiving()[source]

Returns whether it supports archiving items on the fetch process.

Returns

this backend supports items archive

classmethod has_resuming()[source]

Returns whether it supports to resume the fetch process.

Returns

this backend supports items resuming

static metadata_category(item)[source]

Extracts the category from a Twitter item.

This backend only generates one type of item which is ‘tweet’.

static metadata_id(item)[source]

Extracts the identifier from a Twitter item.

static metadata_updated_on(item)[source]

Extracts and coverts the update time from a Twitter item.

The timestamp is extracted from ‘created_at’ field and converted to a UNIX timestamp.

Parameters

item – item generated by the backend

Returns

a UNIX timestamp

search_fields(item)[source]

Add search fields to an item.

It adds the values of metadata_id plus the hashtags of a tweet.

Parameters

item – the item to extract the search fields values

Returns

a dict of search fields

version = '0.4.0'
class perceval.backends.core.twitter.TwitterClient(api_key, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False, ssl_verify=True)[source]

Bases: perceval.client.HttpClient, perceval.client.RateLimitHandler

Twitter API client.

Client for fetching information from the Twitter server using its REST API v1.1.

Parameters
  • api_key – key needed to use the API

  • max_items – maximum number of items per request

  • sleep_for_rate – sleep until rate limit is reset

  • min_rate_to_sleep – minimun rate needed to sleep until it will be reset

  • sleep_time – time (in seconds) to sleep in case of connection problems

  • archive – an archive to store/read fetched data

  • from_archive – it tells whether to write/read the archive

  • ssl_verify – enable/disable SSL verification

HAUTHORIZATION = 'Authorization'
PCOUNT = 'count'
PGEOCODE = 'geocode'
PINCLUDE_ENTITIES = 'include_entities'
PLANG = 'lang'
PMAX_ID = 'max_id'
PQUERY = 'q'
PRESULT_TYPE = 'result_type'
PSINCE_ID = 'since_id'
calculate_time_to_reset()[source]

Number of seconds to wait. They are contained in the rate limit reset header

static sanitize_for_archive(url, headers, payload)[source]

Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items

Param

url: HTTP url request

Param

headers: HTTP headers request

Param

payload: HTTP payload request

:returns url, headers and the sanitized payload

tweets(query, since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, result_type='mixed')[source]

Fetch tweets for a given query between since_id and max_id.

Parameters
  • query – query to fetch tweets

  • since_id – if not null, it returns results with an ID greater than the specified ID

  • max_id – if not null, it returns results with an ID less than the specified ID

  • geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”

  • lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code

  • include_entities – if disabled, it excludes entities node

  • result_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”

Returns

a generator of tweets

class perceval.backends.core.twitter.TwitterCommand(*args, debug=False)[source]

Bases: perceval.backend.BackendCommand

Class to run Twitter backend from the command line.

BACKEND

alias of perceval.backends.core.twitter.Twitter

classmethod setup_cmd_parser()[source]

Returns the Twitter argument parser.

Module contents