perceval.backends.core package¶
Submodules¶
perceval.backends.core.askbot module¶
- class perceval.backends.core.askbot.Askbot(url, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendAskbot backend.
This class retrieves the questions posted on an Askbot site. To initialize this class the URL must be provided. The url will be set as the origin of the data.
- Parameters
url – Askbot site URL
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['question']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'tags': ['tags']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the questions/answers from the repository.
The method retrieves, from an Askbot site, the questions and answers updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain questions/answers updated since this date
- Returns
a generator of items
- fetch_items(category, **kwargs)[source]¶
Fetch the questions
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from an Askbot item.
This backend only generates one type of item which is ‘question’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from an Askbot item.
The timestamp is extracted from ‘last_activity_at’ field. This date is a UNIX timestamp but needs to be converted to a float value.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.8.0'¶
- class perceval.backends.core.askbot.AskbotClient(base_url, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientAskbot client.
This class implements a simple client to retrieve distinct kind of data from an Askbot site.
- Parameters
base_url – URL of the Askbot site
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- API_QUESTIONS = 'api/v1/questions/'¶
- HREQUEST_WITH = 'X-Requested-With'¶
- PAVATAR_SIZE = 'avatar_size'¶
- PPAGE = 'page'¶
- PPOST_ID = 'post_id'¶
- PPOST_TYPE = 'post_type'¶
- PSORT = 'sort'¶
- RCOMMENTS = 's/post_comments'¶
- RCOMMENTS_OLD = 'post_comments'¶
- RHTML_QUESTION = 'question/'¶
- VANSWER = 'answer'¶
- VAVATAR_SIZE = 0¶
- VHTTP_REQUEST = 'XMLHttpRequest'¶
- VORDER_API = 'activity-asc'¶
- VORDER_HTML = 'votes'¶
- get_api_questions(path)[source]¶
Retrieve a question page using the API.
- Parameters
page – page to retrieve
- class perceval.backends.core.askbot.AskbotCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Askbot backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.askbot.Askbot
- class perceval.backends.core.askbot.AskbotParser[source]¶
Bases:
objectAskbot HTML parser.
This class parses a plain HTML document, converting questions, answers, comments and user information into dict items.
- static parse_answers(html_question)[source]¶
Parse the answers of a given HTML question.
The method parses the answers related with a given HTML question, as well as all the comments related to the answer.
- Parameters
html_question – raw HTML question element
- Returns
a list with the answers
- static parse_number_of_html_pages(html_question)[source]¶
Parse number of answer pages to paginate over them.
- Parameters
html_question – raw HTML question element
- Returns
an integer with the number of pages
- static parse_question_container(html_question)[source]¶
Parse the question info container of a given HTML question.
The method parses the information available in the question information container. The container can have up to 2 elements: the first one contains the information related to the user who generated the question and the date (if any). The second one contains the date of the update and the user who updated it (if not the same who generated the question).
- Parameters
html_question – raw HTML question element
- Returns
an object with the parsed information
- static parse_user_info(update_info)[source]¶
Parse the user information of a given HTML container.
The method parses all the available user information in the container. If the class “user-info” exists, the method will get all the available information in the container. If not, if a class “tip” exists, it will be a wiki post with no user associated. Else, it can be an empty container.
- Parameters
update_info – beautiful soup answer container element
- Returns
an object with the parsed information
perceval.backends.core.bugzilla module¶
- class perceval.backends.core.bugzilla.Bugzilla(url, user=None, password=None, max_bugs=200, max_bugs_csv=10000, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendBugzilla backend.
This class allows the fetch the bugs stored in Bugzilla repository. To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.
- Parameters
url – Bugzilla server URL
user – Bugzilla user
password – Bugzilla user password
max_bugs – maximum number of bugs requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['bug']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'component': ['component', 0, '__text__'], 'product': ['product', 0, '__text__']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the bugs from the repository.
The method retrieves, from a Bugzilla repository, the bugs updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain bugs updated since this date
- Returns
a generator of bugs
- fetch_items(category, **kwargs)[source]¶
Fetch the bugs
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Bugzilla item.
This backend only generates one type of item which is ‘bug’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Bugzilla item.
The timestamp is extracted from ‘delta_ts’ field. This date is converted to UNIX timestamp format. Due Bugzilla servers ignore the timezone on HTTP requests, it will be ignored during the conversion, too.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_bug_activity(raw_html)[source]¶
Parse a Bugzilla bug activity HTML stream.
This method extracts the information about activity from the given HTML stream. The bug activity is stored into a HTML table. Each parsed activity event is returned into a dictionary.
If the given HTML is invalid, the method will raise a ParseError exception.
- Parameters
raw_html – HTML string to parse
- Returns
a generator of parsed activity events
- Raises
ParseError – raised when an error occurs parsing the given HTML stream
- static parse_buglist(raw_csv)[source]¶
Parse a Bugzilla CSV bug list.
The method parses the CSV file and returns an iterator of dictionaries. Each one of this, contains the summary of a bug.
- Parameters
raw_csv – CSV string to parse
- Returns
a generator of parsed bugs
- static parse_bugs_details(raw_xml)[source]¶
Parse a Bugilla bugs details XML stream.
This method returns a generator which parses the given XML, producing an iterator of dictionaries. Each dictionary stores the information related to a parsed bug.
If the given XML is invalid or does not contains any bug, the method will raise a ParseError exception.
- Parameters
raw_xml – XML string to parse
- Returns
a generator of parsed bugs
- Raises
ParseError – raised when an error occurs parsing the given XML stream
- version = '0.12.0'¶
- class perceval.backends.core.bugzilla.BugzillaClient(base_url, user=None, password=None, max_bugs_csv=10000, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientBugzilla API client.
This class implements a simple client to retrieve distinct kind of data from a Bugzilla repository. Currently, it only supports 3.x and 4.x servers.
When it is initialized, it checks if the given Bugzilla is available and retrieves its version.
- Parameters
base_url – URL of the Bugzilla server
user – Bugzilla user
password – user password
max_bugs_cvs – max bugs requested per CSV query
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
BackendError – when an error occurs initializing the client
- CGI_BUG = 'show_bug.cgi'¶
- CGI_BUGLIST = 'buglist.cgi'¶
- CGI_BUG_ACTIVITY = 'show_activity.cgi'¶
- CGI_LOGIN = 'index.cgi'¶
- CTYPE_CSV = 'csv'¶
- CTYPE_XML = 'xml'¶
- OLD_STYLE_VERSIONS = ['3.2.3', '3.2.2']¶
- PBUGZILLA_LOGIN = 'Bugzilla_login'¶
- PBUGZILLA_PASSWORD = 'Bugzilla_password'¶
- PBUG_ID = 'id'¶
- PCHFIELD_FROM = 'chfieldfrom'¶
- PCTYPE = 'ctype'¶
- PEXCLUDE_FIELD = 'excludefield'¶
- PLIMIT = 'limit'¶
- PLOGIN = 'GoAheadAndLogIn'¶
- PLOGOUT = 'logout'¶
- PORDER = 'order'¶
- URL = '%(base)s/%(cgi)s'¶
- VERSION_REGEX = re.compile('.+bugzilla version="([^"]+)"', re.DOTALL)¶
- bug_activity(bug_id)[source]¶
Get the activity of a bug in HTML format.
- Parameters
bug_id – bug identifier
- buglist(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Get a summary of bugs in CSV format.
- Parameters
from_date – retrieve bugs that where updated from that date
- bugs(*bug_ids)[source]¶
Get the information of a list of bugs in XML format.
- Parameters
bug_ids – list of bug identifiers
- call(cgi, params)[source]¶
Run an API command.
- Parameters
cgi – cgi command to run on the server
params – dict with the HTTP parameters needed to run the given command
- login(user, password)[source]¶
Authenticate a user in the server.
- Parameters
user – Bugzilla user
password – user password
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the login and password information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.bugzilla.BugzillaCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Bugzilla backend from the command line.
- BACKEND¶
perceval.backends.core.bugzillarest module¶
- class perceval.backends.core.bugzillarest.BugzillaREST(url, user=None, password=None, api_token=None, max_bugs=500, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendBugzilla backend that uses its API REST.
This class allows the fetch the bugs stored in Bugzilla server (version 5.0 or later). To initialize this class the URL of the server must be provided. The url will be set as the origin of the data.
- Parameters
url – Bugzilla server URL
user – Bugzilla user
password – Bugzilla user password
api_token – Bugzilla token
max_bugs – maximum number of bugs requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['bug']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'component': ['component'], 'product': ['product']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='bug', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the bugs from the repository.
The method retrieves, from a Bugzilla repository, the bugs updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain bugs updated since this date
- Returns
a generator of bugs
- fetch_items(category, **kwargs)[source]¶
Fetch the bugs
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Bugzilla item.
This backend only generates one type of item which is ‘bug’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Bugzilla item.
The timestamp used is extracted from ‘last_change_time’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.10.0'¶
- class perceval.backends.core.bugzillarest.BugzillaRESTClient(base_url, user=None, password=None, api_token=None, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientBugzilla REST API client.
This class implements a simple client to retrieve distinct kind of data from a Bugzilla > 5.0 repository using its REST API.
When user and password parameters are given it logs in the server. Further requests will use the token obtained during the sign in phase.
- Parameters
base_url – URL of the Bugzilla server
user – Bugzilla user
password – user password
api_token – api token for user; when this is provided user and password parameters will be ignored
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
BackendError – when an error occurs initializing the client
- PBUGZILLA_LOGIN = 'login'¶
- PBUGZILLA_PASSWORD = 'password'¶
- PBUGZILLA_TOKEN = 'token'¶
- PEXCLUDE_FIELDS = 'exclude_fields'¶
- PIDS = 'ids'¶
- PINCLUDE_FIELDS = 'include_fields'¶
- PLAST_CHANGE_TIME = 'last_change_time'¶
- PLIMIT = 'limit'¶
- POFFSET = 'offset'¶
- PORDER = 'order'¶
- RATTACHMENT = 'attachment'¶
- RBUG = 'bug'¶
- RCOMMENT = 'comment'¶
- RHISTORY = 'history'¶
- RLOGIN = 'login'¶
- URL = '%(base)s/rest/%(resource)s'¶
- VCHANGE_DATE_ORDER = 'changeddate'¶
- VEXCLUDE_ATTCH_DATA = 'data'¶
- VINCLUDE_ALL = '_all'¶
- attachments(*bug_ids)[source]¶
Get the attachments of the given bugs.
- Parameters
bug_id – list of bug identifiers
- bugs(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_bugs=500)[source]¶
Get the information of a list of bugs.
- Parameters
from_date – retrieve bugs that where updated from that date; dates are converted to UTC
offset – starting position for the search; i.e to return 11th element, set this value to 10.
max_bugs – maximum number of bugs to reteurn per query
- call(resource, params)[source]¶
Retrive the given resource.
- Parameters
resource – resource to retrieve
params – dict with the HTTP parameters needed to retrieve the given resource
- Raises
BugzillaRESTError – raised when an error is returned by the server
- comments(*bug_ids)[source]¶
Get the comments of the given bugs.
- Parameters
bug_ids – list of bug identifiers
- history(*bug_ids)[source]¶
Get the history of the given bugs.
- Parameters
bug_ids – list of bug identifiers
- login(user, password)[source]¶
Authenticate a user in the server.
- Parameters
user – Bugzilla user
password – user password
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the login, password and token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.bugzillarest.BugzillaRESTCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run BugzillaREST backend from the command line.
- BACKEND¶
- exception perceval.backends.core.bugzillarest.BugzillaRESTError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorRaised when an error occurs using the API
- message = '%(error)s (code: %(code)s)'¶
perceval.backends.core.confluence module¶
- class perceval.backends.core.confluence.Confluence(url, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendConfluence backend.
This class allows the fetch the historical contents (content versions) stored on a Confluence server. Initialize this class passing the URL os this server. The url will be set as the origin of the data.
- Parameters
url – URL of the server
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['historical content']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='historical content', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the contents by version from the server.
This method fetches the different historical versions (or snapshots) of the contents stored in the server that were updated since the given date. Only those snapshots created or updated after from_date will be returned.
Take into account that the seconds of from_date parameter will be ignored because the Confluence REST API only accepts the date and hours and minutes for timestamps values.
- Parameters
category – the category of items to fetch
from_date – obtain historical versions of contents updated since this date
- Returns
a generator of historical versions
- fetch_items(category, **kwargs)[source]¶
Fetch the contents
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Confluence item.
This backend only generates one type of item which is ‘historical content’.
- static metadata_id(item)[source]¶
Extracts the identifier from a Confluence item.
This identifier will be the mix of two fields because a historical content does not have any unique identifier. In this case, ‘id’ and ‘version’ values are combined because it should not be possible to have two equal version numbers for the same content. The value to return will follow the pattern: <content>#v<version> (i.e 28979#v10).
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Confluence item.
The timestamp is extracted from ‘when’ field on ‘version’ section. This date is converted to UNIX timestamp format.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_contents_summary(raw_json)[source]¶
Parse a Confluence summary JSON list.
The method parses a JSON stream and returns an iterator of diccionaries. Each dictionary is a content summary.
- Parameters
raw_json – JSON string to parse
- Returns
a generator of parsed content summaries.
- static parse_historical_content(raw_json)[source]¶
Parse a Confluence historical content JSON stream.
This method parses a JSON stream and returns a dictionary that contains the data of a historical content.
- Parameters
raw_json – JSON string to parse
- Returns
a dict with historical content
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus the page ancestor IDs, the content ID and the content version number.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.12.0'¶
- class perceval.backends.core.confluence.ConfluenceClient(base_url, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientConfluence REST API client.
This class implements a client to retrieve contents from a Confluence server using its REST API.
- Parameters
base_url – URL of the Confluence server
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- MSEARCH = 'search'¶
- PANCESTORS = 'ancestors'¶
- PCQL = 'cql'¶
- PEXPAND = 'expand'¶
- PLIMIT = 'limit'¶
- PSTART = 'start'¶
- PSTATUS = 'status'¶
- PVERSION = 'version'¶
- RCONTENTS = 'content'¶
- RHISTORY = 'history'¶
- RSPACE = 'space'¶
- URL = '%(base)s/rest/api/%(resource)s'¶
- VCQL = "lastModified>='%(date)s' order by lastModified"¶
- VEXPAND = ['body.storage', 'history', 'version']¶
- VHISTORICAL = 'historical'¶
- contents(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_contents=200)[source]¶
Get the contents of a repository.
This method returns an iterator that manages the pagination over contents. Take into account that the seconds of from_date parameter will be ignored because the API only works with hours and minutes.
- Parameters
from_date – fetch the contents updated since this date
offset – fetch the contents starting from this offset
limit – maximum number of contents to fetch per request
- class perceval.backends.core.confluence.ConfluenceCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Confluence backend from the command line.
- BACKEND¶
perceval.backends.core.discourse module¶
- class perceval.backends.core.discourse.Discourse(url, api_username=None, api_token=None, tag=None, archive=None, max_retries=10, sleep_time=5, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendDiscourse backend for Perceval.
This class retrieves the topics posted in a Discourse board. To initialize this class the URL must be provided. The url will be set as the origin of the data.
- Parameters
url – Discourse URL
api_username – Discourse API username
api_token – Discourse API access token
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['topic']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'category_id': ['category_id']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='topic', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the topics from the Discurse board.
The method retrieves, from a Discourse board the topics updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain topics updated since this date
- Returns
a generator of topics
- fetch_items(category, **kwargs)[source]¶
Fetch the topics
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Discourse item.
This backend only generates one type of item which is ‘topic’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Discourse item.
The timestamp used is extracted from ‘last_posted_at’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.13.1'¶
- class perceval.backends.core.discourse.DiscourseClient(base_url, api_username=None, api_key=None, sleep_time=5, max_retries=10, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientDiscourse API client.
This class implements a simple client to retrieve topics from any Discourse board.
- Parameters
base_url – URL of the Discourse site
api_username – Discourse API username
api_key – Discourse API access token
sleep_time – time (in seconds) to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- ALL_TOPICS = None¶
- EXTRA_STATUS_FORCELIST = [429]¶
- HKEY = 'Api-Key'¶
- HUSER = 'Api-Username'¶
- POSTS = 'posts'¶
- PPAGE = 'page'¶
- TJSON = '.json'¶
- TOPIC = 't'¶
- TOPICS_SUMMARY = 'latest'¶
- post(post_id)[source]¶
Retrieve the post whit post_id identifier.
- Parameters
post_id – identifier of the post to retrieve
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the user and key information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.discourse.DiscourseCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Discourse backend from the command line.
- BACKEND¶
perceval.backends.core.dockerhub module¶
- class perceval.backends.core.dockerhub.DockerHub(owner, repository, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendDockerHub backend for Perceval.
This class retrieves data from a repository stored in the Docker Hub site. To initialize this class owner and repositories where data will be fetched must be provided. The origin of the data will be built with both parameters.
Shortcut _ owner for official Docker repositories will be replaced by its long name: library.
- Parameters
owner – DockerHub owner
repository – DockerHub repository owned by owner
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['dockerhub-data']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'name': ['name'], 'namespace': ['namespace']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='dockerhub-data')[source]¶
Fetch data from a Docker Hub repository.
The method retrieves, from a repository stored in Docker Hub, its data which includes number of pulls, stars, description, among other data.
- Parameters
category – the category of items to fetch
- Returns
a generator of data
- fetch_items(category, **kwargs)[source]¶
Fetch the Dockher Hub items
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Docker Hub item.
This backend only generates one type of item which is ‘dockerhub-data’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Docker Hub item.
The timestamp is extracted from ‘fetched_on’ field. This field is not part of the data provided by Docker Hub. It is added by this backend.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_json(raw_json)[source]¶
Parse a Docker Hub JSON stream.
The method parses a JSON stream and returns a dict with the parsed data.
- Parameters
raw_json – JSON string to parse
- Returns
a dict with the parsed data
- version = '0.6.0'¶
- class perceval.backends.core.dockerhub.DockerHubClient(archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientDockerHub API client.
Client for fetching information from the DockerHub server using its REST API v2.
- Parameters
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- RREPOSITORY = 'repositories'¶
- class perceval.backends.core.dockerhub.DockerHubCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run DockerHub backend from the command line.
- BACKEND¶
perceval.backends.core.gerrit module¶
- class perceval.backends.core.gerrit.Gerrit(hostname, user=None, port='29418', max_reviews=500, disable_host_key_check=False, id_filepath=None, tag=None, archive=None, blacklist_ids=None)[source]¶
Bases:
perceval.backend.BackendGerrit backend.
Class to fetch the reviews from a Gerrit server. To initialize this class the Hostname of the server must be provided. The hostname will be set as the origin of the data.
- Parameters
hostname – Gerrit server Hostname
user – SSH user used to connect to the Gerrit server
port – SSH port
max_reviews – maximum number of reviews requested on the same query
disable_host_key_check – disable host key controls
tag – label used to mark the data
archive – archive to store/retrieve items
blacklist_ids – exclude the reviews while fetching
id_filepath – path to SSH private key
- CATEGORIES = ['review']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'project_name': ['project'], 'review_hash': ['id']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='number', type=<class 'str'>)¶
A field unique to a given origin for items produced by this backend.
If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.
If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a
OriginUniqueFieldcontaining the number and data type of the field.Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.
To access this field, please prefer
origin_unique_field().
- fetch(category='review', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the reviews from the repository.
The method retrieves, from a Gerrit repository, the reviews updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain reviews updated since this date
- Returns
a generator of reviews
- fetch_items(category, **kwargs)[source]¶
Fetch the reviews
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Gerrit item.
This backend only generates one type of item which is ‘review’.
- static metadata_updated_on(item)[source]¶
Extracts and converts the update time from a Gerrit item.
The timestamp is extracted from ‘lastUpdated’ field. This date is a UNIX timestamp but needs to be converted to a float value.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.13.1'¶
- class perceval.backends.core.gerrit.GerritClient(repository, user=None, max_reviews=500, blacklist_reviews=None, disable_host_key_check=False, port='29418', id_filepath=None, archive=None, from_archive=False)[source]¶
Bases:
objectGerrit API client.
This class implements a client to retrieve reviews from a Gerrit repository using the ssh API. Currently it supports <2.8 and >=2.9 versions in incremental mode.
Check the next link for more info: https://gerrit-documentation.storage.googleapis.com/Documentation/2.12/cmd-query.html
- Parameters
repository – Hostname of the Gerrit server
user – SSH user to be used to connect to gerrit server
max_reviews – max number of reviews per query
blacklist_reviews – exclude the reviews of this list while fetching
disable_host_key_check – disable host key controls
port – SSH port
id_filepath – SSH private key path
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
- CMD_GERRIT = 'gerrit'¶
- CMD_VERSION = 'version'¶
- MAX_RETRIES = 3¶
- RETRY_WAIT = 60¶
- VERSION_REGEX = re.compile('gerrit version (\\d+)\\.(\\d+).*')¶
- next_retrieve_group_item(last_item=None, entry=None)[source]¶
Return the item to start from in next reviews group.
- static sanitize_for_archive(cmd)[source]¶
Sanitize the Gerrit command by removing username information before storing/retrieving archived items
- Param
cmd: Gerrit command
:returns the sanitized cmd
- property version¶
Return the Gerrit server version.
- class perceval.backends.core.gerrit.GerritCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Gerrit backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.gerrit.Gerrit
perceval.backends.core.git module¶
- exception perceval.backends.core.git.EmptyRepositoryError(**kwargs)[source]¶
Bases:
perceval.errors.RepositoryErrorException raised when a repository is empty
- message = '%(repository)s is empty'¶
- class perceval.backends.core.git.Git(uri, gitpath, tag=None, archive=None)[source]¶
Bases:
perceval.backend.BackendGit backend.
This class allows the fetch the commits from a Git repository (local or remote) or from a log file. To initialize this class, you have to provide the URI repository and a value for gitpath. This uri will be set as the origin of the data.
When gitpath is a directory or does not exist, it will be considered as the place where the repository is/will be cloned; when gitpath is a file it will be considered as a Git log file.
- Parameters
uri – URI of the Git repository
gitpath – path to the repository or to the log file
tag – label used to mark the data
archive – archive to store/retrieve items
- Raises
RepositoryError – raised when there was an error cloning or updating the repository.
- CATEGORIES = ['commit']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='commit', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), branches=None, latest_items=False, no_update=False)[source]¶
Fetch commits.
The method retrieves from a Git repository or a log file a list of commits. Commits are returned in the same order they were obtained.
When from_date parameter is given it returns items committed since the given date.
The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.
The parameter latest_items returns only those commits which are new since the last time this method was called.
The parameter no_update returns all commits without performing an update of the repository before.
Take into account that from_date and branches are ignored when the commits are fetched from a Git log file or when latest_items flag is set.
The class raises a RepositoryError exception when an error occurs accessing the repository.
- Parameters
category – the category of items to fetch
from_date – obtain commits newer than a specific date (inclusive)
to_date – obtain commits older than a specific date
branches – names of branches to fetch from (default: None)
latest_items – sync with the repository to fetch only the newest commits
no_update – if enabled, don’t update the repo with the latest changes
- Returns
a generator of commits
- fetch_items(category, **kwargs)[source]¶
Fetch the commits
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend does not support items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Git item.
This backend only generates one type of item which is ‘commit’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Git item.
The timestamp used is extracted from ‘CommitDate’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_git_log_from_file(filepath)[source]¶
Parse a Git log file.
The method parses the Git log file and returns an iterator of dictionaries. Each one of this, contains a commit.
- Parameters
filepath – path to the log file
- Returns
a generator of parsed commits
- Raises
ParseError – raised when the format of the Git log file is invalid
OSError – raised when an error occurs reading the given file
- static parse_git_log_from_iter(iterator)[source]¶
Parse a Git log obtained from an iterator.
The method parses the Git log fetched from an iterator, where each item is a line of the log. It returns and iterator of dictionaries. Each dictionary contains a commit.
- Parameters
iterator – iterator of Git log lines
- Raises
ParseError – raised when the format of the Git log is invalid
- version = '0.12.1'¶
- class perceval.backends.core.git.GitCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Git backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.git.Git
- class perceval.backends.core.git.GitParser(stream)[source]¶
Bases:
objectGit log parser.
This class parses a plain Git log stream, converting plain commits into dict items.
Not every Git log output is valid to be parsed. The Git log stream must have a specific structure. It must contain raw commits data and stats about modified files. The next excerpt shows an example of a valid log:
commit aaa7a9209f096aaaadccaaa7089aaaa3f758a703 Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300
Commit for testing
:000000 100644 0000000… aaaaaaa… A aaa/otherthing :000000 100644 0000000… aaaaaaa… A aaa/something :000000 100644 0000000… aaaaaaa… A bbb/bthing 0 0 aaa/otherthing 0 0 aaa/something 0 0 bbb/bthing
Each commit starts with the ‘commit’ tag that is followed by the SHA-1 of the commit, its parents (two or more parents in the case of a merge) and a list of refs, if any.
- commit 456a68ee1407a77f3e804a30dff245bb6c6b872f
ce8e0b86a1e9877f42fe9453ede418519115f367 51a3b654f252210572297f47597b31527c475fb8 (HEAD -> refs/heads/master)
The commit line is followed by one or more headers. Each header has a key and a value:
Author: John Smith <jsmith@example.com> AuthorDate: Tue Aug 14 14:30:13 2012 -0300 Commit: John Smith <jsmith@example.com> CommitDate: Tue Aug 14 14:30:13 2012 -0300
Then, an empty line divides the headers from the commit message.
First line of the commit
Commit message splitted into one or several lines. Each line of the message stars with 4 spaces.
Commit messages can contain a list of ‘trailers’. These trailers have the same format of headers but their meaning is project dependent. This is an example of a commit message with trailers:
Commit message with trailers
This is the body of the message where trailers are included. Trailers are part of the body so each line of the message stars with 4 spaces.
Signed-off-by: John Doe <jdoe@example.com> Signed-off-by: Jane Rae <jrae@example.com>
After a new empty line, actions and stats over files can be found. A action line starts with one or more ‘:’ chars and contain data about the old and new permissions of a file, its old and new indexes, the action code and the filepath to the file. In the case of a copied, renamed or moved file, the new filepath to that file is included.
:100644 100644 e69de29… e69de29… R100 aaa/otherthing aaa/otherthing.renamed
Stats lines include the number of lines added and removed, and the name of the file. The new name is also included for moved or renamed files.
10 0 aaa/{otherthing => otherthing.renamed}
The commit ends with an empty line.
Take into account that one empty line is valid at the beginning of the log. This allows to parse empty logs without raising exceptions.
This example was generated using the next command:
git log –raw –numstat –pretty=fuller –decorate=full –parents -M -C -c –remotes=origin –all
- Parameters
stream – a file object which stores the log
- ACTION_PATTERN = '^(?P<sc>\\:+)\n (?P<modes>(?:\\d{6}[ \\t])+)\n (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n (?P<action>[^\\t]+)\\t+\n (?P<file>[^\\t]+)\n (?:\\t+(?P<newfile>.+))?$'¶
- COMMIT = 1¶
- COMMIT_PATTERN = '^commit[ \\t](?P<commit>[a-f0-9]{40})\n (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n (?:[ \\t]\\((?P<refs>.+)\\))?$\n '¶
- EMPTY_LINE_PATTERN = '^$'¶
- FILE = 4¶
- GIT_ACTION_REGEXP = re.compile('^(?P<sc>\\:+)\n (?P<modes>(?:\\d{6}[ \\t])+)\n (?P<indexes>(?:[a-f0-9]+\\.{,3}[ \\t])+)\n (?P<action>[^\\t]+)\\t+\n , re.VERBOSE)¶
- GIT_COMMIT_REGEXP = re.compile('^commit[ \\t](?P<commit>[a-f0-9]{40})\n (?:[ \\t](?P<parents>[a-f0-9][a-f0-9 \\t]+))?\n (?:[ \\t]\\((?P<refs>.+)\\))?$\n ', re.VERBOSE)¶
- GIT_HEADER_TRAILER_REGEXP = re.compile('^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$', re.VERBOSE)¶
- GIT_MESSAGE_REGEXP = re.compile('^[\\s]{4}(?P<msg>.*)$', re.VERBOSE)¶
- GIT_NEXT_STATE_REGEXP = re.compile('^$', re.VERBOSE)¶
- GIT_STATS_REGEXP = re.compile('^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$', re.VERBOSE)¶
- HEADER = 2¶
- HEADER_TRAILER_PATTERN = '^(?P<name>[a-zA-z0-9\\-]+)\\:[ \\t]+(?P<value>.+)$'¶
- INIT = 0¶
- MESSAGE = 3¶
- MESSAGE_LINE_PATTERN = '^[\\s]{4}(?P<msg>.*)$'¶
- STATS_PATTERN = '^(?P<added>\\d+|-)[ \\t]+(?P<removed>\\d+|-)[ \\t]+(?P<file>.+)$'¶
- TRAILERS = ['Signed-off-by']¶
- class perceval.backends.core.git.GitRef(hash, refname)¶
Bases:
tuple- property hash¶
Alias for field number 0
- property refname¶
Alias for field number 1
- class perceval.backends.core.git.GitRepository(uri, dirpath)[source]¶
Bases:
objectManage a Git repository.
This class provides access to a Git repository running some common commands such as clone, pull or log. To create an instance from a remote repository, use clone() class method.
- Parameters
uri – URI of the repository
dirpath – local directory where the repository is stored
- GIT_PRETTY_OUTPUT_OPTS = ['--raw', '--numstat', '--pretty=fuller', '--decorate=full', '--parents', '-M', '-C', '-c']¶
- classmethod clone(uri, dirpath)[source]¶
Clone a Git repository.
Make a bare copy of the repository stored in uri into dirpath. The repository would be either local or remote.
- Parameters
uri – URI of the repository
dirtpath – directory where the repository will be cloned
- Returns
a GitRepository class having cloned the repository
- Raises
RepositoryError – when an error occurs cloning the given repository
- count_objects()[source]¶
Count the objects of a repository.
The method returns the total number of objects (packed and unpacked) available on the repository.
- Raises
RepositoryError – when an error occurs counting the objects of a repository
- is_detached()[source]¶
Check if the repo is in a detached state.
The repository is in a detached state when HEAD is not a symbolic reference.
- Returns
whether the repository is detached or not
- Raises
RepositoryError – when an error occurs checking the state of the repository
- is_empty()[source]¶
Determines whether the repository is empty or not.
Returns True when the repository is empty. Under the hood, it checks the number of objects on the repository. When this number is 0, the repositoy is empty.
- Raises
RepositoryError – when an error occurs accessing the repository
- log(from_date=None, to_date=None, branches=None, encoding='utf-8')[source]¶
Read the commit log from the repository.
The method returns the Git log of the repository using the following options:
- git log –raw –numstat –pretty=fuller –decorate=full
–all –reverse –topo-order –parents -M -C -c –remotes=origin
When from_date is given, it gets the commits equal or older than that date. This date is given in a datetime object.
The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.
- Parameters
from_date – fetch commits newer than a specific date (inclusive)
branches – names of branches to fetch from (default: None)
encoding – encode the log using this format
- Returns
a generator where each item is a line from the log
- Raises
EmptyRepositoryError – when the repository is empty and the action cannot be performed
RepositoryError – when an error occurs fetching the log
- rev_list(branches=None)[source]¶
Read the list commits from the repository
The list of branches is a list of strings, with the names of the branches to fetch. If the list of branches is empty, no commit is fetched. If the list of branches is None, all commits for all branches will be fetched.
The method returns the Git rev-list of the repository using the following options:
git rev-list –topo-order
- Parameters
branches – names of branches to fetch from (default: None)
- Raises
EmptyRepositoryError – when the repository is empty and the action cannot be performed
RepositoryError – when an error occurs executing the command
- show(commits=None, encoding='utf-8')[source]¶
Show the data of a set of commits.
The method returns the output of Git show command for a set of commits using the following options:
- git show –raw –numstat –pretty=fuller –decorate=full
–parents -M -C -c [<commit>…<commit>]
When the list of commits is empty, the command will return data about the last commit, like the default behaviour of git show.
- Parameters
commits – list of commits to show data
encoding – encode the output using this format
- Returns
a generator where each item is a line from the show output
- Raises
EmptyRepositoryError – when the repository is empty and the action cannot be performed
RepositoryError – when an error occurs fetching the show output
- sync()[source]¶
Keep the repository in sync.
This method will synchronize the repository with its ‘origin’, fetching newest objects and updating references. It uses low level commands which allow to keep track of which things have changed in the repository.
The method also returns a list of hashes related to the new commits fetched during the process.
- Returns
list of new commits
- Raises
RepositoryError – when an error occurs synchronizing the repository
- update()[source]¶
Update repository from its remote.
Calling this method, the repository will be synchronized with the remote repository using ‘fetch’ command for ‘heads’ refs. Any commit stored in the local copy will be removed; refs will be overwritten.
- Raises
RepositoryError – when an error occurs updating the repository
perceval.backends.core.github module¶
- class perceval.backends.core.github.GitHub(owner=None, repository=None, api_token=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendGitHub backend for Perceval.
This class allows the fetch the issues stored in GitHub repository. Note that since version 0.20.0, the api_token accepts a list of tokens, thus the backend must be initialized as follows: ``` GitHub(
owner=’chaoss’, repository=’grimoirelab’, api_token=[TOKEN-1, TOKEN-2, …], sleep_for_rate=True, sleep_time=300
- Parameters
owner – GitHub owner
repository – GitHub repository from the owner
api_token – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
sleep_time – time to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['issue', 'pull_request', 'repository']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- CLASSIFIED_FIELDS = [['user_data'], ['merged_by_data'], ['assignee_data'], ['assignees_data'], ['requested_reviewers_data'], ['comments_data', 'user_data'], ['comments_data', 'reactions_data', 'user_data'], ['reviews_data', 'user_data'], ['review_comments_data', 'user_data'], ['review_comments_data', 'reactions_data', 'user_data']]¶
A list of fields that should be considered sensitive or confidential.
Fields listed here will be hidden from fetched items, when this behaviour is requested.
Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.
Classified data filtering and archiving are not compatible to prevent data leaks or security issues.
- fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]¶
Fetch the issues/pull requests from the repository.
The method retrieves, from a GitHub repository, the issues/pull requests updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain issues/pull requests updated since this date
to_date – obtain issues/pull requests until a specific date (included)
filter_classified – remove classified fields from the resulting items
- Returns
a generator of issues
- fetch_items(category, **kwargs)[source]¶
Fetch the items (issues or pull_requests or repo information)
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a GitHub item.
This backend generates three types of item which are ‘issue’, ‘pull_request’ and ‘repo’ information.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a GitHub item.
The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus the owner and repo.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.27.0'¶
- class perceval.backends.core.github.GitHubClient(owner, repository, tokens=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerClient for retieving information from GitHub API
- Parameters
owner – GitHub owner
repository – GitHub repository from the owner
tokens – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- EXTRA_STATUS_FORCELIST = [403, 500, 502, 503]¶
- HACCEPT = 'Accept'¶
- HAUTHORIZATION = 'Authorization'¶
- PDIRECTION = 'direction'¶
- PPER_PAGE = 'per_page'¶
- PSINCE = 'since'¶
- PSORT = 'sort'¶
- PSTATE = 'state'¶
- RCOMMENTS = 'comments'¶
- RCOMMITS = 'commits'¶
- RISSUES = 'issues'¶
- RORGS = 'orgs'¶
- RPULLS = 'pulls'¶
- RRATE_LIMIT = 'rate_limit'¶
- RREACTIONS = 'reactions'¶
- RREPOS = 'repos'¶
- RREQUESTED_REVIEWERS = 'requested_reviewers'¶
- RREVIEWS = 'reviews'¶
- RUSERS = 'users'¶
- VACCEPT = 'application/vnd.github.squirrel-girl-preview'¶
- VACCEPT_V3 = 'application/vnd.github.v3+json'¶
- VDIRECTION_ASC = 'asc'¶
- VSORT_UPDATED = 'updated'¶
- VSTATE_ALL = 'all'¶
- calculate_time_to_reset()[source]¶
Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.
- fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]¶
Fetch the data from a given URL.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available
auth – auth of the request
:returns a response object
- issues(from_date=None)[source]¶
Fetch the issues from the repository.
The method retrieves, from a GitHub repository, the issues updated since the given date.
- Parameters
from_date – obtain issues updated since this date
- Returns
a generator of issues
- pulls(from_date=None)[source]¶
Fetch the pull requests from the repository.
The method retrieves, from a GitHub repository, the pull requests updated since the given date.
- Parameters
from_date – obtain pull requests updated since this date
- Returns
a generator of pull requests
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.github.GitHubCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run GitHub backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.github.GitHub
perceval.backends.core.githubql module¶
- class perceval.backends.core.githubql.GitHubQL(owner=None, repository=None, api_token=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.github.GitHubGitHubQL backend for Perceval using the GitHub API v4. Most of the methods are inherited from the GitHub backend.
This class allows the fetch the issue events of a GitHub repository. Note that the events retrieved included also the ones of pull requests, since in GitHub, every pull request is an issue, but an issue may not be a pull request. Pull requests can be identified by the attribute pull_request included in data.issue.
Due to the limitation of not fetching issue events after a given date from GitHub v3, the events are fetched via the GitHub v4 (based on GraphQL).
All issues of a given tracker are retrieved in ascending order based on the last time they were updated. For each issue, its events (optionally from/until a given date) are collected using a GraphQL call. Each event is returned by Perceval together with the corresponding issue (available in data.issue).
Since the events are collected issue by issue, the incremental fetching is not supported. This limitation is due to the fact that events that occur on an issue may not update the issue attributes. Since there is no way to identify new events from the attributes of an issue, all issues must be fetched for every execution.
No user information beyond the login is included in data returned by this backend. Thus, the backend doesn’t require filter classified support.
- Parameters
owner – GitHub owner
repository – GitHub repository from the owner
api_token – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items per query
sleep_time – time to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['event']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the issue events from the repository.
The method retrieves, from a GitHub repository, the issue events since/until a given date.
- Parameters
category – the category of items to fetch
from_date – obtain issue events since this date
to_date – obtain issue events until this date (included)
- Returns
a generator of events
- fetch_items(category, **kwargs)[source]¶
Fetch the items
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend doesn’t support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a GitHub item.
This backend generates one type item which is ‘event’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a GitHub item.
The timestamp used is extracted from ‘createdAt’ field. This date is converted to UNIX timestamp format. As GitHub dates are in UTC the conversion is straightforward.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.4.0'¶
- class perceval.backends.core.githubql.GitHubQLClient(owner, repository, tokens=None, github_app_id=None, github_app_pk_filepath=None, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.github.GitHubClientClient for retrieving information from GitHub API
- Parameters
owner – GitHub owner
repository – GitHub repository from the owner
tokens – list of GitHub auth tokens to access the API
github_app_id – GitHub App ID
github_app_pk_filepath – GitHub App private key PEM file path
base_url – GitHub URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitHub public site.
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
sleep_time – time to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
archive – collect events already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- VACCEPT = 'application/vnd.github.squirrel-girl-preview,application/vnd.github.starfox-preview+json'¶
- VPER_PAGE = 100¶
- class perceval.backends.core.githubql.GitHubQLCommand(*args, debug=False)[source]¶
Bases:
perceval.backends.core.github.GitHubCommandClass to run GitHubQL backend from the command line.
- BACKEND¶
perceval.backends.core.gitlab module¶
- class perceval.backends.core.gitlab.GitLab(owner=None, repository=None, api_token=None, is_oauth_token=False, base_url=None, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, max_retries=5, sleep_time=1, blacklist_ids=None, extra_retry_after_status=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendGitLab backend for Perceval.
This class allows the fetch the issues stored in GitLab repository.
- Parameters
owner – GitLab owner
repository – GitLab repository from the owner
api_token – GitLab auth token to access the API
is_oauth_token – True if the token is OAuth (default False)
base_url – GitLab URL in enterprise edition case; when no value is set the backend will be fetch the data from the GitLab public site.
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
blacklist_ids – ids of items that must not be retrieved
extra_retry_after_status – retry HTTP requests after status (default 500 and 502). These status complete the ones (413, 429, 503) defined in the HttpClient class
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['issue', 'merge_request']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='iid', type=<class 'int'>)¶
A field unique to a given origin for items produced by this backend.
If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.
If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a
OriginUniqueFieldcontaining the number and data type of the field.Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.
To access this field, please prefer
origin_unique_field().
- fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the issues/merge requests from the repository.
The method retrieves, from a GitLab repository, the issues/merge requests updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain issues updated since this date
- Returns
a generator of issues
- fetch_items(category, **kwargs)[source]¶
Fetch the items (issues or merge_requests)
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archivng items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a GitLab item.
This backend only generates one type of item which is ‘issue’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a GitLab item.
The timestamp used is extracted from ‘updated_at’ field. This date is converted to UNIX timestamp format. As GitLab dates are in UTC the conversion is straightforward.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus the owner, project and iid of the issue or merge requests. Optionally, if the project is part of a (nested) group, all groups are also included to the search fields via the attribute groups.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.12.0'¶
- class perceval.backends.core.gitlab.GitLabClient(owner, repository, token, is_oauth_token=False, base_url=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, max_retries=5, extra_retry_after_status=None, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerClient for retieving information from GitLab API
- Parameters
owner – GitLab owner
repository – GitLab owner’s repository
token – GitLab auth token to access the API
is_oauth_token – True if the token is OAuth (default False)
base_url –
- GitLab URL in enterprise edition case;
when no value is set the backend will be fetch the data from the GitLab public site.
- param sleep_for_rate
sleep until rate limit is reset
- param min_rate_to_sleep
minimum rate needed to sleep until it will be reset
- param sleep_time
time (in seconds) to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
extra_retry_after_status – retry HTTP requests after status
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- HAUTHORIZATION = 'Authorization'¶
- HPRIVATE_TOKEN = 'PRIVATE-TOKEN'¶
- HRATE_LIMIT = 'RateLimit-Remaining'¶
- HRATE_LIMIT_RESET = 'RateLimit-Reset'¶
- PORDER_BY = 'order_by'¶
- PPER_PAGE = 'per_page'¶
- PSORT = 'sort'¶
- PSTATE = 'state'¶
- PUPDATE_AFTER = 'updated_after'¶
- PVIEW = 'view'¶
- REMOJI = 'award_emoji'¶
- RISSUES = 'issues'¶
- RMERGES = 'merge_requests'¶
- RNOTES = 'notes'¶
- RPROJECTS = 'projects'¶
- RVERSIONS = 'versions'¶
- VORDER_UPDATED_AT = 'updated_at'¶
- VPER_PAGE = 100¶
- VSORT_ASC = 'asc'¶
- VSTATE_ALL = 'all'¶
- VVIEW_SIMPLE = 'simple'¶
- calculate_time_to_reset()[source]¶
Calculate the seconds to reset the token requests, by obtaining the different between the current date and the next date when the token is fully regenerated.
- fetch(url, payload=None, headers=None, method='GET', stream=False)[source]¶
Fetch the data from a given URL.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available
:returns a response object
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.gitlab.GitLabCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run GitLab backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.gitlab.GitLab
perceval.backends.core.gitter module¶
- class perceval.backends.core.gitter.Gitter(group=None, room=None, api_token=None, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendGitter backend.
This class retrieves the messages sent to a Gitter room. To access the server an API token is required.
The origin of the data will be set to the GITTER_URL plus the identifier of the room; i.e ‘https://gitter.im/{group}/{room}’.
- Parameters
group – group to which the room belongs
room – identifier of the room from which the messages are to be fetched
api_token – token or key needed to use the API
max_items – maximum number of message requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from the room.
This method fetches the messages sent in the room that were sent since the given date.
- Parameters
category – the category of items to fetch
from_date – date from which messages are to be fetched
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages.
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Gitter item.
This backend only generates one type of item which is ‘message’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the sent time of a message from a Gitter item.
The timestamp is extracted from ‘sent’ field and converted to a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id,`group`,`room` and ‘room_id’.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.1.0'¶
- class perceval.backends.core.gitter.GitterClient(api_token, max_items=100, archive=None, sleep_for_rate=False, min_rate_to_sleep=10, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerGitter API client.
Client for fetching information from the Gitter server using its REST API.
- Parameters
api_token – key needed to use the API
max_items – maximum number of items per request
archive – an archive to store/read fetched data
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- HAUTHORIZATION = 'Authorization'¶
- PBEFORE_ID = 'beforeId'¶
- PLIMIT = 'limit'¶
- RMESSAGES = 'chatMessages'¶
- RROOMS = 'rooms'¶
- calculate_time_to_reset()[source]¶
Number of seconds to wait. They are contained in the rate limit reset header
- fetch(url, payload=None, headers=None)[source]¶
Fetch the data from a given URL.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
:returns a response object
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items.
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.gitter.GitterCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Gitter backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.gitter.Gitter
perceval.backends.core.googlehits module¶
- class perceval.backends.core.googlehits.GoogleHits(keywords, tag=None, archive=None, max_retries=5, sleep_time=1, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendGoogleHits backend for Perceval.
This class retrieves the number of hits for a given list of keywords via the Google API. To initialize this class a list of keywords is needed.
- Parameters
keywords – a list of keywords
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['google_hits']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'keywords': ['keywords']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='google_hits')[source]¶
Fetch data from Google API.
The method retrieves a list of hits for some given keywords using the Google API.
- Parameters
category – the category of items to fetch
- Returns
a generator of data
- fetch_items(category, **kwargs)[source]¶
Fetch Google hit items
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a GoogleHits item.
This backend only generates one type of item which is ‘google_hits’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a GoogleHit item.
The timestamp is based on the current time when the hit was extracted. This field is not part of the data provided by Google API. It is added by this backend.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.4.0'¶
- class perceval.backends.core.googlehits.GoogleHitsClient(sleep_time=1, max_retries=5, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientGoogleHits API client.
Client for fetching hits data from Google API.
- Parameters
sleep_time – time (in seconds) to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- EXTRA_STATUS_FORCELIST = [429]¶
- PQUERY = 'q'¶
- class perceval.backends.core.googlehits.GoogleHitsCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run GoogleHits backend from the command line.
- BACKEND¶
perceval.backends.core.groupsio module¶
- class perceval.backends.core.groupsio.Groupsio(group_name, dirpath, email, password, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.mbox.MBoxGroups.io backend.
This class allows the fetch the messages of a Groups.io group. Initialize this class passing the name of the group, the directory path where the mbox files will be fetched and stored, and the email and password of the Groupsio user. The origin of the data will be set to the url of the group on Groups.io.
In order to know the group names where you are subscribed, you can use the following script: https://gist.github.com/valeriocos/2e2231e17fd3052800303bf99bd0c7c4
- Parameters
group_name – Name of the group
dirpath – directory path where the mboxes are stored
email – Groupsio user email
password – Groupsio user password
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from a Groups.io group.
The method fetches the mbox files from a remote Groups.io group and retrieves the messages stored on them.
- Parameters
category – the category of items to fetch
from_date – obtain messages since this date
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend does not support items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus the group_name
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.4.2'¶
- class perceval.backends.core.groupsio.GroupsioClient(group_name, dirpath, email, password, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.mbox.MailingListManage mailing list archives stored by Groups.io.
This class gives access to remote and local mboxes archives from a mailing list stored by Groups.io. This class also allows to keep them in sync.
- Parameters
group_name – Name of the group
dirpath – directory path where the mboxes are stored
email – Groupsio user email
password – Groupsio user password
ssl_verify – enable/disable SSL verification
- PEMAIL = 'email'¶
- PGROUP_ID = 'group_id'¶
- PLIMIT = 'limit'¶
- PPAGE_TOKEN = 'page_token'¶
- PPASSWORD = 'password'¶
- PSTART_TIME = 'start_time'¶
- RDOWNLOAD_ARCHIVES = 'downloadarchives'¶
- RGET_SUBSCRIPTIONS = 'getsubs'¶
- RLOGIN = 'login'¶
- fetch(from_date=None)[source]¶
Fetch the mbox files from the remote archiver.
Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.
Groups.io archives are returned as a .zip file, which contains one file in mbox format.
- Parameters
from_date – fetch messages after a given date (included) expressed in ISO format
- Returns
a list of tuples, storing the links and paths of the fetched archives
- class perceval.backends.core.groupsio.GroupsioCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Groupsio backend from the command line.
- BACKEND¶
perceval.backends.core.hyperkitty module¶
- class perceval.backends.core.hyperkitty.HyperKitty(url, dirpath, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.mbox.MBoxHyperKitty backend.
This class allows the fetch the email messages stored on a HyperKitty archiver. Initialize this class passing the URL where the mailing list archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.
- Parameters
url – URL to the HyperKitty mailing list archiver
dirpath – directory path where the mboxes are stored
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from the HyperKitty mailing list archiver.
The method fetches the mbox files from a remote HyperKitty mailing list archiver and retrieves the messages stored on them.
Take into account that HyperKitty does not provide yet any kind of info to know which is the first message on the mailing list. For this reason, using a value in from_date previous to the date where the first message was sent will make to download empty mbox files.
- Parameters
category – the category of items to fetch
from_date – obtain messages since this date
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend does not support items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- version = '0.6.0'¶
- class perceval.backends.core.hyperkitty.HyperKittyCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run HyperKitty backend from the command line.
- BACKEND¶
- class perceval.backends.core.hyperkitty.HyperKittyList(url, dirpath, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.mbox.MailingListManage mailing list archives stored by HyperKitty archiver.
This class gives access to remote and local mboxes archives from a mailing list stored by HyperKitty. This class also allows to keep them in sync.
Notice that this class only works with HyperKitty version 1.0.4 or greater. Previous versions do not export messages in MBox format.
- Parameters
url – URL to the HyperKitty archiver for this list
dirpath – path to the local mboxes archives
ssl_verify – enable/disable SSL verification
- PEND = 'end'¶
- PSTART = 'start'¶
- REXPORT = 'export'¶
- fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the mbox files from the remote archiver.
This method stores the archives in the path given during the initialization of this object.
HyperKitty archives are accessed month by month and stored following the schema year-month. Archives are fetched from the given month till the current month.
- Parameters
from_date – fetch archives that store messages equal or after the given date; only year and month values are compared
- Returns
a list of tuples, storing the links and paths of the fetched archives
- property mboxes¶
Get the mboxes managed by this mailing list.
Returns the archives sorted by date in ascending order.
- Returns
a list of .MBoxArchive objects
perceval.backends.core.jenkins module¶
- class perceval.backends.core.jenkins.Jenkins(url, user=None, api_token=None, tag=None, archive=None, detail_depth=1, sleep_time=10, blacklist_ids=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendJenkins backend for Perceval.
This class retrieves the builds from a Jenkins site. To initialize this class the URL must be provided. The url will be set as the origin of the data.
- Parameters
url – Jenkins url
user – Jenkins user
api_token – Jenkins auth token to access the API
tag – label used to mark the data
archive – archive to store/retrieve items
detail_depth – control the detail level of the data returned by the API
sleep_time – time (in seconds) to sleep in case of connection problems
archive – collect builds already retrieved from an archive
blacklist_ids – exclude the jobs ID of this list while fetching
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['build']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'number': ['number']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- ORIGIN_UNIQUE_FIELD = OriginUniqueField(name='url', type=<class 'str'>)¶
A field unique to a given origin for items produced by this backend.
If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.
If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a
OriginUniqueFieldcontaining the number and data type of the field.Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.
To access this field, please prefer
origin_unique_field().
- fetch(category='build')[source]¶
Fetch the builds from the url.
The method retrieves, from a Jenkins url, the builds updated since the given date.
- Parameters
category – the category of items to fetch
- Returns
a generator of builds
- fetch_items(category, **kwargs)[source]¶
Fetch the contents
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archiving
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Jenkins item.
This backend only generates one type of item which is ‘build’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Jenkins item.
The timestamp is extracted from ‘timestamp’ field. This date is a UNIX timestamp but needs to be converted to a float value.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.16.0'¶
- class perceval.backends.core.jenkins.JenkinsClient(url, user=None, api_token=None, blacklist_jobs=None, detail_depth=1, sleep_time=10, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientJenkins API client.
This class implements a simple client to retrieve jobs/builds from projects in a Jenkins node. The amount of data returned for each request depends on the detail_depth value selected (minimum and default is 1). Note that increasing the detail_depth may considerably slow down the fetch operation and cause connection broken errors.
- Parameters
url – URL of jenkins node: https://build.opnfv.org/ci
user – Jenkins user
api_token – Jenkins auth token to access the API
blacklist_jobs – exclude the jobs of this list while fetching
detail_depth – set the detail level of the data returned by the API
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- EXTRA_STATUS_FORCELIST = [410, 502, 503]¶
- MAX_RETRIES = 5¶
- PDEPTH = 'depth'¶
- RAPI = 'api'¶
- RJOB = 'job'¶
- RJSON = 'json'¶
- class perceval.backends.core.jenkins.JenkinsCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Jenkins backend from the command line.
- BACKEND¶
perceval.backends.core.jira module¶
- class perceval.backends.core.jira.Jira(url, project=None, user=None, password=None, cert=None, max_results=100, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendJIRA backend for Perceval.
This class retrieves the issues stored in JIRA issue tracking system. To initialize this class the URL must be provided. The url will be set as the origin of the data.
Note that when fetching data with an authenticated access (i.e., user and password), information about issue transitions and operations (e.g., edit-issue, comment-issue) is included in the JSON documents produced by the backend.
- Parameters
url – JIRA’s endpoint
project – filter issues by project
user – Jira user
password – Jira user password
cert – SSL certificate path (PEM)
max_results – max number of results per query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['issue']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'issue_key': ['key'], 'project_id': ['fields', 'project', 'id'], 'project_key': ['fields', 'project', 'key'], 'project_name': ['fields', 'project', 'name']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the issues from the site.
The method retrieves, from a JIRA site, the issues updated since the given date.
- Parameters
category – the category of items to fetch
from_date – retrieve issues updated from this date
- Returns
a generator of issues
- fetch_items(category, **kwargs)[source]¶
Fetch the issues
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Jira item.
This backend only generates one type of item which is ‘issue’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Jira item.
The timestamp used is extracted from ‘updated’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_issues(raw_page)[source]¶
Parse a JIRA API raw response.
The method parses the API response retrieving the issues from the received items
- Parameters
items – items from where to parse the issues
- Returns
a generator of issues
- version = '0.14.0'¶
- class perceval.backends.core.jira.JiraClient(url, project, user, password, cert, max_results=100, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientJIRA API client.
This class implements a simple client to retrieve issues from any JIRA issue tracking system.
- Parameters
URL – URL of the JIRA server
project – filter issues by project
user – JIRA’s username
password – JIRA’s password
cert – SSL certificate
max_results – max number of results per query
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- PEXPAND = 'expand'¶
- PJQL = 'jql'¶
- PMAX_RESULTS = 'maxResults'¶
- PSTART_AT = 'startAt'¶
- RCOMMENT = 'comment'¶
- RESOURCE = 'rest/api'¶
- RFIELD = 'field'¶
- RISSUE = 'issue'¶
- RSEARCH = 'search'¶
- VERSION_API = '2'¶
- VEXPAND = 'renderedFields,transitions,operations,changelog'¶
- get_comments(issue_id)[source]¶
Retrieve all the comments of a given issue.
- Parameters
issue_id – ID of the issue
- class perceval.backends.core.jira.JiraCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Jira backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.jira.Jira
- perceval.backends.core.jira.filter_custom_fields(fields)[source]¶
Filter custom fields from a given set of fields.
- Parameters
fields – set of fields
- Returns
an object with the filtered custom fields
- perceval.backends.core.jira.map_custom_field(custom_fields, fields)[source]¶
Add extra information for custom fields.
- Parameters
custom_fields – set of custom fields with the extra information
fields – fields of the issue where to add the extra information
- Returns
an set of items with the extra information mapped
perceval.backends.core.launchpad module¶
- class perceval.backends.core.launchpad.Launchpad(distribution, package=None, items_per_page=75, sleep_time=300, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendLaunchpad backend for Perceval.
This class allows the fetch the issues stored in Launchpad.
- Parameters
distribution – Launchpad distribution
package – Distribution package
items_per_page – number of items in a retrieved page
sleep_time – time (in seconds) to sleep in case of connection problems
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['issue']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the issues from a project (distribution/package).
The method retrieves, from a Launchpad project, the issues updated since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain issues updated since this date
- Returns
a generator of issues
- fetch_items(category, **kwargs)[source]¶
Fetch the issues
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Launchpad item.
This backend only generates one type of item which is ‘issue’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Launchpad item.
The timestamp used is extracted from ‘date_last_updated’ field. This date is converted to UNIX timestamp format. As Launchpad dates are in UTC in ISO 8601 (e.g., ‘2008-03-26T01:43:15.603905+00:00’) the conversion is straightforward.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus additional values depending on the item category. For the categories issue and pull_request, the search fields include the issue/pull request number, labels, state and the name of the milestone. For the category repository, license and language are set as search fields.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.8.1'¶
- class perceval.backends.core.launchpad.LaunchpadClient(distribution, package=None, items_per_page=75, sleep_time=300, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientClient for retrieving information from Launchpad API
- Parameters
distribution – Launchpad distribution
package – Distribution package
items_per_page – number of items in a retrieved page
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- HCONTENT_TYPE = 'Content-type'¶
- PMODIFIED_SINCE = 'modified_since'¶
- POMIT_DULPLICATES = 'omit_duplicates'¶
- PORDER_BY = 'order_by'¶
- PSTATUS = 'status'¶
- PWS_OP = 'ws.op'¶
- PWS_SIZE = 'ws.size'¶
- PWS_START = 'ws.start'¶
- RBUGS = 'bugs'¶
- RSOURCE = '+source'¶
- VCONTENT_TYPE = 'application/json'¶
- VDATE_LAST_MODIFIED = 'date_last_updated'¶
- VOMIT_DUPLICATES = 'false'¶
- VSEARCH_TASKS = 'searchTasks'¶
- VSTATUS = ['New', 'Incomplete', 'Opinion', 'Invalid', "Won't Fix", 'Expired', 'Confirmed', 'Triaged', 'In Progress', 'Fix Committed', 'Fix Released', 'Incomplete (with response)', 'Incomplete (without response)']¶
- class perceval.backends.core.launchpad.LaunchpadCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Launchpad backend from the command line.
- BACKEND¶
perceval.backends.core.mattermost module¶
- class perceval.backends.core.mattermost.Mattermost(url, channel, api_token, max_items=60, tag=None, archive=None, team=None, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendMattermost backend.
This class retrieves the posts sent to a Mattermost channel. To access the server an API token is required, which must have enough permissions to read from the given channel.
To initialize this class the URL of the server must be provided. The origin of data will be set using this url plus the channel from data is obtained (i.e: https://mattermost.example.com/abcdefg). If using channel and team names instead of a channel id, this will take the form url plus team plus channel.
The team parameter is only required if providing a channel name instead of a channel ID.
- Parameters
url – URL of the server
channel – identifier/name of the channel where data will be fetched
api_token – token or key needed to use the API
max_items – maximum number of message requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
team – (optional) The name of the team the channel is in
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['post']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_data', 'id'], 'channel_name': ['channel_data', 'name']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='post', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the posts from the channel.
This method fetches the posts stored on the channel that were sent since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain posts sent since this date
- Returns
a generator of posts
- fetch_items(category, **kwargs)[source]¶
Fetch the messages.
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Mattermost item.
This backend only generates one type of item which is ‘post’.
- static metadata_updated_on(item)[source]¶
Extracts and converts the update time from a Metadata item.
The timestamp is extracted from ‘update_at’ field. This field is already a UNIX timestamp but it needs to be converted to float.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_json(raw_json)[source]¶
Parse a Mattermost JSON stream.
The method parses a JSON stream and returns a dict with the parsed data.
- Parameters
raw_json – JSON string to parse
- Returns
a dict with the parsed data
- version = '0.5.0'¶
- class perceval.backends.core.mattermost.MattermostClient(base_url, api_token, max_items=60, sleep_for_rate=False, min_rate_to_sleep=10, sleep_time=1, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerMattermost API client.
Client for fetching information from a Mattermost server using its REST API.
- Parameters
base_url – URL of the Mattermost server
api_key – key needed to use the API
max_items – maximum number of items fetched per request
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- API_URL = '%(base_url)s/api/v4/%(entrypoint)s'¶
- HAUTHORIZATION = 'Authorization'¶
- PCHANNEL_ID = 'channel_id'¶
- PPAGE = 'page'¶
- PPER_PAGE = 'per_page'¶
- RCHANNELS = 'channels'¶
- RCHANNELS_BY_NAME = 'teams/name/%s/channels/name/%s'¶
- RPOSTS = 'posts'¶
- RUSERS = 'users'¶
- calculate_time_to_reset()[source]¶
Number of seconds to wait.
The time is obtained by the different between the current date and the next date when the token is fully regenerated.
- channel_by_name(team: str, channel: str)[source]¶
Fetch the channel information by channel/team name
This provides identical information to the
channel()method, with the key difference of looking up a channel by channel name and team name instead of by the channel ID.
- fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]¶
Override fetch method to handle API rate limit.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available
auth – auth of the request
:returns a response object
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items.
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.mattermost.MattermostCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Mattermost backend from the command line.
- BACKEND¶
- DESCRIPTION = 'Can either be called a channel ID, or a channel name. If a channel name is used, the team name is required. Otherwise, the team argument is ignored.'¶
perceval.backends.core.mbox module¶
- class perceval.backends.core.mbox.MBox(uri, dirpath, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendMBox backend.
This class allows the fetch the email messages stored one or several mbox files. Initialize this class passing the directory path where the mbox files are stored. The origin of the data will be set to to the value of uri.
- Parameters
uri – URI of the mboxes; typically, the URL of their mailing list
dirpath – directory path where the mboxes are stored
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- DATE_FIELD = 'Date'¶
- MESSAGE_ID_FIELD = 'Message-ID'¶
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from a set of mbox files.
The method retrieves, from mbox files, the messages stored in these containers.
- Parameters
category – the category of items to fetch
from_date – obtain messages since this date
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend does not support items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a MBox item.
This backend only generates one type of item which is ‘message’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a MBox item.
The timestamp used is extracted from ‘Date’ field in its several forms. This date is converted to UNIX timestamp format.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_mbox(filepath)[source]¶
Parse a mbox file.
This method parses a mbox file and returns an iterator of dictionaries. Each one of this contains an email message.
- Parameters
filepath – path of the mbox to parse
- :returnsgenerator of messages; each message is stored in a
dictionary of type requests.structures.CaseInsensitiveDict
- version = '0.13.1'¶
- class perceval.backends.core.mbox.MBoxArchive(filepath)[source]¶
Bases:
objectClass to access a mbox archive.
MBOX archives can be stored into plain or compressed files (gzip, bz2 or zip).
- Parameters
filepath – path to the mbox file
- property compressed_type¶
- property container¶
- property filepath¶
- class perceval.backends.core.mbox.MBoxCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run MBox backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.mbox.MBox
- class perceval.backends.core.mbox.MailingList(uri, dirpath)[source]¶
Bases:
objectManage mailing lists archives.
This class gives access to the local mboxes archives that a mailing list manages.
- Parameters
uri – URI of the mailing lists, usually its URL address
dirpath – path to the mboxes archives
- property mboxes¶
Get the mboxes managed by this mailing list.
Returns the archives sorted by name.
- Returns
a list of .MBoxArchive objects
perceval.backends.core.mediawiki module¶
- class perceval.backends.core.mediawiki.MediaWiki(url, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendMediaWiki backend for Perceval.
This class retrieves the wiki pages and edits from a MediaWiki site. To initialize this class the URL must be provided. The origin of the data will be set to this URL.
It uses different APIs to support pre and post 1.27 MediaWiki versions. The pre 1.27 approach performance is better but it needs different logic for full an incremental retrieval.
In pre 1.27 the incremental approach uses the recent changes API which just covers MAX_RECENT_DAYS. If the from_date used is older, all the pages must be retrieved and the consumer of the items must filter itself.
Both approach return a common format: a page with all its revisions. It is different how the pages list is generated.
The page and revisions data downloaded are the standard. More data could be gathered using additional properties.
Deleted pages are not analyzed.
- Parameters
url – MediaWiki url
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['page']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='page', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), reviews_api=False)[source]¶
Fetch the pages from the backend url.
The method retrieves, from a MediaWiki url, the wiki pages.
- Parameters
category – the category of items to fetch
from_date – obtain pages updated since this date
reviews_api – use the reviews API available in MediaWiki >= 1.27
- Returns
a generator of pages
- fetch_items(category, **kwargs)[source]¶
Fetch the pages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a MediaWiki item.
This backend only generates one type of item which is ‘page’.
- static metadata_updated_on(item)[source]¶
Extracts the update field from a MediaWiki item.
The timestamp is extracted from ‘update’ field. This date is a UNIX timestamp but needs to be converted to a float value.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.11.0'¶
- class perceval.backends.core.mediawiki.MediaWikiClient(url, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientMediaWiki API client.
This class implements a simple client to retrieve pages from projects in a MediaWiki node.
- Parameters
url – URL of mediawiki site: https://wiki.mozilla.org
archive – an archive to store/retrieved the fetched data
from_archive – define whether the archive is used to store/read data
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- PACTION = 'action'¶
- PAP_CONTINUE = 'apcontinue'¶
- PAP_LIMIT = 'aplimit'¶
- PAP_NAMESPACE = 'apnamespace'¶
- PARV_CONTINUE = 'arvcontinue'¶
- PARV_DIR = 'arvdir'¶
- PARV_LIMIT = 'arvlimit'¶
- PARV_NAMESPACE = 'arvnamespace'¶
- PARV_PROP = 'arvprop'¶
- PARV_START = 'arvstart'¶
- PFORMAT = 'format'¶
- PLIST = 'list'¶
- PMETA = 'meta'¶
- PPAGE_IDS = 'pageids'¶
- PPROP = 'prop'¶
- PRC_CONTINUE = 'rccontinue'¶
- PRC_LIMIT = 'rclimit'¶
- PRC_NAMESPACE = 'rcnamespace'¶
- PRC_PROP = 'rcprop'¶
- PRV_DIR = 'rvdir'¶
- PRV_LIMIT = 'rvlimit'¶
- PRV_START = 'rvstart'¶
- PSIPROP = 'siprop'¶
- VALL_PAGES = 'allpages'¶
- VALL_REVISIONS = 'allrevisions'¶
- VIDS = 'ids'¶
- VJSON = 'json'¶
- VNAMESPACES = 'namespaces'¶
- VNEWER = 'newer'¶
- VQUERY = 'query'¶
- VRC_PROP = 'title|timestamp|ids'¶
- VRECENT_CHANGES = 'recentchanges'¶
- VREVISIONS = 'revisions'¶
- VSITE_INFO = 'siteinfo'¶
- call(params)[source]¶
Run an API command. :param cgi: cgi command to run on the server :param params: dict with the HTTP parameters needed to run
the given command
- get_pages(namespace, apcontinue='')[source]¶
Retrieve all pages from a namespace starting from apcontinue.
- class perceval.backends.core.mediawiki.MediaWikiCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run MediaWiki backend from the command line.
- BACKEND¶
perceval.backends.core.meetup module¶
- class perceval.backends.core.meetup.Meetup(group, api_token, max_items=200, tag=None, archive=None, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendMeetup backend.
This class allows to fetch the events of a group from the Meetup server. Initialize this class passing the OAuth2 token needed for authentication with the parameter api_token.
- Parameters
group – name of the group where data will be fetched
api_token – OAuth2 token to access the API
max_items – maximum number of issues requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['event']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- CLASSIFIED_FIELDS = [['group', 'topics'], ['event_hosts'], ['rsvps'], ['venue']]¶
A list of fields that should be considered sensitive or confidential.
Fields listed here will be hidden from fetched items, when this behaviour is requested.
Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.
Classified data filtering and archiving are not compatible to prevent data leaks or security issues.
- EXTRA_SEARCH_FIELDS = {'group_id': ['group', 'id'], 'group_name': ['group', 'name']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='event', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=None, filter_classified=False)[source]¶
Fetch the events from the server.
This method fetches those events of a group stored on the server that were updated since the given date. Data comments and rsvps are included within each event.
- Parameters
category – the category of items to fetch
from_date – obtain events updated since this date
to_date – obtain events updated before this date
filter_classified – remove classified fields from the resulting items
- Returns
a generator of events
- fetch_items(category, **kwargs)[source]¶
Fetch the events
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Meetup item.
This backend only generates one type of item which is ‘event’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Meetup item.
The timestamp is extracted from ‘updated’ field and converted to a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_json(raw_json)[source]¶
Parse a Meetup JSON stream.
The method parses a JSON stream and returns a list with the parsed data.
- Parameters
raw_json – JSON string to parse
- Returns
a list with the parsed data
- version = '0.17.0'¶
- class perceval.backends.core.meetup.MeetupClient(api_token, max_items=200, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerMeetup API client.
Client for fetching information from the Meetup server using its REST API v3.
- Parameters
api_token – OAuth2 token needed to access the API
max_items – maximum number of items per request
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- EXTRA_STATUS_FORCELIST = [429]¶
- PFIELDS = 'fields'¶
- PKEY_OAUTH2 = 'Authorization'¶
- PORDER = 'order'¶
- PPAGE = 'page'¶
- PRESPONSE = 'response'¶
- PSCROLL = 'scroll'¶
- PSTATUS = 'status'¶
- RCOMMENTS = 'comments'¶
- REVENTS = 'events'¶
- RRSVPS = 'rsvps'¶
- VEVENT_FIELDS = ['event_hosts', 'featured', 'group_topics', 'plain_text_description', 'rsvpable', 'series']¶
- VRESPONSE = ['yes', 'no']¶
- VRSVP_FIELDS = ['attendance_status']¶
- VSTATUS = ['cancelled', 'upcoming', 'past', 'proposed', 'suggested']¶
- VUPDATED = 'updated'¶
- calculate_time_to_reset()[source]¶
Number of seconds to wait. They are contained in the rate limit reset header
- events(group, from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the events pages of a given group.
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items :param: url: HTTP url request :param: headers: HTTP headers request :param: payload: HTTP payload request :returns url, headers and the sanitized payload
- class perceval.backends.core.meetup.MeetupCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Meetup backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.meetup.Meetup
perceval.backends.core.nntp module¶
- class perceval.backends.core.nntp.NNTP(host, group, tag=None, archive=None)[source]¶
Bases:
perceval.backend.BackendNNTP backend.
This class allows to fetch the articles published on a news group using NNTP. It is initialized giving the host and the name of the news group.
- Parameters
host – host
group – name of the group
tag – label used to mark the data
archive – archive to store/retrieve items
- CATEGORIES = ['article']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'newsgroups': ['Newsgroups']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='article', offset=1)[source]¶
Fetch articles posted on a news group.
This method fetches those messages or articles published on a news group starting on the given offset.
- Parameters
category – the category of items to fetch
offset – obtain messages from this offset
- Returns
a generator of articles
- fetch_items(category, **kwargs)[source]¶
Fetch the articles
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- metadata(item, filter_classified=False)[source]¶
NNTP metadata.
This method takes items, overriding metadata decorator, to add extra information related to NNTP.
- Parameters
item – an item fetched by a backend
filter_classified – sets if classified fields were filtered
- static metadata_category(item)[source]¶
Extracts the category from a NNTP item.
This backend only generates one type of item which is ‘article’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a NNTP item.
The timestamp is extracted from ‘Date’ field and converted to a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_article(raw_article)[source]¶
Parse a NNTP article.
This method parses a NNTP article stored in a string object and returns an dictionary.
- Parameters
raw_article – NNTP article string
- Returns
a dictionary of type requests.structures.CaseInsensitiveDict
- Raises
ParseError – when an error is found parsing the article
- version = '0.6.0'¶
- class perceval.backends.core.nntp.NNTPCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run NNTP backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.nntp.NNTP
- class perceval.backends.core.nntp.NNTTPClient(host, archive=None, from_archive=False)[source]¶
Bases:
objectNNTP client
- Parameters
host – host
group – name of the group
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
- ARTICLE = 'article'¶
- GROUP = 'group'¶
- OVER = 'over'¶
perceval.backends.core.pagure module¶
- class perceval.backends.core.pagure.Pagure(namespace=None, repository=None, api_token=None, tag=None, archive=None, max_retries=5, sleep_time=1, max_items=100, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendPagure backend for Perceval.
This class allows the fetch the issues stored in a Pagure repository.
- Parameters
namespace – Pagure namespace
repository – Pagure repository
api_token – Pagure API token to access the API
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items (e.g., issues, pull requests) per query
sleep_time – time to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['issue']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), to_date=datetime.datetime(2100, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]¶
Fetch the issues from the repository.
The method retrieves, from a Pagure repository, the issues updated since/until the given date.
- Parameters
category – the category of items to fetch
from_date – obtain issues updated since this date
to_date – obtain issues until a until a specific date (included)
filter_classified – remove classified fields from the resulting items
- Returns
a generator of issues
- fetch_items(category, **kwargs)[source]¶
Fetch the items (issues)
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Pagure item.
This backend generates one type of item which is ‘issue’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Pagure item.
The timestamp used is extracted from ‘last_updated’ field. This date is converted to UNIX timestamp format. As Pagure dates are in timestamp format the conversion is straightforward.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus the namespace and repo.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.1.2'¶
- class perceval.backends.core.pagure.PagureClient(namespace, repository, token, sleep_time=1, max_retries=5, max_items=100, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientClient for retrieving information from Pagure API
- Parameters
namespace – Pagure namespace
repository – Pagure repository
token – Pagure API token to access the API
sleep_time – time to sleep in case of connection problems
max_retries – number of max retries to a data source before raising a RetryError exception
max_items – max number of category items per query
archive – collect issues already retrieved from an archive
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- HAUTHORIZATION = 'Authorization'¶
- PORDER = 'order'¶
- PPER_PAGE = 'per_page'¶
- PSINCE = 'since'¶
- PSTATUS = 'status'¶
- RISSUES = 'issues'¶
- VORDER_ASC = 'asc'¶
- VSTATUS_ALL = 'all'¶
- fetch(url, payload=None, headers=None)[source]¶
Fetch the data from a given URL.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
:returns a response object
- fetch_items(path, payload)[source]¶
Return the items from Pagure API using links pagination
- Parameters
path – Path from which the item is to be fetched
payload – Payload to be added to the request
- Returns
a generator of items
- issues(from_date=None)[source]¶
Fetch the issues from the repository.
The method retrieves, from a Pagure repository, the issues updated since the given date.
- Parameters
from_date – obtain issues updated since this date
- Returns
a generator of issues
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.pagure.PagureCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Pagure backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.pagure.Pagure
perceval.backends.core.phabricator module¶
- class perceval.backends.core.phabricator.ConduitClient(base_url, api_token, max_retries=5, sleep_time=1, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientConduit API Client.
Phabricator uses Conduit as the Phabricator REST API. This class implements some of its methods to retrieve the contents from a Phabricator server.
- Parameters
base_url – URL of the Phabricator server
api_token – token to get access to restricted methods of the API
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- EXTRA_STATUS_FORCELIST = [429, 502, 503]¶
- MANIPHEST_TASKS = 'maniphest.search'¶
- MANIPHEST_TRANSACTIONS = 'maniphest.gettasktransactions'¶
- PAFTER = 'after'¶
- PATTACHMENTS = 'attachments'¶
- PCONSTRAINTS = 'constraints'¶
- PHAB_PHIDS = 'phid.query'¶
- PHAB_USERS = 'user.query'¶
- PHIDS = 'phids'¶
- PIDS = 'ids'¶
- PMODIFIED_START = 'modifiedStart'¶
- PORDER = 'order'¶
- PPROJECTS = 'projects'¶
- URL = '%(base)s/api/%(method)s'¶
- VOUTDATED = 'outdated'¶
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- tasks(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Retrieve tasks.
- Parameters
from_date – retrieve tasks that where updated from that date; dates are converted epoch time.
- exception perceval.backends.core.phabricator.ConduitError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorRaised when an error occurs using Conduit
- message = '%(error)s (code: %(code)s)'¶
- class perceval.backends.core.phabricator.Phabricator(url, api_token, tag=None, archive=None, max_retries=5, sleep_time=1, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendPhabricator backend.
This class allows to fetch the tasks stored on a Phabricator server. Initialize this class passing the URL of this server and the API token. The origin of the data will be set to this URL.
- Parameters
url – URL of the server
api_token – token needed to use the API
tag – label used to mark the data
archive – archive to store/retrieve items
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['task']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='task', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the tasks from the server.
This method fetches the tasks stored on the server that were updated since the given date. The transactions data related to each task is also included within them.
- Parameters
category – the category of items to fetch
from_date – obtain tasks updated since this date
- Returns
a generator of tasks
- fetch_items(category, **kwargs)[source]¶
Fetch the tasks
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Phabricator item.
This backend only generates one type of item which is ‘task’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Phabricator item.
The timestamp is extracted from ‘dateModified’ field. This date is in UNIX timestamp format but needs to be converted to a float number.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_phids(results)[source]¶
Parse a Phabicator PHIDs JSON stream.
This method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the PHID parsed data.
- Parameters
results – JSON to parse
- Returns
a generator of parsed PHIDs
- static parse_tasks(raw_json)[source]¶
Parse a Phabricator tasks JSON stream.
The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the task parsed data.
- Parameters
raw_json – JSON string to parse
- Returns
a generator of parsed tasks
- static parse_tasks_transactions(raw_json)[source]¶
Parse a Phabricator tasks transactions JSON stream.
The method parses a JSON stream and returns a dictionary with the parsed transactions.
- Parameters
raw_json – JSON string to parse
- Returns
a dict with the parsed transactions
- static parse_users(raw_json)[source]¶
Parse a Phabricator users JSON stream.
The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contais the user parsed data.
- Parameters
raw_json – JSON string to parse
- Returns
a generator of parsed users
- version = '0.13.0'¶
- class perceval.backends.core.phabricator.PhabricatorCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Phabricator backend from the command line.
- BACKEND¶
perceval.backends.core.pipermail module¶
- class perceval.backends.core.pipermail.Pipermail(url, dirpath, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.mbox.MBoxPipermail backend.
This class allows the fetch the email messages stored on a Pipermail archiver. Initialize this class passing the URL where the archiver is and the directory path where the mbox files will be fetched and stored. The origin of the data will be set to the value of url.
- Parameters
url – URL to the Pipermail archiver
dirpath – directory path where the mboxes are stored
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from the Pipermail archiver.
The method fetches the mbox files from a remote Pipermail archiver and retrieves the messages stored on them.
- Parameters
category – the category of items to fetch
from_date – obtain messages since this date
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend does not support items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- version = '0.11.1'¶
- class perceval.backends.core.pipermail.PipermailCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Pipermail backend from the command line.
- BACKEND¶
- class perceval.backends.core.pipermail.PipermailList(url, dirpath, ssl_verify=True)[source]¶
Bases:
perceval.backends.core.mbox.MailingListManage mailing list archives stored by Pipermail archiver.
This class gives access to remote and local mboxes archives from a mailing list stored by Pipermail. This class also allows to keep them in sync.
- Parameters
url – URL to the Pipermail archiver for this list
dirpath – path to the local mboxes archives
ssl_verify – enable/disable SSL verification
- fetch(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the mbox files from the remote archiver.
Stores the archives in the path given during the initialization of this object. Those archives which a not valid extension will be ignored.
Pipermail archives usually have on their file names the date of the archives stored following the schema year-month. When from_date property is called, it will return the mboxes which their year and month are equal or after that date.
- Parameters
from_date – fetch archives that store messages equal or after the given date; only year and month values are compared
- Returns
a list of tuples, storing the links and paths of the fetched archives
- property mboxes¶
Get the mboxes managed by this mailing list.
Returns the archives sorted by date in ascending order.
- Returns
a list of .MBoxArchive objects
perceval.backends.core.redmine module¶
- class perceval.backends.core.redmine.Redmine(url, api_token=None, max_issues=100, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendRedmine backend.
This class allows to fetch the issues stored on a Redmine server. Initialize this class passing the URL of this server. Some servers require authentication to get access to some data, if this is the case, pass the API token to api_token parameter.
- Parameters
url – URL of the server
api_token – token needed to use the API
max_issues – maximum number of issues requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['issue']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'project_id': ['project', 'id'], 'project_name': ['project', 'name']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='issue', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the issues from the server.
This method fetches the issues stored on the server that were updated since the given date. Data about attachments, journals and watchers (among others) are included within each issue.
- Parameters
category – the category of items to fetch
from_date – obtain issues updated since this date
- Returns
a generator of issues
- fetch_items(category, **kwargs)[source]¶
Fetch the issues
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Redmine item.
This backend only generates one type of item which is ‘issue’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Redmine item.
The timestamp is extracted from ‘updated_on’ field and converted to a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_issue_data(raw_json)[source]¶
Parse a Redmine issue JSON stream.
The method parses a JSON stream and returns a dictionary with the parsed data for the given issue.
- Parameters
raw_json – JSON string to parse
- Returns
a dictionary with the parsed issue data
- static parse_issues(raw_json)[source]¶
Parse a Redmine issues JSON stream.
The method parses a JSON stream and returns a list iterator. Each item is a dictionary that contains the issue parsed data.
- Parameters
raw_json – JSON string to parse
- Returns
a generator of parsed issues
- static parse_user_data(raw_json)[source]¶
Parse a Redmine user JSON stream.
The method parses a JSON stream and returns a dictionary with the parsed data for the given user.
- Parameters
raw_json – JSON string to parse
- Returns
a dictionary with the parsed user data
- version = '0.11.0'¶
- class perceval.backends.core.redmine.RedmineClient(base_url, api_token=None, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientRedmine API client.
This class implements a client that retrieves issues from a Redmine server. Remine servers provides a REST API that returns its results in JSON format.
- Parameters
base_url – URL of the Phabricator server
api_token – token to get access to restricted data stored in the server
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- CATTACHMENTS = 'attachments'¶
- CCHANGESETS = 'changesets'¶
- CCHILDREN = 'children'¶
- CJOURNALS = 'journals'¶
- CJSON = '.json'¶
- CRELATIONS = 'relations'¶
- CWATCHERS = 'watchers'¶
- PINCLUDE = 'include'¶
- PKEY = 'key'¶
- PLIMIT = 'limit'¶
- POFFSET = 'offset'¶
- PSORT = 'sort'¶
- PSTATUS_ID = 'status_id'¶
- PUPDATED_ON = 'updated_on'¶
- RISSUES = 'issues'¶
- RUSERS = 'users'¶
- URL = '%(base)s/%(resource)s'¶
- issue(issue_id)[source]¶
Get the information of the given issue.
- Parameters
issue_id – issue identifier
- issues(from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), offset=None, max_issues=100)[source]¶
Get the information of a list of issues.
- Parameters
from_date – retrieve issues that where updated from that date; dates are converted to UTC
offset – starting position for the search
max_issues – maximum number of issues to reteurn per query
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.redmine.RedmineCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Redmine backend from the command line.
- BACKEND¶
perceval.backends.core.rocketchat module¶
- class perceval.backends.core.rocketchat.RocketChat(url, channel, user_id, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendRocket.Chat backend.
This class allows to fetch messages from a channel(room) on a Rocket.Chat server. An API token and a User Id is required to access the server.
- Parameters
url – server url from where messages are to be fetched
channel – name of the channel from where data will be fetched
user_id – generated User Id using your Rocket.Chat account
api_token – token needed to use the API
max_items – maximum number of message requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_info', '_id'], 'channel_name': ['channel_info', 'name']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()), filter_classified=False)[source]¶
Fetch the messages from the channel.
This method fetches the messages stored on the channel that were sent since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain messages sent since this date
filter_classified – remove classified fields from the resulting items
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages.
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Rocket.Chat item.
This backend only generates one type of item which is ‘message’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Rocket.Chat item.
The timestamp is extracted from ‘ts’ field, and then converted into a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
extracted timestamp
- static parse_channel_info(raw_channel_info)[source]¶
Parse a channel’s information JSON stream.
This method parses a JSON stream, containing the information of the channel, and returns a dict with the parsed data.
- Parameters
raw_channel_info – JSON string to parse
- Returns
a dict with the parsed channel’s information
- static parse_messages(raw_messages)[source]¶
Parse a channel messages JSON stream.
This method parses a JSON stream, containing the history of a channel. It returns a list of messages and the total messages count in that channel.
- Parameters
raw_messages – JSON string to parse
- Returns
a tuple with a list of dicts with the parsed messages and a total messages count in the channel.
- version = '0.1.0'¶
- class perceval.backends.core.rocketchat.RocketChatClient(url, user_id, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=10, from_archive=False, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerRocket.Chat API client.
Client for fetching information from the Rocket.Chat server using its REST API.
- Parameters
url – server url from where messages are to be fetched
user_id – generated User Id using your Rocket.Chat account
api_token – token needed to use the API
max_items – maximum number of message requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimum rate needed to sleep until it will be reset
from_archive – it tells whether to write/read the archive
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- HAUTH_TOKEN = 'X-Auth-Token'¶
- HUSER_ID = 'X-User-Id'¶
- PCHANNEL_NAME = 'roomName'¶
- PCOUNT = 'count'¶
- POLDEST = 'oldest'¶
- RCHANNEL_INFO = 'channels.info'¶
- RCHANNEL_MESSAGES = 'channels.messages'¶
- calculate_time_to_reset()[source]¶
Number of seconds to wait. They are contained in the rate limit reset header.
- fetch(url, payload=None, headers=None)[source]¶
Fetch the data from a given URL.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
:returns a response object
- messages(channel, from_date, offset)[source]¶
Fetch messages from a channel.
The messages are fetch in ascending order i.e. from the oldest to the latest based on the time they were last updated. A query is also passed as a param to fetch the messages from a given date.
- static sanitize_for_archive(url, headers, payload)[source]¶
- Sanitize payload of a HTTP request by removing the token and
user id information before storing/retrieving archived items.
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
- Returns
url, headers and the sanitized payload
- class perceval.backends.core.rocketchat.RocketChatCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Rocket.Chat backend from the command line.
- BACKEND¶
perceval.backends.core.rss module¶
- class perceval.backends.core.rss.RSS(url, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendRSS backend for Perceval.
This class retrieves the entries from a RSS feed. To initialize this class the URL must be provided. The url will be set as the origin of the data.
- Parameters
url – RSS url
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['entry']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='entry')[source]¶
Fetch the entries from the url.
The method retrieves all entries from a RSS url
- Parameters
category – the category of items to fetch
- Returns
a generator of entries
- fetch_items(category, **kwargs)[source]¶
Fetch the entries
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving entries on the fetch process.
- Returns
this backend supports entries archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not supports entries resuming
- static metadata_category(item)[source]¶
Extracts the category from a RSS item.
This backend only generates one type of item which is ‘entry’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a RSS item.
The timestamp is extracted from ‘published’ field. This date is a datetime string that needs to be converted to a UNIX timestamp float value.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- version = '0.7.0'¶
- class perceval.backends.core.rss.RSSClient(url, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientRSS API client.
This class implements a simple client to retrieve entries from projects in a RSS node.
- Parameters
url – URL of rss node: https://item.opnfv.org/ci
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- class perceval.backends.core.rss.RSSCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run RSS backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.rss.RSS
perceval.backends.core.slack module¶
- class perceval.backends.core.slack.Slack(channel, api_token, max_items=1000, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendSlack backend.
This class retrieves the messages sent to a Slack channel. To access the server an API token is required, which must have enough permissions to read from the given channel.
The origin of the data will be set to the SLACK_URL plus the identifier of the channel; i.e ‘https://slack.com/C01234ABC’.
- Parameters
channel – identifier of the channel where data will be fetched
api_token – token or key needed to use the API
max_items – maximum number of message requested on the same query
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'channel_id': ['channel_info', 'id'], 'channel_name': ['channel_info', 'name']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from the channel.
This method fetches the messages stored on the channel that were sent since the given date.
- Parameters
category – the category of items to fetch
from_date – obtain messages sent since this date
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend does not support items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Slack item.
This backend only generates one type of item which is ‘message’.
- static metadata_id(item)[source]¶
Extracts the identifier from a Slack item.
This identifier will be the mix of two fields because Slack messages does not have any unique identifier. In this case, ‘ts’ and ‘user’ values (or ‘bot_id’ when the message is sent by a bot) are combined because there have been cases where two messages were sent by different users at the same time.
In the case where neither the ‘user’ or ‘bot_id’ attributes are present (e.g, bot deleted), the fallback option is to generate the identifier using the ‘ts’ and ‘username’ values.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Slack item.
The timestamp is extracted from ‘ts’ field and converted to a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_channel_info(raw_channel_info)[source]¶
Parse a channel info JSON stream.
This method parses a JSON stream, containing the information from a channel, and returns a dict with the parsed data.
:param raw_channel_info
- Returns
a dict with the parsed information about a channel
- static parse_history(raw_history)[source]¶
Parse a channel history JSON stream.
This method parses a JSON stream, containing the history of a channel, and returns a list with the parsed data. It also returns if there are more messages that are not included on this stream.
- Parameters
raw_history – JSON string to parse
- Returns
a tuple with a list of dicts with the parsed messages and ‘has_more’ value
- static parse_user(raw_user)[source]¶
Parse a user’s info JSON stream.
This method parses a JSON stream, containing the information from a user, and returns a dict with the parsed data.
- Parameters
raw_user – JSON string to parse
- Returns
a dict with the parsed user’s information
- version = '0.10.0'¶
- class perceval.backends.core.slack.SlackClient(api_token, max_items=1000, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientSlack API client.
Client for fetching information from the Slack server using its REST API.
- Parameters
api_token – key needed to use the API
max_items – maximum number of items per request
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- AUTHORIZATION_HEADER = 'Authorization'¶
- PCHANNEL = 'channel'¶
- PCOUNT = 'count'¶
- PLATEST = 'latest'¶
- POLDEST = 'oldest'¶
- PTOKEN = 'token'¶
- PUSER = 'user'¶
- RCONVERSATION_HISTORY = 'conversations.history'¶
- RCONVERSATION_INFO = 'conversations.info'¶
- RCONVERSATION_MEMBERS = 'conversations.members'¶
- RUSER_INFO = 'users.info'¶
- URL = 'https://slack.com/api/%(resource)s'¶
- conversation_members(conversation)[source]¶
Fetch the number of members in a conversation, which is a supertype for public and private ones, DM and group DM.
- Parameters
conversation – the ID of the conversation
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- exception perceval.backends.core.slack.SlackClientError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorRaised when an error occurs using the Slack client
- message = '%(error)s'¶
- class perceval.backends.core.slack.SlackCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Slack backend from the command line.
- BACKEND¶
alias of
perceval.backends.core.slack.Slack
perceval.backends.core.stackexchange module¶
- class perceval.backends.core.stackexchange.StackExchange(site, tagged=None, api_token=None, access_token=None, max_questions=100, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendStackExchange backend for Perceval.
This class retrieves the questions stored in any of the StackExchange sites. To initialize this class the site must be provided.
- Parameters
site – StackExchange site
tagged – filter items by question Tag
api_token – StackExchange application key for the API
access_token – StackExchange user access_token for the API
max_questions – max of questions per page retrieved
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['question']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'tags': ['tags']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='question', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the questions from the site.
The method retrieves, from a StackExchange site, the questions updated since the given date.
- Parameters
from_date – obtain questions updated since this date
- Returns
a generator of questions
- fetch_items(category, **kwargs)[source]¶
Fetch the questions
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a StackExchange item.
This backend only generates one type of item which is ‘question’.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a StackExchange item.
The timestamp is extracted from ‘last_activity_date’ field. This date is a UNIX timestamp but needs to be converted to a float value.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_questions(raw_page)[source]¶
Parse a StackExchange API raw response.
The method parses the API response retrieving the questions from the received items
- Parameters
items – items from where to parse the questions
- Returns
a generator of questions
- version = '0.12.1'¶
- class perceval.backends.core.stackexchange.StackExchangeClient(site, tagged, token, access_token=None, max_questions=100, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientStackExchange API client.
This class implements a simple client to retrieve questions from any Stackexchange site.
- Parameters
site – URL of the Bugzilla server
tagged – filter items by question Tag
token – StackExchange application key for the API
access_token – StackExchange user access token for the API
max_questions – max number of questions per query
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- Raises
HTTPError – when an error occurs doing the request
- PACCESSTOKEN = 'access_token'¶
- PFILTER = 'filter'¶
- PKEY = 'key'¶
- PMIN = 'min'¶
- PORDER = 'order'¶
- PPAGE = 'page'¶
- PPAGESIZE = 'pagesize'¶
- PSITE = 'site'¶
- PSORT = 'sort'¶
- PTAGGED = 'tagged'¶
- RQUESTIONS = 'questions'¶
- STACKEXCHANGE_API_URL = 'https://api.stackexchange.com'¶
- VERSION_API = '2.2'¶
- VQUESTIONS_FILTER = 'Bf*y*ByQD_upZqozgU6lXL_62USGOoV3)MFNgiHqHpmO_Y-jHR'¶
- get_questions(from_date)[source]¶
Retrieve all the questions from a given date.
- Parameters
from_date – obtain questions updated since this date
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- class perceval.backends.core.stackexchange.StackExchangeCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run StackExchange backend from the command line.
- BACKEND¶
perceval.backends.core.supybot module¶
- class perceval.backends.core.supybot.Supybot(uri, dirpath, tag=None, archive=None)[source]¶
Bases:
perceval.backend.BackendSupybot IRC log backend.
This class fetches the messages stored by Supybot in log files. Initialize this class providing the directory where those IRC log files are stored.
The log filenames expected by this backend should follow the pattern: #channel_YYYY-MM-DD.log (i.e #grimoirelab_2016-06-27.log). This is needed to determine the date when messages were sent. Other filenames might work too but the behaviour is unknown.
The format of the messages must also follow a pattern. This patterns can be found in SupybotParser class documentation.
- Parameters
uri – URI of the IRC archives; typically, the URL of their IRC channel
dirpath – directory path where the archives are stored
tag – label used to mark the data
archive – archive to store/retrieve items
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='message', from_date=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=tzutc()))[source]¶
Fetch the messages from the Supybot IRC logger.
The method parsers and returns the messages saved on the IRC log files and stored by Supybot in dirpath.
- Parameters
category – the category of items to fetch
from_date – obtain messages since this date
- Returns
a generator of messages
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend does not support items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Supybot item.
This backend only generates one type of item which is ‘message’.
- static metadata_id(item)[source]¶
Extracts the identifier from a Supybot item.
This identifier will be the mix of three fields because IRC messages does not have any unique identifier. In this case, ‘timestamp’, ‘nick’ and ‘body’ values are combined because there have been cases where two messages were sent by the same user at the same time.
- static metadata_updated_on(item)[source]¶
Extracts the update time from a Supybot item.
The timestamp used is extracted from ‘timestamp’ field. This date is converted to UNIX timestamp format taking into account the timezone of the date.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_supybot_log(filepath)[source]¶
Parse a Supybot IRC log file.
The method parses the Supybot IRC log file and returns an iterator of dictionaries. Each one of this, contains a message from the file.
- Parameters
filepath – path to the IRC log file
- Returns
a generator of parsed messages
- Raises
ParseError – raised when the format of the Supybot log file is invalid
OSError – raised when an error occurs reading the given file
- version = '0.10.0'¶
- class perceval.backends.core.supybot.SupybotCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Supybot backend from the command line.
- BACKEND¶
- class perceval.backends.core.supybot.SupybotParser(stream)[source]¶
Bases:
objectSupybot IRC parser.
This class parses a Supybot IRC log stream, converting plain log lines (or messages) into dict items. Each dictionary will contain the date of the message, the type of message (comment or server message), the nick of the sender and its body.
Each line on a log starts with a date in ISO format including its timezone and it is followed by two spaces and by a message.
There are two types of valid messages in a Supybot log: comment messages and server messages. First one follows any of these two patterns:
2016-06-27T12:00:00+0000 <nick> body of the message 2016-06-27T12:00:00+0000 * nick waves hello
While a valid server message has the next pattern:
2016-06-27T12:00:00+0000 *** nick is known as new_nick
An exception is raised when any of the lines does not follow any of the above formats.
- Parameters
stream – an iterator which produces Supybot log lines
- BOT_PATTERN = '^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$'¶
- COMMENT_ACTION_PATTERN = '^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$'¶
- COMMENT_PATTERN = '^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$'¶
- EMPTY_BOT_PATTERN = '^-(.*?)(!.*)?-\\s*$'¶
- EMPTY_COMMENT_ACTION_PATTERN = '^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$'¶
- EMPTY_COMMENT_PATTERN = '^<(.*?)(!.*)?>\\s*$'¶
- EMPTY_PATTERN = '^\\s*$'¶
- SERVER_PATTERN = '^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$'¶
- SUPYBOT_BOT_REGEX = re.compile('^-(?P<nick>(.*?)(!.*)?)-\\s(?P<body>.+)$', re.VERBOSE)¶
- SUPYBOT_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?(?P<body>(?P<nick>([^\\s\\*]+?)(!.*)?)\\s.+)$', re.VERBOSE)¶
- SUPYBOT_COMMENT_REGEX = re.compile('^<(?P<nick>(.*?)(!.*)?)>\\s(?P<body>.+)$', re.VERBOSE)¶
- SUPYBOT_EMPTY_BOT_REGEX = re.compile('^-(.*?)(!.*)?-\\s*$', re.VERBOSE)¶
- SUPYBOT_EMPTY_COMMENT_ACTION_REGEX = re.compile('^\\*\\s?([^\\s\\*]+?)(!.*)?\\s*$', re.VERBOSE)¶
- SUPYBOT_EMPTY_COMMENT_REGEX = re.compile('^<(.*?)(!.*)?>\\s*$', re.VERBOSE)¶
- SUPYBOT_EMPTY_REGEX = re.compile('^\\s*$', re.VERBOSE)¶
- SUPYBOT_SERVER_REGEX = re.compile('^\\*\\*\\*\\s(?P<body>(?P<nick>(.*?)(!.*)?)\\s.+)$', re.VERBOSE)¶
- SUPYBOT_TIMESTAMP_REGEX = re.compile('^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]?\\d{0,4})\\s\\s\n (?P<msg>.+)$\n ', re.VERBOSE)¶
- TCOMMENT = 'comment'¶
- TIMESTAMP_PATTERN = '^(?P<ts>\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}[\\+\\-]?\\d{0,4})\\s\\s\n (?P<msg>.+)$\n '¶
- TSERVER = 'server'¶
- parse()[source]¶
Parse a Supybot IRC stream.
Returns an iterator of dicts. Each dicts contains information about the date, type, nick and body of a single log entry.
- Returns
iterator of parsed lines
- Raises
ParseError – when an invalid line is found parsing the given stream
perceval.backends.core.telegram module¶
- class perceval.backends.core.telegram.Telegram(bot, bot_token, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendTelegram backend.
The Telegram backend fetches the messages that a Telegram bot can receive. Usually, these messages are direct or private messages but a bot can be configured to receive every message sent to a channel/group where it is subscribed. Take into account that messages are removed from the Telegram server 24 hours after they are sent. Moreover, once they are fetched using an offset, these messages are also removed. This means every time this backend is called, messages will be deleted.
Initialize this class passing the name of the bot and the authentication token used by this bot. The authentication token is provided by Telegram once the bot is created.
The origin of the data will be set to the TELEGRAM_URL plus the name of the bot; i.e ‘http://telegram.org/mybot’.
- Parameters
bot – name of the bot
bot_token – authentication token used by the bot
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['message']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- EXTRA_SEARCH_FIELDS = {'chat_id': ['message', 'chat', 'id'], 'chat_name': ['message', 'chat', 'title']}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- fetch(category='message', offset=1, chats=None)[source]¶
Fetch the messages the bot can read from the server.
The method retrieves, from the Telegram server, the messages sent with an offset equal or greater than the given.
A list of chats, groups and channels identifiers can be set using the parameter chats. When it is set, only those messages sent to any of these will be returned. An empty list will return no messages.
- Parameters
category – the category of items to fetch
offset – obtain messages from this offset
chats – list of chat names used to filter messages
- Returns
a generator of messages
- Raises
ValueError – when chats is an empty list
- fetch_items(category, **kwargs)[source]¶
Fetch the messages
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- metadata(item, filter_classified=False)[source]¶
Telegram metadata.
The method takes an item and overrides the metadata information to add extra information related to Telegram.
Currently, it adds the ‘offset’ keyword.
- Parameters
item – an item fetched by a backend
filter_classified – sets if classified fields were filtered
- static metadata_category(item)[source]¶
Extracts the category from a Telegram item.
This backend only generates one type of item which is ‘message’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Telegram item.
The timestamp is extracted from ‘date’ field that is inside of ‘message’ dict. This date is converted to UNIX timestamp format.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- static parse_messages(raw_json)[source]¶
Parse a Telegram JSON messages list.
The method parses the JSON stream and returns an iterator of dictionaries. Each one of this, contains a Telegram message.
- Parameters
raw_json – JSON string to parse
- Returns
a generator of parsed messages
- version = '0.11.1'¶
- class perceval.backends.core.telegram.TelegramBotClient(bot_token, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClientTelegram Bot API 2.0 client.
This class implements a simple client to retrieve those messages sent to a Telegram bot. This includes personal messages or messages sent to a channel (when privacy settings are disabled).
- Parameters
bot_token – token for the bot
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- API_URL = 'https://api.telegram.org/bot%(token)s/%(method)s'¶
- OFFSET = 'offset'¶
- UPDATES_METHOD = 'getUpdates'¶
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize URL of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns the sanitized url, plus the headers and payload
- updates(offset=None)[source]¶
Fetch the messages that a bot can read.
When the offset is given it will retrieve all the messages that are greater or equal to that offset. Take into account that, due to how the API works, all previous messages will be removed from the server.
- Parameters
offset – fetch the messages starting on this offset
- class perceval.backends.core.telegram.TelegramCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Telegram backend from the command line.
- BACKEND¶
perceval.backends.core.twitter module¶
- class perceval.backends.core.twitter.Twitter(query, api_token, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, tag=None, archive=None, ssl_verify=True)[source]¶
Bases:
perceval.backend.BackendTwitter backend.
This class allows to fetch samples of tweets containing specific keywords. Initialize this class passing API key needed for authentication with the parameter api_key.
- Parameters
query – query to fetch tweets
api_token – token or key needed to use the API
max_items – maximum number of issues requested on the same query
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
tag – label used to mark the data
archive – archive to store/retrieve items
ssl_verify – enable/disable SSL verification
- CATEGORIES = ['tweet']¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- fetch(category='tweet', since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, tweets_type='mixed')[source]¶
Fetch the tweets from the server.
This method fetches tweets from the TwitterSearch API published in the last seven days.
- Parameters
category – the category of items to fetch
since_id – if not null, it returns results with an ID greater than the specified ID
max_id – when it is set or if not None, it returns results with an ID less than the specified ID
geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”
lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code
include_entities – if disabled, it excludes entities node
tweets_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”
- Returns
a generator of tweets
- fetch_items(category, **kwargs)[source]¶
Fetch the tweets
- Parameters
category – the category of items to fetch
kwargs – backend arguments
- Returns
a generator of items
- classmethod has_archiving()[source]¶
Returns whether it supports archiving items on the fetch process.
- Returns
this backend supports items archive
- classmethod has_resuming()[source]¶
Returns whether it supports to resume the fetch process.
- Returns
this backend supports items resuming
- static metadata_category(item)[source]¶
Extracts the category from a Twitter item.
This backend only generates one type of item which is ‘tweet’.
- static metadata_updated_on(item)[source]¶
Extracts and coverts the update time from a Twitter item.
The timestamp is extracted from ‘created_at’ field and converted to a UNIX timestamp.
- Parameters
item – item generated by the backend
- Returns
a UNIX timestamp
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of metadata_id plus the hashtags of a tweet.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- version = '0.4.0'¶
- class perceval.backends.core.twitter.TwitterClient(api_key, max_items=100, sleep_for_rate=False, min_rate_to_sleep=1, sleep_time=30, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
perceval.client.HttpClient,perceval.client.RateLimitHandlerTwitter API client.
Client for fetching information from the Twitter server using its REST API v1.1.
- Parameters
api_key – key needed to use the API
max_items – maximum number of items per request
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be reset
sleep_time – time (in seconds) to sleep in case of connection problems
archive – an archive to store/read fetched data
from_archive – it tells whether to write/read the archive
ssl_verify – enable/disable SSL verification
- HAUTHORIZATION = 'Authorization'¶
- PCOUNT = 'count'¶
- PGEOCODE = 'geocode'¶
- PINCLUDE_ENTITIES = 'include_entities'¶
- PLANG = 'lang'¶
- PMAX_ID = 'max_id'¶
- PQUERY = 'q'¶
- PRESULT_TYPE = 'result_type'¶
- PSINCE_ID = 'since_id'¶
- calculate_time_to_reset()[source]¶
Number of seconds to wait. They are contained in the rate limit reset header
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize payload of a HTTP request by removing the token information before storing/retrieving archived items
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and the sanitized payload
- tweets(query, since_id=None, max_id=None, geocode=None, lang=None, include_entities=True, result_type='mixed')[source]¶
Fetch tweets for a given query between since_id and max_id.
- Parameters
query – query to fetch tweets
since_id – if not null, it returns results with an ID greater than the specified ID
max_id – if not null, it returns results with an ID less than the specified ID
geocode – if enabled, returns tweets by users located at latitude,longitude,”mi”|”km”
lang – if enabled, restricts tweets to the given language, given by an ISO 639-1 code
include_entities – if disabled, it excludes entities node
result_type – type of tweets returned. Default is “mixed”, others are “recent” and “popular”
- Returns
a generator of tweets
- class perceval.backends.core.twitter.TwitterCommand(*args, debug=False)[source]¶
Bases:
perceval.backend.BackendCommandClass to run Twitter backend from the command line.
- BACKEND¶