perceval package¶
Subpackages¶
- perceval.backends package
- Subpackages
- perceval.backends.core package
- Submodules
- perceval.backends.core.askbot module
- perceval.backends.core.bugzilla module
- perceval.backends.core.bugzillarest module
- perceval.backends.core.confluence module
- perceval.backends.core.discourse module
- perceval.backends.core.dockerhub module
- perceval.backends.core.gerrit module
- perceval.backends.core.git module
- perceval.backends.core.github module
- perceval.backends.core.githubql module
- perceval.backends.core.gitlab module
- perceval.backends.core.gitter module
- perceval.backends.core.googlehits module
- perceval.backends.core.groupsio module
- perceval.backends.core.hyperkitty module
- perceval.backends.core.jenkins module
- perceval.backends.core.jira module
- perceval.backends.core.launchpad module
- perceval.backends.core.mattermost module
- perceval.backends.core.mbox module
- perceval.backends.core.mediawiki module
- perceval.backends.core.meetup module
- perceval.backends.core.nntp module
- perceval.backends.core.pagure module
- perceval.backends.core.phabricator module
- perceval.backends.core.pipermail module
- perceval.backends.core.redmine module
- perceval.backends.core.rocketchat module
- perceval.backends.core.rss module
- perceval.backends.core.slack module
- perceval.backends.core.stackexchange module
- perceval.backends.core.supybot module
- perceval.backends.core.telegram module
- perceval.backends.core.twitter module
- Module contents
- perceval.backends.core package
- Module contents
- Subpackages
Submodules¶
perceval.archive module¶
- class perceval.archive.Archive(archive_path)[source]¶
Bases:
objectBasic class for archiving raw items fetched by Perceval.
This class allows to archive raw items - usually HTML pages or JSON documents - for a further recovery. These raw items will be fetched, stored and retrieved back by a backend.
Each stored item will have a hash code used as unique identifier. Hash codes are generated using URIs and other parameters needed to fetch raw items.
When an instance of Archive is initialized it will expect to access an existing archive file. To create a new and empty archive used create class method instead. Metadata must be initialized calling to init_metadata method after creating a new archive.
- Parameters
archive_path – path where this archive is stored
- Raises
ArchiveError – when the archive does not exist or is invalid
- ARCHIVE_CREATE_STMT = 'CREATE TABLE archive ( id INTEGER PRIMARY KEY AUTOINCREMENT, hashcode VARCHAR(256) UNIQUE NOT NULL, uri TEXT, payload BLOB, headers BLOB, data BLOB)'¶
- ARCHIVE_TABLE = 'archive'¶
- METADATA_CREATE_STMT = 'CREATE TABLE metadata ( origin TEXT, backend_name TEXT, backend_version TEXT, category TEXT, backend_params BLOB, created_on TEXT)'¶
- METADATA_TABLE = 'metadata'¶
- classmethod create(archive_path)[source]¶
Create a brand new archive.
Call this method to create a new and empty archive. It will initialize the storage file in the path defined by archive_path.
- Parameters
archive_path – absolute path where the archive file will be created
- Raises
ArchiveError – when the archive file already exists
- init_metadata(origin, backend_name, backend_version, category, backend_params)[source]¶
Init metadata information.
Metatada is composed by basic information needed to identify where archived data came from and how it can be retrieved and built into Perceval items.
- Param
origin: identifier of the repository
- Param
backend_name: name of the backend
- Param
backend_version: version of the backend
- Param
category: category of the items fetched
- Param
backend_params: dict representation of the fetch parameters
raises ArchiveError: when an error occurs initializing the metadata
- static make_hashcode(uri, payload, headers)[source]¶
Generate a SHA1 based on the given arguments.
Hashcodes created by this method will used as unique identifiers for the raw items or resources stored by this archive.
- Parameters
uri – URI to the resource
payload – payload of the request needed to fetch the resource
headers – headers of the request needed to fetch the resource
- Returns
a SHA1 hash code
- retrieve(uri, payload, headers)[source]¶
Retrieve a raw item from the archive.
The method will return the data content corresponding to the hascode derived from the given parameters.
- Parameters
uri – request URI
payload – request payload
headers – request headers
- Returns
the archived data
- Raises
ArchiveError – when an error occurs retrieving data
- store(uri, payload, headers, data)[source]¶
Store a raw item in this archive.
The method will store data content in this archive. The unique identifier for that item will be generated using the rest of the parameters.
- Parameters
uri – request URI
payload – request payload
headers – request headers
data – data to store in this archive
- Raises
ArchiveError – when an error occurs storing the given data
- class perceval.archive.ArchiveManager(dirpath)[source]¶
Bases:
objectManager for handling archives in Perceval.
This class manages the creation, deletion and access of Archive objects. Archives are stored under dirpath directory, using a random SHA1 for each file. The first byte of the hashcode will be the name of the subdirectory; the remaining bytes, the archive name.
- Param
dirpath: path where the archives are stored
- STORAGE_EXT = '.sqlite3'¶
- create_archive()[source]¶
Create a new archive.
The method creates in the filesystem a brand new archive with a random SHA1 as its name. The first byte of the hashcode will be the name of the subdirectory; the remaining bytes, the archive name.
- Returns
a new Archive object
- Raises
ArchiveManagerError – when an error occurs creating the new archive
- remove_archive(archive_path)[source]¶
Remove an archive.
This method deletes from the filesystem the archive stored in archive_path.
- Parameters
archive_path – path to the archive
- Raises
ArchiveManangerError – when an error occurs removing the archive
- search(origin, backend_name, category, archived_after)[source]¶
Search archives.
Get the archives which store data based on the given parameters. These parameters define which the origin was (origin), how data was fetched (backend_name) and data type (‘category’). Only those archives created on or after archived_after will be returned.
The method returns a list with the file paths to those archives. The list is sorted by the date of creation of each archive.
- Parameters
origin – data origin
backend_name – backed used to fetch data
category – type of the items fetched by the backend
archived_after – get archives created on or after this date
- Returns
a list with archive names which match the search criteria
perceval.backend module¶
- class perceval.backend.Backend(origin, tag=None, archive=None, blacklist_ids=None, ssl_verify=True)[source]¶
Bases:
objectAbstract class for backends.
Base class to fetch data from a repository. This repository will be named as ‘origin’. During the initialization, an Archive object can be provided for archiving raw data from the repositories.
To avoid a
NotImplementedError, derived classes have to implement or define:fetch_items(), to retrieve items from the repositoryhas_archiving(), whether this backend supports archiveshas_resuming(), whether this backend supports resumingmetadata_id(), to produce a unique id from an itemmetadata_updated_on(), to find the last time an item was modifiedmetadata_category(), to identify the category of an item_init_client(), to initialize the backend’s clientCATEGORIES, defining the set of categories the backend produces[Optional]
CLASSIFIED_FIELDS, to hide certain fields from results[Optional]
EXTRA_SEARCH_FIELDS, to add easy access fields to items[Optional]
ORIGIN_UNIQUE_FIELD, to enable item blacklisting
For more information on the details of implementing these methods, please see the docs on each method.
The fetched items can be tagged using the tag parameter. It will be useful to trace data. When it is set to None or to an empty string, the tag will be the same that the origin attribute.
To track which version of the backend was used during the fetching process, this class provides a version attribute that each backend may override.
Each fetch operation generates a summary, available via the property summary. By default, it includes the last UUID generated, number of items fetched, skipped and their sum, plus the min, max and last updated_on times. Furthermore, for backends using offsets, the corresponding summary contains the min and max offsets retrieved. Finally, the summary also includes some extra fields, which can be used by any backend to include fetch-specific information.
Backends also produce a set of search fields, exposed in the search_fields attribute of each item returned by a call to
fetch(). These contain the item_id, as well as any number of backend-specific fields.- Parameters
origin – identifier of the repository
tag – tag items using this label
archive – archive to store/retrieve data
ssl_verify – enable/disable SSL verification
- Raises
ValueError – raised when archive is not an instance of Archive class
- CATEGORIES = []¶
A list of categories that can be fetched by this backend.
Every backend able to produce items falling into a limited set of categories. The specific categories a backend can fetch is unique to that backend.
The categories defined in this variable (and only the categories defined in this variable) can be passed to
fetch()and returned frommetadata_category().Implementing backends can define any category they need, as long as categories are short, descriptive, snake_case strings, such as “commit”, “merge_request”, or “pull_request”.
- CLASSIFIED_FIELDS = []¶
A list of fields that should be considered sensitive or confidential.
Fields listed here will be hidden from fetched items, when this behaviour is requested.
Fields are represented as a list of strings. As items returned are dicts that may contain nested dicts, each entry is a list which stores the “path” or nested dicts keys to the field to remove. For example, [‘my’, ‘classified’, ‘field’] will remove field from item[‘data’][‘my’][‘classified’] dict.
Classified data filtering and archiving are not compatible to prevent data leaks or security issues.
- EXTRA_SEARCH_FIELDS = {}¶
A set of search fields to simplify query operations.
The use of search fields can avoid the manual inspection of the items. The search fields are included with items returned from
fetch()in a dict with the following shape:- {
‘key-1’: value-1, ‘key-2’: value-2, ‘key-3’: value-3
}
These fields are added to the item metadata information in the search_fields attribute. By default, search_fields contains the id of the item (‘item_id’: item_id_value), obtained via the method metadata_id. However, each backend can set extra search fields using the dict
EXTRA_SEARCH_FIELDS. An example ofEXTRA_SEARCH_FIELDSis provided below:- {
‘project_id’: [‘fields’, ‘project’, ‘id’], ‘project_key’: [‘fields’, ‘project’, ‘key’], ‘project_name’: [‘fields’, ‘project’, ‘name’]
}
Each key in the dict is a search field to be included in the item metadata information, while the corresponding value is a list that stores the “path” of the search field value within the item.
- ORIGIN_UNIQUE_FIELD = None¶
A field unique to a given origin for items produced by this backend.
If ORIGIN_UNIQUE_FIELD is defined, users can pass a list of blocked values which should not be included in the results, if the field defined here contains them. For example, if ORIGIN_UNIQUE_FIELD were set to post_id, then users could pass a list of post ids that should be excluded from the results.
If set to None, blacklisting will be disabled completely. Otherwise, this should be set to a
OriginUniqueFieldcontaining the number and data type of the field.Note: Origin in this context refers to one site, api, or other remote that contains several repositories, each consisting of many items of several categories. For example, for the backend GitLab, an origin would be one instance GitLab, such as gitlab.com or opensource.ieee.org, which each contain many repositories, which contain items such as issues and merge request.
To access this field, please prefer
origin_unique_field().
- property archive¶
- property categories¶
See
CATEGORIES.
- property classified_fields¶
A list of fields to be hidden from results.
Fields are represented as a list of strings, where each string is a period delimited field. For example, ‘attributes.author_info.secret_info’ would hide the secret info of the author in the attributes dict.
- fetch(category, filter_classified=False, **kwargs)[source]¶
Fetch items from the repository.
The method retrieves items from a repository.
To removed classified fields from the resulting items, set the parameter filter_classified. Take into account this parameter is incompatible with archiving items. Raw client data are archived before any other process. Therefore, classified data are stored within the archive. To prevent from possible data leaks or security issues when users do not need these fields, archiving and filtering are not compatible.
- Parameters
category – the category of the items fetched
filter_classified – remove classified fields from the resulting items
kwargs – a list of other parameters (e.g., from_date, offset, etc.
specific for each backend)
- Returns
a generator of items
- Raises
BackendError – either when the category is not valid or ‘filter_classified’ and ‘archive’ are active at the same time.
- fetch_from_archive()[source]¶
Fetch the questions from an archive.
It returns the items stored within an archive. If this method is called but no archive was provided, the method will raise a ArchiveError exception.
- Returns
a generator of items
- Raises
ArchiveError – raised when an error occurs accessing an archive
- fetch_items(category, **kwargs)[source]¶
Retrieve raw data from the repository.
This method is to be implemented by implementors of Backend, and is intended for internal use. Developers hoping to retrieve processed results should use the
fetch()method.This method receives a category of items to fetch from the repository. This will be one of categories defined in the
CATEGORIESclass variable. The method also receives a list of keyword arguments. These arguments include any commandline variables defined by the correspondingBackendCommand.The method is then responsible for retrieving all items matching the criteria defined by the keyword args and the given category, then returning them as a generator of dicts. The structure of the dicts is irrelevant, but each dict should represent exactly one item.
- Parameters
category – the category if items to retrieve from the repository
kwargs – additional arguments to assist or specify retrieval
- Returns
a generator producing items
- filter_classified_data(item)[source]¶
Remove classified or confidential data from an item.
It removes those fields that contain data considered as classified. Classified fields are defined in CLASSIFIED_FIELDS class attribute.
- Parameters
item – fields will be removed from this item
- Returns
the same item but with confidential data filtered
- classmethod has_archiving()[source]¶
Whether or not this backend supports archiving requests.
For implementors, this means whether
_init_client()can be called with from_archive=True and whether the backend will respect that. If the client used by the backend is anHttpClient, and_init_client()passes from_archive on to theclient.HttpClient’s initializer, this should be true.Classified data filtering and archiving are not compatible to prevent data leaks or security issues.
- classmethod has_resuming()[source]¶
Whether this backend supports resuming interrupted collections.
When interrupted, some backends may support resuming the collection by setting the from_date parameter on
fetch_items()orfetch()to the date of the last item retrieved from the repository.However, for some backends, this cannot be done, for example because results are retrieved from newest to oldest. If resuming was attempted on a backend like this, then some items would be missed.
For example, if the backend was in the middle of retrieving items from January 5th through 1st, but was interrupted when retrieving items from the 3rd, than it would be missing items for the 2nd and 1st. If this backend was resumed by setting from_date to the most recent item (the 5th), these missing items would not be retrieved, since they are earlier than the from_date.
This method is used to indicate that this backend can be resumed in this manner without missing any items. If a backend declares that it supports resuming, than from_date should be set to the date of the most recent item from the last collection, even if it failed. Otherwise, from_date should be set to the most recent item of the last successful collection. Resuming in this manner should not leave any holes in the collected items.
This can be used to speed up collections by skipping network IO for items that have already been downloaded and added to the database.
Additionally, from_date may be set regardless of this setting if the last collection did not fail, or if the user is not interested in items earlier than the provided datetime.
Implementers should return a constant True if their backend supports resuming connections in this manor, or False otherwise.
- metadata(item, filter_classified=False)[source]¶
Add metadata to an item.
It adds metadata to a given item such as how and when it was fetched. The contents from the original item will be stored under the ‘data’ keyword.
- Parameters
item – an item fetched by a backend
filter_classified – sets if classified fields were filtered
- static metadata_category(item)[source]¶
Identify the category of a given item.
Every item returned by
fetch_items()should belong to exactly one of the categories listed in the CATEGORIES class member. This method should determine which category the item belongs to, given one of the items returned byfetch_items()method.Note that for all items returned by a call to
fetch_items(), this method is expected to return a category equivalent to the one passed as the category argument.- Returns
One of the strings in CATEGORIES
- static metadata_id(item)[source]¶
Produce a unique identifier for an item.
Given one of the items produced by
fetch_items(), produce a unique identifier for that item. Typically, this is an identifier given by the repository itself, such as a commit hash or post id.The id should be represented by a string.
- static metadata_updated_on(item)[source]¶
Determine the last time an item was updated.
Given one of the items produced by
fetch_items(), attempt to identify the last time this item was modified.- Returns
The timestamp of the last modification, represented as epoch seconds (a UNIX timestamp)
- property origin¶
- property origin_unique_field¶
See
ORIGIN_UNIQUE_FIELD.
- search_fields(item)[source]¶
Add search fields to an item.
It adds the values of the fields defined in SEARCH_FIELDS class attribute with their corresponding keys.
- Parameters
item – the item to extract the search fields values
- Returns
a dict of search fields
- property ssl_verify¶
- property summary¶
- version = '0.12.0'¶
- class perceval.backend.BackendCommand(*args, debug=False)[source]¶
Bases:
objectAbstract class to run backends from the command line.
When the class is initialized, it parses the given arguments using the defined argument parser on setup_cmd_parser method. Those arguments will be stored in the attribute parsed_args.
The arguments will be used to initialize and run the Backend object assigned to this command. The backend used to run the command is stored under BACKEND class attributed. Any class derived from this and must set its own Backend class.
Moreover, the method setup_cmd_parser must be implemented to execute the backend.
- Parameters
debug – boolean flag to check if application is running in debug mode
- BACKEND = None¶
- run()[source]¶
Fetch and write items.
This method runs the backend to fetch the items from the given origin. Items are converted to JSON objects and written to the defined output. A summary with the result is written to the log.
If fetch-archive parameter was given as an argument during the initialization of the instance, the items will be retrieved using the archive manager.
- class perceval.backend.BackendCommandArgumentParser(backend, from_date=False, to_date=False, offset=False, basic_auth=False, token_auth=False, archive=False, aliases=None, blacklist=False, ssl_verify=False)[source]¶
Bases:
objectManage and parse backend command arguments.
This class defines and parses a set of arguments common to backends commands. Some parameters like archive or the different types of authentication can be set during the initialization of the instance.
- Parameters
backend – backend object
from_date – set from_date argument
to_date – set to_date argument
offset – set offset argument
basic_auth – set basic authentication arguments
token_auth – set token/key authentication arguments
archive – set archiving arguments
aliases – define aliases for parsed arguments
ssl_verify – set SSL verify argument
- Raises
AttributeError – when both from_date and offset are set to True
- parse(*args)[source]¶
Parse a list of arguments.
Parse argument strings needed to run a backend command. The result will be a argparse.Namespace object populated with the values obtained after the validation of the parameters.
- Parameters
args – argument strings
- Result
an object with the parsed values
- class perceval.backend.BackendItemsGenerator(backend_class, backend_args, category, filter_classified=False, manager=None, fetch_archive=False, archived_after=None)[source]¶
Bases:
objectBackendItemsGenerator class.
This class provides a generator through the items attribute that will fetch items from any data source and/or archive in a transparent way. A summary with the result of the process can be accessed via the attribute summary.
To initialize an instance is necessary to pass the backend that will be used to fetch data, its parameters and other useful data as the category of the items to retrieve and the archive options.
This object can also be used as a context manager.
- Parameters
backend_class – backend class to fetch items
backend_args – dict of arguments needed to fetch the items
category – category of the items to retrieve If None, it will use the default backend category
filter_classified – remove classified fields from the resulting items. Note that filter classified is not supported for archived items.
manager – archive manager where the items will be retrieved
fetch_archive – If enabled, items are fetched from archives
archived_after – return items archived after this date
- property summary¶
Return the summary object of the last fetch execution
- class perceval.backend.OriginUniqueField(name, type)¶
Bases:
tuple- property name¶
Alias for field number 0
- property type¶
Alias for field number 1
- class perceval.backend.Summary[source]¶
Bases:
objectSummary class for fetch executions.
This class models the summary of a fetch execution. It includes the last UUID, number of items fetched, skipped and their sum, plus the minimum, maximum and last updated_on times.
Furthermore, for backends using offsets, the corresponding summary contains the minimum, maximum and last offsets retrieved.
Finally, the summary also includes some extra fields, which can be used by any backend to include fetch-specific information.
- property total¶
Number of items retrieved. This includes fetched and skipped items.
- perceval.backend.fetch(backend_class, backend_args, category, filter_classified=False, manager=None)[source]¶
Fetch items using the given backend.
Generator to get items using the given backend class. When an archive manager is given, this function will store the fetched items in an Archive. If an exception is raised, this archive will be removed to avoid corrupted archives.
The parameters needed to initialize the backend class and get the items are given using backend_args dict parameter.
- Parameters
backend_class – backend class to fetch items
backend_args – dict of arguments needed to fetch the items
category – category of the items to retrieve. If None, it will use the default backend category
filter_classified – remove classified fields from the resulting items
manager – archive manager needed to store the items
- Returns
a generator of items
- perceval.backend.fetch_from_archive(backend_class, backend_args, manager, category, archived_after)[source]¶
Fetch items from an archive manager.
Generator to get the items of a category (previously fetched by the given backend class) from an archive manager. Only those items archived after the given date will be returned.
The parameters needed to initialize backend and get the items are given using backend_args dict parameter.
- Parameters
backend_class – backend class to retrive items
backend_args – dict of arguments needed to retrieve the items
manager – archive manager where the items will be retrieved
category – category of the items to retrieve
archived_after – return items archived after this date
- Returns
a generator of archived items
- perceval.backend.find_backends(top_package)[source]¶
Find available backends.
Look for the Perceval backends and commands under top_package and its sub-packages. When top_package defines a namespace, backends under that same namespace will be found too.
- Parameters
top_package – package storing backends
- Returns
a tuple with two dicts: one with Backend classes and one with BackendCommand classes
- perceval.backend.uuid(*args)[source]¶
Generate a UUID based on the given parameters.
The UUID will be the SHA1 of the concatenation of the values from the list. The separator between these values is ‘:’. Each value must be a non-empty string, otherwise, the function will raise an exception.
- Parameters
*args –
list of arguments used to generate the UUID
- Returns
a universal unique identifier
- Raises
ValueError – when anyone of the values is not a string, is empty or None.
perceval.client module¶
- class perceval.client.HttpClient(base_url, max_retries=5, sleep_time=1, extra_headers=None, extra_status_forcelist=None, extra_retry_after_status=None, archive=None, from_archive=False, ssl_verify=True)[source]¶
Bases:
objectAbstract class for HTTP clients.
Base class to query data sources taking care of retrying requests in case connection issues. If the data source does not send back a response after retrying a request, a RetryError exception is thrown.
Sub-classes can use the methods fetch to obtain data from the data source.
To track which version of the client was used during the fetching process, this class provides a version attribute that each client may override.
- Parameters
base_url – base URL of the data source
max_retries – number of max retries to a data source before raising a RetryError exception
sleep_time – time (in seconds) to sleep in case of connection problems
extra_headers – extra headers to be included in the requests
extra_status_forcelist – a set of HTTP status codes that will force a retry on
extra_retry_after_status – a set of HTTP status codes that will perform a retry respecting the Retry-After header
archive – archive to store/retrieve items
from_archive – if True the data is fetched from an archive
ssl_verify – enable/disable SSL verification
- DEFAULT_HEADERS = {'User-Agent': 'Perceval/0.17.12'}¶
- DEFAULT_METHOD_WHITELIST = False¶
- DEFAULT_RAISE_ON_REDIRECT = True¶
- DEFAULT_RAISE_ON_STATUS = True¶
- DEFAULT_RESPECT_RETRY_AFTER_HEADER = True¶
- DEFAULT_RETRY_AFTER_STATUS_CODES = [413, 429, 503]¶
- DEFAULT_SLEEP_TIME = 1¶
- DEFAULT_STATUS_FORCE_LIST = [408, 423, 504]¶
- GET = 'GET'¶
- MAX_RETRIES = 5¶
- MAX_RETRIES_ON_CONNECT = 5¶
- MAX_RETRIES_ON_READ = 5¶
- MAX_RETRIES_ON_REDIRECT = 5¶
- MAX_RETRIES_ON_STATUS = 5¶
- POST = 'POST'¶
- fetch(url, payload=None, headers=None, method='GET', stream=False, auth=None)[source]¶
Fetch the data from a given URL.
- Parameters
url – link to the resource
payload – payload of the request
headers – headers of the request
method – type of request call (GET or POST)
stream – defer downloading the response body until the response content is available
auth – auth of the request
:returns a response object
- static sanitize_for_archive(url, headers, payload)[source]¶
Sanitize the URL, headers and payload of a HTTP request before storing/retrieving items. By default, this method does not modify url, headers and payload. The modifications take place within the specific backends that redefine the sanitize_for_archive.
- Param
url: HTTP url request
- Param
headers: HTTP headers request
- Param
payload: HTTP payload request
:returns url, headers and payload sanitized
- version = '0.3.0'¶
- class perceval.client.RateLimitHandler[source]¶
Bases:
objectClass to handle rate limit for HTTP clients.
- Parameters
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to sleep until it will be rese
rate_limit_header – header to know the current rate limit
rate_limit_reset_header – header to know the next rate limit reset
- MAX_RATE_LIMIT = 500¶
- MIN_RATE_LIMIT = 10¶
- RATE_LIMIT_HEADER = 'X-RateLimit-Remaining'¶
- RATE_LIMIT_RESET_HEADER = 'X-RateLimit-Reset'¶
- setup_rate_limit_handler(sleep_for_rate=False, min_rate_to_sleep=10, rate_limit_header='X-RateLimit-Remaining', rate_limit_reset_header='X-RateLimit-Reset')[source]¶
Setup the rate limit handler.
- Parameters
sleep_for_rate – sleep until rate limit is reset
min_rate_to_sleep – minimun rate needed to make the fecthing process sleep
rate_limit_header – header from where extract the rate limit data
rate_limit_reset_header – header from where extract the rate limit reset data
- sleep_for_rate_limit()[source]¶
The fetching process sleeps until the rate limit is restored or raises a RateLimitError exception if sleep_for_rate flag is disabled.
- update_rate_limit(response)[source]¶
Update the rate limit and the time to reset from the response headers.
- Param
response: the response object
- version = '0.2'¶
perceval.errors module¶
- exception perceval.errors.ArchiveError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorGeneric error for archive objects
- message = '%(cause)s'¶
- exception perceval.errors.ArchiveManagerError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorGeneric error for archive manager
- message = '%(cause)s'¶
- exception perceval.errors.BackendCommandArgumentParserError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorGeneric error for BackendCommandArgumentParser
- message = '%(cause)s'¶
- exception perceval.errors.BackendError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorGeneric error for backends
- message = '%(cause)s'¶
- exception perceval.errors.BaseError(**kwargs)[source]¶
Bases:
ExceptionBase class for Perceval exceptions.
Derived classes can overwrite the error message declaring
messageproperty.- message = 'Perceval base error'¶
- exception perceval.errors.HttpClientError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorGeneric error for HTTP Cient
- message = '%(cause)s'¶
- exception perceval.errors.ParseError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorException raised a parsing errors occurs
- message = '%(cause)s'¶
- exception perceval.errors.RateLimitError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorException raised when the rate limit is exceeded
- message = '%(cause)s; %(seconds_to_reset)s seconds to rate reset'¶
- property seconds_to_reset¶
- exception perceval.errors.RepositoryError(**kwargs)[source]¶
Bases:
perceval.errors.BaseErrorGeneric error for repositories
- message = '%(cause)s'¶
perceval.perceval module¶
- class perceval.perceval.ListBackends(option_strings, dest, **kwargs)[source]¶
Bases:
argparse.Action
perceval.utils module¶
- perceval.utils.check_compressed_file_type(filepath)[source]¶
Check if filename is a compressed file supported by the tool.
This function uses magic numbers (first four bytes) to determine the type of the file. Supported types are ‘gz’ and ‘bz2’. When the filetype is not supported, the function returns None.
- Parameters
filepath – path to the file
- Returns
‘gz’ or ‘bz2’; None if the type is not supported
- perceval.utils.message_to_dict(msg)[source]¶
Convert an email message into a dictionary.
This function transforms an email.message.Message object into a dictionary. Headers are stored as key:value pairs while the body of the message is stored inside body key. Body may have two other keys inside, ‘plain’, for plain body messages and ‘html’, for HTML encoded messages.
The returned dictionary has the type requests.structures.CaseInsensitiveDict due to same headers with different case formats can appear in the same message.
- Parameters
msg – email message of type email.message.Message
:returns : dictionary of type requests.structures.CaseInsensitiveDict
- Raises
ParseError – when an error occurs transforming the message to a dictionary
- perceval.utils.months_range(from_date, to_date)[source]¶
Generate a months range.
Generator of months starting on from_date util to_date. Each returned item is a tuple of two datatime objects like in (month, month+1). Thus, the result will follow the sequence:
((fd, fd+1), (fd+1, fd+2), …, (td-2, td-1), (td-1, td))
- Parameters
from_date – generate dates starting on this month
to_date – generate dates until this month
- Result
a generator of months range
- perceval.utils.remove_invalid_xml_chars(raw_xml)[source]¶
Remove control and invalid characters from an xml stream.
Looks for invalid characters and subtitutes them with whitespaces. This solution is based on these two posts: Olemis Lang’s reponse on StackOverflow (http://stackoverflow.com/questions/1707890) and lawlesst’s on GitHub Gist (https://gist.github.com/lawlesst/4110923), that is based on the previous answer.
- Parameters
xml – XML stream
- Returns
a purged XML stream
- perceval.utils.xml_to_dict(raw_xml)[source]¶
Convert a XML stream into a dictionary.
This function transforms a xml stream into a dictionary. The attributes are stored as single elements while child nodes are stored into lists. The text node is stored using the special key ‘__text__’.
This code is based on Winston Ewert’s solution to this problem. See http://codereview.stackexchange.com/questions/10400/convert-elementtree-to-dict for more info. The code was licensed as cc by-sa 3.0.
- Parameters
raw_xml – XML stream
- Returns
a dict with the XML data
- Raises
ParseError – raised when an error occurs parsing the given XML stream