Task(name=None, url=None, grab=None, grab_config=None, priority=None, priority_set_explicitly=True, network_try_count=0, task_try_count=1, disable_cache=False, refresh_cache=False, valid_status=None, use_proxylist=True, cache_timeout=None, delay=None, raw=False, callback=None, fallback_name=None, **kwargs)¶
Task for spider.
__init__(name=None, url=None, grab=None, grab_config=None, priority=None, priority_set_explicitly=True, network_try_count=0, task_try_count=1, disable_cache=False, refresh_cache=False, valid_status=None, use_proxylist=True, cache_timeout=None, delay=None, raw=False, callback=None, fallback_name=None, **kwargs)¶
Create Task object.
If more than one of url, grab and grab_config options are non-empty then they processed in following order: * grab overwrite grab_config * grab_config overwrite url
name of the task. After successful network operation task’s result will be passed to task_<name> method.
URL of network document. Any task requires url or grab option to be specified.
configured Grab instance. You can use that option in case when url option is not enough. Do not forget to configure url option of Grab instance because in this case the url option of Task constructor will be overwritten with grab.config[‘url’].
- priority of the Task. Tasks with lower priority
will be processed earlier. By default each new task is assigned with random priority from (80, 100) range.
- internal flag which tells if that
task priority was assigned manually or generated by spider according to priority generation rules.
you’ll probably will not need to use it. It is used internally to control how many times this task was restarted due to network errors. The Spider instance has network_try_limit option. When network_try_count attribute of the task exceeds the network_try_limit attribute then processing of the task is abandoned.
the as network_try_count but it increased only then you use clone method. Also you can set it manually. It is useful if you want to restart the task after it was cancelled due to multiple network errors. As you might guessed there is task_try_limit option in Spider instance. Both options network_try_count and network_try_limit guarantee you that you’ll not get infinite loop of restarting some task.
if True disable cache subsystem. The document will be fetched from the Network and it will not be saved to cache.
if True the document will be fetched from the Network and saved to cache.
extra status codes which counts as valid
it means to use proxylist which was configured via setup_proxylist method of spider
if specified tells the spider to schedule the task and execute it after delay seconds
if raw is True then the network response is forwarding to the corresponding handler without any check of HTTP status code of network error, if raw is False (by default) then failed response is putting back to task queue or if tries limit is reached then the processing of this request is finished.
if you pass some function in callback option then the network response will be passed to this callback and the usual ‘task_*’ handler will be ignored and no error will be raised if such ‘task_*’ handler does not exist.
the name of method that is called when spider gives up to do the task (due to multiple network errors)
Any non-standard named arguments passed to Task constructor will be saved as attributes of the object. You can get their values later as attributes or with get method which allows to use default value if attribute does not exist.
Clone Task instance.
Reset network_try_count, increase task_try_count. Reset priority attribute if it was not set explicitly.
Return value of attribute or None if such attribute does not exist.