scrapy实践之item pipeline的使用
1. 去重复
根据业务场景来判断重复,然后去掉重复项,代码如下
from scrapy.exceptions import DropItem
# 去重复
class DedupPipeline:
def __init__(self):
self.link_set = set()
def process_item(self, item, spider):
link = item['miRNA'] + item['target']
if link in self.link_set:
raise DropItem(item)
self.link_set.add(link)
return item
2. 验证数据
对数据的有效性进行验证,保留有效数据,代码如下
class ValidatePipeline:
def process_item(self, item, spider):
invalid_miRNAs = ['bta-miR-15b', 'bta-miR-10a']
if item['miRNA'] in invalid_miRNAs:
raise DropItem(item)
return item
3. 写入文件
将item中的信息,保存到文件中,代码如下
# 写入文件
class ExportPipeline:
def __init__(self):
self.file = open('items.tsv', 'w')
def process_item(self, item, spider):
line = '{}\t{}\t{}\n'.format(item['miRNA'], item['target'], item['species'])
self.file.write(line)
return item
4. 持久化
将item中的信息,存储到数据库中,以sqlite3为例,代码如下
class WritedbPipeline:
def open_spider(self, spider):
conn = sqlite3.connect('test.db')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS target (mirna text, gene text, species text, UNIQUE(mirna, gene, species))''')
self.conn = conn
self.cursor = cursor
def close_spider(self, spider):
self.conn.commit()
self.conn.close()
def process_item(self, item, spider):
c = self.cursor
c.execute('INSERT INTO target VALUES (?, ?, ?)', (item['miRNA'], item['target'], item['species']))
return item
在pipelines.py中,每个类定义了一个组件,对于多个组件,需要在settings.py中进行配置,控制多个组件的使用顺序,代码如下
ITEM_PIPELINES = {
'hello_world.pipelines.ValidatePipeline': 200,
'hello_world.pipelines.DedupPipeline': 300,
'hello_world.pipelines.ExportPipeline': 400,
'hello_world.pipelines.WritedbPipeline': 500,
}