Upload
yency
View
308
Download
8
Embed Size (px)
DESCRIPTION
Citation preview
What wil happen if you were a Python programmer
in Financial Community ?
Take my Internship experiment as an Example
- Crawler -
Python 可以吃嗎?
你用py…(欸,這怎麼念阿) python 寫過爬蟲阿, 可是我們都寫C# 跟VBA耶。
不過沒關係,
上班前學會C#, VBA, SQL就OK了!
迷之聲:有這麼容易嗎QQ...
__ __ 券商自營部
#3 每日期權strike, price分佈圖台指選擇權 201309到期 賣權
strike price 履約價格
買方
賣方
券商 1, 2, 3, 4
券商 1, 2, 3, 4
於是主管畫了一張圖給我...
#3 每日期權strike, price戰力分佈圖
所謂選擇權,是一種權利契約,買方支付權利金後,便有權利在未來約定的某特定日期(到期日),依約定之履約價格(Strike Price),買入或賣出一定數量的約定標的物。
strike price 履約價格反映對市場的預期?
簡單來說,就是抓資料嘛~
[ C# ] Arachnode.net ? (not free)R is slowerTry Python Solution:Scrapy
Windows Platform : C#
C#
Scrapy
(official)is a web crawling framework, used to crawl websites and extract structured data from their pages.
是一個可以讓你快速開發網路爬蟲的套件。
多快? 為什麼快?
XPath Parser
Scrapy, is a web crawling framework, used to crawl websites and extract
structured data from their pages.
# Regular ExpressionEvery characters aretreated as the same
# Alternatives: XPathhtml doc can be astrudtured data
XPath is like “address”
# C://Python27
# html/body/div[@class="wrapper"]/div[class="header.clearfix"]/h1[class="logo"]/a
A Simple Demo
期交所首頁
盤後資訊公司簡介 商品 …...
Rule : XPath url urlurlurl
url
…. …. 開盤價 ….
item itemitemitem Rule : XPath
http://www.taifex.com.tw/chinese/index.asp
1st Response
2nd Request
2nd Response
Pseudo Codeimport scrapy工具
class my_Spider( 繼承scrapy寫好的spider ): name ="爬蟲名字" start_urls=[initial request(URL)] i.e. 最上層的root
def parse(self, first_response): 第一個parse函式寫死的 !!
hxs = HtmlXPathSelector( first_response ) html_response => Xpath 結構化物件
Xpath = "爬取url的Rule" extracted data = hxs.select(Xpath).extract() yield Request(url=爬到的url, callback=self.next_parser)
def next_parser(self, second_response): 函式名字隨便訂
hxs = HtmlXPathSelector(second_response) Xpath = "爬取資料的Rule" extracted data = hxs.select(Xpath).extract()
cmd command
# 建立scrapy 專案[cmd]scrapy startproject PyLadiesDemo
# 執行name = "TaiTex" 的爬蟲[cmd]scrapy crawl TaiFex
請參考官方文件:http://doc.scrapy.org/en/latest/intro/tutorial.html
Demo
# sample code
# Good Tools: (error detect : try except)[cmd] scrapy Shell url_that_u_want_2_crawl
例外處理 + pdb 下中斷點
# 例外處理def parse(self, response):
try:...
except:…
# 下中斷點import pdb
pdb.set_trace()
網站難免有些例外
or 弄錯XPath,可以很即時的修正。
#3 每日期權strike, price戰力分佈圖
Find the “url pattern”
crawl pages and scrap data
store into DB
Visualization: histogram
Google Developer Tools
PythonScrapy
XPath Helper
Django -> MSSQL
Excel
Windows Platform
query string: http://www.taifex.com.tw/chinese/3/fcm_opt_rep.asp?
commodity_id=TXO&commodity_idt=TXO&settlemon=201310W1&pccode=P&curpage=1&pagenum=1
Step 1Find the “url pattern”
換月時點有許多例外
[TXO] 3近2季一般選擇權:每個月第3個禮拜三結算
週選擇權:每週星期三結算
what if 颱風假,連假?
[TGO] 6 近偶數月份
Step 1Find the “url pattern”
什麼!! pip 不能用 !!!!!
因為公司擋網站 !!!
◢ ▆ ▅ ▄ ▃ 崩╰ (〒皿〒)╯ 潰▃ ▄ ▅ ▇ ◣
(installer using .bat file)
Step 2crawl pages and scrap data
Python is beautiful ...
python crawler 49 行 (ver1) 150行 (final ver)
v.s.
C# crawler 700 行 (還只能爬一頁!!)
大勝!
可以用MSSQL嗎?
Django 可以接M$SQL嗎?
YES !< Solution > install “Django-mssql”
modify “settings.py”http://django-mssql.readthedocs.org/en/latest/
可以打包成exe檔嗎?
<Solution 1> Py2exe改由python script啟動crawler
https://scrapy.readthedocs.org/en/latest/topics/practices.html
Fail ! @@
可以打包成exe檔嗎?
<Solution 2> Installer.batuse pip offline(no update)
it works !(while fails at a few machines)
Conclusion
# scrapy :還不錯的輕量級crawler框架
快速開發,專注在parser上簡單易學,好維護
Remember 例外處理
# django: 殺雞用牛刀了
若不考慮發展其他產品,
直接寫txt出來也許更省力
# 收穫:主管終於知道Python怎麼念了! <( ̄︶ ̄)>
Reference
# official website tutorialwww.scrapy.org
# Taipei.py Thoen’s slidehttp://files.meetup.com/6816242/%28Pycon%20Taipei%29%20Scrapy-20130328.pdf
# Thanks Tim & c3h3 !