Skip to content

Latest commit

 

History

History
478 lines (361 loc) · 13.2 KB

README.md

File metadata and controls

478 lines (361 loc) · 13.2 KB

2018-SASxSTCWorkshop

Agenda

Time Agenda
0930 Registration start
1020 Opening
1030 Analytic Talk by Fusionex
1200 Q & A
1230 Lunch
1330 Workshop start
1620 Ending Speech
1630 End

Workshop Agenda

Time Agenda
1330 Ice Breaking
1333 Introduce TAs
1334 Install requirement
1350 Basic Web Scrapping
1415 Scrape CIA world factbook
1445 Panda + Matplotlib
1620 End

Slides

Google Slides

Workshop

  1. Install all the requirements before start

    $ pip install -r ./requirements.txt

    OR

    $ python -m pip install -r ./requirements.txt
  2. Error (Windows)

    Lack of TK

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.6/site-packages/matplotlib/pyplot.py", line 115, in <module>
        _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
      File "/usr/local/lib/python3.6/site-packages/matplotlib/backends/__init__.py", line 32, in pylab_setup
        globals(),locals(),[backend_name],0)
      File "/usr/local/lib/python3.6/site-packages/matplotlib/backends/backend_tkagg.py", line 6, in <module>
        from six.moves import tkinter as Tk
      File "/usr/local/lib/python3.6/site-packages/six.py", line 92, in __get__
        result = self._resolve()
      File "/usr/local/lib/python3.6/site-packages/six.py", line 115, in _resolve
        return _import_module(self.mod)
      File "/usr/local/lib/python3.6/site-packages/six.py", line 82, in _import_module
        __import__(name)
      File "/usr/local/lib/python3.6/tkinter/__init__.py", line 36, in <module>
        import _tkinter # If this fails your Python may not be configured for Tk
    ModuleNotFoundError: No module named '_tkinter'

    Solution: Reinstall python and check TK modules

    Microsoft Visual C++ 14.0

    running build_ext
    building 'twisted.test.raiser' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    Solution:

    1. Manually download twisted from here

    2. Run

    $ pip install Twisted‑18.9.0‑cp37‑cp37m‑win32.whl

    Lack of pywin32

    ModuleNotFoundError: No module named 'pywin32'

    Solution:

    $ pip install pywin32

    OR

    $ pip install pypiwin32

Scrapping Workshop

  1. First, create a scrapy project

    $ scrapy startproject sasxstc
  2. We will start by scrapping this website

  3. Create a new file in projectdir/sasxstc/sasxstc/spiders/SampleSpider.py

  4. Insert below boilerplate code into SampleSpider.py

    import scrapy
    
    
    class SampleSpider(scrapy.Spider):
        name = "sample"
        start_urls = [
            'https://sunwaytechclub.github.io/2018-SASxSTCWorkshop/1.html'
        ]
    
        '''
        Called after every request
        This is where your scrapping code should be
        '''
        def parse(self, response):
            result = response.body
            return {"result": result}
  5. Run the spider

    $ scrapy crawl sample
  6. If you get the following error:

      File "/home/gaara/.virtualenvs/sasxstc/lib/python3.7/site-packages/twisted/conch/manhole.py", line 154
        def write(self, data, async=False):
                                  ^
    SyntaxError: invalid syntax

    run

    $ pip install git+https://github.com/twisted/twisted.git@trunk
  7. Xpath basic

    # Extract every html tag
    result = response.xpath('//html').extract()
    # Extract the body tag which follow by html
    result = response.xpath('//html/body').extract()
    # Extract every body tag
    result = response.xpath('//body').extract()
    # Extract the p tag that follow by div, body and html
    result = response.xpath('//html/body/div/p').extract()
    # Extract the text within every p tag
    result = response.xpath('//p/text()').extract()
  8. Xpath with id and class

    # Extract every element within div with id of 'd01'
    result = response.xpath('//div[@id="d01"]').extract()
    # Extract all the text with class blue
    result = response.xpath('//p[@class="blue"]/text()').extract()
    # Extract the word SAS
    result = response.xpath('//div[@id="d02"]/p[@class="red"]/text()').extract()

Scrapping CIA factbook

CIA is Central Intelligence Agency. Various data can be found in CIA factbook, such as country GDP, population growth rate, etc. The list of data can be found in here.

  1. So, first, visit this url and observe the website.

  2. Let's create an empty spider FactbookSpider.py

    from pprint import pprint
    import scrapy
    
    
    class FactbookSpider(scrapy.Spider):
        name = "factbook"
        start_urls = [
            'https://www.cia.gov/library/publications/the-world-factbook/rankorder/rankorderguide.html'
        ]
    
        '''
        Called after every request
        This is where your scrapping code should be
        '''
    
        def parse(self, response):
            pass
  3. Time to get the link

    links = response.xpath('//body//div[@id="profileguide"]/div[@class="answer"]//a')
    for index, link in enumerate(links):
        text = link.xpath('text()').extract_first()
        link = link.xpath('@href').extract_first()
        print(text)
        print(link)
  4. Now, join the url

    link = response.urljoin(link.xpath('@href').extract_first())
  5. Put the links into results object

    links = response.xpath('//body//div[@id="profileguide"]/div[@class="answer"]//a')
    results = {}
    for index, link in enumerate(links):
        text = link.xpath('text()').extract_first()
        link = response.urljoin(link.xpath('@href').extract_first())
        results[text] = link
    
    pprint(results)
  6. Crawl into one of the link

    def parse(self, response):
        links = response.xpath('//body//div[@id="profileguide"]/div[@class="answer"]//a')
        results = {}
        for index, link in enumerate(links):
            text = link.xpath('text()').extract_first()
            link = response.urljoin(link.xpath('@href').extract_first())
            results[text] = link
    
        yield scrapy.Request(
            results["Population growth rate:"],
            callback=self.parse_population,
            meta={"links": results}
        )
    
    def parse_population(self, response):
        meta = response.meta
        pprint(meta)
  7. Scrape the row and store into results

    rows = response.xpath('//div[@class="wfb-text-box"]//table[@id="rankOrder"]/tbody/tr')
    results = {}
    for index, row in enumerate(rows):
        if not row.xpath('@class').extract_first() == "rankHeading":
            id = row.xpath('@id').extract_first()
            name = row.xpath('td[@class="region"]//text()').extract_first()
            population_growth = row.xpath('td[3]/text()').extract_first()
            print(id + " " + name + " " + population_growth)
            results[id] = {
                "name": name,
                "population_growth_rate": population_growth
            }
  8. Do the same to extract gdp growth rate

    def parse_population(self, response):
        meta = response.meta
        rows = response.xpath('//div[@class="wfb-text-box"]//table[@id="rankOrder"]/tbody/tr')
        results = {}
        for index, row in enumerate(rows):
            if not row.xpath('@class').extract_first() == "rankHeading":
                id = row.xpath('@id').extract_first()
                name = row.xpath('td[@class="region"]//text()').extract_first()
                population_growth = row.xpath('td[3]/text()').extract_first()
                results[id] = {
                    "name": name,
                    "population_growth_rate": population_growth
                }
        meta["results"] = results
        yield scrapy.Request(
            meta["links"]["Infant mortality rate:"],
            callback=self.parse_infant_mortality,
            meta=meta
        )
    
    def parse_infant_mortality(self, response):
        meta = response.meta
        results = meta["results"]
    
        rows = response.xpath('//div[@class="wfb-text-box"]//table[@id="rankOrder"]/tbody/tr')
        for index, row in enumerate(rows):
            if not row.xpath('@class').extract_first() == 'rankHeading':
                id = row.xpath('@id').extract_first()
                infant_mortality_rate = row.xpath('td[3]/text()').extract_first()
                results[id]["infant_mortality_rate"] = infant_mortality_rate
    
        return results
  9. Getting error?

    Traceback (most recent call last):
      File "/home/gaara/.virtualenvs/sasxstc/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/gaara/Desktop/2018-SASxSTCWorkshop/sasxstc/sasxstc/spiders/FactbookSpider.py", line 60, in parse_gdp
        results[id]["infant_mortality_rate"] = infant_mortality_rate
    KeyError: 'kv'

    Surround with try and except

    try:
        results[id]["infant_mortality_rate"] = infant_mortality_rate
    except KeyError:
        pass
  10. Now you have the data of factbook with you!

  11. However, before we move forward, let's make use of the pipelines

    Add a field to projectdir/sasxstc/sasxstc/items.py

    class SasxstcItem(scrapy.Item):
        # define the fields for your item here like:
        results = scrapy.Field()

    Uncomment the following line in projectdir/sasxstc/sasxstc/settings.py

    # ITEM_PIPELINES = {
    #    'sasxstc.pipelines.SasxstcPipeline': 300,
    #}

    Import the item in projectdir/sasxstc/sasxstc/spiders/FactbookSpider.py

    from sasxstc.items import SasxstcItem

    Change the last line of projectdir/sasxstc/sasxstc/spiders/FactbookSpider.py

    # return results
    item = SasxstcItem()
    item["results"] = results
    return item
  12. You are good to go now!

Analytics

  1. Go to projectdir/sasxstc/sasxstc/pipelines.py

  2. Add imports

    import pandas
    import seaborn
    from matplotlib import pyplot
    from pprint import pprint
    from scipy import stats
  3. Seperate results into different list

    results = item["results"]
        country_name = []
        population_growth = []
        infant_mortality = []
    
        for country_code in list(results.keys()):
    
            country_name.append(results[country_code]["name"])
    
            population_growth.append(float(results[country_code]["population_growth_rate"]))
    
            try:
                infant_mortality.append(float(results[country_code]["infant_mortality_rate"]))
            except KeyError:
                infant_mortality.append(None)
  4. Put data into Panda dataframe

    data = pandas.DataFrame(
            {
                "infant_mortality": infant_mortality,
                "population_growth": population_growth
            },
            index=country_name
        )
    
    pprint(data)
  5. Run it and see how the data looks like

  6. Drop the row with empty field

    data = data.dropna(how='any')
    pprint(data)
  7. Plot the graph

    seaborn.jointplot(x="infant_mortality", y="population_growth", data=data, kind="reg")
    
    pyplot.show()
  8. Run it!

  9. Now, add R and P value?

    seaborn.jointplot(x="infant_mortality", y="population_growth", data=data,
                      kind="reg", stat_func=stats.pearsonr)
  10. Add the equation of regression line

    slope, intercept, r_value, p_value, std_err = stats.linregress(data["infant_mortality"].tolist(), data["population_growth"].tolist())
    
    seaborn.jointplot(x="infant_mortality", y="population_growth", data=data,
                      kind="reg", stat_func=stats.pearsonr)
    pyplot.annotate("y={0:.1f}x+{1:.1f}".format(slope, intercept), xy=(0.05, 0.95), xycoords='axes fraction')
    pyplot.show()
  11. And, you are done!