forked from swcarpentry/web-data-python
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-csv.html
152 lines (147 loc) · 9.41 KB
/
02-csv.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Working With Data on the Web</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Working With Data on the Web</h1>
<h2 class="subtitle">Handling CSV Data</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Explain what CSV is, and read CSV data sets.</li>
</ul>
</div>
</div>
<p>Our little program gets the data we want, but returns it as one long character string rather than as a list of numbers. There are two ways we could convert the former to the latter:</p>
<ul>
<li>Write a function to split that string on newline characters to create lines, then split the lines on commas and convert the second part of each to a number.</li>
<li>Use a couple of Python libraries to do this for us.</li>
</ul>
<p>Most experienced programmers would say that the second approach is easier, but “easy” is relative: using standard libraries is only more effective in practice if we know those libraries exist, and know enough about them to think about our problem in terms of what they can do.</p>
<p>Let’s give both methods a try. Here’s a small program to test the first approach:</p>
<pre class="sourceCode python"><code class="sourceCode python">input_data = <span class="st">'''1901,12.3</span>
<span class="st">1902,45.6</span>
<span class="st">1903,78.9'''</span>
as_lines = input_data.split(<span class="st">'</span><span class="ch">\n</span><span class="st">'</span>)
<span class="dt">print</span>(<span class="st">'input data as lines:'</span>)
<span class="dt">print</span>(as_lines)
<span class="kw">for</span> line in as_lines:
fields = line.split(<span class="st">','</span>) <span class="co"># turn '1901,12.3' into ['1901', '12.3']</span>
year = <span class="dt">int</span>(fields[<span class="dv">0</span>]) <span class="co"># turn the text '1901' into the integer 1901</span>
value = <span class="dt">float</span>(fields[<span class="dv">1</span>]) <span class="co"># turn the text '12.3' into the number 12.3</span>
<span class="dt">print</span>(year, <span class="st">':'</span>, value)</code></pre>
<pre class="output"><code>input data as lines:
['1901,12.3', '1902,45.6', '1903,78.9']
1901 : 12.3
1902 : 45.6
1903 : 78.9</code></pre>
<p>We start by defining a string in our program to use as input data so that we can easily check the correctness of our output. The first three lines of code turn this one multi-line string into a list of strings by splitting on the newline characters (which are written <code>\n</code> in our program). The <code>for</code> loop then extracts the year and value from each line by splitting the line on the comma and converting the digits to numbers.</p>
<div id="escape-sequences" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Escape Sequences</h2>
</div>
<div class="panel-body">
<p>Programmers need a way to put quotes, double quotes, and other special characters in strings. To do this, they use <a href="reference.html#escape-sequence">escape sequences</a>: <code>\'</code> for a single quote, <code>\"</code> for a double quote, <code>\n</code> for a newline, and so on.</p>
</div>
</div>
<p>Now let’s have a look at how we could parse the data using standard Python libraries. The library we’ll use is called <code>csv</code>. It doesn’t read data itself: instead, it takes the lines read by something else and turns them into lists of values by splitting on commas. We will also need to split the string on line separators before <code>csv</code> can read it:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> csv
<span class="ch">import</span> io
data = <span class="st">u'''first,FIRST</span>
<span class="st">second,SECOND</span>
<span class="st">third,THIRD'''</span>
reader = io.StringIO(data)
wrapper = csv.reader(reader)
<span class="kw">for</span> record in wrapper:
<span class="dt">print</span>(record)</code></pre>
<pre class="output"><code>['first', 'FIRST']
['second', 'SECOND']
['third', 'THIRD']</code></pre>
<p>Putting it all together, we can get data for Canada like this:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> requests
<span class="ch">import</span> io
<span class="ch">import</span> csv
url = <span class="st">'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv'</span>
response = requests.get(url)
<span class="kw">if</span> response.status_code != <span class="dv">200</span>:
<span class="dt">print</span>(<span class="st">'Failed to get data:'</span>, response.status_code)
<span class="kw">else</span>:
reader = io.StringIO(response.text)
wrapper = csv.reader(reader)
<span class="kw">for</span> record in wrapper:
year = <span class="dt">int</span>(record[<span class="dv">0</span>])
value = <span class="dt">float</span>(record[<span class="dv">1</span>])
<span class="dt">print</span>(year, <span class="st">':'</span>, value)</code></pre>
<pre class="error"><code>---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-da21db395042> in <module>()
7 wrapper = csv.reader(reader)
8 for record in wrapper:
----> 9 year = int(record[0])
10 value = float(record[1])
11 print year, ':', value
ValueError: invalid literal for int() with base 10: 'year'</code></pre>
<p>That error occurs because the first line of data is:</p>
<pre><code>year,data</code></pre>
<p>When we try to convert the string <code>'year'</code> to an integer, Python quite rightly complains. The fix is straightforward: we just need to ignore lines that start with the word <code>year</code>. And while we’re at it, we’ll put our results into a list instead of just printing them:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> requests
<span class="ch">import</span> io
<span class="ch">import</span> csv
url = <span class="st">'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv'</span>
response = requests.get(url)
<span class="kw">if</span> response.status_code != <span class="dv">200</span>:
<span class="dt">print</span>(<span class="st">'Failed to get data:'</span>, response.status_code)
<span class="kw">else</span>:
reader = io.StringIO(response.text)
wrapper = csv.reader(reader)
results = []
<span class="kw">for</span> record in wrapper:
<span class="kw">if</span> record[<span class="dv">0</span>] != <span class="st">'year'</span>:
year = <span class="dt">int</span>(record[<span class="dv">0</span>])
value = <span class="dt">float</span>(record[<span class="dv">1</span>])
results.append([year, value])
<span class="dt">print</span>(<span class="st">'first five results'</span>)
<span class="dt">print</span>(results[:<span class="dv">5</span>])</code></pre>
<pre class="output"><code>first five results
[[1901, -7.67241907119751], [1902, -7.862711429595947], [1903, -7.910782814025879], [1904, -8.155729293823242], [1905, -7.547311305999756]]</code></pre>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>