fork download
  1. {
  2. "metadata": {
  3. "name": "",
  4. "signature": "sha256:e9ce83ce359770d776e46f0cbff068bb20238ec1716dfdaa9f755145605adda4"
  5. },
  6. "nbformat": 3,
  7. "nbformat_minor": 0,
  8. "worksheets": [
  9. {
  10. "cells": [
  11. {
  12. "cell_type": "heading",
  13. "level": 1,
  14. "metadata": {},
  15. "source": [
  16. "Python for Data Science"
  17. ]
  18. },
  19. {
  20. "cell_type": "markdown",
  21. "metadata": {},
  22. "source": [
  23. "[Joe McCarthy](http://i...content-available-to-author-only...y.com/joe), \n",
  24. "*Director, Analytics & Data Science*, [Atigeo, LLC](http://a...content-available-to-author-only...o.com)"
  25. ]
  26. },
  27. {
  28. "cell_type": "code",
  29. "collapsed": "false",
  30. "input": [
  31. "from IPython.display import display, Image, HTML"
  32. ],
  33. "language": "python",
  34. "metadata": {},
  35. "outputs": [],
  36. "prompt_number": 1
  37. },
  38. {
  39. "cell_type": "heading",
  40. "level": 3,
  41. "metadata": {},
  42. "source": [
  43. "Navigation"
  44. ]
  45. },
  46. {
  47. "cell_type": "markdown",
  48. "metadata": {},
  49. "source": [
  50. "Notebooks in this primer:\n",
  51. "\n",
  52. "1. [Introduction](1_Introduction.ipynb)\n",
  53. "2. **Data Science: Basic Concepts** (*you are here*)\n",
  54. "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n",
  55. "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n",
  56. "5. [Next Steps](5_Next_Steps.ipynb)"
  57. ]
  58. },
  59. {
  60. "cell_type": "heading",
  61. "level": 2,
  62. "metadata": {},
  63. "source": [
  64. "2. Data Science: Basic Concepts"
  65. ]
  66. },
  67. {
  68. "cell_type": "heading",
  69. "level": 3,
  70. "metadata": {},
  71. "source": [
  72. "Data Science and Data Mining"
  73. ]
  74. },
  75. {
  76. "cell_type": "markdown",
  77. "metadata": {},
  78. "source": [
  79. "<a href=\"http://d...content-available-to-author-only...z.com/\"><img src=\"http://a...content-available-to-author-only...y.com/images/0636920028918/cat.gif\" style=\"margin: 0px 0px 5px 20px; width: 125px; float: right;\" title=\"Data Science for Business, by Provost and Fawcett\" alt=\"DataScienceForBusiness_cover.jpg\" /></a>\n",
  80. "Foster Provost and [Tom Fawcett](http://h...content-available-to-author-only...t.net/~tom.fawcett/public_html/index.html) offer succinct descriptions of data science and data mining in [Data Science for Business](http://d...content-available-to-author-only...z.com/):\n",
  81. "\n",
  82. "> **Data science** involves principles, processes and techniques for understanding phenomena via the (automated) analysis of data.\n",
  83. "> \n",
  84. "> **Data mining** is the extraction of knowledge from data, via technologies that incorporate these principles."
  85. ]
  86. },
  87. {
  88. "cell_type": "heading",
  89. "level": 3,
  90. "metadata": {},
  91. "source": [
  92. "Knowledge Discovery, Data Mining and Machine Learning"
  93. ]
  94. },
  95. {
  96. "cell_type": "markdown",
  97. "metadata": {},
  98. "source": [
  99. "Provost & Fawcett also offer some history and insights into the relationship between *data mining* and *machine learning*, terms which are often used somewhat interchangeably:\n",
  100. "\n",
  101. "> The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.\n",
  102. "> \n",
  103. ">Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition \u2014 how will an intelligent agent use learned knowledge to reason and act in its environment \u2014 which are not concerns of Data Mining.\n",
  104. "> \n",
  105. ">Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.\n"
  106. ]
  107. },
  108. {
  109. "cell_type": "heading",
  110. "level": 3,
  111. "metadata": {},
  112. "source": [
  113. "Cross Industry Standard Process for Data Mining (CRISP-DM)"
  114. ]
  115. },
  116. {
  117. "cell_type": "markdown",
  118. "metadata": {},
  119. "source": [
  120. "The [Cross Industry Standard Process for Data Mining](https://e...content-available-to-author-only...a.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) introduced a process model for data mining in 2000 that has become widely adopted.\n",
  121. "\n",
  122. "<a href=\"https://e...content-available-to-author-only...a.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining\"><img src=\"https://u...content-available-to-author-only...a.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/479px-CRISP-DM_Process_Diagram.png\" title=\"Cross Industry Standard Process for Data Mining\" alt=\"CRISP-DM_Process_Diagram\" /></a>\n",
  123. "\n",
  124. "The model emphasizes the ***iterative*** nature of the data mining process, distinguishing several different stages that are regularly revisited in the course of developing and deploying data-driven solutions to business problems:\n",
  125. "\n",
  126. "* Business understanding\n",
  127. "* Data understanding\n",
  128. "* Data preparation\n",
  129. "* Modeling \n",
  130. "* Deployment\n",
  131. "\n",
  132. "We will be focusing primarily on using Python for **data preparation** and **modeling**."
  133. ]
  134. },
  135. {
  136. "cell_type": "heading",
  137. "level": 3,
  138. "metadata": {},
  139. "source": [
  140. "Data Science Workflow"
  141. ]
  142. },
  143. {
  144. "cell_type": "markdown",
  145. "metadata": {},
  146. "source": [
  147. "[Philip Guo](http://w...content-available-to-author-only...e.net/) presents a [Data Science Workflow](http://c...content-available-to-author-only...m.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext) offering a slightly different process model emhasizing the importance of **reflection** and some of the meta-data, data management and bookkeeping challenges that typically arise in the data science process. His 2012 PhD thesis, [Software Tools to Facilitate Research Programming](http://p...content-available-to-author-only...e.net/projects/pubs/guo_phd_dissertation.pdf), offers an insightful and more comprehensive description of many of these challenges.\n",
  148. "\n",
  149. "<a href=\"http://c...content-available-to-author-only...m.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext\"><img src=\"http://c...content-available-to-author-only...m.org/system/assets/0001/3678/rp-overview.jpg\" title=\"Data Science Workflow, by Philip Guo\" alt=\"pguo-data-science-overview.jpg\" style=\"width: 500px\" /></a>"
  150. ]
  151. },
  152. {
  153. "cell_type": "markdown",
  154. "metadata": {},
  155. "source": [
  156. "Provost & Fawcett list a number of different tasks in which data science techniques are employed:\n",
  157. "\n",
  158. "* Classification and class probability estimation \n",
  159. "* Regression (aka value estimation) \n",
  160. "* Similarity matching \n",
  161. "* Clustering \n",
  162. "* Co-occurrence grouping (aka frequent itemset mining, association rule discovery, market-basket analysis) \n",
  163. "* Profiling (aka behavior description, fraud / anomaly detection) \n",
  164. "* Link prediction \n",
  165. "* Data reduction \n",
  166. "* Causal modeling \n",
  167. "\n",
  168. "We will be focusing primarily on **classification** and **class probability estimation** tasks, which are defined by Provost & Fawcett as follows:\n",
  169. "\n",
  170. "> *Classification* and *class probability estimation* attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclusive. An example classification question would be: \u201cAmong all the customers of MegaTelCo, which are likely to respond to a given offer?\u201d In this example the two classes could be called will respond and will not respond.\n",
  171. "\n",
  172. "To further simplify this primer, we will focus exclusively on **supervised** methods, in which the data is explicitly labeled with classes. There are also *unsupervised* methods that involve working with data in which there are no pre-specified class labels."
  173. ]
  174. },
  175. {
  176. "cell_type": "heading",
  177. "level": 3,
  178. "metadata": {},
  179. "source": [
  180. "Supervised Classification"
  181. ]
  182. },
  183. {
  184. "cell_type": "markdown",
  185. "metadata": {},
  186. "source": [
  187. "The [Natural Language Toolkit (NLTK) book](http://w...content-available-to-author-only...k.org/book) provides a diagram and succinct description (below, with italics and bold added for emphasis) of supervised classification:\n",
  188. "\n",
  189. "<a href=\"http://w...content-available-to-author-only...k.org/book/ch06.html\"><img src=\"http://w...content-available-to-author-only...k.org/images/supervised-classification.png\" title=\"Supervised Classification, from NLTK book, Chapter 6\" alt=\"nltk_ch06_supervised-classification.png\" style=\"width: 500px\" /></a>\n",
  190. "\n",
  191. "> *Supervised Classification*. (a) During *training*, a **feature extractor** is used to convert each **input value** to a **feature set**. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and **labels** are fed into the **machine learning algorithm** to generate a **model**. (b) During *prediction*, the same feature extractor is used to convert **unseen inputs** to feature sets. These feature sets are then fed into the model, which generates **predicted labels**."
  192. ]
  193. },
  194. {
  195. "cell_type": "heading",
  196. "level": 3,
  197. "metadata": {},
  198. "source": [
  199. "Data Mining Terminology"
  200. ]
  201. },
  202. {
  203. "cell_type": "markdown",
  204. "metadata": {},
  205. "source": [
  206. "* **Structured** data has simple, well-defined patterns (e.g., a table or graph)\n",
  207. "* **Unstructured** data has less well-defined patterns (e.g., text, images)\n",
  208. "* **Model**: a pattern that captures / generalizes regularities in data (e.g., an equation, set of rules, decision tree)\n",
  209. "* **Attribute** (aka *variable*, *feature*, *signal*, *column*): an element used in a model\n",
  210. "* **Example** (aka *instance*, *feature vector*, *row*): a representation of an entity being modeled\n",
  211. "* **Target attribute** (aka *dependent variable*, *class label*): the class / type / category of an entity being modeled"
  212. ]
  213. },
  214. {
  215. "cell_type": "heading",
  216. "level": 3,
  217. "metadata": {},
  218. "source": [
  219. "Data Mining Example: UCI Mushroom dataset"
  220. ]
  221. },
  222. {
  223. "cell_type": "markdown",
  224. "metadata": {},
  225. "source": [
  226. "The [Center for Machine Learning and Intelligent Systems](http://c...content-available-to-author-only...i.edu/) at the University of California, Irvine (UCI), hosts a [Machine Learning Repository](https://a...content-available-to-author-only...i.edu/ml/datasets.html) containing over 200 publicly available data sets.\n",
  227. "\n",
  228. "<a href=\"https://a...content-available-to-author-only...i.edu/ml/datasets/Mushroom\"><img src=\"https://a...content-available-to-author-only...i.edu/ml/assets/MLimages/Large73.jpg\" style=\"margin: 0px 0px 5px 20px; width: 125px; float: right;\" title=\"Mushrooms from Agaricus and Lepiota Family\" alt=\"mushroom\"/></a>\n",
  229. "We will use the [mushroom](https://a...content-available-to-author-only...i.edu/ml/datasets/Mushroom) data set, which forms the basis of several examples in Chapter 3 of the Provost & Fawcett data science book.\n",
  230. "\n",
  231. "The following description of the dataset is provided at the UCI repository:\n",
  232. "\n",
  233. ">This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525 [The Audubon Society Field Guide to North American Mushrooms, 1981]). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be'' for Poisonous Oak and Ivy.\n",
  234. "> \n",
  235. "> **Number of Instances**: 8124\n",
  236. "> \n",
  237. "> **Number of Attributes**: 22 (all nominally valued)\n",
  238. "> \n",
  239. "> **Attribute Information**: (*classes*: edible=e, poisonous=p)\n",
  240. "> \n",
  241. "> 1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n",
  242. "> 2. *cap-surface*: fibrous=f, grooves=g, scaly=y, smooth=s\n",
  243. "> 3. *cap-color*: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n",
  244. "> 4. *bruises?*: bruises=t, no=f\n",
  245. "> 5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s\n",
  246. "> 6. *gill-attachment*: attached=a, descending=d, free=f, notched=n\n",
  247. "> 7. *gill-spacing*: close=c, crowded=w, distant=d\n",
  248. "> 8. *gill-size*: broad=b, narrow=n\n",
  249. "> 9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y\n",
  250. "> 10. *stalk-shape*: enlarging=e, tapering=t\n",
  251. "> 11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?\n",
  252. "> 12. *stalk-surface-above-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n",
  253. "> 13. *stalk-surface-below-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n",
  254. "> 14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n",
  255. "> 15. *stalk-color-below-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n",
  256. "> 16. *veil-type*: partial=p, universal=u\n",
  257. "> 17. *veil-color*: brown=n, orange=o, white=w, yellow=y\n",
  258. "> 18. *ring-number*: none=n, one=o, two=t\n",
  259. "> 19. *ring-type*: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n",
  260. "> 20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y\n",
  261. "> 21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y\n",
  262. "> 22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\n",
  263. "> \n",
  264. "> **Missing Attribute Values**: 2480 of them (denoted by \"?\"), all for attribute #11.\n",
  265. "> \n",
  266. "> **Class Distribution**: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances\n",
  267. "\n",
  268. "The [data file](https://a...content-available-to-author-only...i.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data) associated with this dataset has one instance of a hypothetical mushroom per line, with abbreviations for the values of the class and each of the other 22 attributes separated by commas.\n",
  269. "\n",
  270. "Here is a sample line from the data file:\n",
  271. "\n",
  272. "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n",
  273. "\n",
  274. "This instance represents a mushroom with the following attribute values:\n",
  275. "\n",
  276. "*class*: edible=e, **poisonous=p**\n",
  277. "\n",
  278. "1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, **knobbed=k**, sunken=s\n",
  279. "2. *cap-surface*: **fibrous=f**, grooves=g, scaly=y, smooth=s\n",
  280. "3. *cap-color*: **brown=n** ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n",
  281. "4. *bruises?*: bruises=t, **no=f**\n",
  282. "5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, **none=n**, pungent=p, spicy=s\n",
  283. "6. *gill-attachment*: attached=a, descending=d, **free=f**, notched=n\n",
  284. "7. *gill-spacing*: **close=c**, crowded=w, distant=d\n",
  285. "8. *gill-size*: broad=b, **narrow=n**\n",
  286. "9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, **white=w**, yellow=y\n",
  287. "10. *stalk-shape*: **enlarging=e**, tapering=t\n",
  288. "11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, **missing=?**\n",
  289. "12. *stalk-surface-above-ring*: fibrous=f, scaly=y, **silky=k**, smooth=s\n",
  290. "13. *stalk-surface-below-ring*: fibrous=f, **scaly=y**, silky=k, smooth=s\n",
  291. "14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, **white=w**, yellow=y\n",
  292. "15. *stalk-color-below-ring*: **brown=n**, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n",
  293. "16. *veil-type*: **partial=p**, universal=u\n",
  294. "17. *veil-color*: brown=n, orange=o, **white=w**, yellow=y\n",
  295. "18. *ring-number*: none=n, **one=o**, two=t\n",
  296. "19. *ring-type*: cobwebby=c, **evanescent=e**, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n",
  297. "20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, **white=w**, yellow=y\n",
  298. "21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, **several=v**, solitary=y\n",
  299. "22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, **woods=d**\n",
  300. "\n",
  301. "Building a model with this data set will serve as a motivating example throughout much of this primer."
  302. ]
  303. },
  304. {
  305. "cell_type": "heading",
  306. "level": 3,
  307. "metadata": {},
  308. "source": [
  309. "Navigation"
  310. ]
  311. },
  312. {
  313. "cell_type": "markdown",
  314. "metadata": {},
  315. "source": [
  316. "Notebooks in this primer:\n",
  317. "\n",
  318. "1. [Introduction](1_Introduction.ipynb)\n",
  319. "2. **Data Science: Basic Concepts** (*you are here*)\n",
  320. "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n",
  321. "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n",
  322. "5. [Next Steps](5_Next_Steps.ipynb)"
  323. ]
  324. }
  325. ],
  326. "metadata": {}
  327. }
  328. ]
  329. }# your code goes here
Success #stdin #stdout 0.01s 7264KB
stdin
mushroom
stdout
Standard output is empty