title
Clojure Data Analysis Cookbook

Clojure Data Analysis Cookbook

Looking to use Clojure for data analysis?

This book covers Incanter, Weka, and even goes into creating data visualizations for the web with D3 and ClojureScript. It provides over 100 recipes, some short and some more extended.

Now out! Order this through Packt or Amazon.

Data

Throughout the book, I use a number of datasets. Some of these are standard datasets, some are from the UCI Machine Learning Repository, some from census.ire.org, some from other sources, and some I've put together myself. I've uploaded them all here for archiving and easy access. Here they all are, with a few notes about each:

2010 US Census Data

This data is downloaded from the Investigative Reporters and Editors Census dataset site. You can also download raw census data from the US Census Bureau.

all_160.P3.csv: This is race data (P3) from the census. This is a place-level summary (160), and I've merged this data for all states.
all_160_in_51.P3.csv: This is race data (P3) from the census. This is a place-level summary (160) for Virginia (51).
all_160_in_51.P35.csv: This is family counts (P35) from the census. This is a place level summary (160) for Virginia (51).
census-race.json: This is the data from all_160.P3.csv, mentioned above, translated into JSON.
clusters.json: This is a graph of the clusters of the data from all_160.P3.csv, mentioned above. The clusters were generated K-means clusters from that dataset aggregated by state. The JSON data structure represents the nodes and links (edges) in the graph, along with the aggregated data.

Abalone

This dataset is from the UCI Machine Learning Repository. It contains sex, age, and measurements of abalone. This can be used to predict the age from the fish's physical measurements.

abalone.data: This is the data in CSV format.
abalone.json: The data from abalone.data formatted as JSON.
abalone.names: This is information about the data, including the fields and their ranges of values.

Accident Fatalities

This dataset was selected and downloaded from the US National Highway Traffic Safety Administration. This dataset includes the speed limit and other factors related to the accidents.

accident-fatalities.tsv

Chick Weights

This is from the Incanter datasets package. It's also found in the R datasets package.

chick-weight.json: This is the Incanter dataset converted to JSON.

Currencies and Exchange Rates

This is a couple of datasets used to illustrate working with semantic web data and web scraping.

currencies.ttl: This dataset is from Telegraphis, and it contains linked data with information about various currencies, such as the name, ISO codes, symbols.
x-rates-usd.html: This is a snapshot of a rates table from X-Rates.

Doctor Who Companions

This is a dataset that I've pulled together from Wikipedia listing the actors and companions from the British television program Doctor Who.

companions.clj: This is a set of Clojure forms that define the in-memory data for this information.
companions.txt: This is a list of the companions as CSV. It lists an identifier (usually the first name) for each and their first name.

FASTA datasets

FASTA files are used in bioinformatics to exchange nucleotide and peptide sequences. This is a small collection of them to use for testing a custom FASTA parser.

abc-transporter.fasta
dehydratase.fasta
elephas.fasta
maltophilia.fasta
mchu.fasta
ovax-chick.fasta
salmonella.fasta
seqeuence-1.fasta
sequences.fasta
transferase.fasta

IBM stock prices

This dataset was downloaded from Google Finance. It contains the prices of IBM stock for the decade between Nov 26, 2001 and Nov 23, 2012.

ibm.csv

Ionosphere data

This dataset is from an antenna array in Labrador. It contains a number of measurements of free electrons in the ionosphere. This dataset can be found in the UCI Machine Learning Repository, but this dataset is in Attribute-Relation File Format (ARFF) format for use with Weka.

ionosphere.arff: This is pulled from the Weka distribution for easier access.

Iris

This is a standard dataset that's almost everywhere. We also use the copy that ships with Incanter several times in the book. For more information about this dataset, see its page at the UCI Machine Learning Repository.

iris.arff: This is pulled from the Weka distribution for easier access.

Mushroom

This is another standard dataset from the UCI Machine Learning Repository. This contains categorical data on mushrooms, including whether they're edible or poisonous.

agaricus-lepiota.data: The data file from the UCI web site.
agaricus-lepiota.names: Information about the data, including field names.
mushroom.arff: The same dataset packages as an ARFF file for Weka.

TV-Related Sample Datasets

These are a series of datasets I threw together to illustrate loading different data formats.

small-sample-header.csv
small-sample-header.xls
small-sample-list.html
small-sample-table.html
small-sample.csv
small-sample.json
small-sample.sqlite
small-sample.xml

The Adventures of Sherlock Holmes

This text is from Project Gutenberg. It's a collection of Sherlock Holmes short stories written by Sir Arthur Conan Doyle.

pg1661.txt

Spelling Training Corpus

This is the training corpus used in Peter Norvig's article, "How to Write a Spelling Corrector."

big.txt

World Bank dataset

I downloaded this dataset about income inequality from the World Bank. It need to be filtered and pivoted, and here is the final result.

world-bank-filtered.csv

This is a dataset on how much land is used for agriculture in China.

chn-land.csv

Delicious RSS Feed

This is a compressed subset of a delicious RSS feed scraping. I can't find the original online anywhere anymore, so I'm putting it here.

delicious-rss-214k.json.xz

State of the Union dataset

This is a scraping of US Presidents' State of the Union (SOTU) addresses.

sotu.tar.gz

Flight Data

This is a compressed copy of data on US domestic flights from 1990–2009.

flights_with_colnames.csv.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Clojure Data Analysis Cookbook

Data

2010 US Census Data

Abalone

Accident Fatalities

Chick Weights

Currencies and Exchange Rates

Doctor Who Companions

FASTA datasets

IBM stock prices

Ionosphere data

Iris

Mushroom

TV-Related Sample Datasets

The Adventures of Sherlock Holmes

Spelling Training Corpus

World Bank dataset

Delicious RSS Feed

State of the Union dataset

Flight Data

Files

index.md

Latest commit

History

index.md

File metadata and controls

Clojure Data Analysis Cookbook

Data

2010 US Census Data

Abalone

Accident Fatalities

Chick Weights

Currencies and Exchange Rates

Doctor Who Companions

FASTA datasets

IBM stock prices

Ionosphere data

Iris

Mushroom

TV-Related Sample Datasets

The Adventures of Sherlock Holmes

Spelling Training Corpus

World Bank dataset

Delicious RSS Feed

State of the Union dataset

Flight Data