Documentation¶

`nbdtools` Package¶

This package includes tools to predict and plot neighborhoods.

`nbdpred` Module¶

class nbdtools.nbdpred.NbdPred(loc_and_n)[source]¶

Bases: object

A neighborhood predictor class which takes as a parameter a list of places whose neighborhood is known. The predictor is nearest neighbor.

Parameters:	loc_and_n (list) – the data used to build a predictor, as a list of places. Each place is a list of two floats and a str; the two floats are the location of the place, and the str is the neighborhood the place belongs to.

We use the following example throughout.

>>> from nbdtools.nbdpred import NbdPred
>>> loc_and_n = [[0, 0, 'A'], [0, 1, 'A'], [2, 0, 'B'], [2, 1, 'B']] 
>>> npred = NbdPred(loc_and_n)

Todo

Neighborhood graph

make_predictor(train_percent)[source]¶

Split the data set into training and test sets, return a nearest neighbor predictor trained on the training set and the classification rate on the test set.

Parameters:	train_percent (float) – the percentage of the data set that will go into the training set

Returns:	a nearest neighbor predictor and its classification rate
Return type:	`sklearn.neighbors.KNeighborsClassifier`, float

>>> from nbdtools.nbdpred import NbdPred
>>> loc_and_n = [[0, 0, 'A'], [0, 1, 'A'], [2, 0, 'B'], [2, 1, 'B']] 
>>> npred = NbdPred(loc_and_n)
>>> nnclassifier, classrate = npred.make_predictor(train_percent=0.5)
>>> print classrate
1.0
>>> print nnclassifier.predict([0,2])
['A']
>>> print nnclassifier.predict([3,0])
['B']

neighborhoods = None¶: The set of neighborhoods.

neighborhoods_list = None¶: The list of neighborhoods.

num_neighborhoods = None¶: The number of neighborhoods.

`datatools` Package¶

This package includes tools to analyze neighborhood data.

`nbddataframe` Module¶

class datatools.nbddataframe.NBDDataFrame(df, min_lat=47.5, max_lat=47.75, min_long=-122.44, max_long=-122.2, debug=False)[source]¶

Bases: object

A neighborhood data cleaner and preliminary analyzer.

Parameters:

df (pandas.DataFrame) – the neighborhood data: at the least, it should have latitude, longitude and date columns
min_lat (float) – the minimum considered latitude, default is minlat
max_lat (float) – the maximum considered latitude, default is maxlat
min_long (float) – the minimum considered longitude, default is minlong
max_long (float) – the maximum considered longitude, default is maxlong
debug (bool) – if True, produce verbose output; default is False

Raises Exception:
	if df does not have the required columns

For example, we read a test DataFrame into an NBDDataFrame.

>>> from datatools.nbddataframe import testdataframe, NBDDataFrame
>>> nbddf = NBDDataFrame(testdataframe, debug=True)
The DataFrame is in the correct format

An exception is raised if the DataFrame is not in the correct format (in this case, a date column is missing).

>>> nbddf = NBDDataFrame(testdataframe[['latitude', 'longitude']])
Traceback (most recent call last):
    ...
Exception: DataFrame format error

To print missing data info,

>>> nbddf = NBDDataFrame(testdataframe)
>>> print nbddf.print_info()
The number of rows is 13.
2 rows are missing val.
1 rows are missing longitude.
2 rows have out-of-bounds location.

To remove rows with missing location or date data,

>>> nbddf.remove_missing_data()
>>> print nbddf.print_info()
The number of rows is 12.
2 rows are missing val.
2 rows have out-of-bounds location.

To remove rows with out-of-bounds locations,

>>> nbddf.remove_outofbounds_data()
>>> print nbddf.print_info()
The number of rows is 10.
2 rows are missing val.

Preliminary plots:

>>> nbddf.plot_rowcount_by_month()
>>> nbddf.plot_map()

Todo

function to add neighborhoods
cleaning functions
grouping functions
vis/analysis functions
normalization functions
Use shapefiles to get nbd instead of nearest-nbr predictor.

get_df()[source]¶

Get the underlying DataFrame.

Returns:	the underlying DataFrame
Return type:	pandas.DataFrame

plot_map(df=None, filename='row_locations_map.png')[source]¶

Plot the number of rows by month.

Parameters:	df (DataFrame) – the data to plot: if None, then this object’s underlying DataFrame is used, default is None filename (str) – the name of the file with the plot, default is “rowcount_by_month.png”

plot_rowcount_by_month(df=None, filename='rowcount_by_month.png')[source]¶

Plot the number of rows by month.

Parameters:	df (DataFrame) – the data to plot: if None, then this object’s underlying DataFrame is used, default is None filename (str) – the name of the file with the plot, default is “rowcount_by_month.png”

print_info()[source]¶

Print the number of rows, number of missing values for each column, and number of rows with location out of bounds.

Returns:	the above information
Return type:	str

remove_missing_data()[source]¶: Remove rows with missing location or date data.

remove_outofbounds_data()[source]¶: Remove rows with out-of-bounds locations.

datatools.nbddataframe.get_csv_data(filename, nbdname=None, latname='latitude', longname='longitude', datename='date', sep='\t')[source]¶

Read neighborhood data from a csv into a pandas.DataFrame compatible with NBDDataFrame. The csv data should have at the least latitude and longitude columns with names latname and longname, and a date column with name datename. A neighborhood column with name nbdname is optional.

Parameters:

filename (str) – the name of the csv file
nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
latname (str) – the name of the latitude column, default is ‘latitude’
longname (str) – the name of the longitude column, default is ‘longitude’
datename (str) – the name of the date column, default is ‘date’
sep (str) – the separation character, default is tab

Returns:	a pandas.DataFrame compatible with `NBDDataFrame`
Return type:	pandas.DataFrame

For example,

>>> from datatools.nbddataframe import testdata, get_csv_data
>>> fil = open('test.csv', 'w') #make a test csv file
>>> fil.write(testdata) #write data randomly generated for test purposes 
>>> fil.close()
>>> df = get_csv_data(
...                   filename='test.csv', nbdname='neighborhood',
...                   latname='lat', longname='lon', datename='date', 
...                   sep=' '
... )
>>> df.head()
        val   latitude   longitude      rand nbd       date
0  4.076444  47.600025 -122.373928  0.127659   B 2000-01-01
1  4.252051  47.400000 -122.341304  0.875592   A 1986-06-20
2  1.322697  47.634556 -122.344292  0.365462   B 1987-10-13
3 -9.362756  47.650959 -122.299352  0.095750   B 2000-01-01
4       NaN  47.544586 -122.343625  0.171634   B 2000-01-01
>>> df.loc[0, 'nbd']
'B'
>>> df.mean().loc['latitude']
47.597802519907695

datatools.nbddataframe.get_db_data(engine, tablename='nbddata', nbdname=None, latname='latitude', longname='longitude', datename='date', index_col=None)[source]¶

Read neighborhood data from a database into a pandas.DataFrame compatible with NBDDataFrame. The table should have at the least latitude and longitude columns with names latname and longname, and a date column with name datename. A neighborhood column with name nbdname is optional.

Parameters:

engine (sqlalchemy.engine.Engine) – the database engine
tablename (str) – the name of the database table, default is ‘nbddata’
nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
latname (str) – the name of the latitude column, default is ‘latitude’
longname (str) – the name of the longitude column, default is ‘longitude’
datename (str) – the name of the date column, default is ‘date’
index_col (str) – the name of the index column if there is one, default is None

Returns:	a DataFrame compatible with `NBDDataFrame`
Return type:	pandas.DataFrame

For example,

>>> from datatools.nbddataframe import testdataframe, NBDDataFrame
>>> from datatools.nbddataframe import make_db, get_db_data
>>> neigh_dataframe = NBDDataFrame(testdataframe)
>>> engine = make_db(nbddf=neigh_dataframe, tablename='neigh_data')
>>> df2 = get_db_data(engine=engine, tablename='neigh_data', 
...                   nbdname='nbd', latname='latitude', 
...                   longname='longitude', datename='date'
... )
>>> df2.loc[1, 'nbd']
u'A'

datatools.nbddataframe.make_db(nbddf, dbname=None, tablename='nbddata')[source]¶

Write NBDDataFrame data into a SQLite database.

Parameters:	nbddf (NBDDataFrame) – the `NBDDataFrame` object containing the data dbname (str) – the name of the database; if no name is given, the database will be in-memory-only, default is None tablename (str) – the name of the table, default is ‘nbddata’

Returns:	an engine to the database
Return type:	sqlalchemy.engine.Engine

For example,

>>> from datatools.nbddataframe import testdataframe, NBDDataFrame
>>> from datatools.nbddataframe import make_db
>>> neigh_dataframe = NBDDataFrame(testdataframe)
>>> engine = make_db(nbddf=neigh_dataframe, tablename='neigh_data')
>>> con = engine.connect()
>>> res = con.execute("select latitude from neigh_data where nbd = 'A'")
>>> data = res.fetchall()
>>> data[0][0]
47.4
>>> con.close()

datatools.nbddataframe.maxlat = 47.75¶: The default maximum latitude.

datatools.nbddataframe.maxlong = -122.2¶: The default maximum longitude.

datatools.nbddataframe.minlat = 47.5¶: The default minimum latitude.

datatools.nbddataframe.minlong = -122.44¶: The default minimum longitude.

datatools.nbddataframe.rename_cols(df, nbdname=None, latname='latitude', longname='longitude', datename='date')[source]¶

Rename the longitude, latitude, date (and convert to time type) and nbd (if there is one) columns to make the DataFrame df compatible with NBDDataFrame.

Parameters:	nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None latname (str) – the name of the latitude column, default is ‘latitude’ longname (str) – the name of the longitude column, default is ‘longitude’ datename (str) – the name of the date column, default is ‘date’

Returns:	a pandas.DataFrame compatible with `NBDDataFrame`
Return type:	pandas.DataFrame

Documentation¶

nbdtools Package¶

nbdpred Module¶

datatools Package¶

nbddataframe Module¶

`nbdtools` Package¶

`nbdpred` Module¶

`datatools` Package¶

`nbddataframe` Module¶