Documentation

nbdtools Package

This package includes tools to predict and plot neighborhoods.

nbdpred Module

class nbdtools.nbdpred.NbdPred(loc_and_n)[source]

Bases: object

A neighborhood predictor class which takes as a parameter a list of places whose neighborhood is known. The predictor is nearest neighbor.

Parameters:loc_and_n (list) – the data used to build a predictor, as a list of places. Each place is a list of two floats and a str; the two floats are the location of the place, and the str is the neighborhood the place belongs to.

We use the following example throughout.

>>> from nbdtools.nbdpred import NbdPred
>>> loc_and_n = [[0, 0, 'A'], [0, 1, 'A'], [2, 0, 'B'], [2, 1, 'B']] 
>>> npred = NbdPred(loc_and_n)

Todo

Neighborhood graph

make_predictor(train_percent)[source]

Split the data set into training and test sets, return a nearest neighbor predictor trained on the training set and the classification rate on the test set.

Parameters:train_percent (float) – the percentage of the data set that will go into the training set
Returns:a nearest neighbor predictor and its classification rate
Return type:sklearn.neighbors.KNeighborsClassifier, float
>>> from nbdtools.nbdpred import NbdPred
>>> loc_and_n = [[0, 0, 'A'], [0, 1, 'A'], [2, 0, 'B'], [2, 1, 'B']] 
>>> npred = NbdPred(loc_and_n)
>>> nnclassifier, classrate = npred.make_predictor(train_percent=0.5)
>>> print classrate
1.0
>>> print nnclassifier.predict([0,2])
['A']
>>> print nnclassifier.predict([3,0])
['B']
neighborhoods = None

The set of neighborhoods.

neighborhoods_list = None

The list of neighborhoods.

num_neighborhoods = None

The number of neighborhoods.

datatools Package

This package includes tools to analyze neighborhood data.

nbddataframe Module

class datatools.nbddataframe.NBDDataFrame(df, min_lat=47.5, max_lat=47.75, min_long=-122.44, max_long=-122.2, debug=False)[source]

Bases: object

A neighborhood data cleaner and preliminary analyzer.

Parameters:
  • df (pandas.DataFrame) – the neighborhood data: at the least, it should have latitude, longitude and date columns
  • min_lat (float) – the minimum considered latitude, default is minlat
  • max_lat (float) – the maximum considered latitude, default is maxlat
  • min_long (float) – the minimum considered longitude, default is minlong
  • max_long (float) – the maximum considered longitude, default is maxlong
  • debug (bool) – if True, produce verbose output; default is False
Raises Exception:
 if df does not have the required columns

For example, we read a test DataFrame into an NBDDataFrame.

>>> from datatools.nbddataframe import testdataframe, NBDDataFrame
>>> nbddf = NBDDataFrame(testdataframe, debug=True)
The DataFrame is in the correct format

An exception is raised if the DataFrame is not in the correct format (in this case, a date column is missing).

>>> nbddf = NBDDataFrame(testdataframe[['latitude', 'longitude']])
Traceback (most recent call last):
    ...
Exception: DataFrame format error

To print missing data info,

>>> nbddf = NBDDataFrame(testdataframe)
>>> print nbddf.print_info()
The number of rows is 13.
2 rows are missing val.
1 rows are missing longitude.
2 rows have out-of-bounds location.

To remove rows with missing location or date data,

>>> nbddf.remove_missing_data()
>>> print nbddf.print_info()
The number of rows is 12.
2 rows are missing val.
2 rows have out-of-bounds location.

To remove rows with out-of-bounds locations,

>>> nbddf.remove_outofbounds_data()
>>> print nbddf.print_info()
The number of rows is 10.
2 rows are missing val.

Preliminary plots:

>>> nbddf.plot_rowcount_by_month()
>>> nbddf.plot_map()

Todo

  • function to add neighborhoods
  • cleaning functions
  • grouping functions
  • vis/analysis functions
  • normalization functions
  • Use shapefiles to get nbd instead of nearest-nbr predictor.
get_df()[source]

Get the underlying DataFrame.

Returns:the underlying DataFrame
Return type:pandas.DataFrame
plot_map(df=None, filename='row_locations_map.png')[source]

Plot the number of rows by month.

Parameters:
  • df (DataFrame) – the data to plot: if None, then this object’s underlying DataFrame is used, default is None
  • filename (str) – the name of the file with the plot, default is “rowcount_by_month.png”
plot_rowcount_by_month(df=None, filename='rowcount_by_month.png')[source]

Plot the number of rows by month.

Parameters:
  • df (DataFrame) – the data to plot: if None, then this object’s underlying DataFrame is used, default is None
  • filename (str) – the name of the file with the plot, default is “rowcount_by_month.png”
print_info()[source]

Print the number of rows, number of missing values for each column, and number of rows with location out of bounds.

Returns:the above information
Return type:str
remove_missing_data()[source]

Remove rows with missing location or date data.

remove_outofbounds_data()[source]

Remove rows with out-of-bounds locations.

datatools.nbddataframe.get_csv_data(filename, nbdname=None, latname='latitude', longname='longitude', datename='date', sep='\t')[source]

Read neighborhood data from a csv into a pandas.DataFrame compatible with NBDDataFrame. The csv data should have at the least latitude and longitude columns with names latname and longname, and a date column with name datename. A neighborhood column with name nbdname is optional.

Parameters:
  • filename (str) – the name of the csv file
  • nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
  • latname (str) – the name of the latitude column, default is ‘latitude’
  • longname (str) – the name of the longitude column, default is ‘longitude’
  • datename (str) – the name of the date column, default is ‘date’
  • sep (str) – the separation character, default is tab
Returns:a pandas.DataFrame compatible with NBDDataFrame
Return type:pandas.DataFrame

For example,

>>> from datatools.nbddataframe import testdata, get_csv_data
>>> fil = open('test.csv', 'w') #make a test csv file
>>> fil.write(testdata) #write data randomly generated for test purposes 
>>> fil.close()
>>> df = get_csv_data(
...                   filename='test.csv', nbdname='neighborhood',
...                   latname='lat', longname='lon', datename='date', 
...                   sep=' '
... )
>>> df.head()
        val   latitude   longitude      rand nbd       date
0  4.076444  47.600025 -122.373928  0.127659   B 2000-01-01
1  4.252051  47.400000 -122.341304  0.875592   A 1986-06-20
2  1.322697  47.634556 -122.344292  0.365462   B 1987-10-13
3 -9.362756  47.650959 -122.299352  0.095750   B 2000-01-01
4       NaN  47.544586 -122.343625  0.171634   B 2000-01-01
>>> df.loc[0, 'nbd']
'B'
>>> df.mean().loc['latitude']
47.597802519907695
datatools.nbddataframe.get_db_data(engine, tablename='nbddata', nbdname=None, latname='latitude', longname='longitude', datename='date', index_col=None)[source]

Read neighborhood data from a database into a pandas.DataFrame compatible with NBDDataFrame. The table should have at the least latitude and longitude columns with names latname and longname, and a date column with name datename. A neighborhood column with name nbdname is optional.

Parameters:
  • engine (sqlalchemy.engine.Engine) – the database engine
  • tablename (str) – the name of the database table, default is ‘nbddata’
  • nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
  • latname (str) – the name of the latitude column, default is ‘latitude’
  • longname (str) – the name of the longitude column, default is ‘longitude’
  • datename (str) – the name of the date column, default is ‘date’
  • index_col (str) – the name of the index column if there is one, default is None
Returns:a DataFrame compatible with NBDDataFrame
Return type:pandas.DataFrame

For example,

>>> from datatools.nbddataframe import testdataframe, NBDDataFrame
>>> from datatools.nbddataframe import make_db, get_db_data
>>> neigh_dataframe = NBDDataFrame(testdataframe)
>>> engine = make_db(nbddf=neigh_dataframe, tablename='neigh_data')
>>> df2 = get_db_data(engine=engine, tablename='neigh_data', 
...                   nbdname='nbd', latname='latitude', 
...                   longname='longitude', datename='date'
... )
>>> df2.loc[1, 'nbd']
u'A'
datatools.nbddataframe.make_db(nbddf, dbname=None, tablename='nbddata')[source]

Write NBDDataFrame data into a SQLite database.

Parameters:
  • nbddf (NBDDataFrame) – the NBDDataFrame object containing the data
  • dbname (str) – the name of the database; if no name is given, the database will be in-memory-only, default is None
  • tablename (str) – the name of the table, default is ‘nbddata’
Returns:an engine to the database
Return type:sqlalchemy.engine.Engine

For example,

>>> from datatools.nbddataframe import testdataframe, NBDDataFrame
>>> from datatools.nbddataframe import make_db
>>> neigh_dataframe = NBDDataFrame(testdataframe)
>>> engine = make_db(nbddf=neigh_dataframe, tablename='neigh_data')
>>> con = engine.connect()
>>> res = con.execute("select latitude from neigh_data where nbd = 'A'")
>>> data = res.fetchall()
>>> data[0][0]
47.4
>>> con.close()
datatools.nbddataframe.maxlat = 47.75

The default maximum latitude.

datatools.nbddataframe.maxlong = -122.2

The default maximum longitude.

datatools.nbddataframe.minlat = 47.5

The default minimum latitude.

datatools.nbddataframe.minlong = -122.44

The default minimum longitude.

datatools.nbddataframe.rename_cols(df, nbdname=None, latname='latitude', longname='longitude', datename='date')[source]

Rename the longitude, latitude, date (and convert to time type) and nbd (if there is one) columns to make the DataFrame df compatible with NBDDataFrame.

Parameters:
  • nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
  • latname (str) – the name of the latitude column, default is ‘latitude’
  • longname (str) – the name of the longitude column, default is ‘longitude’
  • datename (str) – the name of the date column, default is ‘date’
Returns:a pandas.DataFrame compatible with NBDDataFrame
Return type:pandas.DataFrame