Documentation¶
nbdtools Package¶
This package includes tools to predict and plot neighborhoods.
nbdpred Module¶
- class nbdtools.nbdpred.NbdPred(loc_and_n)[source]¶
Bases: object
A neighborhood predictor class which takes as a parameter a list of places whose neighborhood is known. The predictor is nearest neighbor.
Parameters: loc_and_n (list) – the data used to build a predictor, as a list of places. Each place is a list of two floats and a str; the two floats are the location of the place, and the str is the neighborhood the place belongs to. We use the following example throughout.
>>> from nbdtools.nbdpred import NbdPred >>> loc_and_n = [[0, 0, 'A'], [0, 1, 'A'], [2, 0, 'B'], [2, 1, 'B']] >>> npred = NbdPred(loc_and_n)
Todo
Neighborhood graph
- make_predictor(train_percent)[source]¶
Split the data set into training and test sets, return a nearest neighbor predictor trained on the training set and the classification rate on the test set.
Parameters: train_percent (float) – the percentage of the data set that will go into the training set Returns: a nearest neighbor predictor and its classification rate Return type: sklearn.neighbors.KNeighborsClassifier, float >>> from nbdtools.nbdpred import NbdPred >>> loc_and_n = [[0, 0, 'A'], [0, 1, 'A'], [2, 0, 'B'], [2, 1, 'B']] >>> npred = NbdPred(loc_and_n) >>> nnclassifier, classrate = npred.make_predictor(train_percent=0.5) >>> print classrate 1.0 >>> print nnclassifier.predict([0,2]) ['A'] >>> print nnclassifier.predict([3,0]) ['B']
- neighborhoods = None¶
The set of neighborhoods.
- neighborhoods_list = None¶
The list of neighborhoods.
- num_neighborhoods = None¶
The number of neighborhoods.
datatools Package¶
This package includes tools to analyze neighborhood data.
nbddataframe Module¶
- class datatools.nbddataframe.NBDDataFrame(df, min_lat=47.5, max_lat=47.75, min_long=-122.44, max_long=-122.2, debug=False)[source]¶
Bases: object
A neighborhood data cleaner and preliminary analyzer.
Parameters: - df (pandas.DataFrame) – the neighborhood data: at the least, it should have latitude, longitude and date columns
- min_lat (float) – the minimum considered latitude, default is minlat
- max_lat (float) – the maximum considered latitude, default is maxlat
- min_long (float) – the minimum considered longitude, default is minlong
- max_long (float) – the maximum considered longitude, default is maxlong
- debug (bool) – if True, produce verbose output; default is False
Raises Exception: if df does not have the required columns For example, we read a test DataFrame into an NBDDataFrame.
>>> from datatools.nbddataframe import testdataframe, NBDDataFrame >>> nbddf = NBDDataFrame(testdataframe, debug=True) The DataFrame is in the correct format
An exception is raised if the DataFrame is not in the correct format (in this case, a date column is missing).
>>> nbddf = NBDDataFrame(testdataframe[['latitude', 'longitude']]) Traceback (most recent call last): ... Exception: DataFrame format error
To print missing data info,
>>> nbddf = NBDDataFrame(testdataframe) >>> print nbddf.print_info() The number of rows is 13. 2 rows are missing val. 1 rows are missing longitude. 2 rows have out-of-bounds location.
To remove rows with missing location or date data,
>>> nbddf.remove_missing_data() >>> print nbddf.print_info() The number of rows is 12. 2 rows are missing val. 2 rows have out-of-bounds location.
To remove rows with out-of-bounds locations,
>>> nbddf.remove_outofbounds_data() >>> print nbddf.print_info() The number of rows is 10. 2 rows are missing val.
Preliminary plots:
>>> nbddf.plot_rowcount_by_month() >>> nbddf.plot_map()
Todo
- function to add neighborhoods
- cleaning functions
- grouping functions
- vis/analysis functions
- normalization functions
- Use shapefiles to get nbd instead of nearest-nbr predictor.
- get_df()[source]¶
Get the underlying DataFrame.
Returns: the underlying DataFrame Return type: pandas.DataFrame
- plot_map(df=None, filename='row_locations_map.png')[source]¶
Plot the number of rows by month.
Parameters: - df (DataFrame) – the data to plot: if None, then this object’s underlying DataFrame is used, default is None
- filename (str) – the name of the file with the plot, default is “rowcount_by_month.png”
- plot_rowcount_by_month(df=None, filename='rowcount_by_month.png')[source]¶
Plot the number of rows by month.
Parameters: - df (DataFrame) – the data to plot: if None, then this object’s underlying DataFrame is used, default is None
- filename (str) – the name of the file with the plot, default is “rowcount_by_month.png”
- datatools.nbddataframe.get_csv_data(filename, nbdname=None, latname='latitude', longname='longitude', datename='date', sep='\t')[source]¶
Read neighborhood data from a csv into a pandas.DataFrame compatible with NBDDataFrame. The csv data should have at the least latitude and longitude columns with names latname and longname, and a date column with name datename. A neighborhood column with name nbdname is optional.
Parameters: - filename (str) – the name of the csv file
- nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
- latname (str) – the name of the latitude column, default is ‘latitude’
- longname (str) – the name of the longitude column, default is ‘longitude’
- datename (str) – the name of the date column, default is ‘date’
- sep (str) – the separation character, default is tab
Returns: a pandas.DataFrame compatible with NBDDataFrame Return type: pandas.DataFrame For example,
>>> from datatools.nbddataframe import testdata, get_csv_data >>> fil = open('test.csv', 'w') #make a test csv file >>> fil.write(testdata) #write data randomly generated for test purposes >>> fil.close() >>> df = get_csv_data( ... filename='test.csv', nbdname='neighborhood', ... latname='lat', longname='lon', datename='date', ... sep=' ' ... ) >>> df.head() val latitude longitude rand nbd date 0 4.076444 47.600025 -122.373928 0.127659 B 2000-01-01 1 4.252051 47.400000 -122.341304 0.875592 A 1986-06-20 2 1.322697 47.634556 -122.344292 0.365462 B 1987-10-13 3 -9.362756 47.650959 -122.299352 0.095750 B 2000-01-01 4 NaN 47.544586 -122.343625 0.171634 B 2000-01-01 >>> df.loc[0, 'nbd'] 'B' >>> df.mean().loc['latitude'] 47.597802519907695
- datatools.nbddataframe.get_db_data(engine, tablename='nbddata', nbdname=None, latname='latitude', longname='longitude', datename='date', index_col=None)[source]¶
Read neighborhood data from a database into a pandas.DataFrame compatible with NBDDataFrame. The table should have at the least latitude and longitude columns with names latname and longname, and a date column with name datename. A neighborhood column with name nbdname is optional.
Parameters: - engine (sqlalchemy.engine.Engine) – the database engine
- tablename (str) – the name of the database table, default is ‘nbddata’
- nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
- latname (str) – the name of the latitude column, default is ‘latitude’
- longname (str) – the name of the longitude column, default is ‘longitude’
- datename (str) – the name of the date column, default is ‘date’
- index_col (str) – the name of the index column if there is one, default is None
Returns: a DataFrame compatible with NBDDataFrame Return type: pandas.DataFrame For example,
>>> from datatools.nbddataframe import testdataframe, NBDDataFrame >>> from datatools.nbddataframe import make_db, get_db_data >>> neigh_dataframe = NBDDataFrame(testdataframe) >>> engine = make_db(nbddf=neigh_dataframe, tablename='neigh_data') >>> df2 = get_db_data(engine=engine, tablename='neigh_data', ... nbdname='nbd', latname='latitude', ... longname='longitude', datename='date' ... ) >>> df2.loc[1, 'nbd'] u'A'
- datatools.nbddataframe.make_db(nbddf, dbname=None, tablename='nbddata')[source]¶
Write NBDDataFrame data into a SQLite database.
Parameters: - nbddf (NBDDataFrame) – the NBDDataFrame object containing the data
- dbname (str) – the name of the database; if no name is given, the database will be in-memory-only, default is None
- tablename (str) – the name of the table, default is ‘nbddata’
Returns: an engine to the database Return type: sqlalchemy.engine.Engine For example,
>>> from datatools.nbddataframe import testdataframe, NBDDataFrame >>> from datatools.nbddataframe import make_db >>> neigh_dataframe = NBDDataFrame(testdataframe) >>> engine = make_db(nbddf=neigh_dataframe, tablename='neigh_data') >>> con = engine.connect() >>> res = con.execute("select latitude from neigh_data where nbd = 'A'") >>> data = res.fetchall() >>> data[0][0] 47.4 >>> con.close()
- datatools.nbddataframe.maxlat = 47.75¶
The default maximum latitude.
- datatools.nbddataframe.maxlong = -122.2¶
The default maximum longitude.
- datatools.nbddataframe.minlat = 47.5¶
The default minimum latitude.
- datatools.nbddataframe.minlong = -122.44¶
The default minimum longitude.
- datatools.nbddataframe.rename_cols(df, nbdname=None, latname='latitude', longname='longitude', datename='date')[source]¶
Rename the longitude, latitude, date (and convert to time type) and nbd (if there is one) columns to make the DataFrame df compatible with NBDDataFrame.
Parameters: - nbdname (str) – (optional) the name of the neighborhood column if there is one, default is None
- latname (str) – the name of the latitude column, default is ‘latitude’
- longname (str) – the name of the longitude column, default is ‘longitude’
- datename (str) – the name of the date column, default is ‘date’
Returns: a pandas.DataFrame compatible with NBDDataFrame Return type: pandas.DataFrame