*This is a group work of a team of 3 people, including myself for class DS8003*
Problem Definition
In this project, we will perform analysis on the theft from motor vehicles reported to the Toronto Police Service occurred in 140 neighborhoods of the City of Toronto, using the four different technology tools, namely Tableau, Spark, Hive and Kibana. This document is written to introduce the dataset that will be examined in depth, to report the insights we discovered, and to give potential solutions.
Before going any further, it is vital to understand the problem first. Theft of all kinds occurs on a daily basis, especially in major cities. For this project, we are using a dataset given directly from the Toronto Police Service[Dataset 1], which documents all reported theft from motor vehicle cases dated back from 2014 until now. We also used an additional “Theft” dataset from the Toronto Police Service[Dataset 2], along with some economic data retrieved indirectly and directly from Statistic Canada. Hence, it gives enough range and depth to explore whether there is a trend of rising theft from motor vehicles. As a result, we found that although the overall theft from motor vehicles is on a downwarding trend since 2014, there is several noticeable correlation between theft and times during a week of a day, as well as the movement of crimes over the year toward specific neighbourhoods. We also identify some potential solutions to counter these directions of development.
Data Description
Our dataset, Theft from Motor vehicles contains 76,914 rows and 26 columns. Due to a high number of columns, our data description view gets a little distorted so we can use the below command to see our data in a concise format.
Work Distribution
The work done for this project was an equal contribution by all the team members. Everyone participated in their full capacities during our weekly meetings. During these weekly meetings, we would actively brainstorm ideas and talk about the tools that we would be working on for our individual pieces of the project but we’d come together and collaborate when anyone needed support. We worked together on the presentation as well and provided each other constructive criticism on how we can all improve all parts of the presentation. We also contributed to the report by working on our sections individually and then reading and editing and providing feedback on all sections so as to have one cohesive report.
Solution Description
For this project, we used Hive and Spark, PySpark specifically, to process data and Kibana with Tableau for data visualization. The reason that led us to this decision is the fact that we had over 100 thousand lines of combined data, so using Hive and Spark will boost up the processing data time compared to python native data types.
Hive
Subsequent to loading our data into the Hadoop Distributed File System (HDFS), which allows for the efficient storage and processing of large datasets, Hive was deemed an excellent option for our analysis because it provides a quick and equally efficient way to read, write, and manage up to petabytes of data using a SQL-like interface (HiveQL) [1]. After defining the table schemas for, and populating data from the Theft From Motor Vehicle and Auto Theft datasets respectively, all rows of the two tables were joined together using the common columns event_unique_id, occurrence_year, occurrence_month, occurrence_day_of_week, and occurrence_hour (Figure 1), which had been renamed slightly from the original dataset. A type column was added to differentiate the theft types by the labels ‘Theft From Auto’ and ‘Auto Theft’.
In the Figure 2 query, we created a temporary table for each theft category, with the theft year (between 2014 and 2021) and two columns representing theft counts aggregated by year, and the percent difference relative to 2014. The 2014 values were previously queried and stored into the variables from_auto_2014 and auto_theft_2014 for easy reference, and the percent difference was calculated and processed in ‘xx.x%’ format. The two tables were then joined together using occurrence_year as the common key.
We engaged in a similar process in Figure 3, this time joining two temporary tables ranking theft count by month, with occurrence_month as the common column. The goal was to compare the relative ranks between theft types in order to determine whether there were commonalities in the highest-risk months.
For Figure 4, the point of interest was time of day, and we bucketed each hour into a four-hour time-slot before ranking the number of thefts accordingly. This time, three temporary tables were used, with the first one defining and joining the the time_of_day column to the columns from the main auto_related_theft table for easy reference.
Finally, we specifically looked at ‘Theft From Auto’ (Figure 5), grouping the data by the days of week and occurrence hour, and ranking them by theft counts. Here, the goal was to highlight the top ten high-risk day and time combinations.
Spark
Spark Dataframe is able to support parallel computing, which translates to much faster computing time compared to Python’s Pandas Dataframe. Since we have 76,914 lines of data from the “Theft from motor vehicles” dataset, it makes sense to use Spark Dataframe to load our dataset. Spark has a Python library called “pyspark”, which we used for our project. In the code snippet below, we import pyspark library, then create a Spark session before reading data with Spark. It was used in several insights in this document. For instance, when we analyze the general trend of crimes moving toward downtown Toronto and Eotbicoke, which will be discussed later, pyspark allows us to quickly group values by “Neighbourhood” column and count total cases (figure 6).
Figure 6: Pyspark code snippet
Furthermore, in order to assist mapping with Kibana, which will be discussed after this, we also used pyspark to convert string format of month to number type (figure 7). Here, since the number of cases before 2014 are significantly lower than after 2014, we decided to focus our analysis on data after 2014.
Figure 7: Pyspark converting format month.
Finally, in some cases, it is not convenient to switch back and forth between tools, we used pyspark to graph charts when it is needed.
Figure 8: Pyspark converting to Pandas dataframe to draw a chart
Kibana
Kibana is a visualization tool that is a part of an ELK stack. It is available with both machine-based installation and web-based interface, which is very user-friendly. However, after using it for a map and a graph, we realized it is quite difficult to use outside of the ELK stack, so we moved to Tableau for similar ease of chart generation and used a python library to draw charts if needed. For an instance of using Kibana, we used processed data from pyspark that already computed the total number of theft from motor vehicles as input for Kibana. As we already have years as X axis and total number of crimes as Y axis, it is simple to generate a line chart like figure 21, which is a part of our analysis on correlation between economic and theft from motor vehicles.
Tableau
Tableau is a very popular business intelligence and data visualization tool, and for good reason. Not only can it handle significant amounts of data efficiently, it has helpful features such as Quick Table Calculations, and allows users to create static and interactive displays with its drag-and-drop features in very little time. We opted to use it as a far more user-friendly alternative to Kibana, which had proved somewhat inflexible. Tableau allows users to bring insights to life, displaying them in an easily digestible and visually appealing format. We used it to create charts to display the time-based trends (annual, monthly, etc.) as well as breakdowns by Premise Type that we previously analyzed using Hive and PySpark. These will be shown in the appropriate Insights sections. The tool enabled us to automatically compute counts by merely dragging and dropping the desired fields into the appropriate sections (e.g. into Colour on the Marks card to create separate bars for the different kinds of theft), and make easy adjustments such as adding aliases and filters, and custom sorting.
The minimal code used (Figure 9) was to mimic the grouped four-hour time slots temp table query from Hive; this could also technically have been included in a new extended Hive table prior to importing the data to Tableau, which would have altogether eliminated the need for code.
Insights
I. Thefts from motor vehicles have not seen an alarming increase relative to 2014.
As we can see from the above figures, there was a steady increase in both types of theft from 2016 onwards; however, the peak at 2020 for Theft From Auto (tfa) only represents a 5.6% increase relative to 2014. The following year’s numbers were 15.7% lower than in 2014. Meanwhile, Auto Theft (at) has seen continuous year-over-year increases since 2016, with its own peak at 2021 representing a 79.5% increase relative to 2014. Although it is possible that auto theft crime is an outlier, given the drastic increase, the comparison between the two categories, the fact that 5 out of 7 years since 2014 have seen lower numbers than in that year, and the most recent year’s results all suggest that there has not generally been an alarming increase in thefts from motor vehicles. This, however, does not mean that interventions are not required.
II. There is strong alignment between thefts from motor vehicles and auto thefts in terms of high-risk periods.
As per Figure 12, the top four months for Theft From Auto (tfa) are October, September, August, and July. For Auto Theft (at), they are October, November, September, and July. This means that they share 3 out of 4 top months (i.e. July, September, and October) in common, with October being the highest-risk month for both. Figures 13 and 14, meanwhile, show perfect alignment in terms of the times of day that people’s vehicles are at the greatest risk of being stolen or stolen from. The time period between 8pm and midnight is when people should be the most concerned, followed by 4pm to 8pm. These commonalities mean that, despite the earlier noted differences in top level annual trends and their different classifications (i.e. Major Crime vs. Non-Major Crime), there may be sufficient grounds to propose that these crimes be jointly monitored and addressed by the same task forces.
III. There are particular days and times, such as 6pm on Fridays, and midnight on Monday and Sunday, where increased police presence is warranted.
Looking at thefts from motor vehicles, we can see that 6pm and midnight are generally high-risk times of day. In particular, Friday in the 6pm hour and Monday and Sunday in the midnight hour are the top three time frames of concern. This assessment can assist the Toronto Police Service in their workforce planning, so as to ensure that a greater number of officers are on patrol duty as appropriate. The results of this analysis also expose some of the dangers of grouping/bucketing data. For example, the 12am hour was included in the 12am - 4am group in the earlier section, and because there is a decrease in activity over the following three hours, its level of risk was not adequately captured.
IV. A movement in areas of committed cases toward downtown Toronto and Etobicoke
After we composed a list of neighbourhoods with most reported cases since 2014 (figure 17), we went to map it out to get a more overview look (figure 17).
Figure 17: Most reported cases neighbourhoods since 2014 in descending order.
Figure 18: Map of neighbourhoods with most reported cases since 2014.
In general, this map shows us that the east side of North York, East York and York are relatively safe, as most crimes are concentrated in downtown Toronto, Scarborough, Etobicoke, and West of North York. As we can see from figure 17, after the 6th one, rate of changes decreases significantly, which suggests the top 6 concentrated most of the cases and needs more attention. Thus, if we combine that along with table A describing the number of cases for those areas over the years and figure 20 counting the number of time each neighbourhood appears in the top 10 every year since 2014, it can be observed that although at its peak, West Humber-Clairville cases in a year is much less than Kensington-Chinatown’s one, it appears in the top 10 for 9 consecutive years, more than the 7 times Kensington-Chinatown.
Table A: Number of cases per neighbourhood per year in descending order.
Figure 19: Code to find top 10 most dangerous neighbourhood by year, then count number of times they made it to the top 10 since 2014.
In fact, Kensington-Chinatown number of cases are dropping fairly consistently since its peak in 2014, with 2020 and 2021 are not even in the top 20 while it is the opposite trend for West Humber-Clairville. Therefore, it can be concluded that West Humber-Clairville right now is actually more dangerous than Kensington-Chinatown. Although Islington-City Center West makes it in the top 5 of most thefts from auto from auto vehicle since 2014 and managed to be in the top 10 for 9 consecutive years, its peak did not manage to make it to top 20 since 2014. Thus, it can be suggested that this area is consistently risky with no noticeable trend.
Figure 20: Number of times a neighbourhood appears in the top 10 every year since 2014.
If we combine figure 17, table A and figure 18 together, we can notice that there is a trend of theft from motor vehicles moving toward downtown Toronto and Etobicoke. With this information, we suggest that more resources should be allocated to these two areas in anticipation of rising crimes.
V. A negative correlation with sense of wealth
With all that information we have seen, it is difficult not to wonder about possible catalysts or causation of rising theft from motor vehicles. So we decided to go out of the box to find an answer to it. A popular belief was that there is a link between wealth and theft. Basically, this theory says people tend to go the criminal path more often when it is difficult to put food on the table. In fact, there are also some studies on this subject. For instance, there is a study conducted in 2009 proposing the relationship between median income and property theft [2]. Although it does not fully match our situation, we decided to explore whether this holds true for theft from motor vehicles.
From our dataset, we can graph the total number of theft from motor vehicles from 2014-2022 (figure 21). Due to the nature of our question, for economic data, we decided to go straight to charts of data where possible as there are many datasets we need to look at to meet.
Figure 21: Total number of reported cases from 2014-2022
Figure 22: Median income in Ontario
The theory does not seem applicable to our cases, as although median income data is only available until 2020 (figure 22), median income is positively correlated with total number of theft from motor vehicles, except for 2016. This doesn’t make sense, as it suggests people steal more when they have more income. In fact, when we look at this theory with the unemployment rate of Canada (figure 23), it seems to have no relationship with theft from motor vehicles, which disproves our initial belief.
Figure 23: Unemployment rate in Ontario
Thus, we decided to look under another direction: sense of wealth; and finally we found one piece of data that seems to support the idea. Before 2020, theft from motor vehicles seems to have negative correlation with Personal Saving Rate from 2014-2019 (figure 24). This looks more promising, but more research is still needed to have a concrete conclusion of the correlation between the sense of wealth and theft from motor vehicles. From this angle, it can be suggested that more resources could be allocated on community support and public security during those times when the sense of wealth decreases, as well as raising personal awareness to prevent chances of being a victim of theft from motor vehicles.
Figure 24: Personal Saving rate in Canada
VI. A maximum of these thefts amount to under $5000 from motor vehicles and are substantially seen happening “outside”
Here you can see a relationship between the UCR Code which is a code set system that the police use to identify the type of offence and the premise type. It’s evident from the data that premise type “outside” has an extremely high count of UCR code 2142. UCR code 2142 signifies theft under $5000 from a motor vehicle and UCR code 2132 represents theft above $5000 from motor vehicles. Both these offences are maximum observed “outside”. Outside basically covers cars parked on the street, cars parked near curbs, cars parked in dark or poorly lit areas. Our second maxima is observed in the case of premise type - “house” which refers to cars being parked in the driveway of the victim’s home or even the residential streets in front of the victim’s homes.
Future Work
For future work, we can continue to verify the wealth-crime theory with more Ontario specific datasets. For instance, we can compare total theft from motor vehicles per year with Ontario Consumer Confidence and Ontario Personal Saving Rate. There is also potential ground to explore with surveys specifically asking people to classify themselves as living paycheck to paycheck change rate or self-classifying as low-mid-high class.
References
Dataset 1: “Open data dataset,” City of Toronto Open Data Portal. [Online]. Available: https://open.toronto.ca/dataset/theft-from-motor-vehicle/. [Accessed: 05-Dec-2022].
Dataset 2: “Auto Theft,” Toronto Police Service Public Safety Data Portal. [Online]. Available: https://data.torontopolice.on.ca/datasets/TorontoPS::auto-theft-1/about. [Accessed: 23-Nov-2022].
[1]: “What is Apache Hive?” [Online]. Available: https://aws.amazon.com/big-data/what-is-hive/. [Accessed: 04-Dec-2022].
[2]: C. Muroi and R. Baumann, “The non-linear effect of wealth on crime - college of the holy cross.” [Online]. Available: https://college.holycross.edu/RePEc/hcx/Muroi-Baumann_Crime.pdf. [Accessed: 11-Dec-2022].
Figure 22: L. Jeudy, “Median total family income in Ontario,” Statista, 21-Oct-2022. [Online]. Available: https://www.statista.com/statistics/582818/median-total-family-income-ontario/. [Accessed: 29-Nov-2022].
Figure 23: “Annual average unemployment rate Canada and provinces 1976-2021.” [Online]. Available: https://www.stats.gov.nl.ca/Statistics/Topics/labour/PDF/UnempRate.pdf. [Accessed: 29-Nov-2022].
Figure 24: “Canada household saving Rate 2022 data - 2023 forecast - 1961-2021 historical,” Canada Household Saving Rate - 2022 Data - 2023 Forecast - 1961-2021 Historical. [Online]. Available: https://tradingeconomics.com/canada/personal-savings. [Accessed: 29-Nov-2022]
留言