Artificial Informer - Dissecting a Machine Learning Powered Investigation

Dissecting a Machine Learning Powered Investigation

Uncovering local property tax evasion using machine learning and statistical modeling. An investigative recipe.

By Brandon Roberts

If there's one universal investigative template I've come across in my journalism career, it's this: take a list of names or organizations, find those names in another dataset[1] and identify the most suspicious ones for further investigation. I call this a "data in, data out" investigation. While I've personally only applied this pattern to property tax related stories, it could also be extended to areas like campaign finance and background research.

Before I get into the technical dirt, I want to talk about why data in, data out investigations are ideal. Going into a dataset essentially blind and letting the data provide leads helps keep things less biased. Smaller newsrooms can benefit, too, because they don't need whistleblowers telling them exactly who or what to look for. To me, the best part is these kinds of investigations lend themselves well to advanced computer assisted methods—eliminating a lot of the tedium that goes into investigative work.

I'm going to dissect a data in, data out investigation and the machine learning (ML)[10] tools that went into it.

The Investigation: An Overview

The City of Austin, like many cities across the US, requires people who list their homes on sites like HomeAway and Airbnb to register and pay taxes on each stay, just like hotels do. These types of rentals are called "short-term rentals" (STRs). Properties used as STRs generally aren't eligible for most tax exemptions. In Texas, home owners can apply for a so-called "homestead exemption". These add up to significant tax savings. Because of this, there are lots of rules around them. For example, only one property per person or couple can be exempt. It needs to be their primary residence and it can't be used commercially.

We wanted to see whether the appropriate rental taxes were being paid and if the appraisal district was letting STR operators apply for and receive tax exemptions—a clear violation of the law.

A list of STRs registered with the city was the primary dataset. Secondarily, we got datasets of appraisal district properties and electric utility subscribers. We took the STR list and used a ML algorithm to look for matches in the other two datasets. This identified STR operators (individuals or couples) who owned multiple properties and had applied for exemptions.

Next, we checked each joint STR/property record for exemptions and ranked them by suspiciousness. This ranking was accomplished using ML. From there, we filed a series of public information requests to verify our findings.

Joining two datasets based on one of their columns is a basic strategy common to many investigations. Relational databases were created to do exactly this. But in our case, we had low quality data and a lot of it: roughly a million records, combined. Names were spelled inconsistently and addresses were often abbreviated or were missing apartment numbers. Standard tools didn't cut it, so we turned to ML.

Linking Records with Machine Learning

The foundation of my approach to this investigation is a ML algorithm called Locality Sensitive Hashing (LSH). This is part of the unsupervised[19] machine learning family of algorithms. Unsupervised learning gets its name because it learns directly from the data itself. Unlike other forms of ML, unsupervised methods don't require people to tediously sort through data and tag individual records. These algorithms are inexpensive to use, can work in a variety of situations and lend themselves to straightforward, practical applications. Unsupervised learning excels in problems like search, uncovering structure, similarity, grouping and segmentation. It's my opinion that unsupervised machine learning holds the most promise for the future of local and nonprofit investigative journalism.

LSH is an algorithmic method for taking large amounts of data and efficiently grouping together records which are similar to each other. It can handle inexact or "fuzzy" matches, greatly reducing the amount of preprocessing[8] needed.

LSH: A Visual Description

An animation of the process of grouping three short random words into groups using LSH.

In short, LSH operates by taking data points and assigning them to "buckets" or groups. When using LSH to link records from two datasets, we take the larger dataset, build an LSH index and then take the smaller dataset and query that index. The result of a query is a list of records which should be similar to the record we're querying.

I'm going to give a technical overview of the specific LSH algorithm I used. Feel free to skip this part if you don't care.

--------------- START TECHNICAL DISCUSSION ---------------

In order to understand LSH, we first need to discuss the k-Nearest Neighbors (k-NN) algorithm. What k-NN does is take one set of records and allow you to find some number of most similar records. That number is the k in k-NN. So a k-NN algorithm that only finds the single most similar record is known as k-NN with k=1 or (sometimes) 1-NN. In its most basic form, this algorithm checks every point against each other and gets a score for how similar they are. We then rank these and take the k-number of closest matches. While this algorithm has its place, it is problematic in our investigation for several reasons:

I had a dataset of 20,000 properties and another dataset of 800,000 utilities records. Doing the math, 20,000 records times 800,000, tells us 16,000,000,000 checks need to be performed to find groups using plain old k-NN.
Even if doing that number of checks was feasible, I'd need to come up with a distance[2] threshold—a number that establishes how similar two records need to be in order to be considered the same.

Both of these problems are annoying to solve and can be avoided by using LSH, which is a form of approximate nearest neighbors.

LSH goes about finding similar records using what I mentally categorize as a mathematical cheat code: if two records, A and B, are similar to a third record, then the two are likely similar to each other. It's a massive simplification of how LSH works, but it serves as a decent mental model. This gets around having to check every record against every record in order to find groups. We only need to check all records against that third thing.

In reality, LSH works by projecting[9] our records onto a randomized hyperplane (basically a random matrix[11] of data), doing a little rounding, and assigning them to a bucket. The smaller the hyperplane is, the higher chance that records will end up in the same bucket, despite dissimilarity.

Generally, this is a lot quicker than k-NN because, for each record, we only need to do a multiplication with the hyperplane. Mathematically speaking, LSH scales linearly with the number of records, as opposed to k-NN which scales exponentially. In my example above of 20,000 and 800,000 records, I only need to do 820,000 checks (800,000 plus 20,000 checks against our hyperplane). Obviously, this is far less than the 16,000,000,000 checks with plain k-NN.

--------------- END TECHNICAL DISCUSSION ---------------

Grouping records in LSH is done probabilistically, meaning you probably won't find every match and some non-matches might also fall into your results. Fortunately, there are ways to tune LSH in order to avoid this. LSH has two main configuration options, if you will. The most important being the hash[5] size. The larger the hash size, the more similar records must be to end up in the same bucket. Another way to improve accuracy is to use multiple LSH indexes. This increases the chances of records falling into the same bucket if they are indeed similar. I tend to use three LSH indexes with a hash size of either 512 or 1024 bits. (Another strategy for finding a good hash size is to start with the square root of the number of records and go up from there. Wikipedia has a decent page on this.)

The process for grouping records between two datasets using LSH goes like this:

Choose a column to match records with and extract that column from each dataset (e.g., we want to match against owner name, so get the "owner name" column from one dataset and "name" on the other).
Take only one of our extracted columns for matching and build a LSH index with it (e.g., build a LSH index on the "owner name" column).
Take the second, remaining column and query (get bucket assignments from) our LSH index (e.g., for each value in the "name" column, get the most similar names from the "owner name" column). The results of each query forms a "group" of similar records.
For each match in each group, grab the original full record. We will use the combined group of records to continue our investigation.

This process results in a list of grouped records. I tend to flatten these groups into wide rows by combining fields from both datasets. I also add some extra columns describing things like the number of matches found in the bucket and some averages from the records. We can use this data to filter out groups that aren't worth investigating further and then rank the remaining records by investigative interest. But instead of doing this last step by hand, I propose using another ML technique: regression.

Ranking Records by Suspiciousness with Lasso Regression

So, we've successfully identified groups of records and now want to uncover ones that warrant intensive, manual research. We've also built a few aggregates from each group to work with. In the investigation described above, we had these aggregate variables (or features[4]) to work with:

Number of properties owned.
Total number of exemptions held.
Total monetary amount of combined exemptions.
Property type (there were three types of properties, with different rules about how exemptions may legally be applied).
Average valuation of all properties owned.

Typically, one might consider writing a program to take these values and use them to rank grouped records. We could do this by assigning an importance value, or weight[21], to each variable, combining them and obtaining a score for each group.

This may sound good, but there are still some open questions. How do we combine the variables for each group? How do we compare variables with different ranges? The valuation of a property, for example, is in the hundreds of thousands, while the number of exemptions applied to a property is in the single digits. How do we combine everything once we've decided on importance? What do we do if some of the variables are redundant and overpower other, more important, variables? Luckily there is ML algorithm that can help with all of these problems.

Regression is a statistical modeling technique in the supervised[17] family of ML algorithms. If you've heard about ML before, it's almost guaranteed to be a supervised learning algorithm. This family of algorithms is useful in cases where you have some task to replicate and examples of it being done. Some problems and algorithms need more examples than others.

A classic example is using regression to predict the price of a home based on the number of rooms, bathrooms, existence of a garage, etc. If we give a regression algorithm enough data about homes along with the price of each (we call this training data[18]), the model can learn to predict price when given information about new homes.

Lasso, short for Least Absolute Shrinkage and Selection Operator, is a specific type of regression that is not only able to make predictions based on training data, but it can also automatically assign importance values and eliminate useless features from consideration.

In order to use regression to rank our properties by suspiciousness, we need to score some of our records by hand. So I browsed my dataset, found a few seemingly egregious examples, and gave them high scores (I chose zero through ten, with ten being very suspicious). Then I found a few records I wanted to be sure to ignore and gave them a low score. With a few of these hand labeled records, I was ready to build my Lasso model.

Lasso regression model

Visual description of a Lasso regression model. In this scatter matrix, each feature is plotted against the other features, forming a scatter plot for each combination. For example, the bottom left plot shows "property type" on the y-axis vs. "number of exemptions" on the x-axis. Points are colored by the suspiciousness scores obtained by the combination of the two features. Red dots are the most suspicious; dark blue/purple dots are the least suspicious.

There's a major benefit to using regression models: the weights they assign to your input variables are straightforward to understand. Other ML models are basically black boxes. This is so much the case that there's an entire subfield of ML dedicated to building models that can interpret other models. In journalism, we need to quickly check our model's assumptions and biases. Regression gives us a way to do this.

As an example, here's a breakdown of a Lasso model I trained:

           Lasso model explanation

Feature                    Importance
_______________________________________________
Property type          =>  2.3802236455789316
Total exemption amount =>  0.9168560761627526
Number of properties   => -0.052699999380756465
Number of exemptions   =>  Ignored
Valuation              =>  Ignored


                (Prop type * 2.38)
              - (No. props * 0.05)
       + (Total exempt amt * 0.91)
       ===========================
                    Suspiciousness

In general, this model thought "property type" was the most important feature. But it found records with large numbers of properties owned by a single person worth ignoring. This is because the data lacked details like unit/apartment numbers, so searches returned properties for every apartment in a complex. Because of this, the regression model decided to ignore buckets with huge numbers of properties. The total exemption amount would bring up a record's suspiciousness score, while number of exemptions and valuation were ignored.

In regression, each of these weights gets multiplied by the corresponding value from each record. When summed, we get a suspiciousness score for each record. Data in, data out.

There were three property types: 1, 2 and 3. Since there were very few type 2 and 3 properties in my dataset, it was important to look into them. Anything rare in a dataset usually is worth looking into. Some of the addresses pointed to large apartment complexes. Unfortunately, our data was lacking apartment or suite numbers, so searches for these properties would bring back every unit in the building. If a bucket contained a lot of properties, it was probably because of a search result like this. Total exemption amount was also an important variable. In this case, my model figured out that "number of exemptions" and "valuation" were directly related to the "total exemption amount". Because of this, we only needed the total exemption amount and could ignore the other two variables.

Using this, I was able to run my grouped records through the trained Lasso regression model and get back a ranked list of records. The top scoring records being the most likely to be worth manual investigative effort.

Conclusion

In one investigation, when the algorithms were done spitting out results, we filed public information requests for hundreds of property tax applications. We got a stack of paper four inches thick that my girlfriend and I carefully spread across our living room floor, cataloged and verified. The Austin Bulldog used this analysis to produce a two-part investigative story about the appraisal district's reluctance to vet tax exemption applications. Hundreds of thousands of dollars in back taxes were reclaimed for taxpayers.

In another investigation, we identified properties that were likely running large, illegal online STRs and getting unwarranted tax breaks. The fallout from this one is still ongoing.

Both investigations wouldn't have been possible without algorithmic assistance. But no matter how many fancy computational methods you use during an investigation, you can never escape the need to go back to the primary, physical sources and meticulously check your work.

Machine learning is just a tool, not magic. It won't do our investigation for us and it won't follow the leads it provides. But when used correctly, it can unlock leads from data and help get us to the part we’re good at: writing stories that expose wrongdoing and effecting change.

Issue One: Table of Contents

Arificial Informer. April 2019.
Edited by Lisa van Dam-Bates.
Written by Brandon Roberts.
Print version (PDF) available here.

An Interview with Chase Davis

A pioneer in applying machine learning techniques to investigative journalism.
Dissecting a Machine Learning Powered Investigation

Uncovering property tax evasion in Austin, Texas, using machine learning and statistical modeling. An investigative recipe.
Rethinking Web Scraping with AutoScrape

Web scraping is tedious. In this article, we explore patterns for building scrapers and introduce a new tool, AutoScrape.
Glossary of Terminology

Explanations of the various technical terms and concepts found in this issue.

Glossary of Terminology

1. Dataset A collection of machine readable records, typically from a single source. A dataset can be a single file (Excel or CSV), a database table, or a collection of documents. In machine learning, a dataset is commonly called a corpus. When the dataset is being used to train[18] a machine learning model[12], it can be called a training dataset (a.k.a. a training set). Datasets need to be transformed into a matrix[11] before they can be used by a machine learning model.

Further reading: Training, Validation and Test Sets - Wikipedia

2. Distance Function, Distance Metric A method for quantifying how dissimilar, or far apart, two records are. Euclidean distance, the simplest distance metric used, is attributed to the Ancient Greek mathematician Euclid. This distance metric finds the length of a straight line between two points, as if using a ruler. Cosine distance is another popular metric which measures the angle between two points using trigonometry.

3. Document Object Model, DOM A representation of a HTML page using a hierarchical tree. This is the way that browsers "see" web pages. As an example, we have a simple HTML page and its corresponding DOM tree:

              HTML                             DOM
---------------------------------------------------------------
<html>                           |            html
  <head>                         |              |
    <title>Example DOM</title>   |       ---------------
    <style>*{margin: 0;}</style> |       |             |
  </head>                        |      head          body
  <body>                         |       |             |
    <h1>Example Page!</h1>       |   --------     ----------
    <div>                        |   |      |     |   |    |
      <button>Save</button>      | title  style  h1  div footer
    </div>                       |                    |
    <footer>A footer</footer>    |                  button
  </body>                        |
</html>                          |

4. Feature A column in a dataset representing a specific type of values. A feature is typically represented as a variable in a machine learning model. For example, in a campaign finance dataset, a feature might be "contribution amount" or "candidate name". The number of features in a dataset determines its dimensionality. In many machine learning algorithms, high dimensional data (data with lots of features) is notoriously difficult to work with.

5. Hash A short label or string of characters identifying a piece of data. Hashes are generated by a hash function. An example of this comes from the most common use case: password hashes. Instead of storing passwords in a database for anyone to read (and steal), password hashes are stored. For example, the password "Thing99" might get turned into something like b3aca92c793ee0e9b1a9b0a5f5fc044e05140df3 by a hash function and saved in a database. When logging in, the website will hash the provided password and check it against the one in the database. A strong cryptographic hash function can't feasibly be reversed and uniquely identifies a record. In other usages, such as in_ LSH_, a hash may identify a group of similar records. Hashes are a fixed length, unlike the input data used to create them.

Further reading: "Hash Function" - Wikipedia, "MinHash for dummies", "An Introduction to Cryptographic Hash Functions" - Dev-HQ

6. k-NN, k-Nearest Neighbors An algorithm for finding the k number of most similar records to a given record, or query point. k-NN can use a variety of distance metrics[2] to measure dissimilarity, or distance_, between points. In k-NN, when _k is equal to 1, the algorithm will return the single most similar record. When k is greater than 1, the algorithm will return multiple records. A common practice is to take the similar records, average them and make educated guesses about the query point.

7. Locality Sensitive Hashing, LSH A method, similar in application to k-NN[6], for identifying similar records given a query record. LSH uses some statistical tricks like hashing[5] and projection[9] to do this as a performance optimization. Due to this, it can be used on large amounts of data. The penalty for this is that it's possible for false records to turn up in the results and, inversely, for actual similar records to be missed.

8. Preprocessing A step in data analysis that happens before any actual analysis occurs to transform the data into a specific format or to clean it. A common preprocessing task is lowercasing and stripping symbols. Vectorization[20] is a common preprocessing step found in machine learning and statistics.

9. Projection A mathematical method for taking an input vector[20] and transforming it into another dimension. Typically, this is done by taking high dimensional data (data with a large number of columns) and converting it to a lower dimension. A simple example of this would be taking a 3D coordinate and turning it into a 2D point. This is one of the key concepts behind LSH[7].

Further reading: Random Projection - Wikipedia

10. Machine Learning, ML A field of statistics and computer science focused on building or using algorithms that can perform tasks, without being told specifically how to accomplish them, by learning from data. The type of data required by the machine learning algorithm, labeled or unlabeled, splits the field into two major groups: supervised[17] and unsupervised[19], respectively. Machine learning is a subfield of Artificial Intelligence.

11. Matrix Rectangularly arranged data made up of rows and columns. In the machine learning context, every cell in a matrix is a number. The numbers in a matrix may represent a letter, number or category.

          Example m-by-n matrix (2x3)

      n columns (3)
     .---------------------------------.
   m | 11.347271944951725 | 2203 | 2.0 | <- row vector
rows |--------------------+------+-----|
 (2) | 9.411296351528783  | 1867 | 1.0 |
     `---------------------------------'
          \
           This is element (2,1)

Each row, which represents a single record, is known as a vector[20]. The process of turning source data into a matrix is known as vectorization.

12. Model, Statistical Model A collection of assumptions that a machine learning algorithm has learned from a dataset. Fundamentally, a model consists of numbers, known as weights, that can be be plugged into a machine learning algorithm. We use models to get data out of machine learning algorithms.

13. Outlier Detection A method for identifying records that are out of place in the context of a dataset. These outlying data points can be thought of as strange, suspicious, fraudulent, rare, unique, etc. Outlier detection is a major subfield of machine learning with applications in fraud detection, quality assurance and alert systems.

14. Regression A statistical method for identifying relationships among the features[4] in a dataset.

15. Scraping, Web Scraping The process of loading a web page, extracting information and collecting it into a specific structure (a database, spreadsheet, etc). Typically web scraping is done automatically with a program, or tool, known as a web scraper.

16. String A piece of data, arranged sequentially, made up of letters, numbers or symbols. Technically speaking, computers represent everything as numbers, but they are converted to letters when needed. Numbers, words, sentences, paragraphs and even entire documents can be represented as strings.

17. Supervised Machine Learning, Supervised Learning A subfield of machine learning where algorithms learn to predict values or categories from human-labeled data. Examples of supervised machine learning problems: (1) predicting the temperature of a future day using a dataset of historical weather readings and (2) classifying emails by whether or not they are spam from a set of categorized emails. The goal of supervised machine learning is to learn from one dataset and then make accurate predictions on new data (this is known as generalization).

18. Training The process of feeding a statistical or machine learning algorithm data for the purpose of learning to predict, identifying structure, or extracting knowledge. As an example, consider a list of legitimate campaign contributions. Once an algorithm has been shown this data, it generates a model[12] representing how these contributions typically look. This model can be used to spot unusual contributions, since the model has learned what normal ones look like. There are many different methods for training models, but most of them are iterative, step-based procedures that slowly improve over time. A common analogy for how models are trained is hill climbing: knowing that a flat area (a good solution) is at the top of a hill, but only being able to see a short distance due to thick fog, the top can be found by following steep paths. Training is also known as model fitting.

19. Unsupervised Machine Learning, Unsupervised Learning A subfield of machine learning where algorithms learn to identify the structure or find patterns within a dataset. Unsupervised algorithms don't require human labeling or organization, and therefore can be used on a wide variety of datasets and in many situations. Examples of unsupervised use cases: (1) discovering natural groups of records in a dataset, (2) finding similar documents in a dataset and (3) identifying the way that events normally occur and using this to detect unusual events (a.k.a. outlier detection and anomaly detection[13]).

20. Vectorization, Vector The process of turning a raw source dataset into a numerical matrix[11]. Each record becomes a row of the matrix, known as a vector.

21. Weight A number that is used to either increase or decrease the importance of a feature[4]. Weights are used in supervised machine learning to quantify how well one variable predicts another; in unsupervised learning, weights are used to emphasize features that segment a dataset into groups.

22. XPath A description of the location of an element on a web page. From the browser's perspective, a web page is represented as a hierarchical tree known as the Document Object Model (DOM)[3]. An XPath selector describes a route through this tree that leads to a specific part of the page.