Artificial Informer - Issue One

Glossary of Terminology

1. Dataset A collection of machine readable records, typically from a single source. A dataset can be a single file (Excel or CSV), a database table, or a collection of documents. In machine learning, a dataset is commonly called a corpus. When the dataset is being used to train[18] a machine learning model[12], it can be called a training dataset (a.k.a. a training set). Datasets need to be transformed into a matrix[11] before they can be used by a machine learning model.

Further reading: Training, Validation and Test Sets - Wikipedia

2. Distance Function, Distance Metric A method for quantifying how dissimilar, or far apart, two records are. Euclidean distance, the simplest distance metric used, is attributed to the Ancient Greek mathematician Euclid. This distance metric finds the length of a straight line between two points, as if using a ruler. Cosine distance is another popular metric which measures the angle between two points using trigonometry.

3. Document Object Model, DOM A representation of a HTML page using a hierarchical tree. This is the way that browsers "see" web pages. As an example, we have a simple HTML page and its corresponding DOM tree:

              HTML                             DOM
<html>                           |            html
  <head>                         |              |
    <title>Example DOM</title>   |       ---------------
    <style>*{margin: 0;}</style> |       |             |
  </head>                        |      head          body
  <body>                         |       |             |
    <h1>Example Page!</h1>       |   --------     ----------
    <div>                        |   |      |     |   |    |
      <button>Save</button>      | title  style  h1  div footer
    </div>                       |                    |
    <footer>A footer</footer>    |                  button
  </body>                        |
</html>                          |

4. Feature A column in a dataset representing a specific type of values. A feature is typically represented as a variable in a machine learning model. For example, in a campaign finance dataset, a feature might be "contribution amount" or "candidate name". The number of features in a dataset determines its dimensionality. In many machine learning algorithms, high dimensional data (data with lots of features) is notoriously difficult to work with.

5. Hash A short label or string of characters identifying a piece of data. Hashes are generated by a hash function. An example of this comes from the most common use case: password hashes. Instead of storing passwords in a database for anyone to read (and steal), password hashes are stored. For example, the password "Thing99" might get turned into something like b3aca92c793ee0e9b1a9b0a5f5fc044e05140df3 by a hash function and saved in a database. When logging in, the website will hash the provided password and check it against the one in the database. A strong cryptographic hash function can't feasibly be reversed and uniquely identifies a record. In other usages, such as in_ LSH_, a hash may identify a group of similar records. Hashes are a fixed length, unlike the input data used to create them.

Further reading: "Hash Function" - Wikipedia, "MinHash for dummies", "An Introduction to Cryptographic Hash Functions" - Dev-HQ

6. k-NN, k-Nearest Neighbors An algorithm for finding the k number of most similar records to a given record, or query point. k-NN can use a variety of distance metrics[2] to measure dissimilarity, or distance_, between points. In k-NN, when _k is equal to 1, the algorithm will return the single most similar record. When k is greater than 1, the algorithm will return multiple records. A common practice is to take the similar records, average them and make educated guesses about the query point.

7. Locality Sensitive Hashing, LSH A method, similar in application to k-NN[6], for identifying similar records given a query record. LSH uses some statistical tricks like hashing[5] and projection[9] to do this as a performance optimization. Due to this, it can be used on large amounts of data. The penalty for this is that it's possible for false records to turn up in the results and, inversely, for actual similar records to be missed.

8. Preprocessing A step in data analysis that happens before any actual analysis occurs to transform the data into a specific format or to clean it. A common preprocessing task is lowercasing and stripping symbols. Vectorization[20] is a common preprocessing step found in machine learning and statistics.

9. Projection A mathematical method for taking an input vector[20] and transforming it into another dimension. Typically, this is done by taking high dimensional data (data with a large number of columns) and converting it to a lower dimension. A simple example of this would be taking a 3D coordinate and turning it into a 2D point. This is one of the key concepts behind LSH[7].

Further reading: Random Projection - Wikipedia

10. Machine Learning, ML A field of statistics and computer science focused on building or using algorithms that can perform tasks, without being told specifically how to accomplish them, by learning from data. The type of data required by the machine learning algorithm, labeled or unlabeled, splits the field into two major groups: supervised[17] and unsupervised[19], respectively. Machine learning is a subfield of Artificial Intelligence.

11. Matrix Rectangularly arranged data made up of rows and columns. In the machine learning context, every cell in a matrix is a number. The numbers in a matrix may represent a letter, number or category.

          Example m-by-n matrix (2x3)

      n columns (3)
   m | 11.347271944951725 | 2203 | 2.0 | <- row vector
rows |--------------------+------+-----|
 (2) | 9.411296351528783  | 1867 | 1.0 |
           This is element (2,1)

Each row, which represents a single record, is known as a vector[20]. The process of turning source data into a matrix is known as vectorization.

12. Model, Statistical Model A collection of assumptions that a machine learning algorithm has learned from a dataset. Fundamentally, a model consists of numbers, known as weights, that can be be plugged into a machine learning algorithm. We use models to get data out of machine learning algorithms.

13. Outlier Detection A method for identifying records that are out of place in the context of a dataset. These outlying data points can be thought of as strange, suspicious, fraudulent, rare, unique, etc. Outlier detection is a major subfield of machine learning with applications in fraud detection, quality assurance and alert systems.

14. Regression A statistical method for identifying relationships among the features[4] in a dataset.

15. Scraping, Web Scraping The process of loading a web page, extracting information and collecting it into a specific structure (a database, spreadsheet, etc). Typically web scraping is done automatically with a program, or tool, known as a web scraper.

16. String A piece of data, arranged sequentially, made up of letters, numbers or symbols. Technically speaking, computers represent everything as numbers, but they are converted to letters when needed. Numbers, words, sentences, paragraphs and even entire documents can be represented as strings.

17. Supervised Machine Learning, Supervised Learning A subfield of machine learning where algorithms learn to predict values or categories from human-labeled data. Examples of supervised machine learning problems: (1) predicting the temperature of a future day using a dataset of historical weather readings and (2) classifying emails by whether or not they are spam from a set of categorized emails. The goal of supervised machine learning is to learn from one dataset and then make accurate predictions on new data (this is known as generalization).

18. Training The process of feeding a statistical or machine learning algorithm data for the purpose of learning to predict, identifying structure, or extracting knowledge. As an example, consider a list of legitimate campaign contributions. Once an algorithm has been shown this data, it generates a model[12] representing how these contributions typically look. This model can be used to spot unusual contributions, since the model has learned what normal ones look like. There are many different methods for training models, but most of them are iterative, step-based procedures that slowly improve over time. A common analogy for how models are trained is hill climbing: knowing that a flat area (a good solution) is at the top of a hill, but only being able to see a short distance due to thick fog, the top can be found by following steep paths. Training is also known as model fitting.

19. Unsupervised Machine Learning, Unsupervised Learning A subfield of machine learning where algorithms learn to identify the structure or find patterns within a dataset. Unsupervised algorithms don't require human labeling or organization, and therefore can be used on a wide variety of datasets and in many situations. Examples of unsupervised use cases: (1) discovering natural groups of records in a dataset, (2) finding similar documents in a dataset and (3) identifying the way that events normally occur and using this to detect unusual events (a.k.a. outlier detection and anomaly detection[13]).

20. Vectorization, Vector The process of turning a raw source dataset into a numerical matrix[11]. Each record becomes a row of the matrix, known as a vector.

21. Weight A number that is used to either increase or decrease the importance of a feature[4]. Weights are used in supervised machine learning to quantify how well one variable predicts another; in unsupervised learning, weights are used to emphasize features that segment a dataset into groups.

22. XPath A description of the location of an element on a web page. From the browser's perspective, a web page is represented as a hierarchical tree known as the Document Object Model (DOM)[3]. An XPath selector describes a route through this tree that leads to a specific part of the page.